awswrangler.s3.read_orc¶

awswrangler.s3.read_orc(path: str | list[str], path_root: str | None = None, dataset: bool = False, path_suffix: str | list[str] | None = None, path_ignore_suffix: str | list[str] | None = None, ignore_empty: bool = True, partition_filter: Callable[[dict[str, str]], bool] | None = None, columns: list[str] | None = None, validate_schema: bool = False, last_modified_begin: datetime | None = None, last_modified_end: datetime | None = None, version_id: str | dict[str, str] | None = None, dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', use_threads: bool | int = True, ray_args: RaySettings | None = None, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, Any] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) → DataFrame¶

Read ORC file(s) from an S3 prefix or list of S3 objects paths.

The concept of dataset enables more complex features like partitioning and catalog integration (AWS Glue Catalog).

This function accepts Unix shell-style wildcards in the path argument. * (matches everything), ? (matches any single character), [seq] (matches any character in seq), [!seq] (matches any character not in seq). If you want to use a path which includes Unix shell-style wildcard characters (*, ?, []), you can use glob.escape(path) before passing the argument to this function.

Note

If use_threads=True, the number of threads is obtained from os.cpu_count().

Note

Filtering by last_modified begin and last_modified end is applied after listing all S3 files

Note

This function has arguments which can be configured globally through wr.config or environment variables:

dtype_backend

Check out the Global Configurations Tutorial for details.

Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:

boto3_session
version_id
s3_additional_kwargs
dtype_backend

Parameters:

path (str | list[str]) – S3 prefix (accepts Unix shell-style wildcards) (e.g. s3://bucket/prefix) or list of S3 objects paths (e.g. [s3://bucket/key0, s3://bucket/key1]).
path_root (str | None) – Root path of the dataset. If dataset=`True`, it is used as a starting point to load partition columns.
dataset (bool) – If True, read an ORC dataset instead of individual file(s), loading all related partitions as columns.
path_suffix (str | list[str] | None) – Suffix or List of suffixes to be read (e.g. [“.gz.orc”, “.snappy.orc”]). If None, reads all files. (default)
path_ignore_suffix (str | list[str] | None) – Suffix or List of suffixes to be ignored.(e.g. [“.csv”, “_SUCCESS”]). If None, reads all files. (default)
ignore_empty (bool) – Ignore files with 0 bytes.
partition_filter (Callable[[dict[str, str]], bool] | None) – Callback Function filters to apply on PARTITION columns (PUSH-DOWN filter). This function must receive a single argument (Dict[str, str]) where keys are partitions names and values are partitions values. Partitions values must be strings and the function must return a bool, True to read the partition or False to ignore it. Ignored if dataset=False. E.g lambda x: True if x["year"] == "2020" and x["month"] == "1" else False https://aws-sdk-pandas.readthedocs.io/en/3.10.0/tutorials/023%20-%20Flexible%20Partitions%20Filter.html
columns (list[str] | None) – List of columns to read from the file(s).
validate_schema (bool) – Check that the schema is consistent across individual files.
last_modified_begin (datetime | None) – Filter S3 objects by Last modified date. Filter is only applied after listing all objects.
last_modified_end (datetime | None) – Filter S3 objects by Last modified date. Filter is only applied after listing all objects.
version_id (str | dict[str, str] | None) – Version id of the object or mapping of object path to version id. (e.g. {‘s3://bucket/key0’: ‘121212’, ‘s3://bucket/key1’: ‘343434’})
dtype_backend (Literal['numpy_nullable', 'pyarrow']) –
Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.
use_threads (bool | int) – True to enable concurrent requests, False to disable multiple threads. If enabled, os.cpu_count() is used as the max number of threads. If integer is provided, specified number is used.
ray_args (RaySettings | None) – Parameters of the Ray Modin settings. Only used when distributed computing is used with Ray and Modin installed.
boto3_session (Session | None) – Boto3 Session. The default boto3 session is used if None is received.
s3_additional_kwargs (dict[str, Any] | None) – Forward to S3 botocore requests.
pyarrow_additional_kwargs (dict[str, Any] | None) – Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.

Return type:

DataFrame

Returns:

Pandas DataFrame.

Examples

Reading all ORC files under a prefix

>>> import awswrangler as wr
>>> df = wr.s3.read_orc(path='s3://bucket/prefix/')

Reading all ORC files from a list

>>> import awswrangler as wr
>>> df = wr.s3.read_orc(path=['s3://bucket/filename0.orc', 's3://bucket/filename1.orc'])

Reading ORC Dataset with PUSH-DOWN filter over partitions

>>> import awswrangler as wr
>>> my_filter = lambda x: True if x["city"].startswith("new") else False
>>> df = wr.s3.read_orc(path, dataset=True, partition_filter=my_filter)