awswrangler.s3.read_orc_table¶
- awswrangler.s3.read_orc_table(table: str, database: str, filename_suffix: str | list[str] | None = None, filename_ignore_suffix: str | list[str] | None = None, catalog_id: str | None = None, partition_filter: Callable[[dict[str, str]], bool] | None = None, columns: list[str] | None = None, validate_schema: bool = True, dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', use_threads: bool | int = True, ray_args: RaySettings | None = None, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, Any] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) DataFrame ¶
Read Apache ORC table registered in the AWS Glue Catalog.
Note
If use_threads=True, the number of threads is obtained from os.cpu_count().
Note
This function has arguments which can be configured globally through wr.config or environment variables:
catalog_id
database
dtype_backend
Check out the Global Configurations Tutorial for details.
Note
Following arguments are not supported in distributed mode with engine EngineEnum.RAY:
boto3_session
s3_additional_kwargs
dtype_backend
- Parameters:
table (
str
) – AWS Glue Catalog table name.database (
str
) – AWS Glue Catalog database name.filename_suffix (
str
|list
[str
] |None
) – Suffix or List of suffixes to be read (e.g. [“.gz.orc”, “.snappy.orc”]). If None, read all files. (default)filename_ignore_suffix (
str
|list
[str
] |None
) – Suffix or List of suffixes for S3 keys to be ignored.(e.g. [“.csv”, “_SUCCESS”]). If None, read all files. (default)catalog_id (
str
|None
) – The ID of the Data Catalog from which to retrieve Databases. If none is provided, the AWS account ID is used by default.partition_filter (
Callable
[[dict
[str
,str
]],bool
] |None
) – Callback Function filters to apply on PARTITION columns (PUSH-DOWN filter). This function must receive a single argument (Dict[str, str]) where keys are partitions names and values are partitions values. Partitions values must be strings and the function must return a bool, True to read the partition or False to ignore it. Ignored if dataset=False. E.glambda x: True if x["year"] == "2020" and x["month"] == "1" else False
https://aws-sdk-pandas.readthedocs.io/en/3.10.0/tutorials/023%20-%20Flexible%20Partitions%20Filter.htmlcolumns (
list
[str
] |None
) – List of columns to read from the file(s).validate_schema (
bool
) – Check that the schema is consistent across individual files.dtype_backend (
Literal
['numpy_nullable'
,'pyarrow'
]) –Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.
use_threads (
bool
|int
) – True to enable concurrent requests, False to disable multiple threads. If enabled, os.cpu_count() is used as the max number of threads. If integer is provided, specified number is used.ray_args (
RaySettings
|None
) – Parameters of the Ray Modin settings. Only used when distributed computing is used with Ray and Modin installed.boto3_session (
Session
|None
) – Boto3 Session. The default boto3 session is used if None is received.s3_additional_kwargs (
dict
[str
,Any
] |None
) – Forward to S3 botocore requests.pyarrow_additional_kwargs (
dict
[str
,Any
] |None
) – Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.
- Return type:
DataFrame
- Returns:
Pandas DataFrame.
Examples
Reading ORC Table
>>> import awswrangler as wr >>> df = wr.s3.read_orc_table(database='...', table='...')
Reading ORC Dataset with PUSH-DOWN filter over partitions
>>> import awswrangler as wr >>> my_filter = lambda x: True if x["city"].startswith("new") else False >>> df = wr.s3.read_orc_table(path, dataset=True, partition_filter=my_filter)