awswrangler.s3.read_parquet_metadata¶

awswrangler.s3.read_parquet_metadata(path: str | list[str], dataset: bool = False, version_id: str | dict[str, str] | None = None, path_suffix: str | None = None, path_ignore_suffix: str | list[str] | None = None, ignore_empty: bool = True, ignore_null: bool = False, dtype: dict[str, str] | None = None, sampling: float = 1.0, coerce_int96_timestamp_unit: str | None = None, use_threads: bool | int = True, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, Any] | None = None) → _ReadTableMetadataReturnValue¶

Read Apache Parquet file(s) metadata from an S3 prefix or list of S3 objects paths.

The concept of dataset enables more complex features like partitioning and catalog integration (AWS Glue Catalog).

This function accepts Unix shell-style wildcards in the path argument. * (matches everything), ? (matches any single character), [seq] (matches any character in seq), [!seq] (matches any character not in seq). If you want to use a path which includes Unix shell-style wildcard characters (*, ?, []), you can use glob.escape(path) before passing the argument to this function.

Note

If use_threads=True, the number of threads is obtained from os.cpu_count().

Parameters:

path (str | list[str]) – S3 prefix (accepts Unix shell-style wildcards) (e.g. s3://bucket/prefix) or list of S3 objects paths (e.g. [s3://bucket/key0, s3://bucket/key1]).
dataset (bool) – If True, read a parquet dataset instead of individual file(s), loading all related partitions as columns.
version_id (str | dict[str, str] | None) – Version id of the object or mapping of object path to version id. (e.g. {‘s3://bucket/key0’: ‘121212’, ‘s3://bucket/key1’: ‘343434’})
path_suffix (str | None) – Suffix or List of suffixes to be read (e.g. [“.gz.parquet”, “.snappy.parquet”]). If None, reads all files. (default)
path_ignore_suffix (str | list[str] | None) – Suffix or List of suffixes to be ignored.(e.g. [“.csv”, “_SUCCESS”]). If None, reads all files. (default)
ignore_empty (bool) – Ignore files with 0 bytes.
ignore_null (bool) – Ignore columns with null type.
dtype (dict[str, str] | None) – Dictionary of columns names and Athena/Glue types to cast. Use when you have columns with undetermined data types as partitions columns. (e.g. {‘col name’: ‘bigint’, ‘col2 name’: ‘int’})
sampling (float) – Ratio of files metadata to inspect. Must be 0.0 < sampling <= 1.0. The higher, the more accurate. The lower, the faster.
use_threads (bool | int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.
boto3_session (Session | None) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.
s3_additional_kwargs (dict[str, Any] | None) – Forward to S3 botocore requests.

Return type:

_ReadTableMetadataReturnValue

Returns:

columns_types: Dictionary with keys as column names and values as data types (e.g. {‘col0’: ‘bigint’, ‘col1’: ‘double’}). / partitions_types: Dictionary with keys as partition names and values as data types (e.g. {‘col2’: ‘date’}).

Examples

Reading all Parquet files (with partitions) metadata under a prefix

>>> import awswrangler as wr
>>> columns_types, partitions_types = wr.s3.read_parquet_metadata(path='s3://bucket/prefix/', dataset=True)

Reading all Parquet files metadata from a list

>>> import awswrangler as wr
>>> columns_types, partitions_types = wr.s3.read_parquet_metadata(path=[
...     's3://bucket/filename0.parquet',
...     's3://bucket/filename1.parquet'
... ])