awswrangler.s3.read_parquet_metadata

awswrangler.s3.read_parquet_metadata(path: str | list[str], dataset: bool = False, version_id: str | dict[str, str] | None = None, path_suffix: str | None = None, path_ignore_suffix: str | list[str] | None = None, ignore_empty: bool = True, ignore_null: bool = False, dtype: dict[str, str] | None = None, sampling: float = 1.0, coerce_int96_timestamp_unit: str | None = None, use_threads: bool | int = True, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, Any] | None = None) _ReadTableMetadataReturnValue

Read Apache Parquet file(s) metadata from an S3 prefix or list of S3 objects paths.

The concept of dataset enables more complex features like partitioning and catalog integration (AWS Glue Catalog).

This function accepts Unix shell-style wildcards in the path argument. * (matches everything), ? (matches any single character), [seq] (matches any character in seq), [!seq] (matches any character not in seq). If you want to use a path which includes Unix shell-style wildcard characters (*, ?, []), you can use glob.escape(path) before passing the argument to this function.

Note

If use_threads=True, the number of threads is obtained from os.cpu_count().

Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:

  • boto3_session

Note

This function has arguments which can be configured globally through wr.config or environment variables:

Check out the Global Configurations Tutorial for details.

Parameters:
  • path (Union[str, List[str]]) – S3 prefix (accepts Unix shell-style wildcards) (e.g. s3://bucket/prefix) or list of S3 objects paths (e.g. [s3://bucket/key0, s3://bucket/key1]).

  • dataset (bool, default False) – If True, read a parquet dataset instead of individual file(s), loading all related partitions as columns.

  • version_id (Union[str, Dict[str, str]], optional) – Version id of the object or mapping of object path to version id. (e.g. {‘s3://bucket/key0’: ‘121212’, ‘s3://bucket/key1’: ‘343434’})

  • path_suffix (Union[str, List[str], None]) – Suffix or List of suffixes to be read (e.g. [“.gz.parquet”, “.snappy.parquet”]). If None, reads all files. (default)

  • path_ignore_suffix (Union[str, List[str], None]) – Suffix or List of suffixes to be ignored.(e.g. [“.csv”, “_SUCCESS”]). If None, reads all files. (default)

  • ignore_empty (bool, default True) – Ignore files with 0 bytes.

  • ignore_null (bool, default False) – Ignore columns with null type.

  • dtype (Dict[str, str], optional) – Dictionary of columns names and Athena/Glue types to cast. Use when you have columns with undetermined data types as partitions columns. (e.g. {‘col name’: ‘bigint’, ‘col2 name’: ‘int’})

  • sampling (float) – Ratio of files metadata to inspect. Must be 0.0 < sampling <= 1.0. The higher, the more accurate. The lower, the faster.

  • use_threads (bool, int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • s3_additional_kwargs (dict[str, Any], optional) – Forward to S3 botocore requests.

Returns:

columns_types: Dictionary with keys as column names and values as data types (e.g. {‘col0’: ‘bigint’, ‘col1’: ‘double’}). / partitions_types: Dictionary with keys as partition names and values as data types (e.g. {‘col2’: ‘date’}).

Return type:

Tuple[Dict[str, str], Optional[Dict[str, str]]]

Examples

Reading all Parquet files (with partitions) metadata under a prefix

>>> import awswrangler as wr
>>> columns_types, partitions_types = wr.s3.read_parquet_metadata(path='s3://bucket/prefix/', dataset=True)

Reading all Parquet files metadata from a list

>>> import awswrangler as wr
>>> columns_types, partitions_types = wr.s3.read_parquet_metadata(path=[
...     's3://bucket/filename0.parquet',
...     's3://bucket/filename1.parquet'
... ])