awswrangler.s3.store_parquet_metadata

awswrangler.s3.store_parquet_metadata(path: str, database: str, table: str, catalog_id: str | None = None, path_suffix: str | None = None, path_ignore_suffix: str | List[str] | None = None, ignore_empty: bool = True, dtype: Dict[str, str] | None = None, sampling: float = 1.0, dataset: bool = False, use_threads: bool | int = True, description: str | None = None, parameters: Dict[str, str] | None = None, columns_comments: Dict[str, str] | None = None, compression: str | None = None, mode: Literal['append', 'overwrite'] = 'overwrite', catalog_versioning: bool = False, regular_partitions: bool = True, athena_partition_projection_settings: AthenaPartitionProjectionSettings | None = None, s3_additional_kwargs: Dict[str, Any] | None = None, boto3_session: Session | None = None) Tuple[Dict[str, str], Dict[str, str] | None, Dict[str, List[str]] | None]

Infer and store parquet metadata on AWS Glue Catalog.

Infer Apache Parquet file(s) metadata from a received S3 prefix And then stores it on AWS Glue Catalog including all inferred partitions (No need for ‘MSCK REPAIR TABLE’)

The concept of Dataset goes beyond the simple idea of files and enables more complex features like partitioning and catalog integration (AWS Glue Catalog).

This function accepts Unix shell-style wildcards in the path argument. * (matches everything), ? (matches any single character), [seq] (matches any character in seq), [!seq] (matches any character not in seq). If you want to use a path which includes Unix shell-style wildcard characters (*, ?, []), you can use glob.escape(path) before passing the path to this function.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:

  • boto3_session

Note

This function has arguments which can be configured globally through wr.config or environment variables:

  • catalog_id

  • database

Check out the Global Configurations Tutorial for details.

Parameters:
  • path (str) – S3 prefix (accepts Unix shell-style wildcards) (e.g. s3://bucket/prefix).

  • table (str) – Glue/Athena catalog: Table name.

  • database (str) – AWS Glue Catalog database name.

  • catalog_id (str, optional) – The ID of the Data Catalog from which to retrieve Databases. If none is provided, the AWS account ID is used by default.

  • path_suffix (Union[str, List[str], None]) – Suffix or List of suffixes for filtering S3 keys.

  • path_ignore_suffix (Union[str, List[str], None]) – Suffix or List of suffixes for S3 keys to be ignored.

  • ignore_empty (bool) – Ignore files with 0 bytes.

  • dtype (Dict[str, str], optional) – Dictionary of columns names and Athena/Glue types to be casted. Useful when you have columns with undetermined data types as partitions columns. (e.g. {‘col name’: ‘bigint’, ‘col2 name’: ‘int’})

  • sampling (float) – Random sample ratio of files that will have the metadata inspected. Must be 0.0 < sampling <= 1.0. The higher, the more accurate. The lower, the faster.

  • dataset (bool) – If True read a parquet dataset instead of simple file(s) loading all the related partitions as columns.

  • use_threads (bool, int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.

  • description (str, optional) – Glue/Athena catalog: Table description

  • parameters (Dict[str, str], optional) – Glue/Athena catalog: Key/value pairs to tag the table.

  • columns_comments (Dict[str, str], optional) – Glue/Athena catalog: Columns names and the related comments (e.g. {‘col0’: ‘Column 0.’, ‘col1’: ‘Column 1.’, ‘col2’: ‘Partition.’}).

  • compression (str, optional) – Compression style (None, snappy, gzip, etc).

  • mode (str) – ‘overwrite’ to recreate any possible existing table or ‘append’ to keep any possible existing table.

  • catalog_versioning (bool) – If True and mode=”overwrite”, creates an archived version of the table catalog before updating it.

  • regular_partitions (bool) – Create regular partitions (Non projected partitions) on Glue Catalog. Disable when you will work only with Partition Projection. Keep enabled even when working with projections is useful to keep Redshift Spectrum working with the regular partitions.

  • athena_partition_projection_settings (AthenaPartitionProjectionSettings, optional) –

    Parameters of the Athena Partition Projection (https://docs.aws.amazon.com/athena/latest/ug/partition-projection.html). AthenaPartitionProjectionSettings is a TypedDict, meaning the passed parameter can be instantiated either as an instance of AthenaPartitionProjectionSettings or as a regular Python dict.

    Following projection parameters are supported:

    Projection Parameters

    Name

    Type

    Description

    projection_types

    Optional[Dict[str, str]]

    Dictionary of partitions names and Athena projections types. Valid types: “enum”, “integer”, “date”, “injected” https://docs.aws.amazon.com/athena/latest/ug/partition-projection-supported-types.html (e.g. {‘col_name’: ‘enum’, ‘col2_name’: ‘integer’})

    projection_ranges

    Optional[Dict[str, str]]

    Dictionary of partitions names and Athena projections ranges. https://docs.aws.amazon.com/athena/latest/ug/partition-projection-supported-types.html (e.g. {‘col_name’: ‘0,10’, ‘col2_name’: ‘-1,8675309’})

    projection_values

    Optional[Dict[str, str]]

    Dictionary of partitions names and Athena projections values. https://docs.aws.amazon.com/athena/latest/ug/partition-projection-supported-types.html (e.g. {‘col_name’: ‘A,B,Unknown’, ‘col2_name’: ‘foo,boo,bar’})

    projection_intervals

    Optional[Dict[str, str]]

    Dictionary of partitions names and Athena projections intervals. https://docs.aws.amazon.com/athena/latest/ug/partition-projection-supported-types.html (e.g. {‘col_name’: ‘1’, ‘col2_name’: ‘5’})

    projection_digits

    Optional[Dict[str, str]]

    Dictionary of partitions names and Athena projections digits. https://docs.aws.amazon.com/athena/latest/ug/partition-projection-supported-types.html (e.g. {‘col_name’: ‘1’, ‘col2_name’: ‘2’})

    projection_formats

    Optional[Dict[str, str]]

    Dictionary of partitions names and Athena projections formats. https://docs.aws.amazon.com/athena/latest/ug/partition-projection-supported-types.html (e.g. {‘col_date’: ‘yyyy-MM-dd’, ‘col2_timestamp’: ‘yyyy-MM-dd HH:mm:ss’})

    projection_storage_location_template

    Optional[str]

    Value which is allows Athena to properly map partition values if the S3 file locations do not follow a typical …/column=value/… pattern. https://docs.aws.amazon.com/athena/latest/ug/partition-projection-setting-up.html (e.g. s3://bucket/table_root/a=${a}/${b}/some_static_subdirectory/${c}/)

  • s3_additional_kwargs (Optional[Dict[str, Any]]) – Forwarded to botocore requests. e.g. s3_additional_kwargs={‘ServerSideEncryption’: ‘aws:kms’, ‘SSEKMSKeyId’: ‘YOUR_KMS_KEY_ARN’}

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

Returns:

The metadata used to create the Glue Table. columns_types: Dictionary with keys as column names and values as data types (e.g. {‘col0’: ‘bigint’, ‘col1’: ‘double’}). / partitions_types: Dictionary with keys as partition names and values as data types (e.g. {‘col2’: ‘date’}). / partitions_values: Dictionary with keys as S3 path locations and values as a list of partitions values as str (e.g. {‘s3://bucket/prefix/y=2020/m=10/’: [‘2020’, ‘10’]}).

Return type:

Tuple[Dict[str, str], Optional[Dict[str, str]], Optional[Dict[str, List[str]]]]

Examples

Reading all Parquet files metadata under a prefix

>>> import awswrangler as wr
>>> columns_types, partitions_types, partitions_values = wr.s3.store_parquet_metadata(
...     path='s3://bucket/prefix/',
...     database='...',
...     table='...',
...     dataset=True
... )