awswrangler.s3.read_parquet_table¶
- awswrangler.s3.read_parquet_table(table: str, database: str, filename_suffix: Optional[Union[str, List[str]]] = None, filename_ignore_suffix: Optional[Union[str, List[str]]] = None, catalog_id: Optional[str] = None, partition_filter: Optional[Callable[[Dict[str, str]], bool]] = None, columns: Optional[List[str]] = None, validate_schema: bool = True, categories: Optional[List[str]] = None, safe: bool = True, map_types: bool = True, chunked: Union[bool, int] = False, use_threads: Union[bool, int] = True, boto3_session: Optional[Session] = None, s3_additional_kwargs: Optional[Dict[str, Any]] = None) Any ¶
Read Apache Parquet table registered on AWS Glue Catalog.
Note
Batching
(chunked argument) (Memory Friendly):Will enable the function to return an Iterable of DataFrames instead of a regular DataFrame.
There are two batching strategies on awswrangler:
If chunked=True, a new DataFrame will be returned for each file in your path/dataset.
If chunked=INTEGER, awswrangler will paginate through files slicing and concatenating to return DataFrames with the number of row igual the received INTEGER.
P.S. chunked=True if faster and uses less memory while chunked=INTEGER is more precise in number of rows for each Dataframe.
Note
In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().
Note
This function has arguments which can be configured globally through wr.config or environment variables:
catalog_id
database
Check out the Global Configurations Tutorial for details.
- Parameters
table (str) – AWS Glue Catalog table name.
database (str) – AWS Glue Catalog database name.
filename_suffix (Union[str, List[str], None]) – Suffix or List of suffixes to be read (e.g. [“.gz.parquet”, “.snappy.parquet”]). If None, will try to read all files. (default)
filename_ignore_suffix (Union[str, List[str], None]) – Suffix or List of suffixes for S3 keys to be ignored.(e.g. [“.csv”, “_SUCCESS”]). If None, will try to read all files. (default)
catalog_id (str, optional) – The ID of the Data Catalog from which to retrieve Databases. If none is provided, the AWS account ID is used by default.
partition_filter (Optional[Callable[[Dict[str, str]], bool]]) – Callback Function filters to apply on PARTITION columns (PUSH-DOWN filter). This function MUST receive a single argument (Dict[str, str]) where keys are partition names and values are partition values. Partition values will be always strings extracted from S3. This function MUST return a bool, True to read the partition or False to ignore it. Ignored if dataset=False. E.g
lambda x: True if x["year"] == "2020" and x["month"] == "1" else False
https://aws-sdk-pandas.readthedocs.io/en/2.17.0/tutorials/023%20-%20Flexible%20Partitions%20Filter.htmlcolumns (List[str], optional) – Names of columns to read from the file(s).
validate_schema – Check that individual file schemas are all the same / compatible. Schemas within a folder prefix should all be the same. Disable if you have schemas that are different and want to disable this check.
categories (Optional[List[str]], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.
safe (bool, default True) – For certain data types, a cast is needed in order to store the data in a pandas DataFrame or Series (e.g. timestamps are always stored as nanoseconds in pandas). This option controls whether it is a safe cast or not.
map_types (bool, default True) – True to convert pyarrow DataTypes to pandas ExtensionDtypes. It is used to override the default pandas type for conversion of built-in pyarrow types or in absence of pandas_metadata in the Table schema.
chunked (bool) – If True will break the data in smaller DataFrames (Non-deterministic number of lines). Otherwise return a single DataFrame with the whole data.
use_threads (Union[bool, int]) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.
boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.
s3_additional_kwargs (Optional[Dict[str, Any]]) – Forward to botocore requests, only “SSECustomerAlgorithm” and “SSECustomerKey” arguments will be considered.
- Returns
Pandas DataFrame or a Generator in case of chunked=True.
- Return type
Union[pandas.DataFrame, Generator[pandas.DataFrame, None, None]]
Examples
Reading Parquet Table
>>> import awswrangler as wr >>> df = wr.s3.read_parquet_table(database='...', table='...')
Reading Parquet Table encrypted
>>> import awswrangler as wr >>> df = wr.s3.read_parquet_table( ... database='...', ... table='...' ... s3_additional_kwargs={ ... 'ServerSideEncryption': 'aws:kms', ... 'SSEKMSKeyId': 'YOUR_KMS_KEY_ARN' ... } ... )
Reading Parquet Table in chunks (Chunk by file)
>>> import awswrangler as wr >>> dfs = wr.s3.read_parquet_table(database='...', table='...', chunked=True) >>> for df in dfs: >>> print(df) # Smaller Pandas DataFrame
Reading Parquet Dataset with PUSH-DOWN filter over partitions
>>> import awswrangler as wr >>> my_filter = lambda x: True if x["city"].startswith("new") else False >>> df = wr.s3.read_parquet_table(path, dataset=True, partition_filter=my_filter)