awswrangler.s3.select_query¶
- awswrangler.s3.select_query(sql: str, path: str | list[str], input_serialization: str, input_serialization_params: dict[str, bool | str], compression: str | None = None, scan_range_chunk_size: int | None = None, path_suffix: str | list[str] | None = None, path_ignore_suffix: str | list[str] | None = None, ignore_empty: bool = True, use_threads: bool | int = True, last_modified_begin: datetime | None = None, last_modified_end: datetime | None = None, dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', boto3_session: Session | None = None, s3_additional_kwargs: dict[str, Any] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) DataFrame¶
Filter contents of Amazon S3 objects based on SQL statement.
Note: Scan ranges are only supported for uncompressed CSV/JSON, CSV (without quoted delimiters) and JSON objects (in LINES mode only). It means scanning cannot be split across threads if the aforementioned conditions are not met, leading to lower performance.
- Parameters:
sql (
str) – SQL statement used to query the object.path (
str|list[str]) – S3 prefix (accepts Unix shell-style wildcards) (e.g. s3://bucket/prefix) or list of S3 objects paths (e.g.[s3://bucket/key0, s3://bucket/key1]).input_serialization (
str) – Format of the S3 object queried. Valid values: “CSV”, “JSON”, or “Parquet”. Case sensitive.input_serialization_params (
dict[str,bool|str]) – Dictionary describing the serialization of the S3 object.compression (
str|None) – Compression type of the S3 object. Valid values: None, “gzip”, or “bzip2”. gzip and bzip2 are only valid for CSV and JSON objects.scan_range_chunk_size (
int|None) – Chunk size used to split the S3 object into scan ranges. 1,048,576 by default.path_suffix (
str|list[str] |None) – Suffix or List of suffixes to be read (e.g. [“.csv”]). If None, read all files. (default)path_ignore_suffix (
str|list[str] |None) – Suffix or List of suffixes for S3 keys to be ignored. (e.g. [“_SUCCESS”]). If None, read all files. (default)ignore_empty (
bool) – Ignore files with 0 bytes.use_threads (
bool|int) – True (default) to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() is used as the max number of threads. If integer is provided, specified number is used.last_modified_begin (
datetime|None) – Filter S3 objects by Last modified date. Filter is only applied after listing all objects.last_modified_end (
datetime|None) – Filter S3 objects by Last modified date. Filter is only applied after listing all objects.dtype_backend (
Literal['numpy_nullable','pyarrow']) –Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.
boto3_session (
Session|None) – The default boto3 session is used if none is provided.s3_additional_kwargs (
dict[str,Any] |None) – Forwarded to botocore requests. Valid values: “SSECustomerAlgorithm”, “SSECustomerKey”, “ExpectedBucketOwner”. e.g. s3_additional_kwargs={‘SSECustomerAlgorithm’: ‘md5’}.pyarrow_additional_kwargs (
dict[str,Any] |None) – Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.
- Return type:
DataFrame- Returns:
Pandas DataFrame with results from query.
Examples
Reading a gzip compressed JSON document
>>> import awswrangler as wr >>> df = wr.s3.select_query( ... sql='SELECT * FROM s3object[*][*]', ... path='s3://bucket/key.json.gzip', ... input_serialization='JSON', ... input_serialization_params={ ... 'Type': 'Document', ... }, ... compression="gzip", ... )
Reading multiple CSV objects from a prefix
>>> import awswrangler as wr >>> df = wr.s3.select_query( ... sql='SELECT * FROM s3object', ... path='s3://bucket/prefix/', ... input_serialization='CSV', ... input_serialization_params={ ... 'FileHeaderInfo': 'Use', ... 'RecordDelimiter': '\r\n' ... }, ... )
Reading a single column from Parquet object with pushdown filter
>>> import awswrangler as wr >>> df = wr.s3.select_query( ... sql='SELECT s.\"id\" FROM s3object s where s.\"id\" = 1.0', ... path='s3://bucket/key.snappy.parquet', ... input_serialization='Parquet', ... )