awswrangler.s3.select_query¶
- awswrangler.s3.select_query(sql: str, path: str, input_serialization: str, input_serialization_params: Dict[str, bool | str], compression: str | None = None, use_threads: bool | int = False, boto3_session: Session | None = None, s3_additional_kwargs: Dict[str, Any] | None = None) DataFrame ¶
Filter contents of an Amazon S3 object based on SQL statement.
Note: Scan ranges are only supported for uncompressed CSV/JSON, CSV (without quoted delimiters) and JSON objects (in LINES mode only). It means scanning cannot be split across threads if the latter conditions are not met, leading to lower performance.
- Parameters:
sql (str) – SQL statement used to query the object.
path (str) – S3 path to the object (e.g. s3://bucket/key).
input_serialization (str,) – Format of the S3 object queried. Valid values: “CSV”, “JSON”, or “Parquet”. Case sensitive.
input_serialization_params (Dict[str, Union[bool, str]]) – Dictionary describing the serialization of the S3 object.
compression (Optional[str]) – Compression type of the S3 object. Valid values: None, “gzip”, or “bzip2”. gzip and bzip2 are only valid for CSV and JSON objects.
use_threads (Union[bool, int]) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() is used as the max number of threads. If integer is provided, specified number is used.
boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session is used if none is provided.
s3_additional_kwargs (Optional[Dict[str, Any]]) – Forwarded to botocore requests. Valid values: “SSECustomerAlgorithm”, “SSECustomerKey”, “ExpectedBucketOwner”. e.g. s3_additional_kwargs={‘SSECustomerAlgorithm’: ‘md5’}
- Returns:
Pandas DataFrame with results from query.
- Return type:
pandas.DataFrame
Examples
Reading a gzip compressed JSON document
>>> import awswrangler as wr >>> df = wr.s3.select_query( ... sql='SELECT * FROM s3object[*][*]', ... path='s3://bucket/key.json.gzip', ... input_serialization='JSON', ... input_serialization_params={ ... 'Type': 'Document', ... }, ... compression="gzip", ... )
Reading an entire CSV object using threads
>>> import awswrangler as wr >>> df = wr.s3.select_query( ... sql='SELECT * FROM s3object', ... path='s3://bucket/key.csv', ... input_serialization='CSV', ... input_serialization_params={ ... 'FileHeaderInfo': 'Use', ... 'RecordDelimiter': '\r\n' ... }, ... use_threads=True, ... )
Reading a single column from Parquet object with pushdown filter
>>> import awswrangler as wr >>> df = wr.s3.select_query( ... sql='SELECT s.\"id\" FROM s3object s where s.\"id\" = 1.0', ... path='s3://bucket/key.snappy.parquet', ... input_serialization='Parquet', ... )