awswrangler.s3.select_query(sql: str, path: str, input_serialization: str, input_serialization_params: Dict[str, Union[bool, str]], compression: Optional[str] = None, use_threads: Union[bool, int] = False, boto3_session: Optional[Session] = None, s3_additional_kwargs: Optional[Dict[str, Any]] = None) DataFrame

Filter contents of an Amazon S3 object based on SQL statement.

Note: Scan ranges are only supported for uncompressed CSV/JSON, CSV (without quoted delimiters) and JSON objects (in LINES mode only). It means scanning cannot be split across threads if the latter conditions are not met, leading to lower performance.

  • sql (str) – SQL statement used to query the object.

  • path (str) – S3 path to the object (e.g. s3://bucket/key).

  • input_serialization (str,) – Format of the S3 object queried. Valid values: “CSV”, “JSON”, or “Parquet”. Case sensitive.

  • input_serialization_params (Dict[str, Union[bool, str]]) – Dictionary describing the serialization of the S3 object.

  • compression (Optional[str]) – Compression type of the S3 object. Valid values: None, “gzip”, or “bzip2”. gzip and bzip2 are only valid for CSV and JSON objects.

  • use_threads (Union[bool, int]) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() is used as the max number of threads. If integer is provided, specified number is used.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session is used if none is provided.

  • s3_additional_kwargs (Optional[Dict[str, Any]]) – Forwarded to botocore requests. Valid values: “SSECustomerAlgorithm”, “SSECustomerKey”, “ExpectedBucketOwner”. e.g. s3_additional_kwargs={‘SSECustomerAlgorithm’: ‘md5’}


Pandas DataFrame with results from query.

Return type



Reading a gzip compressed JSON document

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT * FROM s3object[*][*]',
...     path='s3://bucket/key.json.gzip',
...     input_serialization='JSON',
...     input_serialization_params={
...         'Type': 'Document',
...     },
...     compression="gzip",
... )

Reading an entire CSV object using threads

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT * FROM s3object',
...     path='s3://bucket/key.csv',
...     input_serialization='CSV',
...     input_serialization_params={
...         'FileHeaderInfo': 'Use',
...         'RecordDelimiter': '\r\n'
...     },
...     use_threads=True,
... )

Reading a single column from Parquet object with pushdown filter

>>> import awswrangler as wr
>>> df = wr.s3.select_query(
...     sql='SELECT s.\"id\" FROM s3object s where s.\"id\" = 1.0',
...     path='s3://bucket/key.snappy.parquet',
...     input_serialization='Parquet',
... )