awswrangler.athena.read_sql_query¶

awswrangler.athena.read_sql_query(sql: str, database: str, ctas_approach: bool = True, unload_approach: bool = False, ctas_parameters: AthenaCTASSettings | None = None, unload_parameters: AthenaUNLOADSettings | None = None, categories: list[str] | None = None, chunksize: int | bool | None = None, s3_output: str | None = None, workgroup: str = 'primary', encryption: str | None = None, kms_key: str | None = None, keep_files: bool = True, use_threads: bool | int = True, boto3_session: Session | None = None, client_request_token: str | None = None, athena_cache_settings: AthenaCacheSettings | None = None, data_source: str | None = None, athena_query_wait_polling_delay: float = 1.0, params: dict[str, Any] | list[str] | None = None, paramstyle: Literal['qmark', 'named'] = 'named', result_reuse_configuration: dict[str, Any] | None = None, dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', s3_additional_kwargs: dict[str, Any] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) → DataFrame | Iterator[DataFrame]¶

Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame.

Related tutorial:

There are three approaches available through ctas_approach and unload_approach parameters:

1 - ctas_approach=True (Default):

Wrap the query with a CTAS and then reads the table data as parquet directly from s3.

PROS:

Faster for mid and big result sizes.
Can handle some level of nested types.

CONS:

Requires create/delete table permissions on Glue.
Does not support timestamp with time zone
Does not support columns with repeated names.
Does not support columns with undefined data types.
A temporary table will be created and then deleted immediately.
Does not support custom data_source/catalog_id.

2 - unload_approach=True and ctas_approach=False:

Does an UNLOAD query on Athena and parse the Parquet result on s3.

PROS:

Faster for mid and big result sizes.
Can handle some level of nested types.
Does not modify Glue Data Catalog

CONS:

Output S3 path must be empty.
Does not support timestamp with time zone.
Does not support columns with repeated names.
Does not support columns with undefined data types.

3 - ctas_approach=False:

Does a regular query on Athena and parse the regular CSV result on s3.

PROS:

Faster for small result sizes (less latency).
Does not require create/delete table permissions on Glue
Supports timestamp with time zone.
Support custom data_source/catalog_id.

CONS:

Slower for big results (But stills faster than other libraries that uses the regular Athena’s API)
Does not handle nested types at all.

Note

The resulting DataFrame (or every DataFrame in the returned Iterator for chunked queries) have a query_metadata attribute, which brings the query result metadata returned by Boto3/Athena .

For a practical example check out the related tutorial!

Note

Valid encryption modes: [None, ‘SSE_S3’, ‘SSE_KMS’].

P.S. ‘CSE_KMS’ is not supported.

Note

Create the default Athena bucket if it doesn’t exist and s3_output is None.

(E.g. s3://aws-athena-query-results-ACCOUNT-REGION/)

Note

chunksize argument (Memory Friendly) (i.e batching):

Return an Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies:

If chunksize=True, depending on the size of the data, one or more data frames are returned per file in the query result. Unlike chunksize=INTEGER, rows from different files are not mixed in the resulting data frames.
If chunksize=INTEGER, awswrangler iterates on the data by number of rows equal to the received INTEGER.

P.S. chunksize=True is faster and uses less memory while chunksize=INTEGER is more precise in number of rows for each data frame.

P.P.S. If ctas_approach=False and chunksize=True, you will always receive an iterator with a single DataFrame because regular Athena queries only produces a single output file.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Parameters:

sql (str) – SQL query.
database (str) – AWS Glue/Athena database name - It is only the origin database from where the query will be launched. You can still using and mixing several databases writing the full table name within the sql (e.g. database.table).
ctas_approach (bool) – Wraps the query using a CTAS, and read the resulted parquet data on S3. If false, read the regular CSV on S3.
unload_approach (bool) – Wraps the query using UNLOAD, and read the results from S3. Only PARQUET format is supported.
ctas_parameters (AthenaCTASSettings | None) –
Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:
- boto3_session
- s3_additional_kwargs
This function has arguments which can be configured globally through wr.config or environment variables:
- ctas_approach
- database
- athena_cache_settings
- athena_query_wait_polling_delay
- workgroup
- chunksize
- dtype_backend
Check out the Global Configurations Tutorial for details.

Parameters of the CTAS such as database, temp_table_name, bucketing_info, and compression.

unload_parameters

Parameters of the UNLOAD such as format, compression, field_delimiter, and partitioned_by.

categories

List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

chunksize

If passed will split the data in a Iterable of DataFrames (Memory friendly). If True awswrangler iterates on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed awswrangler will iterate on the data by number of rows equal the received INTEGER.

s3_output

Amazon S3 path. Not required for the regular query path (ctas_approach=False, unload_approach=False) when the workgroup uses managed query results. Still used for CTAS/UNLOAD paths.

workgroup

Athena workgroup. Primary by default.

encryption

Valid values: [None, ‘SSE_S3’, ‘SSE_KMS’]. Notice: ‘CSE_KMS’ is not supported.

kms_key

For SSE-KMS, this is the KMS key ARN or ID.

keep_files

Whether staging files produced by Athena are retained. ‘True’ by default.

use_threads

True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.

boto3_session

The default boto3 session will be used if boto3_session receive None.

client_request_token

A unique case-sensitive string used to ensure the request to create the query is idempotent (executes only once). If another StartQueryExecution request is received, the same response is returned and another query is not created. If a parameter has changed, for example, the QueryString , an error is returned. If you pass the same client_request_token value with different parameters the query fails with error message “Idempotent parameters do not match”. Use this only with ctas_approach=False and unload_approach=False and disabled cache.

athena_cache_settings

Parameters of the Athena cache settings such as max_cache_seconds, max_cache_query_inspections, max_remote_cache_entries, and max_local_cache_entries. AthenaCacheSettings is a TypedDict, meaning the passed parameter can be instantiated either as an instance of AthenaCacheSettings or as a regular Python dict. If cached results are valid, awswrangler ignores the ctas_approach, s3_output, encryption, kms_key, keep_files and ctas_temp_table_name params. If reading cached data fails for any reason, execution falls back to the usual query run path.

data_source

Data Source / Catalog name. If None, ‘AwsDataCatalog’ will be used by default.

athena_query_wait_polling_delay

Interval in seconds for how often the function will check if the Athena query has completed.

params

Parameters that will be used for constructing the SQL query. Only named or question mark parameters are supported. The parameter style needs to be specified in the paramstyle parameter.

For paramstyle="named", this value needs to be a dictionary. The dict needs to contain the information in the form {'name': 'value'} and the SQL query needs to contain :name. The formatter will be applied client-side in this scenario.

For paramstyle="qmark", this value needs to be a list of strings. The formatter will be applied server-side. The values are applied sequentially to the parameters in the query in the order in which the parameters occur.

paramstyle

Determines the style of params. Possible values are:

named
qmark

result_reuse_configuration

A structure that contains the configuration settings for reusing query results. This parameter is only valid when both ctas_approach and unload_approach are set to False. See also: https://docs.aws.amazon.com/athena/latest/ug/reusing-query-results.html

dtype_backend

Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.

s3_additional_kwargs

Forwarded to botocore requests. e.g. s3_additional_kwargs={‘RequestPayer’: ‘requester’}

pyarrow_additional_kwargs

Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.

Return type:: DataFrame | Iterator[DataFrame]
Returns:: Pandas DataFrame or Generator of Pandas DataFrames if chunksize is passed.

Examples

>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(sql="...", database="...")
>>> scanned_bytes = df.query_metadata["Statistics"]["DataScannedInBytes"]

>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(
...     sql="SELECT * FROM my_table WHERE name=:name AND city=:city",
...     params={"name": "filtered_name", "city": "filtered_city"}
... )

>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(
...     sql="...",
...     database="...",
...     athena_cache_settings={
...          "max_cache_seconds": 90,
...     },
... )