awswrangler.athena.read_sql_query¶
- awswrangler.athena.read_sql_query(sql: str, database: str, ctas_approach: bool = True, unload_approach: bool = False, ctas_parameters: AthenaCTASSettings | None = None, unload_parameters: AthenaUNLOADSettings | None = None, categories: list[str] | None = None, chunksize: int | bool | None = None, s3_output: str | None = None, workgroup: str = 'primary', encryption: str | None = None, kms_key: str | None = None, keep_files: bool = True, use_threads: bool | int = True, boto3_session: Session | None = None, client_request_token: str | None = None, athena_cache_settings: AthenaCacheSettings | None = None, data_source: str | None = None, athena_query_wait_polling_delay: float = 1.0, params: dict[str, Any] | list[str] | None = None, paramstyle: Literal['qmark', 'named'] = 'named', dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', s3_additional_kwargs: dict[str, Any] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) DataFrame | Iterator[DataFrame] ¶
Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame.
Related tutorial:
There are three approaches available through ctas_approach and unload_approach parameters:
1 - ctas_approach=True (Default):
Wrap the query with a CTAS and then reads the table data as parquet directly from s3.
PROS:
Faster for mid and big result sizes.
Can handle some level of nested types.
CONS:
Requires create/delete table permissions on Glue.
Does not support timestamp with time zone
Does not support columns with repeated names.
Does not support columns with undefined data types.
A temporary table will be created and then deleted immediately.
Does not support custom data_source/catalog_id.
2 - unload_approach=True and ctas_approach=False:
Does an UNLOAD query on Athena and parse the Parquet result on s3.
PROS:
Faster for mid and big result sizes.
Can handle some level of nested types.
Does not modify Glue Data Catalog
CONS:
Output S3 path must be empty.
Does not support timestamp with time zone.
Does not support columns with repeated names.
Does not support columns with undefined data types.
3 - ctas_approach=False:
Does a regular query on Athena and parse the regular CSV result on s3.
PROS:
Faster for small result sizes (less latency).
Does not require create/delete table permissions on Glue
Supports timestamp with time zone.
Support custom data_source/catalog_id.
CONS:
Slower for big results (But stills faster than other libraries that uses the regular Athena’s API)
Does not handle nested types at all.
Note
The resulting DataFrame (or every DataFrame in the returned Iterator for chunked queries) have a query_metadata attribute, which brings the query result metadata returned by Boto3/Athena .
For a practical example check out the related tutorial!
Note
Valid encryption modes: [None, ‘SSE_S3’, ‘SSE_KMS’].
P.S. ‘CSE_KMS’ is not supported.
Note
Create the default Athena bucket if it doesn’t exist and s3_output is None.
(E.g. s3://aws-athena-query-results-ACCOUNT-REGION/)
Note
chunksize argument (Memory Friendly) (i.e batching):
Return an Iterable of DataFrames instead of a regular DataFrame.
There are two batching strategies:
If chunksize=True, depending on the size of the data, one or more data frames are returned per file in the query result. Unlike chunksize=INTEGER, rows from different files are not mixed in the resulting data frames.
If chunksize=INTEGER, awswrangler iterates on the data by number of rows equal to the received INTEGER.
P.S. chunksize=True is faster and uses less memory while chunksize=INTEGER is more precise in number of rows for each data frame.
P.P.S. If ctas_approach=False and chunksize=True, you will always receive an iterator with a single DataFrame because regular Athena queries only produces a single output file.
Note
In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().
Note
Following arguments are not supported in distributed mode with engine EngineEnum.RAY:
boto3_session
s3_additional_kwargs
Note
This function has arguments which can be configured globally through wr.config or environment variables:
ctas_approach
database
athena_cache_settings
athena_query_wait_polling_delay
workgroup
chunksize
dtype_backend
Check out the Global Configurations Tutorial for details.
- Parameters:
sql (
str
) – SQL query.database (
str
) – AWS Glue/Athena database name - It is only the origin database from where the query will be launched. You can still using and mixing several databases writing the full table name within the sql (e.g. database.table).ctas_approach (
bool
) – Wraps the query using a CTAS, and read the resulted parquet data on S3. If false, read the regular CSV on S3.unload_approach (
bool
) – Wraps the query using UNLOAD, and read the results from S3. Only PARQUET format is supported.ctas_parameters (
AthenaCTASSettings
|None
) – Parameters of the CTAS such as database, temp_table_name, bucketing_info, and compression.unload_parameters (
AthenaUNLOADSettings
|None
) – Parameters of the UNLOAD such as format, compression, field_delimiter, and partitioned_by.categories (
list
[str
] |None
) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.chunksize (
int
|bool
|None
) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True awswrangler iterates on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed awswrangler will iterate on the data by number of rows equal the received INTEGER.s3_output (
str
|None
) – Amazon S3 path.workgroup (
str
) – Athena workgroup. Primary by default.encryption (
str
|None
) – Valid values: [None, ‘SSE_S3’, ‘SSE_KMS’]. Notice: ‘CSE_KMS’ is not supported.kms_key (
str
|None
) – For SSE-KMS, this is the KMS key ARN or ID.keep_files (
bool
) – Whether staging files produced by Athena are retained. ‘True’ by default.use_threads (
bool
|int
) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.boto3_session (
Session
|None
) – The default boto3 session will be used if boto3_session receiveNone
.client_request_token (
str
|None
) – A unique case-sensitive string used to ensure the request to create the query is idempotent (executes only once). If another StartQueryExecution request is received, the same response is returned and another query is not created. If a parameter has changed, for example, the QueryString , an error is returned. If you pass the same client_request_token value with different parameters the query fails with error message “Idempotent parameters do not match”. Use this only with ctas_approach=False and unload_approach=False and disabled cache.athena_cache_settings (
AthenaCacheSettings
|None
) – Parameters of the Athena cache settings such as max_cache_seconds, max_cache_query_inspections, max_remote_cache_entries, and max_local_cache_entries. AthenaCacheSettings is a TypedDict, meaning the passed parameter can be instantiated either as an instance of AthenaCacheSettings or as a regular Python dict. If cached results are valid, awswrangler ignores the ctas_approach, s3_output, encryption, kms_key, keep_files and ctas_temp_table_name params. If reading cached data fails for any reason, execution falls back to the usual query run path.data_source (
str
|None
) – Data Source / Catalog name. If None, ‘AwsDataCatalog’ will be used by default.athena_query_wait_polling_delay (
float
) – Interval in seconds for how often the function will check if the Athena query has completed.params (
dict
[str
,Any
] |list
[str
] |None
) –Parameters that will be used for constructing the SQL query. Only named or question mark parameters are supported. The parameter style needs to be specified in the
paramstyle
parameter.For
paramstyle="named"
, this value needs to be a dictionary. The dict needs to contain the information in the form{'name': 'value'}
and the SQL query needs to contain:name
. The formatter will be applied client-side in this scenario.For
paramstyle="qmark"
, this value needs to be a list of strings. The formatter will be applied server-side. The values are applied sequentially to the parameters in the query in the order in which the parameters occur.paramstyle (
Literal
['qmark'
,'named'
]) –Determines the style of
params
. Possible values are:named
qmark
dtype_backend (
Literal
['numpy_nullable'
,'pyarrow'
]) –Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.
s3_additional_kwargs (
dict
[str
,Any
] |None
) – Forwarded to botocore requests. e.g. s3_additional_kwargs={‘RequestPayer’: ‘requester’}pyarrow_additional_kwargs (
dict
[str
,Any
] |None
) – Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.
- Return type:
DataFrame
|Iterator
[DataFrame
]- Returns:
Pandas DataFrame or Generator of Pandas DataFrames if chunksize is passed.
Examples
>>> import awswrangler as wr >>> df = wr.athena.read_sql_query(sql="...", database="...") >>> scanned_bytes = df.query_metadata["Statistics"]["DataScannedInBytes"]
>>> import awswrangler as wr >>> df = wr.athena.read_sql_query( ... sql="SELECT * FROM my_table WHERE name=:name AND city=:city", ... params={"name": "filtered_name", "city": "filtered_city"} ... )
>>> import awswrangler as wr >>> df = wr.athena.read_sql_query( ... sql="...", ... database="...", ... athena_cache_settings={ ... "max_cache_seconds": 90, ... }, ... )