awswrangler.athena.read_sql_query

awswrangler.athena.read_sql_query(sql: str, database: str, ctas_approach: bool = True, unload_approach: bool = False, unload_parameters: Dict[str, Any] | None = None, categories: List[str] | None = None, chunksize: int | bool | None = None, s3_output: str | None = None, workgroup: str | None = None, encryption: str | None = None, kms_key: str | None = None, keep_files: bool = True, ctas_database_name: str | None = None, ctas_temp_table_name: str | None = None, ctas_bucketing_info: Tuple[List[str], int] | None = None, ctas_write_compression: str | None = None, use_threads: bool | int = True, boto3_session: Session | None = None, max_cache_seconds: int = 0, max_cache_query_inspections: int = 50, max_remote_cache_entries: int = 50, max_local_cache_entries: int = 100, data_source: str | None = None, athena_query_wait_polling_delay: float = 0.25, params: Dict[str, Any] | None = None, s3_additional_kwargs: Dict[str, Any] | None = None, pyarrow_additional_kwargs: Dict[str, Any] | None = None) Any

Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame.

Related tutorial:

There are three approaches available through ctas_approach and unload_approach parameters:

1 - ctas_approach=True (Default):

Wrap the query with a CTAS and then reads the table data as parquet directly from s3.

PROS:

  • Faster for mid and big result sizes.

  • Can handle some level of nested types.

CONS:

  • Requires create/delete table permissions on Glue.

  • Does not support timestamp with time zone

  • Does not support columns with repeated names.

  • Does not support columns with undefined data types.

  • A temporary table will be created and then deleted immediately.

  • Does not support custom data_source/catalog_id.

2 - unload_approach=True and ctas_approach=False:

Does an UNLOAD query on Athena and parse the Parquet result on s3.

PROS:

  • Faster for mid and big result sizes.

  • Can handle some level of nested types.

  • Does not modify Glue Data Catalog

CONS:

  • Output S3 path must be empty.

  • Does not support timestamp with time zone.

  • Does not support columns with repeated names.

  • Does not support columns with undefined data types.

3 - ctas_approach=False:

Does a regular query on Athena and parse the regular CSV result on s3.

PROS:

  • Faster for small result sizes (less latency).

  • Does not require create/delete table permissions on Glue

  • Supports timestamp with time zone.

  • Support custom data_source/catalog_id.

CONS:

  • Slower for big results (But stills faster than other libraries that uses the regular Athena’s API)

  • Does not handle nested types at all.

Note

The resulting DataFrame (or every DataFrame in the returned Iterator for chunked queries) have a query_metadata attribute, which brings the query result metadata returned by Boto3/Athena .

For a practical example check out the related tutorial!

Note

Valid encryption modes: [None, ‘SSE_S3’, ‘SSE_KMS’].

P.S. ‘CSE_KMS’ is not supported.

Note

Create the default Athena bucket if it doesn’t exist and s3_output is None.

(E.g. s3://aws-athena-query-results-ACCOUNT-REGION/)

Note

chunksize argument (Memory Friendly) (i.e batching):

Return an Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies:

  • If chunksize=True, depending on the size of the data, one or more data frames will be returned per each file in the query result. Unlike chunksize=INTEGER, rows from different files will not be mixed in the resulting data frames.

  • If chunksize=INTEGER, awswrangler will iterate on the data by number of rows egual the received INTEGER.

P.S. chunksize=True is faster and uses less memory while chunksize=INTEGER is more precise in number of rows for each Dataframe.

P.P.S. If ctas_approach=False and chunksize=True, you will always receive an iterator with a single DataFrame because regular Athena queries only produces a single output file.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Note

This function has arguments which can be configured globally through wr.config or environment variables:

  • ctas_approach

  • database

  • max_cache_query_inspections

  • max_cache_seconds

  • max_remote_cache_entries

  • max_local_cache_entries

  • athena_query_wait_polling_delay

  • workgroup

  • chunksize

Check out the Global Configurations Tutorial for details.

Parameters:
  • sql (str) – SQL query.

  • database (str) – AWS Glue/Athena database name - It is only the origin database from where the query will be launched. You can still using and mixing several databases writing the full table name within the sql (e.g. database.table).

  • ctas_approach (bool) – Wraps the query using a CTAS, and read the resulted parquet data on S3. If false, read the regular CSV on S3.

  • unload_approach (bool) – Wraps the query using UNLOAD, and read the results from S3. Only PARQUET format is supported.

  • unload_parameters (Optional[Dict[str, Any]]) – Params of the UNLOAD such as format, compression, field_delimiter, and partitioned_by.

  • categories (List[str], optional) – List of columns names that should be returned as pandas.Categorical. Recommended for memory restricted environments.

  • chunksize (Union[int, bool], optional) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True awswrangler iterates on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed awswrangler will iterate on the data by number of rows igual the received INTEGER.

  • s3_output (str, optional) – Amazon S3 path.

  • workgroup (str, optional) – Athena workgroup.

  • encryption (str, optional) – Valid values: [None, ‘SSE_S3’, ‘SSE_KMS’]. Notice: ‘CSE_KMS’ is not supported.

  • kms_key (str, optional) – For SSE-KMS, this is the KMS key ARN or ID.

  • keep_files (bool) – Whether staging files produced by Athena are retained. ‘True’ by default.

  • ctas_database_name (str, optional) – The name of the alternative database where the CTAS temporary table is stored. If None, the default database is used.

  • ctas_temp_table_name (str, optional) – The name of the temporary table and also the directory name on S3 where the CTAS result is stored. If None, it will use the follow random pattern: f”temp_table_{uuid.uuid4().hex()}”. On S3 this directory will be under under the pattern: f”{s3_output}/{ctas_temp_table_name}/”.

  • ctas_bucketing_info (Tuple[List[str], int], optional) – Tuple consisting of the column names used for bucketing as the first element and the number of buckets as the second element. Only str, int and bool are supported as column data types for bucketing.

  • ctas_write_compression (str, optional) – Write compression for the temporary table where the CTAS result is stored. Corresponds to the write_compression parameters for CREATE TABLE AS statement in Athena.

  • use_threads (bool, int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • max_cache_seconds (int) – awswrangler can look up in Athena’s history if this query has been run before. If so, and its completion time is less than max_cache_seconds before now, awswrangler skips query execution and just returns the same results as last time. If cached results are valid, awswrangler ignores the ctas_approach, s3_output, encryption, kms_key, keep_files and ctas_temp_table_name params. If reading cached data fails for any reason, execution falls back to the usual query run path.

  • max_cache_query_inspections (int) – Max number of queries that will be inspected from the history to try to find some result to reuse. The bigger the number of inspection, the bigger will be the latency for not cached queries. Only takes effect if max_cache_seconds > 0.

  • max_remote_cache_entries (int) – Max number of queries that will be retrieved from AWS for cache inspection. The bigger the number of inspection, the bigger will be the latency for not cached queries. Only takes effect if max_cache_seconds > 0 and default value is 50.

  • max_local_cache_entries (int) – Max number of queries for which metadata will be cached locally. This will reduce the latency and also enables keeping more than max_remote_cache_entries available for the cache. This value should not be smaller than max_remote_cache_entries. Only takes effect if max_cache_seconds > 0 and default value is 100.

  • data_source (str, optional) – Data Source / Catalog name. If None, ‘AwsDataCatalog’ will be used by default.

  • athena_query_wait_polling_delay (float, default: 0.25 seconds) – Interval in seconds for how often the function will check if the Athena query has completed.

  • params (Dict[str, any], optional) – Dict of parameters that will be used for constructing the SQL query. Only named parameters are supported. The dict needs to contain the information in the form {‘name’: ‘value’} and the SQL query needs to contain :name;. Note that for varchar columns and similar, you must surround the value in single quotes.

  • s3_additional_kwargs (Optional[Dict[str, Any]]) – Forwarded to botocore requests. e.g. s3_additional_kwargs={‘RequestPayer’: ‘requester’}

  • pyarrow_additional_kwargs (Optional[Dict[str, Any]]) – Forward to the ParquetFile class or converting an Arrow table to Pandas, currently only an “coerce_int96_timestamp_unit” or “timestamp_as_object” argument will be considered. If reading parquet files where you cannot convert a timestamp to pandas Timestamp[ns] consider setting timestamp_as_object=True, to allow for timestamp units larger than “ns”. If reading parquet data that still uses INT96 (like Athena outputs) you can use coerce_int96_timestamp_unit to specify what timestamp unit to encode INT96 to (by default this is “ns”, if you know the output parquet came from a system that encodes timestamp to a particular unit then set this to that same unit e.g. coerce_int96_timestamp_unit=”ms”).

Returns:

Pandas DataFrame or Generator of Pandas DataFrames if chunksize is passed.

Return type:

Union[pd.DataFrame, Iterator[pd.DataFrame]]

Examples

>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(sql="...", database="...")
>>> scanned_bytes = df.query_metadata["Statistics"]["DataScannedInBytes"]
>>> import awswrangler as wr
>>> df = wr.athena.read_sql_query(
...     sql="SELECT * FROM my_table WHERE name=:name; AND city=:city;",
...     params={"name": "'filtered_name'", "city": "'filtered_city'"}
... )