awswrangler.redshift.unload

awswrangler.redshift.unload(sql: str, path: str, con: redshift_connector.Connection, iam_role: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_session_token: str | None = None, region: str | None = None, max_file_size: float | None = None, kms_key_id: str | None = None, dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', chunked: bool | int = False, keep_files: bool = False, parallel: bool = True, use_threads: bool | int = True, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, str] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) DataFrame | Iterator[DataFrame]

Load Pandas DataFrame from a Amazon Redshift query result using Parquet files on s3 as stage.

This is a HIGH latency and HIGH throughput alternative to wr.redshift.read_sql_query()/wr.redshift.read_sql_table() to extract large Amazon Redshift data into a Pandas DataFrames through the UNLOAD command.

This strategy has more overhead and requires more IAM privileges than the regular wr.redshift.read_sql_query()/wr.redshift.read_sql_table() function, so it is only recommended to fetch 1k+ rows at once.

https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

Note

Batching (chunked argument) (Memory Friendly):

Will enable the function to return an Iterable of DataFrames instead of a regular DataFrame.

There are two batching strategies on awswrangler:

  • If chunked=True, depending on the size of the data, one or more data frames are returned per file. Unlike chunked=INTEGER, rows from different files are not be mixed in the resulting data frames.

  • If chunked=INTEGER, awswrangler iterates on the data by number of rows (equal to the received INTEGER).

P.S. chunked=True is faster and uses less memory while chunked=INTEGER is more precise in the number of rows for each DataFrame.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Note

This function has arguments which can be configured globally through wr.config or environment variables:

  • dtype_backend

Check out the Global Configurations Tutorial for details.

Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:

  • boto3_session

  • s3_additional_kwargs

Parameters:
  • sql (str) – SQL query.

  • path (str) – S3 path to write stage files (e.g. s3://bucket_name/any_name/)

  • con (Connection) – Use redshift_connector.connect() to use ” “credentials directly or wr.redshift.connect() to fetch it from the Glue Catalog.

  • iam_role (str | None) – AWS IAM role with the related permissions.

  • aws_access_key_id (str | None) – The access key for your AWS account.

  • aws_secret_access_key (str | None) – The secret key for your AWS account.

  • aws_session_token (str | None) – The session key for your AWS account. This is only needed when you are using temporary credentials.

  • region (str | None) – Specifies the AWS Region where the target Amazon S3 bucket is located. REGION is required for UNLOAD to an Amazon S3 bucket that isn’t in the same AWS Region as the Amazon Redshift cluster. By default, UNLOAD assumes that the target Amazon S3 bucket is located in the same AWS Region as the Amazon Redshift cluster.

  • max_file_size (float | None) – Specifies the maximum size (MB) of files that UNLOAD creates in Amazon S3. Specify a decimal value between 5.0 MB and 6200.0 MB. If None, the default maximum file size is 6200.0 MB.

  • kms_key_id (str | None) – Specifies the key ID for an AWS Key Management Service (AWS KMS) key to be used to encrypt data files on Amazon S3.

  • keep_files (bool) – Should keep stage files?

  • parallel (bool) – Whether to unload to multiple files in parallel. Defaults to True. By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. If parallel is False, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used.

  • dtype_backend (Literal['numpy_nullable', 'pyarrow']) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.

  • chunked (bool | int) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True awswrangler iterates on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed awswrangler will iterate on the data by number of rows equal the received INTEGER.

  • use_threads (bool | int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.

  • boto3_session (Session | None) – The default boto3 session will be used if boto3_session is None.

  • s3_additional_kwargs (dict[str, str] | None) – Forward to botocore requests.

  • pyarrow_additional_kwargs (dict[str, Any] | None) – Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.

Return type:

DataFrame | Iterator[DataFrame]

Returns:

Result as Pandas DataFrame(s).

Examples

>>> import awswrangler as wr
>>> with wr.redshift.connect("MY_GLUE_CONNECTION") as con:
...     df = wr.redshift.unload(
...         sql="SELECT * FROM public.mytable",
...         path="s3://bucket/extracted_parquet_files/",
...         con=con,
...         iam_role="arn:aws:iam::XXX:role/XXX"
...     )