awswrangler.redshift.unload¶
- awswrangler.redshift.unload(sql: str, path: str, con: redshift_connector.core.Connection, iam_role: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_session_token: str | None = None, region: str | None = None, max_file_size: float | None = None, kms_key_id: str | None = None, dtype_backend: Literal['numpy_nullable', 'pyarrow'] = 'numpy_nullable', chunked: bool | int = False, keep_files: bool = False, parallel: bool = True, cleanpath: bool = False, use_threads: bool | int = True, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, str] | None = None, pyarrow_additional_kwargs: dict[str, Any] | None = None) DataFrame | Iterator[DataFrame]¶
Load Pandas DataFrame from a Amazon Redshift query result using Parquet files on s3 as stage.
This is a HIGH latency and HIGH throughput alternative to wr.redshift.read_sql_query()/wr.redshift.read_sql_table() to extract large Amazon Redshift data into a Pandas DataFrames through the UNLOAD command.
This strategy has more overhead and requires more IAM privileges than the regular wr.redshift.read_sql_query()/wr.redshift.read_sql_table() function, so it is only recommended to fetch 1k+ rows at once.
https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
Note
Batching(chunked argument) (Memory Friendly):Will enable the function to return an Iterable of DataFrames instead of a regular DataFrame.
There are two batching strategies on awswrangler:
If chunked=True, depending on the size of the data, one or more data frames are returned per file. Unlike chunked=INTEGER, rows from different files are not be mixed in the resulting data frames.
If chunked=INTEGER, awswrangler iterates on the data by number of rows (equal to the received INTEGER).
P.S. chunked=True is faster and uses less memory while chunked=INTEGER is more precise in the number of rows for each DataFrame.
Note
In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().
- Parameters:
sql (
str) – SQL query.path (
str) – S3 path to write stage files (e.g. s3://bucket_name/any_name/)con (
Connection) – Use redshift_connector.connect() to use ” “credentials directly or wr.redshift.connect() to fetch it from the Glue Catalog.iam_role (
str|None) – AWS IAM role with the related permissions.aws_access_key_id (
str|None) – The access key for your AWS account.aws_secret_access_key (
str|None) – The secret key for your AWS account.aws_session_token (
str|None) – The session key for your AWS account. This is only needed when you are using temporary credentials.region (
str|None) – Specifies the AWS Region where the target Amazon S3 bucket is located. REGION is required for UNLOAD to an Amazon S3 bucket that isn’t in the same AWS Region as the Amazon Redshift cluster. By default, UNLOAD assumes that the target Amazon S3 bucket is located in the same AWS Region as the Amazon Redshift cluster.max_file_size (
float|None) – Specifies the maximum size (MB) of files that UNLOAD creates in Amazon S3. Specify a decimal value between 5.0 MB and 6200.0 MB. If None, the default maximum file size is 6200.0 MB.kms_key_id (
str|None) – Specifies the key ID for an AWS Key Management Service (AWS KMS) key to be used to encrypt data files on Amazon S3.keep_files (
bool) – Should keep stage files?parallel (
bool) – Whether to unload to multiple files in parallel. Defaults to True. By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. If parallel is False, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used.cleanpath (
bool) –Use CLEANPATH instead of ALLOWOVERWRITE. When True, uses CLEANPATH to remove existing files located in the Amazon S3 path before unloading files. When False (default), uses ALLOWOVERWRITE to overwrite existing files, including the manifest file. These options are mutually exclusive.
ALLOWOVERWRITE: By default, UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified, UNLOAD overwrites existing files, including the manifest file.
CLEANPATH: Removes existing files located in the Amazon S3 path specified in the TO clause before unloading files to the specified location. If you include the PARTITION BY clause, existing files are removed only from the partition folders to receive new files generated by the UNLOAD operation. You must have the s3:DeleteObject permission on the Amazon S3 bucket. Files removed using CLEANPATH are permanently deleted and can’t be recovered.
For more information, see: https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
dtype_backend (
Literal['numpy_nullable','pyarrow']) –Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.
The dtype_backends are still experimential. The “pyarrow” backend is only supported with Pandas 2.0 or above.
chunked (
bool|int) – If passed will split the data in a Iterable of DataFrames (Memory friendly). If True awswrangler iterates on the data by files in the most efficient way without guarantee of chunksize. If an INTEGER is passed awswrangler will iterate on the data by number of rows equal the received INTEGER.use_threads (
bool|int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.boto3_session (
Session|None) – The default boto3 session will be used if boto3_session isNone.s3_additional_kwargs (
dict[str,str] |None) – Forward to botocore requests.pyarrow_additional_kwargs (
dict[str,Any] |None) – Forwarded to to_pandas method converting from PyArrow tables to Pandas DataFrame. Valid values include “split_blocks”, “self_destruct”, “ignore_metadata”. e.g. pyarrow_additional_kwargs={‘split_blocks’: True}.
- Return type:
DataFrame|Iterator[DataFrame]- Returns:
Result as Pandas DataFrame(s).
Examples
>>> import awswrangler as wr >>> with wr.redshift.connect("MY_GLUE_CONNECTION") as con: ... df = wr.redshift.unload( ... sql="SELECT * FROM public.mytable", ... path="s3://bucket/extracted_parquet_files/", ... con=con, ... iam_role="arn:aws:iam::XXX:role/XXX" ... ) >>> # Using CLEANPATH instead of ALLOWOVERWRITE >>> with wr.redshift.connect("MY_GLUE_CONNECTION") as con: ... df = wr.redshift.unload( ... sql="SELECT * FROM public.mytable", ... path="s3://bucket/extracted_parquet_files/", ... con=con, ... iam_role="arn:aws:iam::XXX:role/XXX", ... cleanpath=True ... )