awswrangler.redshift.unload_to_files¶

awswrangler.redshift.unload_to_files(sql: str, path: str, con: redshift_connector.core.Connection, iam_role: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_session_token: str | None = None, region: str | None = None, unload_format: Literal['CSV', 'PARQUET'] | None = None, parallel: bool = True, max_file_size: float | None = None, kms_key_id: str | None = None, manifest: bool = False, partition_cols: list[str] | None = None, cleanpath: bool = False, boto3_session: Session | None = None) → None¶

Unload Parquet files on s3 from a Redshift query result (Through the UNLOAD command).

https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Parameters:

sql (str) – SQL query.
path (str) – S3 path to write stage files (e.g. s3://bucket_name/any_name/)
con (Connection) – Use redshift_connector.connect() to use ” “credentials directly or wr.redshift.connect() to fetch it from the Glue Catalog.
iam_role (str | None) – AWS IAM role with the related permissions.
aws_access_key_id (str | None) – The access key for your AWS account.
aws_secret_access_key (str | None) – The secret key for your AWS account.
aws_session_token (str | None) – The session key for your AWS account. This is only needed when you are using temporary credentials.
region (str | None) – Specifies the AWS Region where the target Amazon S3 bucket is located. REGION is required for UNLOAD to an Amazon S3 bucket that isn’t in the same AWS Region as the Amazon Redshift cluster. By default, UNLOAD assumes that the target Amazon S3 bucket is located in the same AWS Region as the Amazon Redshift cluster.
unload_format (Literal['CSV', 'PARQUET'] | None) – Format of the unloaded S3 objects from the query. Valid values: “CSV”, “PARQUET”. Case sensitive. Defaults to PARQUET.
parallel (bool) – Whether to unload to multiple files in parallel. Defaults to True. By default, UNLOAD writes data in parallel to multiple files, according to the number of slices in the cluster. If parallel is False, UNLOAD writes to one or more data files serially, sorted absolutely according to the ORDER BY clause, if one is used.
max_file_size (float | None) – Specifies the maximum size (MB) of files that UNLOAD creates in Amazon S3. Specify a decimal value between 5.0 MB and 6200.0 MB. If None, the default maximum file size is 6200.0 MB.
kms_key_id (str | None) – Specifies the key ID for an AWS Key Management Service (AWS KMS) key to be used to encrypt data files on Amazon S3.
manifest (bool) – Unload a manifest file on S3.
partition_cols (list[str] | None) – Specifies the partition keys for the unload operation.
cleanpath (bool) –
Use CLEANPATH instead of ALLOWOVERWRITE. When True, uses CLEANPATH to remove existing files located in the Amazon S3 path before unloading files. When False (default), uses ALLOWOVERWRITE to overwrite existing files, including the manifest file. These options are mutually exclusive.

ALLOWOVERWRITE: By default, UNLOAD fails if it finds files that it would possibly overwrite. If ALLOWOVERWRITE is specified, UNLOAD overwrites existing files, including the manifest file.

CLEANPATH: Removes existing files located in the Amazon S3 path specified in the TO clause before unloading files to the specified location. If you include the PARTITION BY clause, existing files are removed only from the partition folders to receive new files generated by the UNLOAD operation. You must have the s3:DeleteObject permission on the Amazon S3 bucket. Files removed using CLEANPATH are permanently deleted and can’t be recovered.

For more information, see: https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html
boto3_session (Session | None) – The default boto3 session will be used if boto3_session is None.

Return type:

None

Examples

>>> import awswrangler as wr
>>> with wr.redshift.connect("MY_GLUE_CONNECTION") as con:
...     wr.redshift.unload_to_files(
...         sql="SELECT * FROM public.mytable",
...         path="s3://bucket/extracted_parquet_files/",
...         con=con,
...         iam_role="arn:aws:iam::XXX:role/XXX"
...     )
>>> # Using CLEANPATH instead of ALLOWOVERWRITE
>>> with wr.redshift.connect("MY_GLUE_CONNECTION") as con:
...     wr.redshift.unload_to_files(
...         sql="SELECT * FROM public.mytable",
...         path="s3://bucket/extracted_parquet_files/",
...         con=con,
...         iam_role="arn:aws:iam::XXX:role/XXX",
...         cleanpath=True
...     )