awswrangler.redshift.copy

awswrangler.redshift.copy(df: pd.DataFrame, path: str, con: redshift_connector.Connection, table: str, schema: str, iam_role: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_session_token: str | None = None, index: bool = False, dtype: dict[str, str] | None = None, mode: _ToSqlModeLiteral = 'append', overwrite_method: _ToSqlOverwriteModeLiteral = 'drop', diststyle: _ToSqlDistStyleLiteral = 'AUTO', distkey: str | None = None, sortstyle: _ToSqlSortStyleLiteral = 'COMPOUND', sortkey: list[str] | None = None, primary_keys: list[str] | None = None, varchar_lengths_default: int = 256, varchar_lengths: dict[str, int] | None = None, serialize_to_json: bool = False, keep_files: bool = False, use_threads: bool | int = True, lock: bool = False, commit_transaction: bool = True, sql_copy_extra_params: list[str] | None = None, boto3_session: boto3.Session | None = None, s3_additional_kwargs: dict[str, str] | None = None, max_rows_by_file: int | None = 10000000, precombine_key: str | None = None, use_column_names: bool = False) None

Load Pandas DataFrame as a Table on Amazon Redshift using parquet files on S3 as stage.

This is a HIGH latency and HIGH throughput alternative to wr.redshift.to_sql() to load large DataFrames into Amazon Redshift through the ** SQL COPY command**.

This strategy has more overhead and requires more IAM privileges than the regular wr.redshift.to_sql() function, so it is only recommended to inserting +1K rows at once.

https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html

Note

If the table does not exist yet, it will be automatically created for you using the Parquet metadata to infer the columns data types.

Note

In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().

Note

Following arguments are not supported in distributed mode with engine EngineEnum.RAY:

  • boto3_session

  • s3_additional_kwargs

Parameters:
  • df (pandas.DataFrame) – Pandas DataFrame.

  • path (str) – S3 path to write stage files (e.g. s3://bucket_name/any_name/). Note: This path must be empty.

  • con (redshift_connector.Connection) – Use redshift_connector.connect() to use ” “credentials directly or wr.redshift.connect() to fetch it from the Glue Catalog.

  • table (str) – Table name

  • schema (str) – Schema name

  • iam_role (str, optional) – AWS IAM role with the related permissions.

  • aws_access_key_id (str, optional) – The access key for your AWS account.

  • aws_secret_access_key (str, optional) – The secret key for your AWS account.

  • aws_session_token (str, optional) – The session key for your AWS account. This is only needed when you are using temporary credentials.

  • index (bool) – True to store the DataFrame index in file, otherwise False to ignore it.

  • dtype (Dict[str, str], optional) – Dictionary of columns names and Athena/Glue types to be casted. Useful when you have columns with undetermined or mixed data types. Only takes effect if dataset=True. (e.g. {‘col name’: ‘bigint’, ‘col2 name’: ‘int’})

  • mode (str) – Append, overwrite or upsert.

  • overwrite_method (str) –

    Drop, cascade, truncate, or delete. Only applicable in overwrite mode.

    ”drop” - DROP ... RESTRICT - drops the table. Fails if there are any views that depend on it. “cascade” - DROP ... CASCADE - drops the table, and all views that depend on it. “truncate” - TRUNCATE ... - truncates the table, but immediately commits current transaction & starts a new one, hence the overwrite happens in two transactions and is not atomic. “delete” - DELETE FROM ... - deletes all rows from the table. Slow relative to the other methods.

  • diststyle (str) – Redshift distribution styles. Must be in [“AUTO”, “EVEN”, “ALL”, “KEY”]. https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.html

  • distkey (str, optional) – Specifies a column name or positional number for the distribution key.

  • sortstyle (str) – Sorting can be “COMPOUND” or “INTERLEAVED”. https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.html

  • sortkey (List[str], optional) – List of columns to be sorted.

  • primary_keys (List[str], optional) – Primary keys.

  • varchar_lengths_default (int) – The size that will be set for all VARCHAR columns not specified with varchar_lengths.

  • varchar_lengths (Dict[str, int], optional) – Dict of VARCHAR length by columns. (e.g. {“col1”: 10, “col5”: 200}).

  • keep_files (bool) – Should keep stage files?

  • use_threads (bool, int) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.

  • lock (bool) – True to execute LOCK command inside the transaction to force serializable isolation.

  • commit_transaction (bool) – Whether to commit the transaction. True by default.

  • sql_copy_extra_params (Optional[List[str]]) – Additional copy parameters to pass to the command. For example: [“STATUPDATE ON”]

  • boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.

  • s3_additional_kwargs (Dict[str, str], optional) – Forwarded to botocore requests. e.g. s3_additional_kwargs={‘ServerSideEncryption’: ‘aws:kms’, ‘SSEKMSKeyId’: ‘YOUR_KMS_KEY_ARN’}

  • max_rows_by_file (int) – Max number of rows in each file. (e.g. 33554432, 268435456)

  • precombine_key (str, optional) – When there is a primary_key match during upsert, this column will change the upsert method, comparing the values of the specified column from source and target, and keeping the larger of the two. Will only work when mode = upsert.

  • use_column_names (bool) – If set to True, will use the column names of the DataFrame for generating the INSERT SQL Query. E.g. If the DataFrame has two columns col1 and col3 and use_column_names is True, data will only be inserted into the database columns col1 and col3.

Returns:

None.

Return type:

None

Examples

>>> import awswrangler as wr
>>> import pandas as pd
>>> con = wr.redshift.connect("MY_GLUE_CONNECTION")
>>> wr.redshift.copy(
...     df=pd.DataFrame({'col': [1, 2, 3]}),
...     path="s3://bucket/my_parquet_files/",
...     con=con,
...     table="my_table",
...     schema="public",
...     iam_role="arn:aws:iam::XXX:role/XXX"
... )
>>> con.close()