awswrangler.redshift.copy_from_files¶
- awswrangler.redshift.copy_from_files(path: str, con: redshift_connector.Connection, table: str, schema: str, iam_role: str | None = None, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_session_token: str | None = None, data_format: Literal['parquet', 'orc', 'csv'] = 'parquet', redshift_column_types: dict[str, str] | None = None, parquet_infer_sampling: float = 1.0, mode: Literal['append', 'overwrite', 'upsert'] = 'append', overwrite_method: Literal['drop', 'cascade', 'truncate', 'delete'] = 'drop', diststyle: Literal['AUTO', 'EVEN', 'ALL', 'KEY'] = 'AUTO', distkey: str | None = None, sortstyle: Literal['COMPOUND', 'INTERLEAVED'] = 'COMPOUND', sortkey: list[str] | None = None, primary_keys: list[str] | None = None, varchar_lengths_default: int = 256, varchar_lengths: dict[str, int] | None = None, serialize_to_json: bool = False, path_suffix: str | None = None, path_ignore_suffix: str | list[str] | None = None, use_threads: bool | int = True, lock: bool = False, commit_transaction: bool = True, manifest: bool | None = False, sql_copy_extra_params: list[str] | None = None, boto3_session: Session | None = None, s3_additional_kwargs: dict[str, str] | None = None, precombine_key: str | None = None, column_names: list[str] | None = None, add_new_columns: bool = False) None ¶
Load files from S3 to a Table on Amazon Redshift (Through COPY command).
https://docs.aws.amazon.com/redshift/latest/dg/r_COPY.html
Note
If the table does not exist yet, it will be automatically created for you using the Parquet/ORC/CSV metadata to infer the columns data types. If the data is in the CSV format, the Redshift column types need to be specified manually using
redshift_column_types
.Note
In case of use_threads=True the number of threads that will be spawned will be gotten from os.cpu_count().
- Parameters:
path (
str
) – S3 prefix (e.g. s3://bucket/prefix/)con (
Connection
) – Use redshift_connector.connect() to use ” “credentials directly or wr.redshift.connect() to fetch it from the Glue Catalog.table (
str
) – Table nameschema (
str
) – Schema nameiam_role (
str
|None
) – AWS IAM role with the related permissions.aws_access_key_id (
str
|None
) – The access key for your AWS account.aws_secret_access_key (
str
|None
) – The secret key for your AWS account.aws_session_token (
str
|None
) – The session key for your AWS account. This is only needed when you are using temporary credentials.data_format (
Literal
['parquet'
,'orc'
,'csv'
]) – Data format to be loaded. Supported values are Parquet, ORC, and CSV. Default is Parquet.redshift_column_types (
dict
[str
,str
] |None
) –Dictionary with keys as column names and values as Redshift column types. Only used when
data_format
is CSV.e.g.
`{'col1': 'BIGINT', 'col2': 'VARCHAR(256)'}`
parquet_infer_sampling (
float
) – Random sample ratio of files that will have the metadata inspected. Must be 0.0 < sampling <= 1.0. The higher, the more accurate. The lower, the faster.mode (
Literal
['append'
,'overwrite'
,'upsert'
]) – Append, overwrite or upsert.overwrite_method (
Literal
['drop'
,'cascade'
,'truncate'
,'delete'
]) –Drop, cascade, truncate, or delete. Only applicable in overwrite mode.
”drop” -
DROP ... RESTRICT
- drops the table. Fails if there are any views that depend on it. “cascade” -DROP ... CASCADE
- drops the table, and all views that depend on it. “truncate” -TRUNCATE ...
- truncates the table, but immediately commits current transaction & starts a new one, hence the overwrite happens in two transactions and is not atomic. “delete” -DELETE FROM ...
- deletes all rows from the table. Slow relative to the other methods.diststyle (
Literal
['AUTO'
,'EVEN'
,'ALL'
,'KEY'
]) – Redshift distribution styles. Must be in [“AUTO”, “EVEN”, “ALL”, “KEY”]. https://docs.aws.amazon.com/redshift/latest/dg/t_Distributing_data.htmldistkey (
str
|None
) – Specifies a column name or positional number for the distribution key.sortstyle (
Literal
['COMPOUND'
,'INTERLEAVED'
]) – Sorting can be “COMPOUND” or “INTERLEAVED”. https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data.htmlsortkey (
list
[str
] |None
) – List of columns to be sorted.primary_keys (
list
[str
] |None
) – Primary keys.varchar_lengths_default (
int
) – The size that will be set for all VARCHAR columns not specified with varchar_lengths.varchar_lengths (
dict
[str
,int
] |None
) – Dict of VARCHAR length by columns. (e.g. {“col1”: 10, “col5”: 200}).serialize_to_json (
bool
) – Should awswrangler add SERIALIZETOJSON parameter into the COPY command? SERIALIZETOJSON is necessary to load nested data https://docs.aws.amazon.com/redshift/latest/dg/ingest-super.html#copy_jsonpath_suffix (
str
|None
) – Suffix or List of suffixes to be scanned on s3 for the schema extraction (e.g. [“.gz.parquet”, “.snappy.parquet”]). Only has effect during the table creation. If None, will try to read all files. (default)path_ignore_suffix (
str
|list
[str
] |None
) – Suffix or List of suffixes for S3 keys to be ignored during the schema extraction. (e.g. [“.csv”, “_SUCCESS”]). Only has effect during the table creation. If None, will try to read all files. (default)use_threads (
bool
|int
) – True to enable concurrent requests, False to disable multiple threads. If enabled os.cpu_count() will be used as the max number of threads. If integer is provided, specified number is used.lock (
bool
) – True to execute LOCK command inside the transaction to force serializable isolation.commit_transaction (
bool
) – Whether to commit the transaction. True by default.manifest (
bool
|None
) – If set to true path argument accepts a S3 uri to a manifest file.sql_copy_extra_params (
list
[str
] |None
) – Additional copy parameters to pass to the command. For example: [“STATUPDATE ON”]boto3_session (
Session
|None
) – The default boto3 session will be used if boto3_session isNone
.s3_additional_kwargs (
dict
[str
,str
] |None
) – Forwarded to botocore requests. e.g. s3_additional_kwargs={‘ServerSideEncryption’: ‘aws:kms’, ‘SSEKMSKeyId’: ‘YOUR_KMS_KEY_ARN’}precombine_key (
str
|None
) – When there is a primary_key match during upsert, this column will change the upsert method, comparing the values of the specified column from source and target, and keeping the larger of the two. Will only work when mode = upsert.column_names (
list
[str
] |None
) – List of column names to map source data fields to the target columns.add_new_columns (
bool
) – If True, it automatically adds the new DataFrame columns into the target table.
- Return type:
None
Examples
>>> import awswrangler as wr >>> with wr.redshift.connect("MY_GLUE_CONNECTION") as con: ... wr.redshift.copy_from_files( ... path="s3://bucket/my_parquet_files/", ... con=con, ... table="my_table", ... schema="public", ... iam_role="arn:aws:iam::XXX:role/XXX" ... )