8 - Redshift - COPY & UNLOAD¶
Amazon Redshift
has two SQL command that help to load and unload large amount of data staging it on Amazon S3
:
1 - COPY
2 - UNLOAD
Let’s take a look and how awswrangler can use it.
[ ]:
# Install the optional modules first
!pip install 'awswrangler[redshift]'
[1]:
import awswrangler as wr
con = wr.redshift.connect("aws-sdk-pandas-redshift")
Enter your bucket name:¶
[2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/stage/"
···········································
Enter your IAM ROLE ARN:¶
[3]:
iam_role = getpass.getpass()
····················································································
Creating a DataFrame from the NOAA’s CSV files¶
[4]:
cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]
df = wr.s3.read_csv(
path="s3://noaa-ghcn-pds/csv/by_year/1897.csv", names=cols, parse_dates=["dt", "obs_time"]
) # ~127MB, ~4MM rows
df
[4]:
id | dt | element | value | m_flag | q_flag | s_flag | obs_time | |
---|---|---|---|---|---|---|---|---|
0 | AG000060590 | 1897-01-01 | TMAX | 170 | NaN | NaN | E | NaN |
1 | AG000060590 | 1897-01-01 | TMIN | -14 | NaN | NaN | E | NaN |
2 | AG000060590 | 1897-01-01 | PRCP | 0 | NaN | NaN | E | NaN |
3 | AGE00135039 | 1897-01-01 | TMAX | 140 | NaN | NaN | E | NaN |
4 | AGE00135039 | 1897-01-01 | TMIN | 40 | NaN | NaN | E | NaN |
... | ... | ... | ... | ... | ... | ... | ... | ... |
3923594 | UZM00038457 | 1897-12-31 | TMIN | -145 | NaN | NaN | r | NaN |
3923595 | UZM00038457 | 1897-12-31 | PRCP | 4 | NaN | NaN | r | NaN |
3923596 | UZM00038457 | 1897-12-31 | TAVG | -95 | NaN | NaN | r | NaN |
3923597 | UZM00038618 | 1897-12-31 | PRCP | 66 | NaN | NaN | r | NaN |
3923598 | UZM00038618 | 1897-12-31 | TAVG | -45 | NaN | NaN | r | NaN |
3923599 rows × 8 columns
Load and Unload with COPY and UNLOAD commands¶
Note: Please use a empty S3 path for the COPY command.
[5]:
%%time
wr.redshift.copy(
df=df,
path=path,
con=con,
schema="public",
table="commands",
mode="overwrite",
iam_role=iam_role,
)
CPU times: user 2.78 s, sys: 293 ms, total: 3.08 s
Wall time: 20.7 s
[6]:
%%time
wr.redshift.unload(
sql="SELECT * FROM public.commands",
con=con,
iam_role=iam_role,
path=path,
keep_files=True,
)
CPU times: user 10 s, sys: 1.14 s, total: 11.2 s
Wall time: 27.5 s
[6]:
id | dt | element | value | m_flag | q_flag | s_flag | obs_time | |
---|---|---|---|---|---|---|---|---|
0 | AG000060590 | 1897-01-01 | TMAX | 170 | <NA> | <NA> | E | <NA> |
1 | AG000060590 | 1897-01-01 | PRCP | 0 | <NA> | <NA> | E | <NA> |
2 | AGE00135039 | 1897-01-01 | TMIN | 40 | <NA> | <NA> | E | <NA> |
3 | AGE00147705 | 1897-01-01 | TMAX | 164 | <NA> | <NA> | E | <NA> |
4 | AGE00147705 | 1897-01-01 | PRCP | 0 | <NA> | <NA> | E | <NA> |
... | ... | ... | ... | ... | ... | ... | ... | ... |
3923594 | USW00094967 | 1897-12-31 | TMAX | -144 | <NA> | <NA> | 6 | <NA> |
3923595 | USW00094967 | 1897-12-31 | PRCP | 0 | P | <NA> | 6 | <NA> |
3923596 | UZM00038457 | 1897-12-31 | TMAX | -49 | <NA> | <NA> | r | <NA> |
3923597 | UZM00038457 | 1897-12-31 | PRCP | 4 | <NA> | <NA> | r | <NA> |
3923598 | UZM00038618 | 1897-12-31 | PRCP | 66 | <NA> | <NA> | r | <NA> |
7847198 rows × 8 columns
[7]:
con.close()