AWS SDK for pandas

8 - Redshift - COPY & UNLOAD

Amazon Redshift has two SQL command that help to load and unload large amount of data staging it on Amazon S3:

1 - COPY

2 - UNLOAD

Let’s take a look and how awswrangler can use it.

[ ]:
# Install the optional modules first
!pip install 'awswrangler[redshift]'
[1]:
import awswrangler as wr

con = wr.redshift.connect("aws-sdk-pandas-redshift")

Enter your bucket name:

[2]:
import getpass

bucket = getpass.getpass()
path = f"s3://{bucket}/stage/"
 ···········································

Enter your IAM ROLE ARN:

[3]:
iam_role = getpass.getpass()
 ····················································································

Creating a DataFrame from the NOAA’s CSV files

Reference

[4]:
cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]

df = wr.s3.read_csv(
    path="s3://noaa-ghcn-pds/csv/by_year/1897.csv", names=cols, parse_dates=["dt", "obs_time"]
)  # ~127MB, ~4MM rows

df
[4]:
id dt element value m_flag q_flag s_flag obs_time
0 AG000060590 1897-01-01 TMAX 170 NaN NaN E NaN
1 AG000060590 1897-01-01 TMIN -14 NaN NaN E NaN
2 AG000060590 1897-01-01 PRCP 0 NaN NaN E NaN
3 AGE00135039 1897-01-01 TMAX 140 NaN NaN E NaN
4 AGE00135039 1897-01-01 TMIN 40 NaN NaN E NaN
... ... ... ... ... ... ... ... ...
3923594 UZM00038457 1897-12-31 TMIN -145 NaN NaN r NaN
3923595 UZM00038457 1897-12-31 PRCP 4 NaN NaN r NaN
3923596 UZM00038457 1897-12-31 TAVG -95 NaN NaN r NaN
3923597 UZM00038618 1897-12-31 PRCP 66 NaN NaN r NaN
3923598 UZM00038618 1897-12-31 TAVG -45 NaN NaN r NaN

3923599 rows × 8 columns

Load and Unload with COPY and UNLOAD commands

Note: Please use a empty S3 path for the COPY command.

[5]:
%%time

wr.redshift.copy(
    df=df,
    path=path,
    con=con,
    schema="public",
    table="commands",
    mode="overwrite",
    iam_role=iam_role,
)
CPU times: user 2.78 s, sys: 293 ms, total: 3.08 s
Wall time: 20.7 s
[6]:
%%time

wr.redshift.unload(
    sql="SELECT * FROM public.commands",
    con=con,
    iam_role=iam_role,
    path=path,
    keep_files=True,
)
CPU times: user 10 s, sys: 1.14 s, total: 11.2 s
Wall time: 27.5 s
[6]:
id dt element value m_flag q_flag s_flag obs_time
0 AG000060590 1897-01-01 TMAX 170 <NA> <NA> E <NA>
1 AG000060590 1897-01-01 PRCP 0 <NA> <NA> E <NA>
2 AGE00135039 1897-01-01 TMIN 40 <NA> <NA> E <NA>
3 AGE00147705 1897-01-01 TMAX 164 <NA> <NA> E <NA>
4 AGE00147705 1897-01-01 PRCP 0 <NA> <NA> E <NA>
... ... ... ... ... ... ... ... ...
3923594 USW00094967 1897-12-31 TMAX -144 <NA> <NA> 6 <NA>
3923595 USW00094967 1897-12-31 PRCP 0 P <NA> 6 <NA>
3923596 UZM00038457 1897-12-31 TMAX -49 <NA> <NA> r <NA>
3923597 UZM00038457 1897-12-31 PRCP 4 <NA> <NA> r <NA>
3923598 UZM00038618 1897-12-31 PRCP 66 <NA> <NA> r <NA>

7847198 rows × 8 columns

[7]:
con.close()