25 - Redshift - Loading Parquet files with Spectrum¶

Enter your bucket name:¶

[ ]:

# Install the optional modules first
!pip install 'awswrangler[redshift]'

[1]:

import getpass

bucket = getpass.getpass()
PATH = f"s3://{bucket}/files/"

 ···········································

Mocking some Parquet Files on S3¶

[2]:

import pandas as pd

import awswrangler as wr

df = pd.DataFrame(
    {
        "col0": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
        "col1": ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"],
    }
)

df

[2]:

	col0	col1
0	0	a
1	1	b
2	2	c
3	3	d
4	4	e
5	5	f
6	6	g
7	7	h
8	8	i
9	9	j

[3]:

wr.s3.to_parquet(df, PATH, max_rows_by_file=2, dataset=True, mode="overwrite")

Crawling the metadata and adding into Glue Catalog¶

[4]:

wr.s3.store_parquet_metadata(path=PATH, database="aws_sdk_pandas", table="test", dataset=True, mode="overwrite")

[4]:

({'col0': 'bigint', 'col1': 'string'}, None, None)

Running the CTAS query to load the data into Redshift storage¶

[5]:

con = wr.redshift.connect(connection="aws-sdk-pandas-redshift")

[6]:

query = "CREATE TABLE public.test AS (SELECT * FROM aws_sdk_pandas_external.test)"

[7]:

with con.cursor() as cursor:
    cursor.execute(query)

Running an INSERT INTO query to load MORE data into Redshift storage¶

[8]:

df = pd.DataFrame(
    {
        "col0": [10, 11],
        "col1": ["k", "l"],
    }
)
wr.s3.to_parquet(df, PATH, dataset=True, mode="overwrite")

[9]:

query = "INSERT INTO public.test (SELECT * FROM aws_sdk_pandas_external.test)"

[10]:

with con.cursor() as cursor:
    cursor.execute(query)

Checking the result¶

[11]:

query = "SELECT * FROM public.test"

[13]:

wr.redshift.read_sql_table(con=con, schema="public", table="test")

[13]:

	col0	col1
0	5	f
1	1	b
2	3	d
3	6	g
4	8	i
5	10	k
6	4	e
7	0	a
8	2	c
9	7	h
10	9	j
11	11	l

[14]:

con.close()