AWS SDK for pandas

4 - Parquet Datasets

awswrangler has 3 different write modes to store Parquet Datasets on Amazon S3.

  • append (Default)

    Only adds new files without any delete.

  • overwrite

    Deletes everything in the target directory and then add new files. If writing new files fails for any reason, old files are not restored.

  • overwrite_partitions (Partition Upsert)

    Only deletes the paths of partitions that should be updated and then writes the new partitions files. It’s like a “partition Upsert”.

[1]:
from datetime import date

import pandas as pd

import awswrangler as wr

Enter your bucket name:

[2]:
import getpass

bucket = getpass.getpass()
path = f"s3://{bucket}/dataset/"
 ············

Creating the Dataset

[3]:
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"], "date": [date(2020, 1, 1), date(2020, 1, 2)]})

wr.s3.to_parquet(df=df, path=path, dataset=True, mode="overwrite")

wr.s3.read_parquet(path, dataset=True)
[3]:
id value date
0 1 foo 2020-01-01
1 2 boo 2020-01-02

Appending

[4]:
df = pd.DataFrame({"id": [3], "value": ["bar"], "date": [date(2020, 1, 3)]})

wr.s3.to_parquet(df=df, path=path, dataset=True, mode="append")

wr.s3.read_parquet(path, dataset=True)
[4]:
id value date
0 3 bar 2020-01-03
1 1 foo 2020-01-01
2 2 boo 2020-01-02

Overwriting

[5]:
wr.s3.to_parquet(df=df, path=path, dataset=True, mode="overwrite")

wr.s3.read_parquet(path, dataset=True)
[5]:
id value date
0 3 bar 2020-01-03

Creating a Partitioned Dataset

[6]:
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"], "date": [date(2020, 1, 1), date(2020, 1, 2)]})

wr.s3.to_parquet(df=df, path=path, dataset=True, mode="overwrite", partition_cols=["date"])

wr.s3.read_parquet(path, dataset=True)
[6]:
id value date
0 1 foo 2020-01-01
1 2 boo 2020-01-02

Upserting partitions (overwrite_partitions)

[7]:
df = pd.DataFrame({"id": [2, 3], "value": ["xoo", "bar"], "date": [date(2020, 1, 2), date(2020, 1, 3)]})

wr.s3.to_parquet(df=df, path=path, dataset=True, mode="overwrite_partitions", partition_cols=["date"])

wr.s3.read_parquet(path, dataset=True)
[7]:
id value date
0 1 foo 2020-01-01
1 2 xoo 2020-01-02
2 3 bar 2020-01-03

BONUS - Glue/Athena integration

[8]:
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"], "date": [date(2020, 1, 1), date(2020, 1, 2)]})

wr.s3.to_parquet(df=df, path=path, dataset=True, mode="overwrite", database="aws_sdk_pandas", table="my_table")

wr.athena.read_sql_query("SELECT * FROM my_table", database="aws_sdk_pandas")
[8]:
id value date
0 1 foo 2020-01-01
1 2 boo 2020-01-02