AWS SDK for pandas

23 - Flexible Partitions Filter (PUSH-DOWN)

  • partition_filter argument:

    - Callback Function filters to apply on PARTITION columns (PUSH-DOWN filter).
    - This function MUST receive a single argument (Dict[str, str]) where keys are partitions names and values are partitions values.
    - This function MUST return a bool, True to read the partition or False to ignore it.
    - Ignored if `dataset=False`.
    

P.S. Check the function API doc to see it has some argument that can be configured through Global configurations.

[1]:
import awswrangler as wr
import pandas as pd

Enter your bucket name:

[2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/dataset/"
 ············

Creating the Dataset (Parquet)

[3]:
df = pd.DataFrame({
    "id": [1, 2, 3],
    "value": ["foo", "boo", "bar"],
})

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    partition_cols=["value"]
)

wr.s3.read_parquet(path, dataset=True)
[3]:
id value
0 3 bar
1 2 boo
2 1 foo

Parquet Example 1

[4]:
my_filter = lambda x: x["value"].endswith("oo")

wr.s3.read_parquet(path, dataset=True, partition_filter=my_filter)
[4]:
id value
0 2 boo
1 1 foo

Parquet Example 2

[5]:
from Levenshtein import distance


def my_filter(partitions):
    return distance("boo", partitions["value"]) <= 1


wr.s3.read_parquet(path, dataset=True, partition_filter=my_filter)
[5]:
id value
0 2 boo
1 1 foo

Creating the Dataset (CSV)

[6]:
df = pd.DataFrame({
    "id": [1, 2, 3],
    "value": ["foo", "boo", "bar"],
})

wr.s3.to_csv(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    partition_cols=["value"],
    compression="gzip",
    index=False
)

wr.s3.read_csv(path, dataset=True)
[6]:
id value
0 3 bar
1 2 boo
2 1 foo

CSV Example 1

[7]:
my_filter = lambda x: x["value"].endswith("oo")

wr.s3.read_csv(path, dataset=True, partition_filter=my_filter)
[7]:
id value
0 2 boo
1 1 foo

CSV Example 2

[8]:
from Levenshtein import distance


def my_filter(partitions):
    return distance("boo", partitions["value"]) <= 1


wr.s3.read_csv(path, dataset=True, partition_filter=my_filter)
[8]:
id value
0 2 boo
1 1 foo