AWS SDK for pandas

42 - Amazon S3 Tables

Amazon S3 Tables provide analytics-optimized tabular storage using Apache Iceberg format. S3 Tables introduce table buckets, namespaces, and tables.

AWS SDK for pandas supports S3 Tables through the wr.s3 module. Read and write operations require the pyiceberg optional dependency:

pip install awswrangler[pyiceberg]
[ ]:
! pip install awswrangler[pyiceberg]
[18]:
import getpass

import pandas as pd

import awswrangler as wr
[19]:
bucket_name = getpass.getpass("Enter a table bucket name:")

Creating resources

Create a Table Bucket

[ ]:
bucket_arn = wr.s3.create_table_bucket(name=bucket_name)
print(f"Table bucket ARN: {bucket_arn}")

Create a Namespace

[ ]:
namespace = "tutorial"

wr.s3.create_namespace(
    table_bucket_arn=bucket_arn,
    namespace=namespace,
)

Write

Writing a DataFrame

to_iceberg automatically creates the table if it does not exist.

[ ]:
df = pd.DataFrame(
    {
        "order_id": [1, 2, 3],
        "amount": [10.50, 20.00, 15.75],
        "region": ["us", "eu", "us"],
    }
)

wr.s3.to_iceberg(
    df=df,
    table_bucket_arn=bucket_arn,
    namespace=namespace,
    table_name="orders",
)

Appending data

[ ]:
df_new = pd.DataFrame(
    {
        "order_id": [4, 5],
        "amount": [30.00, 12.25],
        "region": ["eu", "us"],
    }
)

wr.s3.to_iceberg(
    df=df_new,
    table_bucket_arn=bucket_arn,
    namespace=namespace,
    table_name="orders",
    mode="append",
)

Overwriting data

[ ]:
df_replace = pd.DataFrame(
    {
        "order_id": [100, 200],
        "amount": [99.99, 49.99],
        "region": ["ap", "ap"],
    }
)

wr.s3.to_iceberg(
    df=df_replace,
    table_bucket_arn=bucket_arn,
    namespace=namespace,
    table_name="orders",
    mode="overwrite",
)

Read

Read entire table

[ ]:
df = wr.s3.from_iceberg(
    table_bucket_arn=bucket_arn,
    namespace=namespace,
    table_name="orders",
)
df

Column selection and row filtering

[ ]:
df = wr.s3.from_iceberg(
    table_bucket_arn=bucket_arn,
    namespace=namespace,
    table_name="orders",
    columns=["order_id", "amount"],
    row_filter="amount > 50.0",
)
df

Limiting rows

[ ]:
df = wr.s3.from_iceberg(
    table_bucket_arn=bucket_arn,
    namespace=namespace,
    table_name="orders",
    limit=1,
)
df

Using the AWS Glue Iceberg REST endpoint

By default, read and write operations use the S3 Tables REST endpoint. To use the AWS Glue Iceberg REST endpoint instead, set wr.config.s3tables_catalog_endpoint_url. This enables integration with services that work through the Glue Data Catalog (e.g. Amazon Athena, Amazon Redshift).

Prerequisites

Before using the Glue endpoint, your table bucket must be integrated with the AWS Glue Data Catalog. This requires:

  1. An IAM role for Lake Formation with s3tables:* permissions and a trust policy allowing lakeformation.amazonaws.com to assume it.

  2. A Lake Formation resource registration for arn:aws:s3tables:<region>:<account>:bucket/* with WithFederation=True and HybridAccessEnabled=True.

  3. A Glue federated catalog named s3tablescatalog linked to S3 Tables via the aws:s3tables connection.

  4. Lake Formation permissions granting the caller access to the catalog, databases, and tables.

For step-by-step instructions, see Integrating S3 Tables with AWS analytics services.

[ ]:
# Point read/write at the Glue Iceberg REST endpoint
wr.config.s3tables_catalog_endpoint_url = "https://glue.<region>.amazonaws.com/iceberg"

df = wr.s3.from_iceberg(
    table_bucket_arn=bucket_arn,
    namespace=namespace,
    table_name="orders",
)

# Reset to default (S3 Tables endpoint)
wr.config.s3tables_catalog_endpoint_url = None

Deleting resources

[ ]:
wr.s3.delete_table(
    table_bucket_arn=bucket_arn,
    namespace=namespace,
    table_name="orders",
)
wr.s3.delete_namespace(
    table_bucket_arn=bucket_arn,
    namespace=namespace,
)
wr.s3.delete_table_bucket(
    table_bucket_arn=bucket_arn,
)