AWS SDK for pandas

5 - Glue Catalog

awswrangler makes heavy use of Glue Catalog to store metadata of tables and connections.

[1]:
import pandas as pd

import awswrangler as wr

Enter your bucket name:

[2]:
import getpass

bucket = getpass.getpass()
path = f"s3://{bucket}/data/"
 ············

Creating a Pandas DataFrame

[3]:
df = pd.DataFrame(
    {"id": [1, 2, 3], "name": ["shoes", "tshirt", "ball"], "price": [50.3, 10.5, 20.0], "in_stock": [True, True, False]}
)
df
[3]:
id name price in_stock
0 1 shoes 50.3 True
1 2 tshirt 10.5 True
2 3 ball 20.0 False

Checking Glue Catalog Databases

[4]:
databases = wr.catalog.databases()
print(databases)
            Database                                   Description
0  aws_sdk_pandas  AWS SDK for pandas Test Arena - Glue Database
1            default                         Default Hive database

Create the database awswrangler_test if not exists

[5]:
if "awswrangler_test" not in databases.values:
    wr.catalog.create_database("awswrangler_test")
    print(wr.catalog.databases())
else:
    print("Database awswrangler_test already exists")
            Database                                   Description
0  aws_sdk_pandas  AWS SDK for pandas Test Arena - Glue Database
1   awswrangler_test
2            default                         Default Hive database

Checking the empty database

[6]:
wr.catalog.tables(database="awswrangler_test")
[6]:
Database Table Description Columns Partitions

Writing DataFrames to Data Lake (S3 + Parquet + Glue Catalog)

[7]:
desc = "This is my product table."

param = {"source": "Product Web Service", "class": "e-commerce"}

comments = {
    "id": "Unique product ID.",
    "name": "Product name",
    "price": "Product price (dollar)",
    "in_stock": "Is this product availaible in the stock?",
}

res = wr.s3.to_parquet(
    df=df,
    path=f"s3://{bucket}/products/",
    dataset=True,
    database="awswrangler_test",
    table="products",
    mode="overwrite",
    glue_table_settings=wr.typing.GlueTableSettings(description=desc, parameters=param, columns_comments=comments),
)

Checking Glue Catalog (AWS Console)

Glue Console

Looking Up for the new table!

[8]:
wr.catalog.tables(name_contains="roduc")
[8]:
Database Table Description Columns Partitions
0 awswrangler_test products This is my product table. id, name, price, in_stock
[9]:
wr.catalog.tables(name_prefix="pro")
[9]:
Database Table Description Columns Partitions
0 awswrangler_test products This is my product table. id, name, price, in_stock
[10]:
wr.catalog.tables(name_suffix="ts")
[10]:
Database Table Description Columns Partitions
0 awswrangler_test products This is my product table. id, name, price, in_stock
[11]:
wr.catalog.tables(search_text="This is my")
[11]:
Database Table Description Columns Partitions
0 awswrangler_test products This is my product table. id, name, price, in_stock

Getting tables details

[12]:
wr.catalog.table(database="awswrangler_test", table="products")
[12]:
Column Name Type Partition Comment
0 id bigint False Unique product ID.
1 name string False Product name
2 price double False Product price (dollar)
3 in_stock boolean False Is this product availaible in the stock?

Cleaning Up the Database

[13]:
for table in wr.catalog.get_tables(database="awswrangler_test"):
    wr.catalog.delete_table_if_exists(database="awswrangler_test", table=table["Name"])

Delete Database

[14]:
wr.catalog.delete_database("awswrangler_test")