5 - Glue Catalog¶
awswrangler makes heavy use of Glue Catalog to store metadata of tables and connections.
[1]:
import pandas as pd
import awswrangler as wr
Enter your bucket name:¶
[2]:
import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"
············
Creating a Pandas DataFrame¶
[3]:
df = pd.DataFrame(
{"id": [1, 2, 3], "name": ["shoes", "tshirt", "ball"], "price": [50.3, 10.5, 20.0], "in_stock": [True, True, False]}
)
df
[3]:
id | name | price | in_stock | |
---|---|---|---|---|
0 | 1 | shoes | 50.3 | True |
1 | 2 | tshirt | 10.5 | True |
2 | 3 | ball | 20.0 | False |
Checking Glue Catalog Databases¶
[4]:
databases = wr.catalog.databases()
print(databases)
Database Description
0 aws_sdk_pandas AWS SDK for pandas Test Arena - Glue Database
1 default Default Hive database
Create the database awswrangler_test if not exists¶
[5]:
if "awswrangler_test" not in databases.values:
wr.catalog.create_database("awswrangler_test")
print(wr.catalog.databases())
else:
print("Database awswrangler_test already exists")
Database Description
0 aws_sdk_pandas AWS SDK for pandas Test Arena - Glue Database
1 awswrangler_test
2 default Default Hive database
Checking the empty database¶
[6]:
wr.catalog.tables(database="awswrangler_test")
[6]:
Database | Table | Description | Columns | Partitions |
---|
Writing DataFrames to Data Lake (S3 + Parquet + Glue Catalog)¶
[7]:
desc = "This is my product table."
param = {"source": "Product Web Service", "class": "e-commerce"}
comments = {
"id": "Unique product ID.",
"name": "Product name",
"price": "Product price (dollar)",
"in_stock": "Is this product availaible in the stock?",
}
res = wr.s3.to_parquet(
df=df,
path=f"s3://{bucket}/products/",
dataset=True,
database="awswrangler_test",
table="products",
mode="overwrite",
glue_table_settings=wr.typing.GlueTableSettings(description=desc, parameters=param, columns_comments=comments),
)
Checking Glue Catalog (AWS Console)¶
Looking Up for the new table!¶
[8]:
wr.catalog.tables(name_contains="roduc")
[8]:
Database | Table | Description | Columns | Partitions | |
---|---|---|---|---|---|
0 | awswrangler_test | products | This is my product table. | id, name, price, in_stock |
[9]:
wr.catalog.tables(name_prefix="pro")
[9]:
Database | Table | Description | Columns | Partitions | |
---|---|---|---|---|---|
0 | awswrangler_test | products | This is my product table. | id, name, price, in_stock |
[10]:
wr.catalog.tables(name_suffix="ts")
[10]:
Database | Table | Description | Columns | Partitions | |
---|---|---|---|---|---|
0 | awswrangler_test | products | This is my product table. | id, name, price, in_stock |
[11]:
wr.catalog.tables(search_text="This is my")
[11]:
Database | Table | Description | Columns | Partitions | |
---|---|---|---|---|---|
0 | awswrangler_test | products | This is my product table. | id, name, price, in_stock |
Getting tables details¶
[12]:
wr.catalog.table(database="awswrangler_test", table="products")
[12]:
Column Name | Type | Partition | Comment | |
---|---|---|---|---|
0 | id | bigint | False | Unique product ID. |
1 | name | string | False | Product name |
2 | price | double | False | Product price (dollar) |
3 | in_stock | boolean | False | Is this product availaible in the stock? |
Cleaning Up the Database¶
[13]:
for table in wr.catalog.get_tables(database="awswrangler_test"):
wr.catalog.delete_table_if_exists(database="awswrangler_test", table=table["Name"])
Delete Database¶
[14]:
wr.catalog.delete_database("awswrangler_test")