An AWS Professional Service open source initiative | aws-proserve-opensource@amazon.com
Quick Start¶
>>> pip install awswrangler
>>> # Optional modules are installed with:
>>> pip install 'awswrangler[redshift]'
import awswrangler as wr
import pandas as pd
from datetime import datetime
df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]})
# Storing data on Data Lake
wr.s3.to_parquet(
df=df,
path="s3://bucket/dataset/",
dataset=True,
database="my_db",
table="my_table"
)
# Retrieving the data directly from Amazon S3
df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True)
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")
# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum
con = wr.redshift.connect("my-glue-connection")
df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con)
con.close()
# Amazon Timestream Write
df = pd.DataFrame({
"time": [datetime.now(), datetime.now()],
"my_dimension": ["foo", "boo"],
"measure": [1.0, 1.1],
})
rejected_records = wr.timestream.write(df,
database="sampleDB",
table="sampleTable",
time_col="time",
measure_col="measure",
dimensions_cols=["my_dimension"],
)
# Amazon Timestream Query
wr.timestream.query("""
SELECT time, measure_value::double, my_dimension
FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3
""")
Read The Docs¶
- What is AWS SDK for pandas?
- Install
- At scale
- Tutorials
- 1 - Introduction
- 2 - Sessions
- 3 - Amazon S3
- 4 - Parquet Datasets
- 5 - Glue Catalog
- 6 - Amazon Athena
- 7 - Redshift, MySQL, PostgreSQL, SQL Server and Oracle
- 8 - Redshift - COPY & UNLOAD
- 9 - Redshift - Append, Overwrite and Upsert
- 10 - Parquet Crawler
- 11 - CSV Datasets
- 12 - CSV Crawler
- 13 - Merging Datasets on S3
- 14 - Schema Evolution
- 15 - EMR
- 16 - EMR & Docker
- 17 - Partition Projection
- 18 - QuickSight
- 19 - Amazon Athena Cache
- 20 - Spark Table Interoperability
- 21 - Global Configurations
- 22 - Writing Partitions Concurrently
- 23 - Flexible Partitions Filter (PUSH-DOWN)
- 24 - Athena Query Metadata
- 25 - Redshift - Loading Parquet files with Spectrum
- 26 - Amazon Timestream
- 27 - Amazon Timestream - Example 2
- 28 - Amazon DynamoDB
- 29 - S3 Select
- 30 - Data Api
- 31 - OpenSearch
- 33 - Amazon Neptune
- 34 - Distributing Calls Using Ray
- 35 - Distributing Calls on Ray Remote Cluster
- 36 - Distributing Calls on Glue Interactive sessions
- 37 - Glue Data Quality
- 38 - OpenSearch Serverless
- 39 - Athena Iceberg
- 40 - EMR Serverless
- 41 - Apache Spark on Amazon Athena
- Architectural Decision Records
- 1. Record architecture decisions
- 2. Handling unsupported arguments in distributed mode
- 3. Use TypedDict to group similar parameters
- 4. AWS SDK for pandas does not alter IAM permissions
- 5. Move dependencies to optional
- 6. Deprecate wr.s3.merge_upsert_table
- 7. Design of engine and memory format
- 8. Switching between PyArrow and Pandas based datasources for CSV/JSON I/O
- 9. Engine selection and lazy initialization
- API Reference
- Amazon S3
- AWS Glue Catalog
- Amazon Athena
- Amazon Redshift
- PostgreSQL
- MySQL
- Data API Redshift
- Data API RDS
- AWS Glue Data Quality
- OpenSearch
- Amazon Neptune
- DynamoDB
- Amazon Timestream
- AWS Clean Rooms
- Amazon EMR
- Amazon EMR Serverless
- Amazon CloudWatch Logs
- Amazon QuickSight
- AWS STS
- AWS Secrets Manager
- Amazon Chime
- Typing
- Global Configurations
- Engine and Memory Format
- Distributed - Ray
- Community Resources
- Logging
- Who uses AWS SDK for pandas?
- License
- Contributing