AWS SDK for pandas

36 - Distributing Calls on Glue Interactive sessionsΒΆ

AWS SDK for pandas is pre-loaded into AWS Glue interactive sessions with Ray kernel, making it by far the easiest way to experiment with the library at scale.

In AWS Glue Studio, choose Notebook to create an AWS Glue interactive session:

image1

Then select Ray as the kernel. The IAM role must trust the AWS Glue service principal.

image2

Once the notebook is up and running you can import the library. You can install awswrangler and modin as additional dependencies.

[16]:
%additional_python_modules awswrangler,modin
Additional python modules to be included:
awswrangler
modin
[1]:
import awswrangler as wr
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::463623607974:role/service-role/AmazonSageMakerServiceCatalogProductsGlueRole
Trying to create a Glue session for the kernel.
Worker Type: Z.2X
Number of Workers: 5
Session ID: 32566e82-34d2-4db7-adac-cbee573e20bf
Job Type: glueray
Applying the following default arguments:
--glue_kernel_version 0.38.1
--enable-glue-datacatalog true
--auto-scaling-ray-min-workers 1
--additional-python-modules awswrangler,modin
Waiting for session 32566e82-34d2-4db7-adac-cbee573e20bf to get into ready status...
Session 32566e82-34d2-4db7-adac-cbee573e20bf has been created.
[8]:
df = wr.s3.read_parquet(path="s3://ursa-labs-taxi-data/2017/")
[9]:
df.head()
  vendor_id           pickup_at  ... improvement_surcharge  total_amount
0         1 2017-01-09 11:13:28  ...                   0.3     15.300000
1         1 2017-01-09 11:32:27  ...                   0.3      7.250000
2         1 2017-01-09 11:38:20  ...                   0.3      7.300000
3         1 2017-01-09 11:52:13  ...                   0.3      8.500000
4         2 2017-01-01 00:00:00  ...                   0.3     52.799999

[5 rows x 17 columns]

To avoid incurring a charge, make sure to delete the Jupyter Notebook when you are done experimenting.