AWS SDK for pandas

36 - Distributing Calls on Glue Interactive sessions

AWS SDK for pandas is pre-loaded into AWS Glue interactive sessions with Ray kernel, making it by far the easiest way to experiment with the library at scale.

In AWS Glue Studio, choose Jupyter Notebook to create an AWS Glue interactive session:

../_images/glue_is_create.png

Then select Ray as the kernel. The IAM role must trust the AWS Glue service principal.

../_images/glue_is_setup.png

Once the notebook is up and running you can import the library. Since we are running on AWS Glue with Ray, AWS SDK for pandas will automatically use the existing Ray cluster with no extra configuration needed.

Install the library

[ ]:
!pip install "awswrangler[modin]"
[1]:
import awswrangler as wr
Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.0
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::977422593089:role/AWSGlueMantaTests
Trying to create a Glue session for the kernel.
Worker Type: Z.2X
Number of Workers: 5
Session ID: 309824f0-bad7-49d0-a2b4-e1b8c7368c5f
Job Type: glueray
Applying the following default arguments:
--glue_kernel_version 0.37.0
--enable-glue-datacatalog true
Waiting for session 309824f0-bad7-49d0-a2b4-e1b8c7368c5f to get into ready status...
Session 309824f0-bad7-49d0-a2b4-e1b8c7368c5f has been created.
2022-11-21 16:24:03,136 INFO worker.py:1329 -- Connecting to existing Ray cluster at address: 2600:1f10:4674:6822:5b63:3324:984:3152:6379...
2022-11-21 16:24:03,144 INFO worker.py:1511 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
[3]:
df = wr.s3.read_csv(path="s3://nyc-tlc/csv_backup/yellow_tripdata_2021-0*.csv")
Read progress: 100%|##########| 9/9 [00:10<00:00,  1.15s/it]
UserWarning: When using a pre-initialized Ray cluster, please ensure that the runtime env sets environment variable __MODIN_AUTOIMPORT_PANDAS__ to 1
[4]:
df.head()
   VendorID tpep_pickup_datetime  ... total_amount  congestion_surcharge
0       1.0  2021-01-01 00:30:10  ...        11.80                   2.5
1       1.0  2021-01-01 00:51:20  ...         4.30                   0.0
2       1.0  2021-01-01 00:43:30  ...        51.95                   0.0
3       1.0  2021-01-01 00:15:48  ...        36.35                   0.0
4       2.0  2021-01-01 00:31:49  ...        24.36                   2.5

[5 rows x 18 columns]

To avoid incurring a charge, make sure to delete the Jupyter Notebook when you are done experimenting.