API Reference

Amazon S3

copy_objects(paths, source_path, target_path)

Copy a list of S3 objects to another S3 directory.

delete_objects(path[, use_threads, ...])

Delete Amazon S3 objects from a received S3 prefix or list of S3 objects paths.

describe_objects(path[, version_id, ...])

Describe Amazon S3 objects from a received S3 prefix or list of S3 objects paths.

does_object_exist(path[, ...])

Check if object exists on S3.

download(path, local_file[, version_id, ...])

Download file from a received S3 path to local file.

get_bucket_region(bucket[, boto3_session])

Get bucket region name.

list_buckets([boto3_session])

List Amazon S3 buckets.

list_directories(path[, chunked, ...])

List Amazon S3 objects from a prefix.

list_objects(path[, suffix, ignore_suffix, ...])

List Amazon S3 objects from a prefix.

merge_datasets(source_path, target_path[, ...])

Merge a source dataset into a target dataset.

read_csv(path[, path_suffix, ...])

Read CSV file(s) from a received S3 prefix or list of S3 objects paths.

read_excel(path[, version_id, use_threads, ...])

Read EXCEL file(s) from a received S3 path.

read_fwf(path[, path_suffix, ...])

Read fixed-width formatted file(s) from a received S3 prefix or list of S3 objects paths.

read_json(path[, path_suffix, ...])

Read JSON file(s) from a received S3 prefix or list of S3 objects paths.

read_parquet(path[, path_root, dataset, ...])

Read Parquet file(s) from an S3 prefix or list of S3 objects paths.

read_parquet_metadata(path[, dataset, ...])

Read Apache Parquet file(s) metadata from an S3 prefix or list of S3 objects paths.

read_parquet_table(table, database[, ...])

Read Apache Parquet table registered in the AWS Glue Catalog.

read_orc(path[, path_root, dataset, ...])

Read ORC file(s) from an S3 prefix or list of S3 objects paths.

read_orc_metadata(path[, dataset, ...])

Read Apache ORC file(s) metadata from an S3 prefix or list of S3 objects paths.

read_orc_table(table, database[, ...])

Read Apache ORC table registered in the AWS Glue Catalog.

read_deltalake(path[, version, partitions, ...])

Load a Deltalake table data from an S3 path.

select_query(sql, path, input_serialization, ...)

Filter contents of Amazon S3 objects based on SQL statement.

size_objects(path[, version_id, ...])

Get the size (ContentLength) in bytes of Amazon S3 objects from a received S3 prefix or list of S3 objects paths.

store_parquet_metadata(path, database, table)

Infer and store parquet metadata on AWS Glue Catalog.

to_csv(df[, path, sep, index, columns, ...])

Write CSV file or dataset on Amazon S3.

to_excel(df, path[, boto3_session, ...])

Write EXCEL file on Amazon S3.

to_json(df[, path, index, columns, ...])

Write JSON file on Amazon S3.

to_parquet(df[, path, index, compression, ...])

Write Parquet file or dataset on Amazon S3.

to_orc(df[, path, index, compression, ...])

Write ORC file or dataset on Amazon S3.

to_deltalake(df, path[, index, mode, dtype, ...])

Write a DataFrame to S3 as a DeltaLake table.

upload(local_file, path[, use_threads, ...])

Upload file from a local file to received S3 path.

wait_objects_exist(paths[, delay, ...])

Wait Amazon S3 objects exist.

wait_objects_not_exist(paths[, delay, ...])

Wait Amazon S3 objects not exist.

AWS Glue Catalog

add_column(database, table, column_name[, ...])

Add a column in a AWS Glue Catalog table.

add_csv_partitions(database, table, ...[, ...])

Add partitions (metadata) to a CSV Table in the AWS Glue Catalog.

add_parquet_partitions(database, table, ...)

Add partitions (metadata) to a Parquet Table in the AWS Glue Catalog.

create_csv_table(database, table, path, ...)

Create a CSV Table (Metadata Only) in the AWS Glue Catalog.

create_database(name[, description, ...])

Create a database in AWS Glue Catalog.

create_json_table(database, table, path, ...)

Create a JSON Table (Metadata Only) in the AWS Glue Catalog.

create_parquet_table(database, table, path, ...)

Create a Parquet Table (Metadata Only) in the AWS Glue Catalog.

databases([limit, catalog_id, boto3_session])

Get a Pandas DataFrame with all listed databases.

delete_column(database, table, column_name)

Delete a column in a AWS Glue Catalog table.

delete_database(name[, catalog_id, ...])

Delete a database in AWS Glue Catalog.

delete_partitions(table, database, ...[, ...])

Delete specified partitions in a AWS Glue Catalog table.

delete_all_partitions(table, database[, ...])

Delete all partitions in a AWS Glue Catalog table.

delete_table_if_exists(database, table[, ...])

Delete Glue table if exists.

does_table_exist(database, table[, ...])

Check if the table exists.

drop_duplicated_columns(df)

Drop all repeated columns (duplicated names).

extract_athena_types(df[, index, ...])

Extract columns and partitions types (Amazon Athena) from Pandas DataFrame.

get_columns_comments(database, table[, ...])

Get all columns comments.

get_columns_parameters(database, table[, ...])

Get all columns parameters.

get_csv_partitions(database, table[, ...])

Get all partitions from a Table in the AWS Glue Catalog.

get_databases([catalog_id, boto3_session])

Get an iterator of databases.

get_parquet_partitions(database, table[, ...])

Get all partitions from a Table in the AWS Glue Catalog.

get_partitions(database, table[, ...])

Get all partitions from a Table in the AWS Glue Catalog.

get_table_description(database, table[, ...])

Get table description.

get_table_location(database, table[, ...])

Get table's location on Glue catalog.

get_table_number_of_versions(database, table)

Get total number of versions.

get_table_parameters(database, table[, ...])

Get all parameters.

get_table_types(database, table[, ...])

Get all columns and types from a table.

get_table_versions(database, table[, ...])

Get all versions.

get_tables([catalog_id, database, ...])

Get an iterator of tables.

overwrite_table_parameters(parameters, ...)

Overwrite all existing parameters.

sanitize_column_name(column)

Convert the column name to be compatible with Amazon Athena and the AWS Glue Catalog.

sanitize_dataframe_columns_names(df[, ...])

Normalize all columns names to be compatible with Amazon Athena.

sanitize_table_name(table)

Convert the table name to be compatible with Amazon Athena and the AWS Glue Catalog.

search_tables(text[, catalog_id, boto3_session])

Get Pandas DataFrame of tables filtered by a search string.

table(database, table[, catalog_id, ...])

Get table details as Pandas DataFrame.

tables([limit, catalog_id, database, ...])

Get a DataFrame with tables filtered by a search term, prefix, suffix.

upsert_table_parameters(parameters, ...[, ...])

Insert or Update the received parameters.

Amazon Athena

create_athena_bucket([boto3_session])

Create the default Athena bucket if it doesn't exist.

create_spark_session(workgroup[, ...])

Create session and wait until ready to accept calculations.

create_ctas_table(sql[, database, ...])

Create a new table populated with the results of a SELECT query.

generate_create_query(table[, database, ...])

Generate the query that created a table(EXTERNAL_TABLE) or a view(VIRTUAL_TABLE).

get_query_columns_types(query_execution_id)

Get the data type of all columns queried.

get_query_execution(query_execution_id[, ...])

Fetch query execution details.

get_query_executions(query_execution_ids[, ...])

From specified query execution IDs, return a DataFrame of query execution details.

get_query_results(query_execution_id[, ...])

Get AWS Athena SQL query results as a Pandas DataFrame.

get_named_query_statement(named_query_id[, ...])

Get the named query statement string from a query ID.

get_work_group(workgroup[, boto3_session])

Return information about the workgroup with the specified name.

list_query_executions([workgroup, ...])

Fetch list query execution IDs ran in specified workgroup or primary work group if not specified.

read_sql_query(sql, database[, ...])

Execute any SQL query on AWS Athena and return the results as a Pandas DataFrame.

read_sql_table(table, database[, ...])

Extract the full table AWS Athena and return the results as a Pandas DataFrame.

repair_table(table[, database, data_source, ...])

Run the Hive's metastore consistency check: 'MSCK REPAIR TABLE table;'.

run_spark_calculation(code, workgroup[, ...])

Execute Spark Calculation and wait for completion.

show_create_table(table[, database, ...])

Generate the query that created it: 'SHOW CREATE TABLE table;'.

start_query_execution(sql[, database, ...])

Start a SQL Query against AWS Athena.

stop_query_execution(query_execution_id[, ...])

Stop a query execution.

to_iceberg(df, database, table[, temp_path, ...])

Insert into Athena Iceberg table using INSERT INTO .

delete_from_iceberg_table(df, database, ...)

Delete rows from an Iceberg table.

unload(sql, path, database[, file_format, ...])

Write query results from a SELECT statement to the specified data format using UNLOAD.

wait_query(query_execution_id[, ...])

Wait for the query end.

create_prepared_statement(sql, statement_name)

Create a SQL statement with the name statement_name to be run at a later time.

list_prepared_statements([workgroup, ...])

List the prepared statements in the specified workgroup.

delete_prepared_statement(statement_name[, ...])

Delete the prepared statement with the specified name from the specified workgroup.

Amazon Redshift

connect([connection, secret_id, catalog_id, ...])

Return a redshift_connector connection from a Glue Catalog or Secret Manager.

connect_temp(cluster_identifier, user[, ...])

Return a redshift_connector temporary connection (No password required).

copy(df, path, con, table, schema[, ...])

Load Pandas DataFrame as a Table on Amazon Redshift using parquet files on S3 as stage.

copy_from_files(path, con, table, schema[, ...])

Load files from S3 to a Table on Amazon Redshift (Through COPY command).

read_sql_query(sql, con[, index_col, ...])

Return a DataFrame corresponding to the result set of the query string.

read_sql_table(table, con[, schema, ...])

Return a DataFrame corresponding the table.

to_sql(df, con, table, schema[, mode, ...])

Write records stored in a DataFrame into Redshift.

unload(sql, path, con[, iam_role, ...])

Load Pandas DataFrame from a Amazon Redshift query result using Parquet files on s3 as stage.

unload_to_files(sql, path, con[, iam_role, ...])

Unload Parquet files on s3 from a Redshift query result (Through the UNLOAD command).

PostgreSQL

connect([connection, secret_id, catalog_id, ...])

Return a pg8000 connection from a Glue Catalog Connection.

read_sql_query(sql, con[, index_col, ...])

Return a DataFrame corresponding to the result set of the query string.

read_sql_table(table, con[, schema, ...])

Return a DataFrame corresponding the table.

to_sql(df, con, table, schema[, mode, ...])

Write records stored in a DataFrame into PostgreSQL.

MySQL

connect([connection, secret_id, catalog_id, ...])

Return a pymysql connection from a Glue Catalog Connection or Secrets Manager.

read_sql_query(sql, con[, index_col, ...])

Return a DataFrame corresponding to the result set of the query string.

read_sql_table(table, con[, schema, ...])

Return a DataFrame corresponding the table.

to_sql(df, con, table, schema[, mode, ...])

Write records stored in a DataFrame into MySQL.

Microsoft SQL Server

connect([connection, secret_id, catalog_id, ...])

Return a pyodbc connection from a Glue Catalog Connection.

read_sql_query(sql, con[, index_col, ...])

Return a DataFrame corresponding to the result set of the query string.

read_sql_table(table, con[, schema, ...])

Return a DataFrame corresponding the table.

to_sql(df, con, table, schema[, mode, ...])

Write records stored in a DataFrame into Microsoft SQL Server.

Oracle

connect([connection, secret_id, catalog_id, ...])

Return a oracledb connection from a Glue Catalog Connection.

read_sql_query(sql, con[, index_col, ...])

Return a DataFrame corresponding to the result set of the query string.

read_sql_table(table, con[, schema, ...])

Return a DataFrame corresponding the table.

to_sql(df, con, table, schema[, mode, ...])

Write records stored in a DataFrame into Oracle Database.

Data API Redshift

RedshiftDataApi([cluster_id, database, ...])

Provides access to a Redshift cluster via the Data API.

connect([cluster_id, database, ...])

Create a Redshift Data API connection.

read_sql_query(sql, con[, database])

Run an SQL query on a RedshiftDataApi connection and return the result as a DataFrame.

Data API RDS

RdsDataApi(resource_arn, database[, ...])

Provides access to the RDS Data API.

connect(resource_arn, database[, ...])

Create a RDS Data API connection.

read_sql_query(sql, con[, database])

Run an SQL query on an RdsDataApi connection and return the result as a DataFrame.

to_sql(df, con, table, database[, mode, ...])

Insert data using an SQL query on a Data API connection.

AWS Glue Data Quality

create_recommendation_ruleset(database, ...)

Create recommendation Data Quality ruleset.

create_ruleset(name, database, table[, ...])

Create Data Quality ruleset.

evaluate_ruleset(name, iam_role_arn[, ...])

Evaluate Data Quality ruleset.

get_ruleset(name[, boto3_session])

Get a Data Quality ruleset.

update_ruleset(name[, mode, df_rules, ...])

Update Data Quality ruleset.

OpenSearch

connect(host[, port, boto3_session, region, ...])

Create a secure connection to the specified Amazon OpenSearch domain.

create_collection(name[, collection_type, ...])

Create Amazon OpenSearch Serverless collection.

create_index(client, index[, doc_type, ...])

Create an index.

delete_index(client, index)

Delete an index.

index_csv(client, path, index[, doc_type, ...])

Index all documents from a CSV file to OpenSearch index.

index_documents(client, documents, index[, ...])

Index all documents to OpenSearch index.

index_df(client, df, index[, doc_type, ...])

Index all documents from a DataFrame to OpenSearch index.

index_json(client, path, index[, doc_type, ...])

Index all documents from JSON file to OpenSearch index.

search(client[, index, search_body, ...])

Return results matching query DSL as pandas DataFrame.

search_by_sql(client, sql_query, **kwargs)

Return results matching SQL query as pandas DataFrame.

Amazon Neptune

connect(host, port[, iam_enabled])

Create a connection to a Neptune cluster.

execute_gremlin(client, query)

Return results of a Gremlin traversal as pandas DataFrame.

execute_opencypher(client, query)

Return results of a openCypher traversal as pandas DataFrame.

execute_sparql(client, query)

Return results of a SPARQL query as pandas DataFrame.

flatten_nested_df(df[, include_prefix, ...])

Flatten the lists and dictionaries of the input data frame.

to_property_graph(client, df[, batch_size, ...])

Write records stored in a DataFrame into Amazon Neptune.

to_rdf_graph(client, df[, batch_size, ...])

Write records stored in a DataFrame into Amazon Neptune.

bulk_load(client, df, path, iam_role[, ...])

Write records into Amazon Neptune using the Neptune Bulk Loader.

bulk_load_from_files(client, path, iam_role)

Load files from S3 into Amazon Neptune using the Neptune Bulk Loader.

DynamoDB

delete_items(items, table_name[, boto3_session])

Delete all items in the specified DynamoDB table.

execute_statement(statement[, parameters, ...])

Run a PartiQL statement against a DynamoDB table.

get_table(table_name[, boto3_session])

Get DynamoDB table object for specified table name.

put_csv(path, table_name[, boto3_session, ...])

Write all items from a CSV file to a DynamoDB.

put_df(df, table_name[, boto3_session, ...])

Write all items from a DataFrame to a DynamoDB.

put_items(items, table_name[, ...])

Insert all items to the specified DynamoDB table.

put_json(path, table_name[, boto3_session, ...])

Write all items from JSON file to a DynamoDB.

read_items(table_name[, index_name, ...])

Read items from given DynamoDB table.

read_partiql_query(query[, parameters, ...])

Read data from a DynamoDB table via a PartiQL query.

Amazon Timestream

batch_load(df, path, database, table, ...[, ...])

Batch load a Pandas DataFrame into a Amazon Timestream table.

batch_load_from_files(path, database, table, ...)

Batch load files from S3 into a Amazon Timestream table.

create_database(database[, kms_key_id, ...])

Create a new Timestream database.

create_table(database, table, ...[, tags, ...])

Create a new Timestream database.

delete_database(database[, boto3_session])

Delete a given Timestream database.

delete_table(database, table[, boto3_session])

Delete a given Timestream table.

list_databases([boto3_session])

List all databases in timestream.

list_tables([database, boto3_session])

List tables in timestream.

query(sql[, chunked, pagination_config, ...])

Run a query and retrieve the result as a Pandas DataFrame.

wait_batch_load_task(task_id[, ...])

Wait for the Timestream batch load task to complete.

write(df, database, table[, time_col, ...])

Store a Pandas DataFrame into an Amazon Timestream table.

unload_to_files(sql, path[, unload_format, ...])

Unload query results to Amazon S3.

unload(sql, path[, unload_format, ...])

Unload query results to Amazon S3 and read the results as Pandas Data Frame.

AWS Clean Rooms

read_sql_query([sql, analysis_template_arn, ...])

Execute Clean Rooms Protected SQL query and return the results as a Pandas DataFrame.

wait_query(membership_id, query_id[, ...])

Wait for the Clean Rooms protected query to end.

Amazon EMR

build_spark_step(path[, args, deploy_mode, ...])

Build the Step structure (dictionary).

build_step(command[, name, ...])

Build the Step structure (dictionary).

create_cluster(subnet_id[, cluster_name, ...])

Create a EMR cluster with instance fleets configuration.

get_cluster_state(cluster_id[, boto3_session])

Get the EMR cluster state.

get_step_state(cluster_id, step_id[, ...])

Get EMR step state.

submit_ecr_credentials_refresh(cluster_id, path)

Update internal ECR credentials.

submit_spark_step(cluster_id, path[, args, ...])

Submit Spark Step.

submit_step(cluster_id, command[, name, ...])

Submit new job in the EMR Cluster.

submit_steps(cluster_id, steps[, boto3_session])

Submit a list of steps.

terminate_cluster(cluster_id[, boto3_session])

Terminate EMR cluster.

Amazon EMR Serverless

create_application(name, release_label[, ...])

Create an EMR Serverless application.

run_job(application_id, execution_role_arn, ...)

Run an EMR serverless job.

wait_job(application_id, job_run_id[, ...])

Wait for the EMR Serverless job to finish.

Amazon CloudWatch Logs

read_logs(query, log_group_names[, ...])

Run a query against AWS CloudWatchLogs Insights and convert the results to Pandas DataFrame.

run_query(query, log_group_names[, ...])

Run a query against AWS CloudWatchLogs Insights and wait the results.

start_query(query, log_group_names[, ...])

Run a query against AWS CloudWatchLogs Insights.

wait_query(query_id[, boto3_session, ...])

Wait query ends.

describe_log_streams(log_group_name[, ...])

List the log streams for the specified log group, return results as a Pandas DataFrame.

filter_log_events(log_group_name[, ...])

List log events from the specified log group.

Amazon QuickSight

cancel_ingestion(ingestion_id[, ...])

Cancel an ongoing ingestion of data into SPICE.

create_athena_data_source(name[, workgroup, ...])

Create a QuickSight data source pointing to an Athena/Workgroup.

create_athena_dataset(name[, database, ...])

Create a QuickSight dataset.

create_ingestion([dataset_name, dataset_id, ...])

Create and starts a new SPICE ingestion on a dataset.

delete_all_dashboards([account_id, ...])

Delete all dashboards.

delete_all_data_sources([account_id, ...])

Delete all data sources.

delete_all_datasets([account_id, ...])

Delete all datasets.

delete_all_templates([account_id, ...])

Delete all templates.

delete_dashboard([name, dashboard_id, ...])

Delete a dashboard.

delete_data_source([name, data_source_id, ...])

Delete a data source.

delete_dataset([name, dataset_id, ...])

Delete a dataset.

delete_template([name, template_id, ...])

Delete a template.

describe_dashboard([name, dashboard_id, ...])

Describe a QuickSight dashboard by name or ID.

describe_data_source([name, data_source_id, ...])

Describe a QuickSight data source by name or ID.

describe_data_source_permissions([name, ...])

Describe a QuickSight data source permissions by name or ID.

describe_dataset([name, dataset_id, ...])

Describe a QuickSight dataset by name or ID.

describe_ingestion(ingestion_id[, ...])

Describe a QuickSight ingestion by ID.

get_dashboard_id(name[, account_id, ...])

Get QuickSight dashboard ID given a name and fails if there is more than 1 ID associated with this name.

get_dashboard_ids(name[, account_id, ...])

Get QuickSight dashboard IDs given a name.

get_data_source_arn(name[, account_id, ...])

Get QuickSight data source ARN given a name and fails if there is more than 1 ARN associated with this name.

get_data_source_arns(name[, account_id, ...])

Get QuickSight Data source ARNs given a name.

get_data_source_id(name[, account_id, ...])

Get QuickSight data source ID given a name and fails if there is more than 1 ID associated with this name.

get_data_source_ids(name[, account_id, ...])

Get QuickSight data source IDs given a name.

get_dataset_id(name[, account_id, boto3_session])

Get QuickSight Dataset ID given a name and fails if there is more than 1 ID associated with this name.

get_dataset_ids(name[, account_id, ...])

Get QuickSight dataset IDs given a name.

get_template_id(name[, account_id, ...])

Get QuickSight template ID given a name and fails if there is more than 1 ID associated with this name.

get_template_ids(name[, account_id, ...])

Get QuickSight template IDs given a name.

list_dashboards([account_id, boto3_session])

List dashboards in an AWS account.

list_data_sources([account_id, boto3_session])

List all QuickSight Data sources summaries.

list_datasets([account_id, boto3_session])

List all QuickSight datasets summaries.

list_groups([namespace, account_id, ...])

List all QuickSight Groups.

list_group_memberships(group_name[, ...])

List all QuickSight Group memberships.

list_iam_policy_assignments([status, ...])

List IAM policy assignments in the current Amazon QuickSight account.

list_iam_policy_assignments_for_user(user_name)

List all the IAM policy assignments.

list_ingestions([dataset_name, dataset_id, ...])

List the history of SPICE ingestions for a dataset.

list_templates([account_id, boto3_session])

List all QuickSight templates.

list_users([namespace, account_id, ...])

Return a list of all of the Amazon QuickSight users belonging to this account.

list_user_groups(user_name[, namespace, ...])

List the Amazon QuickSight groups that an Amazon QuickSight user is a member of.

AWS STS

get_account_id([boto3_session])

Get Account ID.

get_current_identity_arn([boto3_session])

Get current user/role ARN.

get_current_identity_name([boto3_session])

Get current user/role name.

AWS Secrets Manager

get_secret(name[, boto3_session])

Get secret value.

get_secret_json(name[, boto3_session])

Get JSON secret value.

Amazon Chime

post_message(webhook, message)

Send message on an existing Chime Chat rooms.

Typing

GlueTableSettings

Typed dictionary defining the settings for the Glue table.

AthenaCTASSettings

Typed dictionary defining the settings for using CTAS (Create Table As Statement).

AthenaUNLOADSettings

Typed dictionary defining the settings for using UNLOAD.

AthenaCacheSettings

Typed dictionary defining the settings for using cached Athena results.

AthenaPartitionProjectionSettings

Typed dictionary defining the settings for Athena Partition Projection.

TimestreamBatchLoadReportS3Configuration

Report configuration for a batch load task.

ArrowDecryptionConfiguration

Configuration for Arrow file decrypting.

ArrowEncryptionConfiguration

Configuration for Arrow file encrypting.

RaySettings

Typed dictionary defining the settings for distributing calls using Ray.

RayReadParquetSettings

Typed dictionary defining the settings for distributing reading calls using Ray.

_S3WriteDataReturnValue

Typed dictionary defining the dictionary returned by S3 write functions.

_ReadTableMetadataReturnValue(columns_types, ...)

Named tuple defining the return value of the read_*_metadata functions.

Global Configurations

reset()

Reset one or all (if None is received) configuration values.

to_pandas()

Load all configurations on a Pandas DataFrame.

Engine and Memory Format

Engine()

Execution engine configuration class.

MemoryFormat()

Memory format configuration class.

Distributed - Ray

initialize_ray([address, redis_password, ...])

Connect to an existing Ray cluster or start one and connect to it.