awswrangler.data_quality.evaluate_ruleset¶

Evaluate Data Quality ruleset.

Note

This function has arguments which can be configured globally through wr.config or environment variables:

catalog_id
database

Check out the Global Configurations Tutorial for details.

Parameters:

name (str | list[str]) – Ruleset name or list of names.
iam_role_arn (str) – IAM Role ARN.
number_of_workers (int) – The number of G.1X workers to be used in the run. The default is 5.
timeout (int) – The timeout for a run in minutes. The default is 2880 (48 hours).
database (str | None) – Glue database name. Database associated with the ruleset will be used if not provided.
table (str | None) – Glue table name. Table associated with the ruleset will be used if not provided.
catalog_id (str | None) – Glue Catalog id.
connection_name (str | None) – Glue connection name.
additional_options (dict[str, str] | None) – Additional options for the table. Supported keys: pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset. catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.
additional_run_options (dict[str, str | bool] | None) –
Additional run options. Supported keys:
- CloudWatchMetricsEnabled: whether to enable CloudWatch metrics.
- ResultsS3Prefix: prefix for Amazon S3 to store results.
client_token (str | None) – Random id used for idempotency. Will be automatically generated if not provided.
boto3_session (Session | None) – The default boto3 session will be used if boto3_session is None.

Return type:

DataFrame

Returns:

Data frame with ruleset evaluation results.

Examples

>>> import awswrangler as wr
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"c0": [0, 1, 2], "c1": [0, 1, 2], "c2": [0, 0, 1]})
>>> wr.s3.to_parquet(df, path, dataset=True, database="database", table="table")
>>> wr.data_quality.create_ruleset(
...     name="ruleset",
...     database="database",
...     table="table",
...     dqdl_rules="Rules = [ RowCount between 1 and 3 ]",
... )
>>> df_ruleset_results = wr.data_quality.evaluate_ruleset(
...     name="ruleset",
...     iam_role_arn=glue_data_quality_role,
... )