awswrangler.data_quality.evaluate_ruleset

awswrangler.data_quality.evaluate_ruleset(name: str | List[str], iam_role_arn: str, number_of_workers: int = 5, timeout: int = 2880, database: str | None = None, table: str | None = None, catalog_id: str | None = None, connection_name: str | None = None, additional_options: Dict[str, str] | None = None, additional_run_options: Dict[str, bool | str] | None = None, client_token: str | None = None, boto3_session: Session | None = None) Any

Evaluate Data Quality ruleset.

Note

This function has arguments which can be configured globally through wr.config or environment variables:

  • catalog_id

  • database

Check out the Global Configurations Tutorial for details.

Parameters:
  • name (str or list[str]) – Ruleset name or list of names.

  • iam_role_arn (str) – IAM Role ARN.

  • number_of_workers (int, optional) – The number of G.1X workers to be used in the run. The default is 5.

  • timeout (int, optional) – The timeout for a run in minutes. The default is 2880 (48 hours).

  • database (str, optional) – Glue database name. Database associated with the ruleset will be used if not provided.

  • table (str, optional) – Glue table name. Table associated with the ruleset will be used if not provided.

  • catalog_id (str, optional) – Glue Catalog id.

  • connection_name (str, optional) – Glue connection name.

  • additional_options (dict, optional) – Additional options for the table. Supported keys: pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset. catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.

  • additional_run_options (Dict[str, Union[str, bool]], optional) – Additional run options. Supported keys: CloudWatchMetricsEnabled: whether to enable CloudWatch metrics. ResultsS3Prefix: prefix for Amazon S3 to store results.

  • client_token (str, optional) – Random id used for idempotency. Will be automatically generated if not provided.

  • boto3_session (boto3.Session, optional) – Boto3 Session. If none, the default boto3 session is used.

Returns:

Data frame with ruleset evaluation results.

Return type:

pd.DataFrame

Examples

>>> import awswrangler as wr
>>> import pandas as pd
>>>
>>> df = pd.DataFrame({"c0": [0, 1, 2], "c1": [0, 1, 2], "c2": [0, 0, 1]})
>>> wr.s3.to_parquet(df, path, dataset=True, database="database", table="table")
>>> wr.data_quality.create_ruleset(
>>>     name="ruleset",
>>>     database="database",
>>>     table="table",
>>>     dqdl_rules="Rules = [ RowCount between 1 and 3 ]",
>>>)
>>> df_ruleset_results = wr.data_quality.evaluate_ruleset(
>>>     name="ruleset",
>>>     iam_role_arn=glue_data_quality_role,
>>> )