awswrangler.emr.create_cluster¶
- awswrangler.emr.create_cluster(subnet_id: str, cluster_name: str = 'my-emr-cluster', logging_s3_path: str | None = None, emr_release: str = 'emr-6.0.0', emr_ec2_role: str = 'EMR_EC2_DefaultRole', emr_role: str = 'EMR_DefaultRole', instance_type_master: str = 'r5.xlarge', instance_type_core: str = 'r5.xlarge', instance_type_task: str = 'r5.xlarge', instance_ebs_size_master: int = 64, instance_ebs_size_core: int = 64, instance_ebs_size_task: int = 64, instance_num_on_demand_master: int = 1, instance_num_on_demand_core: int = 0, instance_num_on_demand_task: int = 0, instance_num_spot_master: int = 0, instance_num_spot_core: int = 0, instance_num_spot_task: int = 0, spot_bid_percentage_of_on_demand_master: int = 100, spot_bid_percentage_of_on_demand_core: int = 100, spot_bid_percentage_of_on_demand_task: int = 100, spot_provisioning_timeout_master: int = 5, spot_provisioning_timeout_core: int = 5, spot_provisioning_timeout_task: int = 5, spot_timeout_to_on_demand_master: bool = True, spot_timeout_to_on_demand_core: bool = True, spot_timeout_to_on_demand_task: bool = True, python3: bool = True, spark_glue_catalog: bool = True, hive_glue_catalog: bool = True, presto_glue_catalog: bool = True, consistent_view: bool = False, consistent_view_retry_seconds: int = 10, consistent_view_retry_count: int = 5, consistent_view_table_name: str = 'EmrFSMetadata', bootstraps_paths: List[str] | None = None, debugging: bool = True, applications: List[str] | None = None, visible_to_all_users: bool = True, key_pair_name: str | None = None, security_group_master: str | None = None, security_groups_master_additional: List[str] | None = None, security_group_slave: str | None = None, security_groups_slave_additional: List[str] | None = None, security_group_service_access: str | None = None, docker: bool = False, extra_public_registries: List[str] | None = None, spark_log_level: str = 'WARN', spark_jars_path: List[str] | None = None, spark_defaults: Dict[str, str] | None = None, spark_pyarrow: bool = False, custom_classifications: List[Dict[str, Any]] | None = None, maximize_resource_allocation: bool = False, steps: List[Dict[str, Any]] | None = None, custom_ami_id: str | None = None, step_concurrency_level: int = 1, keep_cluster_alive_when_no_steps: bool = True, termination_protected: bool = False, auto_termination_policy: Dict[str, int] | None = None, tags: Dict[str, str] | None = None, boto3_session: Session | None = None) str ¶
Create a EMR cluster with instance fleets configuration.
https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html
- Parameters:
subnet_id (str) – VPC subnet ID.
cluster_name (str) – Cluster name.
logging_s3_path (str, optional) – Logging s3 path (e.g. s3://BUCKET_NAME/DIRECTORY_NAME/). If None, the default is s3://aws-logs-{AccountId}-{RegionId}/elasticmapreduce/
emr_release (str) – EMR release (e.g. emr-5.28.0).
emr_ec2_role (str) – IAM role name.
emr_role (str) – IAM role name.
instance_type_master (str) – EC2 instance type.
instance_type_core (str) – EC2 instance type.
instance_type_task (str) – EC2 instance type.
instance_ebs_size_master (int) – Size of EBS in GB.
instance_ebs_size_core (int) – Size of EBS in GB.
instance_ebs_size_task (int) – Size of EBS in GB.
instance_num_on_demand_master (int) – Number of on demand instances.
instance_num_on_demand_core (int) – Number of on demand instances.
instance_num_on_demand_task (int) – Number of on demand instances.
instance_num_spot_master (int) – Number of spot instances.
instance_num_spot_core (int) – Number of spot instances.
instance_num_spot_task (int) – Number of spot instances.
spot_bid_percentage_of_on_demand_master (int) – The bid price, as a percentage of On-Demand price.
spot_bid_percentage_of_on_demand_core (int) – The bid price, as a percentage of On-Demand price.
spot_bid_percentage_of_on_demand_task (int) – The bid price, as a percentage of On-Demand price.
spot_provisioning_timeout_master (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.
spot_provisioning_timeout_core (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.
spot_provisioning_timeout_task (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.
spot_timeout_to_on_demand_master (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?
spot_timeout_to_on_demand_core (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?
spot_timeout_to_on_demand_task (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?
python3 (bool) – Python 3 Enabled?
spark_glue_catalog (bool) – Spark integration with Glue Catalog?
hive_glue_catalog (bool) – Hive integration with Glue Catalog?
presto_glue_catalog (bool) – Presto integration with Glue Catalog?
consistent_view (bool) – Consistent view allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
consistent_view_retry_seconds (int) – Delay between the tries (seconds).
consistent_view_retry_count (int) – Number of tries.
consistent_view_table_name (str) – Name of the DynamoDB table to store the consistent view data.
bootstraps_paths (List[str], optional) – Bootstraps paths (e.g [“s3://BUCKET_NAME/script.sh”]).
debugging (bool) – Debugging enabled?
applications (List[str], optional) – List of applications (e.g [“Hadoop”, “Spark”, “Ganglia”, “Hive”]). If None, [“Spark”] will be considered.
visible_to_all_users (bool) – True or False.
key_pair_name (str, optional) – Key pair name.
security_group_master (str, optional) – The identifier of the Amazon EC2 security group for the master node.
security_groups_master_additional (str, optional) – A list of additional Amazon EC2 security group IDs for the master node.
security_group_slave (str, optional) – The identifier of the Amazon EC2 security group for the core and task nodes.
security_groups_slave_additional (str, optional) – A list of additional Amazon EC2 security group IDs for the core and task nodes.
security_group_service_access (str, optional) – The identifier of the Amazon EC2 security group for the Amazon EMR service to access clusters in VPC private subnets.
docker (bool) – Enable Docker Hub and ECR registries access.
extra_public_registries (List[str], optional) – Additional docker registries.
spark_log_level (str) – log4j.rootCategory log level (ALL, DEBUG, INFO, WARN, ERROR, FATAL, OFF, TRACE).
spark_jars_path (List[str], optional) – spark.jars e.g. [s3://…/foo.jar, s3://…/boo.jar] https://spark.apache.org/docs/latest/configuration.html
spark_defaults (Dict[str, str], optional) – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults
spark_pyarrow (bool) – Enable PySpark to use PyArrow behind the scenes. P.S. You must install pyarrow by your self via bootstrap
custom_classifications (List[Dict[str, Any]], optional) – Extra classifications.
maximize_resource_allocation (bool) – Configure your executors to utilize the maximum resources possible https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
custom_ami_id (Optional[str]) – The custom AMI ID to use for the provisioned instance group
steps (List[Dict[str, Any]], optional) – Steps definitions (Obs : str Use EMR.build_step() to build it)
keep_cluster_alive_when_no_steps (bool) – Specifies whether the cluster should remain available after completing all steps
termination_protected (bool) – Specifies whether the Amazon EC2 instances in the cluster are protected from termination by API calls, user intervention, or in the event of a job-flow error.
auto_termination_policy (Optional[Dict[str, int]]) – Specifies the auto-termination policy that is attached to an Amazon EMR cluster eg. auto_termination_policy = {‘IdleTimeout’: 123} IdleTimeout specifies the amount of idle time in seconds after which the cluster automatically terminates. You can specify a minimum of 60 seconds and a maximum of 604800 seconds (seven days).
tags (Dict[str, str], optional) – Key/Value collection to put on the Cluster. e.g. {“foo”: “boo”, “bar”: “xoo”})
boto3_session (boto3.Session(), optional) – Boto3 Session. The default boto3 session will be used if boto3_session receive None.
- Returns:
Cluster ID.
- Return type:
str
Examples
Minimal Example
>>> import awswrangler as wr >>> cluster_id = wr.emr.create_cluster("SUBNET_ID")
Minimal Example With Custom Classification
>>> import awswrangler as wr >>> cluster_id = wr.emr.create_cluster( >>> subnet_id="SUBNET_ID", >>> custom_classifications=[ >>> { >>> "Classification": "livy-conf", >>> "Properties": { >>> "livy.spark.master": "yarn", >>> "livy.spark.deploy-mode": "cluster", >>> "livy.server.session.timeout": "16h", >>> }, >>> } >>> ], >>> )
Full Example
>>> import awswrangler as wr >>> cluster_id = wr.emr.create_cluster( ... cluster_name="wrangler_cluster", ... logging_s3_path=f"s3://BUCKET_NAME/emr-logs/", ... emr_release="emr-5.28.0", ... subnet_id="SUBNET_ID", ... emr_ec2_role="EMR_EC2_DefaultRole", ... emr_role="EMR_DefaultRole", ... instance_type_master="m5.xlarge", ... instance_type_core="m5.xlarge", ... instance_type_task="m5.xlarge", ... instance_ebs_size_master=50, ... instance_ebs_size_core=50, ... instance_ebs_size_task=50, ... instance_num_on_demand_master=1, ... instance_num_on_demand_core=1, ... instance_num_on_demand_task=1, ... instance_num_spot_master=0, ... instance_num_spot_core=1, ... instance_num_spot_task=1, ... spot_bid_percentage_of_on_demand_master=100, ... spot_bid_percentage_of_on_demand_core=100, ... spot_bid_percentage_of_on_demand_task=100, ... spot_provisioning_timeout_master=5, ... spot_provisioning_timeout_core=5, ... spot_provisioning_timeout_task=5, ... spot_timeout_to_on_demand_master=True, ... spot_timeout_to_on_demand_core=True, ... spot_timeout_to_on_demand_task=True, ... python3=True, ... spark_glue_catalog=True, ... hive_glue_catalog=True, ... presto_glue_catalog=True, ... bootstraps_paths=None, ... debugging=True, ... applications=["Hadoop", "Spark", "Ganglia", "Hive"], ... visible_to_all_users=True, ... key_pair_name=None, ... spark_jars_path=[f"s3://...jar"], ... maximize_resource_allocation=True, ... keep_cluster_alive_when_no_steps=True, ... termination_protected=False, ... spark_pyarrow=True, ... tags={ ... "foo": "boo" ... })