awswrangler.emr.create_cluster¶

awswrangler.emr.create_cluster(subnet_id: str, cluster_name: str = 'my-emr-cluster', logging_s3_path: str | None = None, emr_release: str = 'emr-6.7.0', emr_ec2_role: str = 'EMR_EC2_DefaultRole', emr_role: str = 'EMR_DefaultRole', instance_type_master: str = 'r5.xlarge', instance_type_core: str = 'r5.xlarge', instance_type_task: str = 'r5.xlarge', instance_ebs_size_master: int = 64, instance_ebs_size_core: int = 64, instance_ebs_size_task: int = 64, instance_num_on_demand_master: int = 1, instance_num_on_demand_core: int = 0, instance_num_on_demand_task: int = 0, instance_num_spot_master: int = 0, instance_num_spot_core: int = 0, instance_num_spot_task: int = 0, spot_bid_percentage_of_on_demand_master: int = 100, spot_bid_percentage_of_on_demand_core: int = 100, spot_bid_percentage_of_on_demand_task: int = 100, spot_provisioning_timeout_master: int = 5, spot_provisioning_timeout_core: int = 5, spot_provisioning_timeout_task: int = 5, spot_timeout_to_on_demand_master: bool = True, spot_timeout_to_on_demand_core: bool = True, spot_timeout_to_on_demand_task: bool = True, python3: bool = True, spark_glue_catalog: bool = True, hive_glue_catalog: bool = True, presto_glue_catalog: bool = True, consistent_view: bool = False, consistent_view_retry_seconds: int = 10, consistent_view_retry_count: int = 5, consistent_view_table_name: str = 'EmrFSMetadata', bootstraps_paths: list[str] | None = None, debugging: bool = True, applications: list[str] | None = None, visible_to_all_users: bool = True, key_pair_name: str | None = None, security_group_master: str | None = None, security_groups_master_additional: list[str] | None = None, security_group_slave: str | None = None, security_groups_slave_additional: list[str] | None = None, security_group_service_access: str | None = None, security_configuration: str | None = None, docker: bool = False, extra_public_registries: list[str] | None = None, spark_log_level: str = 'WARN', spark_jars_path: list[str] | None = None, spark_defaults: dict[str, str] | None = None, spark_pyarrow: bool = False, custom_classifications: list[dict[str, Any]] | None = None, maximize_resource_allocation: bool = False, steps: list[dict[str, Any]] | None = None, custom_ami_id: str | None = None, step_concurrency_level: int = 1, keep_cluster_alive_when_no_steps: bool = True, termination_protected: bool = False, auto_termination_policy: dict[str, int] | None = None, tags: dict[str, str] | None = None, boto3_session: Session | None = None, configurations: list[dict[str, Any]] | None = None) → str¶

Create a EMR cluster with instance fleets configuration.

https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-instance-fleet.html

Parameters:

subnet_id (str) – VPC subnet ID.
cluster_name (str) – Cluster name.
logging_s3_path (str | None) – Logging s3 path (e.g. s3://BUCKET_NAME/DIRECTORY_NAME/). If None, the default is s3://aws-logs-{AccountId}-{RegionId}/elasticmapreduce/
emr_release (str) – EMR release (e.g. emr-5.28.0).
emr_ec2_role (str) – IAM role name.
emr_role (str) – IAM role name.
instance_type_master (str) – EC2 instance type.
instance_type_core (str) – EC2 instance type.
instance_type_task (str) – EC2 instance type.
instance_ebs_size_master (int) – Size of EBS in GB.
instance_ebs_size_core (int) – Size of EBS in GB.
instance_ebs_size_task (int) – Size of EBS in GB.
instance_num_on_demand_master (int) – Number of on demand instances.
instance_num_on_demand_core (int) – Number of on demand instances.
instance_num_on_demand_task (int) – Number of on demand instances.
instance_num_spot_master (int) – Number of spot instances.
instance_num_spot_core (int) – Number of spot instances.
instance_num_spot_task (int) – Number of spot instances.
spot_bid_percentage_of_on_demand_master (int) – The bid price, as a percentage of On-Demand price.
spot_bid_percentage_of_on_demand_core (int) – The bid price, as a percentage of On-Demand price.
spot_bid_percentage_of_on_demand_task (int) – The bid price, as a percentage of On-Demand price.
spot_provisioning_timeout_master (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.
spot_provisioning_timeout_core (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.
spot_provisioning_timeout_task (int) – The spot provisioning timeout period in minutes. If Spot instances are not provisioned within this time period, the TimeOutAction is taken. Minimum value is 5 and maximum value is 1440. The timeout applies only during initial provisioning, when the cluster is first created.
spot_timeout_to_on_demand_master (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?
spot_timeout_to_on_demand_core (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?
spot_timeout_to_on_demand_task (bool) – After a provisioning timeout should the cluster switch to on demand or shutdown?
python3 (bool) – Python 3 Enabled?
spark_glue_catalog (bool) – Spark integration with Glue Catalog?
hive_glue_catalog (bool) – Hive integration with Glue Catalog?
presto_glue_catalog (bool) – Presto integration with Glue Catalog?
consistent_view (bool) – Consistent view allows EMR clusters to check for list and read-after-write consistency for Amazon S3 objects written by or synced with EMRFS. https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-consistent-view.html
consistent_view_retry_seconds (int) – Delay between the tries (seconds).
consistent_view_retry_count (int) – Number of tries.
consistent_view_table_name (str) – Name of the DynamoDB table to store the consistent view data.
bootstraps_paths (list[str] | None) – Bootstraps paths (e.g [“s3://BUCKET_NAME/script.sh”]).
debugging (bool) – Debugging enabled?
applications (list[str] | None) – List of applications (e.g [“Hadoop”, “Spark”, “Ganglia”, “Hive”]). If None, [“Spark”] will be considered.
visible_to_all_users (bool) – True or False.
key_pair_name (str | None) – Key pair name.
security_group_master (str | None) – The identifier of the Amazon EC2 security group for the master node.
security_groups_master_additional (list[str] | None) – A list of additional Amazon EC2 security group IDs for the master node.
security_group_slave (str | None) – The identifier of the Amazon EC2 security group for the core and task nodes.
security_groups_slave_additional (list[str] | None) – A list of additional Amazon EC2 security group IDs for the core and task nodes.
security_group_service_access (str | None) – The identifier of the Amazon EC2 security group for the Amazon EMR service to access clusters in VPC private subnets.
security_configuration (str, optional) – The name of a security configuration to apply to the cluster.
docker (bool) – Enable Docker Hub and ECR registries access.
extra_public_registries (list[str] | None) – Additional docker registries.
spark_log_level (str) – log4j.rootCategory log level (ALL, DEBUG, INFO, WARN, ERROR, FATAL, OFF, TRACE).
spark_jars_path (list[str] | None) – spark.jars e.g. [s3://…/foo.jar, s3://…/boo.jar] https://spark.apache.org/docs/latest/configuration.html
spark_defaults (dict[str, str] | None) – https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#spark-defaults
spark_pyarrow (bool) – Enable PySpark to use PyArrow behind the scenes. P.S. You must install pyarrow by your self via bootstrap
custom_classifications (list[dict[str, Any]] | None) – Extra classifications.
maximize_resource_allocation (bool) – Configure your executors to utilize the maximum resources possible https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html#emr-spark-maximizeresourceallocation
custom_ami_id (str | None) – The custom AMI ID to use for the provisioned instance group
steps (list[dict[str, Any]] | None) – Steps definitions (Obs : str Use EMR.build_step() to build it)
keep_cluster_alive_when_no_steps (bool) – Specifies whether the cluster should remain available after completing all steps
termination_protected (bool) – Specifies whether the Amazon EC2 instances in the cluster are protected from termination by API calls, user intervention, or in the event of a job-flow error.
auto_termination_policy (dict[str, int] | None) – Specifies the auto-termination policy that is attached to an Amazon EMR cluster eg. auto_termination_policy = {‘IdleTimeout’: 123} IdleTimeout specifies the amount of idle time in seconds after which the cluster automatically terminates. You can specify a minimum of 60 seconds and a maximum of 604800 seconds (seven days).
tags (dict[str, str] | None) – Key/Value collection to put on the Cluster. e.g. {“foo”: “boo”, “bar”: “xoo”})
boto3_session (Session | None) – The default boto3 session will be used if boto3_session is None.
configurations (list[dict[str, Any]] | None) –
The list of configurations supplied for an EMR cluster instance group.

By default, adds log4j config as follows: {“Classification”: “spark-log4j”, “Properties”: {“log4j.rootCategory”: f”{pars[‘spark_log_level’]}, console”}}

Return type:

str

Returns:

Cluster ID.

Examples

Minimal Example

>>> import awswrangler as wr
>>> cluster_id = wr.emr.create_cluster("SUBNET_ID")

Minimal Example With Custom Classification

>>> import awswrangler as wr
>>> cluster_id = wr.emr.create_cluster(
>>> subnet_id="SUBNET_ID",
>>> custom_classifications=[
>>>         {
>>>             "Classification": "livy-conf",
>>>             "Properties": {
>>>                 "livy.spark.master": "yarn",
>>>                 "livy.spark.deploy-mode": "cluster",
>>>                 "livy.server.session.timeout": "16h",
>>>             },
>>>         }
>>>     ],
>>> )

Full Example

>>> import awswrangler as wr
>>> cluster_id = wr.emr.create_cluster(
...     cluster_name="wrangler_cluster",
...     logging_s3_path=f"s3://BUCKET_NAME/emr-logs/",
...     emr_release="emr-5.28.0",
...     subnet_id="SUBNET_ID",
...     emr_ec2_role="EMR_EC2_DefaultRole",
...     emr_role="EMR_DefaultRole",
...     instance_type_master="m5.xlarge",
...     instance_type_core="m5.xlarge",
...     instance_type_task="m5.xlarge",
...     instance_ebs_size_master=50,
...     instance_ebs_size_core=50,
...     instance_ebs_size_task=50,
...     instance_num_on_demand_master=1,
...     instance_num_on_demand_core=1,
...     instance_num_on_demand_task=1,
...     instance_num_spot_master=0,
...     instance_num_spot_core=1,
...     instance_num_spot_task=1,
...     spot_bid_percentage_of_on_demand_master=100,
...     spot_bid_percentage_of_on_demand_core=100,
...     spot_bid_percentage_of_on_demand_task=100,
...     spot_provisioning_timeout_master=5,
...     spot_provisioning_timeout_core=5,
...     spot_provisioning_timeout_task=5,
...     spot_timeout_to_on_demand_master=True,
...     spot_timeout_to_on_demand_core=True,
...     spot_timeout_to_on_demand_task=True,
...     python3=True,
...     spark_glue_catalog=True,
...     hive_glue_catalog=True,
...     presto_glue_catalog=True,
...     bootstraps_paths=None,
...     debugging=True,
...     applications=["Hadoop", "Spark", "Ganglia", "Hive"],
...     visible_to_all_users=True,
...     key_pair_name=None,
...     spark_jars_path=[f"s3://...jar"],
...     maximize_resource_allocation=True,
...     keep_cluster_alive_when_no_steps=True,
...     termination_protected=False,
...     spark_pyarrow=True,
...     tags={
...         "foo": "boo"
...     })