Running Spark Applications
You can run Spark applications locally or distributed across a cluster, either by using an interactive shell or by submitting an application. Running Spark applications interactively is commonly performed during the data-exploration phase and for ad hoc analysis.
The Spark 2 Job Commands
With Spark 2, you use slightly different command names than in Spark 1, so that you can run both versions of Spark side-by-side without conflicts:
-
spark2-submit instead of spark-submit.
-
spark2-shell instead of spark-shell.
-
pyspark2 instead of pyspark.
For development and test purposes, you can also configure each host so that invoking the Spark 1 command name runs the corresponding Spark 2 executable. See Configuring Names of Spark 2 Tools for details.
Canary Test for pyspark Command
The following example shows a simple pyspark session that refers to the SparkContext, calls the collect() function which runs a Spark 2 job, and writes data to HDFS. This sequence of operations helps to check if there are obvious configuration issues that prevent Spark jobs from working at all. For the HDFS path for the output directory, substitute a path that exists on your own system.
[$ hdfs dfs -mkdir /user/systest/spark $ pyspark ... SparkSession available as 'spark'. >>> strings = ["one","two","three"] >>> s2 = sc.parallelize(strings) >>> s3 = s2.map(lambda word: word.upper()) >>> s3.collect() ['ONE', 'TWO', 'THREE'] >>> s3.saveAsTextFile('hdfs:///user/systest/spark/canary_test') >>> quit() $ hdfs dfs -ls /user/systest/spark Found 1 items drwxr-xr-x - systest supergroup 0 2016-08-26 14:41 /user/systest/spark/canary_test $ hdfs dfs -ls /user/systest/spark/canary_test Found 3 items -rw-r--r-- 3 systest supergroup 0 2016-08-26 14:41 /user/systest/spark/canary_test/_SUCCESS -rw-r--r-- 3 systest supergroup 4 2016-08-26 14:41 /user/systest/spark/canary_test/part-00000 -rw-r--r-- 3 systest supergroup 10 2016-08-26 14:41 /user/systest/spark/canary_test/part-00001 $ hdfs dfs -cat /user/systest/spark/canary_test/part-00000 ONE $ hdfs dfs -cat /user/systest/spark/canary_test/part-00001 TWO THREE
Fetching Spark 2 Maven Dependencies
The Maven coordinates are a combination of groupId, artifactId and version. The groupId and artifactId are the same as for the upstream Apache Spark project. For example, for spark-core, groupId is org.apache.spark, and artifactId is spark-core_2.11, both the same as the upstream project. The version is different for the Cloudera packaging: see Version and Packaging Information for the exact name depending on which release you are using.
Accessing the Spark 2 History Server
In CDH 6, the Spark 2 history server is available on port 18088, the same port used by the Spark 1 history server in CDH 5. This is a change from port 18089 that was formerly used for the history server with the separate Spark 2 parcel.
<< Security Considerations for Cloudera Distribution of Apache Spark 2 | ©2016 Cloudera, Inc. All rights reserved | Spark 2 Kafka Integration >> |
Terms and Conditions Privacy Policy |