Running the Apache Beam samples With Apache Spark

Prerequisites

Check the prerequisites to make sure you have the correct java version, have built your Apache Hop fact jar and have exported your project metadata to a JSON file.

Get Spark

Download your selected Spark version and unzip to a convenient location.

Start your local Spark single node cluster

To keep things as simple as possible, we’ll run a local single node Spark cluster.

First we need to start our local master. This can be done with a single command from the folder where you unzipped Spark:

run <SPARK_FOLDER>/sbin/start-master.sh.

Your output should look similar to the one below:

starting org.apache.spark.deploy.master.Master, logging to <PATH>/spark-3.1.2-bin-hadoop3.2/logs/spark-<USER>-org.apache.spark.deploy.master.Master-1-<HOSTNAME>.out

You should now be able to access the Spark Master’s web ui at http://localhost:8080.

Copy the master’s url from the master’s page header, e.g. spark://<YOUR_HOSTNAME>.localdomain:7077.

With the master in place, we can start a worker (formerly called slave). Similar to the master, this is a single command that takes the master’s url that yo

sbin/start-worker.sh spark://<YOUR_HOSTNAME>.localdomain:7077.

Your output should look similar to the one below:

Run sample pipeline with Spark Submit

Since Spark doesn’t support remote execution, we’ll be running one of the sample pipelines through Spark Submit.

INFO: the sample pipeline we’ll run in this example reads variables for file input and output from the Spark pipeline run configuration. Check the variables tab for the Spark pipeline run configuration in the metadata perspective for more details.

The command below passes all the required information to run the samples input-process-output.hpl pipeline on our local Spark cluster with spark-submit.

Tip: Optionally you can provide a 4th argument after the run configuration name: the name of the environment configuration file to use.

In this case, the fat jar and metadata export files were saved to /opt/spark. The last argument, Spark, is the name of the Spark pipeline run configuration in the samples project. Replace with the necessary arguments for your environment and run.

You should see verbose logging output similar to the output below:

After your pipeline finishes and the spark-submit command ends, your Spark master ui will show a new entry in the 'Finished Applications' list. You can follow up any running applications in the 'Running Applications' and drill down into their execution details while running.

Last updated