Integrating Apache Spark with ArcGIS Notebooks in ArcGIS Pro
Apache Spark has grown into the de-facto tool for processing data at scale. ArcGIS GeoAnalytics Server and GeoAnalytics Desktop both use Apache Spark as the processing engine in the background to execute analytics quickly and efficiently. This blog post explains how you can integrate Apache Spark with ArcGIS Notebooks to execute big data tasks remotely on a Spark cluster instead of on your local machine.
Apache Spark is an open-source analytics engine for processing data at scale. It provides a framework for distributed processing of big data based on the MapReduce paradigm. MapReduce can accelerate big data tasks by splitting large datasets into smaller chunks and processing them in a massively parallel manner. Apache Spark is mainly written in the Scala programming language, but it provides APIs for Python, Scala, Java, and R. Moreover, Spark has a module for structured data processing (Spark SQL) that lets you query data using structured query language (SQL).
Apache Spark Core includes the basic functionality of Spark. It provides distributed task dispatching, scheduling, memory management, fault recovery, I/O functionality, and more. The functionality of Spark is extended through several components that are built on top of Spark Core: Spark SQL allows data engineers and data scientists to execute distributed queries on structured (or semi-structured) data; Spark Streaming provides libraries for real-time data processing; MLlib includes Apache Spark’s libraries for machine learning tasks; and GraphX facilitates graphs and graph-parallel computation.
Spark Cluster Architecture
An Apache Spark cluster for processing big data follows a driver/worker architecture with one driver node and any number of worker nodes, which are managed by a cluster manager.
- Driver node: contains the driver process that is responsible for executing the user code, converting the code into tasks, and acquiring executors on Worker Nodes.
- Cluster Manager: controls and allocates resources on worker nodes when a Spark application is launched. Apache Spark can be configured to use the cluster manager from one of the following open-source systems:
- Spark Standalone mode, included with the Spark distribution.
- Apache Hadoop YARN, the resource manager in the second generation of Hadoop.
- Apache Mesos, developed to manage resources in large-scale clusters.
- Kubernetes, a platform for managing containerized workloads.
- Worker nodes: contain the executor processes that are responsible for executing the computations that the driver process assigns to them.
Apache Spark and ArcGIS
You can use Apache Spark’s distributed processing capability in ArcGIS in different ways. For example, Apache Spark is the engine behind distributed processing in an ArcGIS GeoAnalytics Server site. GeoAnalytics Server provides a collection of tools that parallelize computation to analyze your big data more efficiently. It can execute a process quickly by breaking down the process into smaller jobs and executing the jobs across multiple server cores and machines simultaneously, and then aggregating the outcomes and returning the final result to users. Since the release of ArcGIS Pro 2.4, GeoAnalytics functionality can also be accessed through the GeoAnalytics Desktop toolbox. The tools in this toolbox use Apache Spark in the background and parallelize computation on the available CPU cores on your local machine.
Moreover, you can connect some of the ArcGIS applications with remote Spark clusters in case you need more resources than are available to process large datasets on your local machine. For instance, you can integrate ArcGIS Insights Desktop with a managed Spark cluster to perform analysis at scale. In the remainder of this blog post you will see how to integrate ArcGIS Notebooks with a remote Spark cluster.
ArcGIS Notebooks & Remote Spark Clusters
Another way to use Apache Spark in ArcGIS is to configure ArcGIS Notebooks in ArcGIS Pro to process large datasets in a remote Spark cluster using Databricks, which is a cloud-based platform developed to provide data engineers/scientists with fully managed Spark clusters for big data workloads. With Databricks you can easily create a set of computation resources when needed. Databricks Connect is a client library that enables users to execute heavy workloads remotely on a managed cluster instead of their local machine. The analysis is performed on your local machine, but the job is submitted to the Spark server running in Databricks for execution in the cluster. To enable this capability for ArcGIS Notebooks in ArcGIS Pro you need to complete the following steps:
Note: You will need a Databricks subscription to use Databricks Connect on your local machine.
Step 1: Create a cluster on Databricks. The version of Python installed on your local machine must match the Python version of the Databricks cluster. For example, ArcGIS Pro 2.8 comes with Python 3.7 and therefore when creating a cluster, you must use one of the Databricks Runtime versions that also come with Python 3.7. This includes Databricks Runtime versions 7.3 LTS ML, 7.3 LTS, 6.4 ML, and 6.4 (LTS stands for Long Term Support). Databricks Runtime ML (Databricks Runtime for Machine Learning) creates a cluster optimized for machine learning tasks that includes the most popular machine learning libraries, such as scikit-learn, PyTorch, TensorFlow, Keras, and XGBoost. If you use Databricks on Azure or AWS, you may have access to some of these Runtime versions.
Databricks Connect uses port number 15001 by default. You can configure a cluster to use a different port, such as 8787. To do this, navigate to Advanced > Spark > Spark Config when creating the cluster and add the line below to spark.conf:
spark.databricks.service.port 8787
You can also add an additional line to spark.conf to make sure that you can connect your favorite applications or notebook servers to a Databricks cluster and execute Apache Spark code:
spark.databricks.service.server.enabled true
Step 2: Download and install Java SE Runtime Version 8 on your local machine. It’s important that you only install Version 8. It's also very important to change the install location to "C:\Java\jdk1.8.0_301" (or any other location that doesn’t include a space in the path name). By default, the installer will use the "Program Files" folder as the install location, which may cause problems with Spark due to the space in the path name.
Once you have Java installed, open Windows PowerShell and execute the following command to set the environment variable for JAVA_HOME:
[Environment]::SetEnvironmentVariable("JAVA_HOME", "C:\Java\jdk1.8.0_301", "Machine")
Step 3: Clone the default conda environment that is shipped with ArcGIS Pro (i.e., arcgispro-py3) and set this new environment as the active environment. Then, open the Python Command Prompt and run the pip command to install the Databricks Connect library for the Databricks Runtime version you are using in the Databricks cluster. For example, if you are using Databricks Runtime version 7.3, run the following command:
pip install -U "databricks-connect==7.3.*"
Step 4: Determine the parameter values you will need to connect. You will need to define five parameters to use a Databricks cluster remotely: Databricks Host, Databricks Token, Cluster ID, Org ID, and Port.
Databricks Host, Cluster ID, and Org ID can be found in the Databricks workspace URL when you are logged into your Databricks deployment. For example, if you are using Microsoft Azure, the URL will be:
<Databricks Host>/?o=<Org ID>#setting/clusters/<Cluster ID>/configuration
The port number is the same as the one you used in Step 1, either the default value of 15001, or the number you specified for spark.databricks.service.port when you created the cluster.
You will also need to generate a personal access token in your Databricks workspace. The steps for generating the token can be found in the Databricks online documentation.
Step 5: Configure the connection to the remote cluster. Run the following command in the Python Command Prompt:
databricks-connect configure
Accept the license when prompted then specify the parameters from the previous step to connect to the remote cluster you created.
Step 6: Test your connection. Run the following command in the Python Command Prompt to test the connectivity to Databricks:
databricks-connect test
The test will start the cluster, if you haven't already started it manually, and run a number of tests to verify that all the parameters are configured correctly and that you have Java 8 installed on your local machine. The cluster will remain running until its configured auto-termination time. You should see a message that says “All tests passed” in the terminal when the tests are completed. If you see any error messages, make sure that the JAVA_HOME environment variable and all Databricks connection parameters have been set correctly.
Once you have your connection set up and configured, you can use the power of a Spark cluster in ArcGIS Notebooks. To do this, create a new notebook in a project in ArcGIS Pro and add the following code block to create a SparkSession:
You can then use Spark functions in your Python scripts. For example, you can read a dataset from a hosted feature service, query it, and aggregate a column using the pyspark library as follows:
Do you have ideas for data engineering and analytical tasks in ArcGIS Notebooks that could benefit from the power of distributed processing? If so, please share them with us in the comments.
This post was translated to French and can be viewed here.