# Apache Beam Google DataFlow Pipeline Engine

### Beam DataFlow

Google Cloud Dataflow is a fully managed service for executing Apache Beam pipelines within the Google Cloud Platform ecosystem. As a managed Google Cloud service, it provisions worker nodes and out of the box optimization.

The Cloud Dataflow Runner and service are suitable for large scale continuous jobs and provide:

* A fully managed service
* Autoscaling of the number of workers throughout the lifetime of the Dataflow job
* Dynamic work re-balancing

Check the [Google DataFlow docs](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params) and [Apache Beam DataFlow runner docs](https://beam.apache.org/documentation/runners/dataflow/) for more information.

### Google Dataflow Configuration

INFO: this configuration checklist was reprinted (copied) from the [Apache Beam documentation](https://beam.apache.org/documentation/runners/dataflow/).

To use the Google Cloud Dataflow runtime configuration, you must complete the setup in the *Before you begin* section of the [Cloud Dataflow quickstart](https://cloud.google.com/dataflow/docs/quickstarts) for your chosen language.

* Select or create a Google Cloud Platform Console project.
* Enable billing for your project.
* Enable the required Google Cloud APIs: Cloud Dataflow, Compute Engine, Stackdriver Logging, Cloud Storage, Cloud Storage JSON, and Cloud Resource Manager. You may need to enable additional APIs (such as BigQuery, Cloud Pub/Sub, or Cloud Datastore) if you use them in your pipeline code.
* Authenticate with Google Cloud Platform.
* Install the Google Cloud SDK.
* Create a Cloud Storage bucket.

### Options

| Option                                       | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
| -------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Project ID                                   | The project ID for your Google Cloud Project. This is required if you want to run your pipeline using the Dataflow managed service.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Service Account                              | With this option you can run the pipeline dataflow job with a specific service account, instead of the default GCE robot. The default is to leave this blank.                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| Application name                             | The name of the Dataflow job being executed as it appears in Dataflow’s jobs list and job details. Also used when updating an existing pipeline.                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| Staging location                             | Cloud Storage path for staging local files. Must be a valid Cloud Storage URL, beginning with gs\://.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
| Initial number of workers                    | The initial number of Google Compute Engine instances to use when executing your pipeline. This option determines how many workers the Dataflow service starts up when your job begins.                                                                                                                                                                                                                                                                                                                                                                                                                                    |
| Maximum number of workers                    | The maximum number of Compute Engine instances to be made available to your pipeline during execution. Note that this can be higher than the initial number of workers (specified by numWorkers to allow your job to scale up, automatically or otherwise.                                                                                                                                                                                                                                                                                                                                                                 |
| Auto-scaling algorithm                       | The autoscaling mode for your Dataflow job. Possible values are THROUGHPUT\_BASED to enable autoscaling, or NONE to disable. See [Autotuning features](https://cloud.google.com/dataflow/service/dataflow-service-desc#Autotuning) to learn more about how autoscaling works in the Dataflow managed service.                                                                                                                                                                                                                                                                                                              |
| Worker machine type                          | <p>The Compute Engine machine type that Dataflow uses when starting worker VMs. You can use any of the available Compute Engine machine type families as well as custom machine types.</p><p>For best results, use n1 machine types. Shared core machine types, such as f1 and g1 series workers, are not supported under the Dataflow Service Level Agreement.</p><p>Note that Dataflow bills by the number of vCPUs and GB of memory in workers. Billing is independent of the machine type family. Check the <a href="https://cloud.google.com/compute/docs/machine-types">list</a> of machine types for reference.</p> |
| Worker disk type                             | <p>The type of persistent disk to use, specified by a full URL of the disk type resource.</p><p>For example, use compute.googleapis.com/projects//zones//diskTypes/pd-ssd to specify a SSD persistent disk.</p><p><a href="https://cloud.google.com/compute/docs/disks#pdspecs">more</a>.</p>                                                                                                                                                                                                                                                                                                                              |
| Disk size in GB                              | The disk size, in gigabytes, to use on each remote Compute Engine worker instance. If set, specify at least 30 GB to account for the worker boot image and local logs.                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Region                                       | <p>Specifies a Compute Engine region for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs. The zone for workerRegion is <a href="https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#autozone">automatically assigned</a>.</p><p>Note: This option cannot be combined with workerZone or zone.</p><p>(<a href="https://cloud.google.com/dataflow/docs/concepts/regional-endpoints">regions list</a>).</p>                                                                               |
| Zone                                         | <p>Specifies a Compute Engine zone for launching worker instances to run your pipeline. This option is used to run workers in a different location than the region used to deploy, manage, and monitor jobs.</p><p>Note: This option cannot be combined with workerRegion or zone.</p>                                                                                                                                                                                                                                                                                                                                     |
| Network                                      | This is the GCE network for launching workers. For more information, see the reference documentation <https://cloud.google.com/compute/docs/networking>. The default is up to the Dataflow service.                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Subnetwork                                   | This is the GCE subnetwork for launching workers. For more information, see the reference documentation <https://cloud.google.com/compute/docs/networking>. The default is up to the Dataflow service.                                                                                                                                                                                                                                                                                                                                                                                                                     |
| Use public IPs?                              | Specifies whether worker pools should be started with public IP addresses. **WARNING**: This feature is experimental. You must be allowlisted to use it.                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Dataflow Service Options                     | This is a comma separated list of options in the format `option=value,option2=value2,…​`. It serves as a catch-all option for the Dataflow service. This helps if Beam is missing some service related options and if new options are added in the future older Beam versions could still use them through these catch-all options.                                                                                                                                                                                                                                                                                        |
| User agent                                   | A user agent string as per [RFC2616](https://tools.ietf.org/html/rfc2616), describing the pipeline to external services.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
| Temp location                                | Cloud Storage path for temporary files. Must be a valid Cloud Storage URL, beginning with gs\://.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Plugins to stage (, delimited)               | Comma separated list of plugins.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| Transform plugin classes                     | List of transform plugin classes.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| XP plugin classes                            | List of extensions point plugins.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |
| Streaming Hop transforms flush interval (ms) | The amount of time after which the internal buffer is sent completely over the network and emptied.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| Hop streaming transforms buffer size         | The internal buffer size to use.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
| Fat jar file location                        | Fat jar location. Generate a fat jar using `Tools → Generate a Hop fat jar`. The generated fat jar file name will be copied to the clipboard.                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |

**Environment Settings**

This environment variable need to be set locally.

```highlight
GOOGLE_APPLICATION_CREDENTIALS=/path/to/google-key.json
```

### Security considerations

To allow encrypted (TLS) network connections to, for example, Kafka and Neo4j Aura certain older security algorithms are [disabled on Dataflow](https://github.com/apache/hop/blob/master/plugins/engines/beam/src/main/java/org/apache/hop/beam/engines/dataflow/DataFlowJvmStart.java). This is done by setting security property `jdk.tls.disabledAlgorithms` to value: `Lv3, RC4, DES, MD5withRSA, DH keySize < 1024, EC keySize < 224, 3DES_EDE_CBC, anon, NULL`.

Please let us know if you have a need to make this configurable and we’ll look for a way to not hardcode this. Just create a Github Issue to let us know.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.primeur.com/data-shaper-1.21/knowing-the-data-shaper-designer/pipelines/pipeline-run-configurations/beam-dataflow-pipeline-engine.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
