Apache Tika

Description

The Apache Tika transform parses files in all sorts of formats and extracts the text content as well as available metadata it can extract. This transform uses the Apache Tika libraries to do the parsing.

The extracted metadata is given in JSON format. If you need specific pieces of information from this metadata, you can extract those with a JSON Input transform.

Supported Engines

Hop Engine

^✓

Spark

Flink

Dataflow

Options

Option

Description

Transform name

Name of the transform. Note: This name has to be unique in a single pipeline.

File tab

Here you can specify which files will be read and examined.

Content tab

This tab has various content settings. For example, you can specify the file encoding, output format and so on.

Output fields tab

On this tab you can simply type in the names of the fields you want in the output.

PreviousAppend Streams NextAvro Decode

Last updated 8 days ago