Parquet File Output
 Parquet File Output
 Parquet File Output
 Parquet File Output
 Parquet File OutputDescription
The Parquet File Output transform writes data into the Apache Parquet file format.
For more information on this see: Apache Parquet.
Hop Engine
✓
Spark
✓
Flink
✓
Dataflow
✓
Options
Notes:
- The date optionally referenced in the output file name(s) will be the start of the pipeline execution. 
- Hop Date types are serialized as EPOC: milliseconds since - 1970-01-01 00:00:00.000
- Strings are written as binary in UTF-8 
- Compression of data into columnar format is being done in memory. This happens when all rows are written. To not run out of memory make sure to specify a split size. 
Transform name
Name of the transform this name has to be unique in a single pipeline.
Base file name
Specify the base filename. This is composed of where you want to write the Parquet file to as well as the start of the filename. Examples:
Write to Amazon AWS S3 : s3://my-bucket-name/transactions
Write to a local folder : /my/folder/customer-data
Extension
This is the extension of the file. Usually this is simply snappy
Include date?
Check this box if you want to include the date in the filename with mask yyyMMdd
Include time?
Check this box if you want to include the time in the filename with mask HHmmss
Include date-time-format?
Check this box if you want to include a specific custom date-time format in the filename
Include transform copy number?
Enable this option if you run this transform in multiple copies to not have multiple threads write to the same file. The copy number is formatted with mask 00
Split into parts and include number?
Enable this option if you want to split the output into multiple parts. Specify a split size larger than 0 and this is then the number of rows per file. The file part (split) number will be included in the filename to make sure that the same file is not being overwritten. The split number is formatted with mask 0000
Compression codec
Here you can indicate which compression codec you want to use. The default is SNAPPY for Apache Snappy compression.
Version
Choose the protocol version of Parquet (1.0 or 2.0)
Row group size
The amount of rows in a group
Data page size
The data page size on a 1kB boundary (default is 1048576)
Dictionary page size
The data dictionary page size on a 1kB boundary (default is 1048576)
Fields
You can specify which fields to write and in which order. You can use the "Get Fields" button to populate the dialog.
Last updated
