Common Properties of Readers
- Readers allow you to specify the location of input data.
See examples below of the File URL attribute for reading from local and remote files, through proxy, input port and dictionary in Supported File URL Formats for Readers below.- Readers allow you to view the source data. See Viewing Data on Readers below.
- Readers can read data from the input port. E.g. you can read URLs of files to be read. See Input Port Reading below.
- Readers can read only the new records. See Incremental Reading below.
- Readers can skip specific number of initial records or set limit on number of records to be read. See Selecting Input Records below.
- Readers allow you to configure a policy related to parsing incomplete or invalid data record. See Data Policy below.
- Some readers can log information about errors.
- XML-reading components allow you to configure the parser. See XML Features below.
- In some Readers, a transformation can be or must be defined. For information about transformation templates for transformations written in CTL see CTL Templates for Readers below.
- Similarly, for information about transformation interfaces that must be implemented in transformations written in Java see Java Interfaces for Readers below.
COMPONENT | DATA SOURCE | INPUT PORTS | OUTPUT PORTS | EACH TO ALL OUTPUTS | DIFFERENT TO DIFFERENT OUTPUTS { | TRANSFORMATION | TRANSF. REQ. | JAVA | CTL | AUTO-PROPAGATED METADATA |
---|---|---|---|---|---|---|---|---|---|---|
ComplexDataReader | flat file | 1 | 1-n | x | ✓ | ✓ | ✓ | ✓ | ✓ | x |
DatabaseReader | database | 0 | 1-n | ✓ | x | x | x | x | x | x |
DataGenerator | none | 0 | 1-n | x | ✓ | ✓ | ✓ | ✓ | ✓ | x |
EDIFACTReader | EDIFACT files | 0-1 | 1-n | x | ✓ | x | x | x | x | x |
FlatFileReader | flat file | 0-1 | 1-2 | x | ✓ | x | x | x | x | x |
JSONExtract | JSON file | 0-1 | 1-n | x | ✓ | x | x | x | x | x |
JSONReader | JSON file | 0-1 | 1-n | x | ✓ | x | x | x | x | x |
LDAPReader | LDAP directory tree | 0 | 1-n | x | x | x | x | x | x | x |
MultiLevelReader | flat file | 1 | 1-n | x | ✓ | ✓ | ✓ | ✓ | x | x |
SpreadsheetDataReader | XLS(X) file | 0-1 | 1-2 | x | x | x | x | x | x | x |
UniversalDataReader | flat file | 0-1 | 1-n | x | ✓ | x | x | x | x | x |
X12Reader | X12 files | 0-1 | 1-n | x | ✓ | x | x | x | x | x |
XMLExtract | XML file | 0-1 | 1-n | x | ✓ | x | x | x | x | x |
XMLReader | XML file | 0-1 | 1-n | x | ✓ | x | x | x | x | x |
XMLXPathReader | XML file | 0-1 | 1-n | x | ✓ | x | x | x | x | x |
[1] The component sends each data record to all of the connected output ports.
[2] The component sends different data records to different output ports using return values of the transformation (DataGenerator and MultiLevelReader). For more information, see Return Values of Transformations. XMLExtract, XMLReader and XMLXPathReader send data to ports as defined in their Mapping or Mapping URL attribute.
Supported File URL Formats for Readers
The File URL attribute may be defined using the URL File Dialog.
Below are examples of possible URL for Readers:
Reading of Local Files
-
/path/filename.txt
Reads a specified file. -
/path1/filename1.txt;/path2/filename2.txt
Reads two specified files. -
/path/filename?.txt
Reads all files satisfying the mask. -
/path/*
Reads all files in a specified directory. -
zip:(/path/file.zip)
Reads the first file compressed in thefile.zip
file. -
zip:(/path/file.zip)#innerfolder/filename.txt
Reads a specified file compressed in thefile.zip
file. -
gzip:(/path/file.gz)
Reads the first file compressed in thefile.gz
file. -
tar:(/path/file.tar)#innerfolder/filename.txt
Reads a specified file archived in thefile.tar
file. -
zip:(/path/file??.zip)#innerfolder?/filename.*
Reads all files from the compressed zip file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names. -
tar:(/path/file????.tar)#innerfolder??/filename*.txt
Reads all files from the archive file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names. -
gzip:(/path/file*.gz)
Reads all files that has been gzipped into the file that satisfy the specified mask. Wild cards may be used in the compressed file names. -
tar:(gzip:/path/file.tar.gz)#innerfolder/filename.txt
Reads a specified file compressed in thefile.tar.gz
file.
Although Data Shaper can read data from a.tar
file, writing to.tar
files is not supported.
-
tar:(gzip:/path/file??.tar.gz)#innerfolder?/filename*.txt
Reads all files from the gzipped tar archive file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names. -
zip:(zip:(/path/name?.zip)#innerfolder/file.zip)#innermostfolder?/filename*.txt
Reads all files satisfying the file mask from all paths satisfying the path mask from all compressed files satisfying the specified zip mask. Wild cards (? and *) may be used in the outer compressed files, innermost folder and innermost file names. They cannot be used in the inner folder and inner zip file names.
Reading of Remote Files
-
ftp://username:password@server/path/filename.txt
Reads a specified filename.txt file on a remote server connected via an FTP protocol using username and password. -
sftp://username:password@server/path/filename.txt
Reads a specified filename.txt file on a remote server connected via an SFTP protocol using a username and password.
If a certificate-based authentication is used, certificates are placed in the ${PROJECT}/ssh-keys/ directory. For more information, see SFTP Certificate in Data Shaper section in URL file dialog.
Note, that only certificates without a password are currently supported. The certificate-based authentication has a URL without a password: sftp://username@server/path/filename.txt -
http://server/path/filename.txt
Reads a specified filename.txt file on a remote server connected via an HTTP protocol. -
https://server/path/filename.txt
Reads a specified filename.txt file on a remote server connected via an HTTPS protocol. -
zip:(ftp://username:password@server/path/file.zip)#innerfolder/filename.txt
Reads a specified filename.txt file compressed in the file.zip file on a remote server connected via an FTP protocol using username and password. -
zip:(http://server/path/file.zip)#innerfolder/filename.txt
Reads a specified filename.txt file compressed in the file.zip file on a remote server connected via an HTTP protocol. -
tar:(ftp://username:password@server/path/file.tar)#innerfolder/filename.txt
Reads a specified filename.txt file archived in the file.tar file on a remote server connected via an FTP protocol using username and password. -
zip:(zip:(ftp://username:password@server/path/name.zip)#innerfolder/file.zip)#innermostfolder /filename.txt
Reads a specified filename.txt file compressed in the file.zip file that is also compressed in the name.zip file on a remote server connected via an FTP protocol using username and password. -
gzip:(http://server/path/file.gz)
Reads the first file compressed in the file.gz file on a remote server connected via an HTTP protocol. -
http://server/filename*.dat
Reads all files from a WebDAV server which satisfy specified mask (only * is supported). -
s3://access_key_id:
Reads all objects which satisfy the specified mask from an Amazon S3 web storage service from a given bucket using access key ID and a secret access key.
It is recommended to connect to S3 via region-specific S3 URL: s3://s3.eu-central-1.amazonaws.com/bucket.name/. The region-specific URL has much better performance than the generic one (s3://s3.amazonaws.com/bucket.name/).
See recommendation on Amazon S3 URL in URL file dialog. -
az-blob://account:
Reads all objects matching the specified mask from the specified container in Microsoft Azure Blob Storage service.
Connects using the specified Account Key. See Azure Blob Storage in URL file dialog for other authentication options. -
hdfs://CONN_ID/path/filename.dat
Reads a file from the Hadoop distributed file system (HDFS). To which HDFS NameNode to connect to is defined in a Hadoop connection with CONN_ID. This example file URL reads a file with the /path/filename.dat absolute HDFS path. -
smb://domain%3Buser:password@server/path/filename.txt
Reads files from Windows share (Microsoft SMB/CIFS protocol) version 1. The URL path may contain wildcards (both * and ? are supported). The server part may be a DNS name, an IP address or a NetBIOS name. The Userinfo part of the URL (domain%3Buser:password) is not mandatory and any URL reserved character it contains should be escaped using the %-encoding similarly to the semicolon ; character with %3B in the example (the semicolon is escaped because it collides with the default Data Shaper file URL separator).
The SMB protocol is implemented in the JCIFS library which may be configured using Java system properties. See Setting Client Properties in the JCIFS documentation for the list of all configurable properties. -
smb2://domain%3Buser:password@server/path/filename.txt
Reads files from Windows share (Microsoft SMB/CIFS protocol) version 2 and 3.
The SMB2 protocol is implemented in the SMBJ library.
Reading from Input Port
-
port:$0.FieldName:discrete
Each data record field from input port represents one particular data source. -
port:$0.FieldName:source
Each data record field from an input port represents a URL to be loaded in and parsed. -
port:$0.FieldName:stream
Input port field values are concatenated and processed as an input file(s);null
values are replaced by theeof
.
See also Input Port Reading below.
Using Proxy in Readers
-
http:(direct:)//seznam.cz
Without proxy. -
http:(proxy://user:
Proxy setting for HTTP protocol. -
ftp:(proxy://user:password@proxyserver:1234)//seznam.cz
Proxy setting for FTP protocol. -
sftp:(proxy://66.11.122.193:443)//user:password@server/path/file.dat
Proxy setting for SFTP protocol. -
s3:(proxy://user:
Proxy setting for S3 protocol.
Reading from Dictionary
-
dict:keyName:discrete
[1]
Reads data from dictionary. -
dict:keyName:source
[1]
Reads data from dictionary in the same way as the discrete processing type, but expects that the dictionary values are input file URLs. The data from this input passes to the Reader.
[1] Reader finds out the type of source value from a dictionary and creates a readable channel for a parser. Reader supports the following type of sources: InputStream, byte[], ReadableByteChannel, CharSequence, CharSequence[], List
Sandbox Resource as Data Source
A sandbox resource, whether it is a shared, local or partitioned sandbox, is specified in the graph under the fileURL attributes as a so called sandbox URL like this:
sandbox://data/path/to/file/file.dat
where data
is a code for sandbox and path/to/file/file.dat
is the path to the resource from the sandbox root. The URL is evaluated by Data Shaper Server during graph execution and a component (reader or writer) obtains the opened stream from the Server. This may be a stream to a local file or to some other remote resource. Thus, a graph does not have to run on the node which has local access to the resource. There may be more sandbox resources used in the graph and each of them may be on a different node. In such cases, Data Shaper Server would choose the node with the most local resources to minimize remote streams.
The sandbox URL has a specific use for parallel data processing. When the sandbox URL with the resource in a partitioned sandbox is used, that part of the graph/phase runs in parallel, according to the node allocation specified by the list of partitioned sandbox locations. Thus, each worker has its own local sandbox resource. Data Shaper Server evaluates the sandbox URL on each worker and provides an open stream to a local resource to the component.
Viewing Data on Readers
To view data on the reader click the reader. The records will be displayed in Data Inspector tab.
If Data Inspector is not opened, right-click the desired component and select Inspect Data from the context menu.
See Data Inspector.
The same can be done in some Writers. However, only after the output file has been created. See Viewing Data on Writers.
Input Port Reading
Input port reading allows you to read file names or data from an optional input port. This feature is available in most of Readers.
To use input port mapping, connect an edge to an input port. Assign metadata to the edge. In the Reader, edit the File URL attribute.
The attribute value has the syntax port:$0.FieldName[:processingType]
. You can enter the value directly or with help of URL File Dialog.
Here processingType
is optional and defines if the data is processed as plain data or URL addresses. It can be source, discrete
, or stream
. If not set explicitly, discrete
is applied by default.
Processing Type
-
discrete
Each data record field from an input port represents one particular data source. -
source
Each data record field from an input port represents a URL to be loaded in and parsed. -
stream
All data fields from an input port are concatenated and processed as one input file. If thenull
value of this field is met, it is replaced by theEOF
. Following data record fields are parsed as another input file in the same way, i.e. until thenull
value is met. The Reader starts parsing data as soon as first bytes come by the port and process it progressively until EOF comes. For more information about writing withstream
processing type, see Output Port Writing.
Input Port Metadata
In input port reading, only metadata field of some particular data types can be used. The type of the FieldName input field can only be string, byte
or cbyte
.
Processing of Input Port Record
When graph runs, data is read from original data source (according to metadata of an edge connected to an optional input port of Readers) and received by a Reader through its optional input port. Each record is read independently of the other records. The specified field of each one is processed by the Reader according to the output metadata.
Readers with Input Port Reading
Remember that port reading can also be used by DBExecute for receiving SQL commands. Query URL will be as follows: port:$0.fieldName:discrete
. Also an SQL command can be read from a file. Its name, including path, is then passed to DBExecute from an input port and the Query URL attribute should be the following: port:$0.fieldName:source
.
Incremental Reading
Incremental reading is a way to read only the new records since the last graph run. This way, you can avoid reading already processed records. Incremental reading allows you to read new records from a single file as well as new records from multiple files. If a file URL possibly matches new files, it can read records from new files.
Incremental reading can be set with the Incremental file and Incremental key attributes. The Incremental key is a string that holds the information about read records/files. This key is stored in the Incremental file attribute. This way, the component reads only the records or files that have not been marked in Incremental file.
Readers with incremental reading are:
The component DatabaseReader which reads data from databases performs this incremental reading in a different way.
Incremental Reading in Database Components
Unlike other incremental readers, in a database component, more database columns can be evaluated and used as key fields. Incremental key is a sequence of the following individual expression separated by a semicolon: keyname=FUNCTIONNAME(db_field)[!InitialValue] (e.g. key01=MAX(EmployeeID);key02=FIRST(CustomerID)!20
). The functions that can be selected are: FIRST, LAST, MIN, MAX
.
At the same time, when you define an Incremental key, you also need to add these key parts to the Query. In the query, a part of the where
sentence will appear; e.g. where db_field1 > #key01 and db_field2 < #key02
. This way, you can limit which records will be read next time. It depends on the values of db_field1
and db_field2
.
Only the records that satisfy the condition specified by the query will be read. These key fields values are stored in the Incremental file. To define Incremental key, click this attribute row and, by clicking the Plus or Minus buttons in the Define incremental key dialog, add or remove key names and select db field names and function names. Each one of the last two is to be selected from a combo list of possible values.
Selecting Input Records
Some readers allow you to limit the number of records that should be read. You can set up:
- Maximum number of records to be read
- Number of records to be skipped
- Maximum number of records to be read per source
- Number of records to be skipped per source
The last two constraints can be defined only in readers that allow reading more files at the same time. In these Readers, you can define the records that should be read for each input file separately and for all of the input files in total.
Per Reader Configuration
The maximum number of records to be read is defined by the Max number of record attribute. The number of records to be skipped is defined by the Number of skipped records attribute. The records are skipped and/or read continuously throughout all input files. The records are skipped and/or read independently of the values of per source file attributes.
Per Source File Configuration
In some components, you can also specify how many records should be skipped and/or read from each input file. To do this, set up the following two attributes: Number of skipped records per source and/or Max number of records per source.
Combination of per File and per Reader Configuration
If you set up both: per file and per reader limits; firstly, the per file limits are applied. Secondly, the per reader limits are applied.
For example, there are two files:
1 | Alice
2 | Bob
3 | Carolina
4 | Daniel
5 | Eve
6 | Filip
7 | Gina
8 | Henry
9 | Isac
10 | Jane
And the reader has the following configuration:
ATTRIBUTE | VALUE |
---|---|
Number of skipped records | 2 |
Max number of record | 5 |
Number of skipped records per source | 1 |
Max number of records per source | 3 |
The file are read in the following way:
- From each file, the first record is skipped and the next three records are read.
- The first two records of the records read in the previous step are skipped. The following records (at most four) are sent to the output.
The records read by the reader are:
4 | Daniel
7 | Gina
8 | Henry
9 | Isac
The example shows that you can read even less records than is the number specified in the Max number of records attribute.
The total number of records that are skipped equals to Number of skipped records per source multiplied by the number of source files plus Number of skipped records.
And total number of records that are read equals to Max number of records per source multiplied by the number of source files plus Max nuIncremental Reading in Database Componentsmber of records.
The Readers that allow limiting the records for both individual input file and all input files in total are:
The following two Readers allow you to limit the total number of records by using the Number of skipped mappings and/or Max number of mappings attributes. What is called mapping here, is a subtree which should be mapped and sent out through the output ports.
- XMLExtract; in addition, this component allows to use the
skipRows
and/ornumRecords
attributes of individual XML elements. - XMLXPathReader; in addition, this component allows to use XPath language to limit the number of mapped XML structures.
The following Readers allow limiting the numbers in a different way:
- The DatabaseReader component can use the SQL query or Query URL attribute to limit the number of records.
The following Readers do not allow limiting the number of records that should be read (they read them all):
Data Policy
Data Policy affects processing (parsing) of incorrect or incomplete records. This can be specified by the Data Policy attribute. There are three options:
- Strict. This data policy is set by default. It means that data parsing stops if a record field with an incorrect value or format is read. Next processing is aborted.
- Controlled. This data policy means that every error is logged, but incorrect records are skipped and data parsing continues. Generally, incorrect records with error information are logged into
stdout
. Only FlatFileReader, JSONReader and SpreadsheetDataReader enable to sent them out through the optional second port. See error metadata description for particular above mentioned readers. - Lenient. This data policy means that incorrect records are only skipped and data parsing continues.
Data policy can be set in the following Readers:
- ComplexDataReader
- DatabaseReader
- FlatFileReader
- JSONReader
- MultiLevelReader
- XMLReader
- XMLXPathReader
XML Features
In XMLExtract, XMLReader and XMLXPathReader, you can configure the validation of your input XML files by specifying the Xml features attribute. The Xml features configure validation of the XML in more detail by enabling or disabling specific checks, see Parser Features. It is expressed as a sequence of individual expressions of one of the following form: nameM:=true
or nameN:=false
, where each nameM
is an XML feature that should be validated. These expressions are separated from each other by a semicolon.
The options for validation are the following:
- Custom parser setting
- Default parser setting
- No validations
- All validations
You can define this attribute using the following dialog:
In this dialog, you can add features using the Plus button, select their true
or false
values, etc.
CTL Templates for Readers
DataGenerator requires a transformation which can be written in both CTL and Java.
For more information about the transformation template, see CTL Templates for DataGenerator.
Remember that this component allows to send each record through a connected output port whose number equals the value returned by the transformation (Return Values of Transformations). Mapping must be defined for such a port.
Java Interfaces for Readers
- DataGenerator requires a transformation which can be written in both CTL and Java.
For more information about the interface, see Java Interface.
Remember that this component allows sending of each record through a connected output port whose number equals the value returned by the transformation (Return Values of Transformations). Mapping must be defined for such a port. - MultiLevelReader requires a transformation which can only be written in Java.
For more information, see Java Interfaces for MultiLevelReader.
Updated about 1 year ago