Common Properties of Readers

  • Readers allow you to specify the location of input data.
    See examples below of the File URL attribute for reading from local and remote files, through proxy, input port and dictionary in Supported File URL Formats for Readers below.
    • Readers allow you to view the source data. See Viewing Data on Readers below.
    • Readers can read data from the input port. E.g. you can read URLs of files to be read. See Input Port Reading below.
    • Readers can read only the new records. See Incremental Reading below.
    • Readers can skip specific number of initial records or set limit on number of records to be read. See Selecting Input Records below.
    • Readers allow you to configure a policy related to parsing incomplete or invalid data record. See Data Policy below.
    • Some readers can log information about errors.
    • XML-reading components allow you to configure the parser. See XML Features below.
    • In some Readers, a transformation can be or must be defined. For information about transformation templates for transformations written in CTL see CTL Templates for Readers below.
    • Similarly, for information about transformation interfaces that must be implemented in transformations written in Java see Java Interfaces for Readers below.
COMPONENTDATA SOURCEINPUT PORTSOUTPUT PORTSEACH TO ALL OUTPUTSDIFFERENT TO DIFFERENT OUTPUTS {TRANSFORMATIONTRANSF. REQ.JAVACTLAUTO-PROPAGATED METADATA
ComplexDataReaderflat file11-nxβœ“βœ“βœ“βœ“βœ“x
DatabaseReaderdatabase01-nβœ“xxxxxx
DataGeneratornone01-nxβœ“βœ“βœ“βœ“βœ“x
EDIFACTReaderEDIFACT files0-11-nxβœ“xxxxx
FlatFileReaderflat file0-11-2xβœ“xxxxx
JSONExtractJSON file0-11-nxβœ“xxxxx
JSONReaderJSON file0-11-nxβœ“xxxxx
LDAPReaderLDAP directory tree01-nxxxxxxx
MultiLevelReaderflat file11-nxβœ“βœ“βœ“βœ“xx
SpreadsheetDataReaderXLS(X) file0-11-2xxxxxxx
UniversalDataReaderflat file0-11-nxβœ“xxxxx
X12ReaderX12 files0-11-nxβœ“xxxxx
XMLExtractXML file0-11-nxβœ“xxxxx
XMLReaderXML file0-11-nxβœ“xxxxx
XMLXPathReaderXML file0-11-nxβœ“xxxxx

[1] The component sends each data record to all of the connected output ports.
[2] The component sends different data records to different output ports using return values of the transformation (DataGenerator and MultiLevelReader). For more information, see Return Values of Transformations. XMLExtract, XMLReader and XMLXPathReader send data to ports as defined in their Mapping or Mapping URL attribute.

Supported File URL Formats for Readers

The File URL attribute may be defined using the URL File Dialog.

Warning!

To ensure graph portability, forward slashes must be used when defining the path in URLs (even on Microsoft Windows).

Below are examples of possible URL for Readers:

Reading of Local Files

  • /path/filename.txt
    Reads a specified file.

  • /path1/filename1.txt;/path2/filename2.txt
    Reads two specified files.

  • /path/filename?.txt
    Reads all files satisfying the mask.

  • /path/*
    Reads all files in a specified directory.

  • zip:(/path/file.zip)
    Reads the first file compressed in the file.zip file.

  • zip:(/path/file.zip)#innerfolder/filename.txt
    Reads a specified file compressed in the file.zip file.

  • gzip:(/path/file.gz)
    Reads the first file compressed in the file.gz file.

  • tar:(/path/file.tar)#innerfolder/filename.txt
    Reads a specified file archived in the file.tar file.

  • zip:(/path/file??.zip)#innerfolder?/filename.*
    Reads all files from the compressed zip file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names.

  • tar:(/path/file????.tar)#innerfolder??/filename*.txt
    Reads all files from the archive file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names.

  • gzip:(/path/file*.gz)
    Reads all files that has been gzipped into the file that satisfy the specified mask. Wild cards may be used in the compressed file names.

  • tar:(gzip:/path/file.tar.gz)#innerfolder/filename.txt
    Reads a specified file compressed in the file.tar.gz file.



Although Data Shaper can read data from a .tar file, writing to .tar files is not supported.

  • tar:(gzip:/path/file??.tar.gz)#innerfolder?/filename*.txt
    Reads all files from the gzipped tar archive file(s) that satisfy the specified mask. Wild cards (? and *) may be used in the compressed file names, inner folder and inner file names.

  • zip:(zip:(/path/name?.zip)#innerfolder/file.zip)#innermostfolder?/filename*.txt
    Reads all files satisfying the file mask from all paths satisfying the path mask from all compressed files satisfying the specified zip mask. Wild cards (? and *) may be used in the outer compressed files, innermost folder and innermost file names. They cannot be used in the inner folder and inner zip file names.

Reading of Remote Files

  • ftp://username:password@server/path/filename.txt
    Reads a specified filename.txt file on a remote server connected via an FTP protocol using username and password.

  • sftp://username:password@server/path/filename.txt
    Reads a specified filename.txt file on a remote server connected via an SFTP protocol using a username and password.
    If a certificate-based authentication is used, certificates are placed in the ${PROJECT}/ssh-keys/ directory. For more information, see SFTP Certificate in Data Shaper section in URL file dialog.
    Note, that only certificates without a password are currently supported. The certificate-based authentication has a URL without a password: sftp://username@server/path/filename.txt

  • http://server/path/filename.txt
    Reads a specified filename.txt file on a remote server connected via an HTTP protocol.

  • https://server/path/filename.txt
    Reads a specified filename.txt file on a remote server connected via an HTTPS protocol.

  • zip:(ftp://username:password@server/path/file.zip)#innerfolder/filename.txt
    Reads a specified filename.txt file compressed in the file.zip file on a remote server connected via an FTP protocol using username and password.

  • zip:(http://server/path/file.zip)#innerfolder/filename.txt
    Reads a specified filename.txt file compressed in the file.zip file on a remote server connected via an HTTP protocol.

  • tar:(ftp://username:password@server/path/file.tar)#innerfolder/filename.txt
    Reads a specified filename.txt file archived in the file.tar file on a remote server connected via an FTP protocol using username and password.

  • zip:(zip:(ftp://username:password@server/path/name.zip)#innerfolder/file.zip)#innermostfolder /filename.txt
    Reads a specified filename.txt file compressed in the file.zip file that is also compressed in the name.zip file on a remote server connected via an FTP protocol using username and password.

  • gzip:(http://server/path/file.gz)
    Reads the first file compressed in the file.gz file on a remote server connected via an HTTP protocol.

  • http://server/filename*.dat
    Reads all files from a WebDAV server which satisfy specified mask (only * is supported).

  • s3://access_key_id:
    Reads all objects which satisfy the specified mask from an Amazon S3 web storage service from a given bucket using access key ID and a secret access key.
    It is recommended to connect to S3 via region-specific S3 URL: s3://s3.eu-central-1.amazonaws.com/bucket.name/. The region-specific URL has much better performance than the generic one (s3://s3.amazonaws.com/bucket.name/).
    See recommendation on Amazon S3 URL in URL file dialog.

  • az-blob://account:
    Reads all objects matching the specified mask from the specified container in Microsoft Azure Blob Storage service.
    Connects using the specified Account Key. See Azure Blob Storage in URL file dialog for other authentication options.

  • hdfs://CONN_ID/path/filename.dat
    Reads a file from the Hadoop distributed file system (HDFS). To which HDFS NameNode to connect to is defined in a Hadoop connection with CONN_ID. This example file URL reads a file with the /path/filename.dat absolute HDFS path.

  • smb://domain%3Buser:password@server/path/filename.txt
    Reads files from Windows share (Microsoft SMB/CIFS protocol) version 1. The URL path may contain wildcards (both * and ? are supported). The server part may be a DNS name, an IP address or a NetBIOS name. The Userinfo part of the URL (domain%3Buser:password) is not mandatory and any URL reserved character it contains should be escaped using the %-encoding similarly to the semicolon ; character with %3B in the example (the semicolon is escaped because it collides with the default Data Shaper file URL separator).
    The SMB protocol is implemented in the JCIFS library which may be configured using Java system properties. See Setting Client Properties in the JCIFS documentation for the list of all configurable properties.

  • smb2://domain%3Buser:password@server/path/filename.txt
    Reads files from Windows share (Microsoft SMB/CIFS protocol) version 2 and 3.
    The SMB2 protocol is implemented in the SMBJ library.

Reading from Input Port

  • port:$0.FieldName:discrete
    Each data record field from input port represents one particular data source.

  • port:$0.FieldName:source
    Each data record field from an input port represents a URL to be loaded in and parsed.

  • port:$0.FieldName:stream
    Input port field values are concatenated and processed as an input file(s); null values are replaced by the eof.

See also Input Port Reading below.

Using Proxy in Readers

  • http:(direct:)//seznam.cz
    Without proxy.

  • http:(proxy://user:
    Proxy setting for HTTP protocol.

  • ftp:(proxy://user:password@proxyserver:1234)//seznam.cz
    Proxy setting for FTP protocol.

  • sftp:(proxy://66.11.122.193:443)//user:password@server/path/file.dat
    Proxy setting for SFTP protocol.

  • s3:(proxy://user:
    Proxy setting for S3 protocol.

Reading from Dictionary

  • dict:keyName:discrete[1]
    Reads data from dictionary.

  • dict:keyName:source[1]
    Reads data from dictionary in the same way as the discrete processing type, but expects that the dictionary values are input file URLs. The data from this input passes to the Reader.

[1] Reader finds out the type of source value from a dictionary and creates a readable channel for a parser. Reader supports the following type of sources: InputStream, byte[], ReadableByteChannel, CharSequence, CharSequence[], List

Sandbox Resource as Data Source

A sandbox resource, whether it is a shared, local or partitioned sandbox, is specified in the graph under the fileURL attributes as a so called sandbox URL like this:
sandbox://data/path/to/file/file.dat

where data is a code for sandbox and path/to/file/file.dat is the path to the resource from the sandbox root. The URL is evaluated by Data Shaper Server during graph execution and a component (reader or writer) obtains the opened stream from the Server. This may be a stream to a local file or to some other remote resource. Thus, a graph does not have to run on the node which has local access to the resource. There may be more sandbox resources used in the graph and each of them may be on a different node. In such cases, Data Shaper Server would choose the node with the most local resources to minimize remote streams.

The sandbox URL has a specific use for parallel data processing. When the sandbox URL with the resource in a partitioned sandbox is used, that part of the graph/phase runs in parallel, according to the node allocation specified by the list of partitioned sandbox locations. Thus, each worker has its own local sandbox resource. Data Shaper Server evaluates the sandbox URL on each worker and provides an open stream to a local resource to the component.

Viewing Data on Readers

To view data on the reader click the reader. The records will be displayed in Data Inspector tab.
If Data Inspector is not opened, right-click the desired component and select Inspect Data from the context menu.
See Data Inspector.
The same can be done in some Writers. However, only after the output file has been created. See Viewing Data on Writers.

Input Port Reading

Input port reading allows you to read file names or data from an optional input port. This feature is available in most of Readers.
To use input port mapping, connect an edge to an input port. Assign metadata to the edge. In the Reader, edit the File URL attribute.
The attribute value has the syntax port:$0.FieldName[:processingType]. You can enter the value directly or with help of URL File Dialog.
Here processingType is optional and defines if the data is processed as plain data or URL addresses. It can be source, discrete, or stream. If not set explicitly, discrete is applied by default.

Processing Type

  • discrete
    Each data record field from an input port represents one particular data source.

  • source
    Each data record field from an input port represents a URL to be loaded in and parsed.

  • stream
    All data fields from an input port are concatenated and processed as one input file. If the null value of this field is met, it is replaced by the EOF. Following data record fields are parsed as another input file in the same way, i.e. until the null value is met. The Reader starts parsing data as soon as first bytes come by the port and process it progressively until EOF comes. For more information about writing with stream processing type, see Output Port Writing.

Input Port Metadata

In input port reading, only metadata field of some particular data types can be used. The type of the FieldName input field can only be string, byte or cbyte.

Processing of Input Port Record

When graph runs, data is read from original data source (according to metadata of an edge connected to an optional input port of Readers) and received by a Reader through its optional input port. Each record is read independently of the other records. The specified field of each one is processed by the Reader according to the output metadata.

Readers with Input Port Reading

Remember that port reading can also be used by DBExecute for receiving SQL commands. Query URL will be as follows: port:$0.fieldName:discrete. Also an SQL command can be read from a file. Its name, including path, is then passed to DBExecute from an input port and the Query URL attribute should be the following: port:$0.fieldName:source.

Incremental Reading

Incremental reading is a way to read only the new records since the last graph run. This way, you can avoid reading already processed records. Incremental reading allows you to read new records from a single file as well as new records from multiple files. If a file URL possibly matches new files, it can read records from new files.

Incremental reading can be set with the Incremental file and Incremental key attributes. The Incremental key is a string that holds the information about read records/files. This key is stored in the Incremental file attribute. This way, the component reads only the records or files that have not been marked in Incremental file.

Readers with incremental reading are:

The component DatabaseReader which reads data from databases performs this incremental reading in a different way.

Incremental Reading in Database Components

Unlike other incremental readers, in a database component, more database columns can be evaluated and used as key fields. Incremental key is a sequence of the following individual expression separated by a semicolon: keyname=FUNCTIONNAME(db_field)[!InitialValue] (e.g. key01=MAX(EmployeeID);key02=FIRST(CustomerID)!20). The functions that can be selected are: FIRST, LAST, MIN, MAX.

At the same time, when you define an Incremental key, you also need to add these key parts to the Query. In the query, a part of the where sentence will appear; e.g. where db_field1 > #key01 and db_field2 < #key02. This way, you can limit which records will be read next time. It depends on the values of db_field1 and db_field2.

Only the records that satisfy the condition specified by the query will be read. These key fields values are stored in the Incremental file. To define Incremental key, click this attribute row and, by clicking the Plus or Minus buttons in the Define incremental key dialog, add or remove key names and select db field names and function names. Each one of the last two is to be selected from a combo list of possible values.

Selecting Input Records

Some readers allow you to limit the number of records that should be read. You can set up:

  • Maximum number of records to be read
  • Number of records to be skipped
  • Maximum number of records to be read per source
  • Number of records to be skipped per source

The last two constraints can be defined only in readers that allow reading more files at the same time. In these Readers, you can define the records that should be read for each input file separately and for all of the input files in total.

Per Reader Configuration

The maximum number of records to be read is defined by the Max number of record attribute. The number of records to be skipped is defined by the Number of skipped records attribute. The records are skipped and/or read continuously throughout all input files. The records are skipped and/or read independently of the values of per source file attributes.

Per Source File Configuration

In some components, you can also specify how many records should be skipped and/or read from each input file. To do this, set up the following two attributes: Number of skipped records per source and/or Max number of records per source.

Combination of per File and per Reader Configuration

If you set up both: per file and per reader limits; firstly, the per file limits are applied. Secondly, the per reader limits are applied.

For example, there are two files:
1Β  | Alice
2Β  | Bob
3Β  | Carolina
4Β  | Daniel
5Β  | Eve

6Β  | Filip
7Β  | Gina
8Β  | Henry
9Β  | Isac
10 | Jane

And the reader has the following configuration:

ATTRIBUTEVALUE
Number of skipped records2
Max number of record5
Number of skipped records per source1
Max number of records per source3

The file are read in the following way:

  1. From each file, the first record is skipped and the next three records are read.
  2. The first two records of the records read in the previous step are skipped. The following records (at most four) are sent to the output.

The records read by the reader are:
4 | Daniel
7 | Gina
8 | Henry
9 | Isac

The example shows that you can read even less records than is the number specified in the Max number of records attribute.
The total number of records that are skipped equals to Number of skipped records per source multiplied by the number of source files plus Number of skipped records.
And total number of records that are read equals to Max number of records per source multiplied by the number of source files plus Max nuIncremental Reading in Database Componentsmber of records.
The Readers that allow limiting the records for both individual input file and all input files in total are:

The following two Readers allow you to limit the total number of records by using the Number of skipped mappings and/or Max number of mappings attributes. What is called mapping here, is a subtree which should be mapped and sent out through the output ports.

  • XMLExtract; in addition, this component allows to use the skipRows and/or numRecords attributes of individual XML elements.
  • XMLXPathReader; in addition, this component allows to use XPath language to limit the number of mapped XML structures.

The following Readers allow limiting the numbers in a different way:

  • The DatabaseReader component can use the SQL query or Query URL attribute to limit the number of records.

The following Readers do not allow limiting the number of records that should be read (they read them all):

Data Policy

Data Policy affects processing (parsing) of incorrect or incomplete records. This can be specified by the Data Policy attribute. There are three options:

  • Strict. This data policy is set by default. It means that data parsing stops if a record field with an incorrect value or format is read. Next processing is aborted.
  • Controlled. This data policy means that every error is logged, but incorrect records are skipped and data parsing continues. Generally, incorrect records with error information are logged into stdout. Only FlatFileReader, JSONReader and SpreadsheetDataReader enable to sent them out through the optional second port. See error metadata description for particular above mentioned readers.
  • Lenient. This data policy means that incorrect records are only skipped and data parsing continues.

Data policy can be set in the following Readers:

XML Features

In XMLExtract, XMLReader and XMLXPathReader, you can configure the validation of your input XML files by specifying the Xml features attribute. The Xml features configure validation of the XML in more detail by enabling or disabling specific checks, see Parser Features. It is expressed as a sequence of individual expressions of one of the following form: nameM:=true or nameN:=false, where each nameM is an XML feature that should be validated. These expressions are separated from each other by a semicolon.

The options for validation are the following:

  • Custom parser setting
  • Default parser setting
  • No validations
  • All validations

You can define this attribute using the following dialog:

In this dialog, you can add features using the Plus button, select their true or false values, etc.

CTL Templates for Readers

DataGenerator requires a transformation which can be written in both CTL and Java.
For more information about the transformation template, see CTL Templates for DataGenerator.
Remember that this component allows to send each record through a connected output port whose number equals the value returned by the transformation (Return Values of Transformations). Mapping must be defined for such a port.

Java Interfaces for Readers