Data Formats

Sometimes, a Format may be defined for parsing and formatting data values.

  1. Any date can be parsed and/or formatted using date and time format pattern. See Date and Time Format below.
    Parsing and formatting can also be influenced by Locale (names of months, order of day or month information, etc.) and Time Zone.

  2. Any numeric data type (decimal, integer, long, number) can be parsed and/or formatted using the numeric format pattern. See Numeric Format.
    Parsing and formatting can also be influenced by locale (e.g. decimal dot or decimal comma, etc.). See Locale.

  3. Any boolean data type can be parsed and formatted using the boolean format pattern. See Boolean Format.

  4. Any string data type can be parsed using the string format pattern. See String Format.


Remember that both date and time formats and numeric formats are displayed using the system Locale value or the Locale specified in the defaultProperties file, unless another Locale is explicitly specified.

For more information on how Locale may be changed in the defaultProperties see Engine Configuration.

Date and Time Format

A formatting string describes how date/time values should be read and written from/to string representation (flat files, human readable output, etc.). Formatting and parsing of dates is also affected by Locale and Time Zone.

A format can also specify an engine which Data Shaper will use by specifying a prefix (see below). There are two built-in date engines available: standard Java and third-party Joda (https://www.joda.org/joda-time).

DATE ENGINEPREFIXDEFAULTDESCRIPTIONEXAMPLE
Java java:yes - when no prefix is givenStandard Java date implementation. Provides lenient, error-prone and full-featured parsing and writing. It has moderate speed and is generally a good choice unless you need to work with large quantities of date/time fields. For advanced study, please refer to Java SimpleDateFormat documentation.java:yyyy-MM-dd HH:mm:ss
Joda joda:An improved third-party date library. Joda is more strict on input data accuracy when parsing and does not work well with time zones. Joda provides a 20-30% speed increase compared to standard Java.

Joda may be convenient for AS/400 machines.

On the other hand, Joda is unable to read a time zone expressed with any number of z letters and/or at least three Z letters in a pattern.

For further reading, please visit the project site at https://www.joda.org/joda-time.
joda:yyyy-MM-dd HH:mm:ss
Joda iso-8601This format offers support to parse and print dates and times formatted according to ISO 8601. The standard provides more ways of time expression, but usually the format YYYY-MM-DDThh:mm:ss±hh:mm is used - especially in the case of data interchange using XML or JSON documents.

For additional information on the standard, see Wikipedia article on ISO-8601
There are three possible format values:

  • iso-8601:dateTime for timestamps
  • iso-8601:date for simple dates without time information
  • iso-8601:time for simple times without date information
  • Please note that actual format strings for Java and Joda are almost 100% compatible with each other - see tables below.

    Warning!

    The format patterns described in this section are used both in metadata as the Format property and in CTL.

    At first, we provide the list of pattern syntax, the rules and the examples of its usage for Java:

    LETTERDATE OR TIME COMPONENTPRESENTATIONEXAMPLES
    GEra designatorTextAD
    yYearYear1996; 96
    YWeek yearYear2009; 09
    MMonth in yearMonthJuly; Jul; VII; 07; 7
    wWeek in yearNumber27
    WWeek in monthNumber2
    DDay in yearNumber189
    dDay in monthNumber10
    FDay of week in monthNumber2
    EDay in weekTextTuesday; Tue
    uDay number of week (1 = Monday, …​, 7 = Sunday)Number1
    aAM/PM markerTextPM
    HHour in day (0-23)Number0
    kHour in day (1-24)Number24
    KHour in am/pm (0-11)Number0
    hHour in am/pm (1-12)Number12
    mMinute in hourNumber30
    sSecond in minuteNumber55
    SMillisecondNumber970
    zTime zoneGeneral time zonePacific Standard Time; PST; GMT-08:00
    ZTime zoneRFC 822 time zone-0800
    XTime zoneISO 8601 time zone-08; -0800; -08:00
    'Escape for text/idDelimiter(none)
    "Single quoteLiteral'

    The number of symbol letters you specify also determines the format. For example, if the "zz" pattern results in "PDT", then the "zzzz" pattern generates "Pacific Daylight Time". The following table summarizes these rules:

    PRESENTATIONPROCESSINGNUMBER OF PATTERN LETTERSFORM
    TextFormatting1 - 3Short or abbreviated form, if one exists.
    TextFormatting>= 4Full form
    TextParsing>= 1Both forms
    YearFormatting2Truncated to 2 digits
    YearFormatting1 or >= 3Interpreted as Number.
    YearParsing1Interpreted literally
    YearParsing2Interpreted relative to the century within 80 years before or 20 years after the time when the SimpleDateFormat instance is created.
    YearParsing>= 3Interpreted literally
    MonthBoth1-2Interpreted as a Number
    MonthParsing>= 3Interpreted as Text (using Roman numbers, abbreviated month name - if exists, or full month name).
    MonthFormatting3Interpreted as Text (using Roman numbers, or abbreviated month name - if exists).
    MonthFormatting>= 4Interpreted as Text (full month name).
    NumberFormattingMinimum number of required digitsShorter numbers are padded with zeros
    NumberParsingThe number of pattern letters is ignored (unless needed to separate two adjacent fields).Any form
    General time zoneBoth1-3Short or abbreviated form, if it has a name. Otherwise, GMT offset value (GMT[sign][0]0-23]:[00-59]).
    General time zoneBoth>= 4Full form, if it has a name; otherwise, GMT offset value (GMT[sign][0]0-23]:[00-59]).
    General time zoneParsing>= 1RFC 822 time zone form is allowed.
    RFC 822 time zoneBoth>= 1RFC 822 4-digit time zone format is used ([sign][0-23][00-59]).
    RFC 822 time zoneParsing>= 1General time zone form is allowed.

    Examples of date format patterns and resulting dates follow:

    DATE AND TIME PATTERNRESULT
    "yyyy.MM.dd G 'at' HH:mm:ss z"2001.07.04 AD at 12:08:56 PDT
    "EEE, MMM d, ''yy"Wed, Jul 4, '01
    "h:mm a"12:08 PM
    "hh 'o''clock' a, zzzz"12 o’clock PM, Pacific Daylight Time
    "K:mm a, z"0:08 PM, PDT
    "yyyyy.MMMMM.dd GGG hh:mm aaa"02001.July.04 AD 12:08 PM
    "EEE, d MMM yyyy HH:mm:ss Z"Wed, 4 Jul 2001 12:08:56 -0700
    "yyMMddHHmmssZ"010704120856-0700
    "yyyy-MM-dd’T’HH:mm:ss.SSSZ"2001-07-04T12:08:56.235-0700

    The described format patterns are used both in metadata as the Format property and in CTL.

    Now the list of format pattern syntax for Joda follows:

    SYMBOLMEANINGPRESENTATIONEXAMPLES
    GEra designatorTextAD
    CCentury of era (>=0)Number20
    YYear of era (>=0)Year1996
    yYearYear1996
    xWeek of weekyearYear1996
    MMonth of yearMonthJuly; Jul; 07
    wWeek of yearNumber27
    DDay of yearNumber189
    dDay of monthNumber10
    eDay of weekNumber2
    EDay of weekTextTuesday; Tue
    aHalfday of dayTextPM
    HHour of day (0-23)Number0
    kClockhour of day (1-24)Number24
    KHour of halfday (0-11)Number0
    hClockhour of halfday (1-12)Number12
    mMinute of hourNumber30
    sSecond of minuteNumber55
    SFraction of secondNumber970
    zTime zoneTextPacific Standard Time; PST
    ZTime zone offset/idZone-0800; -08:00; America/Los_Angeles
    'Escape for text/idDelimiter(none)
    ''Single quoteLiteral'

    The number of symbol letters you specify also determines the format. The following table summarizes these rules:

    PRESENTATIONPROCESSINGNUMBER OF PATTERN LETTERSFORM
    TextFormatting1 - 3Short or abbreviated form, if one exists.
    TextFormatting>= 4Full form
    TextParsing>= 1Both forms
    YearFormatting2Truncated to 2 digits
    YearFormatting1 or >= 3Interpreted as Number
    YearParsing>= 1Interpreted literally
    MonthBoth1-2Interpreted as Number
    MonthParsing>= 3Interpreted as Text (using Roman numbers, abbreviated month name - if exists, or full month name).
    MonthFormatting3Interpreted as Text (using Roman numbers, or abbreviated month name - if exists).
    MonthFormatting>= 4Interpreted as Text (full month name)
    NumberFormattingThe minimum number of required digits.Shorter numbers are padded with zeros.
    NumberParsing>= 1Any form
    Zone nameFormatting1-3Short or abbreviated form
    Zone nameFormatting>= 4Full form
    Time zone offset/idFormatting1Offset without a colon between hours and minutes.
    Time zone offset/idFormatting2Offset with a colon between hours and minutes.
    Time zone offset/idFormatting>= 3Full textual form like this: "Continent/City".
    Time zone offset/idParsing1Offset without a colon between hours and minutes.
    Time zone offset/idParsing2Offset with a colon between hours and minutes.

    Capt. Eddie to ground control:

    "Remember that parsing with any number of "z" letters, as well as parsing with the number of "Z" letters greater than or equal to 3 is not allowed."

    See information about data types in metadata and CTL (CTL2):

    They are also used in CTL functions. See:

    Numeric Format

    When a text is parsed as any numeric data type or any numeric data type should be formatted to a text, format pattern can be specified. If no format pattern is specified, empty pattern is used and numbers still get parsed and formatted to text.

    There are differences in text parsing and number formatting between cases with an empty pattern and specified pattern.

    1. No pattern and default locale

      • Used when a pattern is empty and no locale is set.
      • Javolution TypeFormat is used for parsing
      • Formatting uses Java’s toString() function (e.g. Integer.toString())
      • Parsing uses Javolution library. It is typically faster than standard Java library but more strict: parsing "10,00" as number fails, parsing "10.00" as integer fails. The expected format for number type is {'.'}{'E|e'}.
    2. A pattern or locale is set (the format from the documentation is used)

      • DecimalFormat for formatting and parsing.
      • Parsing depends on pattern, but e.g. 10,00 is parsed as 1000 (with empty pattern and US locale) and 10.00 will be parsed as valid integer (with value 10).

    Parsing and formatting are locale sensitive.

    In Data Shaper, Java decimal format is used.

    SYMBOLLOCATIONLOCALIZED?MEANING
    #NumberYesDigit, zero shows as absent
    0NumberYesDigit
    .NumberYesDecimal separator or monetary decimal separator
    -NumberYesMinus sign
    ,NumberYesGrouping separator
    ENumberYesSeparates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
    ;Subpattern boundaryYesSeparates positive and negative subpatterns
    %Prefix or suffixYesMultiply by 100 and show as percentage
    ‰ (\u2030)Prefix or suffixYesMultiply by 1000 and show as per mille value
    ¤ (\u00A4)Prefix or suffixNoCurrency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
    'Prefix or suffixNoUsed to quote special characters in a prefix or suffix; for example, "'#'#" formats 123 to "#123". To create a single quote itself, use two in a row: "# o''clock".

    Both prefix and suffix are Unicode characters from \u0000 to \uFFFD, including the margins, but excluding special characters.

    Format pattern composes of subpatterns, prefixes, suffixes, etc. in the way shown in the following table:

    FORMATCOMPONENTS
    patternsubpattern{;subpattern}
    subpattern{prefix}integer{.fraction}{suffix}
    prefix'\u0000'..'\uFFFD' - specialCharacters
    suffix'\u0000'..'\uFFFD' - specialCharacters
    integer'#' '0' '0'
    fraction'0' '#'

    Explanation of these symbols follow:

    NOTATIONDESCRIPTION
    X*0 or more instances of X
    (X | Y)either X or Y
    X..Yany character from X up to Y, inclusive
    S - Tcharacters in S, except those in T
    {X}X is optional

    The grouping separator is commonly used for thousands, but in some countries it separates ten-thousands. The grouping size is a constant number of digits between the grouping characters, such as 3 for 100,000,000 or 4 for 1,0000,0000. If you supply a pattern with multiple grouping characters, the interval between the last one and the end of the integer is the one that is used. So "#,##,###,####" == "######,####" == "##,####,####".

    Remember also that formatting is locale sensitive. See the following table in which results are different for different locales:

    PATTERNLOCALERESULT
    ### ,###.###en.US123,456.789
    ### ,###.###de.DE123.456,789
    ### ,###.###fr.FR123 456,789

    For a deeper look on handling numbers, consult the official Java documentation of NumberFormat, and DecimalFormat.

    Warning!

    Space as group separator
    If locale with space as group separator is used, there should be a hard space (char 160) between digits to parse the number correctly.

    Scientific Notation

    Numbers in scientific notation are expressed as the product of a mantissa and a power of ten.

    For example, 1234 can be expressed as 1.234 x 103.

    The mantissa is often in the range 1.0 <= x < 10.0, but it’s not required.

    Numeric data types can be instructed to format and parse scientific notation only via a pattern. In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation.

    Example: "0.###E0" formats the number 1234 as "1.234E3".

    Examples of numeric pattern and results follow:

    VALUEPATTERNRESULT
    12340.###E01.234E3
    12345## 0.#####E0[1]12.345E3
    123456## 0.#####E0[1]123.456E3
    1234567## 0.#####E0[1]1.234567E6
    12345# 0.#####E0[2]1.2345E4
    123456# 0.#####E0[2]12.3456E4
    1234567# 0.#####E0[2]1.234567E6
    0.0012300.###E0[3]12.3E-4
    123456## 0.##E0[4]12.346E3

    [1] #x00A0;Maximum number of integer digits is 3, minimum number of integer digits is 1, maximum is greater than minimum, thus exponent will be a multiplicate of three (maximum number of integer digits) in each of the cases.

    [2] Maximum number of integer digits is 2, minimum number of integer digits is 1, maximum is greater than minimum, thus exponent will be a multiplicate of two (maximum number of integer digits) in each of the cases.

    [3] Maximum number of integer digits is 2, minimum number of integer digits is 2, maximum is equal to minimum, minimum number of integer digits will be achieved by adjusting the exponent.

    [4] Maximum number of integer digits is 3, maximum number of fraction digits is 2, number of significant digits is sum of maximum number of integer digits and maximum number of fraction digits, thus, the number of significant digits is as shown (5 digits).

    Binary Formats

    The table below presents a list of available formats:

    TYPENAMEFORMATLENGHT
    integerBIG_ENDIANtwo’s-complement, big-endianvariable
    integerLITTLE_ENDIANtwo’s-complement, little-endianvariable
    integerPACKED_DECIMALpacked decimalvariable
    floating-pointDOUBLE_BIG_ENDIANIEEE 754, big-endian8 bytes
    floating-pointDOUBLE_LITTLE_ENDIANIEEE 754, little-endian8 bytes
    floating-pointFLOAT_BIG_ENDIANIEEE 754, big-endian4 bytes
    floating-pointFLOAT_LITTLE_ENDIANIEEE 754, little-endian4 bytes

    The floating-point formats can be used with numeric and decimal datatypes. The integer formats can be used with integer and long datatypes. The exception to the rule is the decimal datatype, which also supports integer formats (BIG_ENDIAN, LITTLE_ENDIAN and PACKED_DECIMAL). When an integer format is used with the decimal datatype, implicit decimal point is set according to the Scale attribute. For example, if the stored value is 123456789 and Scale is set to 3, the value of the field will be 123456.789.

    To use a binary format, create a metadata field with one of the supported datatypes and set the Format attribute to the name of the format prefixed with "BINARY:", e.g. to use the PACKED_DECIMAL format, create a decimal field and set its Format to "BINARY:PACKED_DECIMAL" by choosing it from the list of available formats.

    For the fixed-length formats (double and float) also the Size attribute must be set accordingly.

    Currently, binary data formats can only be handled by ComplexDataReader.

    Boolean Format

    The format for boolean data type specified in Metadata consists of up to four parts separated from each other by the same delimiter.

    This delimiter must also be at the beginning and the end of the Format string. On the other hand, the delimiter must not be contained in the values of the boolean field.

    Warning!

    If you do not use the same character at the beginning and the end of the Format string, the whole string will serve as a regular expression for the true value. The default values (false|F|FALSE|NO|N|f|0|no|n) will be the only ones interpreted as false. Values that match neither the Format regular expression (interpreted as true only) nor the mentioned default values for false will be interpreted as error. In such a case, graph would fail.

    If we symbolically display the format as:

    /A/B/C/D/

    the meaning of each part is as follows:

    1. If the value of the boolean field matches the pattern of the first part (A) and does not match the second part (B), it is interpreted as true.

    2. If the value of the boolean field does not match the pattern of the first part (A), but matches the second part (B), it is interpreted as false.

    3. If the value of the boolean field matches both the pattern of the first part (A) and, at the same time, the pattern of the second part (B), it is interpreted as true.

    4. If the value of the boolean field matches neither the pattern of the first part (A), nor the pattern of the second part (B), it is interpreted as error. In such a case, the graph fails.

    All parts are optional; however, if any of them is omitted, all of the others that are at its right side must also be omitted.

    If the second part (B) is omitted, the following default values are the only ones that are parsed as boolean false:

    false|F|FALSE|NO|N|f|0|no|n

    If there is not any Format, the following default values are the only ones that are parsed as boolean true:

    true|T|TRUE|YES|Y|t|1|yes|y

    The third part (C) is a formatting string used to express boolean true for all matched strings. If the third part is omitted, either the true word is used (if the first part (A) is complicated regular expression), or the first substring from the first part is used (if the first part is a serie of simple substrings separated by pipe, e.g.: Iagree|sure|yes|ok - all these values are formatted as Iagree).

    The fourth part (D) is a formatting string used to express boolean false for all matched strings. If the fourth part is omitted, either the false word is used (if the second part (B) is complicated regular expression), or the first substring from the second part is used (if the second part is a serie of simple substrings separated by pipe, e.g.: Idisagree|nope|no - all these values are formatted as Idisagree).

    String Format

    Such string pattern is a regular expression that allows or prohibits parsing of a string.

    The combo box offers several pre-filled regular expressions.

    The last option (excel:raw) serves to read more precise values from .xlsx files. See documentation on SpreadsheetDataReader.

    Example 6. String Format

    If an input file contains a string field and a Format property is \w{4} for this field, only the string whose length is 4 will be parsed.

    Thus, when a Format property is specified for a string, Data policy may cause a failure of the graph (if Data policy is Strict).

    If Data policy is set to Controlled or Lenient, the records in which this string value matches the specified Format property are read and the others are skipped (either sent to Console or to the rejected port).