Last updated
Last updated
Sometimes, a Format may be defined for parsing and formatting data values.
Any date can be parsed and/or formatted using date and time format pattern. See Date and Time Format below. Parsing and formatting can also be influenced by (names of months, order of day or month information, etc.) and .
Any numeric data type (decimal, integer, long, number
) can be parsed and/or formatted using the numeric format pattern. See .
Parsing and formatting can also be influenced by locale (e.g. decimal dot or decimal comma, etc.). See .
Any boolean data type can be parsed and formatted using the boolean format pattern. See .
Any string data type can be parsed using the string format pattern. See .
A formatting string describes how date/time values should be read and written from/to string representation (flat files, human readable output, etc.). Formatting and parsing of dates is also affected by and .
A format can also specify an engine which Data Shaper will use by specifying a prefix (see below). There are two built-in date engines available: standard Java and third-party Joda ().
iso-8601:dateTime
for timestamps
iso-8601:date
for simple dates without time information
iso-8601:time
for simple times without date information" }, "cols": 5, "rows": 3, "align": [ "left", "left", "left", "left", "left" ] } [/block]
Please note that actual format strings for Java and Joda are almost 100% compatible with each other - see tables below.
Warning!
The format patterns described in this section are used both in metadata as the Format property and in CTL.
At first, we provide the list of pattern syntax, the rules and the examples of its usage for Java:
The number of symbol letters you specify also determines the format. For example, if the "zz" pattern results in "PDT", then the "zzzz" pattern generates "Pacific Daylight Time". The following table summarizes these rules:
Examples of date format patterns and resulting dates follow:
The described format patterns are used both in metadata as the Format property and in CTL.
Now the list of format pattern syntax for Joda follows:
The number of symbol letters you specify also determines the format. The following table summarizes these rules:
See information about data types in metadata and CTL (CTL2):
They are also used in CTL functions. See:
When a text is parsed as any numeric data type or any numeric data type should be formatted to a text, format pattern can be specified. If no format pattern is specified, empty pattern is used and numbers still get parsed and formatted to text.
There are differences in text parsing and number formatting between cases with an empty pattern and specified pattern.
No pattern and default locale
Used when a pattern is empty and no locale is set.
Javolution TypeFormat
is used for parsing
Formatting uses Java’s toString()
function (e.g. Integer.toString()
)
Parsing uses Javolution library. It is typically faster than standard Java library but more strict: parsing "10,00" as number fails, parsing "10.00" as integer fails. The expected format for number type is {'.'}{'E|e'}.
A pattern or locale is set (the format from the documentation is used)
DecimalFormat for formatting and parsing.
Parsing depends on pattern, but e.g. 10,00 is parsed as 1000 (with empty pattern and US locale) and 10.00 will be parsed as valid integer (with value 10).
Parsing and formatting are locale sensitive.
In Data Shaper, Java decimal format is used.
Both prefix and suffix are Unicode characters from \u0000 to \uFFFD, including the margins, but excluding special characters.
Format pattern composes of subpatterns, prefixes, suffixes, etc. in the way shown in the following table:
Explanation of these symbols follow:
Remember also that formatting is locale sensitive. See the following table in which results are different for different locales:
Warning!
Space as group separator If locale with space as group separator is used, there should be a hard space (char 160) between digits to parse the number correctly.
Numbers in scientific notation are expressed as the product of a mantissa and a power of ten.
For example, 1234
can be expressed as 1.234 x 103
.
The mantissa is often in the range 1.0 <= x < 10.0
, but it’s not required.
Numeric data types can be instructed to format and parse scientific notation only via a pattern. In a pattern, the exponent character immediately followed by one or more digit characters indicates scientific notation.
Example: "0.###E0" formats the number 1234 as "1.234E3".
Examples of numeric pattern and results follow:
[1] #x00A0;Maximum number of integer digits is 3, minimum number of integer digits is 1, maximum is greater than minimum, thus exponent will be a multiplicate of three (maximum number of integer digits) in each of the cases.
[2] Maximum number of integer digits is 2, minimum number of integer digits is 1, maximum is greater than minimum, thus exponent will be a multiplicate of two (maximum number of integer digits) in each of the cases.
[3] Maximum number of integer digits is 2, minimum number of integer digits is 2, maximum is equal to minimum, minimum number of integer digits will be achieved by adjusting the exponent.
[4] Maximum number of integer digits is 3, maximum number of fraction digits is 2, number of significant digits is sum of maximum number of integer digits and maximum number of fraction digits, thus, the number of significant digits is as shown (5 digits).
The table below presents a list of available formats:
The floating-point formats can be used with numeric
and decimal
datatypes. The integer formats can be used with integer
and long
datatypes. The exception to the rule is the decimal
datatype, which also supports integer formats (BIG_ENDIAN, LITTLE_ENDIAN
and PACKED_DECIMAL
). When an integer format is used with the decimal
datatype, implicit decimal point is set according to the Scale attribute. For example, if the stored value is 123456789 and Scale is set to 3, the value of the field will be 123456.789.
To use a binary format, create a metadata field with one of the supported datatypes and set the Format attribute to the name of the format prefixed with "BINARY:"
, e.g. to use the PACKED_DECIMAL
format, create a decimal field and set its Format to "BINARY:PACKED_DECIMAL"
by choosing it from the list of available formats.
For the fixed-length formats (double and float) also the Size attribute must be set accordingly.
The format for boolean data type specified in Metadata consists of up to four parts separated from each other by the same delimiter.
This delimiter must also be at the beginning and the end of the Format string. On the other hand, the delimiter must not be contained in the values of the boolean field.
Warning!
If you do not use the same character at the beginning and the end of the Format string, the whole string will serve as a regular expression for the true
value. The default values (false|F|FALSE|NO|N|f|0|no|n
) will be the only ones interpreted as false
. Values that match neither the Format regular expression (interpreted as true
only) nor the mentioned default values for false
will be interpreted as error. In such a case, graph would fail.
If we symbolically display the format as:
/A/B/C/D/
the meaning of each part is as follows:
If the value of the boolean field matches the pattern of the first part (A
) and does not match the second part (B
), it is interpreted as true
.
If the value of the boolean field does not match the pattern of the first part (A
), but matches the second part (B
), it is interpreted as false
.
If the value of the boolean field matches both the pattern of the first part (A
) and, at the same time, the pattern of the second part (B
), it is interpreted as true
.
If the value of the boolean field matches neither the pattern of the first part (A
), nor the pattern of the second part (B
), it is interpreted as error. In such a case, the graph fails.
All parts are optional; however, if any of them is omitted, all of the others that are at its right side must also be omitted.
If the second part (B
) is omitted, the following default values are the only ones that are parsed as boolean false
:
false|F|FALSE|NO|N|f|0|no|n
If there is not any Format, the following default values are the only ones that are parsed as boolean true
:
true|T|TRUE|YES|Y|t|1|yes|y
The third part (C
) is a formatting string used to express boolean true
for all matched strings. If the third part is omitted, either the true
word is used (if the first part (A
) is complicated regular expression), or the first substring from the first part is used (if the first part is a serie of simple substrings separated by pipe, e.g.: Iagree|sure|yes|ok
- all these values are formatted as Iagree
).
The fourth part (D
) is a formatting string used to express boolean false
for all matched strings. If the fourth part is omitted, either the false
word is used (if the second part (B
) is complicated regular expression), or the first substring from the second part is used (if the second part is a serie of simple substrings separated by pipe, e.g.: Idisagree|nope|no
- all these values are formatted as Idisagree
).
The combo box offers several pre-filled regular expressions.
Example 6. String Format
If an input file contains a string field and a Format property is \w{4} for this field, only the string whose length is 4 will be parsed.
Thus, when a Format property is specified for a string, Data policy may cause a failure of the graph (if Data policy is Strict
).
If Data policy is set to Controlled
or Lenient
, the records in which this string value matches the specified Format property are read and the others are skipped (either sent to Console or to the rejected port).
For a deeper look on handling numbers, consult the official Java documentation of , and .
Currently, binary data formats can only be handled by .
Such string pattern is a that allows or prohibits parsing of a string.
The last option (excel:raw) serves to read more precise values from .xlsx
files. See documentation on .
G
Era designator
Text
AD
y
Year
Year
1996; 96
Y
Week year
Year
2009; 09
M
Month in year
Month
July; Jul; VII; 07; 7
w
Week in year
Number
27
W
Week in month
Number
2
D
Day in year
Number
189
d
Day in month
Number
10
F
Day of week in month
Number
2
E
Day in week
Text
Tuesday; Tue
u
Day number of week (1 = Monday, …​, 7 = Sunday)
Number
1
a
AM/PM marker
Text
PM
H
Hour in day (0-23)
Number
0
k
Hour in day (1-24)
Number
24
K
Hour in am/pm (0-11)
Number
0
h
Hour in am/pm (1-12)
Number
12
m
Minute in hour
Number
30
s
Second in minute
Number
55
S
Millisecond
Number
970
z
Time zone
General time zone
Pacific Standard Time; PST; GMT-08:00
Z
Time zone
RFC 822 time zone
-0800
X
Time zone
ISO 8601 time zone
-08; -0800; -08:00
'
Escape for text/id
Delimiter
(none)
"
Single quote
Literal
'
Text
Formatting
1 - 3
Short or abbreviated form, if one exists.
Text
Formatting
>= 4
Full form
Text
Parsing
>= 1
Both forms
Year
Formatting
2
Truncated to 2 digits
Year
Formatting
1 or >= 3
Interpreted as Number.
Year
Parsing
1
Interpreted literally
Year
Parsing
2
Interpreted relative to the century within 80 years before or 20 years after the time when the SimpleDateFormat
instance is created.
Year
Parsing
>= 3
Interpreted literally
Month
Both
1-2
Interpreted as a Number
Month
Parsing
>= 3
Interpreted as Text (using Roman numbers, abbreviated month name - if exists, or full month name).
Month
Formatting
3
Interpreted as Text (using Roman numbers, or abbreviated month name - if exists).
Month
Formatting
>= 4
Interpreted as Text (full month name).
Number
Formatting
Minimum number of required digits
Shorter numbers are padded with zeros
Number
Parsing
The number of pattern letters is ignored (unless needed to separate two adjacent fields).
Any form
General time zone
Both
1-3
Short or abbreviated form, if it has a name. Otherwise, GMT offset value (GMT[sign][0]0-23]:[00-59]).
General time zone
Both
>= 4
Full form, if it has a name; otherwise, GMT offset value (GMT[sign][0]0-23]:[00-59]).
General time zone
Parsing
>= 1
RFC 822 time zone form is allowed.
RFC 822 time zone
Both
>= 1
RFC 822 4-digit time zone format is used ([sign][0-23][00-59]).
RFC 822 time zone
Parsing
>= 1
General time zone form is allowed.
"yyyy.MM.dd G 'at' HH:mm:ss z"
2001.07.04 AD at 12:08:56 PDT
"EEE, MMM d, ''yy"
Wed, Jul 4, '01
"h:mm a"
12:08 PM
"hh 'o''clock' a, zzzz"
12 o’clock PM, Pacific Daylight Time
"K:mm a, z"
0:08 PM, PDT
"yyyyy.MMMMM.dd GGG hh:mm aaa"
02001.July.04 AD 12:08 PM
"EEE, d MMM yyyy HH:mm:ss Z"
Wed, 4 Jul 2001 12:08:56 -0700
"yyMMddHHmmssZ"
010704120856-0700
"yyyy-MM-dd’T’HH:mm:ss.SSSZ"
2001-07-04T12:08:56.235-0700
G
Era designator
Text
AD
C
Century of era (>=0)
Number
20
Y
Year of era (>=0)
Year
1996
y
Year
Year
1996
x
Week of weekyear
Year
1996
M
Month of year
Month
July; Jul; 07
w
Week of year
Number
27
D
Day of year
Number
189
d
Day of month
Number
10
e
Day of week
Number
2
E
Day of week
Text
Tuesday; Tue
a
Halfday of day
Text
PM
H
Hour of day (0-23)
Number
0
k
Clockhour of day (1-24)
Number
24
K
Hour of halfday (0-11)
Number
0
h
Clockhour of halfday (1-12)
Number
12
m
Minute of hour
Number
30
s
Second of minute
Number
55
S
Fraction of second
Number
970
z
Time zone
Text
Pacific Standard Time; PST
Z
Time zone offset/id
Zone
-0800; -08:00; America/Los_Angeles
'
Escape for text/id
Delimiter
(none)
''
Single quote
Literal
'
Text
Formatting
1 - 3
Short or abbreviated form, if one exists.
Text
Formatting
>= 4
Full form
Text
Parsing
>= 1
Both forms
Year
Formatting
2
Truncated to 2 digits
Year
Formatting
1 or >= 3
Interpreted as Number
Year
Parsing
>= 1
Interpreted literally
Month
Both
1-2
Interpreted as Number
Month
Parsing
>= 3
Interpreted as Text (using Roman numbers, abbreviated month name - if exists, or full month name).
Month
Formatting
3
Interpreted as Text (using Roman numbers, or abbreviated month name - if exists).
Month
Formatting
>= 4
Interpreted as Text (full month name)
Number
Formatting
The minimum number of required digits.
Shorter numbers are padded with zeros.
Number
Parsing
>= 1
Any form
Zone name
Formatting
1-3
Short or abbreviated form
Zone name
Formatting
>= 4
Full form
Time zone offset/id
Formatting
1
Offset without a colon between hours and minutes.
Time zone offset/id
Formatting
2
Offset with a colon between hours and minutes.
Time zone offset/id
Formatting
>= 3
Full textual form like this: "Continent/City".
Time zone offset/id
Parsing
1
Offset without a colon between hours and minutes.
Time zone offset/id
Parsing
2
Offset with a colon between hours and minutes.
#
Number
Yes
Digit, zero shows as absent
0
Number
Yes
Digit
.
Number
Yes
Decimal separator or monetary decimal separator
-
Number
Yes
Minus sign
,
Number
Yes
Grouping separator
E
Number
Yes
Separates mantissa and exponent in scientific notation. Need not be quoted in prefix or suffix.
;
Subpattern boundary
Yes
Separates positive and negative subpatterns
%
Prefix or suffix
Yes
Multiply by 100 and show as percentage
‰ (\u2030)
Prefix or suffix
Yes
Multiply by 1000 and show as per mille value
¤ (\u00A4)
Prefix or suffix
No
Currency sign, replaced by currency symbol. If doubled, replaced by international currency symbol. If present in a pattern, the monetary decimal separator is used instead of the decimal separator.
'
Prefix or suffix
No
Used to quote special characters in a prefix or suffix; for example, "'#'#" formats 123 to "#123". To create a single quote itself, use two in a row: "# o''clock".
pattern
subpattern{;subpattern}
subpattern
{prefix}integer{.fraction}{suffix}
prefix
'\u0000'..'\uFFFD' - specialCharacters
suffix
'\u0000'..'\uFFFD' - specialCharacters
integer
'#'_ '0'_ '0'
fraction
'0'_ '#'_
X*
0 or more instances of X
(X | Y)
either X or Y
X..Y
any character from X up to Y, inclusive
S - T
characters in S, except those in T
{X}
X is optional
### ,###.###
en.US
123,456.789
### ,###.###
de.DE
123.456,789
### ,###.###
fr.FR
123 456,789
1234
0.###E0
1.234E3
12345
## 0.#####E0[1]
12.345E3
123456
## 0.#####E0[1]
123.456E3
1234567
## 0.#####E0[1]
1.234567E6
12345
# 0.#####E0[2]
1.2345E4
123456
# 0.#####E0[2]
12.3456E4
1234567
# 0.#####E0[2]
1.234567E6
0.00123
00.###E0[3]
12.3E-4
123456
## 0.##E0[4]
12.346E3
integer
BIG_ENDIAN
two’s-complement, big-endian
variable
integer
LITTLE_ENDIAN
two’s-complement, little-endian
variable
integer
PACKED_DECIMAL
variable
floating-point
DOUBLE_BIG_ENDIAN
IEEE 754, big-endian
8 bytes
floating-point
DOUBLE_LITTLE_ENDIAN
IEEE 754, little-endian
8 bytes
floating-point
FLOAT_BIG_ENDIAN
IEEE 754, big-endian
4 bytes
floating-point
FLOAT_LITTLE_ENDIAN
IEEE 754, little-endian
4 bytes