Primeur Online Docs
Data Shaper
Data Shaper
  • 🚀GETTING STARTED
    • What is Primeur Data Shaper
      • What is the Data Shaper Designer
      • What is the Data Shaper Server
      • What is the Data Shaper Cluster
    • How does the Data Shaper Designer work
      • Designer Views and Graphs
      • Data Shaper Graphs
      • Designer Projects and Sandboxes
      • Data Shaper Designer Reference
    • How do the Data Shaper Server and Cluster work
      • Data Shaper Server and Cluster
      • Data Shaper Server Reference
    • VFS Graph Components
      • DataOneFileDescriptor (DOFD) metadata
      • Passing parameters from Data One Contract to Data Shaper graph
      • Inheriting Data One context attributes in Data Shaper graph
  • DATA SHAPER DESIGNER
    • Configuration
      • Runtime configuration
        • Logging
        • Master Password
        • User Classpath
      • Data Shaper Server Integration
      • Execution monitoring
      • Java configuration
      • Engine configuration
      • Refresh Operation
    • Designer User Interface
      • Graph Editor with Palette of Components
      • Project Explorer Pane
      • Outline Pane
      • Tabs Pane
      • Execution Tab
      • Keyboard Shortcuts
    • Projects
      • Creating Data Shaper projects
      • Converting Data Shaper projects
      • Structure of Data Shaper projects
      • Versioning of server project content
      • Working with Data Shaper Server Projects
      • Project configuration
    • Graphs
      • Creating an empty graph
      • Creating a simple graph
        • Placing Components
        • Placing Components from Palette
        • Connecting Components with Edges
    • Execution
      • Successful Graph Execution
      • Run configuration
      • Connecting to a running job
      • Graph states
    • Common dialogs
      • URL file dialog
      • Edit Value dialog
      • Open Type dialog
    • Import
      • Import Data Shaper projects
      • Import from Data Shaper server sandbox
      • Import graphs
      • Import metadata
    • Export
      • Export graphs to HTML
      • Export to Data Shaper Server sandbox
      • Export image
      • Export Project as Library
    • Graph tracking
      • Changing record count font size
    • Search functionality
    • Working with Data Shaper server
      • Data Shaper server project basic principles
      • Connecting via HTTP
      • Connecting via HTTPS
      • Connecting via Proxy Server
    • Graph components
      • Adding components
      • Finding components
      • Edit component dialog box
      • Enable/disable component
      • Passing data through disabled component
      • Common properties of components
      • Specific attribute types
      • Metadata templates
    • Edges
      • Connecting Components with Edges
      • Types of Edges
      • Assigning Metadata to Edges
      • Colors of Edges
      • Debugging Edges
      • Edge Memory Allocation
    • Metadata
      • Records and Fields
        • Record Types
        • Data Types in Metadata
        • Data Formats
        • Locale and Locale Sensitivity
        • Time Zone
        • Autofilling Functions
      • Metadata Types
        • Internal Metadata
        • External (Shared) Metadata
        • SQL Query Metadata
        • Reading Metadata from Special Sources
      • Auto-propagated Metadata
        • Sources of Auto-Propagated Metadata
        • Explicitly Propagated Metadata
        • Priorities of Metadata
        • Propagation of SQL Query Metadata
      • Creating Metadata
        • Extracting Metadata from a Flat File
        • Extracting Metadata from an XLS(X) File
        • Extracting Metadata from a Database
        • Extracting Metadata from a DBase File
        • Extracting Metadata from Salesforce
        • SQL Query Metadata
        • User Defined Metadata
      • Merging Existing Metadata
      • Creating Database Table from Metadata and Database Connection
      • Metadata Editor
        • Opening Metadata Editor
        • Basics of Metadata Editor
        • Record Pane
        • Field Name vs. Label vs. Description
        • Details Pane
      • Changing and Defining Delimiters
      • Editing Metadata in the Source Code
      • Multi-value Fields
        • Lists and Maps Support in Components
        • Joining on multivalue fields (Comparison Rules)
    • Connections
      • Database Connections
        • Internal Database Connections
        • External (Shared) Database Connections
        • Database Connections Properties
        • Encryption of Access Password
        • Browsing Database and Extracting Metadata from Database Tables
        • Windows Authentication on Microsoft SQL Server
        • Snowflake Connection
        • Hive Connection
        • Troubleshooting
      • JMS Connections
      • QuickBase Connections
      • Hadoop Connections
      • Kafka Connections
      • OAuth2 Connections
      • MongoDB Connections
      • Salesforce Connections
    • Lookup Tables
      • Lookup Tables in Cluster Environment
      • Internal Lookup Tables
      • External (Shared) Lookup Tables
      • Types of Lookup Tables
    • Sequences
      • Persistent Sequences
      • Non Persistent Sequences
      • Internal Sequences
      • External (Shared) Sequences
      • Editing a Sequence
      • Sequences in Cluster Environment
    • Parameters
      • Internal Parameters
      • External (Shared) Parameters
      • Secure Graph Parameters
      • Graph Parameter Editor
      • Secure Graph Parameters
      • Parameters with CTL2 Expressions (Dynamic Parameters)
      • Environment Variables
      • Canonicalizing File Paths
      • Using Parameters
    • Internal/External Graph Elements
    • Dictionary
      • Creating a Dictionary
      • Using a Dictionary in Graphs
    • Execution Properties
    • Notes in Graphs
      • Placing Notes into Graph
      • Resizing Notes
      • Editing Notes
      • Formatted Text
      • Links from Notes
      • Folding Notes
      • Notes Properties
    • Transformations
      • Defining Transformations
      • Transform Editor
      • Common Java Interfaces
    • Data Partitioning (Parallel Running)
    • Data Partitioning in Cluster
      • High Availability
      • Scalability
      • Graph Allocation Examples
      • Example of Distributed Execution
      • Remote Edges
    • Readers
      • Common Properties of Readers
      • ComplexDataReader
      • DatabaseReader
      • DataGenerator
      • DataOneVFSReader
      • EDIFACTReader
      • FlatFileReader
      • JSONExtract
      • JSONReader
      • LDAPReader
      • MultiLevelReader
      • SpreadsheetDataReader
      • UniversalDataReader
      • X12Reader
      • XMLExtract
      • XMLReader
      • XMLXPathReader
    • Writers
      • Common Properties of Writers
      • DatabaseWriter
      • DataOneVFSWriter
      • EDIFACTWriter
      • FlatFileWriter
      • JSONWriter
      • LDAPWriter
      • SpreadsheetDataWriter
      • Trash
      • UniversalDataWriter
      • X12Writer
      • XMLWriter
    • Transformers
      • Common Properties of Transformers
      • Aggregate
      • Concatenate
      • DataIntersection
      • DataSampler
      • Dedup
      • Denormalizer
      • ExtSort
      • FastSort
      • Filter
      • Map
      • Merge
      • MetaPivot
      • Normalizer
      • Partition
      • Pivot
      • Rollup
      • SimpleCopy
      • SimpleGather
      • SortWithinGroups
      • XSLTransformer
    • Joiners
      • Common Properties of Joiners
      • Combine
      • CrossJoin
      • DBJoin
      • ExtHashJoin
      • ExtMergeJoin
      • LookupJoin
      • RelationalJoin
    • Others
      • Common Properties of Others
      • CheckForeignKey
      • DBExecute
      • HTTPConnector
      • LookupTableReaderWriter
      • WebServiceClient
    • CTL2 - Data Shaper Transformation Language
    • Language Reference
      • Program Structure
      • Comments
      • Import
      • Data Types in CTL2
      • Literals
      • Variables
      • Dictionary in CTL2
      • Operators
      • Simple Statement and Block of Statements
      • Control Statements
      • Error Handling
      • Functions
      • Conditional Fail Expression
      • Accessing Data Records and Fields
      • Mapping
      • Parameters
      • Regular Expressions
    • CTL Debugging
      • Debug Perspective
      • Importing and Exporting Breakpoints
      • Inspecting Variables and Expressions
      • Examples
    • Functions Reference
      • Conversion Functions
      • Date Functions
      • Mathematical Functions
      • String Functions
      • Mapping Functions
      • Container Functions
      • Record Functions (Dynamic Field Access)
      • Miscellaneous Functions
      • Lookup Table Functions
      • Sequence Functions
      • Data Service HTTP Library Functions
      • Custom CTL Functions
      • CTL2 Appendix - List of National-specific Characters
    • Tutorial
      • Creating a Transformation Graph
      • Filtering the records
      • Sorting the Records
      • Processing Speed-up with Parallelization
      • Debugging the Java Transformation
  • DATA SHAPER SERVER
    • Introduction
    • Administration
      • Monitoring
    • Using Graphs
      • Job Queue
      • Execution History
      • Job Inspector
    • Cluster
      • Sandboxes in Cluster
      • Troubleshooting
  • Install Data Shaper
    • Install Data Shaper
      • Introduction to Data Shaper installation process
      • Planning Data Shaper installation
      • Data Shaper System Requirements
      • Data Shaper Domain Master Configuration reference
      • Performing Data Shaper initial installation and master configuration
        • Creating database objects for PostgreSQL
        • Creating database objects for Oracle
        • Executing Data Shaper installer
        • Configuring additional firewall rules for Data Shaper
  • DATA SHAPER + DATA MOVER
    • Data Mover in a bundle with Data Shaper
    • Monitoring
    • Execution History
    • Sandboxes
Powered by GitBook
On this page

Last updated 2 months ago

Short Description

FastSort sorts input records using a sort key. FastSort is faster than ExtSort, but requires more system resources and sorting is not stable.

COMPONENT
SAME INPUT METADATA
SORTED INPUTS
INPUTS
OUTPUTS
JAVA
CTL
AUTO-PROPAGATED METADATA

Ports

PORT TYPE
NUMBER
REQUIRED
DESCRIPTION
METADATA

Metadata

Metadata can be propagated through this component. All output metadata must be same.

FastSort Attributes

ATTRIBUTE
REQ
DESCRIPTION
POSSIBLE VALUES

[1] Estimated record count is a helper attribute which is used for calculating (rather unnatural) Run size automatically as approximately Estimated record count to the power 0.66. If Run size is set explicitly, Estimated record count is ignored. [2] These attributes affect automatic guess of Run size. Generally, the following formula must be true: Number of read buffers * Run size * Average record size < Maximum memory

Details

FastSort is a high performance sort component reaching the optimal efficiency when enough system resources are available. FastSort can be up to 2.5 times faster than ExtSort, but consumes significantly more memory and temporary disk space.

The component takes input records and sorts them using a sorting key - a single field or a set of fields. You can specify sorting order for each field in the key separately. The sorted output is sent to all connected ports.

Satisfactory results can be obtained with default settings (just the sorting key must be specified). However, to achieve the best performance, a number of parameters is available for tweaking.

Sorting Null Values

In ascending order, null values are sorted before strings, numbers, booleans or dates. If you sort data in descending order, null values are sorted after strings, numbers, booleans or dates. Remember that FastSort processes records in which the same fields of the Sort key attribute have null values as if these nulls were equal.

FastSort Tweaking

Basically, you do not need to set any of these attributes; however, sometimes you can increase performance by setting them. You may have limited memory or you need to sort a large number of records, or these records are too big. In similar cases, you can adjust FastSort to your needs.

Hint! The memory requirements of the component can be estimated as follows: heap memory = Number of read buffers × Run size × estimated record size direct memory = Max open files × Tape buffer size

Run Size The core attribute for FastSort. Determines how many records form a "run" (i.e. a group of sorted records in temp files). The lower Run size, the more temp files get created, less memory is used and greater speed is achieved. On the other hand, higher values might cause memory issues. There is no rule of thumb as to whether Run size should be high or low to get the best performance. Generally, the more records you are about to sort, the bigger Run size you might want. The rough formula for Run size is Estimated record count^0.66. Note that memory consumption multiplies with Number of read buffers, which in turn grows with Number of sorting threads. So, higher Run sizes result in much higher memory footprints.

Max Open Files FastSort uses relatively large numbers of temporary files during its operation. By default, the number of temporary files is limited to 1,000. For production systems, it is recommended to set the limit as high as possible, because there is no speed sacrifice, see Performance Bottlenecks. On the other hand, you can lower the limit even further to prevent hitting the user quota or other OS-specific limits and runtime limitations. The following table should give you a better idea:

Note that numbers in the table above are not exact and might be different on your system.

Number of Sorting Threads Tells FastSort how many runs (chunks) should be sorted at a time in parallel. By default, it is automatically set to 1 or 2 based on the number of CPU cores in your system. Overriding this value makes sense if your system has lots of CPU cores and your disk performance can handle working with so many parallel data streams.

Number of Read Buffers This setting corresponds tightly to Number of sorting threads - must be equal to or greater than Number of sorting threads. The higher the number of read buffers, the lower chance the workers will block each other. Defaults to Number of sorting threads + 2.

Compress Temporary Files Along with Temporary files charset, this option lets you reduce the space required for temporary files. Note that compression can save a lot of space but decreases the performance by up to 30%.

Tape Buffer Size Size (in bytes) of a file output buffer. The default value is 8kB. Decreasing this value might avoid memory exhaustion for large numbers of runs (e.g. when Run size is very small compared to the total number of records). However, the impact of this setting is quite small.

Best Practices

Performance Bottlenecks

  • Sorting big records (long string fields, tens or hundreds of fields, etc.): FastSort is greedy for both memory and CPU cores. If the system does not have enough of either, FastSort can easily crash with the out-of-memory error. In such a case, use the ExtSort component instead.

  • Utilizing more than 2 CPU cores: Unless you are using very fast disk drives, overriding the default value of Number of sorting threads to more than 2 threads does not necessarily help. It can even slightly slow the process down, as extra memory is loaded for each additional thread.

  • Coping with quotas and other runtime limitations: In complex graphs with several parallel sorts, even with other graph components also having huge number of open files, the Too many open files error and graph execution failure may occur. There are two possible solutions to this issue: a. increase the limit (quota) This option is recommended for production systems since there is no speed sacrifice. Typically, setting limit to higher number on Unix systems. b. force FastSort to keep the number of temporary files below some limit For regular users on large servers, increasing the quota is not an option. Thus, Max open files must be set to a reasonable value. FastSort then performs intermediate merges of temporary files to keep their number below the limit. However, setting Max open files to values, for which such merges are inevitable, often produces significant performance drop. So keep it at the highest possible value. If you are forced to limit FastSort to less than a hundred temporary files, even for large datasets, consider using ExtSort instead which is designed for performance with a limited number of tapes.

See also

Warning: FastSort does not preserve order of records with equal key value (sorting algorithm is not stable). If stability is required, please use instead.

DATASET SIZE
NUMBER OF TEMP. FILES
DEFAULT RUN SIZE
NOTE

Make sure you have dedicated enough memory to your Java Virtual Machine (JVM). Having plenty of memory available, FastSort is capable of doing astonishing job. Remember that the default JVM heap space (64MB) can cause FastSort to crash. For best results, try to increase the memory value up to 2GB (if possible, while still leaving enough memory for the operating system). How to set the JVM is described in .

FastSort

-

x

1

1-N

-

-

✓

Input

0

✓

For input data records

the same input and output metadata

Output

0

✓

For sorted data records

the same input and output metadata

1-N

x

For sorted data records

the same input and output metadata

BASIC

Sort key

✓

A list of fields (separated by a semicolon) according to which the data records are sorted, including a sorting order for each data field separately, see Sort Key.

In memory only

If true, internal sorting is forced and all attributes except Sort key and Run size are ignored.

false (default) | true

ADVANCED

Run size (records)

The number of records sorted at once in memory; the size of one read buffer. Largely affects speed and memory requirements, see Run Size below. Multiplies with Number of read buffers (which depends on the Number of sorting threads). Reasonable Run sizes vary from 5,000 to 200,000 based on the record size and the total number of records.

10000 (default) | 1000 - N

Max open files

Limits the number of temp files that can be created during the sorting. Too low number (500 or less) significantly reduces performance, see Max Open Files below. 0 denotes that the number of temp files is unlimited.

1000 (default) | 1-N | 0 (unlimited)

Number of sorting threads

The number of worker threads to do the job. Setting this value too high may even slow the graph run down, see Number of Sorting Threads below. Also affects memory requirements.

1 (default) | 1-N

Number of read buffers

How many chunks of data (each the size of Run size) will be held in memory at a time, see Number of Read Buffers below. Defaults to Number of sorting threads + 2.

auto (default) | 1-N

Tape buffer size (bytes)

A buffer used by a worker for filling the output. Slightly affects performance, see Tape Buffer Size below.

8192 (default) | 1-N

Compress temporary files

If true, temporary files are compressed. For more information, see Compress Temporary Files below.

false (default) | true

DEPRECATED

Estimated record count

[1]

An estimated number of input records to be sorted.

auto (default) | 1-N

Average record size (bytes)

[2]

Guess on average byte size of records.

auto (default) | 1-N

Maximum memory (MB, GB)

[2]

Rough estimate of maximum memory that can be used.

auto (default) | 1-N

Sorting locale

Locale used for a correct sorting order.

none (default) | any locale

Case sensitive

By default (Sorting locale is none), upper-case characters are sorted separately and precede lower-case characters that are sorted separately, too. If Sorting locale is set, upper- and lower-case characters are sorted together; if Case sensitive is true, a lower-case precedes corresponding upper-case, while false preserves the order in which data strings appear in the input. A case sensitive attribute value is taken into account only if Locale is set.

false (default) | true

1,000,000

~100

~10,000

10,000,000

~250

~45,000

1,000,000,000

20,000 to 2,000

50,000 to 500,000

Depends on available memory

  1. DATA SHAPER DESIGNER
  2. Transformers

FastSort

PreviousExtSortNextFilter
  • Short Description
  • Ports
  • Metadata
  • FastSort Attributes
  • Details
  • Sorting Null Values
  • FastSort Tweaking
  • Best Practices
  • Performance Bottlenecks
  • See also
ExtSort
Runtime configuration
ExtSort
Common Properties of Components
Specific attribute types
Common Properties of Transformers