Primeur Online Docs
Data Shaper
Data Shaper
  • 🚀GETTING STARTED
    • What is Primeur Data Shaper
      • What is the Data Shaper Designer
      • What is the Data Shaper Server
      • What is the Data Shaper Cluster
    • How does the Data Shaper Designer work
      • Designer Views and Graphs
      • Data Shaper Graphs
      • Designer Projects and Sandboxes
      • Data Shaper Designer Reference
    • How do the Data Shaper Server and Cluster work
      • Data Shaper Server and Cluster
      • Data Shaper Server Reference
    • VFS Graph Components
      • DataOneFileDescriptor (DOFD) metadata
      • Passing parameters from Data One Contract to Data Shaper graph
      • Inheriting Data One context attributes in Data Shaper graph
  • DATA SHAPER DESIGNER
    • Configuration
      • Runtime configuration
        • Logging
        • Master Password
        • User Classpath
      • Data Shaper Server Integration
      • Execution monitoring
      • Java configuration
      • Engine configuration
      • Refresh Operation
    • Designer User Interface
      • Graph Editor with Palette of Components
      • Project Explorer Pane
      • Outline Pane
      • Tabs Pane
      • Execution Tab
      • Keyboard Shortcuts
    • Projects
      • Creating Data Shaper projects
      • Converting Data Shaper projects
      • Structure of Data Shaper projects
      • Versioning of server project content
      • Working with Data Shaper Server Projects
      • Project configuration
    • Graphs
      • Creating an empty graph
      • Creating a simple graph
        • Placing Components
        • Placing Components from Palette
        • Connecting Components with Edges
    • Execution
      • Successful Graph Execution
      • Run configuration
      • Connecting to a running job
      • Graph states
    • Common dialogs
      • URL file dialog
      • Edit Value dialog
      • Open Type dialog
    • Import
      • Import Data Shaper projects
      • Import from Data Shaper server sandbox
      • Import graphs
      • Import metadata
    • Export
      • Export graphs to HTML
      • Export to Data Shaper Server sandbox
      • Export image
      • Export Project as Library
    • Graph tracking
      • Changing record count font size
    • Search functionality
    • Working with Data Shaper server
      • Data Shaper server project basic principles
      • Connecting via HTTP
      • Connecting via HTTPS
      • Connecting via Proxy Server
    • Graph components
      • Adding components
      • Finding components
      • Edit component dialog box
      • Enable/disable component
      • Passing data through disabled component
      • Common properties of components
      • Specific attribute types
      • Metadata templates
    • Edges
      • Connecting Components with Edges
      • Types of Edges
      • Assigning Metadata to Edges
      • Colors of Edges
      • Debugging Edges
      • Edge Memory Allocation
    • Metadata
      • Records and Fields
        • Record Types
        • Data Types in Metadata
        • Data Formats
        • Locale and Locale Sensitivity
        • Time Zone
        • Autofilling Functions
      • Metadata Types
        • Internal Metadata
        • External (Shared) Metadata
        • SQL Query Metadata
        • Reading Metadata from Special Sources
      • Auto-propagated Metadata
        • Sources of Auto-Propagated Metadata
        • Explicitly Propagated Metadata
        • Priorities of Metadata
        • Propagation of SQL Query Metadata
      • Creating Metadata
        • Extracting Metadata from a Flat File
        • Extracting Metadata from an XLS(X) File
        • Extracting Metadata from a Database
        • Extracting Metadata from a DBase File
        • Extracting Metadata from Salesforce
        • SQL Query Metadata
        • User Defined Metadata
      • Merging Existing Metadata
      • Creating Database Table from Metadata and Database Connection
      • Metadata Editor
        • Opening Metadata Editor
        • Basics of Metadata Editor
        • Record Pane
        • Field Name vs. Label vs. Description
        • Details Pane
      • Changing and Defining Delimiters
      • Editing Metadata in the Source Code
      • Multi-value Fields
        • Lists and Maps Support in Components
        • Joining on multivalue fields (Comparison Rules)
    • Connections
      • Database Connections
        • Internal Database Connections
        • External (Shared) Database Connections
        • Database Connections Properties
        • Encryption of Access Password
        • Browsing Database and Extracting Metadata from Database Tables
        • Windows Authentication on Microsoft SQL Server
        • Snowflake Connection
        • Hive Connection
        • Troubleshooting
      • JMS Connections
      • QuickBase Connections
      • Hadoop Connections
      • Kafka Connections
      • OAuth2 Connections
      • MongoDB Connections
      • Salesforce Connections
    • Lookup Tables
      • Lookup Tables in Cluster Environment
      • Internal Lookup Tables
      • External (Shared) Lookup Tables
      • Types of Lookup Tables
    • Sequences
      • Persistent Sequences
      • Non Persistent Sequences
      • Internal Sequences
      • External (Shared) Sequences
      • Editing a Sequence
      • Sequences in Cluster Environment
    • Parameters
      • Internal Parameters
      • External (Shared) Parameters
      • Secure Graph Parameters
      • Graph Parameter Editor
      • Secure Graph Parameters
      • Parameters with CTL2 Expressions (Dynamic Parameters)
      • Environment Variables
      • Canonicalizing File Paths
      • Using Parameters
    • Internal/External Graph Elements
    • Dictionary
      • Creating a Dictionary
      • Using a Dictionary in Graphs
    • Execution Properties
    • Notes in Graphs
      • Placing Notes into Graph
      • Resizing Notes
      • Editing Notes
      • Formatted Text
      • Links from Notes
      • Folding Notes
      • Notes Properties
    • Transformations
      • Defining Transformations
      • Transform Editor
      • Common Java Interfaces
    • Data Partitioning (Parallel Running)
    • Data Partitioning in Cluster
      • High Availability
      • Scalability
      • Graph Allocation Examples
      • Example of Distributed Execution
      • Remote Edges
    • Readers
      • Common Properties of Readers
      • ComplexDataReader
      • DatabaseReader
      • DataGenerator
      • DataOneVFSReader
      • EDIFACTReader
      • FlatFileReader
      • JSONExtract
      • JSONReader
      • LDAPReader
      • MultiLevelReader
      • SpreadsheetDataReader
      • UniversalDataReader
      • X12Reader
      • XMLExtract
      • XMLReader
      • XMLXPathReader
    • Writers
      • Common Properties of Writers
      • DatabaseWriter
      • DataOneVFSWriter
      • EDIFACTWriter
      • FlatFileWriter
      • JSONWriter
      • LDAPWriter
      • SpreadsheetDataWriter
      • HIDDEN StructuredDataWriter
      • HIDDEN TableauWriter
      • Trash
      • UniversalDataWriter
      • X12Writer
      • XMLWriter
    • Transformers
      • Common Properties of Transformers
      • Aggregate
      • Concatenate
      • DataIntersection
      • DataSampler
      • Dedup
      • Denormalizer
      • ExtSort
      • FastSort
      • Filter
      • Map
      • Merge
      • MetaPivot
      • Normalizer
      • Partition
      • Pivot
      • Rollup
      • SimpleCopy
      • SimpleGather
      • SortWithinGroups
      • XSLTransformer
    • Joiners
      • Common Properties of Joiners
      • Combine
      • CrossJoin
      • DBJoin
      • ExtHashJoin
      • ExtMergeJoin
      • LookupJoin
      • RelationalJoin
    • Others
      • Common Properties of Others
      • CheckForeignKey
      • DBExecute
      • HTTPConnector
      • LookupTableReaderWriter
      • WebServiceClient
    • CTL2 - Data Shaper Transformation Language
    • Language Reference
      • Program Structure
      • Comments
      • Import
      • Data Types in CTL2
      • Literals
      • Variables
      • Dictionary in CTL2
      • Operators
      • Simple Statement and Block of Statements
      • Control Statements
      • Error Handling
      • Functions
      • Conditional Fail Expression
      • Accessing Data Records and Fields
      • Mapping
      • Parameters
      • Regular Expressions
    • CTL Debugging
      • Debug Perspective
      • Importing and Exporting Breakpoints
      • Inspecting Variables and Expressions
      • Examples
    • Functions Reference
      • Conversion Functions
      • Date Functions
      • Mathematical Functions
      • String Functions
      • Mapping Functions
      • Container Functions
      • Record Functions (Dynamic Field Access)
      • Miscellaneous Functions
      • Lookup Table Functions
      • Sequence Functions
      • Data Service HTTP Library Functions
      • Custom CTL Functions
      • CTL2 Appendix - List of National-specific Characters
      • HIDDEN Subgraph Functions
    • Tutorial
      • Creating a Transformation Graph
      • Filtering the records
      • Sorting the Records
      • Processing Speed-up with Parallelization
      • Debugging the Java Transformation
  • DATA SHAPER SERVER
    • Introduction
    • Administration
      • Monitoring
    • Using Graphs
      • Job Queue
      • Execution History
      • Job Inspector
    • Cluster
      • Sandboxes in Cluster
      • Troubleshooting
  • Install Data Shaper
    • Install Data Shaper
      • Introduction to Data Shaper installation process
      • Planning Data Shaper installation
      • Data Shaper System Requirements
      • Data Shaper Domain Master Configuration reference
      • Performing Data Shaper initial installation and master configuration
        • Creating database objects for PostgreSQL
        • Creating database objects for Oracle
        • Executing Data Shaper installer
        • Configuring additional firewall rules for Data Shaper
Powered by GitBook
On this page
  • Short Description
  • Ports
  • Metadata
  • ExtHashJoin Attributes
  • Details
  • CTL Scripting Specifics
  • Examples
  • Java Interfaces
  • Best Practices
  • See also
  1. DATA SHAPER DESIGNER
  2. Joiners

ExtHashJoin

Short Description

ExtHashJoin is a general purpose joiner. It merges potentially unsorted data from two or more data sources on a common key. The component is fast as it is processed in memory.

COMPONENT
SAME INPUT METADATA
SORTED INPUTS
SLAVE INPUTS
OUTPUTS
OUTPUT FOR DRIVER WITHOUT SLAVE
OUTPUT FOR SLAVES WITHOUT DRIVER
JOINING BASED ON EQUALITY
AUTO-PROPAGATED METADATA

ExtHashJoin

x

x

1-n

1-2

x

x

✓

✓

Ports

ExtHashJoin receives data through two or more input ports, each of which may have a different metadata structure. The joined data is then sent to the single output port.

PORT TYPE
NUMBER
REQUIRED
DESCRIPTION
METADATA

Input

0

✓

Master input port

Any

1

✓

Slave input port

Any

2-n

x

Optional slave input ports

Any

Output

0

✓

Output port for the joined data

Any

1

x

Output port for the unjoined data

Input 0

Metadata

ExtHashJoin propagates metadata from the first input port to the second output port and from the second output port to the first input port. ExtHashJoin has no metadata templates.

ExtHashJoin Attributes

ATTRIBUTE
REQ
DESCRIPTION
POSSIBLE VALUES

BASIC

Join key

yes

A key according to which incoming data flows are joined. See Join Key below.

Join type

Type of the join. See Join Types below.

Inner (default) | Left outer | Full outer

Transform

[bl

A transformation in CTL or Java defined in a graph.

Transform URL

[bl

An external file defining the transformation in CTL or Java.

Transform class

[bl

External transformation class.

Allow slave duplicates

If set to true, records with duplicate key values are allowed. If false, only the first record is used for join.

false (default) | true

ADVANCED

Transform source charset

Encoding of an external file defining the transformation. The default encoding depends on DEFAULT_SOURCE_CODE_CHARSET in defaultProperties.

E.g. UTF-8

Hash table size

An initial size of a hash table that should be used when joining data flows. If there are more records that should be joined, the hash table can be rehashed; however, it slows down the parsing process. The lowest possible value is 512; if a value lower than 512 is defined, 512 is used instead. The number denotes the number or record. For more information, see Hash Tables below.

512 (default)

DEPRECATED

Error actions

A definition of the action that should be performed when the specified transformation returns an Error code. See Return Values of Transformations.

Error log

A URL of the file to which error messages for specified Error actions should be written. If not set, they are written to Console.

Left outer

If set to true, left outer join is performed. By default, the attribute is set to false. However, this attribute has a lower priority than Join type. If you set both, only Join type will be applied.

false (default) | true

Full outer

If set to true, full outer join is performed. By default, the attribute is set to false. However, this attribute has a lower priority than Join type. If you set both, only Join type will be applied.

false (default) | true

Details

ExtHashJoin reads records from slave ports and stores them into hash tables. The records from the hash tables are employed for mapping with the records from the master.

The data attached to the first input port is called the master (as usual in other Joiners). All remaining connected input ports are called slaves.

Each master record is matched to all slave records on one or more fields known as the join key. An output is produced after applying a transformation that maps joined inputs to the output. For details, see Joining Mechanics below.

This joiner should be avoided in the case of large inputs on the slave port. The reason is that slave data is cached in the main memory.

Join Key The Join key is a key according to which the incoming data flows are joined. It is specified as a sequence of mapping expressions for all slaves.

The mapping expressions are separated from each other by hash. Each of these mapping expressions is a sequence of field names from master and slave records (in this order) put together using an equal sign and separated from each other by a semicolon, colon, or pipe.

$CUSTOMERID=$CUSTOMERID#$ORDERID=$ORDERID;$PRODUCTID=$PRODUCTID

The order of these mappings must correspond to the order of the slave input ports. If some of these mappings are empty or missing for some of the slave input ports, the mapping of the first slave input port is used instead.

Note: Different slaves can be joined with the master using different master fields!

Example 42. Slave Part of Join Key for ExtHashJoin

$master_field1=$slave_field1;$master_field2=$slave_field2;...;$master_fieldN=$slave_fieldN
  • If some $slave_fieldJ is missing (i.e. if the subexpression looks like this: $master_fieldJ=), it is supposed to be the same as the $master_fieldJ.

  • If some $master_fieldK is missing, $master_fieldK from the first port is used.

Example 43. Join Key for ExtHashJoin

$first_name=$fname;$last_name=$lname#=$lname;$salary=;$hire_date=$hdate
  • The following is the part of Join key for the first slave data source (input port 1): $first_name=$fname;$last_name=$lname.

    • Thus, the following two fields from the master data flow are used for join with the first slave data source: $first_name and $last_name.

    • They are joined with the following two fields from this first slave data source: $fname and $lname, respectively.

  • The following is the part of Join key for the second slave data source (input port 2): =$lname;$salary=;$hire_date=$hdate.

    • Thus, the following three fields from the master data flow are used for join with the second slave data source: $last_name (because it is the field which is joined with the $lname for the first slave data source), $salary and $hire_date.

    • They are joined with the following three fields from this second slave data source: $lname, $salary and $hdate, respectively. (This slave $salary field is expressed using the master field of the same name.)

Join Key Dialog To create the Join key attribute, use the Join key dialog. When you click the Join key attribute row, a button appears in this row. By clicking this button, you open the dialog.

In the dialog, you can see tabs for all of the slave input ports. In each slave tab, there are two panes: Slave fields and Key mappings.

The Slave fields pane is on the left. It contains a list of all the slave field names and their data types.

The Key mapping pane is on the right. It contains three columns: Master fields, Master field type and Slave key field mapped. The left column contains all field names of the driver input port. The middle column contains data types corresponding to the fields in the first column. The right column contains the mapped fields from the slave fields tab.

Mapping the Fields To map a slave field to a driver (master) field, drag the desired slave field from the left pane, and drop it into the Slave key field mapped column in the right pane. The mapping will be created.

Repeat for all slave fields to be mapped. The same process must be repeated for all slave tabs.

Note that you can also use the Auto mapping button or other buttons in each tab. Thus, slave fields are mapped to driver (Master) fields according to their names.

Note that different slaves can map different number of slave fields to different number of driver (Master) fields.

Hash Tables The component first receives the records incoming through the slave input ports, reads them and creates hash tables from these records. For every slave input port one hash table is created. After that, the component looks up the corresponding records in these hash tables for each driver record incoming through the driver input port.

If such record(s) are found, the tuple of the driver record and the slave record(s) from the hash tables are sent to a transformation class. The transform method is called for each tuple of the master and its corresponding slave records.

The hash tables must be sufficiently small to fit into the main memory.

The incoming records do not need to be sorted.

If the table is 75% full, the size is doubled and the table is recalculated.

The real size of the hash table is nearest power of 2 greater than or equal to the defined parameter value.

The initialization of hash tables is time consuming, therefore it may be a good idea to specify how many records will be stored in hash tables. If you decide to specify the Hash table size attribute, it is wise to consider these facts and set it to the value greater than needed. Nevertheless, for small sets of records, it is not necessary to change the default value.

Joining Mechanics All slave input data is stored in the memory. However, the master data is not. As for memory requirements, you therefore need to consider only the size of your slave data. In consequence, be sure to always set the larger data to the master and smaller inputs as slaves. ExtHashJoin uses in-memory hash tables for storing slave records.

Remember that each slave port can be joined with master using different numbers of various master fields.

Transformation A transformation in ExtHashJoin is required. A transformation in ExtHashJoin lets you define the transformation that sends records to the first output port. Unjoined master records sent to the second output cannot be modified within the ExtHashJoin transformation.

CTL Scripting Specifics

The mapping of unmatched records to the second (optional) port is performed without being explicitly specified.

Examples

Joining Two Data Streams The master port contains metadata fields ProductID and Color. The color is in the RGB code. The slave port contains RGB values and the corresponding color names.

master (port 0)

Product_A|FF0000
Product_B|00FF00
Product_C|00FFFF

slave (port 1)

FF0000|red
00FF00|green
0000FF|blue

Match the product with corresponding color name.

Solution Set the Join key attribute to $Color=$ColorRGB. Set the **Transform **attributes to

//#CTL2
function integer transform() {
 	$out.0.ProuctID = $in.0.ProductID;
 	$out.0.ColorName = $in.1.ColorName;

 	return ALL;
}

The output from ExtHashJoin is

Product_A|red
Product_B|green

The Product_C is missing as it does not match any record with RGB color.

If you need to send products with no corresponding color to the output, set Join type to Left outer join. The items without a match will have a null value instead of the color.

Java Interfaces

Best Practices

If the transformation is specified in an external file (with Transform URL), we recommend users to explicitly specify Transform source charset.

See also

PreviousDBJoinNextExtMergeJoin

Last updated 1 month ago

[1] One of these must be set. These transformation attributes must be specified. Any of these transformation attributes must use a common CTL template for Joiners or implement a RecordTransform interface. For more information, see CTL Scripting Specifics or Java Interfaces below. For detailed information about transformations, see also .

Hint! If you have larger data, consider using the component. If your data sources are unsorted, use a sorting component first (, or ).

All Joiners share the same transformation template which can be found in .

For detailed information about Data Shaper Transformation Language, see .

If you define your transformation in Java, it must implement the following interface that is common for all Joiners: .

Defining Transformations
ExtMergeJoin
ExtSort
FastSort
SortWithinGroups
CTL2 - Data Shaper Transformation Language
Common Properties of Components
Specific Attribute Types
Common Properties of Joiners
CTL Templates for Joiners
Java Interfaces for Joiners