Primeur Online Docs
Data Shaper
Data Shaper
  • πŸš€GETTING STARTED
    • What is Primeur Data Shaper
      • What is the Data Shaper Designer
      • What is the Data Shaper Server
      • What is the Data Shaper Cluster
    • How does the Data Shaper Designer work
      • Designer Views and Graphs
      • Data Shaper Graphs
      • Designer Projects and Sandboxes
      • Data Shaper Designer Reference
    • How do the Data Shaper Server and Cluster work
      • Data Shaper Server and Cluster
      • Data Shaper Server Reference
    • VFS Graph Components
      • DataOneFileDescriptor (DOFD) metadata
      • Passing parameters from Data One Contract to Data Shaper graph
      • Inheriting Data One context attributes in Data Shaper graph
  • DATA SHAPER DESIGNER
    • Configuration
      • Runtime configuration
        • Logging
        • Master Password
        • User Classpath
      • Data Shaper Server Integration
      • Execution monitoring
      • Java configuration
      • Engine configuration
      • Refresh Operation
    • Designer User Interface
      • Graph Editor with Palette of Components
      • Project Explorer Pane
      • Outline Pane
      • Tabs Pane
      • Execution Tab
      • Keyboard Shortcuts
    • Projects
      • Creating Data Shaper projects
      • Converting Data Shaper projects
      • Structure of Data Shaper projects
      • Versioning of server project content
      • Working with Data Shaper Server Projects
      • Project configuration
    • Graphs
      • Creating an empty graph
      • Creating a simple graph
        • Placing Components
        • Placing Components from Palette
        • Connecting Components with Edges
    • Execution
      • Successful Graph Execution
      • Run configuration
      • Connecting to a running job
      • Graph states
    • Common dialogs
      • URL file dialog
      • Edit Value dialog
      • Open Type dialog
    • Import
      • Import Data Shaper projects
      • Import from Data Shaper server sandbox
      • Import graphs
      • Import metadata
    • Export
      • Export graphs to HTML
      • Export to Data Shaper Server sandbox
      • Export image
      • Export Project as Library
    • Graph tracking
      • Changing record count font size
    • Search functionality
    • Working with Data Shaper server
      • Data Shaper server project basic principles
      • Connecting via HTTP
      • Connecting via HTTPS
      • Connecting via Proxy Server
    • Graph components
      • Adding components
      • Finding components
      • Edit component dialog box
      • Enable/disable component
      • Passing data through disabled component
      • Common properties of components
      • Specific attribute types
      • Metadata templates
    • Edges
      • Connecting Components with Edges
      • Types of Edges
      • Assigning Metadata to Edges
      • Colors of Edges
      • Debugging Edges
      • Edge Memory Allocation
    • Metadata
      • Records and Fields
        • Record Types
        • Data Types in Metadata
        • Data Formats
        • Locale and Locale Sensitivity
        • Time Zone
        • Autofilling Functions
      • Metadata Types
        • Internal Metadata
        • External (Shared) Metadata
        • SQL Query Metadata
        • Reading Metadata from Special Sources
      • Auto-propagated Metadata
        • Sources of Auto-Propagated Metadata
        • Explicitly Propagated Metadata
        • Priorities of Metadata
        • Propagation of SQL Query Metadata
      • Creating Metadata
        • Extracting Metadata from a Flat File
        • Extracting Metadata from an XLS(X) File
        • Extracting Metadata from a Database
        • Extracting Metadata from a DBase File
        • Extracting Metadata from Salesforce
        • SQL Query Metadata
        • User Defined Metadata
      • Merging Existing Metadata
      • Creating Database Table from Metadata and Database Connection
      • Metadata Editor
        • Opening Metadata Editor
        • Basics of Metadata Editor
        • Record Pane
        • Field Name vs. Label vs. Description
        • Details Pane
      • Changing and Defining Delimiters
      • Editing Metadata in the Source Code
      • Multi-value Fields
        • Lists and Maps Support in Components
        • Joining on multivalue fields (Comparison Rules)
    • Connections
      • Database Connections
        • Internal Database Connections
        • External (Shared) Database Connections
        • Database Connections Properties
        • Encryption of Access Password
        • Browsing Database and Extracting Metadata from Database Tables
        • Windows Authentication on Microsoft SQL Server
        • Snowflake Connection
        • Hive Connection
        • Troubleshooting
      • JMS Connections
      • QuickBase Connections
      • Hadoop Connections
      • Kafka Connections
      • OAuth2 Connections
      • MongoDB Connections
      • Salesforce Connections
    • Lookup Tables
      • Lookup Tables in Cluster Environment
      • Internal Lookup Tables
      • External (Shared) Lookup Tables
      • Types of Lookup Tables
    • Sequences
      • Persistent Sequences
      • Non Persistent Sequences
      • Internal Sequences
      • External (Shared) Sequences
      • Editing a Sequence
      • Sequences in Cluster Environment
    • Parameters
      • Internal Parameters
      • External (Shared) Parameters
      • Secure Graph Parameters
      • Graph Parameter Editor
      • Secure Graph Parameters
      • Parameters with CTL2 Expressions (Dynamic Parameters)
      • Environment Variables
      • Canonicalizing File Paths
      • Using Parameters
    • Internal/External Graph Elements
    • Dictionary
      • Creating a Dictionary
      • Using a Dictionary in Graphs
    • Execution Properties
    • Notes in Graphs
      • Placing Notes into Graph
      • Resizing Notes
      • Editing Notes
      • Formatted Text
      • Links from Notes
      • Folding Notes
      • Notes Properties
    • Transformations
      • Defining Transformations
      • Transform Editor
      • Common Java Interfaces
    • Data Partitioning (Parallel Running)
    • Data Partitioning in Cluster
      • High Availability
      • Scalability
      • Graph Allocation Examples
      • Example of Distributed Execution
      • Remote Edges
    • Readers
      • Common Properties of Readers
      • ComplexDataReader
      • DatabaseReader
      • DataGenerator
      • DataOneVFSReader
      • EDIFACTReader
      • FlatFileReader
      • JSONExtract
      • JSONReader
      • LDAPReader
      • MultiLevelReader
      • SpreadsheetDataReader
      • UniversalDataReader
      • X12Reader
      • XMLExtract
      • XMLReader
      • XMLXPathReader
    • Writers
      • Common Properties of Writers
      • DatabaseWriter
      • DataOneVFSWriter
      • EDIFACTWriter
      • FlatFileWriter
      • JSONWriter
      • LDAPWriter
      • SpreadsheetDataWriter
      • HIDDEN StructuredDataWriter
      • HIDDEN TableauWriter
      • Trash
      • UniversalDataWriter
      • X12Writer
      • XMLWriter
    • Transformers
      • Common Properties of Transformers
      • Aggregate
      • Concatenate
      • DataIntersection
      • DataSampler
      • Dedup
      • Denormalizer
      • ExtSort
      • FastSort
      • Filter
      • Map
      • Merge
      • MetaPivot
      • Normalizer
      • Partition
      • Pivot
      • Rollup
      • SimpleCopy
      • SimpleGather
      • SortWithinGroups
      • XSLTransformer
    • Joiners
      • Common Properties of Joiners
      • Combine
      • CrossJoin
      • DBJoin
      • ExtHashJoin
      • ExtMergeJoin
      • LookupJoin
      • RelationalJoin
    • Others
      • Common Properties of Others
      • CheckForeignKey
      • DBExecute
      • HTTPConnector
      • LookupTableReaderWriter
      • WebServiceClient
    • CTL2 - Data Shaper Transformation Language
    • Language Reference
      • Program Structure
      • Comments
      • Import
      • Data Types in CTL2
      • Literals
      • Variables
      • Dictionary in CTL2
      • Operators
      • Simple Statement and Block of Statements
      • Control Statements
      • Error Handling
      • Functions
      • Conditional Fail Expression
      • Accessing Data Records and Fields
      • Mapping
      • Parameters
      • Regular Expressions
    • CTL Debugging
      • Debug Perspective
      • Importing and Exporting Breakpoints
      • Inspecting Variables and Expressions
      • Examples
    • Functions Reference
      • Conversion Functions
      • Date Functions
      • Mathematical Functions
      • String Functions
      • Mapping Functions
      • Container Functions
      • Record Functions (Dynamic Field Access)
      • Miscellaneous Functions
      • Lookup Table Functions
      • Sequence Functions
      • Data Service HTTP Library Functions
      • Custom CTL Functions
      • CTL2 Appendix - List of National-specific Characters
      • HIDDEN Subgraph Functions
    • Tutorial
      • Creating a Transformation Graph
      • Filtering the records
      • Sorting the Records
      • Processing Speed-up with Parallelization
      • Debugging the Java Transformation
  • DATA SHAPER SERVER
    • Introduction
    • Administration
      • Monitoring
    • Using Graphs
      • Job Queue
      • Execution History
      • Job Inspector
    • Cluster
      • Sandboxes in Cluster
      • Troubleshooting
  • Install Data Shaper
    • Install Data Shaper
      • Introduction to Data Shaper installation process
      • Planning Data Shaper installation
      • Data Shaper System Requirements
      • Data Shaper Domain Master Configuration reference
      • Performing Data Shaper initial installation and master configuration
        • Creating database objects for PostgreSQL
        • Creating database objects for Oracle
        • Executing Data Shaper installer
        • Configuring additional firewall rules for Data Shaper
Powered by GitBook
On this page
  • Short Description
  • Ports
  • Metadata
  • Dedup Attributes
  1. DATA SHAPER DESIGNER
  2. Transformers

Dedup

Short Description

Dedup removes duplicate records.

COMPONENT
SAME INPUT METADATA
SORTED INPUTS
INPUTS
OUTPUTS
JAVA
CTL
AUTO-PROPAGATED METADATA

Dedup

-

βœ“

1

0-1

-

-

βœ“

Input records may be sorted only partially, i.e. the records with the same value of the Dedup key are grouped together but the groups are not ordered

Ports

PORT TYPE
NUMBER
REQUIRED
DESCRIPTION
METADATA

Input

0

βœ“

For input data records

Any

Output

0

βœ“

For deduplicated data records

equal input metadata

1

x

For duplicate data records.

equal input metadata

Metadata

Metadata can be propagated through this component. Dedup has no metadata template. Dedup does not require any specific metadata fields.

Dedup Attributes

ATTRIBUTE
REQ
DESCRIPTION
POSSIBLE VALUES

BASIC

Dedup key

A key according to which the records are deduplicated. If the Dedup key is not set, the whole input is considered as one group. Therefore the Number of duplicates attribute specifies the number of records that are send to the output. If the Dedup key is set, only a specified number of records with the same values in fields specified as the Dedup key is picked up. See Dedup key below.

Keep

Defines which records will be preserved. If First, those from the beginning. If Last, those from the end. Records are selected from a group or the whole input. If Unique, only records with no duplicates are selected. If Unique, Number of duplicates is ignored.

First (default) | Last | Unique

Sorted input

Assume input as sorted. See Sorted versus Unsorted Input below.

true (default) | false

Equal NULL

By default, records with null values of key fields are considered to be equal. If false, they are considered to be different.

true (default) | false

Number of duplicates

The maximum number of duplicate records to be selected from each group of adjacent records with an equal key value or, if the key is not set, maximum number of records from the beginning or the end of all records. Ignored if the Unique option is selected.

1 (default) | 1-N

Details

Dedup reads data flow of records grouped by the same values of the Dedup key. The key is formed by field name(s) from input records. If no key is specified, the component behaves like the Unix head or tail command. The groups don’t have to be ordered. The component can select a specified number of the first or the last records from the group or from the whole input. Only those records with no duplicates can be selected, too. The deduplicated records are sent to output port 0. The duplicate records may be sent through output port 1.

  • Dedup key The component can process sorted input data as well as partially sorted ones. When setting the fields composing the Dedup key, choose the proper Order attribute: a. Ascending - if the groups of input records with the same key field value(s) are sorted in ascending order b. Descending - if the groups of input records with the same key field value(s) are sorted in descending order c. Auto - the sorting order of the groups of input records is guessed from the first two records with different value in the key field, i.e. from the first records of the first two groups. d. Ignore - if the groups of input records with the same key field value(s) are not sorted

Sorted versus Unsorted Input

Dedup can process data in two modes: sorted and unsorted.

If you want to process a huge number of records with many different key values, sort the records first and then use Dedup with Sorted input.

If your data contains a few different key values, you can use unsorted input. Dedup with unsorted input does not require pre-sorting, but is confined with main memory available as the records to be sent to the first output port are cached in memory.

The requirements on main memory in unsorted mode depend on values of the Number of duplicates and Keep attributes. Lower number of duplicates means less memory is necessary. Selecting several first records requires less memory than several last.

Unsorted Input Records and Order of Output Records If you use unsorted input records, the order of output records is not guaranteed to be the same as the the order of input records.

If you keep First record(s), the order on both output ports is preserved.

If you keep Last record(s), the order within any group with the same key is preserved on both output ports. The order of records on the second output port is not guaranteed.

If you keep Unique records, the order of unique records on the first output port is preserved. The order of records on the second output port may be arbitrary.

Examples

Dedup Sorted Records

This example shows the usage of Dedup with sorted input records. This case is suitable for a big number of records with many different key values. An access log contains IPaddress and timestamp. Records are sorted in ascending order according to the IPaddress and timestamp. For each IPaddress, find timestamp of the first access. Null values do not appear in the data.

Solution Set the Dedup key and Keep attributes.

ATTRIBUTE
VALUE

Dedup key

IPaddress

Keep

First

By default, the number of duplicates is one, therefore it does not have to be set up.

Dedup Unsorted Records

This example shows the usage of Dedup with unsorted input records. This case is suitable for datasets with a small number of different key values.

An access log contains timestamp, username, and IPaddress fields. The records are sorted in ascending order according to the timestamp. The log contains a huge number of records but there are not so many different usernames. Your task is to filter out last two logins for each user.

Solution Set Sorted input and Number of duplicates.

ATTRIBUTE
VALUE

Dedup key

username

Keep

Last

Sorted input

false

Number of duplicates

2

Sorted input is set to false as records are not sorted according to Dedup key. Timestamp is not Dedup key. Note that the order of records sent to the output port may be different from the order of records received from the input port.

Sending out the first N records

The previous component (A) sends out a variable number of records. Send the first 100 records to the component B and send the other records to the component C.

Solution Connect the input port of Dedup with component A; the first output port with component B; and the second output port with component C. Set the Number of duplicates attribute.

ATTRIBUTE
VALUE

Number of duplicates

100

See also

PreviousDataSamplerNextDenormalizer

Common Properties of Components
Specific attribute types
Common Properties of Transformers