Primeur Online Docs
Data Shaper
Data Shaper
  • 🚀GETTING STARTED
    • What is Primeur Data Shaper
      • What is the Data Shaper Designer
      • What is the Data Shaper Server
      • What is the Data Shaper Cluster
    • How does the Data Shaper Designer work
      • Designer Views and Graphs
      • Data Shaper Graphs
      • Designer Projects and Sandboxes
      • Data Shaper Designer Reference
    • How do the Data Shaper Server and Cluster work
      • Data Shaper Server and Cluster
      • Data Shaper Server Reference
    • VFS Graph Components
      • DataOneFileDescriptor (DOFD) metadata
      • Passing parameters from Data One Contract to Data Shaper graph
      • Inheriting Data One context attributes in Data Shaper graph
  • DATA SHAPER DESIGNER
    • Configuration
      • Runtime configuration
        • Logging
        • Master Password
        • User Classpath
      • Data Shaper Server Integration
      • Execution monitoring
      • Java configuration
      • Engine configuration
      • Refresh Operation
    • Designer User Interface
      • Graph Editor with Palette of Components
      • Project Explorer Pane
      • Outline Pane
      • Tabs Pane
      • Execution Tab
      • Keyboard Shortcuts
    • Projects
      • Creating Data Shaper projects
      • Converting Data Shaper projects
      • Structure of Data Shaper projects
      • Versioning of server project content
      • Working with Data Shaper Server Projects
      • Project configuration
    • Graphs
      • Creating an empty graph
      • Creating a simple graph
        • Placing Components
        • Placing Components from Palette
        • Connecting Components with Edges
    • Execution
      • Successful Graph Execution
      • Run configuration
      • Connecting to a running job
      • Graph states
    • Common dialogs
      • URL file dialog
      • Edit Value dialog
      • Open Type dialog
    • Import
      • Import Data Shaper projects
      • Import from Data Shaper server sandbox
      • Import graphs
      • Import metadata
    • Export
      • Export graphs to HTML
      • Export to Data Shaper Server sandbox
      • Export image
      • Export Project as Library
    • Graph tracking
      • Changing record count font size
    • Search functionality
    • Working with Data Shaper server
      • Data Shaper server project basic principles
      • Connecting via HTTP
      • Connecting via HTTPS
      • Connecting via Proxy Server
    • Graph components
      • Adding components
      • Finding components
      • Edit component dialog box
      • Enable/disable component
      • Passing data through disabled component
      • Common properties of components
      • Specific attribute types
      • Metadata templates
    • Edges
      • Connecting Components with Edges
      • Types of Edges
      • Assigning Metadata to Edges
      • Colors of Edges
      • Debugging Edges
      • Edge Memory Allocation
    • Metadata
      • Records and Fields
        • Record Types
        • Data Types in Metadata
        • Data Formats
        • Locale and Locale Sensitivity
        • Time Zone
        • Autofilling Functions
      • Metadata Types
        • Internal Metadata
        • External (Shared) Metadata
        • SQL Query Metadata
        • Reading Metadata from Special Sources
      • Auto-propagated Metadata
        • Sources of Auto-Propagated Metadata
        • Explicitly Propagated Metadata
        • Priorities of Metadata
        • Propagation of SQL Query Metadata
      • Creating Metadata
        • Extracting Metadata from a Flat File
        • Extracting Metadata from an XLS(X) File
        • Extracting Metadata from a Database
        • Extracting Metadata from a DBase File
        • Extracting Metadata from Salesforce
        • SQL Query Metadata
        • User Defined Metadata
      • Merging Existing Metadata
      • Creating Database Table from Metadata and Database Connection
      • Metadata Editor
        • Opening Metadata Editor
        • Basics of Metadata Editor
        • Record Pane
        • Field Name vs. Label vs. Description
        • Details Pane
      • Changing and Defining Delimiters
      • Editing Metadata in the Source Code
      • Multi-value Fields
        • Lists and Maps Support in Components
        • Joining on multivalue fields (Comparison Rules)
    • Connections
      • Database Connections
        • Internal Database Connections
        • External (Shared) Database Connections
        • Database Connections Properties
        • Encryption of Access Password
        • Browsing Database and Extracting Metadata from Database Tables
        • Windows Authentication on Microsoft SQL Server
        • Snowflake Connection
        • Hive Connection
        • Troubleshooting
      • JMS Connections
      • QuickBase Connections
      • Hadoop Connections
      • Kafka Connections
      • OAuth2 Connections
      • MongoDB Connections
      • Salesforce Connections
    • Lookup Tables
      • Lookup Tables in Cluster Environment
      • Internal Lookup Tables
      • External (Shared) Lookup Tables
      • Types of Lookup Tables
    • Sequences
      • Persistent Sequences
      • Non Persistent Sequences
      • Internal Sequences
      • External (Shared) Sequences
      • Editing a Sequence
      • Sequences in Cluster Environment
    • Parameters
      • Internal Parameters
      • External (Shared) Parameters
      • Secure Graph Parameters
      • Graph Parameter Editor
      • Secure Graph Parameters
      • Parameters with CTL2 Expressions (Dynamic Parameters)
      • Environment Variables
      • Canonicalizing File Paths
      • Using Parameters
    • Internal/External Graph Elements
    • Dictionary
      • Creating a Dictionary
      • Using a Dictionary in Graphs
    • Execution Properties
    • Notes in Graphs
      • Placing Notes into Graph
      • Resizing Notes
      • Editing Notes
      • Formatted Text
      • Links from Notes
      • Folding Notes
      • Notes Properties
    • Transformations
      • Defining Transformations
      • Transform Editor
      • Common Java Interfaces
    • Data Partitioning (Parallel Running)
    • Data Partitioning in Cluster
      • High Availability
      • Scalability
      • Graph Allocation Examples
      • Example of Distributed Execution
      • Remote Edges
    • Readers
      • Common Properties of Readers
      • ComplexDataReader
      • DatabaseReader
      • DataGenerator
      • DataOneVFSReader
      • EDIFACTReader
      • FlatFileReader
      • JSONExtract
      • JSONReader
      • LDAPReader
      • MultiLevelReader
      • SpreadsheetDataReader
      • UniversalDataReader
      • X12Reader
      • XMLExtract
      • XMLReader
      • XMLXPathReader
    • Writers
      • Common Properties of Writers
      • DatabaseWriter
      • DataOneVFSWriter
      • EDIFACTWriter
      • FlatFileWriter
      • JSONWriter
      • LDAPWriter
      • SpreadsheetDataWriter
      • HIDDEN StructuredDataWriter
      • HIDDEN TableauWriter
      • Trash
      • UniversalDataWriter
      • X12Writer
      • XMLWriter
    • Transformers
      • Common Properties of Transformers
      • Aggregate
      • Concatenate
      • DataIntersection
      • DataSampler
      • Dedup
      • Denormalizer
      • ExtSort
      • FastSort
      • Filter
      • Map
      • Merge
      • MetaPivot
      • Normalizer
      • Partition
      • Pivot
      • Rollup
      • SimpleCopy
      • SimpleGather
      • SortWithinGroups
      • XSLTransformer
    • Joiners
      • Common Properties of Joiners
      • Combine
      • CrossJoin
      • DBJoin
      • ExtHashJoin
      • ExtMergeJoin
      • LookupJoin
      • RelationalJoin
    • Others
      • Common Properties of Others
      • CheckForeignKey
      • DBExecute
      • HTTPConnector
      • LookupTableReaderWriter
      • WebServiceClient
    • CTL2 - Data Shaper Transformation Language
    • Language Reference
      • Program Structure
      • Comments
      • Import
      • Data Types in CTL2
      • Literals
      • Variables
      • Dictionary in CTL2
      • Operators
      • Simple Statement and Block of Statements
      • Control Statements
      • Error Handling
      • Functions
      • Conditional Fail Expression
      • Accessing Data Records and Fields
      • Mapping
      • Parameters
      • Regular Expressions
    • CTL Debugging
      • Debug Perspective
      • Importing and Exporting Breakpoints
      • Inspecting Variables and Expressions
      • Examples
    • Functions Reference
      • Conversion Functions
      • Date Functions
      • Mathematical Functions
      • String Functions
      • Mapping Functions
      • Container Functions
      • Record Functions (Dynamic Field Access)
      • Miscellaneous Functions
      • Lookup Table Functions
      • Sequence Functions
      • Data Service HTTP Library Functions
      • Custom CTL Functions
      • CTL2 Appendix - List of National-specific Characters
      • HIDDEN Subgraph Functions
    • Tutorial
      • Creating a Transformation Graph
      • Filtering the records
      • Sorting the Records
      • Processing Speed-up with Parallelization
      • Debugging the Java Transformation
  • DATA SHAPER SERVER
    • Introduction
    • Administration
      • Monitoring
    • Using Graphs
      • Job Queue
      • Execution History
      • Job Inspector
    • Cluster
      • Sandboxes in Cluster
      • Troubleshooting
  • Install Data Shaper
    • Install Data Shaper
      • Introduction to Data Shaper installation process
      • Planning Data Shaper installation
      • Data Shaper System Requirements
      • Data Shaper Domain Master Configuration reference
      • Performing Data Shaper initial installation and master configuration
        • Creating database objects for PostgreSQL
        • Creating database objects for Oracle
        • Executing Data Shaper installer
        • Configuring additional firewall rules for Data Shaper
Powered by GitBook
On this page
  • Cluster Reliability in Unreliable Network Environment
  • NodeA Cannot Establish HTTP Connection to NodeB
  • NodeA Cannot Establish TCP Connection (Port 7800 by Default) to NodeB
  • NodeB is Killed or It Cannot Connect to the Database
  • Node cannot access the sandboxes home directory
  • Auto-Resuming in Unreliable Network
  • Long-Term Network Malfunction May Cause Jobs to Hang on
  1. DATA SHAPER SERVER
  2. Cluster

Troubleshooting

Cluster Reliability in Unreliable Network Environment

Data Shaper Server instances must cooperate with each other to form a Cluster together. If the connection between nodes doesn’t work at all, or if it is not configured, the Cluster can’t work properly. This chapter describes Cluster nodes behavior in an environment where the connection between nodes is somehow unreliable.

Nodes use three channels to exchange status info or data

  1. synchronous calls (via HTTP/HTTPS) Typically NodeA requests some operation on NodeB, e.g. job execution. HTTP/HTTPS is also used for streaming data between workers of parallel execution

  2. asynchronous messaging (TCP connection on port 7800 by default) Typically heart-beat or events, e.g. job started or finished.

  3. shared database – each node must be able to create DB connection Shared configuration data, execution history, etc.

Following scenarios are described below one by one, however they may occur together:

  • NodeA Cannot Establish HTTP Connection to NodeB

  • NodeA Cannot Establish TCP Connection (Port 7800 by Default) to NodeB

  • NodeB is Killed or It Cannot Connect to the Database

  • Auto-Resuming in Unreliable Network

  • Long-Term Network Malfunction May Cause Jobs to Hang on

NodeA Cannot Establish HTTP Connection to NodeB

When HTTP request can’t be established between nodes, jobs which are delegated between nodes or jobs running in parallel on more nodes will fail. The error is visible in the Execution History. Each node periodically executes a check-task which checks the HTTP connection to other nodes. If the problem is detected, one of the nodes is suspended, since they can’t cooperate with each other.

Time-line describing the scenario:

  • 0s network connection between NodeA and NodeB is down

  • 0-40s a check-task running on NodeA can’t establish HTTP connection to NodeB; check may last for 30s until it times-out; there is no re-try, if connection fails even just once, it is considered as unreliable, so the nodes can’t cooperate.

  • status of NodeA or NodeB (the one with shorter uptime) is changed to “suspended”

The following configuration properties set the time intervals mentioned above:

  • cluster.node.check.checkMinInterval Periodicity of Cluster node checks, in milliseconds. Default: 20000

  • cluster.sync.connection.readTimeout An HTTP connection response timeout, in milliseconds. Default: 90000

  • cluster.sync.connection.connectTimeout Establishing HTTP connection timeout, in milliseconds. Default: 7000

NodeA Cannot Establish TCP Connection (Port 7800 by Default) to NodeB

TCP connection is used for asynchronous messaging. When the NodeB can’t send/receive asynchronous messages, the other nodes aren’t notified about started/finished jobs, so a parent flow running on NodeA keeps waiting for the event from NodeB. A heart-beat is vital for meaningful load-balancing, the same check-task mentioned above also checks the heart-beat from all Cluster nodes.

Time-line describing the scenario:

  • 0s - the network connection between NodeA and NodeB is down;

  • 60s - NodeA uses the last available NodeB heart-beat;

  • 0-40s - a check-task running on NodeA detects the missing heart-beat from NodeB;

  • the status of NodeA or NodeB (the one with shorter uptime) is changed to suspended.

The following configuration properties set the time intervals mentioned above:

  • cluster.node.check.checkMinInterval The periodicity of Cluster node checks, in milliseconds. Default: 40000

  • cluster.node.sendinfo.interval The periodicity of heart-beat messages, in milliseconds. Default: 2000

  • cluster.node.sendinfo.min_interval A heart-beat may occasionally be sent more often than specified by cluster.node.sendinfo.interval. This property specifies the minimum interval in milliseconds. Default: 500

  • cluster.node.remove.interval The maximum interval for missing a heart-beat, in milliseconds. Default: 50000

NodeB is Killed or It Cannot Connect to the Database

Access to a database is vital for running jobs, running scheduler and cooperation with other nodes. Touching a database is also used for detection of dead process. When the JVM process of NodeB is killed, it stops touching the database and the other nodes may detect it.

Time-line describing the scenario:

  • 0s-30s - the last touch on DB;

  • NodeB or its connection to the database is down;

  • 90s - NodeA sees the last touch.

  • 0-40s - a check-task running on NodeA detects an obsolete touch from NodeB;

  • the status of NodeB is changed to stopped, jobs running on the NodeB are solved, which means that their status is changed to UNKNOWN and the event is dispatched among the Cluster nodes. The job result is considered as error.

The following configuration properties set the time intervals mentioned above:

  • cluster.node.touch.interval The periodicity of a database touch, in milliseconds. Default: 20000

  • cluster.node.touch.forced_stop.interval The interval when the other nodes accept the last touch, in milliseconds. Default: 60000

  • cluster.node.check.checkMinInterval The periodicity of Cluster node checks, in milliseconds. Default: 40000

  • cluster.node.touch.forced_stop.solve_running_jobs.enabled A boolean value which can switch the solving of running jobs mentioned above.

Node cannot access the sandboxes home directory

The sandboxes home directory is a place where shared sandboxes are located (configured by sandboxes.home server property). The directory can be on a local or network file system. If the directory is not accessible, it is a serious problem preventing the node from working correctly (e.g. jobs cannot be executed and run). In such a case the affected node must be suspended to prevent jobs from being sent to it.

Timeline describing the scenario:

  • sandboxes home is connected to a remote file system

  • the connection to the file system is lost

  • periodic check is executed trying to access the directory

  • if the check fails, the node is suspended

The following configuration properties set the time intervals mentioned above:

  • sandboxes.home.check.checkMinInterval Periodicity of sandboxes home checks, in milliseconds. Default: 20000

  • sandboxes.home.check.filewrite.timeout Accessing sandboxes home timeout, in milliseconds. Default: 600000

Setting the timeout value too low might force the node under a heavy load to suspend even if the sandboxes home is actually available.

Auto-Resuming in Unreliable Network

In version 4.4, auto-resuming of suspended nodes was introduced.

Time-line describing the scenario:

  • NodeB is suspended after connection loss

  • 0s - NodeA successfully reestablishes the connection to NodeB;

  • 120s - NodeA changes the NodeB status to forced_resume;

  • NodeB attempts to resume itself if the maximum auto-resume count is not reached;

  • If the connection is lost again, the cycle repeats; if the maximum auto-resume count is exceeded, the node will remain suspended until the counter is reset, to prevent suspend-resume cycles.

  • 240m auto-resume counter is reset

The following configuration properties set the time intervals mentioned above:

  • cluster.node.check.intervalBeforeAutoresume Time a node has to be accessible to be forcibly resumed, in milliseconds. Default: 120000

  • cluster.node.check.maxAutoresumeCount How many times a node may try to auto-resume itself. Default: 3

  • cluster.node.check.intervalResetAutoresumeCount Time before the auto-resume counter will be reset, in minutes. Default: 240

Long-Term Network Malfunction May Cause Jobs to Hang on

The master execution executing child jobs on another Cluster node must be notified about status changes of their child jobs. When the asynchronous messaging doesn’t work, events from the child jobs aren’t delivered, so the parent jobs keep running. When the network works again, the child job events may be re-transmitted, so hung parent jobs may be finished. However, the network malfunction may be so long, that the event can’t be re-transmitted.

See the following time-line to consider a proper configuration:

  • job A running on NodeA executes job B running on NodeB;

  • the network between NodeA and NodeB is down from some reason;

  • job B finishes and sends the finished event; however, it can’t be delivered to NodeA – the event is stored in the sent events buffer;

  • since the network is down, a heart-beat can’t be delivered as well and maybe HTTP connections can’t be established, the Cluster reacts as described in the sections above. Even though the nodes may be suspended, parent job A keeps waiting for the event from job B.

  • now, there are 3 possibilities:

    • The network finally starts working and since all undelivered events are in the sent events buffer, they are re-transmitted and all of them are finally delivered. Parent job A is notified and proceeds. It may fail later, since some Cluster nodes may be suspended.

    • Network finally starts working, but the number of the events sent during the malfunction exceeded the sent events buffer limit size. So some messages are lost and won’t be re-transmitted. Thus the buffer size limit should be higher in the environment with unreliable network. Default buffer size limit is 10,000 events. It should be sufficient for thousands of simple job executions; basically, it depends on number of job phases. Each job execution produces at least 3 events (job started, phase finished, job finished). Please note that there are also other events fired occasionally (configuration changes, suspending, resuming, cache invalidation). Also messaging layer itself stores own messages to the buffer, but the number is negligible (tens of messages per hour). The heart-beat is not stored in the buffer. There is also an inbound events buffer used as a temporary storage for events, so events may be delivered in correct order when some events can’t be delivered at the moment. When the Cluster node is inaccessible, the inbound buffer is released after timeout, which is set to 1 hour, by default.

    • Node B is restarted, so all undelivered events in the buffer are lost.

The following configuration properties set the time intervals mentioned above:

  • cluster.jgroups.protocol.NAKACK.gc_lag Limits the size of the sent events buffer; Note that each stored message takes 2kB of heap memory. Default: 10000

  • cluster.jgroups.protocol.NAKACK.xmit_table_obsolete_member_timeout An inbound buffer timeout of inaccessible Cluster node.

PreviousSandboxes in ClusterNextInstall Data Shaper

Last updated 1 month ago

The suspended node can be resumed when the directory is accessible again, see the section.

Auto-Resuming in Unreliable Network