Hive Connection

A connection to the Apache Hive can be created exactly the same way as any other DB Connection. Here we make just a few Hive specific remarks you may find useful.

Hive JDBC Driver

The JDBC driver is a part of the Hive release. But the library and its dependencies are scattered among other Hive libraries. Moreover, the driver depends on one more library from the Hadoop distribution: hadoop-core-*.jar or hadoop-common-*.jar, depending on version of your Hadoop, there’s always only one of them.

For Hive version 0.8.1, here is a minimal list of libraries you need for the Hive DB connection JDBC driver:

hadoop-core-0.20.205.jar

  • hive-exec-0.8.1.jar
  • hive-jdbc-0.8.1.jar
  • hive-metastore-0.8.1.jar
  • hive-service-0.8.1.jar
  • libfb303-0.7.0.jar
  • slf4j-api-1.6.1.jar
  • slf4j-log4j12-1.6.1.jar

You can put all of the Hive distribution libraries + the one Hadoop lib on the JDBC driver classpath. But some of the Hive distribution libraries may already be included in Data Shaper which may result in class loading conflicts. Typically, no commons-logging* and log4j* libraries should be included, otherwise (harmless) warnings will appear in a graph run log.

Using Hive in Data Shaper Transformation Graphs

Remember that Hive is not an ordinary SQL relational database - it has its own SQL-like query language, called QL. Great resource about the Hive QL and Hive in general is the Apache Hive Wiki.

One of the consequences is that it makes no sense to use the DatabaseWriter component, because the INSERT INTO statement can insert only results of the SELECT query. Even though it’s still possible to work around this, each Data Shaper data record inserted using such INSERT statement will results in a heavy-weigh MapReduce job which renders the component extremely slow. Use LOAD DATA Hive QL statement instead.

In the DBExecute component, always set the Transaction set attribute to One statement. The reason is that the Hive JDBC driver doesn’t support transactions, and attempt to use them would result in an error saying that the AutoCommit mode cannot be disabled.

Note that the append file operation is fully supported only since version 0.21.0 of HDFS. Consequently, if you run Hive on top of older HDFS, you cannot append data to existing tables (use of the OVERWRITE keyword becomes mandatory).