Skip to main content

Hive Writer

Writes to one or more tables in Apache Hive.

When the input stream of a HiveWriter target is of a user-defined type, it can write to Hive tables that use Avro, ORC, Parquet, or text file storage formats, and writes use SQL APPEND or INSERT INTO.

When the input stream is the output stream of a DatabaseReader or CDC source:

  • and the Mode is initialload, the storage format may be Avro, ORC, Parquet, or text file, and writes use SQL APPEND or INSERT INTO.

  • the Mode is incremental, and the storage format is Avro or Parquet, writes use SQL APPEND or INSERT INTO.

  • the Mode is incremental, and the storage format is ORC, Hive ACID transactions must be enabled and writes use SQL MERGE (which your version of Hive must support). In this case, there will be no duplicate events written to Hive ("exactly-once processing") after recovery (Recovering applications), as may happen when using SQL APPEND or INSERT INTO.

Limitations:

  • When the input stream is the output steam of a CDC reader, the reader's Compression property must be False.

  • DDL is not supported. If you need to alter the source tables, quiesce the application, change the source and target tables, and restart.

  • Columns specified in the Tables property's keycolumns option may not be updated. Any attempted update will be silently discarded.

  • Bucketed or partitioned columns may not be updated. This is a limitation of Hive, not HiveWriter.

  • Multiple instances of HiveWriter cannot write to the same table. When a HiveWriter target is deployed on multiple Striim servers, partition the input stream or use an environment variable in table mappings to ensure that they do not write to the same tables.

Hive Writer properties

property

type

default value

notes

Authentication Policy

String

If the HDFS cluster uses Kerberos authentication, provide credentials in the format Kerberos, Principal:<Kerberos principal name>, KeytabPath:<fully qualified keytab file name>. Otherwise, leave blank. For example: authenticationpolicy:'Kerberos, Principal:nn/ironman@EXAMPLE.COM, KeytabPath:/etc/security/keytabs/nn.service.keytab'

Connection URL

String

the JDBC connection URL, for example, ConnectionURL= 'jdbc:hive2:@192.0.2.5:10000'

Directory

String

By default, Striim will create an HDFS directory on the Hive server to use as a staging area. If Striim does not have permission to create the necessary directory, HiveWriter will terminate with a "File Not Found" exception. To resolve that issue, create a staging directory manually and specify it here..

Hadoop Configuration Path

String

If using Kerberos authentication, specify the path to Hadoop configuration files such as core-site.xml and hdfs-site.xml. If this path is incorrect or the configuration changes, authentication may fail.

Hadoop URL

String

The URI for the HDFS cluster NameNode. See below for an example. The default HDFS NameNode IPC port is 8020 or 9000 (depending on the distribution). Port 50070 is for the web UI and should not be specified here.

For an HDFS cluster with high availability, use the value of the dfs.nameservices property from hdfs-site.xml with the syntax hadoopurl:'hdfs://<value>', for example, hdfs://'mycluster'.  When the current NameNode fails, Striim will automatically connect to the next one.

Ignorable Exception Code

Striim

Set to TABLE_NOT_FOUND to prevent the application from terminating when Striim tries to write to a table that does not exist in the target. See Handling "table not found" errors for more information.

Ignored exceptions will be written to the application's exception store (see CREATE EXCEPTIONSTORE).

Merge Policy

String

eventcount:10000, interval:5m

With the default setting, events are written every five minutes or sooner if there are 10,000 events.

Mode

String

incremental

With an input stream of a user-defined type, do not change the default. See Replicating Oracle data to Hive.

Password

encrypted password

The password for the specified user. See Encrypted passwords.

Parallel Threads

Integer

See Creating multiple writer instances.

Tables

String

The name(s) of the table(s) to write to. The table(s) must exist in Hive.

When the input stream of the target is the output of a DatabaseReader, IncrementalBatchReader, or SQL CDC source (that is, when replicating data from one database to another), it can write to multiple tables. In this case, specify the names of both the source and target tables. You may use the % wildcard only for tables, not for schemas or databases. If the reader uses three-part names, you must use them here as well. Note that Oracle CDB/PDB source table names must be specified in two parts when the source is Database Reader or Incremental Batch reader (schema.%,schema.%) but in three parts when the source is Oracle Reader or OJet ((database.schema.%,schema.%). Note that SQL Server source table names must be specified in three parts when the source is Database Reader or Incremental Batch Reader (database.schema.%,schema.%) but in two parts when the source is MS SQL Reader or MS Jet (schema.%,schema.%). Examples:

source.emp,target.emp
source.db1,target.db1;source.db2,target.db2
source.%,target.%
source.mydatabase.emp%,target.mydb.%
source1.%,target1.%;source2.%,target2.%

Since HIve does not have primary keys, you must use the keycolumns option to define a unique identifier for each row in the target table: for example, Tables:'DEMO.EMPLOYEE,employee keycolumns(emp_id)'. If necessary to ensure uniqueness, specify multiple columns with the syntax keycolumns(<column 1>,<column 2>,...). You may use wildcards for the source table provided all the tables have the key columns: for example, ables:'DEMO.Ora%,HIVE.Hiv% KeyColumns(...)'.

See Mapping columns for additional options.

Username

String

A Hive user for the server specified in ConnectionURL. The user must have INSERT, UPDATE, DELETE, TRUNCATE, CREATE, DROP, and ALTER privileges on the specified tables.

Hive Writer sample application

The following sample code writes data from PosDataPreview.csv  to Hive  (to run this code, you must first create the target table in Hive):

CREATE SOURCE PosSource USING FileReader (
  directory:'Samples/PosApp/AppData',
  wildcard:'PosDataPreview.csv',
  positionByEOF:false
)
PARSE USING DSVParser (
  header:yes
)
OUTPUT TO RawStream;

CREATE CQ CsvToPosData
INSERT INTO PosDataStream
SELECT TO_STRING(data[1]) as merchantId,
  TO_DATEF(data[4],'yyyyMMddHHmmss') as dateTime,
  TO_DOUBLE(data[7]) as amount,
  TO_STRING(data[9]) as zip
FROM RawStream;

CREATE TARGET HiveSample USING HiveWriter (
  ConnectionURL:'jdbc:hive2://192.0.2.76:10000',
  Username:'hiveuser', 
  Password:'********',
  hadoopurl:'hdfs://192.0.2.76:9000/',
  Tables:'posdata'
)
FORMAT USING DSVFormatter ()
INPUT FROM PosDataStream;

HiveWriter data type support and correspondence

When the input stream is of a user-defined type:

TQL type

Hive type

java.lang.Byte

BINARY

java.lang.Double

DOUBLE

java.lang.Float

FLOAT

java.lang.Integer

INTEGER

java.lang. Long

BIGINT

java.lang.Short

SMALLINT, TINYINT

java.lang.String

CHAR, DECIMAL, INTERVAL, NUMERIC, STRING, VARCHAR

org.joda.time.DateTime

TIMESTAMP

When the input stream of a HiveWriter target is the output of an Oracle source (DatabaseReader or OracleReader):

Oracle type

Hive type

BINARY_DOUBLE

DOUBLE

BINARY_FLOAT

FLOAT

BLOB

BINARY

CHAR

CHAR, STRING, VARCHAR

CLOB

STRING

DATE

TIMESTAMP

DECIMAL

DECIMAL, DOUBLE, FLOAT

FLOAT

FLOAT

INTEGER

BIGINT, INT, SMALLINT TINYINT

LONG

BIGINT

NCHAR

STRING, VARCHAR

NUMBER

  • INT when the scale is 0 and the precision is less than 10

  • BIGINT when the scale is 0 and the precision is less than 19

  • DECIMAL when the scale is greater than 0 or the precision is greater than 19

NVARCHAR2

STRING, VARCHAR

SMALLINT

SMALLINT

TIMESTAMP

TIMESTAMP

TIMESTAMP WITH LOCAL TIME ZONE

TIMESTAMP

TIMESTAMP WITH TIME ZONE

TIMESTAMP

VARCHAR2

STRING, VARCHAR

Cloudera Hive Writer

ClouderaHiveWriter is identical to HiveWriter except that, since Cloudera's Hive distribution does not support SQL MERGE, there is no Mode property. ClouderaHiveWriter is always in InitialLoad mode, and writes always use SQL APPEND or INSERT INTO.

See Hive Writer for further discussion and documentation of the properties.

Hortonworks Hive Writer

Except for the name of the adapter, HortonworksHiveWriter is identical to HiveWriter. See Hive Writer for documentation of the properties.