Hive insert query like SQL

sql

data-insertion

hive-queries

bulk-data-insertion

byAlex Kataev·Nov 23, 2024

To insert data in Hive, use the following syntax:

INSERT INTO table_name SELECT * FROM source_table;

When inserting specific values, consider the following:

INSERT INTO TABLE table_name VALUES ('value1', 'value2', 'value3');

Ensure that source and target table schemas correspond seamlessly. Hive transactions run smoothly with ORC file format in transactional tables.

Bulk data insertion and temp table usage

For multiple-row data insertions, utilize the power of stack function like this:

/* Like a BOGO sale but for data: Buy One Get One */
INSERT INTO TABLE your_table VALUES 
  (stack(2, 'row1_value1', 'row1_value2', 'row1_value3', 
            'row2_value1', 'row2_value2', 'row2_value3'));

You can also create dummy tables to expedite the insertion of data:

CREATE TABLE dummy_table AS SELECT * FROM existing_table LIMIT 1;

This dummy table acts as a blueprint to facilitate the data insertion process.

Before appending data to the main table, try it out on a temporary table or use the EXPLAIN clause to validate Hive's syntax and functionality.

Advanced append operations in Hive

To manipulate existing data, use the INSERT OVERWRITE TABLE clause:

/* Game of tables: Overwrite is coming */
INSERT OVERWRITE TABLE your_table SELECT * FROM source_table;

The LOAD operation helps insert data into the table's directory without overwriting existing data:

/* Loading... But not the annoying kind */
LOAD DATA INPATH 'path/to/datafile' INTO TABLE your_table;

Ditch temporary text files or CSVs for easier database management. Utilize commands within Hive to maintain consistency.

To generate a new table with literal values, use CREATE TABLE AS SELECT stack():

CREATE TABLE new_table AS 
SELECT stack(1, 'value1', 'value2', 'value3') AS (column1, column2, column3);

Use LIMIT clause for single-row insertions:

INSERT INTO TABLE your_table 
SELECT 'value1', 'value2', 'value3' FROM dummy_table LIMIT 1;

Remember, Hive's data append method is akin to SQL's procedure.

Hive insertion optimization tips

Increase efficiency through these strategies:

Transactional tables: Facilitate momentary changes like inserts and updates.
Partitioning: Direct the insert operation to a particular partition to curtail data processing.
Bucketing: Divide data within each partition for granular control.

The LOAD operation without the overwrite clause is beneficial for appending large datasets.

Dealing with complex data types in Hive

Hive supports complex data types like structs, arrays, and maps. When inserting data with these types, use the following syntax:

/* Playing with complex data types! */
INSERT INTO TABLE complex_table VALUES 
  (named_struct('id', 1, 'name', 'Item1'),
   array('val1', 'val2'),
   map('key1', 'val1', 'key2', 'val2'));

Ensuring data consistency

Make sure to inspect the data for consistency after insertion:

Run COUNT checks: Validate the number of rows inserted.
Data sampling: Pick out a random portion of data and check for accuracy.
Checksum comparisons: A foolproof method to compare source and destination data.

Troubleshooting error during insertion

If problems arise during the data insertion process, probe into the following areas:

Data types: Check for compatibility between source and destination column data types.
Hive version: Make sure you're using Hive 0.14 or later to use the VALUES clause.
File formats: Transactions require certain file formats like ORC.

Visual representation

Imagine you have a set of ingredients (source_table) to create a perfect dish (table_name):

/* Masterchef: Data edition */
INSERT INTO YourKitchen.Ingredients SELECT * FROM RecipeBook.Ingredients;

As a result 🍽, your dish has the exact ingredients as that of the recipe book (Every data accurately matched in your database table).