Hive

Structured Query Language (SQL)

Hive comes in. Hive provides an SQL dialect, called Hive Query Language (abbreviated HiveQLor just HQL) for querying data stored in a Hadoop cluster.

Hive is most suited for data warehouse applications, where relatively static data is analyzed,

fast response times are not required, and when the data is not changing rapidly.

Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do. The biggest limitation is that Hive does not provide record-level update, insert, nor delete.

Hive doesn’t provide crucial features required for OLTP,Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing, but as we’ll see, Hive isn’t ideal for satisfying the “online” part of OLAP, at least today, since there can be significant latency between issuing a query and receiving a reply, both due to the overhead of Hadoop and due to the size of the data sets Hadoop was designed to serve.

If you need OLTP features for large-scale data, you should consider using a NoSQL database. Examples include HBase, a NoSQL database integrated with Hadoop,2 Cassandra, 3 and DynamoDB, if you are using Amazon’s Elastic MapReduce (EMR) or Elastic Compute Cloud (EC2).

Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc

Hive Have 2 Types of table

1. Managed table

2. External Table.

Managed Tables

The tables we have created so far are called managed tables or sometimes called internal tables, because Hive controls the lifecycle of their data (more or less). As we’ve seen, Hive stores the data for these tables in a subdirectory under the directory defined by hive.metastore.warehouse.dir(e.g., /user/hive/warehouse), by default.

managed tables are less convenient for sharing with other tools. For example, suppose we have data that is created and used primarily by Pig or other tools, but we want to run some queries against it, but not give Hive ownership of the data. We can define an external table that points to that data, but doesn’t take ownership of it.

External Tables

The EXTERNAL keyword tells Hive this table is external and the LOCATION … clause is

required to tell Hive where it’s located. Because it’s external, Hive does not assume it owns the data. Therefore, dropping the table does not delete the data, although the metadata for the table will be deleted. There are a few other small differences between managed and external tables, where some HiveQL constructs are not permitted for external tables. We’ll discuss those when we come to them. However, it’s important to note that the differences between managed and external tables are smaller than they appear at first. Even for managed tables, you know where they are located, so you can use other tools, hadoopdfscommands, etc., to modify and even delete the files in the directories for managed tables.

Apache Hive

· Hive is one of the component of hadoop .

· Hive is called As HiveQL (Hive Quering Language ).

· Hive is Introduce at Facebook inorder to fulfill requirements with respect to ETL Technology.

· Hive comes with built in connectors for comma and tab-separated values (CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats.

· The Apache Hive™ data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.

· Built on top of Apache Hadoop™, Hive provides the following features:

· Tools to enable easy access to data via SQL, thus enabling data warehousing tasks such as extract/transform/load (ETL), reporting, and data analysis.

· A mechanism to impose structure on a variety of data formats

· Access to files stored either directly in Apache HDFS ™ or in other data storage systems such as Apache HBase ™ .

Hive Limitation’s

· Not All “Standard” SQL is Supported

· No Support for UPDATE or DELETE .

· No Support for INSERTING Single Rows.

How to find Hive Version ?

hive –version

Present version in cloudera 5.12.0

Hive 1.1.0-cdh5.12.0

+1256-

Delimiters /

Default Format file ORC Optimised Row Columnar

Datatypes…

Primitive Types

Types are associated with the columns in the tables. The following Primitive types are supported:

Integers

· TINYINT—1 byte integer

· SMALLINT—2 byte integer

· INT—4 byte integer

· BIGINT—8 byte integer

Boolean type

· BOOLEAN—TRUE/FALSE

Floating point numbers

· FLOAT—single precision

· DOUBLE—Double precision

Fixed point numbers

· DECIMAL—a fixed point value of user defined scale and precision

String types

· STRING—sequence of characters in a specified character set

· VARCHAR—sequence of characters in a specified character set with a maximum length

· CHAR—sequence of characters in a specified character set with a defined length

· Date and time types

TIMESTAMP— a specific point in time, up to nanosecond precision

· DATE—a date

Binary types

· BINARY—a sequence of bytes

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE.

Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.

Complex Types

Complex Types can be built up from primitive types and other composite types using:

· Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a

· Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']

· Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.

Apache Pig Vs Hive

Both Apache Pig and Hive are used to create MapReduce jobs. And in some cases, Hive operates on HDFS in a similar way Apache Pig does. In the following table, we have listed a few significant points that set Apache Pig apart from Hive.

Apache Pig	Hive
Apache Pig uses a language called Pig Latin. It was originally created at Yahoo.	Hive uses a language called HiveQL. It was originally created at Facebook.
Pig Latin is a data flow language.	HiveQL is a query processing language.
Pig Latin is a procedural language and it fits in pipeline paradigm.	HiveQL is a declarative language.
Apache Pig can handle structured, unstructured, and semi-structured data.	Hive is mostly for structured data.

Describe & describe formatted table name…

sudo service hive-metastore status
sudo service hive-server2 status

1. hdfsdfsadmin -safemode leave

2. sudo service hadoop-master stop : To stop

3. sudo service hadoop-master start : To start

4. Hadoop dfsadmin -safemode enter : Lock's to safe mode.

6. Hadoop dfsadmin -safemode get : Lock's to safe mode.

7. Hadoop dfsadmin -safemode leave : Unlock safemode.

Create Table:::::::::::

create EXTERNAL table EXtxn909(txnno INT, txndate STRING, custno INT, amount DOUBLE,

category STRING, product STRING, city STRING, state STRING, spendby STRING)

row format delimited

fields terminated by ','

stored as textfile;

LOAD DATA LOCAL INPATH '/home/cloudera/emp.txt'OVERWRITE INTO TABLE e1234;

LOAD DATA LOCAL INPATH '/home/cloudera/Pig123' OVERWRITE INTO TABLE student1;

Search This Blog

TECH FOR U

Hive

+1256-

Primitive Types

Complex Types

Comments

Post a Comment

Popular posts from this blog

Problem Statement Of Real Estate Use Cases

Problem Statement Of Bank Marketing analysis

Hadoop Comand