Hive
Structured
Query Language (SQL)
Hive
comes in. Hive provides an SQL dialect,
called Hive Query Language (abbreviated HiveQLor just HQL)
for querying data stored in a Hadoop cluster.
Hive is most suited for data
warehouse applications, where relatively static data is analyzed,
fast
response times are not required, and when the data is not changing rapidly.
Hive is not a full database. The design constraints and
limitations of Hadoop and HDFS impose limits on what Hive can do. The biggest
limitation is that Hive does
not provide record-level update, insert, nor delete.
Hive doesn’t provide crucial features required for OLTP,Online
Transaction Processing. It’s closer to being an OLAP tool, Online
Analytic Processing, but as we’ll see, Hive isn’t ideal for satisfying the “online” part of OLAP,
at least today, since there can be significant latency between issuing a query
and receiving a reply, both due to the overhead of Hadoop and due to the size
of the data sets Hadoop was designed to serve.
If you need OLTP features for large-scale data, you
should consider using a NoSQL
database.
Examples include HBase, a NoSQL database
integrated with Hadoop,2 Cassandra, 3 and DynamoDB, if you are
using Amazon’s Elastic MapReduce (EMR) or Elastic Compute Cloud (EC2).
Hive is best suited for data warehouse applications,
where a large data set is maintained and mined for insights, reports, etc
Hive Have 2 Types of table
1.
Managed
table
2.
External
Table.
Managed Tables
The
tables we have created so far are called managed tables or sometimes
called internal tables, because Hive controls the lifecycle of their
data (more or less). As we’ve seen, Hive stores the data for these tables in a
subdirectory under the directory defined by hive.metastore.warehouse.dir(e.g.,
/user/hive/warehouse), by default.
managed tables are
less convenient for sharing with other tools. For example, suppose we have data
that is created and used primarily by Pig or other tools, but we want to
run some queries against it, but not give Hive ownership of the data. We
can define an external table that points to that data, but doesn’t take
ownership of it.
External Tables
The
EXTERNAL
keyword tells Hive this table is external and the LOCATION … clause
is
required
to tell Hive where it’s located. Because it’s external, Hive does not assume it
owns the data. Therefore, dropping the table does not delete the
data, although the metadata for the table will be deleted. There are a
few other small differences between managed and external tables, where some
HiveQL constructs are not permitted for external tables. We’ll discuss those
when we come to them. However, it’s important to note that the differences
between managed and external tables are smaller than they appear at first. Even
for managed tables, you know where they are located, so you can use
other tools, hadoopdfscommands,
etc., to modify and even delete the files in the directories for managed
tables.
Apache Hive
·
Hive is one
of the component of hadoop .
·
Hive is
called As HiveQL (Hive Quering Language ).
·
Hive is
Introduce at Facebook
inorder to fulfill requirements with respect to ETL Technology.
·
Hive comes with built in connectors for comma and tab-separated values
(CSV/TSV) text files, Apache Parquet™, Apache ORC™, and other formats.
·
The Apache Hive™ data
warehouse software facilitates reading, writing, and managing large
datasets residing in distributed storage and queried using SQL syntax.
·
Tools to enable easy access to data via SQL, thus enabling data
warehousing tasks such as extract/transform/load (ETL), reporting, and data
analysis.
·
A mechanism to impose structure on a variety
of data formats
·
Access to files stored either directly
in Apache
HDFS™ or in other data storage systems such as Apache
HBase™ .
Hive Limitation’s
·
Not
All “Standard” SQL is Supported
·
No
Support for UPDATE or DELETE .
·
No
Support for INSERTING Single Rows.
How to find Hive Version ?
hive –version
Present version in cloudera 5.12.0
Hive 1.1.0-cdh5.12.0
+1256-
Delimiters /
Default
Format file ORC Optimised Row Columnar
Datatypes…
Primitive
Types
Types are associated with the columns in the tables.
The following Primitive types are supported:
Integers
·
TINYINT—1 byte integer
·
SMALLINT—2 byte integer
·
INT—4 byte integer
·
BIGINT—8 byte integer
Boolean type
·
BOOLEAN—TRUE/FALSE
Floating point numbers
·
FLOAT—single precision
·
DOUBLE—Double precision
Fixed point numbers
·
DECIMAL—a fixed point value of user
defined scale and precision
String types
·
STRING—sequence of characters in a
specified character set
·
VARCHAR—sequence of characters in a
specified character set with a maximum length
·
CHAR—sequence of characters in a specified
character set with a defined length
·
Date and time types
TIMESTAMP— a specific point in time, up to
nanosecond precision
·
DATE—a date
Binary types
·
BINARY—a sequence of bytes
This type hierarchy defines how the types are implicitly
converted in the query language. Implicit conversion is allowed for types from
child to an ancestor. So when a query expression expects type1 and the data is
of type2, type2 is implicitly converted to type1 if type1 is an ancestor of
type2 in the type hierarchy. Note that the type hierarchy allows the implicit
conversion of STRING to DOUBLE.
Explicit
type conversion can be done using the cast operator as shown in the #Built
In Functions section below.
Complex Types
Complex
Types can be built up from primitive types and other composite types using:
·
Structs: the elements within the type can
be accessed using the DOT (.) notation. For example, for a column c of type
STRUCT {a INT; b INT}, the a field is accessed by the expression c.a
·
Maps (key-value tuples): The elements are
accessed using ['element name'] notation. For example in a map M comprising of
a mapping from 'group' -> gid the gid value can be accessed using M['group']
·
Arrays (indexable lists): The elements in
the array have to be in the same type. Elements can be accessed using the [n]
notation where n is an index (zero-based) into the array. For example, for an
array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.
Apache Pig
Vs Hive
Both Apache Pig and Hive are used
to create MapReduce jobs. And in some cases, Hive operates on HDFS in a similar
way Apache Pig does. In the following table, we have listed a few significant
points that set Apache Pig apart from Hive.
Apache Pig
|
Hive
|
Apache
Pig uses a language called Pig Latin. It was originally created
at Yahoo.
|
Hive
uses a language called HiveQL. It was originally created at Facebook.
|
Pig
Latin is a data flow language.
|
HiveQL
is a query processing language.
|
Pig
Latin is a procedural language and it fits in pipeline paradigm.
|
HiveQL
is a declarative language.
|
Apache
Pig can handle structured, unstructured, and semi-structured data.
|
Hive is
mostly for structured data.
|
Describe
& describe formatted table name…
sudo service
hive-metastore status
sudo service hive-server2 status
sudo service hive-server2 status
:
1.
hdfsdfsadmin -safemode leave
2.
sudo service hadoop-master stop : To
stop
3.
sudo service hadoop-master start : To
start
4.
Hadoop dfsadmin -safemode enter : Lock's
to safe mode.
5.
7.
Hadoop dfsadmin -safemode leave : Unlock
safemode.
Create Table:::::::::::
create EXTERNAL table EXtxn909(txnno INT,
txndate STRING, custno INT, amount DOUBLE,
category STRING, product STRING, city
STRING, state STRING, spendby STRING)
row format delimited
fields terminated by ','
stored as textfile;
LOAD DATA LOCAL INPATH
'/home/cloudera/emp.txt'OVERWRITE INTO TABLE e1234;
LOAD DATA LOCAL INPATH
'/home/cloudera/Pig123' OVERWRITE INTO TABLE student1;
Comments
Post a Comment