Posts

Showing posts from March, 2020

HCatlog

Image
What is HCatalog? HCatalog is a table storage management tool for Hadoop. It exposes the tabular data of Hive metastore to other Hadoop applications. It enables users with different data processing tools (Pig, MapReduce) to easily write data onto a grid. It ensures that users don’t have to worry about where or in what format their data is stored. HCatalog works like a key component of Hive and it enables the users to store their data in any format and any structure. Why HCatalog? Enabling right tool for right Job Hadoop ecosystem contains different tools for data processing such as Hive, Pig, and MapReduce. Although these tools do not require metadata, they can still benefit from it when it is present. Sharing a metadata store also enables users across tools to share data more easily. A workflow where data is loaded and normalized using MapReduce or Pig and then analyzed via Hive is very common. If all these tools share one metastore, then the users of each tool have immediate access t...

Zookeeper

Zookeeper Zookeeper is an open-source project that provides services like maintaining configuration information, naming, providing distributed synchronization, etc. Zookeeper has ephemeral nodes representing different region servers. Master servers use these nodes to discover available servers. In addition to availability, the nodes are also used to track server failures or network partitions. Clients communicate with region servers via zookeeper. In pseudo and standalone modes, HBase itself will take care of zookeeper.

HBase Vs RDBMS

HBase and RDBMS HBase RDBMS HBase is schema-less, it doesn't have the concept of fixed columns schema; defines only column families. An RDBMS is governed by its schema, which describes the whole structure of tables. It is built for wide tables. HBase is horizontally scalable. It is thin and built for small tables. Hard to scale. No transactions are there in HBase. RDBMS is transactional. It has de-normalized data. It will have normalized data. It is good for semi-structured as well as structured data. It is good for structured data.

HBase Vs HDFS

HBase and HDFS HDFS HBase HDFS is a distributed file system suitable for storing large files. HBase is a database built on top of the HDFS. HDFS does not support fast individual record lookups. HBase provides fast lookups for larger tables. It provides high latency batch processing; no concept of batch processing. It provides low latency access to single rows from billions of records (Random access). It provides only sequential access of data. HBase internally uses Hash tables and provides random access, and it stores the data in indexed HDFS files for faster lookups.

HBase

Image
What is HBase? HBase is a distributed column-oriented database built on top of the Hadoop file system. It is an open-source project and is horizontally scalable. HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop File System (HDFS). It is a part of the Hadoop ecosystem that provides random real-time read/write access to data in the Hadoop File System. One can store the data in HDFS either directly or through HBase. Data consumer reads/accesses the data in HDFS randomly using HBase. HBase sits on top of the Hadoop File System and provides read and write access.

Sqoop Command

Image
Compress command: To decrease the size of data after importing to HDFS, we can use the option – compress  while executing the command. sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root --password cloudera --table categories --hive-import --create-hive-table --hive-table vanshu.sqoop_retailer -m1 --compress ;   direct command: Sqoop can handle bulk transfers very well. We can speed up the transfers by using the – direct  parameter. Sqoop can handle bulk transfers very well. You can speed up the transfers by using the – direct  parameter. sqoop import --connect jdbc:mysql://localhost:3306/retail_db --username root --password cloudera --table categories --direct --target-dir /user/cloudera/Retailer-sqoop --split-by category_id -m1 ;     Incremental Append in Sqoop: 1)Insert   new values in tables present in Mysql: INSERT INTO categories VALUES(60,9,"Cricket12"); INSERT INTO categories VALUES(61,9,"Volley...