When a file is stored in Hadoop, it is considered as a raw file that has to be still processed. The file has to be processed to make it query able.
The log file contains the semi-structured data. Making that file to be query able is one challenge. For example, we have dumped a large employee database in a simple text file and uploaded into the HDFS to be processed. Now it needs to be made to be query able How to make it done?
Here comes the concept of HIVE. It takes all the raw data and converts it into the form of queriable file needed and used by the enterprise.
Let us consider the file, which is in any format like JSON, PDF, CSV or email.
1. The file that is needed to be processed and made query able is to be loaded to the HDFS.
2. Then HIVE comes in between the SQL Query and the files.
3. Initially, the table is defined for querying a file. For example, it has five fields.
4. Then schema is defined.
5. The next step is to load the file in HIVE using the LOAD command.
6. The query is taken by HIVE and then translated into a Map Reduce pipeline.
7. The result would be executed in the cluster.
8. The final queried file is then translated back to the format the user would be able to understand back.
Where the data is stored by HIVE?
HIVE does not have its own database to store the data. So, it operates the data, where the Meta data is stored. It virtually takes the data and does process it. It does not literally store the data. Thefile is stored in the warehouse directory.
Components of HIVE
HIVE has the following components.
A. Meta store – It helps to store the schema or the definition of the query we are looking for. Any of the RDBMS can be used for this Meta store. MySQL or any other database can be used for any purpose. The connection to the JDBC store would be established here. It figures out the fields in the query. However, files like JSON cannot be directly made query able. It needs additional writing process like Map Reduce to read the file.
B. Set of SERDE or serializer / de-serializer-The function of it is to convert the format of the file. Compatible SERDEs must be used to read different formats of the file. However, default SERDE is equipped enough to read the text files. Different SERDEs must be defined by you for different kinds of file formats. The function of the SERDE is to take the file of a specific format and then convert it into the corresponding fields. These are the fields which are defined as a part of the schema definition.
With these components, the entire process of HIVE is established and then the file would be converted to the query able format. He file is now ready to be used as the query able form.