Top 11 Hadoop Interview Questions

Posted by Brian Wallace

6

Tech Innovation

Nov 26, 2020

509 Views

Thanks to the Big Data revolution, businesses are today making use of all the data available around them to draw more valuable insights for better and informed decision-making that will yield them strategies that keep them ahead of the competition. As organizations are leveraging Big Data analytics, there has been a growing adoption of Big Data technologies like Hadoop and Spark. Hadoop is one of the most popular technologies that are widely employed in developing solutions for Big Data challenges. As such, there is a rising demand for Big Data professionals that have acquired Hadoop certification training and experience. However, between this demand and securing a job stands rigorous interview sessions.

Top 11 Hadoop Interview Questions

Hadoop HDFS interview questions

A professional’s potential is best assessed during an interview session which is why one needs to prepare adequately for an upcoming interview. We have compiled 11 top Hadoop questions. This list is by no means exhaustive therefore you can go the extra mile and research other areas of Hadoop outside what we have covered.

Explain block and block scanner in HDFS?

In Hadoop, HDFS will split large files into smaller ones referred to as blocks. A block is the smallest amount of data in a file system that can be read or written and is often a standard 128MB size apart from the last block which can be smaller than the standard size.

A block scanner runs a regular check on the data blocks in the DataNodes to identify and fix checksum errors to keep data integrity in check during transmission and before a read operation.

What are the various types of distributed file systems?

HDFS (Hadoop distributed file system) which is Hadoop’s primary file data system
GFS (Google File System)
MapR File System
Ceph File System
IBM General Parallel File System

What is the difference between NameNode, Backup Node, and Checkpoint NameNode

The NameNode is the core of HDFS that stores metadata for file maps. The NameNode does not store data. Data is stored in the DataNode. The NameNode does keep the list of all the blocks in HDFS on the Hadoop cluster as well as their locational information. The NameNode uses the following files for namespace:
fsimage file - This file, which is also the base file, keeps track of the latest checkpoint of the namespace in essence reflecting the current state of HDFS
edits file - It is a log of modifications that have been made to the namespace since checkpoint.
Checkpoint NameNode. In Hadoop, checkpointing is the process by which the modifications written in the edits file are merged into the fsimage file to form a new fsimage file. Where changes are logged into the edits file without merging them into the fsimage file, the edit log becomes too long which then slows down the next startup. For this reason, edits have to be merged into the fsimage file from time to time during the runtime and the edits log cleared. This is done and updated on the NameNode by the Checkpoint Node so that on the next startup, this updated state is read from fsimage. The Checkpoint Node has the same directory structure as the NameNode.
Backup Node. The Backup Node in addition to providing backup for the filesystem namespace also performs checkpointing alongside streaming file system edits within the NameNode.

What is rack awareness and how does it work?

DataNodes in a Hadoop cluster are stored in several racks. The NameNode, as the directory of all the blocks rack IDs, selects the blocks that are closest based on their rack IDs to service the read/write requests. This concept of choosing the closest block in a DataNode based on its rack ID is known as rack awareness.

Rack awareness is important in Hadoop HDFS because it reduces network traffic between DataNodes on the basis that DataNodes located within the same rack communicate more easily and efficiently compared to those in different racks. Ultimately, this improves cluster performance. Other advantages of rack awareness are that it helps achieve fault tolerance, high data availability, and minimum latency.

Explain two ways in which the replication factor can be modified or overwritten in HDFS.

Overwriting or modifying replication factors in HDFS can be done in the following ways.

Change the replication factor per single file using Hadoop FS Shell using the following command

$hadoop fs –setrep –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)

Modify or overwrite the replication factors of all files in a directory using the Hadoop FS Shell following the below command

$hadoop fs –setrep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)

What are edge nodes in Hadoop?

Edge nodes, also known as gateway nodes, are the interface between Hadoop clusters and external networks. Edge nodes run client applications and administration tools.

What is distributed cache and why is it beneficial in Hadoop?

This is a service carried out by the MapReduce framework in which files are copied across nodes in a Hadoop cluster before any task can be executed. These files can be read-only files, jar files, or archives. A file that has been cached for a specific job becomes available both in-memory and in the system of each data node.

Hadoop’s distributed cache is beneficial because

It can distribute complex files like jar files and archives
It can track modification timestamps of each cached file such that it cannot be modified while the job is being executed.

What are the core methods of a reducer?

There are three core methods of a reducer. These are:

set up() - This method configures parameters of a reducer like distributed cache and input data size at the start of a talk
reduce() - This is the main of the three methods of a reducer in which the task to be applied to a set of values which share a key is applied.
cleanup() - Once the reduce() task is complete, the cleanup() is called to clean temporary files.

What is a custom partitioner and how do you write a custom partitioner for a Hadoop MapReduce job?

The partition phase takes place in between the Map and the Reduce phases. A HashPartitioner, which is the default partitioner in Hadoop, usually divides the key-value pairs of the map phase output based on the number of reducers. This way, one reduce task is mapped to one key. A custom partitioner is a process by which results from different reducers are stored based on user conditions.

Writing a custom MapReduce job follows the steps below

Create a new class that extends the predefined partitioner class
Override the method ‘getpartition’ in your new class
Add the custom partitioner to the job as a config file in the running rapper or add the custom partitioner using the set Partitioner Class method.

What is the difference between the Map-side Join and Reduce-side Join?

All tasks to join the records are done by the mapper on the task-side and is usually applied where a large data set is being joined to a small dataset so that a structure is required for the input datasets.

All joins on the reduce side, also known as repartitioned joins, are done by the reducer. Here, the input datasets do not require a structure.

What additional benefits does YARN bring to Hadoop?

YARN facilitates the effective use of resources allowing multiple applications to run simultaneously which improves Hadoop’s performance significantly. Secondly, YARN supports additional processing models thus allowing applications that are not based on the MapReduce model to run.

Conclusion

While there is always much to cover while preparing for an interview, try not to get overwhelmed. Organize your time to cover most of the important Hadoop aspects and then take some time to go through the interview questions and answers. This will help you learn how to frame your answers.

Comments

Please sign in to add comment.

Ad Image

Post Your Ad Here

More Articles

Join APSense Today

Chat on Tech Innovation