Big Data - Hadoop Interview Questions – Sandeep Kanao
What
is Big Data? - Hadoop Interview
Questions – Sandeep Kanao
Big data is a term for data sets
that are so large or complex that traditional data
processing applications are inadequate to deal with them. Challenges
include analysis, capture, data
curation, search, sharing, storage, transfer,
visualization, querying,
updating and information privacy.
What
is Grid, cloud and cluster - Hadoop Interview Questions – Sandeep Kanao
Cloud: is simply an aggregate of computing power. You can think of the entire "cloud" as single server, for your purposes. It's conceptually much like an old school mainframe where you could submit your jobs to and have it return the result, except that nowadays the concept is applied more widely. (I.e. not just raw computing, also entire services, or storage ...)
Grid: a grid is simply many computers which together might solve a given problem/crunch data. The fundamental difference between a grid and a cluster is that in a grid each node is relatively independent of others; problems are solved in a divide and conquer fashion.
Cluster: conceptually it is essentially smashing up many machines to make a really big & powerful one. This is a much more difficult architecture than cloud or grid to get right because you have to orchestrate all nodes to work together, and provide consistency of things such as cache, memory, and not to mention clocks. Of course clouds have much the same problem, but unlike clusters clouds are not conceptually one big machine, so the entire architecture doesn't have to treat it as such. You can for instance not allocate the full capacity of your data center to a single request, whereas that is kind of the point of a cluster: to be able to throw 100% of the oomph at a single problem.
What
are the examples of Big Data? - Hadoop
Interview Questions – Sandeep Kanao
Black Box Data: It is a component of
helicopter, airplanes, and jets, etc. It captures voices of the flight crew,
recordings of microphones and earphones, and the performance information of the
aircraft.
Social Media Data: Social media such as
Facebook and Twitter hold information and the views posted by millions of
people across the globe.
Stock Exchange Data: The stock exchange data
holds information about the ‘buy’ and ‘sell’ decisions made on a share of
different companies made by the customers.
Power Grid Data: The power grid data holds
information consumed by a particular node with respect to a base station.
Transport Data: Transport data includes
model, capacity, distance and availability of a vehicle.
Search
Engine Data: Search engines retrieve lots of data from different databases.
Thus Big Data
includes huge volume, high velocity, and extensible variety of data. The data
in it will be of three types.
Structured data: Relational data.
Semi Structured data: XML data.
Unstructured
data: Word, PDF, Text, Media Logs.
What
are Big Data Technologies - Hadoop Interview Questions – Sandeep Kanao
There are various
technologies in the market from different vendors including Amazon, IBM,
Microsoft, etc., to handle big data. While looking into the technologies that
handle big data, we examine the following two classes of technology:
Operational
Big Data
These include systems
like MongoDB that provide operational capabilities for real-time, interactive
workloads where data is primarily captured and stored.
Analytical
Big Data
These
includes systems like Massively Parallel Processing (MPP) database systems and
MapReduce that provide analytical capabilities for retrospective and complex
analysis that may touch most or all of the data.
MapReduce provides a new method
of analyzing data that is complementary to the capabilities provided by SQL,
and a system based on MapReduce that can be scaled up from single servers to
thousands of high and low end machines.
What
are the major challenges associated with Big Data? - Hadoop Interview Questions
– Sandeep Kanao
The major challenges
associated with big data are as follows:
Capturing data
Curation
Storage
Searching
Sharing
Transfer
Analysis
Presentation
These problems could be solved
using an algorithm called MapReduce, introduced by Google. This algorithm
divides the task into small parts and assigns them to many computers, and
collects the results from them which when integrated, form the result dataset.