BDI's Big Data Approach

Text Mining of Survey Comments using Hadoop based technologies.

Demographic Distribution in INDIA using Big Data Analytics


BDI Systems has good experience on Hadoop Related Technologies and has done multiple projects.

More than structured information stored neatly in rows and columns, Big Data actually comes in complex, unstructured formats, everything from web sites, social media and email, to videos, presentations, etc. This is a critical distinction, because, in order to extract valuable business intelligence from Big Data, any organization will need to rely on technologies that enable a scalable, accurate, and powerful analysis of these formats.

Apache Hadoop is a framework that allows for the distributed processing of such large data sets across clusters of machines.

Apache Hadoop, at its core, consists of 2 sub-projects ? Hadoop MapReduce and Hadoop Distributed File System. Hadoop MapReduce is a programming model and software framework for writing applications that rapidly process vast amounts of data in parallel on large clusters of compute nodes. HDFS is the primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Other Hadoop-related projects at Apache include Chukwa, Hive, HBase, Mahout, Sqoop and ZooKeeper.

Here is a brief introduction to all these technologies:-

HDFS - Filesystems that manage the storage across a network of machines are called distributed filesystems. HDFS is designed for storing very large files with write-once-ready-many-times patterns, running on clusters of commodity hardware.

MapReduce - MapReduce is a framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collectively referred to as a cluster. The framework is inspired by the map and reduce functions commonly used in functional programming.

Chukwa - Chukwa is a Hadoop subproject devoted to large-scale log collection and analysis. Chukwa is built on top of HDFS and MapReduce framework and inherits Hadoop’s scalability and robustness.

Hive - Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query and analysis. HiveServer provides a Thrift interface and a JDBC / ODBC server.

HBase - HBase is the Hadoop application to use when you require real-time read/write random-access to very large datasets. It is a distributed column-oriented database built on top of HDFS. 

Mahout - Mahout is an open source machine learning library from Apache. It’s highly scalable. Mahout aims to be the machine learning tool of choice when the collection of data to be processed is very large, perhaps far too large for a single machine.

Sqoop/Flume - Sqoop allows easy import and export of data from structured data stores such as relational databases, enterprise data warehouses, and NoSQL systems. The dataset being transferred is sliced up into different partitions and a map-only job is launched with individual mappers responsible for transferring a slice of this dataset.

ZooKeeper - ZooKeeper is a distributed, open-source coordination service for distributed applications. It exposes a simple set of primitives that distributed applications can build upon to implement higher level services for synchronization, configuration maintenance, and groups and naming.