Apache Hadoop

The Apache Hadoop project is the development of open source software for scalable and secure distributed computing.


The Hadoop software library is a framework that enables distributed processing of large datasets using clusters of computers or servers, using simple programming models.

Hadoop is designed to scale easily from single server systems to thousands of machines.

The library detects and handles errors at the application level so that it does not rely on hardware and provides high availability with a cluster of servers with a high level of fault tolerance on each of the machines that make up the cluster.


Hadoop Modules

The Hadoop project consists of four main modules:

Hadoop Common is a set of common utilities for all other modules.

HDFS, Hadoop Distributed Files System, is the distributed file system that provides high-performance access to application data.

Hadoop YARN is a framework for scheduling tasks and managing cluster resources.

Hadoop MapReduce is a system for parallel processing of large datasets, based on YARN.


Hadoop Ecosystem

Apart from the above modules, the complete platform, or Hadoop ecosystem, includes other related projects such asApache Ambari, Apache Cassandra, Apache HBase, Apache Hive, Apache Mahout, Apache Pig or Apache Spark, among others.


Ecosistema de Apache Hadoop

Resources about Apache Hadoop

Apache Hadoop Project Home Page

Tutorials and help on the official Apache Hadoop Wiki

Apache Hadoop download page


Posts about Apache projects on Dataprix