Apache Hadoop

The Apache Hadoop project is the development of open source software for scalable and secure distributed computing.

 

The Hadoop software library is a framework that enables distributed processing of large datasets using clusters of computers or servers, using simple programming models.

Hadoop is designed to scale easily from single server systems to thousands of machines.

The library detects and handles errors at the application level so that it does not rely on hardware and provides high availability with a cluster of servers with a high level of fault tolerance on each of the machines that make up the cluster.

 

Hadoop Modules

The Hadoop project consists of four main modules:

Hadoop Common is a set of common utilities for all other modules.

HDFS, Hadoop Distributed Files System, is the distributed file system that provides high-performance access to application data.

Hadoop YARN is a framework for scheduling tasks and managing cluster resources.

Hadoop MapReduce is a system for parallel processing of large datasets, based on YARN.

 

Hadoop Ecosystem

Apart from the above modules, the complete platform, or Hadoop ecosystem, includes other related projects such asApache Ambari(link is external), Apache Cassandra(link is external), Apache HBase(link is external), Apache Hive(link is external), Apache Mahout(link is external), Apache Pig(link is external) or Apache Spark(link is external), among others.

 

Ecosistema de Apache Hadoop

Resources about Apache Hadoop

Apache Hadoop Project Home Page(link is external)

Tutorials and help on the official Apache Hadoop Wiki(link is external)

Apache Hadoop download page(link is external)

 

Posts about Apache projects on Dataprix