As an organization plans its big data strategy, the following Terms are likely to be used with increasing frequency.


  • Hadoop: A batch-oriented programming framework that supports the processing of large data sets in  a distributed computing environment. Hadoop is written in the Java programming language and is a top-level Apache project (Apache is a decentralized community of developers supporting open-source software).
  • HBase: A non-relational, column-oriented distributed database written in Java. A column-oriented database stores data tables as sections of columns of data rather than as rows of data, as in most relational databases, providing fast aggregation and computation of large numbers of similar data items.
  • HDFS: A distributed, scalable, and portable file system written in Java for the Hadoop framework.
  • Hive: A data warehouse infrastructure built on top of Hadoop, providing data summarization, query, and analysis. It permits queries over the data using a familiar SQL-like syntax.
  • Flume: A tool for collecting, aggregating, and moving large amounts of log data from applications to Hadoop.
  • Mahout: A library of Hadoop implementations of common analytical computations.
  • Oozie: A workflow scheduler system developed to manage Hadoop jobs.
  • Pig: A platform for analyzing large datasets that consists of a high-level language (Pig Latin) for expressing data analysis programs, coupled with infrastructure for evaluating these programs.
  • R: R is a free software programming language and software environment for statistical computing and Graphics. The R language is widely used among statisticians and data miners for developing statistical Software and data analysis.
  • Sqoop: A tool facilitating the transfer of data from relational databases into Hadoop.
  • Zookeeper: A centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services for distributed applications.
