1. Hadoop related tools
1.Hadoop
Apache’s Hadoop project has almost Equipped with big data. It continues to grow and has become a complete ecosystem with many open source tools for highly scalable distributed computing.
Supported operating systems: Windows, Linux and OSX.
2.Ambari
As part of the Hadoop ecosystem, this Apache project provides an intuitive web-based interface for configuring, managing, and monitoring Hadoop clusters. For developers who want to integrate Ambari functionality into their own applications, Ambari provides them with an API that takes advantage of REST (Representational State Transfer Protocol).
Supported operating systems: Windows, Linux and OSX.
3.Avro
This Apache project provides a data serialization system with rich data structures and compact formats. Schemas are defined in JSON, which is easily integrated with dynamic languages.
4.Cascading
Cascading is an application development platform based on Hadoop. Provide business support and training services.
5.Chukwa
Chukwa is based on Hadoop and can collect data from large distributed systems for monitoring. It also contains tools for analyzing and displaying data.
Supported operating systems: Linux and OSX.
6.Flume
Flume can collect log data from other applications and then send this data to Hadoop. The official website claims: "It is powerful, fault-tolerant, has a reliability mechanism that can be adjusted and optimized, and many failover and recovery mechanisms."
Supported operating systems: Linux and OSX.
7.HBase
HBase is designed for very large tables with billions of rows and millions of columns. It is a distributed database that can perform randomization on big data. Real-time read/write access. It is somewhat similar to Google's Bigtable, but is built on Hadoop and Hadoop Distributed File System (HDFS).
8.Hadoop Distributed File System (HDFS)
HDFS is a file system for Hadoop, but it can also be used as an independent distributed file system. It is based on Java and is fault-tolerant, highly scalable and highly configurable.
Supported operating systems: Windows, Linux and OSX.
9.Hive
ApacheHive is a data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, a SQL-like language.
10.Hivemall
Hivemall combines multiple machine learning algorithms for Hive. It includes many highly scalable algorithms for data classification, recursion, recommendation, k-nearest neighbors, anomaly detection, and feature hashing.
11.Mahout
According to the official website, the purpose of the Mahout project is to “create an environment for rapidly building scalable, high-performance machine learning applications.” It includes applications for HadoopMapReduce There are many algorithms for data mining on the Internet, including some novel algorithms for Scala and Spark environments.
12.MapReduce
As an integral part of Hadoop, the MapReduce programming model provides a method for processing large distributed data sets. It was originally developed by Google, but is now used by several other big data tools covered in this article, including CouchDB, MongoDB, and Riak.
13.Oozie
This workflow scheduling tool is specially designed to manage Hadoop tasks. It can trigger tasks based on time or data availability, and integrates with MapReduce, Pig, Hive, Sqoop and many other related tools.
Supported operating systems: Linux and OSX.
14.Pig
ApachePig is a platform for distributed big data analysis. It relies on a programming language called PigLatin, which has the advantages of simplified parallel programming, optimization and scalability.
15.Sqoop
Enterprises often need to transfer data between relational databases and Hadoop, and Sqoop is a tool that can complete this task. It can import data into Hive or HBase and export from Hadoop to a relational database management system (RDBMS).
16.Spark
As an alternative to MapReduce, Spark is a data processing engine. It claims to be up to 100 times faster than MapReduce when used in memory and up to 10 times faster than MapReduce when used on disk. It can be used with Hadoop and Apache Mesos or independently.
Supported operating systems: Windows, Linux and OSX.
17.Tez
Tez is built on Apache Hadoop YARN, which is “an application framework that allows building a complex directed acyclic graph for tasks to process data. "It allows Hive and Pig to simplify complex tasks that would otherwise require multiple steps to complete.
Supported operating systems: Windows, Linux and OSX.
18.Zookeeper
This big data management tool claims to be "a centralized service that can be used to maintain configuration information, name, provide distributed synchronization, and provide group services." It Allow nodes in the Hadoop cluster to coordinate with each other.
Supported operating systems: Linux, Windows (only suitable for development environment) and OSX (only suitable for development environment).
Related recommendations: "FAQ"
2. Big data analysis platform and tools
19.Disco
Disco was originally developed by Nokia. This is a distributed computing framework, like Hadoop. , which is also based on MapReduce. It includes a distributed file system and database supporting billions of keys and values.
Supported operating systems: Linux and OSX.
20.HPCC
As an alternative to Hadoop, HPCC, a big data platform, promises to be very fast and highly scalable. In addition to the free community version, HPCC Systems also provides paid enterprise versions, paid modules, training, consulting and other services.
Supported operating system: Linux.
21.Lumify
Lumify, owned by Altamira Technologies (known for its national security technology), is an open source big data integration, analysis and visualization platform. You can just try the demo version at Try.Lumify.io to see it in action.
Supported operating system: Linux.
22.Pandas
The Pandas project includes data structures and data analysis tools based on the Python programming language. It allows enterprise organizations to use Python as an alternative to R for big data analytics projects.
Supported operating systems: Windows, Linux and OSX.
23.Storm
Storm is now an Apache project, which provides real-time processing of big data (unlike Hadoop which only provides batch task processing). Its users include Twitter, The Weather Channel, WebMD, Alibaba, Yelp, Yahoo Japan, Spotify, Group, Flipboard and many others.
Supported operating system: Linux.
3. Database/Data Warehouse
24.Blazegraph
Blazegraph was formerly named "Bigdata", which is a highly scalable, high-performance database. It is available with both open source and commercial licenses.
25.Cassandra
This NoSQL database was originally developed by Facebook and is now used by more than 1,500 enterprise organizations, including Apple, European Organization for Nuclear Research (CERN), Comcast, Electronic Harbor, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netfilx, Reddit and others. It can support very large-scale clusters; for example, the Cassandra system deployed by Apple includes more than 75,000 nodes and holds more than 10PB of data.
26.CouchDB
CouchDB is known as "a database that fully embraces the Internet". It stores data in JSON documents, which can be queried through a web browser and used JavaScript to handle. It is easy to use, highly available and scalable on a distributed network.
Supported operating systems: Windows, Linux, OSX and Android.
27.FlockDB
FlockDB developed by Twitter is a very fast and scalable graph database that is good at storing social network data. While it's still available for download, the open source version of the project hasn't been updated in some time.
28.Hibari
This Erlang-based project claims to be "a distributed ordered key-value storage system that guarantees strong consistency." It was originally developed by Gemini Mobile Technologies and is now used by several telecom operators in Europe and Asia.
29.Hypertable
Hypertable is a big data database compatible with Hadoop, promising ultra-high performance. Its users include Electronic Harbor, Baidu, Gaopeng, Yelp and many other Internet companies. Provide business support services.
Supported operating systems: Linux and OSX.
30.Impala
Cloudera claims that the SQL-based Impala database is "the leading open source analytics database for Apache Hadoop." It can be downloaded as a standalone product and is part of Cloudera's commercial big data products.
Supported operating systems: Linux and OSX.
31.InfoBright Community Edition
Designed for data analysis, InfoBright is a column-oriented database with a high compression ratio. InfoBright.com offers paid products based on the same code and provides support services.
Supported operating systems: Windows and Linux.
32.MongoDB
With over 10 million downloads, mongoDB is an extremely popular NoSQL database. Enterprise edition, support, training and related products and services are available on MongoDB.com.
Supported operating systems: Windows, Linux, OSX and Solaris.
The above is the detailed content of How many tools are needed for big data analysis?. For more information, please follow other related articles on the PHP Chinese website!