Home >Common Problem >How many tools are needed for big data analysis?

How many tools are needed for big data analysis?

爱喝马黛茶的安东尼
爱喝马黛茶的安东尼Original
2019-07-25 17:25:414350browse

How many tools are needed for big data analysis?

1. Hadoop related tools

1.Hadoop

Apache’s Hadoop project has almost Equipped with big data. It continues to grow and has become a complete ecosystem with many open source tools for highly scalable distributed computing.

Supported operating systems: Windows, Linux and OSX.

2.Ambari

As part of the Hadoop ecosystem, this Apache project provides an intuitive web-based interface for configuring, managing, and monitoring Hadoop clusters. For developers who want to integrate Ambari functionality into their own applications, Ambari provides them with an API that takes advantage of REST (Representational State Transfer Protocol).

Supported operating systems: Windows, Linux and OSX.

3.Avro

This Apache project provides a data serialization system with rich data structures and compact formats. Schemas are defined in JSON, which is easily integrated with dynamic languages.

4.Cascading

Cascading is an application development platform based on Hadoop. Provide business support and training services.

5.Chukwa

Chukwa is based on Hadoop and can collect data from large distributed systems for monitoring. It also contains tools for analyzing and displaying data.

Supported operating systems: Linux and OSX.

6.Flume

Flume can collect log data from other applications and then send this data to Hadoop. The official website claims: "It is powerful, fault-tolerant, has a reliability mechanism that can be adjusted and optimized, and many failover and recovery mechanisms."

Supported operating systems: Linux and OSX.

7.HBase

HBase is designed for very large tables with billions of rows and millions of columns. It is a distributed database that can perform randomization on big data. Real-time read/write access. It is somewhat similar to Google's Bigtable, but is built on Hadoop and Hadoop Distributed File System (HDFS).

8.Hadoop Distributed File System (HDFS)

HDFS is a file system for Hadoop, but it can also be used as an independent distributed file system. It is based on Java and is fault-tolerant, highly scalable and highly configurable.

Supported operating systems: Windows, Linux and OSX.

9.Hive

ApacheHive is a data warehouse for the Hadoop ecosystem. It allows users to query and manage big data using HiveQL, a SQL-like language.

10.Hivemall

Hivemall combines multiple machine learning algorithms for Hive. It includes many highly scalable algorithms for data classification, recursion, recommendation, k-nearest neighbors, anomaly detection, and feature hashing.

11.Mahout

According to the official website, the purpose of the Mahout project is to “create an environment for rapidly building scalable, high-performance machine learning applications.” It includes applications for HadoopMapReduce There are many algorithms for data mining on the Internet, including some novel algorithms for Scala and Spark environments.

12.MapReduce

As an integral part of Hadoop, the MapReduce programming model provides a method for processing large distributed data sets. It was originally developed by Google, but is now used by several other big data tools covered in this article, including CouchDB, MongoDB, and Riak.

13.Oozie

This workflow scheduling tool is specially designed to manage Hadoop tasks. It can trigger tasks based on time or data availability, and integrates with MapReduce, Pig, Hive, Sqoop and many other related tools.

Supported operating systems: Linux and OSX.

14.Pig

ApachePig is a platform for distributed big data analysis. It relies on a programming language called PigLatin, which has the advantages of simplified parallel programming, optimization and scalability.

15.Sqoop

Enterprises often need to transfer data between relational databases and Hadoop, and Sqoop is a tool that can complete this task. It can import data into Hive or HBase and export from Hadoop to a relational database management system (RDBMS).

16.Spark

As an alternative to MapReduce, Spark is a data processing engine. It claims to be up to 100 times faster than MapReduce when used in memory and up to 10 times faster than MapReduce when used on disk. It can be used with Hadoop and Apache Mesos or independently.

Supported operating systems: Windows, Linux and OSX.

17.Tez

Tez is built on Apache Hadoop YARN, which is “an application framework that allows building a complex directed acyclic graph for tasks to process data. "It allows Hive and Pig to simplify complex tasks that would otherwise require multiple steps to complete.

Supported operating systems: Windows, Linux and OSX.

18.Zookeeper

This big data management tool claims to be "a centralized service that can be used to maintain configuration information, name, provide distributed synchronization, and provide group services." It Allow nodes in the Hadoop cluster to coordinate with each other.

Supported operating systems: Linux, Windows (only suitable for development environment) and OSX (only suitable for development environment).

Related recommendations: "FAQ"

2. Big data analysis platform and tools

19.Disco

Disco was originally developed by Nokia. This is a distributed computing framework, like Hadoop. , which is also based on MapReduce. It includes a distributed file system and database supporting billions of keys and values.

Supported operating systems: Linux and OSX.

20.HPCC

As an alternative to Hadoop, HPCC, a big data platform, promises to be very fast and highly scalable. In addition to the free community version, HPCC Systems also provides paid enterprise versions, paid modules, training, consulting and other services.

Supported operating system: Linux.

21.Lumify

Lumify, owned by Altamira Technologies (known for its national security technology), is an open source big data integration, analysis and visualization platform. You can just try the demo version at Try.Lumify.io to see it in action.

Supported operating system: Linux.

22.Pandas

The Pandas project includes data structures and data analysis tools based on the Python programming language. It allows enterprise organizations to use Python as an alternative to R for big data analytics projects.

Supported operating systems: Windows, Linux and OSX.

23.Storm

Storm is now an Apache project, which provides real-time processing of big data (unlike Hadoop which only provides batch task processing). Its users include Twitter, The Weather Channel, WebMD, Alibaba, Yelp, Yahoo Japan, Spotify, Group, Flipboard and many others.

Supported operating system: Linux.

3. Database/Data Warehouse

24.Blazegraph

Blazegraph was formerly named "Bigdata", which is a highly scalable, high-performance database. It is available with both open source and commercial licenses.

25.Cassandra

This NoSQL database was originally developed by Facebook and is now used by more than 1,500 enterprise organizations, including Apple, European Organization for Nuclear Research (CERN), Comcast, Electronic Harbor, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netfilx, Reddit and others. It can support very large-scale clusters; for example, the Cassandra system deployed by Apple includes more than 75,000 nodes and holds more than 10PB of data.

26.CouchDB

CouchDB is known as "a database that fully embraces the Internet". It stores data in JSON documents, which can be queried through a web browser and used JavaScript to handle. It is easy to use, highly available and scalable on a distributed network.

Supported operating systems: Windows, Linux, OSX and Android.

27.FlockDB

FlockDB developed by Twitter is a very fast and scalable graph database that is good at storing social network data. While it's still available for download, the open source version of the project hasn't been updated in some time.

28.Hibari

This Erlang-based project claims to be "a distributed ordered key-value storage system that guarantees strong consistency." It was originally developed by Gemini Mobile Technologies and is now used by several telecom operators in Europe and Asia.

29.Hypertable

Hypertable is a big data database compatible with Hadoop, promising ultra-high performance. Its users include Electronic Harbor, Baidu, Gaopeng, Yelp and many other Internet companies. Provide business support services.

Supported operating systems: Linux and OSX.

30.Impala

Cloudera claims that the SQL-based Impala database is "the leading open source analytics database for Apache Hadoop." It can be downloaded as a standalone product and is part of Cloudera's commercial big data products.

Supported operating systems: Linux and OSX.

31.InfoBright Community Edition

Designed for data analysis, InfoBright is a column-oriented database with a high compression ratio. InfoBright.com offers paid products based on the same code and provides support services.

Supported operating systems: Windows and Linux.

32.MongoDB

With over 10 million downloads, mongoDB is an extremely popular NoSQL database. Enterprise edition, support, training and related products and services are available on MongoDB.com.

Supported operating systems: Windows, Linux, OSX and Solaris.

The above is the detailed content of How many tools are needed for big data analysis?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn