Home >Common Problem >What is the classic learning route for big data?

What is the classic learning route for big data?

silencement
silencementOriginal
2019-06-14 10:14:092301browse

What is the classic learning route for big data?

The learning route of big data is as follows:

java(Java se,javaweb)

Linux(shell, high concurrency Architecture, lucene, solr)

Hadoop(Hadoop, HDFS, Mapreduce, yarn, hive, hbase, sqoop, zookeeper, flume)

Machine learning (R, mahout)

Storm(Storm,kafka,redis)

Spark(scala,spark,spark core,spark sql,spark streaming,spark mllib,spark graphx)

Python(python,spark python)

Cloud computing platform (docker, kvm, openstack)

Explanation of terms

1. Linux

lucene: full-text search engine Architecture

solr: A full-text search server based on Lucene, which is configurable, scalable, optimizes query performance, and provides a complete function management interface.

2. Hadoop

HDFS: Distributed storage system, including NameNode and DataNode. NameNode: Metadata, DataNode. DataNode: stores data.

yarn: It can be understood as the coordination mechanism of MapReduce, which is essentially the processing and analysis mechanism of Hadoop, divided into ResourceManager and NodeManager.

MapReduce: Software framework, writing programs.

Hive: Data warehouse can be queried with SQL and can run Map/Reduce programs. Used to calculate trends or website logs, and should not be used for real-time queries as it takes a long time to return results.

HBase: Database. It is very suitable for real-time query of big data. Facebook uses Hbase to store message data and conduct real-time analysis of messages

ZooKeeper: A reliable coordination system for large-scale distributed. Hadoop's distributed synchronization is implemented by Zookeeper, such as multiple NameNodes and active standby switching.

Sqoop: Transfer databases to each other, relational databases and HDFS to each other

Mahout: Scalable machine learning and data mining library. Used for recommendation mining, aggregation, classification, and frequent item set mining.

Chukwa: An open source collection system that monitors large distributed systems, built on HDFS and Map/Reduce frameworks. Display, monitor, and analyze results.

Ambari: Used to configure, manage and monitor Hadoop clusters, based on the Web and with a friendly interface.

2. Cloudera

Cloudera Manager: Management monitoring and diagnosis integration

Cloudera CDH: (Cloudera's Distribution, including Apache Hadoop) Cloudera has made corresponding changes to Hadoop, and the release version Called CDH.

Cloudera Flume: Log collection system supports customizing various data senders in the log system to collect data.

Cloudera Impala: Provides direct query and interactive SQL for data stored in Apache Hadoop's HDFS and HBase.

Cloudera hue: web manager, including hue ui, hui server, hui db. hue provides shell interface interfaces for all CDH components, and mr can be written in hue.

3. Machine learning/R

R: Language and operating environment for statistical analysis and graphics. Currently, Hadoop-R

mahout: Provides scalable machines The implementation of classic algorithms in the learning field, including clustering, classification, recommendation filtering, frequent sub-item mining, etc., and can be extended to the cloud through Hadoop.

4. Storm

#Storm: a distributed, fault-tolerant real-time streaming computing system that can be used for real-time analysis, online machine learning, information flow processing, continuous computing, distributed RPC, Process messages and update the database in real time.

Kafka: A high-throughput distributed publish-subscribe messaging system that can handle all action streaming data (browsing, searching, etc.) in consumer-scale websites. Compared with Hadoop's log data and offline analysis, real-time processing can be achieved. Currently, Hadoop's parallel loading mechanism is used to unify online and offline message processing

Redis: Written in c language, it supports the network, is a log-type, key-value database that can be memory-based and persistent.

5. Spark

Scala: A completely object-oriented programming language similar to java.

jblas: A fast linear algebra library (JAVA). The ATLAS ART implementation is based on BLAS and LAPACK, the de facto industry standard for matrix calculations, and uses advanced infrastructure for all calculation procedures, making it very fast.

Spark: Spark is a general parallel framework similar to Hadoop MapReduce implemented in the Scala language. In addition to the advantages of Hadoop MapReduce, it is different from MapReduce in that the intermediate output results of jobs can be stored in memory, thus There is no need to read or write HDFS, so Spark is better suited to MapReduce algorithms that require iteration, such as data mining and machine learning. It can operate in parallel with the Hadoop file system. Third-party cluster frameworks using Mesos can support this behavior.

Spark SQL: As part of the Apache Spark big data framework, it can be used for structured data processing and can perform SQL-like Spark data queries

Spark Streaming: A real-time solution built on Spark The computing framework extends Spark's ability to process big data streaming data.

Spark MLlib: MLlib is Spark's implementation library for commonly used machine learning algorithms. Currently (2014.05) it supports binary classification, regression, clustering and collaborative filtering. It also includes a low-level gradient descent optimization basic algorithm. MLlib relies on the jblas linear algebra library, and jblas itself relies on the remote Fortran program.

Spark GraphX: GraphX ​​is an API for graphs and graph parallel computing in Spark. It can provide a one-stop data solution on top of Spark, and can complete a complete set of pipeline operations for graph computing conveniently and efficiently.

Fortran: The earliest high-level computer programming language, widely used in scientific and engineering computing fields.

BLAS: Basic linear algebra subroutine library, with a large number of programs that have been written about linear algebra operations.

LAPACK: Well-known open software, including solving the most common numerical linear algebra problems in scientific and engineering calculations, such as solving linear equations, linear least squares problems, eigenvalue problems and singular value problems, etc.

ATLAS: An optimized version of the BLAS linear algorithm library.

Spark Python: Spark is written in scala language, but for promotion and compatibility, java and python interfaces are provided.

6. Python

Python: an object-oriented, interpreted computer programming language.

7. Cloud computing platform

Docker: Open source application container engine

kvm: (Keyboard Video Mouse)

openstack: Open source cloud computing management Platform project

The above is the detailed content of What is the classic learning route for big data?. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn