Home >Common Problem >What are the four major components of spark?
The four major components of spark are: 1. SparkStreaming, a component for streaming computing on real-time data; 2. SparkSQL, a component for operating structured data; 3. GraphX, a component provided by Spark for graph computing Framework and algorithm library; 4. MLlib, a machine learning algorithm library.
Related recommendations: "Programming Video Course"
Four major components of spark
1. SparkStreaming:
Many application fields have strong demand for streaming computing of real-time data, such as web server logs in network environments or status submitted by users. Updates consist of message queues, etc. These are real-time data streams. Spark Streaming is a component on the Spark platform that performs streaming computing on real-time data and provides a rich API for processing data streams. Since these APIs correspond to the basic operations in Spark Core, developers will be more comfortable writing Spark Streaming applications after they are familiar with Spark's core concepts and programming methods. From the underlying design, Spark Streaming supports the same level of fault tolerance, throughput, and scalability as Spark Core.
2. SparkSQL:
Spark SQL is the component used by Spark to operate structured data. Through Spark SQL, users can query data using SQL or the Apache Hive version of the SQL dialect (HQL). Spark SQL supports multiple data source types, such as Hive tables, Parquet, and JSON. Spark SQL not only provides a SQL interface for Spark, but also supports developers to integrate SQL statements into the Spark application development process. Whether using Python, Java or Scala, users can perform SQL queries and complex queries at the same time in a single application. data analysis. Spark SQL stands out from other open source data warehouse tools because of its tight integration with the rich computing environment provided by Spark. Spark SQL was first introduced in Spark l.0. Before Spark SQL, the University of California, Berkeley, tried to modify Apache Hive to run on Spark, and then proposed the component Shark. However, with the introduction and development of Spark SQL, it has become more closely integrated with the Spark engine and API, so that Shark has been replaced by Spark SQL.
3. GraphX:
GraphX is a framework and algorithm library provided by Spark for graph computing. GraphX proposes the concept of elastic distributed attribute graph, and on this basis realizes the organic combination and unification of graph view and table view; at the same time, it provides rich operations for graph data processing, such as subgraph operations, subgraph operations, and vertex attributes. Operate mapVertices, operate mapEdges on edge attributes, etc. GraphX also realizes the integration with Pregel, which can directly use some common graph algorithms, such as PageRank, triangle counting, etc.
4. MLlib:
MLlib is a machine learning algorithm library provided by Spark. It contains a variety of classic and common machine learning algorithms, mainly classification, regression, clustering, and collaboration. Filtration etc. MLlib not only provides additional functions such as model evaluation and data import, but also provides some lower-level machine learning primitives, including a general gradient descent optimization basic algorithm. All these approaches are designed as architectures that can scale easily on a cluster.
If you want to read more related articles, please visit PHP Chinese website! !
The above is the detailed content of What are the four major components of spark?. For more information, please follow other related articles on the PHP Chinese website!