Home  >  Article  >  Operation and Maintenance  >  what is apache kafka data collection

what is apache kafka data collection

藏色散人
藏色散人Original
2019-09-29 10:45:265403browse

what is apache kafka data collection

What is Apache Kafka data collection?

Apache Kafka - Introduction

Apache Kafka originated at LinkedIn and later became an open source Apache project in 2011 and then became an Apache first-class project in 2012. Kafka is written in Scala and Java. Apache Kafka is a fault-tolerant messaging system based on publish and subscribe. It is fast, scalable and distributed by design.

This tutorial will explore the principles, installation, and operation of Kafka, and then introduce the deployment of Kafka clusters. Finally, we will conclude with real-time applications and integration with Big Data Technologies.

Before proceeding with this tutorial, you must have a good understanding of Java, Scala, distributed messaging systems, and Linux environments.

In big data, a large amount of data is used. Regarding data, we have two main challenges. The first challenge is how to collect large amounts of data, and the second challenge is analyzing the collected data. To overcome these challenges, you need a messaging system.

Kafka is designed for distributed high-throughput systems. Kafka tends to work well as an alternative to more traditional mail brokers. Compared to other messaging systems, Kafka has better throughput, built-in partitioning, replication, and inherent fault tolerance, making it ideal for large-scale message processing applications.

What is a mail system?

The messaging system is responsible for transferring data from one application to another, so applications can focus on the data but not worry about how to share it. Distributed messaging is based on the concept of reliable message queues. Messages are queued asynchronously between the client application and the messaging system. Two types of messaging patterns are available - one is point-to-point and the other is a publish-subscribe (pub-sub) messaging system. Most messaging patterns follow pub-sub.

Point-to-point messaging system

In a point-to-point system, messages will remain in queues. One or more consumers can consume messages from the queue, but a specific message can be consumed by at most only one consumer. Once a consumer reads a message from a queue, it disappears from the queue. A typical example of this system is an order processing system, where each order will be processed by one order processor, but multiple order processors can work simultaneously. The diagram below depicts the structure.

what is apache kafka data collection

Publish-Subscribe Messaging System

In a publish-subscribe system, messages will remain in topics. Unlike peer-to-peer systems, a consumer can subscribe to one or more topics and consume all messages in that topic. In the Publish-Subscribe system, the message generator is called the publisher, and the message consumer is called the subscriber. A real-life example is Dish TV, which publishes different channels like sports, movies, music, etc. Anyone can subscribe to their own channels and get their subscription channels.

what is apache kafka data collection

#What is Kafka?

Apache Kafka is a distributed publish-subscribe messaging system and powerful queue that can handle large amounts of data and enables you to deliver messages from one endpoint to another. Kafka is suitable for offline and online message consumption. Kafka messages are persisted on disk and replicated within the cluster to prevent data loss. Kafka is built on the ZooKeeper synchronization service. It integrates perfectly with Apache Storm and Spark to stream data analysis in real time.

Advantages The following are several benefits of Kafka -

Reliability - Kafka is distributed, partitioned, replicated and fault-tolerant.

Scalability - The Kafka messaging system scales easily with no downtime.

Durability - Kafka uses a distributed commit log, which means messages remain on disk as quickly as possible, so it is durable.

Performance - Kafka has high throughput for both publish and subscribe messages. It maintains stable performance even when many terabytes of messages are stored.

Kafka is very fast, guaranteeing zero downtime and zero data loss.

Use Cases

Kafka can be used for many use cases. Some of them are listed below -

Metrics - Kafka is often used to run monitoring data. This involves aggregating statistics from distributed applications to produce a centralized feed of operational data.

Log aggregation solution - Kafka can be used across an organization to collect logs from multiple services and serve them to multiple servers in a standard format.

Stream Processing - Popular frameworks like Storm and Spark

Streaming reads data from a topic, processes it, and writes the processed data to a new topic that is available to users and applications . Kafka's strong durability is also very useful in stream processing.

Kafka requires

Kafka is a unified platform for processing all real-time data sources. Kafka supports low-latency messaging and guarantees fault tolerance in the presence of machine failures. It has the ability to handle a large number of different consumers. Kafka is very fast, performing 2 million writes/second. Kafka persists all data to disk, which essentially means that all writes go to the operating system's (RAM) page cache. This transfers data from the page cache to the web socket very efficiently.

For more Apache related knowledge, please visit the Apache usage tutorial column!

The above is the detailed content of what is apache kafka data collection. For more information, please follow other related articles on the PHP Chinese website!

Statement:
The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn