Home >Java >javaTutorial >Completed Kafka from an interview perspective

Completed Kafka from an interview perspective

Java后端技术全栈
Java后端技术全栈forward
2023-08-24 15:22:041075browse

Kafka is an excellent distributed message middleware. Kafka is used in many systems for message communication. Understanding and using distributed messaging systems has almost become a necessary skill for a backend developer. Today 马哥byte will start with common Kafka interview questions and talk to you about Kafka.

Completed Kafka from an interview perspective
Mind map

Let’s talk about distributed messaging middleware

Question

  • #What is distributed messaging middleware?
  • What is the role of message middleware?
  • What are the usage scenarios of message middleware?
  • Message middleware selection?
Completed Kafka from an interview perspective
Message queue

Distributed messaging is a communication mechanism, which is different from RPC, HTTP, RMI, etc., in the middle of the message The software uses a distributed intermediate agent to communicate. As shown in the figure, after using message middleware, the upstream business system sends messages, which are first stored in the message middleware, and then the message middleware distributes the messages to the corresponding business module applications (distributed producer-consumer model). This asynchronous approach reduces the coupling between services.

Completed Kafka from an interview perspective
Architecture

Define message middleware:

  • Use efficient and reliable message passing mechanism for platform-independent data exchange
  • Integration of distributed systems based on data communication
  • By providing message passing and message queuing models, inter-process communication can be expanded in a distributed environment

Referring to additional components in the system architecture will inevitably improve The architectural complexity of the system and the difficulty of operation and maintenance, soWhat are the advantages of using distributed message middleware in the system? What is the role of message middleware in the system?

  • Decoupling
  • Redundancy (storage)
  • Scalability
  • Peak clipping
  • Recoverability
  • Sequence guarantee
  • Buffering
  • Asynchronous Communication

During interviews, interviewers often care about the interviewer’s ability to select open source components. This is both It can test the breadth of the interviewer's knowledge, and it can also test the depth of the interviewer's knowledge of a certain type of system. It can also show the interviewer's ability to grasp the overall system and system architecture design. There are many open source distributed messaging systems, and different messaging systems have different characteristics. Choosing a messaging system requires not only a certain understanding of each messaging system, but also a clear understanding of your own system requirements.

The following is a comparison of several common distributed messaging systems:

Completed Kafka from an interview perspective
Select

Answer Keywords

  • What is distributed Message middleware? Communication, queue, distributed, producer-consumer model.
  • What is the role of message middleware? Decoupling, peak handling, asynchronous communication, buffering.
  • What are the usage scenarios of message middleware? Asynchronous communication, message storage and processing.
  • Message middleware selection? Language, protocol, HA, data reliability, performance, transaction, ecology, simplicity, push-pull mode.

Kafka basic concepts and architecture

Questions

  • Briefly talk about the architecture of Kafka?
  • Is Kafka in push mode or pull mode? What is the difference between push and pull?
  • How does Kafka broadcast messages?
  • Are Kafka’s messages in order?
  • Does Kafka support read-write separation?
  • How does Kafka ensure high data availability?
  • What is the role of zookeeper in Kafka?
  • Does it support transactions?
  • Can the number of partitions be reduced?

General concepts in Kafka architecture:

Completed Kafka from an interview perspective
Architecture
  • Producer: Producer, that is, the party that sends the message. Producers are responsible for creating messages and then sending them to Kafka.
  • Consumer: Consumer, that is, the party that receives the message. Consumers connect to Kafka and receive messages, and then perform corresponding business logic processing.
  • Consumer Group: A consumer group can contain one or more consumers. Using the multi-partition and multi-consumer approach can greatly improve the processing speed of downstream data. Consumers in the same consumer group will not consume messages repeatedly. Similarly, messages sent by consumers in different consumer groups will not affect each other. Kafka implements message P2P mode and broadcast mode through consumer groups.
  • Broker: Service proxy node. Broker is Kafka's service node, that is, Kafka's server.
  • Topic: Messages in Kafka are divided into Topic units. The producer sends messages to a specific Topic, and the consumer is responsible for subscribing to Topic messages and consuming them.
  • Partition: Topic is a logical concept that can be subdivided into multiple partitions, and each partition only belongs to a single topic. Different partitions under the same topic contain different messages. The partition can be regarded as an appendable log file at the storage level. When messages are appended to the partition log file, they will be assigned a specific offset. ).
  • Offset: offset is the unique identifier of the message in the partition. Kafka uses it to ensure the order of the message within the partition. However, offset does not span partitions. In other words, Kafka guarantees It's partition ordering rather than topic ordering.
  • Replication: Replica is Kafka’s way of ensuring high data availability. Kafka’s data in the same Partition can have multiple copies on multiple Brokers. Usually only the primary copy provides external read and write services. When the broker where the primary replica is located crashes or a network failure occurs, Kafka will reselect a new Leader replica under the management of the Controller to provide external read and write services.
  • Record: The message record that is actually written to Kafka and can be read. Each record contains key, value and timestamp.

Kafka Topic Partitions Layout

Completed Kafka from an interview perspective##Topic
Kafka partitions the Topic. Partitions can be read and written concurrently.

Kafka Consumer Offset

Completed Kafka from an interview perspective
consumer offset

zookeeper

Completed Kafka from an interview perspective
zookeeper
  • Broker registration: Brokers are deployed in a distributed manner and are independent of each other. Zookeeper is used to manage all Broker nodes registered to the cluster.
  • Topic registration: In Kafka, messages of the same Topic will be divided into multiple partitions and distributed on multiple Brokers. These partition information and the corresponding relationship with the Broker are also All are maintained by Zookeeper
  • Producer load balancing: Since the same Topic message will be partitioned and distributed on multiple Brokers, the producer needs to reasonably distribute the message Sent to these distributed Brokers.
  • Consumer load balancing: Similar to producers, consumers in Kafka also need load balancing to achieve multiple consumers to reasonably receive messages from the corresponding Broker server. A consumer group contains several consumers, and each message will only be sent to one consumer in the group. Different consumer groups consume messages under their own specific Topics without interfering with each other.

Answer Keywords

  • Briefly explain the architecture of Kafka?

    Producer, Consumer, Consumer Group, Topic, Partition

  • ##Is Kafka push mode or pull mode? Push-pull What's the difference?

    Kafka Producer uses Push mode to send messages to Broker, and Consumer uses Pull mode for consumption. Pull mode, allowing the consumer to manage the offset by itself, can provide read performance

  • How does Kafka broadcast messages?

    Consumer group

  • Are Kafka’s messages in order?

    Topic levels are unordered and Partitions are ordered

  • Does Kafka support read-write separation?

    Not supported, only Leader provides external reading and writing services

  • How does Kafka ensure high data availability?

    Copy, ack, HW

  • What is the role of zookeeper in Kafka?

    Cluster management, metadata management

  • Does it support transactions?

    After 0.11, transactions are supported and can be achieved "exactly once"

  • Can the number of partitions be reduced?

    No, data will be lost

Kafka using

Problems

  • What command line tools does Kafka have? Which ones have you used?
  • The execution process of Kafka Producer?
  • What are the common configurations of Kafka Producer?
  • How to keep Kafka messages in order?
  • Producer How to ensure that data transmission is not lost?
  • How to improve the performance of Producer?
  • If the number of consumers in the same group is greater than the number of parts, how does Kafka handle it?
  • Is Kafka Consumer thread-safe?
  • Tell me about the thread model when you use Kafka Consumer to consume messages. Why is it designed like this?
  • Common configurations of Kafka Consumer?
  • When will a Consumer be kicked out of the cluster?
  • How will Kafka react when a Consumer joins or exits?
  • What is Rebalance and when does Rebalance occur?

Command line tools

Kafka’s command line tools are in the Kafka package’s /bin The directory mainly includes service and cluster management scripts, configuration scripts, information viewing scripts, Topic scripts, client scripts, etc.

  • kafka-configs.sh: configuration management script
  • kafka-console-consumer.sh:kafka consumer console
  • kafka-console-producer.sh:kafka producer console
  • kafka-consumer-groups.sh:kafka consumer group related information
  • kafka-delete-records.sh: Delete low-water log files
  • kafka-log-dirs.sh: kafka message log directory information
  • kafka-mirror-maker.sh: kafka cluster replication tool in different data centers
  • kafka-preferred-replica-election.sh: trigger preferred replica election
  • kafka-producer-perf-test.sh: kafka producer performance test script
  • kafka-reassign-partitions.sh: partition reassignment script
  • kafka-replica-verification.sh: Replication progress verification script
  • kafka-server-start.sh: Start kafka service
  • kafka-server-stop.sh: Stop kafka service
  • kafka-topics.sh: topic management script
  • kafka-verifiable-consumer.sh: Verifiable kafka consumer
  • kafka-verifiable-producer.sh: Verifiable kafka producer
  • zookeeper-server-start.sh: Start the zk service
  • zookeeper-server-stop.sh: Stop the zk service
  • zookeeper-shell.sh: zk client

We can usually use kafka-console-consumer.sh and kafka-console-producer.sh Script to test Kafka production and consumption, kafka-consumer-groups.sh can view and manage Topics in the cluster, kafka-topics.sh is usually used to view Kafka's Consumer group situation.

Kafka Producer

The normal production logic of Kafka producer includes the following steps:

  1. Configure the producer Common producer instances for client parameters.
  2. Construct the message to be sent.
  3. Send a message.
  4. Close the producer instance.

Producer The process of sending messages is shown in the figure below. It needs to go through interceptor, serializer and partitioner, finally sent to the Broker in batches by the accumulator.

Completed Kafka from an interview perspective
producer

Kafka Producer requires the following necessary parameters:

  • bootstrap.server: Specify Kafka's Broker Address
  • key.serializer:key serializer
  • value.serializer:value serializer

Common parameters:

  • batch.num.messages

    Default value: 200, the number of batch messages each time, only for asyc kick in.

  • ##request.required.acks

    Default value: 0, 0 means that the producer does not need to wait for confirmation from the leader, 1 means that the leader needs to confirm writing to its local log and confirm it immediately, -1 means that the producer needs to confirm after all backups are completed. It only works in async mode. The adjustment of this parameter is a tradeoff between data loss and transmission efficiency. If you are not sensitive to data loss but care about efficiency, you can consider setting it to 0, which can greatly improve the efficiency of the producer in sending data.

  • request.timeout.ms

    Default value: 10000, confirmation timeout.

  • partitioner.class

    Default value: kafka.producer.DefaultPartitioner, kafka.producer.Partitioner must be implemented, Provide a partitioning strategy based on Key. Sometimes we need the same type of messages to be processed sequentially, so we must customize the distribution strategy to allocate the same type of data to the same partition.

  • producer.type

    Default value: sync, specifies whether the message is sent synchronously or asynchronously. Use kafka.producer.AyncProducer for asynchronous asyc batch sending, and kafka.producer.SyncProducer for synchronous sync. Synchronous and asynchronous sending also affect the efficiency of message production.

  • compression.topic

    Default value: none, message compression, no compression by default. Other compression methods include "gzip", "snappy" and "lz4". Compression of messages can greatly reduce network transmission volume and network IO, thus improving overall performance.

  • ##compressed.topics

    Default value: null. When compression is set, you can specify specific topic compression. If not specified, all compression will be performed.

  • message.send.max.retries

    Default value: 3, the maximum number of attempts to send a message.

  • retry.backoff.ms

    Default value: 300, additional interval added to each attempt.

  • ##topic.metadata.refresh.interval.ms

    Default value: 600000, obtain metadata regularly time. When the partition is lost and the leader is unavailable, the producer will also actively obtain metadata. If it is 0, metadata will be obtained every time the message is sent, which is not recommended. If negative, metadata is only fetched on failure.

  • queue.buffering.max.ms

    Default value: 5000, the maximum cached data in the producer queue Time, only for asyc.

  • ##queue.buffering.max.message

    Default value: 10000, the maximum number of messages cached by the producer, Just for asyc.

  • ##queue.enqueue.timeout.ms
  • Default value: -1, 0 is discarded when the queue is full, the negative value is the block when the queue is full, the positive value is the corresponding time of the block when the queue is full, only for asyc.

##Kafka Consumer

Kafka has the concept of consumer groups. Each consumer only Can consume messages from assigned partitions. Each partition can only be consumed by one consumer in a consumer group. Therefore, if the number of consumers in the same consumer group exceeds the number of partitions, some consumers will appear. The partition that cannot be consumed is allocated. The relationship between consumer groups and consumers is shown in the following figure:

Completed Kafka from an interview perspectiveconsumer group
Kafka Consumer Client consuming messages usually includes the following steps:

  1. Configure the client and create a consumer
  2. Subscribe to the topic
  3. Pull the message and consume it
  4. Submit consumption displacement
  5. Close consumer instance
Completed Kafka from an interview perspectiveProcess
Because Kafka's Consumer client is thread-unsafe, in order to ensure thread safety and improve consumption performance, a thread model similar to Reactor can be used on the Consumer side to consume data.

Completed Kafka from an interview perspective
Consumption model

Kafka consumer parameters

  • bootstrap.servers : Connection broker address, host: port format.
  • group.id: The consumer group to which the consumer belongs.
  • key.deserializer: Corresponding to the producer's key.serializer, the deserialization method of key.
  • value.deserializer: Corresponding to the producer's value.serializer, the deserialization method of value.
  • session.timeout.ms: The time when coordinator detection failed. Default is 10s. This parameter is the time interval for the Consumer Group to actively detect (comsummer, a member of the group) crash, similar to the heartbeat expiration time.
  • auto.offset.reset: This attribute specifies that when the consumer reads an offset without an offset, the offset is invalid (the consumer has expired for a long time and the current offset has been What should be done in the case of partitions that are outdated and deleted)? The default value is latest, which means reading data from the latest records (records generated after the consumer starts). The other value is earliest, which means that in the partial When the shift amount is invalid, the consumer starts reading data from the starting position.
  • enable.auto.commit: No to automatically submit the displacement. If it is false, you need to manually submit the displacement in the program. For exactly-once semantics, it is best to submit the offset manually
  • fetch.max.bytes: The maximum number of bytes for a single pull of data
  • max.poll.records: The maximum number of messages returned by a single poll call. If the processing logic is very lightweight, this value can be increased appropriately. However, max.poll.records pieces of data need to be processed within session.timeout.ms. The default value is 500
  • request.timeout.ms: The maximum waiting time for a request response. If no response is received within the timeout period, Kafka will either resend the message or directly fail if the number of retries is exceeded.

Kafka Rebalance

Rebalance is essentially a protocol that stipulates how all consumers under a consumer group Agree to allocate each partition subscribed to the topic. For example, there are 20 consumers under a certain group, and it subscribes to a topic with 100 partitions. Under normal circumstances, Kafka allocates an average of 5 partitions to each consumer. This allocation process is called rebalance.

When to rebalance?

This is also a question that is often mentioned. There are three trigger conditions for rebalance:

  • Group members change (a new consumer joins the group, an existing consumer leaves the group voluntarily, or an existing consumer crashes - the difference between the two will be discussed later) to)
  • The number of subscribed topics has changed
  • The number of partitions subscribed to the topic has changed

How to allocate partitions within a group?

Kafka provides two allocation strategies by default: Range and Round-Robin. Of course, Kafka adopts a pluggable allocation strategy, and you can create your own allocator to implement different allocation strategies.

Answer Keywords

  • What are the command line tools for Kafka? Which ones have you used? /bin Directory, manage kafka cluster, manage topic, produce and consume kafka
  • The execution process of Kafka Producer? Interceptors, Serializers, Partitioners and Accumulators
  • What are the common configurations for Kafka Producer? Broker configuration, ack configuration, network and sending parameters, compression parameters, ack parameters
  • How to keep Kafka messages in order? Kafka itself is unordered at the topic level, and is only ordered at the partition. Therefore, in order to ensure the processing order, you can customize the partitioner and send the data that needs to be processed sequentially to the same partition
  • Producer How to ensure that data is sent without loss? ack mechanism, retry mechanism
  • How to improve the performance of Producer? Batch, asynchronous, compression
  • If the number of consumers in the same group is greater than the number of parts, how does Kafka handle it? The redundant Part will be in a useless state and will not consume data
  • Is Kafka Consumer thread-safe? Unsafe, single-thread consumption, multi-thread processing
  • Tell me about the thread model when you use Kafka Consumer to consume messages. Why is it designed like this? Separating pulling and processing
  • Common configurations for Kafka Consumer? broker, network and pull parameters, heartbeat parameters
  • When will the Consumer be kicked out of the cluster? Crash, network abnormality, processing time is too long, submission displacement timeout
  • How will Kafka react when a Consumer joins or exits? Perform Rebalance
  • What is Rebalance and when does Rebalance occur? Topic changes, consumer changes

High availability and performance

Problems

  • How does Kafka ensure high availability?
  • Kafka’s delivery semantics?
  • What does Replic do?
  • What's the matter AR, ISR?
  • What are Leader and Flower?
  • What do HW, LEO, LSO, LW, etc. in Kafka stand for?
  • What has Kafka done to ensure superior performance?

Partitions and replicas

Completed Kafka from an interview perspective
Partition replicas

In distributed data systems, partitions are usually used to improve the system's processing capabilities and replicas are used to ensure high availability of data. Multi-partitioning means the ability to process concurrently. Among these multiple copies, only one is the leader, and the others are follower copies. Only the leader copy can provide services to the outside world. Multiple follower copies are usually stored in different brokers from the leader copy. Through this mechanism, high availability is achieved. When a certain machine hangs up, other follower copies can quickly "turn to normal" and start providing services to the outside world.

Why does the follower copy not provide read service?

This problem is essentially a trade-off between performance and consistency. Just imagine, what would happen if the follower copy also provided services to the outside world? First of all, performance will definitely be improved. But at the same time, a series of problems will arise. Similar to phantom reading and dirty reading in database transactions. For example, if you write a piece of data to Kafka topic a, consumer b consumes data from topic a, but finds that it cannot consume it because the latest message has not been written to the partition copy that consumer b reads. At this time, another consumer c can consume the latest data because it consumes the leader copy. Kafka uses the management of WH and Offset to determine what data the Consumer can consume and the data currently written.

Completed Kafka from an interview perspective
watermark

Only the Leader can provide external read services, so how to elect the Leader

kafka will work with The replicas kept synchronized by the leader replica are placed in the ISR replica set. Of course, the leader copy always exists in the ISR copy set. In some special cases, there is even only one copy of the leader in the ISR copy. When the leader fails, kakfa senses this situation through zookeeper, selects a new copy in the ISR copy to become the leader, and provides services to the outside world. But there is another problem with this. As mentioned earlier, it is possible that there is only the leader in the ISR replica set. When the leader replica dies, the ISR set will be empty. What should we do at this time? At this time, if the unclean.leader.election.enable parameter is set to true, Kafka will select a replica to become the leader in asynchronous, that is, a replica that is not in the ISR replica set.

The existence of copies will cause copy synchronization problems

Kafka maintains an available replica list (ISR) in all allocated replicas (AR). When the Producer sends a message to the Broker, it will determine how many replicas need to wait for the message to be synchronized based on the ack configuration. Only if it succeeds, the Broker will internally use the ReplicaManager service to manage the data synchronization between the flower and the leader.

Completed Kafka from an interview perspective
sync

Performance Optimization

  • Partition Concurrency
  • Sequential read and write to disk
  • page cache: read and write by page
  • Read ahead: Kafka will read the messages to be consumed into memory in advance
  • High-performance serialization (binary)
  • Memory mapping
  • Lock-free offset management: improve concurrency capability
  • Java NIO model
  • Batch: batch reading and writing
  • Compression: message compression, storage compression, reducing network and IO overhead

Partition concurrency

On the one hand, since different Partitions can be located on different machines, you can make full use of the advantages of the cluster to achieve parallel processing between machines. On the other hand, since Partition physically corresponds to a folder, even if multiple Partitions are located on the same node, different Partitions on the same node can be configured to be placed on different disk drives to achieve parallel processing between disks. Take full advantage of multiple disks.

Sequential reading and writing

The files in each partition directory of Kafka are evenly cut into equal sizes (the default file size is 500 MB, which can be Manually set) data file, Each data file is called a segment file, and each segment uses append to append data.

Completed Kafka from an interview perspective
Append data

Answer keyword

  • How does Kafka ensure high availability?

    Ensure high availability of data through replicas, producer ack, retry, automatic Leader election, Consumer self-balancing

  • Kafka’s delivery semantics?

    Delivery semantics generally include at least once, at most once and exactly once. Kafka implements the first two through ack configuration.

  • What does Replic do?

    Achieve high availability of data

  • What are AR and ISR?

    AR: Assigned Replicas. AR is the set of replicas allocated when the partition is created after the topic is created. The number of replicas is determined by the replica factor. ISR: In-Sync Replicas. A particularly important concept in Kafka refers to the set of replicas in AR that are synchronized with the Leader. The replica in the AR may not be in the ISR, but the Leader replica is naturally included in the ISR. Regarding ISR, another common interview question is how to determine whether a copy should belong to an ISR. The current judgment is based on whether the time when the Follower replica's LEO lags behind the Leader's LEO exceeds the value of the Broker-side parameter replica.lag.time.max.ms. If exceeded, the replica is removed from the ISR.

  • What are Leader and Flower?

  • What does HW stand for in Kafka?

    High watermark. This is an important field that controls the scope of the message that the consumer can read. An ordinary consumer can only "see" all messages on the Leader replica between Log Start Offset and HW (exclusive). Messages above the water level are invisible to consumers.

  • What has Kafka done to ensure superior performance?

    Partition concurrency, sequential reading and writing to disk, page cache compression, high-performance serialization (binary), memory mapping lock-free offset management, Java NIO model

This article does not go into the implementation details and source code analysis of Kafka, but Kafka is indeed an excellent open source system. Many elegant architectural designs and source code designs are worth learning. It is highly recommended that interested students go more in-depth. Getting to know this open source system will be of great help to your own architectural design capabilities, coding capabilities, and performance optimization.

The above is the detailed content of Completed Kafka from an interview perspective. For more information, please follow other related articles on the PHP Chinese website!

Statement:
This article is reproduced at:Java后端技术全栈. If there is any infringement, please contact admin@php.cn delete