Home >Java >javaTutorial >Completed Kafka from an interview perspective
Kafka is an excellent distributed message middleware. Kafka is used in many systems for message communication. Understanding and using distributed messaging systems has almost become a necessary skill for a backend developer. Today 马哥byte
will start with common Kafka interview questions and talk to you about Kafka.
Distributed messaging is a communication mechanism, which is different from RPC, HTTP, RMI, etc., in the middle of the message The software uses a distributed intermediate agent to communicate. As shown in the figure, after using message middleware, the upstream business system sends messages, which are first stored in the message middleware, and then the message middleware distributes the messages to the corresponding business module applications (distributed producer-consumer model). This asynchronous approach reduces the coupling between services.
Define message middleware:
Referring to additional components in the system architecture will inevitably improve The architectural complexity of the system and the difficulty of operation and maintenance, soWhat are the advantages of using distributed message middleware in the system? What is the role of message middleware in the system?
During interviews, interviewers often care about the interviewer’s ability to select open source components. This is both It can test the breadth of the interviewer's knowledge, and it can also test the depth of the interviewer's knowledge of a certain type of system. It can also show the interviewer's ability to grasp the overall system and system architecture design. There are many open source distributed messaging systems, and different messaging systems have different characteristics. Choosing a messaging system requires not only a certain understanding of each messaging system, but also a clear understanding of your own system requirements.
The following is a comparison of several common distributed messaging systems:
General concepts in Kafka architecture:
Kafka Topic Partitions Layout
Kafka partitions the Topic. Partitions can be read and written concurrently.Kafka Consumer Offset
Briefly explain the architecture of Kafka?
Producer, Consumer, Consumer Group, Topic, Partition
Kafka Producer uses Push mode to send messages to Broker, and Consumer uses Pull mode for consumption. Pull mode, allowing the consumer to manage the offset by itself, can provide read performance
Consumer group
Are Kafka’s messages in order?
Topic levels are unordered and Partitions are ordered
Does Kafka support read-write separation?
Not supported, only Leader provides external reading and writing services
How does Kafka ensure high data availability?
Copy, ack, HW
What is the role of zookeeper in Kafka?
Cluster management, metadata management
Does it support transactions?
After 0.11, transactions are supported and can be achieved "exactly once"
Can the number of partitions be reduced?
No, data will be lost
Kafka’s command line tools are in the Kafka package’s /bin
The directory mainly includes service and cluster management scripts, configuration scripts, information viewing scripts, Topic scripts, client scripts, etc.
We can usually use kafka-console-consumer.sh
and kafka-console-producer.sh
Script to test Kafka production and consumption, kafka-consumer-groups.sh
can view and manage Topics in the cluster, kafka-topics.sh
is usually used to view Kafka's Consumer group situation.
The normal production logic of Kafka producer includes the following steps:
Producer The process of sending messages is shown in the figure below. It needs to go through interceptor
, serializer
and partitioner
, finally sent to the Broker in batches by the accumulator
.
Kafka Producer requires the following necessary parameters:
Common parameters:
batch.num.messages
Default value: 200, the number of batch messages each time, only for asyc kick in.
Default value: 0, 0 means that the producer does not need to wait for confirmation from the leader, 1 means that the leader needs to confirm writing to its local log and confirm it immediately, -1 means that the producer needs to confirm after all backups are completed. It only works in async mode. The adjustment of this parameter is a tradeoff between data loss and transmission efficiency. If you are not sensitive to data loss but care about efficiency, you can consider setting it to 0, which can greatly improve the efficiency of the producer in sending data.
request.timeout.ms
Default value: 10000, confirmation timeout.
partitioner.class
Default value: kafka.producer.DefaultPartitioner, kafka.producer.Partitioner must be implemented, Provide a partitioning strategy based on Key. Sometimes we need the same type of messages to be processed sequentially, so we must customize the distribution strategy to allocate the same type of data to the same partition.
producer.type
Default value: sync, specifies whether the message is sent synchronously or asynchronously. Use kafka.producer.AyncProducer for asynchronous asyc batch sending, and kafka.producer.SyncProducer for synchronous sync. Synchronous and asynchronous sending also affect the efficiency of message production.
compression.topic
Default value: none, message compression, no compression by default. Other compression methods include "gzip", "snappy" and "lz4". Compression of messages can greatly reduce network transmission volume and network IO, thus improving overall performance.
Default value: null. When compression is set, you can specify specific topic compression. If not specified, all compression will be performed.
message.send.max.retries
Default value: 3, the maximum number of attempts to send a message.
retry.backoff.ms
Default value: 300, additional interval added to each attempt.
Default value: 600000, obtain metadata regularly time. When the partition is lost and the leader is unavailable, the producer will also actively obtain metadata. If it is 0, metadata will be obtained every time the message is sent, which is not recommended. If negative, metadata is only fetched on failure.
Default value: 5000, the maximum cached data in the producer queue Time, only for asyc.
Default value: -1, 0 is discarded when the queue is full, the negative value is the block when the queue is full, the positive value is the corresponding time of the block when the queue is full, only for asyc.
host: port
format. key.serializer
, the deserialization method of key. value.serializer
, the deserialization method of value. false
, you need to manually submit the displacement in the program. For exactly-once semantics, it is best to submit the offset manuallymax.poll.records
pieces of data need to be processed within session.timeout.ms. The default value is 500Rebalance is essentially a protocol that stipulates how all consumers under a consumer group Agree to allocate each partition subscribed to the topic. For example, there are 20 consumers under a certain group, and it subscribes to a topic with 100 partitions. Under normal circumstances, Kafka allocates an average of 5 partitions to each consumer. This allocation process is called rebalance.
When to rebalance?
This is also a question that is often mentioned. There are three trigger conditions for rebalance:
How to allocate partitions within a group?
Kafka provides two allocation strategies by default: Range and Round-Robin. Of course, Kafka adopts a pluggable allocation strategy, and you can create your own allocator to implement different allocation strategies.
/bin
Directory, manage kafka cluster, manage topic, produce and consume kafkaIn distributed data systems, partitions are usually used to improve the system's processing capabilities and replicas are used to ensure high availability of data. Multi-partitioning means the ability to process concurrently. Among these multiple copies, only one is the leader, and the others are follower copies. Only the leader copy can provide services to the outside world. Multiple follower copies are usually stored in different brokers from the leader copy. Through this mechanism, high availability is achieved. When a certain machine hangs up, other follower copies can quickly "turn to normal" and start providing services to the outside world.
Why does the follower copy not provide read service?
This problem is essentially a trade-off between performance and consistency. Just imagine, what would happen if the follower copy also provided services to the outside world? First of all, performance will definitely be improved. But at the same time, a series of problems will arise. Similar to phantom reading and dirty reading in database transactions. For example, if you write a piece of data to Kafka topic a, consumer b consumes data from topic a, but finds that it cannot consume it because the latest message has not been written to the partition copy that consumer b reads. At this time, another consumer c can consume the latest data because it consumes the leader copy. Kafka uses the management of WH and Offset to determine what data the Consumer can consume and the data currently written.
Only the Leader can provide external read services, so how to elect the Leader
kafka will work with The replicas kept synchronized by the leader replica are placed in the ISR replica set. Of course, the leader copy always exists in the ISR copy set. In some special cases, there is even only one copy of the leader in the ISR copy. When the leader fails, kakfa senses this situation through zookeeper, selects a new copy in the ISR copy to become the leader, and provides services to the outside world. But there is another problem with this. As mentioned earlier, it is possible that there is only the leader in the ISR replica set. When the leader replica dies, the ISR set will be empty. What should we do at this time? At this time, if the unclean.leader.election.enable parameter is set to true, Kafka will select a replica to become the leader in asynchronous, that is, a replica that is not in the ISR replica set.
The existence of copies will cause copy synchronization problems
Kafka maintains an available replica list (ISR) in all allocated replicas (AR). When the Producer sends a message to the Broker, it will determine how many replicas need to wait for the message to be synchronized based on the ack
configuration. Only if it succeeds, the Broker will internally use the ReplicaManager
service to manage the data synchronization between the flower and the leader.
On the one hand, since different Partitions can be located on different machines, you can make full use of the advantages of the cluster to achieve parallel processing between machines. On the other hand, since Partition physically corresponds to a folder, even if multiple Partitions are located on the same node, different Partitions on the same node can be configured to be placed on different disk drives to achieve parallel processing between disks. Take full advantage of multiple disks.
The files in each partition directory of Kafka are evenly cut into equal sizes (the default file size is 500 MB, which can be Manually set) data file, Each data file is called a segment file, and each segment uses append to append data.
How does Kafka ensure high availability?
Ensure high availability of data through replicas, producer ack, retry, automatic Leader election, Consumer self-balancing
Kafka’s delivery semantics?
Delivery semantics generally include
at least once
,at most once
andexactly once
. Kafka implements the first two through ack configuration.
What does Replic do?
Achieve high availability of data
What are AR and ISR?
AR: Assigned Replicas. AR is the set of replicas allocated when the partition is created after the topic is created. The number of replicas is determined by the replica factor. ISR: In-Sync Replicas. A particularly important concept in Kafka refers to the set of replicas in AR that are synchronized with the Leader. The replica in the AR may not be in the ISR, but the Leader replica is naturally included in the ISR. Regarding ISR, another common interview question is how to determine whether a copy should belong to an ISR. The current judgment is based on whether the time when the Follower replica's LEO lags behind the Leader's LEO exceeds the value of the Broker-side parameter replica.lag.time.max.ms. If exceeded, the replica is removed from the ISR.
What are Leader and Flower?
What does HW stand for in Kafka?
High watermark. This is an important field that controls the scope of the message that the consumer can read. An ordinary consumer can only "see" all messages on the Leader replica between Log Start Offset and HW (exclusive). Messages above the water level are invisible to consumers.
What has Kafka done to ensure superior performance?
Partition concurrency, sequential reading and writing to disk, page cache compression, high-performance serialization (binary), memory mapping lock-free offset management, Java NIO model
This article does not go into the implementation details and source code analysis of Kafka, but Kafka is indeed an excellent open source system. Many elegant architectural designs and source code designs are worth learning. It is highly recommended that interested students go more in-depth. Getting to know this open source system will be of great help to your own architectural design capabilities, coding capabilities, and performance optimization.
The above is the detailed content of Completed Kafka from an interview perspective. For more information, please follow other related articles on the PHP Chinese website!