Kafka源码分析（十二）——Producer：超时问题

KafkaProducer发送消息的过程中可能会出现 消息超时 的问题。本章，我会从Kafka客户端的底层对该问题进行讲解。

一、超时场景

我们先来看下，哪些情况下会出现超时问题：

RecordBatch长时间停留在 BufferPool 缓冲区中，压根没有被Sender线程获取；
Sender线程将消息发送出去了，但是一直没有收到响应，NetworkSend请求长时间积压在 InFlightRequests 中。

1.1 BufferPool超时

我们先来看第一种情况。Sender线程的运行主流程中有这么一行代码：

    // Sender.java
    
    void run(long now) {
           //...
    
        // 5.剔除超时的batch（默认60s）
        List<RecordBatch> expiredBatches = this.accumulator.abortExpiredBatches(this.requestTimeout, now);
    }

我们来看RecordAccumulator的abortExpiredBatches方法，它的处理逻辑如下：

    // RecordAccumulator.java
    
    private final ConcurrentMap<TopicPartition, Deque<RecordBatch>> batches;
    public List<RecordBatch> abortExpiredBatches(int requestTimeout, long now) {
        List<RecordBatch> expiredBatches = new ArrayList<>();
        int count = 0;
        // 1.遍历查找超时RecordBatch
        for (Map.Entry<TopicPartition, Deque<RecordBatch>> entry : this.batches.entrySet()) {
            Deque<RecordBatch> dq = entry.getValue();
            TopicPartition tp = entry.getKey();
            if (!muted.contains(tp)) {
                synchronized (dq) {
                    RecordBatch lastBatch = dq.peekLast();
                    Iterator<RecordBatch> batchIterator = dq.iterator();
                    while (batchIterator.hasNext()) {
                        RecordBatch batch = batchIterator.next();
                        boolean isFull = batch != lastBatch || batch.isFull();
                        // 判断是否超时
                        if (batch.maybeExpire(requestTimeout, retryBackoffMs, now, this.lingerMs, isFull)) {
                            expiredBatches.add(batch);
                            count++;
                            batchIterator.remove();    //移除
                        } else {
                            break;
                        }
                    }
                }
            }
        }
        // 2.触发回调函数
        if (!expiredBatches.isEmpty()) {
            log.trace("Expired {} batches in accumulator", count);
            for (RecordBatch batch : expiredBatches) {
                // 回调
                batch.expirationDone();
                // 回收分配的Buffer
                deallocate(batch);
            }
        }
    
        return expiredBatches;
    }

RecordBatch的expirationDone方法最终会调用内部的done方法，也就是触发回调函数的执行：

    // RecordBatch.java
    
    void expirationDone() {
        if (expiryErrorMessage == null)
            throw new IllegalStateException("Batch has not expired");
        this.done(-1L, Record.NO_TIMESTAMP,
                  new TimeoutException("Expiring " + recordCount + " record(s) for " + topicPartition + ": " + expiryErrorMessage));
    }
    
    public void done(long baseOffset, long logAppendTime, RuntimeException exception) {
        //...
        // execute callbacks
        for (Thunk thunk : thunks) {
            try {
                if (exception == null) {
                    RecordMetadata metadata = thunk.future.value();
                    thunk.callback.onCompletion(metadata, null);
                } else {
                    thunk.callback.onCompletion(null, exception);
                }
            } catch (Exception e) {
                log.error("Error executing user-provided callback on message for topic-partition '{}'", topicPartition, e);
            }
        }
        produceFuture.done();
    }

1.2 InFlightRequests超时

再来看另一种超时的场景，Sender线程的主流程中调用了NetworkClient的poll方法：

    // Sender.java
    
    void run(long now) {
        //...
        this.client.poll(pollTimeout, now);
    }

NetworkClient的poll方法内部有一段超时逻辑的判断，也就是说如果发现有对Broker的请求超时了，即超过request.timeout.ms（默认60s）还没响应，此时会关闭掉跟那个Broker的连接，认为那个Broker已经故障了。同时，进行内存数据结构的清理，并再次标记为需要去重新拉取元数据：

    // NetworkClient.java
    
    public List<ClientResponse> poll(long timeout, long now) {
        long metadataTimeout = metadataUpdater.maybeUpdate(now);
        try {
            this.selector.poll(Utils.min(timeout, metadataTimeout, requestTimeoutMs));
        } catch (IOException e) {
            log.error("Unexpected error during I/O", e);
        }
    
        //处理超时请求
        handleTimedOutRequests(responses, updatedNow);
    
        return responses;
    }
    
    private void handleTimedOutRequests(List<ClientResponse> responses, long now) {
        // 获取超时的目标Broker
        List<String> nodeIds = this.inFlightRequests
            .getNodesWithTimedOutRequests(now, this.requestTimeoutMs);
        for (String nodeId : nodeIds) {
            // 关闭与该Broker的连接
            this.selector.close(nodeId);
            processDisconnection(responses, nodeId, now);
        }
    
        // 标记更新元数据
        if (!nodeIds.isEmpty())
            metadataUpdater.requestUpdate();
    }
    
    private void processDisconnection(List<ClientResponse> responses, String nodeId, long now) {
        connectionStates.disconnected(nodeId, now);
        nodeApiVersions.remove(nodeId);
        nodesNeedingApiVersionsFetch.remove(nodeId);
        // 清理InFlightRequests中缓存的针对该Broker的请求
        for (InFlightRequest request : this.inFlightRequests.clearAll(nodeId)) {    
            if (request.isInternalRequest && request.header.apiKey() == ApiKeys.METADATA.id)
                metadataUpdater.handleDisconnection(request.destination);
            else
                responses.add(request.disconnected(now));
        }
    }

超时逻辑判断：

    // InFlightRequests.java
    
    public List<String> getNodesWithTimedOutRequests(long now, int requestTimeout) {
        List<String> nodeIds = new LinkedList<>();
        for (Map.Entry<String, Deque<NetworkClient.InFlightRequest>> requestEntry : requests.entrySet()) {
            String nodeId = requestEntry.getKey();
            Deque<NetworkClient.InFlightRequest> deque = requestEntry.getValue();
            if (!deque.isEmpty()) {
                NetworkClient.InFlightRequest request = deque.peekLast();
                // 当前事件-请求发送事件超过了`request.timeout.ms`
                long timeSinceSend = now - request.sendTimeMs;
                if (timeSinceSend > requestTimeout)
                    nodeIds.add(nodeId);
            }
        }
        return nodeIds;
    }

二、总结

本章，我对KafkaProducer发送消息的过程中可能会出现的 消息超时 问题进行讲解，整体分为两种情况：

请求积压在BufferPool；
请求积压在InFlightRequests。

无论哪种情况，请求超时的判断逻辑中都涉及参数request.timeout.ms，默认超时时间为60s。同时，超时后最终会触发回调函数的执行。

Java 面试宝典是大明哥全力打造的 Java 精品面试题，它是一份靠谱、强大、详细、经典的 Java 后端面试宝典。它不仅仅只是一道道面试题，而是一套完整的 Java 知识体系，一套你 Java 知识点的扫盲贴。

它的内容包括：

大厂真题：Java 面试宝典里面的题目都是最近几年的高频的大厂面试真题。
原创内容：Java 面试宝典内容全部都是大明哥原创，内容全面且通俗易懂，回答部分可以直接作为面试回答内容。
持续更新：一次购买，永久有效。大明哥会持续更新 3+ 年，累计更新 1000+，宝典会不断迭代更新，保证最新、最全面。
覆盖全面：本宝典累计更新 1000+，从 Java 入门到 Java 架构的高频面试题，实现 360° 全覆盖。
不止面试：内容包含面试题解析、内容详解、知识扩展，它不仅仅只是一份面试题，更是一套完整的 Java 知识体系。
宝典详情：https://www.yuque.com/chenssy/sike-java/xvlo920axlp7sf4k
宝典总览：https://www.yuque.com/chenssy/sike-java/yogsehzntzgp4ly1
宝典进展：https://www.yuque.com/chenssy/sike-java/en9ned7loo47z5aw

目前 Java 面试宝典累计更新 400+ 道，总字数 42w+。大明哥还在持续更新中，下图是大明哥在 2024-12 月份的更新情况：

想了解详情的小伙伴，扫描下面二维码加大明哥微信【daming091】咨询

同时，大明哥也整理一套目前市面最常见的热点面试题。微信搜[大明哥聊 Java]或扫描下方二维码关注大明哥的原创公众号[大明哥聊 Java] ，回复【面试题】即可免费领取。

阅读全文