Bootstrap

内核block层Multi queue多队列 实际调试总结

阅读本文前,希望读者先看看我写的《linux内核block层Multi queue多队列核心点分析》。这篇文章是针对blockMulti queue(简称blk-mq) 多队列基础知识点总结。还有《内核block层Multi queue多队列的一次优化实践》,这是一次边针对blk-mq硬件队列派发IO的性能优化实践。本文是在二者的基础上,实际调试总结的知识点,更加详细。本文内核版本是centos 8.3  4.18.0-240,block层IO调度器bfq。

在内核blk-mq多队列框架下,最终派发IO请求(简称为rq或者req)到磁盘驱动一般情况是在blk_mq_dispatch_rq_list()函数,看下几个这种栈回溯:

读取文件

  • 0xffffffff96409e10 : blk_mq_dispatch_rq_list+0x0/0x820 [kernel]
  •  0xffffffff9640f4ba : blk_mq_do_dispatch_sched+0x11a/0x160 [kernel]
  •  0xffffffff9640ff99 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
  •  0xffffffff96410020 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
  •  0xffffffff96407691 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
  •  0xffffffff96407f51 : __blk_mq_delay_run_hw_queue+0x141/0x160 [kernel]
  •  0xffffffff96410351 : blk_mq_sched_insert_requests+0x71/0xf0 [kernel]
  •  0xffffffff9640b4c6 : blk_mq_flush_plug_list+0x196/0x2c0 [kernel]
  •  0xffffffff963ffbe7 : blk_flush_plug_list+0xd7/0x100 [kernel]
  •  0xffffffff963ffc31 : blk_finish_plug+0x21/0x2e [kernel]
  •  0xffffffff9623744f : read_pages+0x7f/0x190 [kernel]
  •  0xffffffff96237721 : __do_page_cache_readahead+0x1c1/0x1e0 [kernel]
  •  0xffffffff96237939 : ondemand_readahead+0x1f9/0x2c0 [kernel]
  •  0xffffffff9622ce5f : generic_file_buffered_read+0x71f/0xb00 [kernel]
  •  0xffffffffc0670ed7 : xfs_file_buffered_aio_read+0x47/0xe0 [xfs]
  •  0xffffffffc0670fde : xfs_file_read_iter+0x6e/0xd0 [xfs]
  •  0xffffffff962d8841 : new_sync_read+0x121/0x170 [kernel]
  •  0xffffffff962db1c1 : vfs_read+0x91/0x140 [kernel]

jbd2派发IO

  • 0xffffffff96409e10 : blk_mq_dispatch_rq_list+0x0/0x820 [kernel]
  •  0xffffffff9640f4ba : blk_mq_do_dispatch_sched+0x11a/0x160 [kernel]
  •  0xffffffff9640ff99 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
  •  0xffffffff96410020 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
  •  0xffffffff96407691 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
  •  0xffffffff96407f51 : __blk_mq_delay_run_hw_queue+0x141/0x160 [kernel]
  •  0xffffffff96410351 : blk_mq_sched_insert_requests+0x71/0xf0 [kernel]
  •  0xffffffff9640b4c6 : blk_mq_flush_plug_list+0x196/0x2c0 [kernel]
  •  0xffffffff963ffbe7 : blk_flush_plug_list+0xd7/0x100 [kernel]
  •  0xffffffff963ffc31 : blk_finish_plug+0x21/0x2e [kernel]
  •  0xffffffffc09cb164 : jbd2_journal_commit_transaction+0xf64/0x19f0 [jbd2]

fio进程直接派发IO

  • 0xffffffff96409e10 : blk_mq_dispatch_rq_list+0x0/0x820 [kernel]
  •  0xffffffff9640f4ba : blk_mq_do_dispatch_sched+0x11a/0x160 [kernel]
  •  0xffffffff9640ff99 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
  •  0xffffffff96410020 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
  •  0xffffffff96407691 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
  •  0xffffffff96407f51 : __blk_mq_delay_run_hw_queue+0x141/0x160 [kernel]
  •  0xffffffff96410351 : blk_mq_sched_insert_requests+0x71/0xf0 [kernel]
  •  0xffffffff9640b4c6 : blk_mq_flush_plug_list+0x196/0x2c0 [kernel]
  •  0xffffffff963ffbe7 : blk_flush_plug_list+0xd7/0x100 [kernel]
  •  0xffffffff963ffc31 : blk_finish_plug+0x21/0x2e [kernel]
  •  0xffffffff96331bb9 : __x64_sys_io_submit+0xd9/0x180 [kernel]

实际测试表明大部分读写文件的进程都是flush plug形式派发IO,以读文件为例:

  1. static int read_pages(struct address_space *mapping, struct file *filp,
  2.         struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
  3. {
  4.     struct blk_plug plug;
  5.     unsigned page_idx;
  6.     int ret;
  7.     blk_start_plug(&plug);
  8.     for (page_idx = 0; page_idx < nr_pages; page_idx++) {
  9.         struct page *page = lru_to_page(pages);
  10.         list_del(&page->lru);
  11.         if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
  12.             //把要派发的文件IO对应的rq添加到plug->mq_list链表
  13.             mapping->a_ops->readpage(filp, page);
  14.         put_page(page);
  15.     }
  16.     ret = 0;
  17. out: 
  18.     //启动派发IO
  19.     blk_finish_plug(&plug);
  20.     return ret;
  21. }

在ext4_readpage->ext4_mpage_readpages->submit_bio->generic_make_request->blk_mq_make_request->blk_add_rq_to_plug流程把要派发的rq添加到plug->mq_list链表链表。之后执行blk_finish_plug->blk_flush_plug_list->blk_mq_flush_plug_list->blk_mq_sched_insert_requests派发IO,看下blk_mq_sched_insert_requests函数源码:

  1. void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
  2.                   struct blk_mq_ctx *ctx,
  3.                   struct list_head *list, bool run_queue_async)
  4. {
  5.     struct elevator_queue *e;
  6.     struct request_queue *q = hctx->queue;
  7.     percpu_ref_get(&q->q_usage_counter);
  8.     e = hctx->queue->elevator;
  9.     if (e && e->type->ops.insert_requests)
  10.         //执行bfq_insert_requests()list链表上的所有IO请求插入IO运行队列
  11.         e->type->ops.insert_requests(hctx, list, false);
  12. else
  13. {
  14.         if (!hctx->dispatch_busy && !e && !run_queue_async) {
  15.             blk_mq_try_issue_list_directly(hctx, list);
  16.             if (list_empty(list))
  17.                 goto out;
  18.         }
  19.         blk_mq_insert_requests(hctx, ctx, list);
  20.     }
  21.     //派发IO到驱动,run_queue_async决定是异步派发还是同步派发。大部分读写文件的进程都是同步派发
  22.     blk_mq_run_hw_queue(hctx, run_queue_async);
  23.  out:
  24.     percpu_ref_put(&q->q_usage_counter);
  25. }

blk_mq_sched_insert_requests函数中执行bfq_insert_requests()把plug->mq_list链表上的所有IO请求插入到bfq 算法的IO运行队列。然后执行blk_mq_run_hw_queue->__blk_mq_delay_run_hw_queue->__blk_mq_run_hw_queue->blk_mq_sched_dispatch_requests->__blk_mq_sched_dispatch_requests 函数,真正派发IO。此时即便plug->mq_list链表上有多个rq要派发,但是也不能保证能一次性连续派发给磁盘驱动,这点下文有讲解。

下来看下__blk_mq_sched_dispatch_requests函数源码:

  1. int __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
  2. {
  3.     struct request_queue *q = hctx->queue;
  4.     struct elevator_queue *e = q->elevator;
  5.     const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
  6.     int ret = 0;
  7.     LIST_HEAD(rq_list);
  8.     //如果hctx->dispatch链表上有rq,则优先派发hctx->dispatch上的rq,这些rq是上次派发遇到磁盘驱动繁忙等导致派发失败的rq
  9.     if (!list_empty_careful(&hctx->dispatch)) {
  10.         spin_lock(&hctx->lock);
  11.         if (!list_empty(&hctx->dispatch))
  12.             list_splice_init(&hctx->dispatch, &rq_list);
  13.         spin_unlock(&hctx->lock);
  14.     }
  15.     if (!list_empty(&rq_list)) {
  16.         //设置hctx->stateBLK_MQ_S_SCHED_RESTART标记
  17.         blk_mq_sched_mark_restart_hctx(hctx);
  18.         //优先派发hctx->dispatch上的req,如果没有遇到磁盘驱动繁忙等,返回true,此时继续执行下边的blk_mq_do_dispatch_sched派发IO调度队列上的IO请求
  19.         if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
  20.             /*循环从IO调度器队列取出IO请求存入rq_list链表,然后取出rq_list链表上的req派发给磁盘驱动,如果因驱动队列繁忙或者nvme硬件繁忙导致派发失败,则把req添加hctx->dispatch等稍后派发,此时退出循环。如果IO调度器队列IO派发完了,也会退出while循环*/
  21.             if (has_sched_dispatch)
  22.                 ret = blk_mq_do_dispatch_sched(hctx);
  23.             else
  24.                 ret = blk_mq_do_dispatch_ctx(hctx);
  25.         }
  26.     } else if (has_sched_dispatch) {
  27.         ret = blk_mq_do_dispatch_sched(hctx);
  28.     } else if (hctx->dispatch_busy) {
  29.         ret = blk_mq_do_dispatch_ctx(hctx);
  30.     } else {
  31.         blk_mq_flush_busy_ctxs(hctx, &rq_list);
  32.         blk_mq_dispatch_rq_list(q, &rq_list, false);
  33.     }
  34.     return ret;
  35. }

正常情况是执行blk_mq_do_dispatch_sched函数派发IO,看下源码

  1. static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
  2. {
  3.     struct request_queue *q = hctx->queue;
  4.     struct elevator_queue *e = q->elevator;
  5.     LIST_HEAD(rq_list);
  6.     int ret = 0;
  7.     do {
  8.         struct request *rq;
  9.         //bfq_has_work
  10.         if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
  11.             break;
  12.         if (!list_empty_careful(&hctx->dispatch)) {
  13.             ret = -EAGAIN;
  14.             break;
  15.         }
  16.         if (!blk_mq_get_dispatch_budget(hctx))
  17.             break;
  18.         rq = e->type->ops.dispatch_request(hctx);//调用bfq调度器派发IO函数 bfq_dispatch_request
  19.         if (!rq) {
  20.             //如果bfq_dispatch_request返回rqNULL,则执行blk_mq_delay_run_hw_queues()启动blk-mq异步派发IO内核线程
  21.             blk_mq_put_dispatch_budget(hctx);
  22.             blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY);
  23.             break;
  24.         }
  25.         list_add(&rq->queuelist, &rq_list);
  26.     /*取出rq_list链表上的req派发给磁盘驱动,如果因驱动队列繁忙或者nvme硬件繁忙导致派发失败,则把rq添加hctx->dispatch等稍后派发遇到rq派发失败返回false,退出while循环*/
  27.     } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
  28.     return ret;
  29. }
  30. static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
  31. {
  32.         struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
  33.         return !list_empty_careful(&bfqd->dispatch) ||
  34.                 bfq_tot_busy_queues(bfqd) > 0;
  35. }

这个函数作用是循环从IO调度器队列取出IO请求存入rq_list链表,然后取出rq_list链表上的rq派发给磁盘驱动,如果因驱动队列繁忙或者磁盘硬件繁忙导致派发失败,则把rq添加hctx->dispatch等稍后派发,此时退出循环。如果IO调度器队列IO派发完了,也会退出while循环。

这个函数首先执行bfq_has_work函数看IO运行队列上是否还有rq要派发,没有的话就break跳出while循环停止派发IO。如果有rq要派发,执行bfq_dispatch_request从IO运行队列选择一个rq(block层IO调度器时bfq),如果没找到合适的rq则返回NULL(注意此时并不能说明bfq的IO算法队列没有rq了,可能只是一些算法策略bfq算法没找到合适派发的rq),则执行blk_mq_delay_run_hw_queues函数启动blk-mq异步派发rq内核线程,如下是抓到的栈回溯信息:

  • kworker/3:1H 500
  • 0xffffffffc06b3950 : bfq_dispatch_request+0x0/0x9f0 [bfq]
  •  0xffffffffa480f385 : blk_mq_do_dispatch_sched+0xc5/0x160 [kernel]
  •  0xffffffffa480feb9 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
  •  0xffffffffa480ff40 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
  •  0xffffffffa48076a1 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
  •  0xffffffffa44d3477 : process_one_work+0x1a7/0x360 [kernel]
  •  0xffffffffa44d3b40 : worker_thread+0x30/0x390 [kernel]
  •  0xffffffffa44d9502 : kthread+0x112/0x130 [kernel]

为什么要这样呢?前文简单提过,blk_mq_do_dispatch_sched函数执行的bfq_dispatch_request函数返回的rq是NULL,但是此时bfq算法队列还是有rq的,只是因为bfq调度算法策略原因导致bfq_dispatch_request返回NULL。那怎么办?就执行blk_mq_delay_run_hw_queues启动blk-mq异步派发rq内核线程,在这个内核线程里再执行__blk_mq_sched_dispatch_requests->blk_mq_do_dispatch_sched派发rq,直到bfq的IO算法队列的rq全派发完了,则在blk_mq_do_dispatch_sched函数中执行bfq_has_work返回false,则if (e->type->ops.has_work && !e->type->ops.has_work(hctx))成立,直接break跳出while循环,停止派发rq。这种机制能保证bfq算法队列上的rq能一直派发完!

而在blk_mq_dispatch_rq_list函数真正把rq派发给磁盘驱动,看下源码:

  1. bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
  2.                  bool got_budget)
  3. {
  4.     struct blk_mq_hw_ctx *hctx;
  5.     struct request *rq, *nxt;
  6.     bool no_tag = false;
  7.     int errors, queued;
  8.     blk_status_t ret = BLK_STS_OK;
  9.     bool no_budget_avail = false;
  10.     ................
  11.     errors = queued = 0;
  12.     do {
  13.         struct blk_mq_queue_data bd;
  14.         rq = list_first_entry(list, struct request, queuelist);
  15.         hctx = rq->mq_hctx;
  16.         if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) {
  17.             blk_mq_put_driver_tag(rq);
  18.             no_budget_avail = true;
  19.             break;
  20.         }
  21.         ................
  22.         list_del_init(&rq->queuelist);
  23.         bd.rq = rq;
  24.         if (list_empty(list))
  25.             bd.last = true;
  26.         else {
  27.             nxt = list_first_entry(list, struct request, queuelist);
  28.             bd.last = !blk_mq_get_driver_tag(nxt);
  29.         }
  30.         //rq派发给驱动
  31.         ret = q->mq_ops->queue_rq(hctx, &bd);//scsi_queue_rq nvme_queue_rq
  32.         //这个if成立应该说明是 驱动队列繁忙 或者nvme硬件繁忙,不能再向驱动派发IO,因此本次的rq派发失败
  33.         if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
  34.             if (!list_empty(list)) {
  35.                 //rqlist链表上的下一个reqtag释放了,搞不清楚为什么
  36.                 nxt = list_first_entry(list, struct request, queuelist);
  37.                 blk_mq_put_driver_tag(nxt);
  38.             }
  39.             //把派发失败的rq再添加到list链表
  40.             list_add(&rq->queuelist, list);
  41.             __blk_mq_requeue_request(rq);
  42.             break;
  43.         }
  44.         ...........
  45.         //派发rq失败则queued1
  46.         queued++;
  47.     //一直派发list链表上的req直到list链表空
  48.     } while (!list_empty(list));
  49.     hctx->dispatched[queued_to_index(queued)]++;
  50.     //如果list链表上还有rq,说明派发rq时遇到驱动队列或者硬件繁忙,rq没有派发成功
  51.     if (!list_empty(list)) {
  52.         ...........
  53.         spin_lock(&hctx->lock);
  54.         //list上没有派发成功的rq添加到hctx->dispatch链表,稍后延迟派发
  55.         list_splice_tail_init(list, &hctx->dispatch);
  56.         spin_unlock(&hctx->lock);
  57.         ......................
  58.         needs_restart = blk_mq_sched_needs_restart(hctx);
  59.         if (!needs_restart ||
  60.             (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
  61.             //再次直接启动派发IO
  62.             blk_mq_run_hw_queue(hctx, true);
  63.         else if (needs_restart && (ret == BLK_STS_RESOURCE ||
  64.                        no_budget_avail))
  65.             //启动异步线程派发IO
  66.             blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
  67.         //标记blk-mq busy
  68.         blk_mq_update_dispatch_busy(hctx, true);
  69.         //这里返回false表示派发IO时遇到驱动队列繁忙 或者nvme硬件繁忙
  70.         return false;
  71.     } else
  72.         //标记blk-mq busy
  73.         blk_mq_update_dispatch_busy(hctx, false);
  74.     //派发rq时遇到驱动队列或者硬件繁忙,返回false,否则派发正常下边返回true
  75.     if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
  76.         return false;
  77.     return (queued + errors) != 0;
  78. }

取出list链表上的req派发给磁盘驱动,如果因驱动队列繁忙或者磁盘硬件繁忙导致派发失败,则把rq添加hctx->dispatch等稍后派发。这里主要提几点:

1:如果在blk_mq_dispatch_rq_list函数中派发rq遇到驱动队列繁忙或者磁盘硬件繁忙导致派发失败,则要执行blk_mq_update_dispatch_busy(hctx, true)标记blk-mq硬件队列busy,否则执行blk_mq_update_dispatch_busy(hctx, false)标记不busy。而blk-mq硬件队列被标记busy后,则会影响到blk-mq派发rq。比如blk_mq_sched_insert_requests函数中,有if (!hctx->dispatch_busy && !e && !run_queue_async) blk_mq_try_issue_list_directly(hctx, list) ,这是说,如果硬件队列不繁忙的情况下,则执行blk_mq_try_issue_list_directly直接派发rq到磁盘驱动

2:真正派发rq到磁盘驱动是在blk_mq_dispatch_rq_list函数中执行q->mq_ops->queue_rq(hctx, &bd)。blk-mq这种机制,一次只派发一个rq。blk_mq_do_dispatch_sched函数中的while循环中,每次执行bfq_dispatch_request只从IO算法队列取出一个rq,然后执行blk_mq_dispatch_rq_list(q, &rq_list, true)派发这个一个rq。

每次只派发一个rq,效率是不是太低了?这个没办法,因为如果执行blk_mq_dispatch_rq_list函数派发多个rq,这些rq必须是属于同一个硬件队列,否则会导致rq派给磁盘驱动后一直卡死,无法传输完成。而blk_mq_do_dispatch_sched中执行bfq_dispatch_request返回的每个rq,不能保证这些rq是同一个硬件队列,因此blk_mq_do_dispatch_sched中只能一次执行blk_mq_dispatch_rq_list派发一个rq

什么情况下,blk_mq_dispatch_rq_list函数可以一次派发多个rq呢?看下前文__blk_mq_sched_dispatch_requests函数源码,如果hctx->dispatch有多个rq(这些rq是因之前blk_mq_dispatch_rq_list函数中派发时,因驱动队列繁忙或者磁盘硬件繁忙而派发失败而临时添加到hctx->dispatch链表延迟派发),则if (!list_empty(&rq_list))成立,执行blk_mq_dispatch_rq_list(q, &rq_list, false)连续派发hctx->dispatch链表上的多个rq。

3:blk_mq_get_dispatch_budget什么时间用。在blk_mq_do_dispatch_sched函数中派发rq时,是先执行if (!blk_mq_get_dispatch_budget(hctx)),获取budget,然后执行blk_mq_dispatch_rq_list(q, &rq_list, true)派发rq时传入的got_budget形参是true。在__blk_mq_sched_dispatch_requests函数中执行if (blk_mq_dispatch_rq_list(q, &rq_list, false))派发rq时,传入的got_budget形参是false,这是因为没有此时没有提前执行blk_mq_get_dispatch_budget(hctx)获取budget。

向blk_mq_dispatch_rq_list传入的got_budget形参有什么影响?看下blk_mq_dispatch_rq_list源码就知道了,该函数里if (!got_budget && !blk_mq_get_dispatch_budget(hctx))那行代码,如果got_budget是false,则就要执行blk_mq_get_dispatch_budget再次获取budget。为什么要执行blk_mq_get_dispatch_budget获取budget呢?这是确保磁盘驱动是否空闲,可以派发rq。

;