阅读本文前,希望读者先看看我写的《linux内核block层Multi queue多队列核心点分析》。这篇文章是针对block层Multi queue(简称blk-mq) 多队列基础知识点总结。还有《内核block层Multi queue多队列的一次优化实践》,这是一次边针对blk-mq硬件队列派发IO的性能优化实践。本文是在二者的基础上,实际调试总结的知识点,更加详细。本文内核版本是centos 8.3 4.18.0-240,block层IO调度器bfq。
在内核blk-mq多队列框架下,最终派发IO请求(简称为rq或者req)到磁盘驱动一般情况是在blk_mq_dispatch_rq_list()函数,看下几个这种栈回溯:
读取文件
- 0xffffffff96409e10 : blk_mq_dispatch_rq_list+0x0/0x820 [kernel]
- 0xffffffff9640f4ba : blk_mq_do_dispatch_sched+0x11a/0x160 [kernel]
- 0xffffffff9640ff99 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
- 0xffffffff96410020 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
- 0xffffffff96407691 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
- 0xffffffff96407f51 : __blk_mq_delay_run_hw_queue+0x141/0x160 [kernel]
- 0xffffffff96410351 : blk_mq_sched_insert_requests+0x71/0xf0 [kernel]
- 0xffffffff9640b4c6 : blk_mq_flush_plug_list+0x196/0x2c0 [kernel]
- 0xffffffff963ffbe7 : blk_flush_plug_list+0xd7/0x100 [kernel]
- 0xffffffff963ffc31 : blk_finish_plug+0x21/0x2e [kernel]
- 0xffffffff9623744f : read_pages+0x7f/0x190 [kernel]
- 0xffffffff96237721 : __do_page_cache_readahead+0x1c1/0x1e0 [kernel]
- 0xffffffff96237939 : ondemand_readahead+0x1f9/0x2c0 [kernel]
- 0xffffffff9622ce5f : generic_file_buffered_read+0x71f/0xb00 [kernel]
- 0xffffffffc0670ed7 : xfs_file_buffered_aio_read+0x47/0xe0 [xfs]
- 0xffffffffc0670fde : xfs_file_read_iter+0x6e/0xd0 [xfs]
- 0xffffffff962d8841 : new_sync_read+0x121/0x170 [kernel]
- 0xffffffff962db1c1 : vfs_read+0x91/0x140 [kernel]
jbd2派发IO
- 0xffffffff96409e10 : blk_mq_dispatch_rq_list+0x0/0x820 [kernel]
- 0xffffffff9640f4ba : blk_mq_do_dispatch_sched+0x11a/0x160 [kernel]
- 0xffffffff9640ff99 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
- 0xffffffff96410020 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
- 0xffffffff96407691 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
- 0xffffffff96407f51 : __blk_mq_delay_run_hw_queue+0x141/0x160 [kernel]
- 0xffffffff96410351 : blk_mq_sched_insert_requests+0x71/0xf0 [kernel]
- 0xffffffff9640b4c6 : blk_mq_flush_plug_list+0x196/0x2c0 [kernel]
- 0xffffffff963ffbe7 : blk_flush_plug_list+0xd7/0x100 [kernel]
- 0xffffffff963ffc31 : blk_finish_plug+0x21/0x2e [kernel]
- 0xffffffffc09cb164 : jbd2_journal_commit_transaction+0xf64/0x19f0 [jbd2]
fio进程直接派发IO
- 0xffffffff96409e10 : blk_mq_dispatch_rq_list+0x0/0x820 [kernel]
- 0xffffffff9640f4ba : blk_mq_do_dispatch_sched+0x11a/0x160 [kernel]
- 0xffffffff9640ff99 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
- 0xffffffff96410020 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
- 0xffffffff96407691 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
- 0xffffffff96407f51 : __blk_mq_delay_run_hw_queue+0x141/0x160 [kernel]
- 0xffffffff96410351 : blk_mq_sched_insert_requests+0x71/0xf0 [kernel]
- 0xffffffff9640b4c6 : blk_mq_flush_plug_list+0x196/0x2c0 [kernel]
- 0xffffffff963ffbe7 : blk_flush_plug_list+0xd7/0x100 [kernel]
- 0xffffffff963ffc31 : blk_finish_plug+0x21/0x2e [kernel]
- 0xffffffff96331bb9 : __x64_sys_io_submit+0xd9/0x180 [kernel]
实际测试表明大部分读写文件的进程都是flush plug形式派发IO,以读文件为例:
- static int read_pages(struct address_space *mapping, struct file *filp,
- struct list_head *pages, unsigned int nr_pages, gfp_t gfp)
- {
- struct blk_plug plug;
- unsigned page_idx;
- int ret;
- blk_start_plug(&plug);
- for (page_idx = 0; page_idx < nr_pages; page_idx++) {
- struct page *page = lru_to_page(pages);
- list_del(&page->lru);
- if (!add_to_page_cache_lru(page, mapping, page->index, gfp))
- //把要派发的文件IO对应的rq添加到plug->mq_list链表
- mapping->a_ops->readpage(filp, page);
- put_page(page);
- }
- ret = 0;
- out:
- //启动派发IO
- blk_finish_plug(&plug);
- return ret;
- }
在ext4_readpage->ext4_mpage_readpages->submit_bio->generic_make_request->blk_mq_make_request->blk_add_rq_to_plug流程把要派发的rq添加到plug->mq_list链表链表。之后执行blk_finish_plug->blk_flush_plug_list->blk_mq_flush_plug_list->blk_mq_sched_insert_requests派发IO,看下blk_mq_sched_insert_requests函数源码:
- void blk_mq_sched_insert_requests(struct blk_mq_hw_ctx *hctx,
- struct blk_mq_ctx *ctx,
- struct list_head *list, bool run_queue_async)
- {
- struct elevator_queue *e;
- struct request_queue *q = hctx->queue;
- percpu_ref_get(&q->q_usage_counter);
- e = hctx->queue->elevator;
- if (e && e->type->ops.insert_requests)
- //执行bfq_insert_requests()把list链表上的所有IO请求插入IO运行队列
- e->type->ops.insert_requests(hctx, list, false);
- else
- {
- if (!hctx->dispatch_busy && !e && !run_queue_async) {
- blk_mq_try_issue_list_directly(hctx, list);
- if (list_empty(list))
- goto out;
- }
- blk_mq_insert_requests(hctx, ctx, list);
- }
- //派发IO到驱动,run_queue_async决定是异步派发还是同步派发。大部分读写文件的进程都是同步派发
- blk_mq_run_hw_queue(hctx, run_queue_async);
- out:
- percpu_ref_put(&q->q_usage_counter);
- }
blk_mq_sched_insert_requests函数中执行bfq_insert_requests()把plug->mq_list链表上的所有IO请求插入到bfq 算法的IO运行队列。然后执行blk_mq_run_hw_queue->__blk_mq_delay_run_hw_queue->__blk_mq_run_hw_queue->blk_mq_sched_dispatch_requests->__blk_mq_sched_dispatch_requests 函数,真正派发IO。此时即便plug->mq_list链表上有多个rq要派发,但是也不能保证能一次性连续派发给磁盘驱动,这点下文有讲解。
下来看下__blk_mq_sched_dispatch_requests函数源码:
- int __blk_mq_sched_dispatch_requests(struct blk_mq_hw_ctx *hctx)
- {
- struct request_queue *q = hctx->queue;
- struct elevator_queue *e = q->elevator;
- const bool has_sched_dispatch = e && e->type->ops.dispatch_request;
- int ret = 0;
- LIST_HEAD(rq_list);
- //如果hctx->dispatch链表上有rq,则优先派发hctx->dispatch上的rq,这些rq是上次派发遇到磁盘驱动繁忙等导致派发失败的rq
- if (!list_empty_careful(&hctx->dispatch)) {
- spin_lock(&hctx->lock);
- if (!list_empty(&hctx->dispatch))
- list_splice_init(&hctx->dispatch, &rq_list);
- spin_unlock(&hctx->lock);
- }
- if (!list_empty(&rq_list)) {
- //设置hctx->state的BLK_MQ_S_SCHED_RESTART标记
- blk_mq_sched_mark_restart_hctx(hctx);
- //优先派发hctx->dispatch上的req,如果没有遇到磁盘驱动繁忙等,返回true,此时继续执行下边的blk_mq_do_dispatch_sched派发IO调度队列上的IO请求
- if (blk_mq_dispatch_rq_list(q, &rq_list, false)) {
- /*循环从IO调度器队列取出IO请求存入rq_list链表,然后取出rq_list链表上的req派发给磁盘驱动,如果因驱动队列繁忙或者nvme硬件繁忙导致派发失败,则把req添加hctx->dispatch等稍后派发,此时退出循环。如果IO调度器队列IO派发完了,也会退出while循环*/
- if (has_sched_dispatch)
- ret = blk_mq_do_dispatch_sched(hctx);
- else
- ret = blk_mq_do_dispatch_ctx(hctx);
- }
- } else if (has_sched_dispatch) {
- ret = blk_mq_do_dispatch_sched(hctx);
- } else if (hctx->dispatch_busy) {
- ret = blk_mq_do_dispatch_ctx(hctx);
- } else {
- blk_mq_flush_busy_ctxs(hctx, &rq_list);
- blk_mq_dispatch_rq_list(q, &rq_list, false);
- }
- return ret;
- }
正常情况是执行blk_mq_do_dispatch_sched函数派发IO,看下源码
- static int blk_mq_do_dispatch_sched(struct blk_mq_hw_ctx *hctx)
- {
- struct request_queue *q = hctx->queue;
- struct elevator_queue *e = q->elevator;
- LIST_HEAD(rq_list);
- int ret = 0;
- do {
- struct request *rq;
- //bfq_has_work
- if (e->type->ops.has_work && !e->type->ops.has_work(hctx))
- break;
- if (!list_empty_careful(&hctx->dispatch)) {
- ret = -EAGAIN;
- break;
- }
- if (!blk_mq_get_dispatch_budget(hctx))
- break;
- rq = e->type->ops.dispatch_request(hctx);//调用bfq调度器派发IO函数 bfq_dispatch_request
- if (!rq) {
- //如果bfq_dispatch_request返回rq是NULL,则执行blk_mq_delay_run_hw_queues()启动blk-mq异步派发IO内核线程
- blk_mq_put_dispatch_budget(hctx);
- blk_mq_delay_run_hw_queues(q, BLK_MQ_BUDGET_DELAY);
- break;
- }
- list_add(&rq->queuelist, &rq_list);
- /*取出rq_list链表上的req派发给磁盘驱动,如果因驱动队列繁忙或者nvme硬件繁忙导致派发失败,则把rq添加hctx->dispatch等稍后派发遇到rq派发失败返回false,退出while循环*/
- } while (blk_mq_dispatch_rq_list(q, &rq_list, true));
- return ret;
- }
- static bool bfq_has_work(struct blk_mq_hw_ctx *hctx)
- {
- struct bfq_data *bfqd = hctx->queue->elevator->elevator_data;
- return !list_empty_careful(&bfqd->dispatch) ||
- bfq_tot_busy_queues(bfqd) > 0;
- }
这个函数作用是循环从IO调度器队列取出IO请求存入rq_list链表,然后取出rq_list链表上的rq派发给磁盘驱动,如果因驱动队列繁忙或者磁盘硬件繁忙导致派发失败,则把rq添加hctx->dispatch等稍后派发,此时退出循环。如果IO调度器队列IO派发完了,也会退出while循环。
这个函数首先执行bfq_has_work函数看IO运行队列上是否还有rq要派发,没有的话就break跳出while循环停止派发IO。如果有rq要派发,执行bfq_dispatch_request从IO运行队列选择一个rq(block层IO调度器时bfq),如果没找到合适的rq则返回NULL(注意此时并不能说明bfq的IO算法队列没有rq了,可能只是一些算法策略bfq算法没找到合适派发的rq),则执行blk_mq_delay_run_hw_queues函数启动blk-mq异步派发rq内核线程,如下是抓到的栈回溯信息:
- kworker/3:1H 500
- 0xffffffffc06b3950 : bfq_dispatch_request+0x0/0x9f0 [bfq]
- 0xffffffffa480f385 : blk_mq_do_dispatch_sched+0xc5/0x160 [kernel]
- 0xffffffffa480feb9 : __blk_mq_sched_dispatch_requests+0x189/0x1e0 [kernel]
- 0xffffffffa480ff40 : blk_mq_sched_dispatch_requests+0x30/0x60 [kernel]
- 0xffffffffa48076a1 : __blk_mq_run_hw_queue+0x51/0xd0 [kernel]
- 0xffffffffa44d3477 : process_one_work+0x1a7/0x360 [kernel]
- 0xffffffffa44d3b40 : worker_thread+0x30/0x390 [kernel]
- 0xffffffffa44d9502 : kthread+0x112/0x130 [kernel]
为什么要这样呢?前文简单提过,blk_mq_do_dispatch_sched函数执行的bfq_dispatch_request函数返回的rq是NULL,但是此时bfq算法队列还是有rq的,只是因为bfq调度算法策略原因导致bfq_dispatch_request返回NULL。那怎么办?就执行blk_mq_delay_run_hw_queues启动blk-mq异步派发rq内核线程,在这个内核线程里再执行__blk_mq_sched_dispatch_requests->blk_mq_do_dispatch_sched派发rq,直到bfq的IO算法队列的rq全派发完了,则在blk_mq_do_dispatch_sched函数中执行bfq_has_work返回false,则if (e->type->ops.has_work && !e->type->ops.has_work(hctx))成立,直接break跳出while循环,停止派发rq。这种机制能保证bfq算法队列上的rq能一直派发完!
而在blk_mq_dispatch_rq_list函数真正把rq派发给磁盘驱动,看下源码:
- bool blk_mq_dispatch_rq_list(struct request_queue *q, struct list_head *list,
- bool got_budget)
- {
- struct blk_mq_hw_ctx *hctx;
- struct request *rq, *nxt;
- bool no_tag = false;
- int errors, queued;
- blk_status_t ret = BLK_STS_OK;
- bool no_budget_avail = false;
- ................
- errors = queued = 0;
- do {
- struct blk_mq_queue_data bd;
- rq = list_first_entry(list, struct request, queuelist);
- hctx = rq->mq_hctx;
- if (!got_budget && !blk_mq_get_dispatch_budget(hctx)) {
- blk_mq_put_driver_tag(rq);
- no_budget_avail = true;
- break;
- }
- ................
- list_del_init(&rq->queuelist);
- bd.rq = rq;
- if (list_empty(list))
- bd.last = true;
- else {
- nxt = list_first_entry(list, struct request, queuelist);
- bd.last = !blk_mq_get_driver_tag(nxt);
- }
- //把rq派发给驱动
- ret = q->mq_ops->queue_rq(hctx, &bd);//scsi_queue_rq 或 nvme_queue_rq
- //这个if成立应该说明是 驱动队列繁忙 或者nvme硬件繁忙,不能再向驱动派发IO,因此本次的rq派发失败
- if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE) {
- if (!list_empty(list)) {
- //把rq在list链表上的下一个req的tag释放了,搞不清楚为什么
- nxt = list_first_entry(list, struct request, queuelist);
- blk_mq_put_driver_tag(nxt);
- }
- //把派发失败的rq再添加到list链表
- list_add(&rq->queuelist, list);
- __blk_mq_requeue_request(rq);
- break;
- }
- ...........
- //派发rq失败则queued加1
- queued++;
- //一直派发list链表上的req直到list链表空
- } while (!list_empty(list));
- hctx->dispatched[queued_to_index(queued)]++;
- //如果list链表上还有rq,说明派发rq时遇到驱动队列或者硬件繁忙,rq没有派发成功
- if (!list_empty(list)) {
- ...........
- spin_lock(&hctx->lock);
- //list上没有派发成功的rq添加到hctx->dispatch链表,稍后延迟派发
- list_splice_tail_init(list, &hctx->dispatch);
- spin_unlock(&hctx->lock);
- ......................
- needs_restart = blk_mq_sched_needs_restart(hctx);
- if (!needs_restart ||
- (no_tag && list_empty_careful(&hctx->dispatch_wait.entry)))
- //再次直接启动派发IO
- blk_mq_run_hw_queue(hctx, true);
- else if (needs_restart && (ret == BLK_STS_RESOURCE ||
- no_budget_avail))
- //启动异步线程派发IO
- blk_mq_delay_run_hw_queue(hctx, BLK_MQ_RESOURCE_DELAY);
- //标记blk-mq busy
- blk_mq_update_dispatch_busy(hctx, true);
- //这里返回false表示派发IO时遇到驱动队列繁忙 或者nvme硬件繁忙
- return false;
- } else
- //标记blk-mq 不busy
- blk_mq_update_dispatch_busy(hctx, false);
- //派发rq时遇到驱动队列或者硬件繁忙,返回false,否则派发正常下边返回true
- if (ret == BLK_STS_RESOURCE || ret == BLK_STS_DEV_RESOURCE)
- return false;
- return (queued + errors) != 0;
- }
取出list链表上的req派发给磁盘驱动,如果因驱动队列繁忙或者磁盘硬件繁忙导致派发失败,则把rq添加hctx->dispatch等稍后派发。这里主要提几点:
1:如果在blk_mq_dispatch_rq_list函数中派发rq遇到驱动队列繁忙或者磁盘硬件繁忙导致派发失败,则要执行blk_mq_update_dispatch_busy(hctx, true)标记blk-mq硬件队列busy,否则执行blk_mq_update_dispatch_busy(hctx, false)标记不busy。而blk-mq硬件队列被标记busy后,则会影响到blk-mq派发rq。比如blk_mq_sched_insert_requests函数中,有if (!hctx->dispatch_busy && !e && !run_queue_async) blk_mq_try_issue_list_directly(hctx, list) ,这是说,如果硬件队列不繁忙的情况下,则执行blk_mq_try_issue_list_directly直接派发rq到磁盘驱动。
2:真正派发rq到磁盘驱动是在blk_mq_dispatch_rq_list函数中执行q->mq_ops->queue_rq(hctx, &bd)。blk-mq这种机制,一次只派发一个rq。blk_mq_do_dispatch_sched函数中的while循环中,每次执行bfq_dispatch_request只从IO算法队列取出一个rq,然后执行blk_mq_dispatch_rq_list(q, &rq_list, true)派发这个一个rq。
每次只派发一个rq,效率是不是太低了?这个没办法,因为如果执行blk_mq_dispatch_rq_list函数派发多个rq,这些rq必须是属于同一个硬件队列,否则会导致rq派给磁盘驱动后一直卡死,无法传输完成。而blk_mq_do_dispatch_sched中执行bfq_dispatch_request返回的每个rq,不能保证这些rq是同一个硬件队列,因此blk_mq_do_dispatch_sched中只能一次执行blk_mq_dispatch_rq_list派发一个rq。
什么情况下,blk_mq_dispatch_rq_list函数可以一次派发多个rq呢?看下前文__blk_mq_sched_dispatch_requests函数源码,如果hctx->dispatch有多个rq(这些rq是因之前blk_mq_dispatch_rq_list函数中派发时,因驱动队列繁忙或者磁盘硬件繁忙而派发失败而临时添加到hctx->dispatch链表延迟派发),则if (!list_empty(&rq_list))成立,执行blk_mq_dispatch_rq_list(q, &rq_list, false)连续派发hctx->dispatch链表上的多个rq。
3:blk_mq_get_dispatch_budget什么时间用。在blk_mq_do_dispatch_sched函数中派发rq时,是先执行if (!blk_mq_get_dispatch_budget(hctx)),获取budget,然后执行blk_mq_dispatch_rq_list(q, &rq_list, true)派发rq时传入的got_budget形参是true。在__blk_mq_sched_dispatch_requests函数中执行if (blk_mq_dispatch_rq_list(q, &rq_list, false))派发rq时,传入的got_budget形参是false,这是因为没有此时没有提前执行blk_mq_get_dispatch_budget(hctx)获取budget。
向blk_mq_dispatch_rq_list传入的got_budget形参有什么影响?看下blk_mq_dispatch_rq_list源码就知道了,该函数里if (!got_budget && !blk_mq_get_dispatch_budget(hctx))那行代码,如果got_budget是false,则就要执行blk_mq_get_dispatch_budget再次获取budget。为什么要执行blk_mq_get_dispatch_budget获取budget呢?这是确保磁盘驱动是否空闲,可以派发rq。