NVMe Command Submission and Completion Mechanism

Posted on January 22, 2021 by jliu9

Followed by the last post, I’m just wondering how submission queue and completion queue works. Essentially, there is ynchronization and I’d like to understand its detail to see if I can find the cause of long tail, because maybe there is some periodical operations that I am not aware of.


  • The NVMe Express (NVMe) interface allows host software to communicate with a non-valatile memory subsystem.
  • Namespace: A quantity of non-valatile memory that may be formatted into logical blocks.
  • SQ (Submission Queue): A submission queue is a circular buffer with a fixed slot size that the host software used to submit commands for execution by the controller.
    • The controller fetches the Submission Queue and may execute those commands in any order.
    • Each Submission Queue entry is a command. Commands are 64 bytes in size. The physical memory locations in memory to use for data transfers are specified using Physical Region Page (PRP) entries or Scatter Gather Lists.
  • CQ (Completion Queue): A circular buffer with a fixed slot size used to post status for completed commands. A completed command is uniquely identified by a combination of the associated SQ identifer and command identifer that is assigned by host software.
    • The CQ Head pointer is updated by host software after processing completion queue entries indicating the last free CQ slot. A Phase Tag (P) bit is defined in the completion queue entry to indicate whether an entry has been newly posted without consulting a register. This enables host software to determine whether the new entry was posted as part of the previous or current round of completion notifications.
    • Specifically, each round through the Completion Queue entries, the controller inverts the Phase Tag bit.
  • Controller:
    • A controller is the inferface between a host and an NVM subsystem. There are three types of controllers:
      • I/O Controller
      • Discovery controllers; and
      • Administrative controllers
    • A controller executes command submitted by a host on a submission queue and posts an completion on a Completion Queue.


PCI Header

The header contains the address (aka. pointer) of other registers and some other metadata. It’s always indirection sort of thing. The format is like this:

     0x0  0x1  0x2  0x3  0x4  0x5  0x6  0x7  0x8  0x9  0xa  0xb  0xc  0xd  0xe  0xf
0x00:|-      ID      -|  |-CMD-|   |-STS-|   RID  |-    CC   -|  CLS  MLT HTYPE BIST
0x10:|-     MLBAR    -|  |-     MUBAR    -|  |-     BAR2     -|  |-     BAR3     -|
0x20:|-     BAR4     -|  |-     BAR5     -|  |-    CCPTR     -|  |-      SS      -|
0x30:|-     EROM     -|  CAP  |-          Reserved           -|  |-INTR-|  MGNT MLAT

Some notable fields:

  • CMD: Command Register
  • CLS: Cache Line Size
  • HTYPER: Header Type
  • MLBAR: Memory Register Base Address, lower 32-bits
  • MUBAR: Memory Register Base Address, upper 32-bits
  • SS: Subsystem Identifier
  • INTR: Interrupt Information

Controller Registers

Controller registers are located in the MLBAR/MUBAR registers (PCI BAR0 and BAR1) that shall be mapped to a memory space that supports in-order access and variable access width. For many computer architectures, specifying the memory space as uncacheable processing producing this behavior.

      0x0  0x1  0x2  0x3  0x4  0x5  0x6  0x7  0x8  0x9  0xa  0xb  0xc  0xd  0xe  0xf
 0x00:|-   CAP: Controller Capabilities   -|  |- VS: Version  -|  |-     INTMS     -|
 0x10:|-     INTMC    -|  |-      CC      -|  |-   Reserved   -|  |-  CSTS: Status -|
 0x20:|-    NSSR [O]  -|  |-      AQA     -|  |- ASQ: Admin Submission Queue Addr  -|
 0x30:|- ACQ: Admin Completion Queue Addr -|  |-  CMBLOC [O]  -|  |-   CMBSZ [O]   -|
0x100:|-    SQ0TDBL   -|
0x1000+(1*4<<CAP.DSTRD) -- 0x1003+(1*4<<CAP.DSTRD): |-CQ0HDBL-|

Then followed by the non-admin queue pairs like this (Note: not in 16 bytes per line format):

0x1000+(2*4<<CAP.DSTRD) -- 0x1003+(2*4<<CAP.DSTRD): Submission Queue 1 Tail Doorbell
0x1000+(3*4<<CAP.DSTRD) -- 0x1003+(3*4<<CAP.DSTRD): Completion Queue 1 Head Doorbell
0x1000+(4*4<<CAP.DSTRD) -- 0x1003+(4*4<<CAP.DSTRD): Submission Queue 2 Tail Doorbell
0x1000+(5*4<<CAP.DSTRD) -- 0x1003+(5*4<<CAP.DSTRD): Completion Queue 2 Head Doorbell
  • SQ0TDBL: Submission Queue 0 Tail Doorbell (Admin).
    • SQT: Indicates the new value of the Submission Queue Tail entry pointer. This value shll overwrite any previous Submission Queue Tail entry pointer value provided. The difference between the last SQT write and the current SQT write indicates the number of commands added to the Submission Queue.
      • rollover needs to be accounted for by the host.
  • CQ0HDBL: Completion Queue 0 Head Doorbell (Admin).
    • Completion Queue Head (CQH): Indicates the new value of the Completion Queue Head entry pointer. This value shall overwrite any previous Completion Queue Head value provided. The difference between the last CQH write and the current CQH entry pointer write indicates the number of entries that are now available for re-use by the controller in the Completion Queue.
      • again, host handles rollover
  • CAP.DSTRD: doorbell stride, Each Submission Queue and Completion Queue Doorbell register is 32-bits in size. This register indicates the stride between doorbell registers. The stride is specified as (2^(2+DSTRD)) in bytes, A value of 0h indicates a stride of 4 bytes, where the doorbell registers are packed without reserved space between each register.
  • [O]: Optional.

So if we assume DSTRD is 0h, the doorbell registers will look like this:

      0x0  0x1  0x2  0x3  0x4  0x5  0x6  0x7  0x8  0x9  0xa  0xb  0xc  0xd  0xe  0xf
0x100:|-    SQ0TDBL   -|  |-    CQ0HDBL   -|  |-    SQ1HDBL   -|  |-    CQ1HDBL   -|
0x101:|-    SQ2TDBL   -|  |-    CQ2HDBL   -| ...

How NVMe IO (w/ polling host) works? Data Structures and Algorithms

Or, this section is about the mechanism of command submission and completion. It is the data structures and algorithms, together, constitute the mechanism.

The central data structures used for this data transfer between the host and controller is the qpair (one submission queue and one completion queue), plus two doorbells. One particular noteworthy thing here is each of the doorbell register is only written by the host.

The qpair basically means a pair of queue, one for submission request and one for completion. It is organized as a circular buffer, with each entry size set to 64Byte. In submission, the host is the producer and thus updates the tail of SQ. For completion processing, the host, as the consumer, will update the head of CQ.

Submission/Completion Queue

The format of completion entry is:

        0x0  0x1  0x2  0x3  0x4  0x5  0x6  0x7  0x8  0x9  0xa  0xb  0xc  0xd  0xe  0xf
DWord0: |-                             Command Spefic                               -|
DWord1: |-                                  Reserved                                -|
DWord2: |-         SQ Header Pointer        -|  |-          SQ Identifier           -|
DWord3: |-         Command Identifer        -| |-P-|  |-        Status Field        -|

The [P] bit is Phase Tag, indicating whether a completion queue entry is new. Basically, in each round of this circular buffer (CQ), the Phase Tag has one value, and is inverted once passed through.

The detailed steps of this submission/completion mechanism is depicted in the bellow pic. In description, the controller parts are italic.

  1. The host write commands for execution to the next free slots in SQ. (maybe several entries).
  2. The host updates the SQiTDBL
  3. The controller fetches the commands in SQ slots into its own memory
  4. The controller executes the commands (Note execution may be in the different order)
  5. The controller writes entry to CQ
  6. The host consumes the completion.
  7. The host updates the CQiHDBL to indicate the entry has been consumed.
Submission/Completion Mechanism

A small trick here is that, the controller uses SQ Head Pointer (SQHD) field in CQ entries to communicate the most recent value of SQ Header Pointer to the host, which indicates that SQ entries have been consumed by the controller, but does not indicate either execution or completion of any command.


  • NVMe Specicifcation v1.4. link