Skip to content

Which Queue Pair type to use?

Contents

4.71 avg. rating (94% score) - 14 votes

When writing a new RDMA application (just like when writing a new application over sockets), one should decide which QP type he should work with.

In this post, I will describe in detail the characteristics of each transport type.

In RDMA, there are several QP types. They can be represented by : XY

X can be:
Reliable: There is a guarantee that messages are delivered at most once, in order and without corruption.
Unreliable: There isn't any guarantee that the messages will be delivered or about the order of the packets.

In RDMA, every packet has a CRC and corrupted packets are being dropped (for any transport type). The Reliability of a QP transport type refers to the whole message reliability.

Y can be:
Connected: one QP send/receive with exactly one QP
Unconnected: one QP send/receive with any QP

The following mechanisms are being used in RDMA:
* CRC: The CRC field which validates that packets weren't corrupted along the path.

* PSN: The Packet Serial Number makes sure that packets are being received by the order. This helps detect missing packets and packet duplications.

* Acknowledgement: (only in RC QP) Only after a message is being written successfully on the responder side, an ack packet is being sent back to the requestor. If an ack isn't being sent by the requestor, it resend the message again according to the QP's attributes. If there won't be any ack (or nack) from a QP, it will report that there is an error (retry exceeded).
If there is any kind of error on the responder side (protection, resources, etc.) an ack will be sent to the requestor and it will report that there is an error.

Reliable Connected (RC) QP

One RC QP is being connected (i.e. send and receive messages) to exactly one RC QP in a reliable way. It is guaranteed that messages are delivered from a requester to a responder at most once, in order and without corruption. The maximum supported message size is up to 2GB (this value may be lower, depends on the supported RDMA device attributes). RC QP supports Send operations (w/o immediate), RDMA Write operations (w/o immediate), RDMA Read operations and Atomic operations (it depends on the RDMA device support level in atomic operations).

If a message size is bigger than the path MTU, it is being fragmented in the side that sends the data and being reassembled in the receiver side.

Requester considers a message operation complete once there is an ack from the responder side that the message was read/written to its memory.

Responder considers a message operation complete once the message was read/written to its (local) memory.

Unreliable Connected (UC) QP

One UC QP is being connected (i.e. send and receive messages) to exactly one UC QP in an unreliable way. There isn't any guaranteed that the messages will be received by the other side: corrupted or out of sequence packets are silently dropped. If a packet is being dropped, the whole message that it belongs to will be dropped. In this case, the responder won't stop, but continues to receive incoming packets. There isn't any guarantee about the packet ordering. The maximum supported message size is up to 2GB (this value may be lower, depends on the support RDMA device attributes). RC QP supports Send operations (w/o immediate) and RDMA Write operations (w/o immediate).

If a message size is bigger than the path MTU, it is being fragmented in the side that sends the data and being reassembled in the receiver side.

Requester considers a message operation complete once all of the message was sent to the fabric.

Responder considers a message operation complete once it received a complete message in correct sequence and it written the data to its (local) memory.

Unreliable Datagram (UD) QP

One QP can send and receive message to any other UD QP in either unicast (one to one) or multicast (one to many) way in an unreliable way. There isn't any guaranteed that the messages will be received by the other side: corrupted or out of sequence packets are silently dropped. There isn't any guarantee about the packet ordering. The maximum supported message size is the maximum path MTU. UD QP supports only Send operations.

Requester considers a message operation complete once the (one packet) message was sent to the fabric.

Responder considers a message operation complete once it received a complete message and it written the data to its (local) memory.

Choosing the right QP type

Choosing the right QP type is critical to the correction and scalability of an application.

RC QP should be chosen if:

      1. Reliability by the fabric is needed
    1. Fabric size isn't big or the cluster size is big, but not all nodes send traffic to the same node (one victim)

Several uses for a RC QP can be: FTP over RDMA or file system over RDMA.

UC QP should be chosen if:

      1. Reliability by the fabric isn't needed (i.e. reliability isn't important at all or it is being taken care of by the application)
      1. Fabric size isn't big or the cluster size is big, but not all nodes send traffic to the same node (one victim)
    1. Big messages (more than the path MTU) are being sent

One use for an UC QP can be: video over RDMA.

UD QP should be chosen if:

      1. Reliability by the fabric isn't needed (i.e. reliability isn't important at all or it is being taken care of by the application)
      1. Fabric size is big and all nodes and every node send messages to any other node in the fabric. UD is one of the best solutions for scalability problems.
    1. Multicast messages are needed

One use for an UD QP can be: voice over RDMA.

Summary

The following table describes the characteristics of each QP Transport Service Type:

Metric UD UC RC
Opcode: SEND (w/o immediate) Supported Supported Supported
Opcode: RDMA Write (w/o immediate) Not supported Supported Supported
Opcode: RDMA Read Not supported Not supported Supported
Opcode: Atomic operations Not supported Not supported Supported
Reliability No No Yes
Connection type Datagram (One to any/many) Connected (one to one) Connected (one to one)
Maximum message size Maximum path MTU 2 GB 2 GB
Multicast supported Not supported Not supported

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

Tell us what do you think.

  1. Rupert Dance says: November 29, 2013

    Nice summary. Thanks for taking the time to respond to questions.

    • Dotan Barak says: November 29, 2013

      Sure
      :)

      Thanks
      Dotan

  2. Baturay O. says: September 5, 2014

    Hi Dotan,

    I'm using SoftiWARP and trying to create qp with qp_type IBV_QPT_UD. However, I'm getting "out of memory" error. What other options should I change or consider?

    • Dotan Barak says: September 5, 2014

      Hi.

      It seems that SoftiWARP currently supports only RC QPs, so this is the reason for the failure. IMHO, they should return ENOSYS, but I'm not part of this project. .

      Thanks
      Dotan

      • Baturay O. says: September 8, 2014

        I didn't know that. Thank you. I wonder if everything had gone well, would it be enough to set qp_type as IBV_QPT_UD ?

      • Dotan Barak says: September 8, 2014

        I'm sorry, but I don't really understand what you mean.

        if SoftiWARP would have support UD QPs, than the QP creation would have return a valid QP pointer.
        Is this is what you meant?

        Thanks
        Dotan

  3. Baturay O. says: September 10, 2014

    I'm sorry for the misunderstanding. I mean if SoftiWarp would have support UD QPs, is there any other options to change in addition to qp_type?

    • Dotan Barak says: September 10, 2014

      For the QP creation, only the QP type would have change.
      But during the whole test flow, there is a different (connecting the QP is different, using Address Handle only in UD, the attributes of the Send Request are different).

      Thanks
      Dotan

  4. Frank says: June 14, 2015

    Hi, thanks for all your great work, it is really helping me.

    I have a running implementation of a reliable connection using RDMA write. I want to improve the bandwidth using an unrealiable connection.

    My first question: Would that switch improve the performance? And which parts do I need to change in order to use an unreliable connection (do you maybe have a code example?)

    • Dotan Barak says: June 15, 2015

      Thanks!

      I don't expect that moving to UC (Unreliable Connection) will give you extra performance...

      Anyway, the only difference between UC and RC (from SW point of view) is only the connection;
      less attributes in ibv_modify_qp(). The SW should take care of the reliability (if it is needed).

      In the git repository of libibverbs there are examples that can be used as a reference.

      Thanks
      Dotan

  5. Matt says: August 5, 2015

    I am just curious as to what is the relationship between a Queue Pair and the Receive Side Scaling queues in linux. I mean, with a mellanox card, I assume that the kernel will talk with the card via a Queue Pair. So if the card has say 8 RSS queues, does it mean that the kernel has 8 Queue Pairs to talk with the card?

    Thanks for all this great info!

    • Dotan Barak says: August 10, 2015

      Hi.

      How RSS is implemented is device specific;
      there are devices that have a full Queue Paris whereas other devices may have only Receive Queue or any other thin Work Queue.

      Thanks
      Dotan

  6. Elena says: July 10, 2016

    Hello Dotan,
    Thanks for a short and pithy description.
    May be you can add here also advanced features like XRC and DCT ?

    • Dotan Barak says: August 7, 2016

      Hi Elena.

      It is in my "todo list".

      However, I fail to find the time for this;
      I'm maintaining this blog on my (little) free time and sponsor it without any help,

      I hope that within few months I'll find the time to do it
      Dotan

  7. Yacine says: June 27, 2017

    Hi Dotan,

    Thank you for the amazing blog.

    I had a question concerning the RC QP: say I am sending a large RDMA_WRITE and crash in the middle, the receiver's memory get altered right?

    Another, maybe unrelated question, do you have any idea about the granularity of writing from the NIC to the memory, for old PCI I believe it was 32- to 64-bits because of the bus length, but what about PCIe?

    Thanks a lot!

    • Dotan Barak says: July 2, 2017

      Hi.

      In a large message that is written to memory,
      if the process crashes, its resources are being cleaned and the RDMA device stops writing data to the buffers
      (otherwise, bad and unexpected things would have happened).

      Sorry, I can't answer about the PCIe writes granularity question
      (I just don't have the knowledge and don't want to confuse you here).

      Thanks
      Dotan

  8. Yacine says: July 2, 2017

    Thanks for you answer.
    I very curious on the details about the first question. I've seen (e.g. for Infiniband) in the IB Specification Vol-1 (Page-139) that packets arriving are written to memory after checking they are not corrupted and that they arrived in order, without waiting for the whole message.
    My question is what do you exactly mean by cleaned? I can hardly imagine that the RDMA device can rollback to the old version of a receive buffer on memory right?

    Thanks a lot,
    Yacine

    • Dotan Barak says: July 2, 2017

      Hi.

      Once a process get a segmentation fault, the kernel is completely aware that this happened,
      and then started destroy/invalidate all the relevant resources.

      If incoming packet is using an invalidated Memory Region or not-existing Queue Pair,
      the packet will be dropped...

      Thanks
      Dotan

  9. Alok says: December 26, 2017

    Hi Dotan,
    Have a very basic Question , Why we need to create Multiple QP in an application.When same thing can be done by Single QP. ? i mean what is the usecase of Creating Multiple QP.?

    • Dotan Barak says: January 5, 2018

      Hi.

      One can use one or multiple QPs in the same application; depends on its usage.
      If you develop an all-to-all applications, a since QP may not provide the best performance
      (will you use a UD QP? or one RC QP which will be connected each one to other client).

      This is just like the question: should I use one or multiple sockets?
      It depends on what your application is doing, how many parallel connections, performance requirements, etc.

      Thanks
      Dotan

  10. Alok says: December 26, 2017

    In RC QP , a ACK is sent back to requester.
    How is ACK sent , dose it uses QP , what is the OPCODE?
    Thanks in Advance.
    I must say your blogs cleared almost most of my doubts :).

    • Dotan Barak says: January 5, 2018

      Hi.

      ACK is a packet; in the InfiniBand specifications one can see how the ACK packets look like.
      Describing the transport in details is our of the scope of this blog...

      Thanks
      Dotan

  11. Matt says: August 8, 2018

    Hi! Infiniband literature talks about end-to-end flow control when using RC connections. (based on a credit mechanism). But, there is also a RNR error type for situations "when receiver was not ready". I don't quite get how a RNR can occur if there is end-to-end flow control. The sending QP will never send a message unless it has the credits which means that the receiver is ready and waiting for the message. So when does a RNR occur?

    Thanks!

    • Dotan Barak says: September 7, 2018

      Hi.

      It is true that there are end-to-end credits in RDMA.
      However, RNR can occur in (at least) the following flows:
      * First message
      * The remote side has a SRQ
      * Local/Remote side doesn't support this type of credit
      * If there are no credits at the remote side, (only) one message will be sent

      This was a really good question about the RDMA transport;
      I usually answer verbs programming and not transport questions.

      Thanks
      Dotan

  12. Siyuan says: December 26, 2018

    Hi, thanks for the good article. I think the "Requester" at the beginning of the last sentence in "Reliable Connected (RC) QP" part should be "Responder".

    • Dotan Barak says: December 26, 2018

      Nice catch, fixed.

      Thanks
      Dotan

  13. Jeff says: October 29, 2019

    Hi Dotan,

    Thank you for the great article!

    I have a question about the UC QP type. In the article it says that "Requester considers a message operation complete once all of the message was sent to the fabric." Does this mean that a work completion is generated locally as soon as the packet is on the wire? If so, is there no way to know if a packet has reached the other side when using UC?

    Thank you in advance!

    • Dotan Barak says: November 22, 2020

      Hi.

      Exactly. This is the meaning of the U (Unreliable) in the UC transport type.
      If one needs to know if the packets were received in the remote side - he needs to add a message counter,
      or any other mechanism in application level.

      Thanks
      Dotan

  14. Jakub says: March 17, 2020

    Hi, I would like to ask if it is possible to find out if packet is out of sequence when using UD.Thanks!

    • Dotan Barak says: July 10, 2020

      Hi.

      There isn't any receiver sequence number handling for UD QP
      If packets are sent from different QPs in the subnet, how can you maintain this value?

      If it is important for you to know which packet did arrived and which didn't - add it to the payload data,
      and maintain serial number within your program...

      Thanks
      Dotan

  15. George Kalivianakis says: March 23, 2020

    Hello, and thanks for your awesome articles. I understand this might be dead for over 2 years but I have to ask anyway: Why isn't there an RD option for the QP Type ? The Transport services supported by IB are 4 namely RC, RD, UC, UD yet There isn't an RD type. You don't mention it above and there isn't an IBV_QPT_RD in contrast to rest of the types.

    I hope some miracle will ressurect this comment section,
    George Kalivianakis.

    • Dotan Barak says: July 10, 2020

      Hi.

      RD is supported in the Infiniband spec, so I guess that at the beginning the developers of the verbs layer wanted to be spec complaint.
      And maybe they didn't know if RD will eventually be supported, so they wanted to support it from the verbs layer
      (extending it later is very problematic because of binary compatibility issues)

      This is my opinion..
      Dotan

  16. Criss says: April 18, 2021

    Hello, can you tell me what's the relationship between Queue Pair and RDMA implementation (InfiniBand, RoCE, iWARP)? if I use RoCEv2(which has UDP header over Ethernet), is RC QP still reliable?

    • Dotan Barak says: October 24, 2021

      Hi.

      Yes, the transport layer takes care of the reliability of Reliable QPs
      (the UDP header is just an extra header to allow packets pass over Ethernet routers).

      Thanks
      Dotan

  17. Hamed says: September 9, 2021

    Hi Dotan,

    Thanks for your very informative site! Is there a way to disable iCRC check for RoCE v2 using UD QPs ?
    If you are aware of a way to do this, I'd appreciate it if you could please let me know.

    Thanks,
    Hamed

    • Dotan Barak says: October 26, 2021

      Hi.

      AFAIK, the answer is "no": there isn't any common way to perform this.

      However, maybe some HW vendors allow performing this as an undocumented features, i don't know.

      Thanks
      Dotan

Add a Comment

This comment will be moderated; answer may be provided within 14 days.

Time limit is exhausted. Please reload CAPTCHA.