Skip to content

Connecting Queue Pairs

Contents

4.77 avg. rating (95% score) - 13 votes

General

In RDMA there are two options for establishing a connection between two sides:

  1. Changing the QP state explicitly in the application by calling ibv_modify_qp()
  2. Using librdmacm (in iWARP this is the only way to do it)

In this post, I will describe how to establish a connection using the first option.

Needed information

In order to establish a connection, the two sides need to exchange information. How the data will be exchanged between both sides is out of the scope of this post (for example: this can be done using sockets, files, standard input). I will concentrate on the needed information.

Each side cannot send the information about the QP in the QP itself, since in order to send/receive the data it already needs this data (chicken and egg problem). So in RDMA, when the remote side details are known, the connection is established using the Communication Manager (CM) which use a well-known QP (QP 1) to exchange the needed information.

The needed information depends on the transport type of the QP that is being connected. Each attribute is configured to the QP according to its transport type and the statue machine of the QP.

In this post I'm only talking about the required information on one side, but the same requirements exist on the remote side as well.

Connecting UD QPs

Assuming that we connect UD QP in node X and UD QP in node Y and each side creates an AH using ah_attr.

  • In side X the P_Key value at the port's qp_attr.port_num(X) P_Key table[qp_attr.pkey_index(X)] must be equal to same at side Y (what is matters is the P_Key value and not its index in the table) and at least one of them must be full member
  • qp_attr.port_num(X) must be equal to ah_attr.port_num(X)
    • If using unicast: the LID in qp_attr.ah_attr.dlid(X) must be assigned to port qp_attr.port_num(Y) in side Y
    • If using multicast: QP(Y) must be a member of the multicast group of the LIDĀ qp_attr.ah_attr.dlid(X)
  • send_wr.wr.ud.remote_qpn(X) must be equal to qp->qp_num(Y)
  • If GRH isĀ configured in ah_attr X: the value ah_attr.grh.dgid(X) must exist in side Y GID's table of port qp_attr.port_num(Y)
  • qp_attr.qkey(X) should be equal to qp_attr.qkey(Y) unless a different Q_Key value is used in send_wr.wr.ud.remote_qkey(X) when sending a message

 

Connecting UC QPs

Assuming that we connect UC QP in node X and UC QP in node Y.

  • In side X the P_Key value at the port's qp_attr.port_num(X) P_Key table[qp_attr.pkey_index(X)] must be equal to same at side Y (what is matters is the P_Key value and not its index in the table) and at least one of them must be full member
  • qp_attr.rq_psn(X) must be equal to qp_attr.sq_psn(Y)
  • qp_attr.dest_qp_num(X) must be equal to qp->qp_num(Y)
  • qp_attr.path_mtu(X) must be equal to qp_attr.path_mtu(Y)
  • qp_attr.port_num(X) must be equal to qp_attr.ah_attr.port_num(X)
  • The LID qp_attr.ah_attr.dlid(X) must be assigned to port qp_attr.port_num(Y) in side Y
  • qp_attr.port_num(X) + qp_attr.ah_attr.src_path_bits(X) must be equal to qp_attr.ah_attr.dlid(Y)
  • If GRH is configured in QP X: the value qp_attr.ah_attr.grh.dgid(X) must exist in side Y GID's table of port qp_attr.port_num(Y)

 

Connecting RC QPs

Assuming that we connect RC QP in node X and RC QP in node Y.

  • In side X the P_Key value at the port's qp_attr.port_num(X) P_Key table[qp_attr.pkey_index(X)] must be equal to same at side Y (what is matters is the P_Key value and not its index in the table) and at least one of them must be full member
  • qp_attr.rq_psn(X) must be equal to qp_attr.sq_psn(Y)
  • qp_attr.dest_qp_num(X) must be equal to qp->qp_num(Y)
  • qp_attr.path_mtu(X) must be equal to qp_attr.path_mtu(Y)
  • qp_attr.max_rd_atomic(X) must be less or equal then qp_attr.max_dest_rd_atomic(Y)
  • qp_attr.port_num(X) must be equal to qp_attr.ah_attr.port_num(X)
  • The LID qp_attr.ah_attr.dlid(X) must be assigned to port qp_attr.port_num(Y) in side Y
  • qp_attr.port_num(X) + qp_attr.ah_attr.src_path_bits(X) must be equal to qp_attr.ah_attr.dlid(Y)
  • If GRH is configured in QP X: the value qp_attr.ah_attr.grh.dgid(X) must exist in side Y GID's table of port qp_attr.port_num(Y)

Summary

In this post I covered the needed information in order to establish a connection between two sides. In the near future, I will cover the full librdmacm API.

FAQs

How can I get the needed information?

Some of the attributes can be taken from the QP itself (QP number), some of them from the local port number, some of them are being determined by the application.

Do I have to provide all of the above-mentioned information?

Yes. This information is needed when changing the QP state in its state machine.

What will happen if I won't configure the QP properly?

If the needed information won't be configured in the QPs, most likely that messages won't be sent/received between both sides.

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

Tell us what do you think.

  1. Omar Khan says: January 30, 2014

    Hi

    i want to set up an all to all connection between several processes using librdmacm API. So i will have an active and a passive side in each process. I intend to start two threads in each process, one thread to process the incoming connection requests (Passive side) and one thread to connect with other processes (active side). What i want to ask is that can i use one listening rdma_cm_id binded at a particular port waiting for all connection requests or do i have to start a listening rdma_cm_id for each connecting process. i have tried using one listening rdma_cm_id for multiple processes. It works fine for a one to one connection between two processes, but when i connect multiple processes, the performance of communication degrades. It seems as if i am unable to communicate between processes independently of other proceses. The purpose was to set up an infrastructure where each process can communicate with any other process independently and in parallel. if i do a send from one process to every other process in a loop and wait for a receive from each process, it takes a lot of time. for four processes it takes almost 15 secs when it should take a few microseconds.I hope you get what i mean.
    Please help me out, as i have exhausted almost all information available on the net. My code is based on the cmatose.c file that comes with the librdmacm. cmatose.c also sets up communication between multiple processes, but by default it is set for only one.
    I am using only one completion channel and one listening communication id at each process to set up connection but different send/receive and completion queues for each connected process.

    • Dotan Barak says: January 30, 2014

      Hi Omar.

      I'm really sorry, but I don't have a lot of experience with the librdmacm library
      (I plan to fill the gap in the next few weeks, but I guess it won't help you).

      In the librdmacm there is an example called "rping". This example has persistent mode and it uses the same cm_id for multiple connections.

      Since librdmacm tries to provide a socket semantics to RDMA, I'm sure that you can use the same cm_id for multiple connections.

      As I suggested in the previous reply, did you try to send email to "linux-rdma" about this issue?
      Adding a source code may help to reproduce or find a solution even faster.

      I'm sorry that I can't provide any further help

      Thanks
      Dotan

  2. Martin says: August 20, 2014

    Hello Dotan,
    I recently started to work with RDMA. In Berkeley Socket, in "struct socket" , we can find "struct sockbuf so_rcv, so_snd", and "struct sockbuf" contains a member "sb_cc" that indicates the size of the buffer, so I can write a callback fun that will be trigered when the buffer size change. Is there similar way in RDMA so I can use to write a callback function when Send/Recv Queues' payload change?

    • Dotan Barak says: August 21, 2014

      Hi Martin.

      Since in RDMA the access to the data in the buffers isn't done using send()/recv() functions (since the RDMA device send/receive data directly from the application's memory buffers), there isn't such a mechanism in RDMA.

      RDMA is a protocol with different semantics than the "standard" sockets. What are you trying to achieve? Maybe I can suggest you an alternative method to meet your needs...

      Thanks
      Dotan

      • Martin says: August 23, 2014

        Hello Dotan,
        Thank you very much for your reply!
        What I am trying to do is implementing a callback function that will be invoked whenever the Send/Recv Queues are sending/receiving data.
        So I need to be able to know the changes("buffer size" in Berkeley Socket) in Send/Recv Queues.

        Thanks again!

        Best Wishes
        Martin

      • Dotan Barak says: August 23, 2014

        I Martin.

        In RDMA programming in userspace there isn't any callback (everything is synchronic).
        However, if you want to know when data was sent/received you need to read the Work Completions;
        Work Completions are generated when data was sent and was incoming message was received.

        You can either work with polling or with Completion events (for more information, go to the post on ibv_req_notify_cq()).

        One exception though is incoming RDMA data - there isn't any notification about incoming RDMA data
        (but this is what RDMA is all about ...)

        I hope that this answer helped you.
        Dotan

  3. Igor R. says: September 23, 2014

    Hi Dotan,

    Is it possible to connect QPs if the peers don't have IP (and maybe even TCP/IP stack)?

    Thanks.

    • Dotan Barak says: September 23, 2014

      Yes.

      We use the TCP socket in many examples to exchange information between both sides,
      just because it is available.

      However, for InfiniBand to work (and connect QPs) TCP/IP isn't mandatory...

      Thanks
      Dotan

      • Igor R. says: September 23, 2014

        I see. So what's the alternative method of exchanging the QP params that wouldn't use TCP/IP?

      • Dotan Barak says: September 24, 2014

        I can think about (at least) two alternatives:
        1) Use libibcm library
        2) Use multicast groups to know about new members

        Thanks
        Dotan

  4. Valentin Petrov says: December 22, 2014

    Hi, Dotan, I'm using librdmacm to connect 2 peers with RC QPs. My question is: is it possible to control path_mtu size in this case? As far as I could understand it is only allowed to change this during INIT->RTR state modification and this seems to be hidden inside librdmacm.

    • Dotan Barak says: December 22, 2014

      Hi Valentin.

      The question should be:
      What is the reason that you try to change the path_mtu by yourself, and don't allow librdmacm to decide which value to use?
      (In InfiniBand, librdmacm issues an SA query and use the optimal value)

      Thanks
      Dotan

      • Valentin Petrov says: December 23, 2014

        Hi, Dotan, thanks for your reply. The reason I'm trying to controll MTU is that I'm striving to get best latency possible. And experiments with NetPipe based on verbs show that for messages from 512 b up to 4K maller MTU gives significantly better latency numbers (mtu=512 vs mtu=4k gives ~25% improvement for 2K msg size).

      • Dotan Barak says: December 25, 2014

        Hi Valentin.

        I'm sorry, but I think that you cannot control the path MTU in rdmacm,
        since this attribute is being set using the output of the SA Query.

        Thanks
        Dotan

  5. Santosh says: May 18, 2015

    Hi Dotan

    I am able to connect the client and server machine using the single QP. But wanted to extend the same with the multiple RDMA QP.Can I create the multiple queue pair as given in the below parameter
    QP1:{Server IP_XX.YY.ZZ.WW:port_A, rdma_cm_id_A, PD_A, QP_attribute_1}
    QP2:{Server IP_XX.YY.ZZ.WW:port_A, rdma_cm_id_A, PD_A, QP_attribute_2}
    QPn:{Server IP_XX.YY.ZZ.WW:port_A, rdma_cm_id_A, PD_A, QP_attribute_n}

    Thanks & Regards
    Santosh

    • Dotan Barak says: May 22, 2015

      Hi Santosh.

      I'm not an expert in the RDMA-CM API (I will fill this gap in the future);
      however, I think that one rdma_cm_id can be used with one QP.

      Did it answer your question?

      Thanks
      Dotan

  6. Santosh says: May 22, 2015

    This is correct and I am able to create the QPs. Where as the client machine has one rdma_cm_id and server machine has one rdma_cm_id for each queue pair on server IP:single port.

    Now wanted to extend the same for the multiple QP and trying to investigate on the same.

    • Dotan Barak says: May 23, 2015

      Hi Santosh.

      I don't really understand what you are trying to achieve ...

      Dotan

  7. Santosh says: May 23, 2015

    Hi Dotan,

    I am new to rdma and I am trying to find the correct mechanism to create the multiple queue pair between the client and server machine,and trying to figure out that whether the server can listen on the same IP:port . And can create the multiple QP for the host.

    Thanks

    • Dotan Barak says: May 26, 2015

      I would suggest to look at the RDMACM git repository:
      http://git.openfabrics.org/?p=~shefty/librdmacm.git
      at examples/rping.c

      This example has a persistent mode, which I'm sure will give you a hint on the best way to do what you want.

      Thanks
      Dotan

  8. Jack says: June 18, 2015

    Hello Dotan,
    I am still a little bit confused about how to use CM.
    what I am doing now is using TCP Socket to exchange the parameters of both side.
    How can each side exchange info using CM?(They need to establish a channel first, but how to establish?Chicken and egg like you said).

    All the best
    Jingyi

    • Dotan Barak says: June 19, 2015

      I agree.

      This is the reason that CM exists; it uses QP #1, a special QP for this.
      This is how the chicken and egg problem is solved (well known QP number).

      You can look at examples in librdmacm on how to do it
      (I still didn't publish posts that explain about it; it is planned in the next few months).

      BTW, sockets can be used even in pure InfiniBand subnet (over IPoIB).

      Thanks
      Dotan

  9. Steven says: August 27, 2015

    Hello Dotan,
    I am confused about CM. Where it lays?Is it like a demon thread?
    Do you know where I can find some example codes that use CM(qp1) for connection establish?Is there any application?

    • Dotan Barak says: September 13, 2015

      Hi Steven.

      The CM is a kernel module, which is part of the RDMA stack in the kernel.
      It doesn't work in a daemon thread.

      You can find its source code in the Linux kernel:
      /drivers/infiniband/core/cm.c

      Thanks
      Dotan

  10. Jorn says: October 1, 2015

    Hello Dotan,

    great article, thanks. I am looking for examples on how to use libibcm, do you have any pointers?

    Thanks, Jorn

    • Dotan Barak says: October 3, 2015

      Thanks.

      Currently, there aren't any posts on libibcm.
      You can only check the examples that comes with this library.

      Thanks
      Dotan

  11. Yanfang Le says: December 22, 2015

    Hi, Dotan,

    Could you please add some notes about connecting Queue pairs in ROCE environment? What should I add to connect the queue pairs in ROCE? Thanks.

    • Dotan Barak says: January 1, 2016

      Hi.

      You can use librdmacm to connect Queue Pairs over RoCE,
      or use sockets to exchange information; RoCE requires working with GIDs.

      Thanks
      Dotan

  12. Yanfang Le says: December 22, 2015

    Hi Dotan,

    I use tshark to know that the server side get the packet, but it is not put the packet into server's receive memory region. Do you know what happens? Did my queue pairs connect in this case. I really appreciate your help. Thanks. By the way, I use send operation in ROCE environment.

    • Dotan Barak says: January 1, 2016

      Hi.

      Tshark shows the incoming messages and not the packets that were written to memory.
      Was the packet received in the first place?

      Thanks
      Dotan

  13. Kaixin says: January 12, 2019

    Hello, I walk through your posts (and of course wonderful comments) thoroughly and find it quite helpful since I am really a novice in RDMA programming. Now I can write some simple codes involved with RDMA send/recv, read/write, rdma_cm_id based connection management and event-driven programming. I hope you are still maintaining the blog now and I just want to know if you will continuously update posts about using "librdmacm" ?

    • Dotan Barak says: January 13, 2019

      Hi.

      Thanks for the feedback
      :)

      I don't know if I'll add posts about librdmacm;
      I'm considering writing an (e)book that will cover it though.

      Sorry
      Dotan

  14. haonan says: September 14, 2019

    hello, Dotan
    rdmacm seems to be moved to userspace, MLNX_OFED v4.5 and linux-core(https://github.com/linux-rdma/rdma-core/tree/master/librdmacm) give some introduction. Is it true? or is rdmacm split into two parts or two mode?
    Thanks.

    • Dotan Barak says: September 17, 2019

      Hi.

      rdmacm has 2 parts: a userspace library and a kernel part (which is part of the RDMA stack).
      The userspace library was maintained in a dedicated git repository,
      and now it is part of a big repository (rdma-core) which holds most of the userspace libraries.

      I hope that this is clear now.
      Maybe I'll write a post on it in the future.

      Thanks
      Dotan

  15. Raj says: June 29, 2020

    Hi Dotan,

    When creating RC QPs from nodes X and Y to node Z, how do I make sure they dont use the same QPN?

    • Dotan Barak says: July 10, 2020

      Hi.

      When you create QP in different hosts, you don't control (and from RDMA point of view, you shouldn't care about) the QP number.

      If it is important for you and you want to enforce having different QP numbers,
      you can create a QP in host X, publish it to hosts Y, Y will create QPs (without destruction) until he will get a different number than QP in node X,
      and then destroy all the unneeded QPs that it created.

      And I'm sure that you'll understand how to continue from here ..
      :)

      Thanks
      Dotan

Add a Comment

This comment will be moderated; answer may be provided within 14 days.

Time limit is exhausted. Please reload CAPTCHA.