ibv_create_qp()

Contents

5.00 avg. rating (99% score) - 13 votes

struct ibv_qp *ibv_create_qp(struct ibv_pd *pd,
                             struct ibv_qp_init_attr *qp_init_attr);

Description

ibv_create_qp() creates a Queue Pair (QP) associated with a Protection Domain.

The user can define the minimum attributes to the QP: number of Work Requests and number of scatter/gather entries per Work Request to Send and Receive queues. The actual attributes can be equal or higher than those values.

The struct ibv_qp_init_attr describes the requested attributes of the newly created QP.

struct ibv_qp_init_attr {
	void		       *qp_context;
	struct ibv_cq	       *send_cq;
	struct ibv_cq	       *recv_cq;
	struct ibv_srq	       *srq;
	struct ibv_qp_cap	cap;
	enum ibv_qp_type	qp_type;
	int			sq_sig_all;
};

Here is the full description of struct ibv_qp_init_attr:

qp_context

(optional) User defined value which will be available in qp->qp_context

send_cq

A Completion Queue, that was returned from ibv_create_cq(), to be associated with the Send Queue

recv_cq

A Completion Queue, that was returned from ibv_create_cq(), to be associated with the Receive Queue

srq

(optional) A Shared Receive Queue, that was returned from ibv_create_srq(), that this Queue Pair will be associated with. Otherwise, NULL

cap

Attributes of the Queue Pair size, as described in the table below. Upon a successful Queue Pair creation, this structure will hold the actual Queue Pair attributes

qp_type

Requested Transport Service Type of this QP:

IBV_QPT_RC	Reliable Connection
IBV_QPT_UC	Unreliable Connection
IBV_QPT_UD	Unreliable Datagram

sq_sig_all

The Signaling level of Work Requests that will be posted to the Send Queue in this QP.

0	In every Work Request submitted to the Send Queue, the user must decide whether to generate a Work Completion for successful completions or not
otherwise	All Work Requests that will be submitted to the Send Queue will always generate a Work Completion

The InfiniBand spec defines the QP transport type: Reliable Datagram. However, the RDMA software stack doesn't support it nor any RDMA device.

send_cq and recv_cq can be the same CQ or different CQs.
An RC and UD QPs always can be associated with an SRQ. There are RDMA devices which allow a UC QP to be associated with an SRQ as well. However, currently there isn't any indication to know that the RDMA device supports this.
struct ibv_qp_cap describes the size of the Queue Pair (for both Send and Receive Queues).

struct ibv_qp_cap {
	uint32_t		max_send_wr;
	uint32_t		max_recv_wr;
	uint32_t		max_send_sge;
	uint32_t		max_recv_sge;
	uint32_t		max_inline_data;
};

Here is the full description of struct ibv_qp_cap:

max_send_wr	The maximum number of outstanding Work Requests that can be posted to the Send Queue in that Queue Pair. Value can be [0..dev_cap.max_qp_wr]. There may be RDMA devices that for specific transport types may support less outstanding Work Requests than the maximum reported value.
max_recv_wr	The maximum number of outstanding Work Requests that can be posted to the Receive Queue in that Queue Pair. Value can be [0..dev_cap.max_qp_wr]. There may be RDMA devices that for specific transport types may support less outstanding Work Requests than the maximum reported value. This value is ignored if the Queue Pair is associated with an SRQ
max_send_sge	The maximum number of scatter/gather elements in any Work Request that can be posted to the Send Queue in that Queue Pair. Value can be [0..dev_cap.max_sge]. There may be RDMA devices that for specific transport types may support less scatter/gather elements than the maximum reported value.
max_recv_sge	The maximum number of scatter/gather elements in any Work Request that can be posted to the Receive Queue in that Queue Pair. Value can be [0..dev_cap.max_sge]. There may be RDMA devices that for specific transport types may support less scatter/gather elements than the maximum reported value. This value is ignored if the Queue Pair is associated with an SRQ
max_inline_data	The maximum message size (in bytes) that can be posted inline to the Send Queue. 0, if no inline message is requested

Sending inline'd data is an implementation extension that isn't defined in any RDMA specification: it allows send the data itself in the Work Request (instead the scatter/gather entries) that is posted to the RDMA device. The memory that holds this message doesn't have to be registered. There isn't any verb that specifies the maximum message size that can be sent inline'd in a QP. Some of the RDMA devices support it. In some RDMA devices, creating a QP with will set the value of max_inline_data to the size of messages that can be sent using the requested number of scatter/gather elements of the Send Queue. If others, one should specify explicitly the message size to be sent inline before the creation of a QP. for those devices, it is advised to try to create the QP with the required message size and continue decreasing it if the QP creation fails.

Parameters

Name	Direction	Description
pd	in	Protection Domain that was returned from ibv_alloc_pd()
qp_init_attr	in/out	Requested attributes for the Queue Pair. After the QP creation, it will hold the actual attributes of the QP

Return Values

Value

Description

A pointer to the newly allocated Queue Pair.
This pointer also contains the following fields:

qp_context	The value qp_context that was provided to ibv_create_qp()
qp_num	The number of this Queue Pair. A 24 bits value, which is unique per RDMA device. As QPs are destroyed and created, QP numbers may be reused. However, at a given point in time, only a single QP in the RDMA device will exist with the given number. The user cannot control or influence this value
state	The last known state of this Queue Pair. The actual state may be different from this state (in the RDMA device transitioned the state into other state)
qp_type	The Transport Service Type of this Queue Pair

NULL

On failure, errno indicates the failure reason:

EINVAL	Invalid pd, send_cq, recv_cq, srq or invalid value provided in max_send_wr, max_recv_wr, max_send_sge, max_recv_sge or in max_inline_data
ENOMEM	Not enough resources to complete this operation
ENOSYS	QP with this Transport Service Type isn't supported by this RDMA device
EPERM	Not enough permissions to create a QP with this Transport Service Type

Examples

1) Create a QP with both CQ in the Send and Receive Queues and destroy it:

struct ibv_pd *pd;
struct ibv_cq *cq;
struct ibv_qp *qp;
struct ibv_qp_init_attr qp_init_attr;
 
memset(&qp_init_attr, 0, sizeof(qp_init_attr));
 
qp_init_attr.send_cq = cq;
qp_init_attr.recv_cq = cq;
qp_init_attr.qp_type = IBV_QPT_RC;
qp_init_attr.cap.max_send_wr  = 2;
qp_init_attr.cap.max_recv_wr  = 2;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;
 
qp = ibv_create_qp(pd, &qp_init_attr);
if (!qp) {
	fprintf(stderr, "Error, ibv_create_qp() failed\n");
	return -1;
}
 
if (ibv_destroy_qp(qp)) {
	fprintf(stderr, "Error, ibv_destroy_qp() failed\n");
	return -1;
}

2) Create a QP with different CQs in the Send and Receive Queues:

struct ibv_pd *pd;
struct ibv_cq *send_cq;
struct ibv_cq *recv_cq;
struct ibv_qp *qp;
struct ibv_qp_init_attr qp_init_attr;
 
memset(&qp_init_attr, 0, sizeof(qp_init_attr));
 
qp_init_attr.send_cq = send_cq;
qp_init_attr.recv_cq = recv_cq;
qp_init_attr.qp_type = IBV_QPT_RC;
qp_init_attr.cap.max_send_wr  = 2;
qp_init_attr.cap.max_recv_wr  = 2;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;
 
qp = ibv_create_qp(pd, &qp_init_attr);
if (!qp) {
	fprintf(stderr, "Error, ibv_create_qp() failed\n");
	return -1;
}

3) Create a QP, which is associated with an SRQ:

struct ibv_pd *pd;
struct ibv_cq *cq;
struct ibv_srq *srq;
struct ibv_qp *qp;
struct ibv_qp_init_attr qp_init_attr;
 
memset(&qp_init_attr, 0, sizeof(qp_init_attr));
 
qp_init_attr.send_cq = send_cq;
qp_init_attr.recv_cq = recv_cq;
qp_init_attr.srq     = srq;
qp_init_attr.qp_type = IBV_QPT_RC;
qp_init_attr.cap.max_send_wr  = 2;
qp_init_attr.cap.max_send_sge = 1;
 
qp = ibv_create_qp(pd, &qp_init_attr);
if (!qp) {
	fprintf(stderr, "Error, ibv_create_qp() failed\n");
	return -1;
}

4) Create a QP with supports inline message:

struct ibv_pd *pd;
struct ibv_cq *cq;
struct ibv_qp *qp;
struct ibv_qp_init_attr qp_init_attr;
 
memset(&qp_init_attr, 0, sizeof(qp_init_attr));
 
qp_init_attr.send_cq = cq;
qp_init_attr.recv_cq = cq;
qp_init_attr.qp_type = IBV_QPT_RC;
qp_init_attr.cap.max_send_wr  = 2;
qp_init_attr.cap.max_recv_wr  = 2;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;
qp_init_attr.cap.max_inline_data = 512;
 
qp = ibv_create_qp(pd, &qp_init_attr);
if (!qp) {
	fprintf(stderr, "Error, ibv_create_qp() failed\n");
	return -1;
}

FAQs

Why is a QP good for anyway?

QP is the actual object that sends and receives data in the RDMA architecture (something like a socket).

Are socket and QP equivalent?

Not exactly. A socket is an abstraction, which is maintained by the network stack and doesn't have a physical resource behind it. A QP is a resource of an RDMA device and a QP number can be used by one process at the same time (similar to a socket that is associated with a specific TCP or UDP port number)

Can I associate several QPs with the same SRQ?

Yes. you can.

Which QP Transport Types can be associated with an SRQ?

RC and UD QPs, can be associated with an SRQ by all RDMA devices. In some RDMA devices, you can associate a UC QP with an SRQ as well.

Do I need to set the Receive Queue attributes if I associate a QP with an SRQ?

No, you don't have to do it. The Receive Queue attributes are completely ignored if the QP is being associated with an SRQ.

Can I use the same CQ in both Send in Receive Queues?

Yes. you can.

Can I use one CQ in the Send Queue and another CQ in the Receive Queue?

Yes. you can.

How can I know what is the maximum message size that can be sent inline in a QP?

You can't know this information. This information is unavailable. You find this information by trial and error.

I created a QP with transport type X and the QP was created successfully. I tried to create a QP with transport type Y and the QP creation failed. What happened?

The value in dev_cap.max_sge and dev_cap.max_qp_wr reports the supported values of scatter/gather entries and Work Requests that are supported by any QP's transport type. However, for a specific RDMA device, there may be QP's transport types that cannot be created with those maximum values. Using trial and error, one should get the right attributes for this specific RDMA device.

The device capabilities reported that max_qp_wr/max_sge is X, but when I tried to create a QP with those attributes it failed. What happened?

The value in dev_cap.max_sge and dev_cap.max_qp_wr reports the maximum supported values of scatter/gather entries and Work Requests that are supported by any Work Queue (Send and Receive). However, for a specific RDMA device, there may be other considerations for the Send or Receive Queue that prevent a QP to be created with those maximum values. Using trial and error, one should get the right attributes for this specific RDMA device.

Written by: Dotan Barak on December 21, 2012.on June 26, 2015.

Comments

Tell us what do you think.

Samarth says: February 14, 2013

How do you associate each qp with cq?

Reply
- Dotan Barak says: February 14, 2013
  
  When creating a QP, one fills the structure ibv_qp_init_attr:
  * The field send_cq is the CQ that is associated with the QP's Send Queue.
  * The field recv_cq is the CQ that is associated with the QP's Receive Queue.
  
  You can use the same CQ for both Send and Receive Queues or use different CQs.
  
  When you call ibv_create_qp(), the newly created QP is associated with the CQs that you specified.
  
  Thanks
  Dotan
  
  Reply
  - Max says: May 5, 2014
    
    Hi Dotan!
    Can i use same CQ for different QP's?
    
    Unfortunately I can't create QP with shared receive queue in windows OFED, but this work in the Mellanox OFED (WinOF, but i can't use it)
  - Dotan Barak says: May 5, 2014
    
    Hi Max.
    
    Yes, you can use the same CQ for different QPs.
    
    Thanks
    Dotan
Sara says: April 17, 2013

Dotan, thanks for the info. I'm getting an ENOMEM on creating the third call (in a server program; one context for each client) to ibv_create_qp() with following parameters (1 page size for memory region, shared send & recv cq, of depth 10, max send/recv sge = 1). strace indicates the create QP verb failing on write to the verbs device with ENOMEM. All others are default settings. Any pointers on how to proceed will be much appreciated.
Thanks
Sara

Reply
- Sara says: April 17, 2013
  
  A quick update: if I run it as superuser I don't hit this issue. But "ulimit -a" shows the same values for both user & superuser. What could be the difference between the two scenarios?
  Thanks
  Sara
  
  Reply
  - Sara says: April 18, 2013
    
    Please ignore my comments :)
    The /etc/security/limits.conf values were not properly propagated due to incorrect pam config. I fixed that and now it works.
  - Dotan Barak says: April 18, 2013
    
    This is great that you managed to solve it, thanks for the update.
    
    When I'll finish covering all of the verbs description,
    I plan to write about the memory locking issues...
    
    I hope that you find this blog useful..
    Dotan
Jeff says: June 18, 2013

I'm able to create a qp with qp_type IBV_QPT_RC. However creating a qp with type IBV_QPT_UD and with the same parameters that I used to create IBV_QPT_RC, returns NULL with invalid argument error. I could not figure out which parameter could be invalid, any suggestions? I'm trying to create a UD.
Thanks,
Jeff

Reply
- Dotan Barak says: June 18, 2013
  
  Hi.
  
  If you'll specify the attributes that you are using for creating the QP,
  maybe I'll be able to provide a tip on this...
  
  There may be some HCAs that have different attributes to RC and UD QPs,
  so decreasing the number of s/g or the number of WRs or the number of inline data may fix this issue.
  
  Thanks
  Dotan
  
  Reply
Lluis says: October 22, 2014

Hi,

I'm using an UD communication. At the "server" side I do a ibv_create_qp every time that a client "connects" (with quotes because there is no connection in the traditional sense). However, since there is no real connection, how can I know that the client disconnected in order to release the QP created with ibv_create_qp?

Thank you very much for maintaining this great, and resourceful, website!

Reply
- Dotan Barak says: October 22, 2014
  
  Hi.
  
  First of all, thanks for he complements, I'm trying to do my best
  :)
  
  In order to know when to destroy the QP, you have several options:
  1) Use the CM libraries (libibcm/librdmacm) for connection establishment and teardown
  2) Handle this within your application: maintain a "keep alive" messages and/or "leaving" message
  
  The question, is do you really need several QPs?
  You can use the same QP to handle all the communications...
  (only different Address Handle can be used)
  
  I hope that my answer helped you..
  
  Thanks
  Dotan
  
  Reply
  - Lluis says: October 22, 2014
    
    Thank you for the answer. It indeed helps.
    
    I guess that reusing the QP is the easiest solution. But that brings me two doubts:
    - Does a single QP scale well?
    - Is it expensive to create&destroy an address handle every time the server receives a message?
    
    Thanks
  - Dotan Barak says: October 23, 2014
    
    Hi Lluis.
    
    Those are good questions:
    * The question is will one UD QP will scale to your needs
    (IMHO, one UD QP can't get to full line rate, but I don't know what you application needs are)
    * Create and destroy Address Handle is relatively cheap compared to create and destroy a QP
    (AH can be created without a context switch - depends on the low level driver,
    and it has small footprints compared to a QP)
    
    Thanks
    Dotan
Valentin Petrov says: December 16, 2014

Hi, Dotan,
I've got a question regarding max_send/recv_wr qp attributes. While max_recv_wr is clear for me (i can prepost as many wrs to recv qp as it was specified with this attribute value) there is still some ambiguity with max_send_wr parameter. Suppose, for example, i set max_send_wr=5, and i'm doing ibv_post_send calls in a loop (each time posting a single wr). Is it correct to say that proper code has to wait for 5 completions after each 5 WRs posted (assuming all a signalled)? Or, the work requests are consumed when they are being posted? Will the code work if max_send_wr=1 and I only check for send completion queue overflow (and not the send QP depth itself)? Thanks in advance for your help!

Reply
- Dotan Barak says: December 16, 2014
  
  Hi.
  
  A Send Request (like any other Work Request) is considered outstanding until there is a Work Completion for it or for Send Requests that were posted after it
  (if you are using Unsignaled Send Requests).
  
  The attribute max_send_wr specify how many Send Requests can be outstanding.
  So, if all Send Requests are signaled - you must poll the corresponding Send Requests.
  
  If for example, you set 5 in max_send_wr (assuming that the low-level driver didn't increase this value),
  and you posted 5 Send Requests. Posting the 6th Send Request will fail, and you'll be able to post another Send Requests
  after at least one Work Completion (that was generated from a Send Request which ended) will be polled from the Completion Queue.
  
  You can look at it as polling Work Completion of an ended Send Request consume the Send Request from the Send Queue.
  
  It is more clear now?
  
  Thanks
  Dotan
  
  Reply
  - Valentin Petrov says: December 16, 2014
    
    Oh, I see now. Thanks a lot! BTW, does the same hold for SEND_INLINE? I mean there is no additional semantics with respect to completion right (only buffering)?
  - Dotan Barak says: December 16, 2014
    
    Yes.
    
    SEND_INLINE is yet another feature in ibv_post_send() and the semantics that I wrote is relevant to it as well.
    
    Thanks
    Dotan
  - Valentin Petrov says: December 16, 2014
    
    Ok, I see. Thanks again for doing a great job with this blog! It's been extremely helpful for me!
gp says: June 25, 2015

hi,
I am getting error no 12 while tring to create the queue pair and when I reduced the size of max_send_wr,then there is no issues in creating the queue pair. Earlier I used the max device limit which I found by devattr->max_qp_wr. Is the issue is because of the reason that you mention above.

The maximum number of outstanding Work Requests that can be posted to the Send Queue in that Queue Pair. Value can be [0..dev_cap.max_qp_wr]. There may be RDMA devices that for specific transport types may support less outstanding Work Requests than the maximum reported value.

And if it reason then is there any other way by which I find out max sendq limit.

Thanks

Reply
- Dotan Barak says: June 25, 2015
  
  Hi.
  
  I have some questions to be able to answer:
  1) Under which user name are you working?
  2) What is the value of 'ulimit -l'?
  3) Which RDMA device are you using?
  
  thanks
  Dotan
  
  Reply
gp says: June 25, 2015

Hi,
ulimit -l is unlimited.

$ ibv_devinfo -v
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.7.710
node_guid: f04d:a290:9779:10e0
sys_image_guid: f04d:a290:9779:10e3
vendor_id: 0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id: DEL08F0120009
phys_port_cnt: 2
max_mr_size: 0xffffffffffffffff
page_size_cap: 0xfffffe00
max_qp: 65456
max_qp_wr: 16384

For the recv queue it is allowing 16384 but for the sendq max 16351 allowed.

Thanks,

Reply
- Dotan Barak says: June 26, 2015
  
  I added the following note to the Q&A at this post:
  
  The value in dev_cap.max_sge and dev_cap.max_qp_wr reports the maximum supported values of scatter/gather entries and Work Requests that are supported by any Work Queue (Send and Receive). However, for a specific RDMA device, there may be other considerations for the Send or Receive Queue that prevent a QP to be created with those maximum values. Using trial and error one should get the right attributes for this specific RDMA device.
  
  This answers your question...
  
  Thanks
  Dotan
  
  Reply
gp says: June 27, 2015

Hi,
thanks for the reply, but it's applicable only if we always use same machine. Actually in my case software has to run on client side and there we can't try trail and error method. So if there any alternate way for eg. max sendq depth must be guaranteed to be half of the max_qp_wr.

Reply
- Dotan Barak says: June 27, 2015
  
  Sorry, I can't give you a recipe that will work for all the RDMA devices;
  Try using the reported values minus DELTA, and increase the delta that will fit all the devices that you are working with...
  
  Thanks
  Dotan
  
  You can
  
  Reply
DjvuLee says: July 17, 2015

Hi, Dotan:
As the QP and CQ will consumer the resource in the RNIC, if the dev_cap.max_qp_wr = 1024, what's the meaning for max_send_wr/max_recv_wr in one connection? Does all the connections share the same dev_cap.max_qp_wr? more connections(one qp per connection) means the max_send_wr will be smaller, or no matter how many queue pairs. their max_send_wr can be as large as dev_cap.max_qp_wr?

Reply
- Dotan Barak says: July 17, 2015
  
  Hi.
  
  Every QP can have a Work Queue which its depth is maximum of max_qp_wr.
  How the RDMA device handle it, depends on its internal implementation.
  
  For example, if there are 100 QPs which each of them has 1024 WRs,
  and the application, in a magical way, posts 1024 SRs to every one of them;
  how those SRs will be processed depends on the RDMA device.
  
  You only mentioned the number of message aspect, but if one QP sends
  several 2GB messages and the rest send only 1B message?
  
  Bottom line: the scheduling policy of the Send Requests processing is an internal attribute of the RDMA device.
  
  Thanks
  Dotan
  
  Reply
  - DjvuLee says: July 17, 2015
    
    Thanks ! so this means that the post 1024 SRs can be blocked or go into a error state?
    
    Does this also apply to the Complete Queue?
  - Dotan Barak says: July 17, 2015
    
    If one will post 1024 (i.e. fill the Send Queue) with unsignaled Send Requests no more Send Requests can be posted to this Send Queue anymore.
    
    Theoretically, nothing should happen to the Completion Queue.
    However, I may think about implementations that may add Work Completions to the Completion Queue.
    
    One should avoid getting into this state, since there isn't any indication when the Send Requests processing was ended.
    
    Thanks
    Dotan
  - DjvuLee says: July 18, 2015
    
    Thanks for your answer!
    
    I have occurred a very strange problem. I have a cluster of 6 machine, each one may using RDMA Read to read data from other machines at the same time. I find RDMA READ would cost about 1s to fetch data which size is about 10M. My environment is 10Gb, and using ROCE, the PFC is configure correctly using priority 3.
    
    This is too long for a RDMA READ, but I can not do nothing, because all this is done by RNIC. my max_send_wr && max_recv_wr is 200, max_send_sge && max_recv_sge is 4. Is there any wrong for my configuration?
  - Dotan Barak says: July 22, 2015
    
    Hi.
    
    Did you execute the RDMA Read/Write benchmarks on you setup?
    This will help you understand if you have a configuration problem.
    
    Maybe there is a retry and this causes a delay.
    Which values did you configure for the retry in the QP?
    (I mean the 'timeout' attribute).
    
    Thanks
    Dotan
DjvuLee says: July 23, 2015

Thanks!

The retry is 6, maybe I can set up this lower to demonstrate whether caused by retry?

I do not sure initiator_depth and responder_resources have a effect on the RDMA read, because I set both of them as 2.

Reply
- Dotan Barak says: July 24, 2015
  
  Hi.
  
  1) I would set the retry_count value to 1, to see if there are errors.
  2) If you want more parallel RDMA Reads to be initiated,
  you need to increase the initial_depth (and make the responder_resources be able to accept this value).
  
  Thanks
  Dotan
  
  Reply
  - DjvuLee says: July 24, 2015
    
    yes. I use the async mode, there are some parallel RDMA READs.
    
    I wonder whether there is some detail data about how the value of initial_depth & responder_resources impact the parallel RDMA Read? because I find many resource, and few of them take about this two values, the max_send_wr/max_recv_wr is talked more.
  - Dotan Barak says: July 27, 2015
    
    Hi.
    
    I don't know if there is any detailed information on this; try the InfiniBand spec.
    
    Anyway, those values specify the number of in-fligt RDMA Read/Atomic that QP can handle in parallel as requestor/responder.
    
    Thanks
    Dotan
DjvuLee says: July 28, 2015

Thanks, Dotan, I will try to find out.

Reply
Novice says: September 2, 2015

Hi Dotan,

I have a question. When we create a qp using the RDMA verbs, are the queues physically created in the RNIC memory (RNIC DDR) or the HOST system memory? Thanks in advance for your help!

Reply
- Dotan Barak says: September 11, 2015
  
  Hi and welcome to the RDMA world
  :)
  
  When calling RDMA version, actual HW resources are created.
  their location (RNIC memory or Host memory) depends on the RDMA device technology (device specific):
  * Some of them will create the resources in the attached memory (if such exists)
  * Some of them will create the resources in the host memory
  
  You should ask your HW vendor how his device behaves (in this aspect).
  
  Thanks
  Dotan
  
  Reply
Mandrake says: November 24, 2015

Hi!

Firstly, I'd like to thank you for this great overview. When I try to create a *lot* of QPs (>100k), I run into ENOMEM errors. Is there some way to assign more memory to the HCA or is there also some HW limitation to the number of available QPs?

Reply
- Dotan Barak says: November 27, 2015
  
  Hi Mandrake.
  
  I can't answer without know which HW you are using.
  In general, RDMA supports up to 16M QPs (since there are 24 bits for QP numbers).
  
  Possible solutions/ideas:
  * Maybe you need to load the driver with different parameters to allow support for many QPs
  * Maybe the problem is lack of memory in your host
  
  Thanks
  Dotan
  
  Reply
  - Mandrake says: December 14, 2015
    
    Hi Dotan. Thanks a lot for your answer. We are using Mellanox Connect-IB cards running with the mlx-5 driver. I could not find any module options to the kernel module. Host memory should be no problem as the machines have 32GB of which only 4 are in use.
    
    May I ask you where the 24 bit for QP numbers are specified? I have a hard time finding any reliable information about the hardware. Even identifying the exact HCA seems to be non trivial as "Connect-IB" and "MT27600" seem to refer to a variety of cards.
  - Dotan Barak says: December 18, 2015
    
    Hi.
    
    The 24 bits are coming from the RDMA spec headers, for example: look at the BTH, it has 24 bits for encoding the destination QP number.
    Identifying a PCI device should be easy, using the PCI.ids repository: https://pci-ids.ucw.cz/
    
    Thanks
    Dotan
Mark says: January 15, 2016

Please help me with this. While I use ib_create_qp it is giving me "Invalid argument error". The code which I am using is in this http://stackoverflow.com/questions/34788781/cannot-create-queue-pair-with-ib-create-qpstackoverflow page. All other functions like create CQ works fine.

Reply
- Dotan Barak says: January 29, 2016
  
  Hi Mark.
  
  Which device are you using?
  
  Thanks
  Dotan
  
  Reply
David R. says: March 6, 2017

Hi Dotan,

Is there any way to know when one side of a queue pair goes down without having to constantly send "keep alive" messages? For example, if I have client and server applications running and the server crashes, is there any way for the client to know that the remote side of the connection is down before trying to send to it? I guess I'm looking for something similar to a TCP RST that could be used to automatically re-establish a connection, perhaps at the subnet manager level. Any advice would be greatly appreciated!

Thanks,
David

Reply
- Dotan Barak says: August 1, 2017
  
  Hi.
  
  If you are using the QPs directly (i.e. without CM), then the answer is: No.
  
  If you are using CM for connecting and managing the QP connection,
  you should get an event when the remote QP goes down.
  
  Thanks
  Dotan
  
  Reply
Param says: June 28, 2017

Hi Danton,

What is the total number of outstanding RDMA Read/Write Requests that can be performed simultaneously. I have ConnectX4 card and find that there is a problem when I go to a queue depth of more than 64. Is there any limit.

Thanking You,
Param.

Reply
- Dotan Barak says: July 2, 2017
  
  Hi.
  
  RDMA Write messages don't require any special resources, but RDMA Read do, so:
  * The total number of outstanding RDMA Write messages is limited in the requestor: HCA_CAP.max_qp_wr
  * The total number of outstanding RDMA Read messages is limited in the requestor: HCA_CAP.max_qp_init_rd_atom
  * The total number of outstanding RDMA Read messages is limited in the responder: HCA_CAP.max_qp_rd_atom
  
  Thanks
  Dotan
  
  Reply
Dawood says: August 13, 2018

Hi Dotan, is there a way to attribute a queue pair to a specific traffic class?

Reply
- Dotan Barak says: August 24, 2018
  
  Hi.
  
  What do you mean by "traffic class"?
  This information exists in the IPv6/GRH header;
  Do you refer to Infiniband or RoCE?
  
  Thanks
  Dotan
  
  Reply
Dawood says: August 22, 2018

Hi Dotan,

If I send several IBV_WR_RDMA_WRITE then I send a IBV_WR_RDMA_WRITE_WITH_IMM, when the completion of the IBV_WR_RDMA_WRITE_WITH_IMM appears in the destination server, is there guarantee that all the previously sent IBV_WR_RDMA_WRITE were written to the RAM of the destination and hence completed?

Reply
- Dotan Barak says: August 24, 2018
  
  Hi.
  
  This is a great question, but I'm not consider myself an expert in this area;
  however, let's try to analyze it.
  
  Let's assume that all the messages are sent from one QP to a destination QP:
  All the RDMA Writes are accepted by the destination QP and now,
  the the last RDMA Write is accepted as well, and a Completion is generated.
  
  So, all the previous + the last incoming RDMA write content is being DMA'ed to the RAM
  and only then the information that there is a Completion is DMA'ed to the Completion Queue memory.
  
  The big question is: "Did the content from the DMA writes messages was actually written before the DMA of the Completion?".
  I have a feeling that the answer is "it depends on the runtime memory ordering".
  
  However, since when one polls for Completion, AFAIK there is a read barrier so I would expect all the DMA operations to be finished,
  and the RAM should contain the memory from the incoming messages. And then the knowledge that there is a Work Completion should be available.
  
  As I said, I'm not a memory or PCI sub-system expert, but those are my 2 cents.
  Thanks
  Dotan
  
  Thanks
  Dotan
  
  Reply
Dwood says: August 24, 2018

Yes, the GRH has a field referred to as traffic class. Mellanox defines it as "Traffic class (or class of service) is a group of all flows that receive the same service characteristics (e.g. buffer size, scheduling). It is possible that some flow with different priorities will be mapped to the same traffic class." link:https://community.mellanox.com/docs/DOC-2022
I am interested in RoCEv1. Is it only useful in switches, or we can also divide traffic between 2 HCA using the traffic class (by associating a qp with a specific traffic class I suppose)?

Reply
- Dotan Barak says: September 7, 2018
  
  Hi.
  
  As a former employee of Mellanox, I don't want to respond to things that it publishes;
  However, AFAIK traffic should hint the expected type of service,
  and in theory this can affect how the packet is handled in switches, routers and adapters.
  Those components, if this is supported, can have different buffers to different type of service
  thus provide different handling to different type of class.
  
  In InfiniBand there is an SL/VL mechanism for this; I believe that this is the mechanism for IPv4/6 packets.
  
  Thanks
  Dotan
  
  Reply
TomS says: February 7, 2019

Hi Dotan,

I found this comment interesting and somewhat related to my particular problem.

I am interfacing a CX5 ASIC to a FPGA. I need to create the memory region and queue pairs in FPGA physical memory, which I can map into Linux space. I see I can use PA-MR for the data buffer, but how do I create a QP in physical memory?

Reply
- Dotan Barak says: February 11, 2019
  
  Hi.
  
  This is a vendor-specific question; so I won't answer your specific question. I'll give a more general answer.
  
  The data buffers can be used from all over the system (even from memory that is attached to any PCI card); as long as they can be mapped by Linux.
  To create a QP, one needs to register this QP number and provide memory for the Work Queues.
  
  AFAIK, the caller cannot control the origin of the Work Queue memory; this should be support by the low-level driver.
  
  I hope that I answered.
  
  Thanks
  Dotan
  
  Reply
ab says: October 16, 2019

Hi Dotan,
When we create a QP using the RDMA verbs,do we get base address of the QP somehow. My question is related to how the hardware knows where to read the posted WR from that QP.

Reply
- Dotan Barak says: October 19, 2019
  
  Hi.
  
  When one creates an RDMA resources (CQ, QP, SRQ, etc.),
  the internal buffers of those resources isn't (easily) exposed to the user.
  
  Thanks
  Dotan
  
  Reply
alex says: August 4, 2021

Hi Dotan,

What could cause the failure to create qp

Reply
- Dotan Barak says: October 24, 2021
  
  Hi.
  
  QP creation can fail if there aren't enough resources (no more QPs, or memory in the host).
  or
  Bad configuration of the ulimit (number of locked memory pages).
  
  Thanks
  Dotan
  
  Reply