ibv_post_send()

Contents

4.93 avg. rating (98% score) - 28 votes

int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,
                  struct ibv_send_wr **bad_wr);

Description

ibv_post_send() posts a linked list of Work Requests (WRs) to the Send Queue of a Queue Pair (QP). ibv_post_send() go over all of the entries in the linked list, one by one, check that it is valid, generate an HW-specific Send Request out of it and add it to the tail of the QP's Send Queue without performing any context switch. The RDMA device will handle it (later) in an asynchronous way. If there is a failure in one of the WRs because the Send Queue is full or one of the attributes in the WR is bad, it stops immediately and returns the pointer to that WR. The QP will handle Work Requests in the Send queue according to the following rules:

If the QP is in RESET, INIT or RTR state, an immediate error should be returned. However, some low-level drivers may not follow this rule (to eliminate extra check in the data path, thus providing better performance), and posting Send Requests at one or all of those states may be silently ignored.
If the QP is in RTS state, Send Requests can be posted, and they will be processed.
If the QP is in SQE or ERROR state, Send Requests can be posted, and they will be completed with error.
If the QP is in SQD state, Send Requests can be posted, but won't be processed.

The struct ibv_send_wr describes the Work Request to the Send Queue of the QP, i.e., Send Request (SR).

struct ibv_send_wr {
	uint64_t		wr_id;
	struct ibv_send_wr     *next;
	struct ibv_sge	       *sg_list;
	int			num_sge;
	enum ibv_wr_opcode	opcode;
	int			send_flags;
	uint32_t		imm_data;
	union {
		struct {
			uint64_t	remote_addr;
			uint32_t	rkey;
		} rdma;
		struct {
			uint64_t	remote_addr;
			uint64_t	compare_add;
			uint64_t	swap;
			uint32_t	rkey;
		} atomic;
		struct {
			struct ibv_ah  *ah;
			uint32_t	remote_qpn;
			uint32_t	remote_qkey;
		} ud;
	} wr;
};

Here is the full description of struct ibv_send_wr:

wr_id	A 64 bits value associated with this WR. If a Work Completion will be generated when this Work Request ends, it will contain this value
next	Pointer to the next WR in the linked list. NULL indicates that this is the last WR
sg_list	Scatter/Gather array, as described in the table below. It specifies the buffers that will be read from or the buffers where data will be written in, depends on the used opcode. The entries in the list can specify memory blocks that were registered by different Memory Regions. The message size is the sum of all of the memory buffers length in the scatter/gather list
num_sge	Size of the sg_list array. This number can be less or equal to the number of scatter/gather entries that the Queue Pair was created to support in the Send Queue (qp_init_attr.cap.max_send_sge). If this size is 0, this indicates that the message size is 0
opcode	The operation that this WR will perform. This value controls the way that data will be sent, the direction of the data flow and the used attributes in the WR. The value can be one of the following enumerated values: IBV_WR_SEND - The content of the local memory buffers specified in sg_list is being sent to the remote QP. The sender doesn’t know where the data will be written in the remote node. A Receive Request will be consumed from the head of remote QP's Receive Queue and sent data will be written to the memory buffers which are specified in that Receive Request. The message size can be [0, [latex]2^{31}[/latex]] for RC and UC QPs and [0, path MTU] for UD QP IBV_WR_SEND_WITH_IMM - Same as IBV_WR_SEND, and immediate data will be sent in the message. This value will be available in the Work Completion that will be generated for the consumed Receive Request in the remote QP IBV_WR_RDMA_WRITE - The content of the local memory buffers specified in sg_list is being sent and written to a contiguous block of memory range in the remote QP's virtual space. This doesn't necessarily means that the remote memory is physically contiguous. No Receive Request will be consumed in the remote QP. The message size can be [0, [latex]2^{31}[/latex]] IBV_WR_RDMA_WRITE_WITH_IMM - Same as IBV_WR_RDMA_WRITE, but Receive Request will be consumed from the head of remote QP's Receive Queue and immediate data will be sent in the message. This value will be available in the Work Completion that will be generated for the consumed Receive Request in the remote QP IBV_WR_RDMA_READ - Data is being read from a contiguous block of memory range in the remote QP's virtual space and being written to the local memory buffers specified in sg_list. No Receive Request will be consumed in the remote QP. The message size can be [0, [latex]2^{31}[/latex]] IBV_WR_ATOMIC_FETCH_AND_ADD - A 64 bits value in a remote QP's virtual space is being read, added to wr.atomic.compare_add and the result is being written to the same memory address, in an atomic way. No Receive Request will be consumed in the remote QP. The original data, before the add operation, is being written to the local memory buffers specified in sg_list IBV_WR_ATOMIC_CMP_AND_SWP - A 64 bits value in a remote QP's virtual space is being read, compared with wr.atomic.compare_add and if they are equal, the value wr.atomic.swap is being written to the same memory address, in an atomic way. No Receive Request will be consumed in the remote QP. The original data, before the compare operation, is being written to the local memory buffers specified in sg_list
send_flags	Describes the properties of the WR. It is either 0 or the bitwise OR of one or more of the following flags: IBV_SEND_FENCE - Set the fence indicator for this WR. This means that the processing of this WR will be blocked until all prior posted RDMA Read and Atomic WRs will be completed. Valid only for QPs with Transport Service Type IBV_QPT_RC IBV_SEND_SIGNALED - Set the completion notification indicator for this WR. This means that if the QP was created with sq_sig_all=0, a Work Completion will be generated when the processing of this WR will be ended. If the QP was created with sq_sig_all=1, there won't be any effect to this flag IBV_SEND_SOLICITED - Set the solicited event indicator for this WR. This means that when the message in this WR will be ended in the remote QP, a solicited event will be created to it and if in the remote side the user is waiting for a solicited event, it will be woken up. Relevant only for the Send and RDMA Write with immediate opcodes IBV_SEND_INLINE - The memory buffers specified in sg_list will be placed inline in the Send Request. This mean that the low-level driver (i.e. CPU) will read the data and not the RDMA device. This means that the L_Key won't be checked, actually those memory buffers don't even have to be registered and they can be reused immediately after ibv_post_send() will be ended. Valid only for the Send and RDMA Write opcodes
imm_data	(optional) A 32 bits number, in network order, in an SEND or RDMA WRITE opcodes that is being sent along with the payload to the remote side and placed in a Receive Work Completion and not in a remote memory buffer
wr.rdma.remote_addr	Start address of remote memory block to access (read or write, depends on the opcode). Relevant only for RDMA WRITE (with immediate) and RDMA READ opcodes
wr.rdma.rkey	r_key of the Memory Region that is being accessed at the remote side. Relevant only for RDMA WRITE (with immediate) and RDMA READ opcodes
wr.atomic.remote_addr	Start address of remote memory block to access
wr.atomic.compare_add	For Fetch and Add: the value that will be added to the content of the remote address. For compare and swap: the value to be compared with the content of the remote address. Relevant only for atomic operations
wr.atomic.swap	Relevant only for compare and swap: the value to be written in the remote address if the value that was read is equal to the value in wr.atomic.compare_add. Relevant only for atomic operations
wr.atomic.rkey	r_key of the Memory Region that is being accessed at the remote side. Relevant only for atomic operations
wr.ud.ah	Address handle (AH) that describes how to send the packet. This AH must be valid until any posted Work Requests that uses it isn't considered outstanding anymore. Relevant only for UD QP
wr.ud.remote_qpn	QP number of the destination QP. The value 0xFFFFFF indicated that this is a message to a multicast group. Relevant only for UD QP
wr.ud.remote_qkey	Q_Key value of remote QP. Relevant only for UD QP

The following table describes the supported opcodes for each QP Transport Service Type:

Opcode	UD	UC	RC
IBV_WR_SEND	X	X	X
IBV_WR_SEND_WITH_IMM	X	X	X
IBV_WR_RDMA_WRITE		X	X
IBV_WR_RDMA_WRITE_WITH_IMM		X	X
IBV_WR_RDMA_READ			X
IBV_WR_ATOMIC_CMP_AND_SWP			X
IBV_WR_ATOMIC_FETCH_AND_ADD			X

struct ibv_sge describes a scatter/gather entry. The memory buffer that this entry describes must be registered until any posted Work Request that uses it isn't considered outstanding anymore. The order in which the RDMA device access the memory in a scatter/gather list isn't defined. This means that if some of the entries overlap the same memory address, the content of this address is undefined.

struct ibv_sge {
	uint64_t		addr;
	uint32_t		length;
	uint32_t		lkey;
};

Here is the full description of struct ibv_sge:

addr	The address of the buffer to read from or write to
length	The length of the buffer in bytes. The value 0 is a special value and is equal to [latex]2^{31}[/latex] bytes (and not zero bytes, as one might imagine)
lkey	The Local key of the Memory Region that this memory buffer was registered with

Sending inline'd data is an implementation extension that isn't defined in any RDMA specification: it allows send the data itself in the Work Request (instead the scatter/gather entries) that is posted to the RDMA device. The memory that holds this message doesn't have to be registered. There isn't any verb that specifies the maximum message size that can be sent inline'd in a QP. Some of the RDMA devices support it. In some RDMA devices, creating a QP with will set the value of max_inline_data to the size of messages that can be sent using the requested number of scatter/gather elements of the Send Queue. If others, one should specify explicitly the message size to be sent inline before the creation of a QP. for those devices, it is advised to try to create the QP with the required message size and continue to decrease it if the QP creation fails. While a WR is considered outstanding:

If the WR sends data, the local memory buffers content shouldn't be changed since one doesn't know when the RDMA device will stop reading from it (one exception is inline data)
If the WR reads data, the local memory buffers content shouldn't be read since one doesn't know when the RDMA device will stop writing new content to it

Parameters

Name	Direction	Description
qp	in	Queue Pair that was returned from ibv_create_qp()
wr	in	Linked list of Work Requests to be posted to the Send Queue of the Queue Pair
bad_wr	out	A pointer to that will be filled with the first Work Request that its processing failed

Return Values

Value	Description
0	On success
errno	On failure and no change will be done to the QP and bad_wr points to the SR that failed to be posted
EINVAL	Invalid value provided in wr
ENOMEM	Send Queue is full or not enough resources to complete this operation
EFAULT	Invalid value provided in qp

Examples

1) Posting a WR with the Send operation to an UC or RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&amp;sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr-&gt;lkey;
 
memset(&amp;wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &amp;sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_SEND;
wr.send_flags = IBV_SEND_SIGNALED;
 
if (ibv_post_send(qp, &amp;wr, &amp;bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

2) Posting a WR with the Send with immediate operation to an UD QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&amp;sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr-&gt;lkey;
 
memset(&amp;wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &amp;sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_SEND_WITH_IMM;
wr.send_flags = IBV_SEND_SIGNALED;
wr.imm_data   = htonl(0x1234);
wr.wr.ud.ah          = ah;
wr.wr.ud.remote_qpn  = remote_qpn;
wr.wr.ud.remote_qkey = 0x11111111;
 
if (ibv_post_send(qp, &amp;wr, &amp;bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

3) Posting a WR with an RDMA Write operation to an UC or RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&amp;sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr-&gt;lkey;
 
memset(&amp;wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &amp;sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_RDMA_WRITE;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = remote_address
wr.wr.rdma.rkey        = remote_key;
 
if (ibv_post_send(qp, &amp;wr, &amp;bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

4) Posting a WR with an RDMA Write with immediate operation to an UC or RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&amp;sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr-&gt;lkey;
 
memset(&amp;wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &amp;sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_RDMA_WRITE_WITH_IMM;
wr.send_flags = IBV_SEND_SIGNALED;
wr.imm_data   = htonl(0x1234);
wr.wr.rdma.remote_addr = remote_address
wr.wr.rdma.rkey        = remote_key;
 
if (ibv_post_send(qp, &amp;wr, &amp;bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

5) Posting a WR with an RDMA Read operation to a RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&amp;sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr-&gt;lkey;
 
memset(&amp;wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &amp;sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_RDMA_READ;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = remote_address
wr.wr.rdma.rkey        = remote_key;
 
if (ibv_post_send(qp, &amp;wr, &amp;bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

6) Posting a WR with a Compare and Swap operation to a RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&amp;sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr-&gt;lkey;
 
memset(&amp;wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &amp;sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_ATOMIC_CMP_AND_SWP;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.atomic.remote_addr = remote_address
wr.wr.atomic.rkey        = remote_key;
wr.wr.atomic.compare_add = 0ULL; /* expected value in remote address */
wr.wr.atomic.swap        = 1ULL; /* the value that remote address will be assigned to */
 
if (ibv_post_send(qp, &amp;wr, &amp;bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

7) Posting a WR with a Fetch and Add operation to a RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&amp;sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr-&gt;lkey;
 
memset(&amp;wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &amp;sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_ATOMIC_FETCH_AND_ADD;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.atomic.remote_addr = remote_address
wr.wr.atomic.rkey        = remote_key;
wr.wr.atomic.compare_add = 1ULL; /* value to be added to the remote address content */
 
if (ibv_post_send(qp, &amp;wr, &amp;bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

8) Posting a WR with the Send operation to an UC or RC QP with zero bytes:

struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&amp;wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = NULL;
wr.num_sge    = 0;
wr.opcode     = IBV_WR_SEND;
wr.send_flags = IBV_SEND_SIGNALED;
 
if (ibv_post_send(qp, &amp;wr, &amp;bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

FAQs

Does ibv_post_send() cause a context switch?

No. Posting a SR doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

How many WRs can I post?

There is a limit to the maximum number of outstanding WRs for a QP. This value was specified when the QP was created.

Can I know how many WRs are outstanding in a Work Queue?

No, you can't. You should keep track of the number of outstanding WRs according to the number of posted WRs and the number of Work Completions that you polled.

Does the remote side is aware of the fact that RDMA operations are being performed in its memory?

No, this is the idea of RDMA.

If the remote side isn't aware of RDMA operations are being performed in its memory, isn't this a security hole?

Actually, no. For several reasons:

In order to allow incoming RDMA operations to a QP, the QP should be configured to enable remote operations
In order to allow incoming RDMA access to a MR, the MR should be registered with those remote permissions enabled
The remote side must know the r_key and the memory addresses in order to be able to access remote memory

What will happen if I will deregister an MR that is used by an outstanding WR?

When processing a WR, if one of the MRs that are specified in the WR isn't valid, a Work Completion with error will be generated. The only exception for this is posting inline data.

What is the benefit from using IBV_SEND_INLINE?

Using inline data usually provides better performance (i.e. latency).

What is the difference between inline data and immediate data?

Using immediate data means that out of band data will be sent from the local QP to the remote QP: if this is an SEND opcode, this data will exist in the Work Completion, if this is a RDMA WRITE opcode, a WR will be consumed from the remote QP's Receive Queue. Inline data influence only the way that the RDMA device gets the data to send; The remote side isn't aware of the fact that it this WR was sent inline.

I called ibv_post_send() and I got segmentation fault, what happened?

There may be several reasons for this to happen:
1) At least one of the sg_list entries is in invalid address
2) In one of the posted SRs, IBV_SEND_INLINE is set in send_flags, but one of the buffers in sg_list is pointing to an illegal address
3) The value of next points to an invalid address
4) Error occurred in one of the posted SRs (bad value in the SR or full Work Queue) and the variable bad_wr is NULL
5) A UD QP is used and wr.ud.ah points to an invalid address

Help, I've posted and Send Request and it wasn't completed with a corresponding Work Completion. What happened?

In order to debug this kind of problem, one should do the following:

Verify that a Send Request was actually posted
Wait enough time, maybe a Work Completion will eventually be generated
Verify that the logical port state of the RDMA device is IBV_PORT_ACTIVE
Verify that the QP state is RTS
If this is an RC QP, verify that the timeout value that was configured in ibv_modify_qp() isn't 0 since if a packet will be dropped, this may lead to infinite timeout
If this is an RC QP, verify that the timeout and retry_cnt values combination that were configured in ibv_modify_qp() doesn't indicate that long time will pass before a Work Completion with IBV_WC_RETRY_EXC_ERR will be generated
If this is an RC QP, verify that the rnr_retry value that was configured in ibv_modify_qp() isn't 7 since this may lead to retry infinite time in case of RNR flow
If this is an RC QP, verify that the min_rnr_timer and rnr_retry values combination that were configured in ibv_modify_qp() doesn't indicate that long time will pass before a Work Completion with IBV_WC_RNR_RETRY_EXC_ERR will be generated

How can I send a zero bytes message?

In order to send a zero byes message, no matter what is the opcode, the num_sge must be set to zero.

Can I (re)use the Send Request after ibv_post_send() returned?

Yes. This verb translates the Send Request from the libibverbs abstraction to a HW-specific Send Request and you can (re)use both the Send Request and the s/g list within it.

Written by: Dotan Barak on January 26, 2013.on March 10, 2023.

Comments

Tell us what do you think.

test says: March 6, 2013

I have a question about whether a context switch is occurred or not during an RDMA operation. Here (page 15) it is shown that a user space verbs call results in a call of the hardware specific driver (eg mlx4). That "lives" in kernel space. So, ibv_post_send() (RDMA mode) causes a context switch, or not? Can you clarify this for me please.

Also, if ibv_post_send() never causes a context switch, then why there is an implementation of ibv_post_send() in the linux kernel. When is this function (inside the kernel) called?

Thanks!

Reply
- Dotan Barak says: March 6, 2013
  
  This is a great question!
  
  Every control operation (i.e. create/destroy/modify/query to any resource) will cause a context switch.
  However, the data operations won't create a context switch and from the same context,
  one can post new Work Request (either to the Send or Receive Queues).
  
  In the example, you mentioned "mlx4"; the create Queue Pair will perform a context switch and the following libraries/modules will be called in order:
  libibverbs -> libmlx4 -> libibverbs -> ib core -> mlx4
  
  In order to post a Send Request, the following libraries/modules will be called in order:
  libibverbs -> libmlx4
  i.e. no context switch will happen at all.
  
  However, if there will be devices (or low-level drivers) that doesn't support posting Send Requests without a context switch, the libibverbs prepared the infrastructure to allow posting the Work Requests in the kernel level.
  Personally, I don't know about any device that uses those functions.
  
  I hope that I answered all of your questions.
  
  Thanks
  Dotan
  
  Reply
  - test says: March 6, 2013
    
    Yes, you did! Thanks!
    
    ps: I forgot to paste the link I was referring to in my first post. Here it is (from OpenFabrics) --> https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CC8QFjAA&url=https%3A%2F%2Fwww.openfabrics.org%2Fofa-documents%2Fpresentations%2Fdoc_download%2F522-openfabrics-training-programs.html&ei=fW83UffUDo3Osgam0oDYCg&usg=AFQjCNEyvglOCK0V-6jnoGsqyiYEH3kQDw&bvm=bv.43287494,d.Yms&cad=rja
    
    So in page 15 the Hardware Specific Driver (yellow box) might be the libmlx4 depending on the implementation (or it might be mlx4 linux kernel module in ./drivers/infiniband/hw/mlx4 otherwise). Am I right?
  - Dotan Barak says: March 6, 2013
    
    Thanks for this link, now I fully understand your question...
    
    The Hardware Specific Driver (yellow box) is the mlx4 kernel part (since this section describes the kernel space modules). The User level APIs (white box) is the libibverbs and libmlx4.
    (Do you see the "kernel bypass" line? this means direct access to the HW without need for performing context switch).
  - test says: March 11, 2013
    
    Yes, I see the "kernel bypass" line. But that makes a contradiction. Kernel bypass from the one hand, but libmlx4 calls something (Hardware Specific Driver (mlx4 kernel module)) that "lives" inside the kernel (kernel context)). Except if the author of the diagram is meaning that the line is going to the Infiniband HCA directly (firmware code). :P
    
    Sorry for being persistent!
  - Dotan Barak says: March 12, 2013
    
    It is o.k.
    :)
    
    The "kernel bypass" means that in the data path, your user level code will be able
    to work directly with the HW (without performing a context switch).
    
    Please remember that the kernel level must be involved in the control part in order to
    sync the resources (between different processes/modules) and configure the HW since user
    level application can't write directly in the device memory space (since this is a privileged operation).
    
    In this slide, I can see that there are two lines:
    1) First line that specify kernel bypass (for the data path)
    2) Second line that specify that the user level will call to the open fabrics kernel level verbs
    
    I hope that I answered all of your questions.
    If you enjoy this blog, please publish it to other people as well.
    
    Thanks
    Dotan
Jay says: March 11, 2013

Hi Dotan,

I have few questions about ibv_post_send():

1. If I issue one large send request,
will (or can it be) served by multiple
smaller receive buffers? or does one
send request can never use multiple recv
buffers?

2. when would I need to use IBV_SEND_SIGNALED and IBV_SEND_SOLICITED?

3. Can Receive buffer be a gather list and
HCA will dma the received data to appropriate gather elements?

Reply
- Dotan Barak says: March 12, 2013
  
  Hi Jay.
  
  I'll try to answer:
  1) I assume that you mean that you send a big message over the wire,
  At the receive side you can split this message to how many scatter elements
  that you wish (this is a local attribute which).
  
  To summarize it:
  When using RDMA operation(s): only one contiguous buffer can be used
  When using Send operation: the receive side can post Receive Receive with one
  or more scatter elements (as long as the sum of the buffers will be able to hold
  all of the message).
  
  2) IBV_SEND_SIGNALED should be used if the QP was created with sq_sig_all=0
  (which means that not all Send Requests will generate Work Completion when completed).
  
  IBV_SEND_SOLICITED should be used when the remote side is reading the Work Completions
  using events (and not in polling mode). Please check the my about ibv_req_notify_cq()
  for more details.
  
  3) Yes, this is exactly what the RDMA device will do in the Receive side,
  when using the Send operation. Please keep in mind that those memory buffers should
  be registered first.
  
  I hope that I answered all of your questions.
  If you enjoy this blog, please publish it to other people as well.
  
  Thanks
  Dotan
  
  Reply
  - Jay says: March 13, 2013
    
    Hi Dotan,
    
    Please let me rephrase the question #1 -
    Receive side has posted two receive work
    request with n-bytes worth of buffer each.
    So receiver has total of 2n byte buffer
    available.
    Now sender issues one send work request
    with total of 2n+m byte data.
    Can receiver use two receive work requests
    to satisfy one send work request?
    When using RDMA operations you said one single contig. buffer can be used.
    Do you mean RDMA write OR RDMA read?
    
    Thank you so much for your reply.
    
    Jay
  - Dotan Barak says: March 13, 2013
    
    Hi Jay.
    
    The Receive Request is working in resolution of messages and not in resolution of bytes.
    
    Every Receive Request will handle only one incoming message:
    for each incoming message one Receive Request will be fetched from the head
    of the Receive Queue. The messages will be handled by the order of their arrival.
    
    In your example there are 2 Receive Requests that each has n bytes:
    * Receiving a message of n bytes or less, is fine
    * Receiving a message with more than n bytes will cause an error (since there isn't enough room to hold the message)
    
    When working with RDMA operations:
    * RDMA Write can read one or more local gather entries and write them to one remote contiguous block
    * RDMA Read can read from one remote contiguous block and write it locally to one or more scatter entries
    
    If you have more questions, you are more than welcome to ask..
    :)
    
    Dotan
test says: March 19, 2013

when i send a 1024 bite block by IBV_WR_RDMA_WRITE mode，everything is ok， but if block size is set larger (ex 4096 bite),I get a IBV_WC_LOC_PROT_ERR err and then many IBV_WC_WR_FLUSH_ERR err for send cq , can u help me

Reply
- Dotan Barak says: March 19, 2013
  
  Hi.
  
  Please check the memory buffers in the gather list of the Send Request, I suspect that you try to access memory that wasn't registered.
  
  Thanks
  Dotan
  
  Reply
test says: March 25, 2013

ibv_post_send returns -1,what is the problem ? thanks for your help

Reply
- Dotan Barak says: March 25, 2013
  
  Hi.
  
  There can several reasons:
  * The Send Request has invalid value(s)
  * The Send Queue is full
  
  Not all of the low level drivers return errno to indicate about errors
  (some of them returned -1 in the past and now return errno).
  
  It depends of the library that you use and its version..
  
  Thanks
  Dotan
  
  Reply
Sara says: May 1, 2013

Hi Dotan, I'm running into a problem with ibv_post_send and hoping you can provide some guidance. I've adapted the rc_ping_pong program to exchange 312 byte messages among nodes in a 32-machine IB cluster, except that I use an epoll() based mechanism to call ibv_poll_cq(). Several messages later (around 58900 to be exact), ibv_post_send() fails returning ENOMEM and errno set to 2. Both sides of the connection are in good states: IBV_PORT_ACTIVE & IBV_QPS_RTS. When I keep track of sends posted vs sends completed I find that during the failure (posted-completed) = 31, always. However I have only max_send_wr=1 when I created the qp. So I'm not sure what's going on. On the receive side I guarantee posts (rx_depth=800 and whenever it drops to 400 I post 400 more). Any help is much appreciated, and if you need further clarifications please let me know.
Thanks much
Sara

Reply
- Dotan Barak says: May 1, 2013
  
  Hi Sara.
  
  I will try to help you
  :)
  
  If ibv_post_send() itself fails that means that either:
  The Send Queue is full (i.e. all of the Work Requests in the Send Queue are outstanding)
  or
  The posted Send Request is illegal:
  * too many scatter/gather elements
  * too much inline data (if inline data is used)
  * wrong opcode
  
  Please check if this helps you:
  if you sure that the Send Queue isn't full, dump the Send Request and check what I suggested above.
  
  Thanks
  Dotan
  
  Reply
Sara says: May 1, 2013

Thanks for the quick response Dotan.
I'm leaning towards full queue rather than illegal request because:
1. They've been going through fine for all the previous posts, and
2. I simply reuse circular buffers for subsequent sends
3. I inspected the wr (bad_wr points to it) during failure and it looks okay:
(gdb) p wr
$1 = {wr_id = 1, next = 0x0, sg_list = 0x7fcaca7fbcb8, num_sge = 1, opcode = IBV_WR_SEND, send_flags = 2, imm_data = 0, wr = {rdma = {remote_addr = 0, rkey = 0}, atomic = {remote_addr = 0,
compare_add = 0, swap = 0, rkey = 0}, ud = {ah = 0x0, remote_qpn = 0, remote_qkey = 0}}}
(gdb) p *wr->sg_list
$11 = {addr = 49981952, length = 312, lkey = 175104}

I'm confused about two things though (if send queue full is the problem):

1. ibv_post_send() returns ENOMEM (and not -ENOMEM which is what the drivers seem to return when kmalloc fails or something similar)
2. errno=2 which is also weird, I'm unable to find out exactly who sets it & why

I've also tried running it through valgrind to check invalid memory and it looks clean.
Any pointers?

Thanks
Sara

Reply
- Dotan Barak says: May 2, 2013
  
  Hi Sara.
  
  I'll try to help here:
  1) User level libraries return positive errno values and not negative ones
  (kernel level drivers return negative errno values)
  
  2) I don't know where the errno=2 came from. libmlx4 almost doesn't set the errno value
  at all..
  
  Did you poll all of the completions from the CQ?
  Once you have the failure in the ibv_post_send(), did you try to empty the CQ and try to post the Send Request again?
  (since the QP should still be in a good shape)
  
  Thanks
  Dotan
  
  Reply
  - Sara says: May 3, 2013
    
    Thanks, Dotan! Once I reach this point, all polls keep returning 0, and if I attempt to post more sends I run into the same issue. The other side is sitting idle doing an epoll_wait() with plenty of recvs posted. So it doesn't look like an easy problem to solve. I'll try a few more experiments & update (in case someone runs into similar issues later).
    Sara
  - Dotan Barak says: May 4, 2013
    
    This will be great, thanks!
    
    Dotan
  - Sara says: May 7, 2013
    
    Just wanted to update on this issue real quick. I restructured the code quite a bit to make it extensible and now I don't hit upon the issue anymore. So most likely some bad coding on my part - if I had more time to spare I'll explore in detail but unfortunately I'm on a deadline so don't have a clear answer :(
    Thanks for your help Dotan!
  - Dotan Barak says: May 7, 2013
    
    Hi Sara.
    
    I'm happy that you overcome the bug
    :)
    
    You are most welcome!
    Dotan
Stefan says: June 28, 2013

Hi Dotan,

I'm receiving 'remote invalid request error' (IBV_WC_REM_INV_REQ_ERR) with RDMA_READ requests. I checked buffer sizes, access rights, and QP-type and all seams fine to me. RDMA_WRITE works and since the only difference is the opcode (as far as I know), I don't understand the issue.

BTW: I'm new to RDMA programming and your side really helps a lot!

Thanks so far.

Reply
- Dotan Barak says: June 28, 2013
  
  Hi Stefan.
  
  Sharing the code will be great (it will allow me to review it and give feedback..)
  Nevertheless, I will try to help you
  :)
  
  Assuming that you have both RDMA Read and RDMA Write code,
  the delta between the RDMA Write to the RDMA Read support should be:
  1) The QP type is IBV_QPT_RC
  2) The mask IBV_ACCESS_REMOTE_READ is enabled in the responder's MR
  3) The mask IBV_ACCESS_REMOTE_READ is enabled in the responder's QP (qp_access_flags)
  4) The values of max_rd_atomic/max_dest_rd_atomic aren't zero
  (setting the value to one in both sides isn't efficient but will do the trick)
  5) verify that the r_key is correct (although if it worked with RDMA Write, it should be valid)
  
  I hope that I helped you.
  If you enjoy this blog, please publish it to other people as well.
  
  Thanks
  Dotan
  
  Reply
  - Stefan says: July 1, 2013
    
    Hi Dotan,
    
    Thanks for the fast reply. I re-cheched all again and found:
    
    1) .qp_type of ibv_qp_init_attr is IBV_QPT_RC (OK)
    2) access mask was set by
    if (!(remote_mr = ibv_reg_mr(remote_pd, pmydata->recv_buffer, pmydata->max,
    IBV_ACCESS_REMOTE_WRITE |
    IBV_ACCESS_LOCAL_WRITE |
    IBV_ACCESS_REMOTE_READ))) {
    perror("ibv_reg_mr");
    return NULL;
    }
    Which left the flags of the QP unchainged. I set them now by calling ibv_modify_qp. The flags seam to be alright now, but the error remains.
    
    3) Both communication partners have the same flags, for their QPs and MRs so this should be ok.
    
    4) Both, max_rd_atomic and max_dest_rd_atomic are set to 1 by default here. I checked it and it should also be ok.
    
    5) As you mention, since RDMA_WRITE works r_key,l_key, and remote_addr are ok. (I also re-checked that)
    
    What seams strange is, that ibv_modify_qp raised an invalid argument error when I called it with IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MAX_QP_RD_ATOMIC to modify the values but modifying access flags works fine.
    
    Code actually is a mess but basically consist of this parts:
    
    * rdma_create_event_channel() to create event channels
    * rdma_create_id() to create rdma_cm_id's
    * rdma_bind_addr() and rdma_listen() on the server side
    * rdma_resolve_addr() and rdma_resolve_route() on the client side
    * ibv_create_cq(), ibv_alloc_pd(), rdma_create_qp() and ibv_reg_mr() to setup CQ,PD and register MR
    * Exchange Key and memory Address
    * Message setup:
    // current message size
    sge.length = imyproblemsize;
    // Buffer address == MR address and is large enough
    sge.addr = (uint64_t)pmydata->recv_buffer;
    sge.lkey = client_mr->lkey;
    
    snd_wr.sg_list = &sge;
    snd_wr.num_sge = 1;
    snd_wr.opcode = IBV_WR_RDMA_READ;
    snd_wr.send_flags = IBV_SEND_SIGNALED;
    snd_wr.next = NULL;
    snd_wr.wr.rdma.remote_addr = rAddr;
    snd_wr.wr.rdma.rkey = rKey;
    
    * Start Work:
    if (ibv_post_send(client_id->qp, &snd_wr, NULL)) {
    perror("21 ibv_post_send");
    return -21;
    }
    while (!ibv_poll_cq(client_cq, 1, &wc));
    if (wc.status != IBV_WC_SUCCESS) {
    printf("r0: wc.status: %s\n",ibv_wc_status_str(wc.status));
    perror("22 ibv_poll_cq");
    return -22;
    }
    
    The code is some kind of skeleton I wrote and originally covers send/receive wich works fine. Also modifying it to work with RDMA_READ caused no problem, but RDMA_WRITE does.
    
    Thanks a lot.
  - Dotan Barak says: July 4, 2013
    
    Hi Stefan.
    
    Can you call ibv_query_qp() when the QP should be in RTS state and verify that:
    1) The QP state is RTS
    2) The value of max_rd_atomic isn't zero
    3) The value of max_dest_rd_atomic isn't zero
    
    I suspect that the fact that ibv_modify_qp() failed is your problem.
    (please check my post about ibv_modify_qp() and make sure that you
    use the right flags for each QP state transition)
    
    Thanks
    Dotan
boris says: October 9, 2013

Hello Dotan,

I'm measuring latency between two RDMA NICs with IBV_WR_SEND

If I send a work request with IBV_SEND_SIGNALED flag, so when I get
IBV_WC_SEND event, does it mean that the message was delivered and the remote machine sent an ack back? Should I consider this time as a roundtrip?

Thanks.

Reply
- Dotan Barak says: October 10, 2013
  
  Hi.
  
  It depends on the used transport type:
  * If this is reliable transport type (RC), when you get Work Completion in the sender side - this means that the message was written at the remote side (and an ACK was sent back)
  * If this is unreliable transport type (UC/UD), when you get Work Completion in the sender side - this means that the message was sent through the local port (no ACK/NACK will be sent)
  
  I hope that I answered your question.
  
  Thanks
  Dotan
  
  Reply
  - Boris says: October 10, 2013
    
    Thanks a lot.
    That what I've assumed.
    I'm using RC, just to make clear, the following flow
    
    1. post-receive
    2. start timer
    3. send message (IBV_WC_SEND)
    4. wait for receive to complete (send on the other is posted only when message arrived)
    5. stop timer
    
    it measures: 2 messages + ACK for the first send + (optional: ACK to other side of received message)
    
    Thanks.
    Boris.
  - Dotan Barak says: October 10, 2013
    
    Exactly.
    
    One tip though: if you care about latency, you should send the message inline'd
    (if the message is small).
    
    Thanks
    Dotan
Boris says: October 15, 2013

Hello Dotan

in ibv_post_send:
1. Are the ibv_send_wr list, and its sg_list destroyed automatically when the operation completes.
2. Or can I destroy them after the method call returns.
3. They have to be kept alive till receiving work completion.

Boris.
Thanks.

Reply
- Dotan Barak says: October 15, 2013
  
  Hi Boris.
  
  The sg_list array can be safety be (re)used after ibv_post_send() ends:
  The Send Request is being enquequed to the Send Queue space of the Queue Pair
  once it is being posted.
  
  Thanks
  Dotan
  
  Reply
Jagadeesh says: November 11, 2013

Hi Dotan.

Is there any way to know, what is the max length of INLINE data can be sent in SEND or RDMA_WRITE ?

Reply
- Dotan Barak says: November 11, 2013
  
  Hi.
  
  Unfortunately, struct ibv_device_attr doesn't contain any attribute that specify the maximum INLINE data that can be sent.
  When creating a QP, qp_init_attr->cap->max_inline_data is returned with the number of INLINE data that can be sent in this QP.
  
  Thanks
  Dotan
  
  Reply
Martin says: November 22, 2013

Hi,
I'm new to RDMA and run into a weird behavior, which I was hoping you could clarify for me:

I'm using IBV_WR_SEND to send a struct-object which contains some information needed for an RDMA-read later on (rkeys, address and so on).
Now in principle this works fine, but the strange behavior is that only if the object-size is a multiple of 2, does it work correctly. So I tried these cases:
sizeof(message) -> 16. This works
sizeof(message) -> 24. The last object-attribute is always wrong, the rest is correct.
sizeof(message) -> 32. This works again.

Is this normal? I have only seen restrictions about the minimum/maximum message size, but nothing that would hint at an additional restriction of this kind. Or did I something wrong somewhere?

Thank you very much!
Martin

Reply
- Dotan Barak says: November 22, 2013
  
  Hi Martin.
  
  I have a feeling that the problem isn't related to RDMA.
  In RDMA the minimum message size can be even 0 bytes!
  
  I have a feeling that the problem happens because of the way the compiler prepare the structure in the memory
  (padding, etc..).
  In RDMA and in any other networking protocol the application needs to take care of how to transfer data between two machines since maybe the machines are different:
  * CPU arch (32/64) bits
  * Big/little endian
  
  I have two suggestion here:
  1) You can send me the source code for review, and I'll give you feedback
  2) You can give me more information on what went wrong (since you didn't provide this information)
  
  Thanks
  Dotan
  
  Reply
  - Martin says: November 28, 2013
    
    Hi,
    
    thank you for your reply.
    Sorry for my late response, but I was busy the last week.
    
    So, I have a struct containing: int rkey, int remote buffer size, long remote address
    If I send this, everything is fine. But now suppose I add "int id" to the struct. No matter which attribute is specified last in the struct (lets say for example "int id" is now the last one), that attribute is not recieved correctly, but gives a wrong value. All other attributes of that struct are correct.
    
    You are probably correct that this is due to some little/big endian problem.
    
    Thank you very much!
    
    Cheers,
    Martin
  - Dotan Barak says: November 28, 2013
    
    Hi Martin.
    
    Do you want to share the code with me? This way I'll find your bug ...
    
    Another way for you to handle it is to write (using sprintf()) the data to an array of characters,
    and this this data as a string as not as a struct (and parse it in the remote side).
    
    I hope that this tip helped you
    Dotan
Philippe Marguinaud says: November 22, 2013

Hello Dotan,

I have the same problem as Stefan (I get IBV_WC_REM_INV_REQ_ERR with RDMA_READ requests. I tried to follow the advice you already posted here as much as possible, but I cannot sort that out myself.

I can send you a simple program which reproduces my problem, but I would need your email (and your agreement).

Best regards,

Philippe

Reply
- Dotan Barak says: November 23, 2013
  
  Hi Philippe.
  
  If you want to share the code with me, and I'll give you a hint
  on the reason of this problem, you can send it to:
  support at rdmamojo dot com
  
  Thanks
  Dotan
  
  Reply
Jasmine says: November 23, 2013

Hi Dotan,

I have a question about P_KEYS in BTH header. Once a relation is established between two QP's, both ends can modify the qp attribute pkey_index. Can both ends use different pkey_index (and ultimately different pkeys) ? i.e A can say B is using Pkey=X and B can say A is using Pkey=Y.
Thanks,

Jay

Reply
- Dotan Barak says: November 23, 2013
  
  Hi Jay.
  
  It doesn't matter what are the P_Key index that each QP is pointing to
  (since what is really matters is the P_Key value itself and different tables
  *may* have same P_Key values but with different order).
  
  If at some point, the P_Key values of both QPs won't be consistent,
  the packet will be dropped
  (InfiniBand spec: Figure 81 Packet Header Validation Process)
  
  In your example: if X.key != Y.key, there will be a P-Key mismatch and
  the QPs won't be able to communicate (this is the whole idea of the P_Key..)
  
  I hope that I helped you.
  
  Thanks
  Dotan
  
  Reply
Omar Khan says: November 25, 2013

I am trying to use IBV_WR_ATOMIC_CMP_AND_SWP to check a remote value and proceed accordingly. I have registered a 64 bit integer using ibv_reg_mr. and sent this remote address to the sending host. But i am getting a remote access error. The sample code you have provided is not complete.
In the sample code you have used

sg.addr = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey = mr->lkey;

Is buf_addr a 64 bit integer or a char buffer of size 8. Is it possible that you may send a complete code of a working compare and swap function.

Reply
- Dotan Barak says: November 25, 2013
  
  Hi Omar.
  
  I'm sorry, but I don't have any source code that I can share with you...
  (I plan to write it in the future though)
  
  Please make sure that:
  1) The remote QP supports incoming Atomic operations
  2) The remote MR supports incoming Atomic operations
  3) The remote address is 8 bytes aligned
  
  Thanks
  Dotan
  
  Reply
  - wentian says: June 3, 2016
    
    Hi, I came across the same problem, and still cannot figure it out. I can successfully process send/recv operation(which means qpn, psn and lid of the remote side is correct), but I fail at RDMA write operation, receiving the IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error when I call ibv_poll_cq(). Any other comments besides the above three hints?? thanks in advance.
  - Dotan Barak says: June 8, 2016
    
    Hi.
    
    Did you read my post on ibv_poll_cq()?
    
    Anyway, check that RDMA Write is enabled in both the remote QP.qp_access_flags and remote MR.access.
    
    Thanks
    Dotan
Jiajun says: December 7, 2013

Hi Dotan,

I have a question about how WRs are finished. Suppose I have built a RC connection between two QPs. First on the receive side, I post two recv WRs, say recv_wr1 and recv_wr2. Then on the send side, I post two send WRs, say send_wr1, send_wr2. My question is, is there any possibility that send_wr2 finishes before send_wr1? What about the receive side? Is is possible that recv_wr2 is finished before recv_wr1?

Thanks,
Jiajun

Reply
- Dotan Barak says: December 7, 2013
  
  Hi Jiajun.
  
  In term of the Completion Queue of the Work Queues, you should see their Work Completions according to the order of the corresponding posted Send Requests.
  
  In term of the wire, this isn't a place that I fully familiar with, BUT:
  if you send a message, every packet increases the PSN in the Send Queue and in the remote Receive Queue),
  so send_wr2 cannot be sent before send_wr1 was sent. Otherwise, it won't be able to detect missing packets (using the PSNs).
  
  Anyway, you should (re)use the memory only after the relevant Work Request isn't outstanding any more.
  
  I hope that this helped you.
  
  Thanks
  Dotan
  
  Reply
Omar khan says: January 29, 2014

Hi
My question might seem out of context for this post but it's important.
I have to ask you how to set up an all to all communication between a number of processes, some on same machine and some on different. What I have done is open a listening rdma_cm_id wait for incoming connection requests for each process and bind it to a specific port and create new rdma_cm_id when I have completed a connection request. This works fine if all processes are on different host machines, but if I start multiple processes on the same machine, I get a very slow performance or none at all, the system hangs as if in a deadlock. I had hoped that once I have a rdma_cm_id for each process than the processes should communicate without any problem. One thing is that I have only set up one communication channel but it should suffice for many clients (the man pages say this).
Regards
Omar

Reply
- Dotan Barak says: January 29, 2014
  
  Hi Omar.
  
  I really sorry, but I can't help you with this...
  I don't have a lot of experience with rdma_cm (yet?).
  
  If you want a good answer, I suggest that you'll send this question to Sean Hefty,
  the writer and maintainer of rdma_cm.
  
  Sorry again..
  Dotan
  
  Reply
Igor R. says: January 30, 2014

It seems that the descriptions of IBV_WR_ATOMIC_CMP_AND_SWP and IBV_WR_ATOMIC_FETCH_AND_ADD are swapped.

Reply
- Dotan Barak says: January 30, 2014
  
  Fixed, thanks.
  
  Dotan
  
  Reply
Igor R. says: February 15, 2014

Hi Dotan,

What circumstances can make a Send Queue to get overflown?

In my program I perform an RDMA Write in a loop (every time with the same source/destination addresses, just to test), and after a while I constantly get ENOMEM from ibv_post_send(). It doesn't seem to be a race, as it always happens after the same count of iterations, and even sleeping ~1sec between iterations doesn't affect anything; besides, the number of successful iterations is correlated with QP's max_send_wr. None of the WRs is "signaled" (tried to poll the QC at every iteration - it's empty).
I might be missing something basic in the QP configuration. What initialization parameter can cause such a behavior?

Thanks.

Reply
- Dotan Barak says: February 16, 2014
  
  When creating a QP, you specify how many WRs are outstanding in either Send or Receive Queue.
  A WR is considered outstanding until there is a Work Completion for it or for other WRs in that Work Queue.
  
  You posted many WRs (in your case, to the Send Queue) and all of them are outstanding.
  From time to time, you need to make them "signaled" and read the Work Completions.
  
  Thanks
  Dotan
  
  Reply
  - Igor R. says: February 16, 2014
    
    Oh, I see. This looks like a design flaw, doesn't it? At least, it's quite counter-intuitive behavior, as one would expect that an unsignaled WR gets removed from SQ silently as soon as it's processed - after all, that's the whole point of unsignaled WRs...
  - Dotan Barak says: February 16, 2014
    
    But if you don't get any Work Completion, how can you prevent from posting more WRs than the Work Queue size?
    You *assume* that all the posted WRs were processed, in most cases it is true,
    but there isn't any guarantee about it...
    
    Thanks
    Dotan
  - Igor R. says: February 16, 2014
    
    Well, if one produces WRs faster than the HCA can consume, the SQ will eventually overflow, and in *such* situation ENOMEM would be quite logical (like in any producer-consumer scheme) - but still, implicitly treating obviously consumed WRs as outstanding doesn't seem to fit well in this logic. Sometimes the producer can know for sure that he can never overflow the queue (for instance due to retry count/timeout settings vs. timings of WRs), and such a behavior of the queue would surprise him.
  - Dotan Barak says: February 17, 2014
    
    You start to enter to the synchronization mechanism between the low-level driver and the HW...
    Anyway, this is the behavior which the protocol defined.
    
    Thanks
    Dotan
  - Boris says: February 16, 2014
    
    Hi Dotan, joining the question on this issue. Is there any way (or will be) to block on ibv_post_send (until there is place in the work queue)?
    
    Otherwise, in multythreaded application, some synchronization semaphore-like mechanism must be applied, and it could be very costly...
  - Dotan Barak says: February 17, 2014
    
    Hi Boris.
    
    Currently, there isn't any way to block the post_send if the Work Queue is full.
    This require a low-level libraries and API change (to prevent breaking of current behavior).
    
    This isn't anything that I can help you with.
    
    Sorry
    Dotan
Igor R. says: February 18, 2014

Dotan,

You're writing regarding the inline data that "the low-level driver (i.e. CPU) will read the data and not the RDMA device". Is this correct for the both sides? I.e., on the responding side, will the HCA perform DMA for the inlined data, or will CPU handle it?

Thanks a lot for your assistance.

Reply
- Dotan Barak says: February 18, 2014
  
  This is relevant only for the local side, i.e. the side that fetches the data.
  There isn't any hint that this was done once the data is being sent over the wire.
  
  Thanks
  Dotan
  
  Reply
Igor R. says: February 19, 2014

Hi Dotan,

Is there a more straightforward and efficient way to write a value atomicaly to the remote side, than performing rdma-read followed by atomic CAS? (There're no stores to this location on the remote side, only loads, but the value must appear consistently/atomically.)

Thanks.

Reply
- Dotan Barak says: February 20, 2014
  
  Hi Igor.
  
  The only supported atomic operations in RDMA are:
  
  * Fetch and Add
  * Compare and Swap
  
  I don't know what you are trying to achieve, but using them you can implement
  a mutual exclusion primitives.
  
  What about sending a message using "Send" and increment the value locally using a good old mutex/semaphore/spinlock?
  
  Thanks
  Dotan
  
  Reply
  - Igor R. says: February 20, 2014
    
    Due to some constraints I can't use send/receive flow...
    What the level of atomicity of a regular RDMAWrite? I.e., does the remote HCA stores to its local memory bytes or words?
  - Dotan Barak says: February 20, 2014
    
    I'm sorry, but I can't provide a good answer here.
    RDMA supports sending a stream of bytes and AFAIK there isn't any guarantee about atomic access of more than one bytes.
    
    Multiple testing may show you that atomicity of words (or more) is achieved, but there may be scenario that this won't be the case...
    
    Dotan
scott says: March 17, 2014

Hi Dotan,
Great website. Thanks for all the work.
Question about posting WRs. If I post a WR to a WQ, does a copy of the WR get made so that after the ibv_post_send() completes, I am free to overwrite that WR for my own purposes? Or is just a pointer to that WR posted to the WQ and I have to keep it intact until the completion occurs. It tried to find the internal representation of the WQs to see if I could deduce the answer myself, but no luck.

Reply
- Dotan Barak says: March 18, 2014
  
  Thanks
  :)
  
  Short answer: yes.
  
  Long answer: the low-level driver translate the Work Request structure from verbs API to HW API
  and post this HW-specific WR to the the relevant Work Queue.
  
  After the verb of posting the WR returns, you are free to change this WR structure.
  
  If you can to see how this is done, you need to check the code of the low-level drivers...
  
  Thanks
  Dotan
  
  Reply
  - Ariel says: March 31, 2014
    
    Hi Dotan,
    Your site is a huge help!
    Regarding reuse of WR, are the ibv_sge elements copied as well?
    From my reading of the code they are copied but can i reuse them when ibv_post_send returns?
    Also is there a restriction on multiple WR with the same wr_id?
    For example can the same id be used to identify a chain of WR posted together?
    Thanks!
  - Dotan Barak says: March 31, 2014
    
    Thanks!
    
    Yes. The s/g list is being copied to the QP's Send Queue and they can be reused.
    
    About the wr_id; it is a user defined private data and can contain any value that you wish..
    (including multiple WRs with the same wr_id).
    
    Sure
    Dotan
Bernard Gütermann says: May 29, 2014

Hi

thx for your previous answers.

I was wondering: Is there a performance difference between IBV_WR_RDMA_WRITE(_WITH_IMM) and IBV_WR_SEND(_WITH_IMM) ?

Also is there any advantage of having the remote post IBV_WR_RDMA_READ instead of posting IBV_WR_RDMA_WRITE(_WITH_IMM)/IBV_WR_SEND(_WITH_IMM) locally?

thx
Bernard

Reply
- Dotan Barak says: May 30, 2014
  
  Hi Bernard.
  
  In the following post you'll find most of your answers:
  Tips and tricks to optimize your RDMA code
  
  However, I'll answer your questions shortly:
  Yes, there is a performance difference, so one should prefer using RDMA Write with immediate instead of Send with immediate.
  
  RDMA Read is considered more "expensive" than RDMA Write or Send operations, so one should prefer the later operations.
  
  I hope that I helped
  Dotan
  
  Reply
Henry Fu says: August 11, 2014

Hi Dotan,

This is a fantastic website for RDMA learners! I have a question regarding on the atomic operations. That is, how does the RDMA atomic operations (FetchAdd & CmpSwap) implemented? I guess there should be a locking mechanism that comes to work once the atomic operations are performed on some memory buffer. Is the lock implemented on the network (RNIC?), on the specific memory buffer, on the memory bus, or somewhere else?

Thanks in advance!

Henry

Reply
- Dotan Barak says: August 12, 2014
  
  Hi Henry.
  
  Thanks for the compliment.
  :)
  
  The atomic operations are atomic related to other atomic operations and not to any other operation or any other memory access.
  
  I don't *know* the internal implementation but I can guess;
  It depends of the supported atomicity level of the RDMA device:
  * If it is supports atomicity within the device - it may have an internal mechanism to prevent other atomic access to this memory
  * If it is supports atomicity between other devices - I guess that it will lock the bus or something like this.
  
  AFAIK, every atomic is supported until now only within the device.
  
  I hope that this answer helped you.
  
  Thanks
  Dotan
  
  Reply
  - Igor R. says: August 13, 2014
    
    Hi Dotan,
    
    > The atomic operations are atomic related to other atomic operations and not to any other operation or any other memory access.
    
    Do you mean that if one modifies a remote value with eg. IBV_WR_ATOMIC_FETCH_AND_ADD, this modification will *not* appear as atomic for any other software (eg. running locally on that machine) that attempts to read this memory location?
  - Dotan Barak says: August 15, 2014
    
    Hi Igor.
    
    Here is the exact quote from the InfiniBand specifications:
    "o9-17: Atomicity of the read/modify/write on the responder’s node by the
    ATOMIC Operation shall be assured in the presence of concurrent atomic
    accesses by other QPs on the same CA."
    
    It specifies how the RDMA device will handle the content of the memory and doesn't really mention other interfaces (such as the software). For example: it *may* perform the following: Read, modify, write and perform the write 10 seconds after the read happened. During this time, the RDMA device will prevent any access to this memory by other Atomic operations. The (local) software doesn't really aware to the operations that are done by the RDMA device...
    
    Thanks
    Dotan
Zhang Yue says: October 13, 2014

Hi Dotan

I use ibv_post_send(), doing RDMA write, I found that if the num_sge is 4, it return -1; if the num_sge is 2 or 1 , it works fine. (the buffer is 4kB each).

How can I make it send 4(or more) num_sge buffers?

Thanks.

Zhang Yue

Reply
- Dotan Barak says: October 13, 2014
  
  Hi Zhang Yue.
  
  Can you send the output of:
  ibv_devinfo | grep max_sge
  
  Thanks
  Dotan
  
  Reply
  - Zhang Yue says: October 14, 2014
    
    hi Dotan,
    
    The command output is these:
    
    root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target# ibv_devinfo | grep max_sge
    root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target# ibv_devinfo
    hca_id: mlx4_0
    transport: InfiniBand (0)
    fw_ver: 2.32.5100
    node_guid: f452:1403:0028:0820
    sys_image_guid: f452:1403:0028:0823
    vendor_id: 0x02c9
    vendor_part_id: 4099
    hw_ver: 0x0
    board_id: MT_1090120019
    phys_port_cnt: 2
    port: 1
    state: PORT_ACTIVE (4)
    max_mtu: 4096 (5)
    active_mtu: 4096 (5)
    sm_lid: 3
    port_lid: 4
    port_lmc: 0x00
    link_layer: IB
    
    port: 2
    state: PORT_ACTIVE (4)
    max_mtu: 4096 (5)
    active_mtu: 4096 (5)
    sm_lid: 1
    port_lid: 2
    port_lmc: 0x00
    link_layer: IB
    
    root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target#
  - Zhang Yue says: October 14, 2014
    
    hi Dotan
    
    I found that the queue pair config limits it:
    qp_init_attr.cap.max_send_sge = 1; /* scatter/gather entries */
    qp_init_attr.cap.max_recv_sge = 1;
    I changed 1 to 16 and works.
    
    Thanks, you are nice.
    
    Zhang Yue
  - Dotan Barak says: October 15, 2014
    
    Hi Zhang Yue.
    
    Thanks for the update.
    I've updated the description of num_sge in the posts that describe the structures of Send Request and Receive Request to be more informative according to your problem.
    
    Thanks
    Dotan
Lluis says: November 5, 2014

In a UD QP, can you post an inline send with immediate data?

Reply
- Dotan Barak says: November 6, 2014
  
  Yes, you can.
  
  Thanks
  Dotan
  
  Reply
Igor R. says: November 11, 2014

Hi Dotan,

I'd like to consult with you on the following subject: we perform IBV_WR_RDMA_WRITE to a remapped BAR of a remote PCI device and experience poor throughput. Using hardware monitoring tools we found out that the data was being written in 64-byte packets, and that's what cased the above issue.
My question is whether there's any configuration that could affect the way HCA writes the data?
I post a non-signalled rdma-write, with >1K of data as a single SGE, 4K MTU.

Reply
- Dotan Barak says: November 11, 2014
  
  Hi Igor.
  
  I'm sorry, but this is device specific and I don't know much about it.
  
  However, I would check with the vendor of that PCI device to get more details.
  Do you have performance problems when accessing the PCI device locally?
  Maybe the way that this BAR is mapped to kernel can be improved?
  
  I hope that this give you a hint...
  
  Thanks
  Dotan
  
  Reply
  - Igor R. says: November 11, 2014
    
    It's "device-specific" in the sense that writing 64-byte packets causes the device to get the data slowly (which doesn't happen when HCA writes to RAM, or when we DMA'ing to this PCI device by other means) - the device vendor assured this assumption.
    The BAR is remapped to a user-space virtual addresses with io_remap_pfn_range(), then registered as rdma memory-region using PeerMemory mechanism recently introduced in Mellanox OFED especially for this purpose.
    I believe the remote (w.r.t to the PCI device) HCA sends the data over the fabric in MTU-sized chunks, so it's probably the local HCA that performs such a "slow", or PCI-unfriendly, DMA.
    So, the question is whether we have any control over the way HCA performs the DMA?
  - Dotan Barak says: November 13, 2014
    
    Hi Igor.
    
    AFAIK, there isn't any way to control the HCA performs the DMA.
    I doubt it, but even if there are ways to do this; you'll need to get this info from the HW vendors..
    
    Sorry.
    Dotan
Govind Patidar says: December 3, 2014

Hi,
Suppose I post two request in the receive queue but for some reason I received the data for second request before first request. Is it possible to receive data for second request before first or it will always give error.

Reply
- Dotan Barak says: December 3, 2014
  
  Hi Govind.
  
  You have two Receive Requests in your Receive Queue
  (the Receive Queue "knows" only the order of the posting of those Receive Request,
  and this ordered is promised).
  
  The next message that will enter to the Queue Pair that will consume a Receive Request will take
  those Receive Requests according to the order that they were enqueues to it.
  
  I understand that your application has the semantics of the first and second one,
  however, the RDMA doesn't.
  
  Bottom line, the answer is: no.
  
  BTW it should always give an error. You didn't give me enough info,
  but I believe that the problem is that the "first" Receive Request is small.
  This can be fixed by making sure that all the Receive Requests can hold all the incoming messages ...
  
  I hope that this helps you
  Dotan
  
  Reply
Govind Patidar says: December 3, 2014

hii all,
during ibv_post_send I am getting errno 0 and 2 for two different messages. Can someone please point out to some document where I can find description of errno. I am using OFA RDMA api's

Reply
- Dotan Barak says: December 3, 2014
  
  Hi.
  
  Unfortunately, the errno return values isn't consistent for all low-level drivers in RDMA.
  If you'll share the code, maybe I'll be able to answer you.
  
  Thanks
  Dotan
  
  Reply
Erfan says: December 3, 2014

Hello,

I can successfully send RDMA READ/WRITE, but I can't get RDMA atomic operations to work. I get an error when calling ibv_post_send function in the client, and the errno will be set to "Invalid Arguments.". Below I pasted important parts of my code. Could you please check my code and let me know if I'm missing anything?

*********** client side *****************:
-- Registering the memory regions --
mr = ibv_reg_mr(pd, buff, size, IBV_ACCESS_LOCAL_WRITE);
// and the size is 8

if (!mr){
fprintf(stderr, "Error, memory registration failed\n");
return -1;
}

-- Preparing RDMA ATOMIC FETCH AND
struct ibv_send_wr wr, *bad_wr = NULL;
struct ibv_sge sge;

memset(&sge, 0, sizeof(sge));
sge.addr = buff;
sge.length = 8;
sge.lkey = mr->lkey;

memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;

wr.wr.atomic.remote_addr = remote_buffer;
wr.wr.atomic.rkey = peer_mr->rkey;
wr.wr.atomic.compare_add = 1ULL; /* value to be added to the remote address content */

if (ibv_post_send(qp, &wr, &bad_wr)) {
fprintf(stderr, "Error, ibv_post_send() failed\n");
return -1;
}
********* End of Client side *******

****** Server side ****************
-- Registering the memory regions --
mr = ibv_reg_mr(pd, rdma_region_timestamp_oracle, sizeof(TimestampOracle),
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));

if (!mr){
fprintf(stderr, "Error, memory registration() failed\n");
return -1;
}

NOTE: TimestampOracle is a class with two int members, so its size is 8 bytes (satisfies 64-bit condition for RDMA ATOMIC operations)

Thank you for your helps,
Erfan

Reply
- Dotan Barak says: December 4, 2014
  
  Hi Efran.
  
  I have some questions:
  1) Did you check that the RDMA device supports Atomic?
  2) Did you check that the remote address is 8 byte aligned?
  3) Did you enable atomic at the responder QP?
  4) Is this is an RC QP?
  
  I hope that one of the above questions gave you a hint on the problem.
  If not, I'll need to see more source code and information on the RDMA devices that you are using.
  
  Thanks
  Dotan
  
  Reply
  - Erfan says: December 4, 2014
    
    Hello Dotan,
    
    Thank you for your response. I'll try to address your questions as far as my understanding
    1) How can I check that? Do you mean that some RDMA devices support Atomic and some don't?
    
    2) I simplified the code, so now the remote address is one (long long) variable, which is 8 bytes (I paste the code at the end of this comment).
    
    3) As you can see in my previous comment, on the server side code, I registered the memory region to be able to be accessed atomically by ibv_reg_mr(pd, ... , ...,
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC )). Do I need to do anything other than that?
    
    4) When initializing the queue pairs on both client and server, I used qp_attr->qp_type = IBV_QPT_RC.
    
    Here's the simplified code, I tried to leave unrelated parts out. I know how annoying it can be to read somebody else's lousy code. I'd really appreciate your help.
    
    ******** client code **********
    void build_qp_attr(struct ibv_qp_init_attr *qp_attr){
    memset(qp_attr, 0, sizeof(*qp_attr));
    qp_attr->send_cq = s_ctx->cq;
    qp_attr->recv_cq = s_ctx->cq;
    qp_attr->qp_type = IBV_QPT_RC;
    
    qp_attr->cap.max_send_wr = 10;
    qp_attr->cap.max_recv_wr = 10;
    qp_attr->cap.max_send_sge = 1;
    qp_attr->cap.max_recv_sge = 1;
    }
    
    void register_memory(struct connection *conn) {
    local_buffer = new long long[1];
    local_mr = ibv_reg_mr(pd, local_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE));
    }
    
    void on_completion(struct ibv_wc *wc){
    struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;
    // Assume that the client already knows about the remote_mr on the server side
    if (wc->opcode & IBV_WC_RECV) {
    struct ibv_send_wr wr, *bad_wr = NULL;
    struct ibv_sge sge;
    
    memset(&sge, 0, sizeof(sge));
    sge.addr = (uintptr_t)local_buffer;
    sge.length = sizeof(long long);
    sge.lkey = local_mr->lkey;
    
    memset(&wr, 0, sizeof(wr));
    wr.wr_id = 0;
    wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.send_flags = IBV_SEND_SIGNALED;
    
    wr.wr.atomic.remote_addr = (uintptr_t)remote_mr.addr;
    wr.wr.atomic.rkey = remote_mr.rkey;
    wr.wr.atomic.compare_add = 1ULL;
    
    if (ibv_post_send(qp, &wr, &bad_wr)) {
    fprintf(stderr, "Error, ibv_post_send() failed\n");
    die();
    }
    
    }
    }
    ***** End of client code ********
    
    **** Serve code ******
    struct connection {
    struct rdma_cm_id *id;
    struct ibv_qp *qp;
    struct ibv_mr *mr;
    long long *rdma_buffer;
    };
    
    void build_qp_attr(struct ibv_qp_init_attr *qp_attr) {
    memset(qp_attr, 0, sizeof(*qp_attr));
    qp_attr->send_cq = s_ctx->cq;
    qp_attr->recv_cq = s_ctx->cq;
    qp_attr->qp_type = IBV_QPT_RC;
    
    qp_attr->cap.max_send_wr = 10;
    qp_attr->cap.max_recv_wr = 10;
    qp_attr->cap.max_send_sge = 1;
    qp_attr->cap.max_recv_sge = 1;
    }
    
    void register_memory(struct connection *conn){
    rdma_region = 1ULL;
    
    rm = ibv_reg_mr(pd, rdma_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));
    }
    ***** End of Server code *******
  - Dotan Barak says: December 5, 2014
    
    1) In struct ibv_device_attr, there is an attribute called 'atomic_cap'.
    This describe the atomicity support level of this device.
    
    Since there may be devices that don't support atomic operations.
    For more information, please read the post of ibv_query_device().
    
    (Can you tell me what is its value?)
    
    2) Please check the remote address value, that it is 8 byte aligned
    (Can you tell me what is its value?)
    
    3) When calling ibv_modify_qp, there is an attributes in struct ibv_qp_attr called 'qp_access_flags',
    did you enable IBV_ACCESS_REMOTE_ATOMIC in the receiver side?
    
    For more information, please read the post on ibv_modify_qp().
    
    4) Only RC QP supports Atomic, so I see that you are using it.
    And it's o.k., I don't mind read other people code :)
    (I'm doing it all the time).
    
    The code looks fine, beside from my comments above.
    
    If you'll can send me in email (dotan at rdmamojo dot com) :
    1) The full source code
    2) The parameters of your program
    3) Execution example and output of your program
    3) The output of 'ibv_devinfo -v'
    
    I'll be able to help you further more
    (there is a limit to what I can do with only description..)
    
    Thanks
    Dotan
Jaume says: December 5, 2014

Hello Dotan,

I'm trying to speedup ibv_post_send when sending inline messages by using unsignaled completions. The problem is that it doesn't work if I post more than "qp_init_attr.cap.max_send_wr" unsignaled send requests. I tried to post one signaled request every N unsignaled ones, but still crashes after max_send_wr. What am I doing wrong?

Reply
- Dotan Barak says: December 6, 2014
  
  Hi Jaume.
  
  The flow that you've described sounds valid. What do you mean by "still crashes"?
  (Since i don't expect to get a crash in this flow, unless there is a bug).
  
  Did you provide a valid bad_wr pointer to the ibv_post_send () verb?
  
  Thanks
  Dotan
  
  Reply
  - Jaume says: December 8, 2014
    
    By crashing, I meant that ibv_post_send fails. I do not want to spend time reading the completions, so I send an "unsignaled" message. However, it seems that the unsignaled does not work because send fails once the CQ gets filled up. The QP is created with "qp_init_attr.sq_sig_all = 0;" and messages sent without the IBV_SEND_INLINE flag.
  - Dotan Barak says: December 8, 2014
    
    "Unsignaled Work Requests" mean that those Send Requests won't generate Work Completions.
    However, they are still consider outstanding. Which means that you need to empty the Send Queue
    by sending signaled Send Requests from time to time
    (otherwise, the Send Queue will be full, and you won't be able to post any new Send Requests).
    
    The IBV_SEND_INLINE isn't relevant to the signalling of the Send Requests.
    
    Bottom line, from time to time, you must post signaled Send Requests
    (if the Send Queue size is N, you can post signaled Send Requests every N messages,
    and by polling its Work Completion, you'll empty the Send Queue).
    
    Thanks
    Dotan
- Igor R. says: December 7, 2014
  
  Jaume, note that you have to process completions in the completion queue.
  
  Reply
Govind Patidar says: December 8, 2014

Hii Dotan,
I am trying to post send request in a queue that is already full. and I am getting some error (ENOMEM). So I put some sleep time and again post same request but it is again throwing same error. (Consider that after sleep time send queue is not full)

Reply
- Dotan Barak says: December 8, 2014
  
  Hi. Govind.
  
  Did you poll some Work Completions (which were posted to that Work Queue) from the associated CQ during this time?
  
  Thanks
  Dotan
  
  Reply
  - Govind Patidar says: December 8, 2014
    
    yes, i did and and I am getting error there also... Currently I solved these issue by checking the number of pending request (using your idea that u mentioned in one of the comment) in send queue before posting any request and it is working but I don't want to do that because of the performance issue. and one more thing how I should increase the maximum limit of pending request in the queue and thanks for the all the help and suggestions, I really appreciate it.
  - Dotan Barak says: December 8, 2014
    
    I'm glad that i can help.
    :)
    
    Which error do you get.?
    Can you share the source code?
    It will be easier for me to help you with source in front of me. .
    
    Thanks
    Dotan
Govind Patidar says: December 9, 2014

Hi Dotan,
I can't share the code (confidentiality issues), but I can tell u the error number, first error which I am getting having error number 12 and then after error number 5 for all the other messages during polling of CQ?? Can you please tell me how to increase the maximum limit of pending request in queue. Currently I am able to post ~8192 requests.

Reply
- Dotan Barak says: December 9, 2014
  
  Hi Govind.
  
  When calling ibv_create_qp(), you control the Send Queue (please refer to the post on this verb for more information).
  I suspect that you have completion with error (i.e. the 5 and 12 errors that you reported).
  Am I right? (are those are the status values of the Work Completion that you polled?)
  
  If this is the case, completion status 12 = IBV_WC_RETRY_EXC_ERR which means that the remote side didn't answer within the expected time.
  
  Thanks
  Dotan
  
  Reply
Govind Patidar says: December 22, 2014

Hii Dotan,
First of all thanks for all your help, Finally my code is working and currently I am getting 3 times better performance for RDMA compare to UDP. I am having few more question that how much improvement(max) can we suppose with RDMA as compare to udp. Currently I am using only channel semantics, is there any good chances to improve if I use memory semantics also??

Reply
- Dotan Barak says: December 22, 2014
  
  Hi Govind.
  
  I'm happy that I can help
  :)
  
  1) Performance is a very big area. Which metrics do you check? what is the current numbers in UDP?
  Do you compare usin RC QP/UD QP? Which operations do you use?
  2) What do you mean by channel semantics and memory semantics?
  
  Thanks
  Dotan
  
  Reply
Govind Patidar says: December 22, 2014

I am using RC QP and compairing with UDP protocol on the basis of waiting time of requested data.
With memory semantics I mean that I am not allowing the remote node channel adapter to write directly to host memory using rkey (all read write operation are done by local channel adapter by using lkey) and the reason for using only channel semantics is that I am transferring very small amount of data at a time.

Reply
- Dotan Barak says: December 23, 2014
  
  So, I guess that your metric is latency.
  
  I suggest that you'll execute a tool that comes with the OFED package called ib_send_lat,
  which will provide you the (best) latency that you can achieve using SEND operations in your setup.
  
  The performance depends on so many factors, so I prefer not to provide a number.
  
  Thanks
  Dotan
  
  Reply
Zhang Yue says: December 23, 2014

hi Dotan
(at the Target side) When I'm doing a RDMA-READ with 4 wr, each wr have 1 sge (4KB), the initiator will easyly crush or the /dev/sdxx dispear. (While doing RDMA-WRITE is fine.)
I've set the wr's rkey and increase remote_addr by 4096, any suggest?

Thanks
Zhang Yue
ps:
for(k = 1; k cache_req.sglist_size; k++)
{
multi_wr[k] = rdmad->send_wr; // copy struct

multi_wr[k].next = &multi_wr[k+1];
multi_wr[k].sg_list = &task->rdma_sge[k];
multi_wr[k].send_flags = 0; //zy: should be 0. otherwize will free task multi times
multi_wr[k].wr.rdma.remote_addr += (4096 * k);

task->rdma_sge[k].addr = tgt_phy2virt(task->cache_req.sglist[k].addr);
task->rdma_sge[k].length = task->cache_req.sglist[k].len;
task->rdma_sge[k].lkey = get_cache_buf_lkey(task->conn->dev, task->cache_req.sglist[k].addr);

}

// insert to list
multi_wr[k-1].next = rdmad->send_wr.next;
rdmad->send_wr.next = &multi_wr[1];

task->task_multi_wr = multi_wr;

//this sge.length mark the total length, will be use at iser_rdma_rd_comp_complete_handler
rdmad->sge.length = task->rdma_rd_sz;

// so we need to place the first wr's sge to other place
rdmad->send_wr.sg_list = task->rdma_sge;
task->rdma_sge[0].addr = tgt_phy2virt(task->cache_req.sglist[0].addr);
task->rdma_sge[0].length = task->cache_req.sglist[0].len;
task->rdma_sge[0].lkey = get_cache_buf_lkey(task->conn->dev, task->cache_req.sglist[0].addr);

Reply
- Dotan Barak says: December 23, 2014
  
  Hi.
  
  I don't know if this is related to RDMA.
  
  I would suggest to check if the local buffer that is being filled
  is still allocated or being freed.
  
  Maybe you should print the local address and check if the values make any sense.
  
  Please check that before using the values the Work Completion status is o.k.
  
  Thanks
  Dotan
  
  Reply
  - Zhang Yue says: December 25, 2014
    
    Hi Dotan
    
    Firstly, may all of us Merry Christmas!
    Yes,this issuse is NOT related to RDMA.
    
    Yesterday, I print every wr before calling ibv_post_send(), and found a issues:
    After doing a lot of 16KB write, tgt may receive a INQUIRY, and if the INQUIRY unluckily use a task struct that was previously used by a 16kB write( or read),
    It will use the old 4 4KB buffers and DMA to the initiator. INQUIRY only read 70 bytes, DMA 16 KB to it will broke the initiator's memory.
    
    The main fix is: check the need DMA length, if <=0 , skip the left buffers.
    
    Thanks
    
    Zhang Yue
  - Dotan Barak says: December 25, 2014
    
    Hi.
    
    Merry Christmas indeed
    :)
    
    I'm happy that you found the problem.
    
    Dotan
jiaxin shi says: January 15, 2015

Hi Dotan

I am trying to use IBV_WR_ATOMIC_CMP_AND_SWP operation and I get some error like this when I poll the wc :IBV_WC_REM_ACCESS_ERR

I just make some simple modification base on the codes provided in the book “RDMA_Aware_Programing_user_manual”, do you know what is the problem?

Reply
- Dotan Barak says: January 15, 2015
  
  Hi.
  
  Please check that IBV_ACCESS_REMOTE_ATOMIC is enabled in the remote memory buffer and in the remote QP.
  
  Thanks
  Dotan
  
  Reply
Jesus Camacho says: January 22, 2015

Hi Dotan,

I want to post a request, but I want that the remote QP discards this request as soon as it receives it. This is because I want to send a dummy packet when I am in the REARM state in the QPs in order to reach the ARMED state (this is because it is needed an incoming packet for this transition).

I am using the below configuration and it seems to be working, but I would like to know if you think that this could be a generic approach for any situation or not:

struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;

memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.sg_list = NULL;
wr.num_sge = 0;
wr.opcode = 0;
wr.send_flags = 0;

if (ibv_post_send(ctx->id[num_qp]->qp, &wr, &bad_wr)) {
fprintf(stderr, "Error, ibv_post_send() failed\n");
return -1;
}

Best regards,
Jesus Camacho

Reply
- Dotan Barak says: January 23, 2015
  
  Hi Jesus.
  
  You are sending a "standard" zero message. This can work, but you consume a Receive Request in the remote side.
  Did you consider sending a zero message RDMA Write?
  
  Thanks
  Dotan
  
  Reply
Jesus Camacho says: January 23, 2015

Hi Dotan,

I am currently using the opcode 0 (which is the IBV_WR_RDMA_WRITE operation) and it is working fine with the Infiniband microbenchmarks.

Is that what you are suggesting me?
If so, do you think this can be extrapolated to any scenario?

Thanks for your time,
Jesus

Reply
- Dotan Barak says: January 23, 2015
  
  Hi.
  
  Yes, this is was my suggestion.
  What do you mean by "do you think this can be extrapolated to any scenario"?
  
  Thanks
  Dotan
  
  Reply
  - Jesus Camacho says: January 23, 2015
    
    Hi,
    
    I mean if this is a general solution.
    
    Do you think that this is going to work when using another benchmarks, applications, etc.?
    
    Best,
    Jesus
  - Dotan Barak says: January 23, 2015
    
    Hi.
    
    Yes. Using zero bytes message is valid and can be always used.
    Working with such messages with RDMA Write opcode can provide better performance than the Send opcode.
    
    Thanks
    Dotan
  - Jesus Camacho says: January 23, 2015
    
    Hi,
    
    good to know!
    
    Thanks for your help :-)
    Jesus
  - Dotan Barak says: January 23, 2015
    
    Sure
    :)
    
    Dotan
John says: April 13, 2015

Hello Dotan,

I have a quick question. What happens if the local node calls ibv_post_send() with opcode ibv_wr_send before the remote node calls ibv_post_recv()?

Thanks!

Reply
- Dotan Barak says: April 14, 2015
  
  Hi John, the answer won't be quick though
  ;)
  
  The thing that matter is not when the sides posted the Send/Receive request in absolute time;
  since one may not know when the actual scheduling of the Send Request will take place...
  
  If message that consumes a Receive Request received by a Queue Pair when there isn't any available Receive Request in that Queue,
  and RNR (Receive Not Ready) flow will start for a Reliable QPs. For Unreliable QPs, the incoming message will be (silently) dropped.
  
  Thanks
  Dotan
  
  Reply
  - John says: April 14, 2015
    
    Hello Dotan,
    
    Thanks for the quick reply!
    
    I am using a Reliable QP. So I think I will get the RNR errors. Now I have a couple of choices. (a) when getting a RNR error, back off and re-post the send request later; (b) implement a flow control protocol so that the local node posts send requests only when the remote node is ready. I like (b) more than (a). But (b) add complexity, and need to take care cases such as both nodes are waiting for the other side to become ready. :-)
    
    So I am wondering if there is a common practice.
    
    Thanks!
  - Dotan Barak says: April 15, 2015
    
    Sure :)
    
    In RNR flows, the problem is that the receiver side doesn't post Receive Requests fast enough ..
    
    About your suggestions:
    a) When you have an RNR error, your local QP is in ERROR state, so you can't post another Send Request without reconnecting it with the remote QP.
    b) is a good idea
    
    There are more options:
    * You can increase the RNR timeout
    * You can increase the RNR retry count (the value 7 means infinite retries)
    * If you have several QPs at the receiver side, you can use a SRQ and make sure that the SRQ is never empty
    (the SRQ LIMIT mechanism can help you to detect if the number of Receive Requests dropped bellow a specific watermark)
    
    Adding flow control to your messages is always a good idea in order to not enter to the RNR flow in the first place ..
    
    Thanks
    Dotan
gp says: May 29, 2015

Hi Dotan,
I have few questions related to connection of RC queue pair.

1. If ibv_post_send fails then we consider connection was lost.
-> considering all the fields in the message are correct and the send queue is not full. Is vice versa also true that if we are able to post means there is working connections b/w nodes.

2. Is it possible that we receive send WC with some error if there is active or working connection between nodes assuming message was correct and receiver also posted recv request (no RNR error).

3. If we post send request beyond max limit in the send queue then it will corrupt the queue pair and no further request post allowed ? If no then can we post same request again without any change ?

Reply
- Dotan Barak says: May 29, 2015
  
  Hi.
  
  1. Failure of ibv_post_send() means that one of the Send Requests is invalid or the Send Queue is full;
  it doesn't mean that connection is closed. In that case no new Send Request was added to the Send Queue.
  
  You can post Send Request to a Queue Pair which was configured with bad remote attributes
  ("bad" means not the attributes that you should have been configured...), i.e. no connection.
  
  2. In general, no; but this question is tricky...
  Which completion status did you get?
  
  3. If you posted Send Requests beyond the maximum limit and all of them are unsignaled - you have a problem.
  The Queue Pair isn't corrupted, but you can't post anymore Send Requests to it:
  The status of the outstanding Send Requests is undetermined for the sender side.
  The Receive Side of this Queue Pair is still fully operational.
  
  You must recover it but moving it to Error/Reset state and reconnect the Queue Pairs
  
  I hope that I helped you
  Dotan
  
  Reply
ChenCong Fu says: June 3, 2015

Hi Dotan:

Nice to meet you. I'm from China. My English is not very good. Recently I have learn somthing about RDMA. But I met a problem:

This is my test program:
server code :

/*
* Copyright (C) fuchencong@163.com
*/

#include
#include
#include
#include
#include
#include

#define VERB_ERR(verb, ret) \
fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)

#define MB 1024 * 1024

/* Default parameter values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 100
#define DEFAULT_MSG_LENGTH MB

/* Resources used in the example */
struct context
{
char *server_name;
char *server_port;
unsigned int msg_count;
unsigned int msg_length;
/* Resources */
struct rdma_cm_id *id;
struct rdma_cm_id *listen_id;
struct ibv_mr *recv_mr;
char *recv_buf;
};

int
reg_mem(struct context *ctx)
{
ctx->recv_buf = (char *) malloc(ctx->msg_length);
memset(ctx->recv_buf, 0x00, ctx->msg_length);

ctx->recv_mr = rdma_reg_msgs(ctx->id, ctx->recv_buf, ctx->msg_length);
if (!ctx->recv_mr) {
VERB_ERR("rdma_reg_msgs", -1);
return -1;
}

return 0;
}

int
getaddrinfo_and_create_ep(struct context *ctx)
{
int ret;
struct rdma_addrinfo *rai, hints;
struct ibv_qp_init_attr qp_init_attr;

memset(&hints, 0, sizeof (hints));
hints.ai_port_space = RDMA_PS_TCP;
hints.ai_flags = RAI_PASSIVE; /* this makes it a server */

printf("rdma_getaddrinfo\n");
ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
if (ret) {
VERB_ERR("rdma_getaddrinfo", ret);
return ret;
}

memset(&qp_init_attr, 0, sizeof (qp_init_attr));
qp_init_attr.cap.max_send_wr = 1;
qp_init_attr.cap.max_recv_wr = 1;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;

printf("rdma_create_ep\n");
ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
if (ret) {
VERB_ERR("rdma_create_ep", ret);
return ret;
}

rdma_freeaddrinfo(rai);

return 0;
}

int
get_connect_request(struct context *ctx)
{
int ret;
printf("rdma_listen\n");

ret = rdma_listen(ctx->id, 4);
if (ret) {
VERB_ERR("rdma_listen", ret);
return ret;
}

ctx->listen_id = ctx->id;
printf("rdma_get_request\n");
ret = rdma_get_request(ctx->listen_id, &ctx->id);
if (ret) {
VERB_ERR("rdma_get_request", ret);
return ret;
}

if (ctx->id->event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
printf("unexpected event: %s", \
rdma_event_str(ctx->id->event->event));
return ret;
}

return 0;
}

int
establish_connection(struct context *ctx)
{
int ret;
struct rdma_conn_param conn_param;

/* post a receive to catch the first send */
ret = rdma_post_recv(ctx->id, NULL, ctx->recv_buf, ctx->msg_length,
ctx->recv_mr);
if (ret) {
VERB_ERR("rdma_post_recv", ret);
return ret;
}

memset(&conn_param, 0, sizeof (conn_param));
conn_param.responder_resources = 2;
conn_param.initiator_depth = 2;
conn_param.retry_count = 5;
conn_param.rnr_retry_count = 5;

printf("rdma_accept\n");
ret = rdma_accept(ctx->id, &conn_param);
if (ret) {
VERB_ERR("rdma_accept", ret);
return ret;
}

return 0;
}

int
recv_msg(struct context *ctx)
{
int ret;
struct ibv_wc wc;

ret = rdma_get_recv_comp(ctx->id, &wc);
if (ret id, NULL, ctx->recv_buf, ctx->msg_length,
ctx->recv_mr);
if (ret) {
VERB_ERR("rdma_post_recv", ret);
return ret;
}

return 0;
}

int
main(int argc, char** argv)
{
int ret, op, i, recv_cnt;
struct context ctx;
struct ibv_qp_attr qp_attr;

memset(&ctx, 0, sizeof (ctx));
memset(&qp_attr, 0, sizeof (qp_attr));

ctx.server_port = DEFAULT_PORT;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;

while ((op = getopt(argc, argv, "a:p:c:l:")) != -1) {
switch (op) {
case 'a':
ctx.server_name = optarg;
break;
case 'p':
ctx.server_port = optarg;
break;
case 'c':
ctx.msg_count = atoi(optarg);
break;
case 'l':
ctx.msg_length = atoi(optarg) * MB;
break;
default:
printf("usage: %s [-s or -a required]\n", argv[0]);
printf("\t[-a ip_address]\n");
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
exit(1);
}
}

printf("address: %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
printf("port: %s\n", ctx.server_port);
printf("count: %d\n", ctx.msg_count);
printf("length: %d bytes\n", ctx.msg_length);
printf("\n");

ret = getaddrinfo_and_create_ep(&ctx);
if (ret) {
goto out;
}

ret = get_connect_request(&ctx);
if (ret) {
goto out;
}

ret = reg_mem(&ctx);
if (ret) {
goto out;
}

ret = establish_connection(&ctx);

recv_cnt = 0;
for (i = 0; i < ctx.msg_count; i++) {
if (recv_msg(&ctx)) {
break;
}
++recv_cnt;
}
printf("recv %d messages, each message is %d bytes\n", \
recv_cnt, ctx.msg_length);

rdma_disconnect(ctx.id);

out:
if (ctx.recv_mr) {
rdma_dereg_mr(ctx.recv_mr);
}

if (ctx.id) {
rdma_destroy_ep(ctx.id);
}

if (ctx.listen_id) {
rdma_destroy_ep(ctx.listen_id);
}

if (ctx.recv_buf) {
free(ctx.recv_buf);
}

return ret;
}

client code:

/*
* Copyright (C) fuchencong@163.com
*/

#include
#include
#include
#include
#include
#include

#define VERB_ERR(verb, ret) \
fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)

#define MB 1024 * 1024

/* Default parameter values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 100
#define DEFAULT_MSG_LENGTH MB
#define DEFAULT_MSEC_DELAY 500

/* Resources used in the example */
struct context
{
char *server_name;
char *server_port;
unsigned int msg_count;
unsigned int msg_length;
/* Resources */
struct rdma_cm_id *id;
struct ibv_mr *send_mr;
char *send_buf;
};

int
reg_mem(struct context *ctx)
{
ctx->send_buf = (char *) malloc(ctx->msg_length);
memset(ctx->send_buf, 'a', ctx->msg_length);

ctx->send_mr = rdma_reg_msgs(ctx->id, ctx->send_buf, ctx->msg_length);
if (!ctx->send_mr) {
VERB_ERR("rdma_reg_msgs", -1);
return -1;
}

return 0;
}

int
getaddrinfo_and_create_ep(struct context *ctx)
{
int ret;
struct rdma_addrinfo *rai, hints;
struct ibv_qp_init_attr qp_init_attr;

memset(&hints, 0, sizeof (hints));
hints.ai_port_space = RDMA_PS_TCP;

printf("rdma_getaddrinfo\n");
ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
if (ret) {
VERB_ERR("rdma_getaddrinfo", ret);
return ret;
}

memset(&qp_init_attr, 0, sizeof (qp_init_attr));
qp_init_attr.cap.max_send_wr = 1;
qp_init_attr.cap.max_recv_wr = 1;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;

printf("rdma_create_ep\n");
ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
if (ret) {
VERB_ERR("rdma_create_ep", ret);
return ret;
}

rdma_freeaddrinfo(rai);

return 0;
}

int
establish_connection(struct context *ctx)
{
int ret;
struct rdma_conn_param conn_param;

memset(&conn_param, 0, sizeof (conn_param));
conn_param.private_data_len = sizeof (int);
conn_param.responder_resources = 2;
conn_param.initiator_depth = 2;
conn_param.retry_count = 5;
conn_param.rnr_retry_count = 5;

printf("rdma_connect\n");
ret = rdma_connect(ctx->id, &conn_param);
if (ret) {
VERB_ERR("rdma_connect", ret);
return ret;
}

if (ctx->id->event->event != RDMA_CM_EVENT_ESTABLISHED) {
printf("unexpected event: %s",
rdma_event_str(ctx->id->event->event));
return -1;
}

return 0;
}

int
send_msg(struct context *ctx)
{
int ret;
struct ibv_wc wc;

ret = rdma_post_send(ctx->id, NULL, ctx->send_buf, ctx->msg_length,
ctx->send_mr, IBV_SEND_SIGNALED);
if (ret) {
VERB_ERR("rdma_send_recv", ret);
return ret;
}

ret = rdma_get_send_comp(ctx->id, &wc);
if (ret < 0) {
VERB_ERR("rdma_get_send_comp", ret);
return ret;
}

return 0;
}

int
main(int argc, char** argv)
{
int ret, op, i, send_cnt;
struct context ctx;
struct ibv_qp_attr qp_attr;

memset(&ctx, 0, sizeof (ctx));
memset(&qp_attr, 0, sizeof (qp_attr));

ctx.server_port = DEFAULT_PORT;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;

while ((op = getopt(argc, argv, "a:p:c:l:")) != -1) {
switch (op) {
case 'a':
ctx.server_name = optarg;
break;
case 'p':
ctx.server_port = optarg;
break;
case 'c':
ctx.msg_count = atoi(optarg);
break;
case 'l':
ctx.msg_length = atoi(optarg) * MB;
break;
default:
printf("usage: %s [-s or -a required]\n", argv[0]);
printf("\t[-a ip_address]\n");
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
exit(1);
}
}

printf("address: %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
printf("port: %s\n", ctx.server_port);
printf("count: %d\n", ctx.msg_count);
printf("length: %d bytes\n", ctx.msg_length);
printf("\n");

if (!ctx.server_name) {
printf("server address must be specified for client\n");
exit(1);
}

ret = getaddrinfo_and_create_ep(&ctx);
if (ret) {
goto out;
}

ret = reg_mem(&ctx);
if (ret) {
goto out;
}

ret = establish_connection(&ctx);

send_cnt = 0;
for (i = 0; i < ctx.msg_count; i++) {
if (send_msg(&ctx)) {
break;
}
++send_cnt;
}
printf("send %d messages, each message is %d bytes\n", \
send_cnt, ctx.msg_length);

rdma_disconnect(ctx.id);

out:
if (ctx.send_mr) {
rdma_dereg_mr(ctx.send_mr);
}

if (ctx.id) {
rdma_destroy_ep(ctx.id);
}

if (ctx.send_buf) {
free(ctx.send_buf);
}

return ret;
}

What I can't understand is that sometimes this program takes 1 minite to send 1G data and sometimes it only needs 0.2 seconds。 So it's not very stable.

I really don't know why. Can you give me some advice?
Thank you!

Reply
- Dotan Barak says: June 3, 2015
  
  Hi.
  
  The code that you sent me is corrupted (problem to be added in a comment).
  Can you please send it to me?
  dotan at rdmamojo dot com
  
  Thanks
  Dotan
  
  Reply
  - ChenCong Fu says: June 4, 2015
    
    Hi Dotan,
    
    Thanks for the quick reply! I have send my code to you by email.Thank you very much.
  - Dotan Barak says: June 4, 2015
    
    Hi ChenCong Fu.
    
    As wrote in mail, the problem is that the Sender Queue Pair enters Receiver Not Ready (RNR) flow,
    which harms the performance and this is what you sometimes see.
    
    Thanks
    Dotan
Jack says: June 9, 2015

Hello Dotan,
Thanks a lot for your help.
I have a design questin, would you mind take a look at?

I have client and server, client wants to send a lot of data to server, instead of using "send" operation to send data directly from client to server, client register a memory region includes these data and use "send" operation to tell the remote server the virtual address of these data. Once the server receive this request from client, server will post an "RDMA Read" operation to read these data directly from client side.
What's the best way to do it?
because at beginning, server needs to receive a so called "rdma msg" from client, so server will be able to know where to read data at remote side(client), which means we need to put our "RDMA Read" operation inside of "receive completion hander" at server side, only when sever finishes receiving the "rdma msg" from client, server will be able to know where to read and starts "read" operation.

Is it OK to put "RDMA Read" operation inside of "receive completion handler"? Do you have any advise for this design?

Thanks a lot for your time!

All the best
Jack

Reply
- Dotan Barak says: June 9, 2015
  
  Hi Jack.
  
  I'm glad to help where I can
  :)
  
  I would suggest to use RDMA Write to send data instead of RDMA Read,
  i.e. the server allocates blocks and advertise its attributes to the client
  and the client will initiate an RDMA Write(s).
  
  The last RDMA Write can be with immediate, to let the server know that it was the last message
  (or from time to time during the messages as a keep alive messages and let the server know how many
  messages it expects to get).
  
  Did I answer your questions?
  
  Thanks
  Dotan
  
  Reply
  - Jack says: June 12, 2015
    
    Thanks a lot Dotan!
  - Jack says: June 13, 2015
    
    Thanks a lot Dotan!
    I will try to do both write and read.
    While I am implementing it. I found out a weird situation. I am trying to put client and server both on the same machine and perform RDMA Read operation between them. The receiver(reader) can only read half part of data from sender.
    For example, the sender send a packet to receiver(contains the address that the reader will read from), assuming that there're 100 bytes in that address, the receiver(reader) can only read first 50 bytes data correctly from the sender side(If sender sends 16 bytes, then only 8 bytes can be read). It's pretty weird. Because I have already tested rdma send/receiver operation, they are fine(in a loopback), which means DMA works OK.
    
    Do you have any idea? I have updated my firmware to the newest one(May, 2015), my device is ConnectX3. Does it support to perform RDMA Read operation in a local loopback?
    
    Thanks a lot!
    
    Jack
  - Dotan Barak says: June 13, 2015
    
    Hi Jack.
    
    I would have double check the length of the S/G entries in your Send Requests.
    
    Thanks
    Dotan
  - Jack says: June 15, 2015
    
    Hello Dotan,
    Thanks for your help. I have checked the S/G entries length, they are enough for the requests(these entries length are equal to the bytes of data).
    I don't know what to do?
    
    All the best
    Jack
Jack says: June 15, 2015

Thanks Dotan, I figured it out. Something wrong in another module...

Reply
- Dotan Barak says: June 15, 2015
  
  Great!
  
  As I said, the RDMA device you mentioned works great (I worked/working with it personally).
  :)
  
  Thanks
  Dotan
  
  Reply
Jack says: June 16, 2015

Hello Dotan,
I want to ask a question.
If we want to send a huge message via post_send that reqiures more than one work request(we will use send work request list).
For example, we have a send workrequest list that contains 2 work request(sendwr0, sendwr1)
for sendwr0 and sendwr1,
1) do I need to assign them the same workrequestID because they basically represent the same message?
2) About send flag, do I only need to assign send_flag_signaled on the last request(in the case above, it's sendwr1)?

Reply
- Dotan Barak says: June 17, 2015
  
  1) No, you don't *need* to do it, but you *can* do it.
  wr_id is the application attribute for use (or not use).
  If your application needs to know that the two Work Completion are of the same message, you can use it as a hint.
  
  2) You can set the SIGNALED flag to the second Send Request and get one Work Completion if everything will be fine.
  
  The RDMA stack doesn't know (or care) that you used two Send Requests for one application message
  (from the RDMA stack point of view, you have two different messages).
  
  Thanks
  Dotan
  
  Reply
  - Jack says: June 18, 2015
    
    Thanks a lot Dotan, that's helpful!
Jack says: June 18, 2015

Hello Dotan,
I would like to confirm if my understanding about FRWR is correct.
If we have sender and receiver(reader), before they can start, the sender needs to do "post_send()" twice, right? The first "post_send" is register the memory(FRWR) to the NIC, the second one is actually transfer the virtual address of these FRWR memory regions.
How many "post_send" the receiver(reader) should do? maybe "Three"?
1) "post_send" FRWR to store the incoming data
2)"post_send" to actually read the data
3) "post_send" to tell the remote side(sender) to invalidate the memory region(if receiver finishes reading)
Is that correct?
And how could we suppose to know how many FRWR read operations can be performed currently before we invalide the first FRWR? by using query device, I could not find this information, would you mind give me a hand?

All the best
Jack

Reply
- Dotan Barak says: June 21, 2015
  
  Hi Jack.
  
  I don't have any experience with FRWR operations. But let me try to help you anyway.
  I assume that you are using RDMA Read (although you didn't wrote it..); this is the reason for the second post send.
  
  According you your scenario (using RDMA Read), yes - three post_sends are needed.
  
  I don't really understand what is do you mean by:
  "...how many FRWR read operations can be performed currently before we invalidate the first FRWR".
  
  Can you please explain it?
  
  Thanks
  Dotan
  
  Reply
  - Jack says: June 22, 2015
    
    Thanks a lot Dotan!
    "...how many FRWR read operations can be performed currently before we invalidate the first FRWR".
    Because for FRWR(at least from my understanding), we registered a memory region and then we use it and then we invalidate it.
    So for increasing performance, the receiver(reader) may perform a couple of Read operations currently, so receiver(reader) will need to invalidate that specific FMR when it's done, so my question was actually about how many Read operations we can perform, so I think it should depend on my system.
    
    Do you know where I can find more info about FRWR? I tried to search online, but I could not find too much info.
  - Dotan Barak says: June 23, 2015
    
    Yes. It is your decision when to invalidate this Memory Region.
    
    AFAIK, the InfiniBand specifications is the only place that you can get information on FRWR.
    
    Thanks
    Dotan
Jack says: June 29, 2015

Hello Dotan,
If I have a very huge size of data(it's divided into multiple chunks) want to send out, there're two possible ways of doing it.
First one is using one work request (but need extra CPU time to do mem copy)
Second one is using multi rdma work request(don't need extra CPU time do mem copy but needs to post multiple work request).

Which one is better?

All the best
Jingyi

Reply
- Dotan Barak says: July 4, 2015
  
  Hi Jingyi.
  
  You can use one Send Requests with a scatter list;
  this way you'll be able to eliminate the need to perform mem copy and send message from multiple buffers.
  
  If not, the best solution depends on the size of the total message size:
  * If this is small (~ < 1KB), I think that the first one is the best. * If the total message size is big, the second approach will give you best performance. I suggest to use selective signal and create Work Completion only for the last Send Request. Anyway, if performance is highly critical, the best way is to implement both approaches and measure the results (you develop once and use many times ...) I hope that this helped you. Thanks Dotan
  
  Reply
Jack says: July 6, 2015

Hello Dotan,
Thanks a lot for your reply!
I have a idea, I am not sure if it's possible.
Suppose if sender has 10 chunks data that need to send to remote side(still the send/recv model)
The normal way to do it is the sender sends the vaddr to receiver then the receiver reads data from sender or the receiver sends its vaddr to sender then sender writes to receiver.
I was thinking if it's possible that we can perform read and write operations at the same time.
Back to our assumption, for the 1st chunck the receiver(reader) reads from the sender and at the same time the sender writes 2st chunck to receiver(reader), and for the rest chunks, we do something similar. So we can improve the speed by having both side busy, right?
Is the above approach possible? If so, I believe the chanllege we will have is the ordering issue, how can we make sure that the chuncks delievered in order? Is there any good way to do it?

All the best
Jack

Reply
- Dotan Barak says: July 7, 2015
  
  Hi Jack.
  
  Yes, RDMA Reads and Writes can happen in the same time
  (obviously they are initiated by both sides).
  
  I'm not really sure how much improvements it will give compared to the complexity
  (maybe you would want to work with several QPs in parallel).
  
  Anyway, back you your idea:
  What is the meaning of order?
  Each QP can place the data in a different (predefined) location,
  In a Write, you specify the remote location that the data will be written to.
  In Read, you specify the local location that the data will be written to.
  
  So, at the end all the chunks can be placed in one contiguous block.
  
  Thanks
  Dotan
  You only need to
  
  Reply
Jack says: July 9, 2015

Hello Dotan,
Thanks for your time!
When I am doing RDMA Write operation, I noticed an very interesting problem.
After we successfully post write work request and poll the corespoding wc. the wc.byteLen is not the valid number that we have write. In RDMA read operation, the wc.byteLen is the number of bytes we read from remote side,but in write operation, we can't relay on it. I took a look at driver, the wc.byteLen hasn't been updated in write operation(if opcode = rdma write), but it has been updated in read operation.
I also checked the infiniband specification, in the rdma write section, it says we can depend on dmaLen, the weird it didn't say anything about wc.byteLen.
Why for read operation, wc.byteLen will be updated, but for write, it will not be updated?

All the best
Jack

Reply
- Dotan Barak says: July 13, 2015
  
  Hi Jack.
  
  I *think* (since I'm not one of the IB spec authors) is that if you are the Requestor side of RDMA Write or Send, you know how much data you sent. If needed, you can maintain a local information which is associated with the Send Requests, and hold in the wr_id the pointer to it.
  
  Thanks
  Dotan
  
  Reply
  - Jack says: July 15, 2015
    
    Thanks Dotan!
    Actually there's another confusion in driver. If we post_send(wr), in the failure case, it seems that we still can't relay on wc.opcode, because the driver doesn't update it. Is there any design reason?
    why driver doesn't need to update the wc.opcode in the failure case?
    
    All the best
    Jack
  - Dotan Barak says: July 15, 2015
    
    Hi Jack.
    
    This is by design. Look at the post on ibv_poll_cq() for more details on valid attributes when Work Completion has an error.
    
    Thanks
    Dotan
Mark Sherred says: July 16, 2015

Thanks for all the great info!

I didn't realize the IB verbs layer itself needs completion events created by the application layer, until I saw your response to Igor R. When I first saw the description of the dead lock when the WQ is filled with non-signaled operations, I though you were referring to the application layer SW needing completion events to keep a count of outstanding operations to make sure the WQ is never filled.

Do you know why IB verbs pushes WR flow control back into the application layer by going into the error state when the WQ fills, instead of returning EAGAIN or EWOULDBLOCK like send(), recv(), read() or write() for non-blocking I/O to a busy device?

Reply
- Dotan Barak says: July 17, 2015
  
  Hi Mark.
  
  There isn't any problem if the Send Request if full with Send Requests which one of them is Signaled (i.e. will generate a Work Completion).
  
  The problem only exists if all the posted Send Requests are non-signaled.
  
  Letting the low-level driver or the HW make the book-keeping of which Send Request is signaled, which isn't will decrease the performance. Since before any Send Request is posted, the low-level driver will need to check if there is a potential problem.
  
  The application knows what it is doing, and easily can avoid getting into this pitfall.
  
  Thanks
  Dotan
  
  Reply
DjvuLee says: September 9, 2015

Hi, Dotan.
I have a question about the parallel RDMA READ. Since RDMA is a async model, before we finished a RDMA READ, we can launch another, so there is a lot of unfinished RDMA READ at a time, the number of this RDMA READ operation may exceed the initiator_depth and responder resource. What will happen when exceed? does the NIC will launch the RDMA READ as common, or it will wait until the number of unfinished RDMA READ do not exceed?

I keep the parallel RDMA READ model in a cluster, when I do not limit the parallel number, I failed with IBV_WC_RETRY_EXC_ERR, but when I limit the number of parallel RDMA READ, I can success.

Is there any limit for parallel RDMA READ? or we should avoid this. Thanks!

Reply
- Dotan Barak says: September 13, 2015
  
  Hi.
  
  Per QP, there are attributes to number of RDMA Read + Atomic messages that can be sent in parallel.
  If wrong values will be used (for example: the initiator is configured to send more READs that the destination can accept)
  there will be a retry flow and the initiator side may get completion with RETRY EXCEEDED error (as you seen).
  
  The following attributes in the device capabilities are relevant to this operation:
  * max_qp_rd_atom
  * max_qp_init_rd_atom
  
  The supported number of RDMA and Atomic operations per QP (for initiator and target).
  
  Thanks
  Dotan
  
  Reply
  - DjvuLee says: September 13, 2015
    
    Thanks very much! I occurs such a problem, I use shell/python and rping to compose a RDMA shuffle cluster, that is every node run a server mode process(it uses a thread for every incoming client connection), and there is also N client mode process in every node, which will set up connection with other nodes in the cluster. Since rping is RDMA READ--ACK-- RDMA WRITE ---ACK procedure, there is only one outstanding RDMA operation at any time, but there is IBV_WC_RETRY_EXC_ERR error. In my opinion, there is should no reason to occurs this error.
    
    By the way, when the cluster is just 15 nodes, there is no error, errors occurs when there is 30 nodes in the cluster.
    
    Can you give some advice how to deal with this?
  - Dotan Barak says: September 15, 2015
    
    Hi.
    
    The problem is that there is one more attributes 'max_res_rd_atom' - the total number of RDMA Reads and atomic that this device supports as the target,
    and there isn't any sync or protocol (AFAIK) which guarantees that prevents more RDMA Reads / Atomic operations to be targeted to this value.
    
    Thanks
    Dotan
Tingyu says: September 14, 2015

Hi Dotan,

I know it is not safe to ibv_post_recv several messages on the same address. But is it safe to ibv_post_send several messages on the same address? If so, is there any performance difference between posting the same and different?

Thanks,
Tingyu

Reply
- Dotan Barak says: September 15, 2015
  
  Hi Tingyu.
  
  The problem with posting multiple Receive Requests to the same address is that the content isn't consistent
  (i.e. one cannot predict the value of the buffers since there isn't any guaranteed order between different Work Queues).
  
  Sending multiple messages from the same address don't have this problem.
  
  Thanks
  Dotan
  
  Reply
  - Tingyu says: September 15, 2015
    
    Hi Dotan,
    
    Thanks for this reply! I understand
    data will not be consistent, but I wonder
    if RDMA allows this type of operation.
    So I tested by posting several receive
    requests to the same address on the
    receiver side, it seems
    RDMA library threw out an error during
    ibv_poll_cq on the sender side, by setting
    wc.status to 12. Could you explain why?
    Is there any internal mechanism in RDMA library
    that prevents reusing the same buffer?
    
    Thanks,
    Tingyu
  - Dotan Barak says: September 17, 2015
    
    Hi Tingyu.
    
    wc.status 12 means IBV_WC_RETRY_EXC_ERR.
    This means that there was a transport error at some point.
    
    Reusing the same buffer is legal in RDMA.
    
    Thanks
    Dotan
Valentin Petrov says: September 24, 2015

Hi, Dotan,
does the RC QPs guarantee the ordering of RDMA_WRITE WR? For example, if an "initiator" issues 2 consecutive IBV_WR_RDMA_WRITEs into the same remote memory location will the "target" always end up with the data from the second operation (ie, the second WR will always update remote memory after the first one) ?

Reply
- Dotan Barak says: September 26, 2015
  
  Hi Valentin.
  
  I will be careful here:
  * From network point of view, the first message will reach to destination before the second one.
  * The memory will be DMA'ed (by the RDMA device) according to the message ordering
  
  If the memory control, cache in the server will honor this (as I expect to be in most architectures),
  I guess the answer is ""yes".
  
  Thanks
  Dotan
  
  Reply
Tingyu says: September 30, 2015

Hi Dotan,

Is there any limit on the maximal message size posted using
ibv_post_send? Say 16MB, 32MB, 64MB, 128MB? The problem to me
is that when I try to post message larger than 16MB, there will be
a problem (my code first posts 16MB receive request using ibv_post_receive, then posts 16MB send message using ibv_post_send
to the other side. The first posted receive buffer is to receive
the ack message from the other side). It turns out that the remote side doesn't receive the posted message (The other side also posted 16MB receive buffer before receiving message and the connection between the two has been established already). ibv_poll_cq on the sender side returns wc with status 12. Do you have any idea of this issue? I don't know how to debug this issue, could you give me any instruction on how to debug?

Thanks for help!
Tingyu

Reply
- Dotan Barak says: September 30, 2015
  
  Hi Tingyu.
  
  The maximal message size can be found in the port properties: max_msg_sz (in general, RDMA supports up to 2GB messages).
  Posting bigger messages will end with completion with error.
  
  Completion with status 12: IBV_WC_RETRY_EXC_ERR, indicate that there is a transport problem.
  I suspect that the remote side isn't ready yet or finished it work and closed all the resources.
  
  Thanks
  Dotan
  
  Reply
  - Tingyu says: October 2, 2015
    
    Hi Dotan,
    
    Thanks. I just checked the max_msg_sz
    was 2GB. To find the transport problem, I
    used the example "helloworld" code on github
    https://github.com/tarickb/the-geek-in-the-corner as
    described by http://www.hpcadvisorycouncil.com/pdf/building-an-rdma-capable-application-with-ib-verbs.pdf.
    I got the same status 12 when the message size was set as 256MB (messages with smaller size
    worked).
    The network I used was qlogic, so is it possible
    there was something wrong with the hardware or underlying
    verb implementation? Or
    was there anything wrong with the infiniband setup? Do you know
    the way to debug the problem?
    
    Many thanks,
    Tingyu
  - Dotan Barak says: October 19, 2015
    
    Hi.
    
    I didn't work with QLOGIC HW, so I don't have any feedback to tell give you.
    I would suggest to use the libibverbs example (I know them and they always work).
    
    Thanks
    Dotan
Jon says: October 27, 2015

Hello Dotan,

Will work requests be modified after posting them?

In more detail: assuming a list of requests leading by wr is posted by calling ibv_post_send(qp, wr, &bad_wr); will the fields including the next pointers of the requests be modified by the library?

Thanks so much!
Jon

Reply
- Dotan Barak says: November 7, 2015
  
  Hi Jon.
  
  After a Send Request was posted, it can be modified by the application.
  
  During post send request, the low-level library translate the libibverbs Send Request to HW-specific Send Request and "tells" the RDMA device that new SRs were posted.
  
  Thanks
  Dotan
  
  Reply
Sagar Jha says: October 28, 2015

Hi Dotan,
I was wondering what is the behavior of an RDMA read of a remote memory if the remote machine is also writing to it concurrently?

More formally, suppose host A is reading using RDMA read, a variable v which is local to host B. If the value of v before the start of the read operation was 'a', and B is writing to v the value 'b' concurrently with the read operation, what is the return value of read going to be? Is it guaranteed to be either 'a' or 'b' or can it be a possibly garbage value too because of the local write or remote read not being atomic?
Thanks,
Sagar

Reply
- Dotan Barak says: November 7, 2015
  
  Hi Sagar.
  
  Local Read and Local Write are not atomic and you may get garbage...
  
  If you want to guarantee atomicity, you must use the Atomic operations.
  
  Thanks
  Dotan
  
  Reply
  - Sagar Jha says: November 11, 2015
    
    Thanks for the reply. I can see this happening when we are writing to large memory segments. Is this also true if we are writing to single instance of native data types (bits, bytes, integers, floats etc.)?
  - Dotan Barak says: November 11, 2015
    
    If you don't use Atomic operations, there isn't any guarantee to atomic access even for small (and native) data types.
    
    Thanks
    Dotan
Vasily says: October 29, 2015

Hi.

First of all I would say thank you for this site and your comments, they are very useful.

My question :

I know that the atomic operations maybe not very popular, but I have to use it. I have modified rdma-file example to se send one uint64_t-size structure. Also I am using and example provided above. On the server side it is ok - I see when this structure changing. The problem in a client site. I do not understand when and how I can check swapped value: Can I check it directly after the ibv_post_send, or I should wait or made something different? Because now I see nothing after the ibv_post_send, but if I send back some message via different MR, I see the swapped value. can you give me a hint?

Reply
- Dotan Barak says: November 7, 2015
  
  Hi Vasily.
  
  Thanks for the feedback
  :)
  
  This isn't really true that atomic isn't popular - it depends what you are trying to do..
  
  If you want to examine the value in the client side (i.e. the side that calls ibv_post_send()),
  this can be done only after the Send Request processing was ended, i.e. the Work Completion of the corresponding Send Request was polled from the Completion Queue.
  
  Thanks
  Dotan
  
  Reply
songping yu says: November 26, 2015

hi Dotan,
When I use ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr) to transfer one large message(200K) using one work request in UD mode, the parameter wr->opcode=IBV_WR_SEND wr->numsge=1.
An error IBV_WC_LOC_LEN_ERR occurs in send side. I am sure the receive buffer is larger enough on receive side.
Does this happen because MTU(4096)< 200K? Do I need to spilt 200K message into multiple work requests?

Reply
- Dotan Barak says: November 27, 2015
  
  Hi Songping yu.
  
  UD QP doesn't support more than the path MTU message size:
  this value is in the range 256-4096 bytes (depends on your subnet).
  
  It is up to the application to split the (big) message to smaller messages,
  using multiple Work Requests or use a different QP transport type.
  
  Thanks
  Dotan
  
  Reply
  - Alok says: December 26, 2017
    
    Hi,
    SO does it Mean that RC supports Max 2GB and UD its is Max 4K?
  - Dotan Barak says: January 5, 2018
    
    Hi.
    
    * Maximum message size of RC QPs is 2GB (unless one of the end nodes supports a lower value)
    * Maximum message size of UD QPs is 4KB (unless one of the end nodes/switches in the path supports a lower value)
    
    Thanks
    Dotan
Ben says: December 21, 2015

Hi Dotan.
two questions:
1. I register big memory block, can I send part of it by addr offset、len、rkey？
2. I register many MRs, which is different memory size, when I send msg by RDMA SEND operation, the remote side how to select recv MR?

Thanks!
Ben

Reply
- Dotan Barak says: January 1, 2016
  
  Hi.
  
  1) Yes. You can use only part of it in a Work Request.
  
  2) The remote side posts several Receive Requests:
  the incoming messages will consume the Receive Requests according to the order they were posted.
  i.e. RR[0] will be consumed by message[0], etc.
  
  Thanks
  Dotan
  
  Reply
Andy Malakov says: January 22, 2016

Thank you very much,Dotan. These pages are super-useful as IB API reference.

Reply
- Dotan Barak says: January 29, 2016
  
  :)
  
  Thanks for the great feedback
  Dotan
  
  Reply
liuyu says: May 31, 2016

Hi Dotan.
Thank you very much for your post and help!
Now I meet a problem, when i use ibv_post_send, i got a return value : 12. Before ibv_post_send, i checked the send_wr.sge.addr, it is valid. I paste some code here:

1) create qp:

qp_attr.cap.max_send_wr = 1024;
qp_attr.cap.max_send_sge = 1;
qp_attr.cap.max_recv_wr = 1024;
qp_attr.cap.max_recv_sge = 1;
qp_attr.send_cq = send_cq;
qp_attr.recv_cq = recv_cq;
qp_attr.qp_type = IBV_QPT_RC;
err = rdma_create_qp(cm_id, connection->pd, &qp_attr);

2)query qp attr

if (ibv_query_qp(connection->cm_id->qp, &attr, IBV_QP_STATE | IBV_QP_PATH_MTU | IBV_QP_CAP, &qp_attr))
{
printf("client query qp attr fail\n");
return RETURN_ERROR;
}

I found attr.cap.max_send_wr is equal to 2015, and attr.cap.max_recv_wr is equal to 1024, attr.cap.max_send_sge is equal to 2, attr.cap.max_recv_sge is equal to 1.

3)call ibv_post_send to send msg

memset(&sge, 0, sizeof(sge));
sge.addr = (uint64_t)cmd;
sge.length = sizeof(CMD_S);
sge.lkey = connection->connect_mr[MR_REQ].mr->lkey;

memset(&send_wr, 0, sizeof(send_wr));
send_wr.wr_id = (uint64_t)cmd;
send_wr.next = NULL;
send_wr.sg_list = &sge;
send_wr.num_sge = 1;
send_wr.opcode = IBV_WR_SEND;
send_wr.send_flags = IBV_SEND_SIGNALED;
ret = ibv_post_send(connection->cm_id->qp, &send_wr, &bad_wr);
if (ret != 0)
{
printf("client send connect cmd failed, ret=%d.\n", ret);
return RETURN_ERROR;
}

ret is equal to 12.

I am confused with follow question:
1. I set max_send_wr with 1024, max_send_sge with 1, but when I query qp later, they changed, max_send_wr is 2015, max_send_sge is 2. Why?
2. In my test, multi pthreads will call ibv_post_send. My test has two params, one is thread Num, another is queue depth per thread, the queue is used by test , not rdma queue. My test ran well when params are 8 threads and 32 queue depth, but got error when params are 8 threads and 64 queue depth. And ibv_post_send returns a error value 12.

Please give me some suggestion, help me to find key point to resolve the problem. Thanks.

Reply
- liuyu says: June 2, 2016
  
  I'd like to add that, my test creates one qp only to send msg. 8 threads and 32 queue depth means that the qp should handle 8*32 requests one time sometimes. Is the qp limited to handle 256 requests when max_send_wr is setted 1024 ? And is there limits when we use qp to send/rdma read/rdma write ?
  
  Reply
  - Dotan Barak says: June 2, 2016
    
    Hi.
    
    The QP can handle Work Requests according to the max_send_wr that it was created with
    (and this value is limited by the HCA capabilities).
    
    However, please notice the following:
    * The Send Requests will be processed according to their order in the QP
    * RDMA Read & Atomic parallel processing is limited by max_rd_atomic and max_dest_rd_atomic
    for QP as initiator and destination
    
    Thanks
    Dotan
- Dotan Barak says: June 2, 2016
  
  Hi.
  
  1. The RDMA device/low level driver can provide more resources than the originally requested value, according to its needs and internal structure
  2. I suspect that the Send Queue is full, i.e. you have many outstanding Send Requests (Posted Send Requests that were ended with a Work Completion).
  
  You should either increase the rate of polling out the Work Completions from the CQ or increase the QP.max_send_wr value
  
  Thanks
  Dotan
  
  Reply
  - liuyu says: June 2, 2016
    
    Hi Dotan.
    Thanks for your answer!
    But I'm still confused that what causes the send queue to be full. My test generates 256 requests total at first time, and uses them recycled. So I think rdma send queue holds 256 work requests most, and should not be full. Could you give me some detailed explanation？
  - Dotan Barak says: June 3, 2016
    
    Hi.
    
    A posted Work Request is considered outstanding until a Work Completion was generated for it or for Work Request after it.
    You specify in the created QP the number of outstanding Work Requests for either the Send and Receive Queue of that QP.
    
    I suspect that in your example, you post many Send Requests to the QP and don't poll the Work Completions for them.
    
    Thanks
    Dotan
David R. says: June 8, 2016

Hi Dotan,

First of all, thank you so much for the blog! It is tremendously helpful!

I'm not sure if this is the right place to ask this, but I'm having trouble with one of the sample programs from the RDMA Aware Programming User Manual. I'm not 100% certain, but I believe the problem has to do with ibv_post_send() so this was the best place I could think of to ask. The sample program is from Section 8.2 (Multicast Code Example Using RDMA CM). The basic description of this program is that a sender and receiver create a UD QP, join the multicast group, the sender posts a certain number of sends to the group, and the receiver waits to receive them. When I try to run the program, the sender successfully posts the sends, but the receiver never actually receives them. No errors are returned (from the sender or receiver); the receiver simply waits forever. However, if I add a sleep(1) call just before the sender calls ibv_post_send(), everything works correctly. At first I thought the problem was that the sender was posting the sends before the receives are posted by the receiver, but this does not appear to be the case. Are there any other reasons you know of that would explain why sleep() must be called before ibv_post_send() in this case? Or could this problem be caused by something else entirely and calling sleep() just appears to fix it? I'm not sure if this is a common issue or not; hopefully my question is not too vague. The code I am testing is from Revision 1.7 of the manual, but I can post or email it if that would help; just let me know. I greatly appreciate any help you can give me!

Thanks!

Reply
- Dotan Barak says: June 10, 2016
  
  Hi.
  
  Are you aware to the fact that there isn't any synchronization at all between both sides in this test?
  i.e. the sender send a message, but the remote side may not be ready to receive it
  (its QP isn't in the appropriate state or Receive Request wasn't posted or it hasn't join the multicast group yet).
  
  This is the reason that adding a sleep to the sender will solve the problem...
  
  You can solve it by adding a synchronization between both sides, or letting the server send again and again and waiting for an incoming response from the client.
  
  Thanks
  Dotan
  
  Reply
  - David R. says: June 16, 2016
    
    Oh, I see. That makes sense. Thank you!
Bill L says: August 9, 2016

Hi Dotan. Like everyone else, thank you for such an informative resource for RDMA programming. My question: when ibv_post_send is used with one of the atomic opcodes (IBV_WR_ATOMIC_FETCH_AND_ADD or IBV_WR_ATOMIC_CMP_AND_SWAP), do you still need to poll for a completion event to be sure the atomic operation was successful? Or will the operation have completed when ibv_post_send returns?

Reply
- Dotan Barak says: August 10, 2016
  
  Hi.
  
  When atomic operations, like any other operation, will end when there is a Work Completion for it
  (or for any other Send Request that was posted after it).
  
  When ibv_post_send() returns, this means that the low-level driver enqueues this Send Request for the RDMA device
  for future processing.
  
  Thanks
  Dotan
  
  Reply
windybeing says: January 11, 2017

Hi Dotan.

Thank you for such a guideline of rdma programing!

And, I have some trouble about IBV_WR_SEND in UD. I use doorbell batching to post my sends (just like wr[i].next = &wr[i+1]). However, only the data of the lattest wr in the batching is received. I am sure that there is no error thrown in my code because if I replace IBV_WR_SEND with IBV_WR_SEND_WITH_IMMEDIATE it works for the same code, the headers arrive correctly. Also, if I just use a post_send for each wr, it works. I think something in the sender side is wrong.

Hope that you can give me some advice!

Thanks!

Reply
- Dotan Barak says: February 10, 2017
  
  Hi.
  
  Please make sure that there isn't any race between the sides,
  and when the message arrives to the remote side)
  1) The remote QP is in (at least) RTR state
  2) There are already Receive Requests available in the remote QP
  3) The messages are big enough (i.e. at least message size + 40 bytes for the GRH)
  
  Thanks
  Dotan
  
  Reply
Param says: April 18, 2017

Hi Dotan,

I have a question. When I query my device I get that max_qp_rd_atom operations is 16. So is it not possible more than 16. Why is it specific to RDMA Read operations. I do not see any problem when there are more than 16 Work Requests posted for RDMA Read. What does attr.max_qp_rd_atom mean?

Reply
- Dotan Barak says: July 3, 2017
  
  Hi.
  
  RDMA Read operations require special resources and handling in both send and receive side,
  so this is the reason for the limitation.
  
  Configuring QP.max_rd_atomic limit the number of processed RDMA Reads that handled by the QP in any time;
  you may post as much as you want RDMA Read operations, and the RDMA device will limit the processing.
  
  Thanks
  Dotan
  
  Reply
QiuHaonan says: May 9, 2017

Hi,Dotan,I have read many of your articles to learn RDMA programming.
Now I get some problems and try to search result from RDMA_Aware_Programming_User_Manual.pdf (Version 1.7) and the IB Specification Vol 1-Release-1.3-2015-03-03.pdf,but haven't found the result.So I have to turn to you for help.The problem is When I post work request to queuepair,the NIC got notification and fetch the work request from memory to NIC cache by DMA,but when NIC send the data contained in the work reqeust to cabel,does it need to fetch the queuepair information to NIC cache?I know that NIC cache stores the queuepair data,memory address translation data and some network data,but when NIC send data,is the queuepair information necessary?

Reply
- Dotan Barak says: July 3, 2017
  
  Hi.
  
  When sending data, the RDMA device needs to fetch QP information:
  * QP state
  * PKey index
  * Qkey (for UD QPs, in specific scenarios)
  * Remote side attributes (for connected QPs)
  
  So, the answer is yes.
  
  Thanks
  Dotan
  
  Reply
Haodong says: June 22, 2017

Hi Dotan,

If I want to use "ibv_post_send", since we already have "IBV_WR_SEND", why we need "IBV_WR_RDMA_WRITE"? Is there any performance difference between these two approaches?

Reply
- Dotan Barak says: July 2, 2017
  
  Hi.
  
  Yes. There is a performance difference:
  * Send operation will consume a Receive Request in the remote side
  * RDMA Write operation won't, and a PCI read will be prevented (better latency)
  
  Thanks
  Dotan
  
  Reply
  - Haodong says: July 5, 2017
    
    Great! Thanks Dotan.
qiuhaonan says: July 10, 2017

thanks for your answer! Dotan
After reading all conversions in this post above,I have one more curious question...(sorry for disturbing).
The question is:When, where and how is the necessary QP information being collected for posting send wr?
First,please allow me sort out some procedure and explain my understanding.
When I post ibv_send_wr* wr using ibv_post_send,things goes on follow:
1.No context switch,in the same context,the ibv_post_send function transforms ibv_send_wr* wr(libibverbs abstraction) to WQE (HW-specific send request,the WQE is writing in Ethernet_Adapter_programming_Mannual,),during constructing WQE,it demands Ctrl Segment,Eth segment,Memory Management segment,Data segment,and Ctrl segment includes the attribute of SQ number(which
seems the necessary information about QP)
2.after constructing new WQE,writing the WQE to the WQE buffer,and update Doorbell record associated with that queue.(ibv_post_send api returns)
3.Device got notification and asynchronously processes these new WQEs.
4.After work request being processed, NIC writes cqes to relevent cq by dma.
5.I poll cq and got notifications.
ok,the whole procedure is sorted.Is there existing some error?
From proceduer above,can guess the collecting necessary QP information happens at transforming ibv_send_wr to WQE(it means calling ibv_post_send)?
And another question...(sorry for my curiousity),as far as i know,in software level,the qp num is the unique identitfier to steer network message flow to corresponding qp,in hardware level,the gid and port is the unique identifier to steer packet flow.So summarize for above question, can i treat "fetching QP information for work request" as "fetching qp num and other non-unique information"?
Sorry for too much words,But I really interested in this part.If I expressed poorly,please point out and I will improve.Thanks for you patience!Dotan

Reply
- Dotan Barak says: July 21, 2017
  
  Hi.
  
  This is an interesting question.
  After the following step:
  "2.after constructing new WQE,writing the WQE to the WQE buffer,and update Doorbell record associated with that queue.(ibv_post_send api returns)"
  The WQE was enqueued to the RDMA device for processing; when the processing will actually start the RDMA device needs to collect relevant information for the QP:
  * The QP type
  * Remote QP number (for connected QP)
  * Path to the remote QP (for connected QP)
  * Send PSN
  * more
  
  I hope that I answered your question.
  
  Thanks
  Dotan
  
  Reply
  - qiuhaonan says: July 27, 2017
    
    Hi,Dotan
    I got it.There is still so much things which device need to do.
    Sorry for my recklessness.I should carefully read the driver source code and then ask my questions.But I do learn very much from your detailed articles.Thanks for your patience and generosity.
  - Dotan Barak says: July 27, 2017
    
    :)
liuyu says: August 25, 2017

Hi Dotan

I want to transform data from serverA's memory to serverB's memory, then I use ibv_post_send(), doing RDMA write, if return value of ibv_post_send is equal zero, does it mean that the data has transformed from serverA's memory to serverB's memory？

Hope that you can give me some advice!

Thanks!

Reply
- Dotan Barak says: August 28, 2017
  
  Hi.
  
  No.
  
  If ibv_post_send() returns with the value 0,
  this means that this Send Request was added for the RDMA device for further processing.
  
  If this is a reliable transport type, and there is a Work Completion with the SUCCESS status,
  this means that the data was written to remote memory successfully.
  
  Thanks
  Dotan
  
  Reply
Sylvia says: November 7, 2017

Hi Dotan:

I am new to RDMA and I tried to do a RDMA RC Write. Everything works fine when the message size is smaller than MTU. However, when I set my message size larger than MTU, the side which post the Write is not able to get any write completion in the CQ. Even though the remote side already have completed data in the registered memory. There is no error message at both side. The side who post the write is stuck in the while loop of ibv_poll_cq(). I would like to ask what might be the problem of this.

Thanks,
Sylvia

Reply
- Dotan Barak says: December 6, 2017
  
  Hi.
  
  Are you using RoCE or InfiniBand?
  Did you configured the same MTU in both sides?
  
  Thanks
  Dotan
  
  Reply
Haodong says: November 8, 2017

Hi Dotan,

I wrote a ping-pong program with IBV_WR_SEND, it's server/client like. The problem I met was sending and receiving 1M 4096 bytes took 26s, while the ibv_post_send call took 9s. Is this normal? Or is there any reason leading to the ibv_post_send blocking?

Reply
- Dotan Barak says: December 6, 2017
  
  Hi.
  
  What do you mean "ibv_post_send call took 9s"?
  First of all - it is too much time for a fast network and seconds are "infinite".
  Second, need to understand what you did to give an answer.
  
  Thanks
  Dotan
  
  Reply
haiping says: January 11, 2018

hello, Dotan!

I met the problem that many guys mentioned. When I repeat to write and read remote memory, I got the "ENOMEM". I try to empty the CQ at both client and server using ibv_poll_cq, but it didn't work. Please help me! Thanks:)
/*my code seems like that: */

while(1){

...
send_wr.opcode = IBV_WR_RDMA_WRITE;
send_wr.sg_list =&sge;
...
ret = ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr);
...
send_wr.opcode = IBV_WR_RDMA_READ;
send_wr.sg_list =&sge;
...
ret = ibv_post_send(cm_id->qp, &send_wr, &bad_send_wr);
if (ret == EINVAL){
printf("invalid value provided in wr\n");
}else if(ret == ENOMEM){
printf("send queue is full\n");
do {
ne = ibv_poll_cq(cq, 1, &wc);
if (ne < 0) {
fprintf(stderr, "Failed to poll completions from the CQ: ret = %d\n",
ne);
break;
}
/* there may be an extra event with no completion in the CQ */
if (ne == 0)
continue;

if (wc.status != IBV_WC_SUCCESS) {
fprintf(stderr, "Completion with status 0x%x was found\n",
wc.status);
break;
}
} while (ne);
}else if(ret == EFAULT){
printf("invalid value provided in qp\n");
}else if(ret == errno){
printf("on failure and no change will be down to the qp");
}
}

Reply
- Dotan Barak says: January 19, 2018
  
  Hi.
  
  There are 2 options:
  1) There aren't any Work Completion (and won't be) since you didn't request for generation of them
  (ibv_qp_init_attr.sq_sig_all for all Send Requests in that QP or in ibv_send_wr.send_flags - per specific Send Request)
  2) There processing is still on going,
  for example: if there is a retransmission and the timeout is very high (or infinite).
  
  Did you read any Work Completion from that CQ?
  (from the Send Queue)
  
  Thanks
  Dotan
  
  Reply
Erfan Zamanian says: January 31, 2018

Hi Dotan,

Can Atomic ops (CAS and FAA) be made inline? I can see In the documentation that it is supported in the experimental verbs, but I can't find any source that explains it for the normal case.

Reply
- Dotan Barak says: March 2, 2018
  
  Hi.
  
  Experimental verbs are vendor specific,
  and I prefer not to answer on such questions.
  
  Please contact the relevant vendor support team/developers.
  
  (sorry)
  Dotan
  
  Reply
Dawood says: March 19, 2018

Hi,

I have a question about RDMA write speeds.

If I issue 1000 small RMDA WRITE REQUESTS (IBV_OPCODE_RDMA_WRITE_ONLY) of payload ~2000 bytes for each packet, would the data arrive faster or slower than than issuing a single RMDA WRITE of size (2000 bytes * 1000)?

Reply
- Dotan Barak says: December 24, 2018
  
  Hi.
  
  Sorry, I missed this question, I'll answer in case someone will ask this in the future;
  The performance of once big message is better compared to many small messages.
  The reasons are:
  * In One big message: only one WQE is fetched and not many WQEs (fetch time is saved, cache misses, less completions, etc.)
  * In One big message: Total number of packets can be smaller (it can better utilize the MTU)
  * Less number of ACKs will be sent
  
  Thanks
  Dotan
  
  Reply
Mayank Jain says: May 29, 2018

Hi,

I am trying to implement basic client-server using UD instead of RC. Does any have notes, code for UD? I don't see much about UD anywhere on the internet.

Thanks

Reply
- Dotan Barak says: July 14, 2018
  
  Hi.
  
  You can find an example to a UD program in the following URL:
  https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/examples/ud_pingpong.c
  
  I'll write a post about the differences between the various transport type and porting programs between them soon.
  
  Thanks
  Dotan
  
  Reply
Bill says: July 23, 2018

Hi Dotan. As always, thank you for the very helpful website. Regarding the atomics opcodes, the descriptions for both IBV_WR_ATOMIC_FETCH_AND_ADD and IBV_WR_ATOMIC_CMP_AND_SWP state that the updates to the remote virtual memory specified in atomic.remote_addr are done atomically, but imply the writing of the original data to the local sg_list is not atomic. When I issue one of these calls and then get a successful completion event for them, what is the state of the atomic operation at that point? Does it mean the atomic operation on the remote is complete and the local sg_list contains the original value from the remote? With a typical RDMA write, a successful completion event means the data has been successfully written to the remote HCA (per earlier comments in this post) and I can re-use the local sg_list buffers. However, for the atomics operations, the sg_list buffers aren't the source of the data, but rather the destination for the original value from the remote side. So I'm trying to understand exactly what a successful CQ event for an atomics operation tells me about the state that operation. How do I know when it's complete?

Reply
- Dotan Barak says: May 18, 2019
  
  Hi.
  
  The writing of data to *local* s/g list isn't atomic;
  the access to the remote address is being done in atomic way.
  
  Thanks
  Dotan
  
  Reply
Zhao, Bing says: September 5, 2018

Hi,
IBV_EXP_SEND_INLINE:
This mean that the low-level driver (i.e. CPU) will read the data and not the RDMA device.

I am not quite clear about this line. I assume that in most NICs interface, there is some WQE entry, and this means that the NIC driver will read the data and put it in the WQE inside. Then there will be less PCIE DMA transaction. Am I right?

Reply
- Dotan Barak says: September 7, 2018
  
  Hi.
  
  Yes, inline means that the low level driver (i.e. CPU) will read the data,
  and not the RDMA device itself.
  
  Since will prevent an extra PCIE DMA transaction, to fetch the data.
  
  This provides better latency compared to non-inline for small messages.
  
  Thanks
  Dotan
  
  Reply
Junhyun Shim says: September 6, 2018

Hi Dotan, I have a few questions about the behavior of RC QP:

if 2 linked WRITE WRs are posted to the same QP (the first being WRITE and the second being WRITE_WITH_IMM), does the completion of the second WR in peer guarantee the completion of the first?
My guess is that it being a QP with connection behavior, it does. Does it hold for UD QPs?

To delve a little deeper, after which low-level event is the send completion triggered in a RC QP?
1) After the buffer posted for write has been fully put on wire
2) After the ACK was received from the recipient

Also, is the remote memory already written to before the peer responds with an ACK?

What is guaranteed in terms of ordering for WRs posted to non-RC QPs?

PS: Thank you for keeping this blog afloat. You're great help.

Reply
- Dotan Barak says: September 7, 2018
  
  Hi.
  
  In RC QP, there is a PSN (Packet Serial Number) that guarantees the order of the messages;
  In your scenario - the completion of the second message guarantees the arrival (and completion) of the first one.
  
  In UD QP, there isn't any RDMA Write support; only Sends/With immediate.
  Every send will generate a completion in the remote side upon completion.
  
  Local side will have a completion (if requested), once Ack will be received for this or any subsequent message.
  Remote side will send an Ack once data was DMAed to its memory; and create a completion (if needed).
  
  I hope that this was clear.
  
  Thanks
  Dotan
  
  Reply
Weijia says: September 14, 2018

Hi Dotan,

The S/G list in ibv_post_send() is a nice way to consolidate data transfers. However, can I initiate an operation writing to multiple locations on the remote side (or reading from multiple locations on the remote side)? This feature is appealing because it efficiently uses the wide RDMA lanes for multiple small writes. I also checked the Intel qsm APIs and the Cray gni APIs. It seems no one support such a feature--let's call it "writer-controlled remote scatter". Is there a deep reason this is not supported?

Reply
- Dotan Barak says: September 18, 2018
  
  Hi.
  
  RDMA operation doesn't (currently?) support this feature;
  it wasn't defined in any spec.
  
  IMHO, it doesn't have any benefit compared to posting several Send Requests...
  
  Thanks
  Dotan
  
  Reply
qiuhaonan says: December 18, 2018

Hi Dotan.
What will happen when I run the loopback test?
Will the data included in the send WQE be fetched to the NIC TX cache and go through the IB protocol, UDP protocol and IP protocol processing and return to the NIC RX cache, then be delivered to the receive WQE?

Thank you very much.
Haonan Qiu

Reply
- Dotan Barak says: December 24, 2018
  
  Hi.
  
  Here is a port about it:
  https://www.rdmamojo.com/2018/12/29/loopback-messages-in-rdma/
  
  In (a very short) summary:
  the answer is yes: a WQE will be fetched and processed and data will be DMA'ed locally;
  no memcpy() by any SW stack will be done (as it is done in loopback over Ethernet interface).
  
  Thanks
  Dotan
  
  Reply
wangtg says: December 27, 2018

Hi,

I have a question about concurrent RDMA operations. Machine A sends a RDMA write to data item x stored in machine C, meanwhile, machine B sends a RDMA read to the same data item x stored in machine C through a different QP. I understand there is a race condition, but I'm not sure whether it is possible that machine B reads some corrupted data (data item x is only partially modified by machine A)?

Thank you for your answer in advance! And any hint about where I can find the answer is also appreciated!

Reply
- Dotan Barak says: December 29, 2018
  
  Hi.
  
  I can't give a good answer here;
  the behavior is implementation-dependent: of the RDMA device, of the memory controller, cache system...
  
  Unless you sync the memory access, there isn't any guarantee what will be the possible result of such operation.
  
  If you'll try it many times, your may get the same results (for example, always read as a 64/128 bytes block) but there isn't any guarantee that it will always be the case.
  
  Thanks
  Dotan
  
  Reply
Zhiyuan Dong says: July 19, 2019

Hi,

Thank you for such a useful blog!

And I have one question: When the remote RNIC will create an ACK for RDMA Write, when the request is cached in the remote RNIC, or directly DMAed to remote memory and completed?

The workload is: I have an array in local server (init value: array[i] = 0), and a count showing what's the latest validate element, (an element array[i] is valid if array[i] = i). Then, I allocate two qps, array_qp and count_qp, each is connected to a different RNIC on the remote server.

The workflow is a while loop containing following ordered steps:
1. local_array[count] = count
2.array_qp RDMA Writes local_array[count] to remote_array[count]
3. poll array_qp's completion . // assumeing local_array[count] is written successfully
4. ++count
5.array_qp RDMA Write count to remote_count.
6. selectively poll array_qp's completion

I assume that, for the remote server, if it observes a remote_count == i, for each element array[j] == j, (j <=i).

However, the assumption failed. It seems that the remote server observes a remote_count == i, while array[i] != i. The key is that array_qp and count_qp are connected to different RNICs, if they are connected to the same RNIC, there are no problems.

This really confused me.

Thanks

Reply
- Dotan Barak says: July 26, 2019
  
  Hi.
  
  RDMA Write request in the responder side is written to PCI once the message arrives.
  In theory, it should work no matter if you are using one or more RNICs.
  
  I suspect that the problem that you see relevant to the system memory management(NUMA?).
  Try to make sure that the writes will be to different cache lines.
  
  Thanks
  Dotan
  
  Reply
HuaiEn says: August 27, 2019

Hi, Dotan,

I have a question here,

Is there any method that the receive side can aware the transmission is done when using IBV_WR_RDMA_WRITE opcode at sender?

If not, is IBV_WR_RDMA_WRITE_WITH_IMM the replacement method?

Should I per-post ibv_post_recv() at receive side and what else steps should I do?

I have tried

sge.addr = (uintptr_t)buf;
sge.length = sizeof(uint32_t);
sge.lkey = mr->lkey;
recv_wr.wr_id = 0;
recv_wr.sg_list = &sge;
recv_wr.num_sge = 1;
if (ibv_post_recv(cm_id->qp,&recv_wr,&bad_recv_wr))
return 1;
but get completion error.

thanks

Reply
- Dotan Barak says: August 29, 2019
  
  Hi.
  
  The receiver side isn't aware to the fact that RDMA Write is performed (except maybe for reading the memory buffers),
  if needed to sync in "RDMA way", RDMA Write with immediate is a good solution.
  
  This is the right code to post a Receive Request.
  I don't understand what you mean "but get completion error".
  
  Thanks
  Dotan
  
  Reply
HuaiEn says: August 30, 2019

Hi Dotan,
Thanks for reply.
Sorry for my poor explanation.
At sender, after setting cq and qp, I pre-post recv and then post send msg with IBV_WR_RDMA_WRITE_WITH_IMM. Finally use ibv_get_cq_event() and ibv_poll_cq()
However when I execute the sender, it output
"mlx5: localhost.localdomain: got completion with error"
I try to print the wc stauts it returns "error RNR retry counter exceeded"

At receiver, after create cq and qp, I pre-post recv and call ibv_get_cq_event() and ibv_poll_cq()

Is anything I do wring?

Thanks,

BR

Reply
- Dotan Barak says: September 6, 2019
  
  Hi.
  
  Is there any synchronization between the sides?
  Is it possible that when the sender QP post the RDMA_WRITE_WITH_IMM there isn't any Receive Request at the receiver side?
  And this is the reason for the RNR completion.
  
  You can resolve it by adding a synchronization or increase the RNR attributes in the QPs.
  
  Thanks
  Dotan
  
  Thanks
  Dotan
  
  Reply
Yixiao Gao says: September 3, 2019

Hi Dotan,
it seems that the usage of memory window is not discussed here. Is it possible to deny inflight RDMA write using memory window(or other ways)? Thank you very much.

Reply
- Dotan Barak says: September 6, 2019
  
  Hi.
  
  I didn't write any posts on Memory Windows (yet?),
  the Memory Window has a permission: allow or deny incoming RDMA Write, Read or Atomic to it.
  
  I hope it is clear.
  
  Thanks
  Dotan
  
  Reply
skm says: January 17, 2020

Hi Dotan,
I have a question regarding the use of ibv_send_wr and ibv_sge in a setting where the first request in a batch is signaled. There is a preconfigured array for struct ibv_send_wr and struct ibv_sge. Once the call to ibv_post_send() returns, is it safe to reuse the ibv_send_wr and ibv_sge i.e. does ibv_post_send() guarantees that the RDMA NIC will complete the DMA for the linked list of work requests and only then returns. Also thanks for maintaining this wonderful blog.

Reply
- Dotan Barak says: February 22, 2020
  
  Hi.
  
  The answer is yes:
  Once the ibv_post_send() is returned, you can (safety) use the ibv_sge and ibv_send_wr
  (it doesn't matter if this send request is signaled or not).
  The ibv_post_send() verb copies the ibv_send_wr to the send queue of the QP
  (after translating it to the descriptor that the RDMA device "understands").
  
  However, you can't reuse the buffers that the ibv_sge points to,
  until you'll get a work completion to it (or to a send request that was post after it).
  
  Thanks
  Dotan
  
  Reply
Shubham Pandey says: September 30, 2020

Hi Dotan,

Thanks for this blog!
A quick question, Is there a way to find out whether or not an RDMA CAS operation successfully swapped the value in the remote memory?

Best,
Shubham

Reply
- Dotan Barak says: November 21, 2020
  
  Hi.
  
  For a Cmp&Swap:
  At the responder side, you can't know whether or not the value was swapped
  (since this is an RDMA operation).
  At the requestor side, you will get back the original value in the remote address
  (so, if you'll keep the relevant info - you can know whether remote value was swapped or not).
  
  I hope that this helped you
  Dotan
  
  Reply
Alla says: March 17, 2021

Hello Mr. Barak,
In your reply to Weijia, in this post you said:
"IMHO, it (opposite s/g - an operation writing to multiple locations on the remote side or reading from multiple locations on the remote side) doesn't have any benefit compared to posting several Send Requests"

Could you explain a bit more about it?
I do see a difference in time between a single request of contiguous range and the same range divided to many requests. An opposite s/g would help me a lot.

Thanks, Alla

Reply
- Dotan Barak says: March 26, 2021
  
  Hi.
  
  This feature (i.e. writing to/reading from multiple remote operations in one message) will add more complexity:
  * The Send Request will be complicated, since you'll need to provide list of remote addresses + keys
  * Each RDMA Write/Read requests will need to specify list of remote address + keys (headers will be larger, or multiple headers)
  * Now, the headers can have variable size (depends on number of accessed remote blocks)
  
  So IMHO, eventually, to make everything work for one remote block and multiple remote blocks,
  it will be very similar for processing multiple Send Requests
  (this is implementation dependent, but each remote block access request will be treated as it was posted in a different Send Request).
  
  I agree that posting one Send Request is better compared to posting multiple Send Requests,
  but IMHO, the complexity that it brings is high and the ROI is low.
  
  This is my opinion...
  Dotan
  
  Reply
bdy says: April 3, 2021

hi,I have a question about IBV_SEND_INLINE,when i look for sample,i found that some people use it like this:

send_flags=IBV_SEND_INLINE

when some one like this：

send_flags=IBV_SEND_INLINE|IBV_SEND_SIGNALED 。

could someone tell me the difference between them and usage scenarios？

Reply
- Dotan Barak says: October 24, 2021
  
  Hi.
  
  IBV_SEND_INLINE specify the low-level driver to read the payload data (i.e. using the CPU) and not by the RDMA device
  IBV_SEND_SIGNALED specify the RDMA device to generate a Work Completion at the end of processing this Send Request.
  
  Thanks
  Dotan
  
  Reply
fish says: May 20, 2021

Hi,nice to meet you

I confuse about that how to specify rdma_post_send's opcode ?

when use vbers api ibv_post_send ,i know specify opcode by set ibv_send_wr.opcode = IBV_WR_RDMA_READ;

could you tell how to specify the opcode when use rdma_post_send ？

Reply
- Dotan Barak says: October 24, 2021
  
  Hi.
  
  rdma_post_send() is actually a wrapper over ibv_post_send()
  (it may do more things, but at the end of the day - it is a wrapper).
  
  Thanks
  Dotan
  
  Reply