# ibv_post_send()

4.89 avg. rating (97% score) - 18 votes
```int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr);```

# Description

ibv_post_send() posts a linked list of Work Requests (WRs) to the Send Queue of a Queue Pair (QP). ibv_post_send() go over all of the entries in the linked list, one by one, check that it is valid, generate a HW-specific Send Request out of it and add it to the tail of the QP's Send Queue without performing any context switch. The RDMA device will handle it (later) in asynchronous way. If there is a failure in one of the WRs because the Send Queue is full or one of the attributes in the WR is bad, it stops immediately and return the pointer to that WR. The QP will handle Work Requests in the Send queue according to the following rules:

• If the QP is in RESET, INIT or RTR state an immediate error should be returned. However, they may be some low-level driver that won't follow this rule (to eliminate extra check in the data path, thus providing better performance) and posting Send Requests at one or all of those states may be silently ignored.
• If the QP is in RTS state, Send Requests can be posted and they will be processed.
• If the QP is in SQE or ERROR state, Send Requests can be posted and they will be completed with error.
• If the QP is in SQD state, Send Requests can be posted, but they won't be processed.

The struct ibv_send_wr describes the Work Request to the Send Queue of the QP, i.e. Send Request (SR).

```struct ibv_send_wr { uint64_t wr_id; struct ibv_send_wr *next; struct ibv_sge *sg_list; int num_sge; enum ibv_wr_opcode opcode; int send_flags; uint32_t imm_data; union { struct { uint64_t remote_addr; uint32_t rkey; } rdma; struct { uint64_t remote_addr; uint64_t compare_add; uint64_t swap; uint32_t rkey; } atomic; struct { struct ibv_ah *ah; uint32_t remote_qpn; uint32_t remote_qkey; } ud; } wr; };```

Here is the full description of struct ibv_send_wr:

 wr_id A 64 bits value associated with this WR. If a Work Completion will be generated when this Work Request ends, it will contain this value next Pointer to the next WR in the linked list. NULL indicates that this is the last WR sg_list Scatter/Gather array, as described in the table below. It specifies the buffers that will be read from or the buffers where data will be written in, depends on the used opcode. The entries in the list can specify memory blocks that were registered by different Memory Regions. The message size is the sum of all of the memory buffers length in the scatter/gather list num_sge Size of the sg_list array. This number can be less or equal to the number of scatter/gather entries that the Queue Pair was created to support in the Send Queue (qp_init_attr.cap.max_send_sge). If this size is 0, this indicates that the message size is 0 opcode The operation that this WR will perform. This value controls the way that data will be sent, the direction of the data flow and the used attributes in the WR. The value can be one of the following enumerated values: IBV_WR_SEND - The content of the local memory buffers specified in sg_list is being sent to the remote QP. The sender doesn’t know where the data will be written in the remote node. A Receive Request will be consumed from the head of remote QP's Receive Queue and sent data will be written to the memory buffers which are specified in that Receive Request. The message size can be [0, $2^{31}$] for RC and UC QPs and [0, path MTU] for UD QP IBV_WR_SEND_WITH_IMM - Same as IBV_WR_SEND, and immediate data will be sent in the message. This value will be available in the Work Completion that will be generated for the consumed Receive Request in the remote QP IBV_WR_RDMA_WRITE - The content of the local memory buffers specified in sg_list is being sent and written to a contiguous block of memory range in the remote QP's virtual space. This doesn't necessarily means that the remote memory is physically contiguous. No Receive Request will be consumed in the remote QP. The message size can be [0, $2^{31}$] IBV_WR_RDMA_WRITE_WITH_IMM - Same as IBV_WR_RDMA_WRITE, but Receive Request will be consumed from the head of remote QP's Receive Queue and immediate data will be sent in the message. This value will be available in the Work Completion that will be generated for the consumed Receive Request in the remote QP IBV_WR_RDMA_READ - Data is being read from a contiguous block of memory range in the remote QP's virtual space and being written to the local memory buffers specified in sg_list. No Receive Request will be consumed in the remote QP. The message size can be [0, $2^{31}$] IBV_WR_ATOMIC_FETCH_AND_ADD - A 64 bits value in a remote QP's virtual space is being read, added to wr.atomic.compare_add and the result is being written to the same memory address, in an atomic way. No Receive Request will be consumed in the remote QP. The original data, before the add operation, is being written to the local memory buffers specified in sg_list IBV_WR_ATOMIC_CMP_AND_SWP - A 64 bits value in a remote QP's virtual space is being read, compared with wr.atomic.compare_add and if they are equal, the value wr.atomic.swap is being written to the same memory address, in an atomic way. No Receive Request will be consumed in the remote QP. The original data, before the compare operation, is being written to the local memory buffers specified in sg_list send_flags Describes the properties of the WR. It is either 0 or the bitwise OR of one or more of the following flags: IBV_SEND_FENCE - Set the fence indicator for this WR. This means that the processing of this WR will be blocked until all prior posted RDMA Read and Atomic WRs will be completed. Valid only for QPs with Transport Service Type IBV_QPT_RC IBV_SEND_SIGNALED - Set the completion notification indicator for this WR. This means that if the QP was created with sq_sig_all=0, a Work Completion will be generated when the processing of this WR will be ended. If the QP was created with sq_sig_all=1, there won't be any effect to this flag IBV_SEND_SOLICITED - Set the solicited event indicator for this WR. This means that when the message in this WR will be ended in the remote QP, a solicited event will be created to it and if in the remote side the user is waiting for a solicited event, it will be woken up. Relevant only for the Send and RDMA Write with immediate opcodes IBV_SEND_INLINE - The memory buffers specified in sg_list will be placed inline in the Send Request. This mean that the low-level driver (i.e. CPU) will read the data and not the RDMA device. This means that the L_Key won't be checked, actually those memory buffers don't even have to be registered and they can be reused immediately after ibv_post_send() will be ended. Valid only for the Send and RDMA Write opcodes imm_data (optional) A 32 bits number, in network order, in an SEND or RDMA WRITE opcodes that is being sent along with the payload to the remote side and placed in a Receive Work Completion and not in a remote memory buffer wr.rdma.remote_addr Start address of remote memory block to access (read or write, depends on the opcode). Relevant only for RDMA WRITE (with immediate) and RDMA READ opcodes wr.rdma.rkey r_key of the Memory Region that is being accessed at the remote side. Relevant only for RDMA WRITE (with immediate) and RDMA READ opcodes wr.atomic.remote_addr Start address of remote memory block to access wr.atomic.compare_add For Fetch and Add: the value that will be added to the content of the remote address. For compare and swap: the value to be compared with the content of the remote address. Relevant only for atomic operations wr.atomic.swap Relevant only for compare and swap: the value to be written in the remote address if the value that was read is equal to the value in wr.atomic.compare_add. Relevant only for atomic operations wr.atomic.rkey r_key of the Memory Region that is being accessed at the remote side. Relevant only for atomic operations wr.ud.ah Address handle (AH) that describes how to send the packet. This AH must be valid until any posted Work Requests that uses it isn't considered outstanding anymore. Relevant only for UD QP wr.ud.remote_qpn QP number of the destination QP. The value 0xFFFFFF indicated that this is a message to a multicast group. Relevant only for UD QP wr.ud.remote_qkey Q_Key value of remote QP. Relevant only for UD QP

The following table describes the supported opcodes for each QP Transport Service Type:

Opcode UD UC RC
IBV_WR_SEND X X X
IBV_WR_SEND_WITH_IMM X X X
IBV_WR_RDMA_WRITE X X
IBV_WR_RDMA_WRITE_WITH_IMM X X
IBV_WR_ATOMIC_CMP_AND_SWP X

struct ibv_sge describes a scatter/gather entry. The memory buffer that this entry describes must be registered until any posted Work Request that uses it isn't considered outstanding anymore. The order in which the RDMA device access the memory in a scatter/gather list isn't defined. This means that if some of the entries overlap the same memory address, the content of this address is undefined.

```struct ibv_sge { uint64_t addr; uint32_t length; uint32_t lkey; };```

Here is the full description of struct ibv_sge:

 addr The address of the buffer to read from or write to length The length of the buffer in bytes. The value 0 is a special value and is equal to $2^{31}$ bytes (and not zero bytes, as one might imagine) lkey The Local key of the Memory Region that this memory buffer was registered with

Sending inline'd data is an implementation extension that isn't defined in any RDMA specification: it allows send the data itself in the Work Request (instead the scatter/gather entries) that is posted to the RDMA device. The memory that holds this message doesn't have to be registered. There isn't any verb that specifies the maximum message size that can be sent inline'd in a QP. Some of the RDMA devices support it. In some RDMA devices, creating a QP with will set the value of max_inline_data to the size of messages that can be sent using the requested number of scatter/gather elements of the Send Queue. If others, one should specify explicitly the message size to be sent inline before the creation of a QP. for those devices, it is advised to try to create the QP with the required message size and continue to decrease it if the QP creation fails. While a WR is considered outstanding:

• If the WR sends data, the local memory buffers content shouldn't be changed since one doesn't know when the RDMA device will stop reading from it (one exception is inline data)
• If the WR reads data, the local memory buffers content shouldn't be read since one doesn't know when the RDMA device will stop writing new content to it

# Parameters

Name Direction Description
qp in Queue Pair that was returned from ibv_create_qp()
wr in Linked list of Work Requests to be posted to the Send Queue of the Queue Pair
bad_wr out A pointer to that will be filled with the first Work Request that its processing failed

# Return Values

Value Description
0 On success
errno On failure and no change will be done to the QP and bad_wr points to the SR that failed to be posted
EINVAL Invalid value provided in wr
ENOMEM Send Queue is full or not enough resources to complete this operation
EFAULT Invalid value provided in qp

# Examples

1) Posting a WR with the Send operation to an UC or RC QP:

```struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr;   memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey;   memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_SEND; wr.send_flags = IBV_SEND_SIGNALED;   if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; }```

2) Posting a WR with the Send with immediate operation to an UD QP:

```struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr;   memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey;   memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_SEND_WITH_IMM; wr.send_flags = IBV_SEND_SIGNALED; wr.imm_data = htonl(0x1234); wr.wr.ud.ah = ah; wr.wr.ud.remote_qpn = remote_qpn; wr.wr.ud.remote_qkey = 0x11111111;   if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; }```

3) Posting a WR with an RDMA Write operation to an UC or RC QP:

```struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr;   memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey;   memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_RDMA_WRITE; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.rdma.remote_addr = remote_address wr.wr.rdma.rkey = remote_key;   if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; }```

4) Posting a WR with an RDMA Write with immediate operation to an UC or RC QP:

```struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr;   memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey;   memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_RDMA_WRITE_WITH_IMM; wr.send_flags = IBV_SEND_SIGNALED; wr.imm_data = htonl(0x1234); wr.wr.rdma.remote_addr = remote_address wr.wr.rdma.rkey = remote_key;   if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; }```

5) Posting a WR with an RDMA Read operation to a RC QP:

```struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr;   memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey;   memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_RDMA_READ; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.rdma.remote_addr = remote_address wr.wr.rdma.rkey = remote_key;   if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; }```

6) Posting a WR with a Compare and Swap operation to a RC QP:

```struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr;   memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey;   memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_ATOMIC_CMP_AND_SWP; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.atomic.remote_addr = remote_address wr.wr.atomic.rkey = remote_key; wr.wr.atomic.compare_add = 0ULL; /* expected value in remote address */ wr.wr.atomic.swap = 1ULL; /* the value that remote address will be assigned to */   if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; }```

7) Posting a WR with a Fetch and Add operation to a RC QP:

```struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr;   memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey;   memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD; wr.send_flags = IBV_SEND_SIGNALED; wr.wr.atomic.remote_addr = remote_address wr.wr.atomic.rkey = remote_key; wr.wr.atomic.compare_add = 1ULL; /* value to be added to the remote address content */   if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; }```

8) Posting a WR with the Send operation to an UC or RC QP with zero bytes:

```struct ibv_send_wr wr; struct ibv_send_wr *bad_wr;   memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = NULL; wr.num_sge = 0; wr.opcode = IBV_WR_SEND; wr.send_flags = IBV_SEND_SIGNALED;   if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, "Error, ibv_post_send() failed\n"); return -1; }```

# FAQs

#### Does ibv_post_send() cause a context switch?

No. Posting a SR doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

#### How many WRs can I post?

There is a limit to the maximum number of outstanding WRs for a QP. This value was specified when the QP was created.

#### Can I know how many WRs are outstanding in a Work Queue?

No, you can't. You should keep track of the number of outstanding WRs according to the number of posted WRs and the number of Work Completions that you polled.

#### Does the remote side is aware of the fact that RDMA operations are being performed in its memory?

No, this is the idea of RDMA.

#### If the remote side isn't aware of RDMA operations are being performed in its memory, isn't this a security hole?

Actually, no. For several reasons:

• In order to allow incoming RDMA operations to a QP, the QP should be configured to enable remote operations
• In order to allow incoming RDMA access to a MR, the MR should be registered with those remote permissions enabled
• The remote side must know the r_key and the memory addresses in order to be able to access remote memory

#### What will happen if I will deregister an MR that is used by an outstanding WR?

When processing a WR, if one of the MRs that are specified in the WR isn't valid, a Work Completion with error will be generated. The only exception for this is posting inline data.

#### What is the benefit from using IBV_SEND_INLINE?

Using inline data usually provides better performance (i.e. latency).

#### What is the difference between inline data and immediate data?

Using immediate data means that out of band data will be sent from the local QP to the remote QP: if this is an SEND opcode, this data will exist in the Work Completion, if this is a RDMA WRITE opcode, a WR will be consumed from the remote QP's Receive Queue. Inline data influence only the way that the RDMA device gets the data to send; The remote side isn't aware of the fact that it this WR was sent inline.

#### I called ibv_post_send() and I got segmentation fault, what happened?

There may be several reasons for this to happen:
1) At least one of the sg_list entries is in invalid address
2) In one of the posted SRs, IBV_SEND_INLINE is set in send_flags, but one of the buffers in sg_list is pointing to an illegal address
3) The value of next points to an invalid address
4) Error occurred in one of the posted SRs (bad value in the SR or full Work Queue) and the variable bad_wr is NULL
5) A UD QP is used and wr.ud.ah points to an invalid address

#### Help, I've posted and Send Request and it wasn't completed with a corresponding Work Completion. What happened?

In order to debug this kind of problem, one should do the following:

• Verify that a Send Request was actually posted
• Wait enough time, maybe a Work Completion will eventually be generated
• Verify that the logical port state of the RDMA device is IBV_PORT_ACTIVE
• Verify that the QP state is RTS
• If this is an RC QP, verify that the timeout value that was configured in ibv_modify_qp() isn't 0 since if a packet will be dropped, this may lead to infinite timeout
• If this is an RC QP, verify that the timeout and retry_cnt values combination that were configured in ibv_modify_qp() doesn't indicate that long time will pass before a Work Completion with IBV_WC_RETRY_EXC_ERR will be generated
• If this is an RC QP, verify that the rnr_retry value that was configured in ibv_modify_qp() isn't 7 since this may lead to retry infinite time in case of RNR flow
• If this is an RC QP, verify that the min_rnr_timer and rnr_retry values combination that were configured in ibv_modify_qp() doesn't indicate that long time will pass before a Work Completion with IBV_WC_RNR_RETRY_EXC_ERR will be generated

#### How can I send a zero bytes message?

In order to send a zero byes message, no matter what is the opcode, the num_sge must be set to zero.

#### Can I (re)use the Send Request after ibv_post_send() returned?

Yes. This verb translates the Send Request from the libibverbs abstraction to a HW-specific Send Request and you can (re)use both the Send Request and the s/g list within it.

## Share Our Posts

Tell us what do you think.

1. March 6, 2013

I have a question about whether a context switch is occurred or not during an RDMA operation. Here (page 15) it is shown that a user space verbs call results in a call of the hardware specific driver (eg mlx4). That "lives" in kernel space. So, ibv_post_send() (RDMA mode) causes a context switch, or not? Can you clarify this for me please.

Also, if ibv_post_send() never causes a context switch, then why there is an implementation of ibv_post_send() in the linux kernel. When is this function (inside the kernel) called?

Thanks!

• March 6, 2013

This is a great question!

Every control operation (i.e. create/destroy/modify/query to any resource) will cause a context switch.
However, the data operations won't create a context switch and from the same context,
one can post new Work Request (either to the Send or Receive Queues).

In the example, you mentioned "mlx4"; the create Queue Pair will perform a context switch and the following libraries/modules will be called in order:
libibverbs -> libmlx4 -> libibverbs -> ib core -> mlx4

In order to post a Send Request, the following libraries/modules will be called in order:
libibverbs -> libmlx4
i.e. no context switch will happen at all.

However, if there will be devices (or low-level drivers) that doesn't support posting Send Requests without a context switch, the libibverbs prepared the infrastructure to allow posting the Work Requests in the kernel level.
Personally, I don't know about any device that uses those functions.

Thanks
Dotan

• March 6, 2013

Yes, you did! Thanks!

So in page 15 the Hardware Specific Driver (yellow box) might be the libmlx4 depending on the implementation (or it might be mlx4 linux kernel module in ./drivers/infiniband/hw/mlx4 otherwise). Am I right?

• March 6, 2013

The Hardware Specific Driver (yellow box) is the mlx4 kernel part (since this section describes the kernel space modules). The User level APIs (white box) is the libibverbs and libmlx4.
(Do you see the "kernel bypass" line? this means direct access to the HW without need for performing context switch).

• March 11, 2013

Yes, I see the "kernel bypass" line. But that makes a contradiction. Kernel bypass from the one hand, but libmlx4 calls something (Hardware Specific Driver (mlx4 kernel module)) that "lives" inside the kernel (kernel context)). Except if the author of the diagram is meaning that the line is going to the Infiniband HCA directly (firmware code). :P

Sorry for being persistent!

• March 12, 2013

It is o.k.
:)

The "kernel bypass" means that in the data path, your user level code will be able
to work directly with the HW (without performing a context switch).

Please remember that the kernel level must be involved in the control part in order to
sync the resources (between different processes/modules) and configure the HW since user
level application can't write directly in the device memory space (since this is a privileged operation).

In this slide, I can see that there are two lines:
1) First line that specify kernel bypass (for the data path)
2) Second line that specify that the user level will call to the open fabrics kernel level verbs

If you enjoy this blog, please publish it to other people as well.

Thanks
Dotan

2. March 11, 2013

Hi Dotan,

I have few questions about ibv_post_send():

1. If I issue one large send request,
will (or can it be) served by multiple
smaller receive buffers? or does one
send request can never use multiple recv
buffers?

2. when would I need to use IBV_SEND_SIGNALED and IBV_SEND_SOLICITED?

3. Can Receive buffer be a gather list and
HCA will dma the received data to appropriate gather elements?

• March 12, 2013

Hi Jay.

1) I assume that you mean that you send a big message over the wire,
At the receive side you can split this message to how many scatter elements
that you wish (this is a local attribute which).

To summarize it:
When using RDMA operation(s): only one contiguous buffer can be used
or more scatter elements (as long as the sum of the buffers will be able to hold
all of the message).

2) IBV_SEND_SIGNALED should be used if the QP was created with sq_sig_all=0
(which means that not all Send Requests will generate Work Completion when completed).

IBV_SEND_SOLICITED should be used when the remote side is reading the Work Completions
using events (and not in polling mode). Please check the my about ibv_req_notify_cq()
for more details.

3) Yes, this is exactly what the RDMA device will do in the Receive side,
when using the Send operation. Please keep in mind that those memory buffers should
be registered first.

If you enjoy this blog, please publish it to other people as well.

Thanks
Dotan

• March 13, 2013

Hi Dotan,

Please let me rephrase the question #1 -
request with n-bytes worth of buffer each.
So receiver has total of 2n byte buffer
available.
Now sender issues one send work request
with total of 2n+m byte data.
to satisfy one send work request?
When using RDMA operations you said one single contig. buffer can be used.
Do you mean RDMA write OR RDMA read?

Jay

• March 13, 2013

Hi Jay.

The Receive Request is working in resolution of messages and not in resolution of bytes.

Every Receive Request will handle only one incoming message:
for each incoming message one Receive Request will be fetched from the head
of the Receive Queue. The messages will be handled by the order of their arrival.

In your example there are 2 Receive Requests that each has n bytes:
* Receiving a message of n bytes or less, is fine
* Receiving a message with more than n bytes will cause an error (since there isn't enough room to hold the message)

When working with RDMA operations:
* RDMA Write can read one or more local gather entries and write them to one remote contiguous block
* RDMA Read can read from one remote contiguous block and write it locally to one or more scatter entries

If you have more questions, you are more than welcome to ask..
:)

Dotan

3. March 19, 2013

when i send a 1024 bite block by IBV_WR_RDMA_WRITE mode，everything is ok， but if block size is set larger (ex 4096 bite),I get a IBV_WC_LOC_PROT_ERR err and then many IBV_WC_WR_FLUSH_ERR err for send cq , can u help me

• March 19, 2013

Hi.

Please check the memory buffers in the gather list of the Send Request, I suspect that you try to access memory that wasn't registered.

Thanks
Dotan

4. March 25, 2013

ibv_post_send returns -1,what is the problem ? thanks for your help

• March 25, 2013

Hi.

There can several reasons:
* The Send Request has invalid value(s)
* The Send Queue is full

Not all of the low level drivers return errno to indicate about errors
(some of them returned -1 in the past and now return errno).

It depends of the library that you use and its version..

Thanks
Dotan

5. May 1, 2013

Hi Dotan, I'm running into a problem with ibv_post_send and hoping you can provide some guidance. I've adapted the rc_ping_pong program to exchange 312 byte messages among nodes in a 32-machine IB cluster, except that I use an epoll() based mechanism to call ibv_poll_cq(). Several messages later (around 58900 to be exact), ibv_post_send() fails returning ENOMEM and errno set to 2. Both sides of the connection are in good states: IBV_PORT_ACTIVE & IBV_QPS_RTS. When I keep track of sends posted vs sends completed I find that during the failure (posted-completed) = 31, always. However I have only max_send_wr=1 when I created the qp. So I'm not sure what's going on. On the receive side I guarantee posts (rx_depth=800 and whenever it drops to 400 I post 400 more). Any help is much appreciated, and if you need further clarifications please let me know.
Thanks much
Sara

• May 1, 2013

Hi Sara.

:)

If ibv_post_send() itself fails that means that either:
The Send Queue is full (i.e. all of the Work Requests in the Send Queue are outstanding)
or
The posted Send Request is illegal:
* too many scatter/gather elements
* too much inline data (if inline data is used)
* wrong opcode

Please check if this helps you:
if you sure that the Send Queue isn't full, dump the Send Request and check what I suggested above.

Thanks
Dotan

6. May 1, 2013

Thanks for the quick response Dotan.
I'm leaning towards full queue rather than illegal request because:
1. They've been going through fine for all the previous posts, and
2. I simply reuse circular buffers for subsequent sends
3. I inspected the wr (bad_wr points to it) during failure and it looks okay:
(gdb) p wr
\$1 = {wr_id = 1, next = 0x0, sg_list = 0x7fcaca7fbcb8, num_sge = 1, opcode = IBV_WR_SEND, send_flags = 2, imm_data = 0, wr = {rdma = {remote_addr = 0, rkey = 0}, atomic = {remote_addr = 0,
compare_add = 0, swap = 0, rkey = 0}, ud = {ah = 0x0, remote_qpn = 0, remote_qkey = 0}}}
(gdb) p *wr->sg_list
\$11 = {addr = 49981952, length = 312, lkey = 175104}

I'm confused about two things though (if send queue full is the problem):

1. ibv_post_send() returns ENOMEM (and not -ENOMEM which is what the drivers seem to return when kmalloc fails or something similar)
2. errno=2 which is also weird, I'm unable to find out exactly who sets it & why

I've also tried running it through valgrind to check invalid memory and it looks clean.
Any pointers?

Thanks
Sara

• May 2, 2013

Hi Sara.

I'll try to help here:
1) User level libraries return positive errno values and not negative ones
(kernel level drivers return negative errno values)

2) I don't know where the errno=2 came from. libmlx4 almost doesn't set the errno value
at all..

Did you poll all of the completions from the CQ?
Once you have the failure in the ibv_post_send(), did you try to empty the CQ and try to post the Send Request again?
(since the QP should still be in a good shape)

Thanks
Dotan

• May 3, 2013

Thanks, Dotan! Once I reach this point, all polls keep returning 0, and if I attempt to post more sends I run into the same issue. The other side is sitting idle doing an epoll_wait() with plenty of recvs posted. So it doesn't look like an easy problem to solve. I'll try a few more experiments & update (in case someone runs into similar issues later).
Sara

• May 4, 2013

This will be great, thanks!

Dotan

• May 7, 2013

Just wanted to update on this issue real quick. I restructured the code quite a bit to make it extensible and now I don't hit upon the issue anymore. So most likely some bad coding on my part - if I had more time to spare I'll explore in detail but unfortunately I'm on a deadline so don't have a clear answer :(

• May 7, 2013

Hi Sara.

I'm happy that you overcome the bug
:)

You are most welcome!
Dotan

7. June 28, 2013

Hi Dotan,

I'm receiving 'remote invalid request error' (IBV_WC_REM_INV_REQ_ERR) with RDMA_READ requests. I checked buffer sizes, access rights, and QP-type and all seams fine to me. RDMA_WRITE works and since the only difference is the opcode (as far as I know), I don't understand the issue.

BTW: I'm new to RDMA programming and your side really helps a lot!

Thanks so far.

• June 28, 2013

Hi Stefan.

Sharing the code will be great (it will allow me to review it and give feedback..)
:)

Assuming that you have both RDMA Read and RDMA Write code,
the delta between the RDMA Write to the RDMA Read support should be:
1) The QP type is IBV_QPT_RC
4) The values of max_rd_atomic/max_dest_rd_atomic aren't zero
(setting the value to one in both sides isn't efficient but will do the trick)
5) verify that the r_key is correct (although if it worked with RDMA Write, it should be valid)

I hope that I helped you.
If you enjoy this blog, please publish it to other people as well.

Thanks
Dotan

• July 1, 2013

Hi Dotan,

Thanks for the fast reply. I re-cheched all again and found:

1) .qp_type of ibv_qp_init_attr is IBV_QPT_RC (OK)
2) access mask was set by
if (!(remote_mr = ibv_reg_mr(remote_pd, pmydata->recv_buffer, pmydata->max,
IBV_ACCESS_REMOTE_WRITE |
IBV_ACCESS_LOCAL_WRITE |
perror("ibv_reg_mr");
return NULL;
}
Which left the flags of the QP unchainged. I set them now by calling ibv_modify_qp. The flags seam to be alright now, but the error remains.

3) Both communication partners have the same flags, for their QPs and MRs so this should be ok.

4) Both, max_rd_atomic and max_dest_rd_atomic are set to 1 by default here. I checked it and it should also be ok.

5) As you mention, since RDMA_WRITE works r_key,l_key, and remote_addr are ok. (I also re-checked that)

What seams strange is, that ibv_modify_qp raised an invalid argument error when I called it with IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MAX_QP_RD_ATOMIC to modify the values but modifying access flags works fine.

Code actually is a mess but basically consist of this parts:

* rdma_create_event_channel() to create event channels
* rdma_create_id() to create rdma_cm_id's
* rdma_bind_addr() and rdma_listen() on the server side
* rdma_resolve_addr() and rdma_resolve_route() on the client side
* ibv_create_cq(), ibv_alloc_pd(), rdma_create_qp() and ibv_reg_mr() to setup CQ,PD and register MR
* Exchange Key and memory Address
* Message setup:
// current message size
sge.length = imyproblemsize;
sge.lkey = client_mr->lkey;

snd_wr.sg_list = &sge;
snd_wr.num_sge = 1;
snd_wr.send_flags = IBV_SEND_SIGNALED;
snd_wr.next = NULL;
snd_wr.wr.rdma.rkey = rKey;

* Start Work:
if (ibv_post_send(client_id->qp, &snd_wr, NULL)) {
perror("21 ibv_post_send");
return -21;
}
while (!ibv_poll_cq(client_cq, 1, &wc));
if (wc.status != IBV_WC_SUCCESS) {
printf("r0: wc.status: %s\n",ibv_wc_status_str(wc.status));
perror("22 ibv_poll_cq");
return -22;
}

The code is some kind of skeleton I wrote and originally covers send/receive wich works fine. Also modifying it to work with RDMA_READ caused no problem, but RDMA_WRITE does.

Thanks a lot.

• July 4, 2013

Hi Stefan.

Can you call ibv_query_qp() when the QP should be in RTS state and verify that:
1) The QP state is RTS
2) The value of max_rd_atomic isn't zero
3) The value of max_dest_rd_atomic isn't zero

I suspect that the fact that ibv_modify_qp() failed is your problem.
use the right flags for each QP state transition)

Thanks
Dotan

8. October 9, 2013

Hello Dotan,

I'm measuring latency between two RDMA NICs with IBV_WR_SEND

If I send a work request with IBV_SEND_SIGNALED flag, so when I get
IBV_WC_SEND event, does it mean that the message was delivered and the remote machine sent an ack back? Should I consider this time as a roundtrip?

Thanks.

• October 10, 2013

Hi.

It depends on the used transport type:
* If this is reliable transport type (RC), when you get Work Completion in the sender side - this means that the message was written at the remote side (and an ACK was sent back)
* If this is unreliable transport type (UC/UD), when you get Work Completion in the sender side - this means that the message was sent through the local port (no ACK/NACK will be sent)

Thanks
Dotan

• October 10, 2013

Thanks a lot.
That what I've assumed.
I'm using RC, just to make clear, the following flow

2. start timer
3. send message (IBV_WC_SEND)
4. wait for receive to complete (send on the other is posted only when message arrived)
5. stop timer

it measures: 2 messages + ACK for the first send + (optional: ACK to other side of received message)

Thanks.
Boris.

• October 10, 2013

Exactly.

One tip though: if you care about latency, you should send the message inline'd
(if the message is small).

Thanks
Dotan

9. October 15, 2013

Hello Dotan

in ibv_post_send:
1. Are the ibv_send_wr list, and its sg_list destroyed automatically when the operation completes.
2. Or can I destroy them after the method call returns.
3. They have to be kept alive till receiving work completion.

Boris.
Thanks.

• October 15, 2013

Hi Boris.

The sg_list array can be safety be (re)used after ibv_post_send() ends:
The Send Request is being enquequed to the Send Queue space of the Queue Pair
once it is being posted.

Thanks
Dotan

10. November 11, 2013

Hi Dotan.

Is there any way to know, what is the max length of INLINE data can be sent in SEND or RDMA_WRITE ?

• November 11, 2013

Hi.

Unfortunately, struct ibv_device_attr doesn't contain any attribute that specify the maximum INLINE data that can be sent.
When creating a QP, qp_init_attr->cap->max_inline_data is returned with the number of INLINE data that can be sent in this QP.

Thanks
Dotan

11. November 22, 2013

Hi,
I'm new to RDMA and run into a weird behavior, which I was hoping you could clarify for me:

I'm using IBV_WR_SEND to send a struct-object which contains some information needed for an RDMA-read later on (rkeys, address and so on).
Now in principle this works fine, but the strange behavior is that only if the object-size is a multiple of 2, does it work correctly. So I tried these cases:
sizeof(message) -> 16. This works
sizeof(message) -> 24. The last object-attribute is always wrong, the rest is correct.
sizeof(message) -> 32. This works again.

Is this normal? I have only seen restrictions about the minimum/maximum message size, but nothing that would hint at an additional restriction of this kind. Or did I something wrong somewhere?

Thank you very much!
Martin

• November 22, 2013

Hi Martin.

I have a feeling that the problem isn't related to RDMA.
In RDMA the minimum message size can be even 0 bytes!

I have a feeling that the problem happens because of the way the compiler prepare the structure in the memory
In RDMA and in any other networking protocol the application needs to take care of how to transfer data between two machines since maybe the machines are different:
* CPU arch (32/64) bits
* Big/little endian

I have two suggestion here:
1) You can send me the source code for review, and I'll give you feedback
2) You can give me more information on what went wrong (since you didn't provide this information)

Thanks
Dotan

• November 28, 2013

Hi,

Sorry for my late response, but I was busy the last week.

So, I have a struct containing: int rkey, int remote buffer size, long remote address
If I send this, everything is fine. But now suppose I add "int id" to the struct. No matter which attribute is specified last in the struct (lets say for example "int id" is now the last one), that attribute is not recieved correctly, but gives a wrong value. All other attributes of that struct are correct.

You are probably correct that this is due to some little/big endian problem.

Thank you very much!

Cheers,
Martin

• November 28, 2013

Hi Martin.

Do you want to share the code with me? This way I'll find your bug ...

Another way for you to handle it is to write (using sprintf()) the data to an array of characters,
and this this data as a string as not as a struct (and parse it in the remote side).

I hope that this tip helped you
Dotan

12. November 22, 2013

Hello Dotan,

I have the same problem as Stefan (I get IBV_WC_REM_INV_REQ_ERR with RDMA_READ requests. I tried to follow the advice you already posted here as much as possible, but I cannot sort that out myself.

I can send you a simple program which reproduces my problem, but I would need your email (and your agreement).

Best regards,

Philippe

• November 23, 2013

Hi Philippe.

If you want to share the code with me, and I'll give you a hint
on the reason of this problem, you can send it to:
support at rdmamojo dot com

Thanks
Dotan

13. November 23, 2013

Hi Dotan,

I have a question about P_KEYS in BTH header. Once a relation is established between two QP's, both ends can modify the qp attribute pkey_index. Can both ends use different pkey_index (and ultimately different pkeys) ? i.e A can say B is using Pkey=X and B can say A is using Pkey=Y.
Thanks,

Jay

• November 23, 2013

Hi Jay.

It doesn't matter what are the P_Key index that each QP is pointing to
(since what is really matters is the P_Key value itself and different tables
*may* have same P_Key values but with different order).

If at some point, the P_Key values of both QPs won't be consistent,
the packet will be dropped
(InfiniBand spec: Figure 81 Packet Header Validation Process)

In your example: if X.key != Y.key, there will be a P-Key mismatch and
the QPs won't be able to communicate (this is the whole idea of the P_Key..)

I hope that I helped you.

Thanks
Dotan

14. November 25, 2013

I am trying to use IBV_WR_ATOMIC_CMP_AND_SWP to check a remote value and proceed accordingly. I have registered a 64 bit integer using ibv_reg_mr. and sent this remote address to the sending host. But i am getting a remote access error. The sample code you have provided is not complete.
In the sample code you have used

sg.length = buf_size;
sg.lkey = mr->lkey;

Is buf_addr a 64 bit integer or a char buffer of size 8. Is it possible that you may send a complete code of a working compare and swap function.

• November 25, 2013

Hi Omar.

I'm sorry, but I don't have any source code that I can share with you...
(I plan to write it in the future though)

1) The remote QP supports incoming Atomic operations
2) The remote MR supports incoming Atomic operations
3) The remote address is 8 bytes aligned

Thanks
Dotan

• June 3, 2016

Hi, I came across the same problem, and still cannot figure it out. I can successfully process send/recv operation(which means qpn, psn and lid of the remote side is correct), but I fail at RDMA write operation, receiving the IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error when I call ibv_poll_cq(). Any other comments besides the above three hints?? thanks in advance.

• June 8, 2016

Hi.

Did you read my post on ibv_poll_cq()?

Anyway, check that RDMA Write is enabled in both the remote QP.qp_access_flags and remote MR.access.

Thanks
Dotan

15. December 7, 2013

Hi Dotan,

I have a question about how WRs are finished. Suppose I have built a RC connection between two QPs. First on the receive side, I post two recv WRs, say recv_wr1 and recv_wr2. Then on the send side, I post two send WRs, say send_wr1, send_wr2. My question is, is there any possibility that send_wr2 finishes before send_wr1? What about the receive side? Is is possible that recv_wr2 is finished before recv_wr1?

Thanks,
Jiajun

• December 7, 2013

Hi Jiajun.

In term of the Completion Queue of the Work Queues, you should see their Work Completions according to the order of the corresponding posted Send Requests.

In term of the wire, this isn't a place that I fully familiar with, BUT:
if you send a message, every packet increases the PSN in the Send Queue and in the remote Receive Queue),
so send_wr2 cannot be sent before send_wr1 was sent. Otherwise, it won't be able to detect missing packets (using the PSNs).

Anyway, you should (re)use the memory only after the relevant Work Request isn't outstanding any more.

I hope that this helped you.

Thanks
Dotan

16. January 29, 2014

Hi
My question might seem out of context for this post but it's important.
I have to ask you how to set up an all to all communication between a number of processes, some on same machine and some on different. What I have done is open a listening rdma_cm_id wait for incoming connection requests for each process and bind it to a specific port and create new rdma_cm_id when I have completed a connection request. This works fine if all processes are on different host machines, but if I start multiple processes on the same machine, I get a very slow performance or none at all, the system hangs as if in a deadlock. I had hoped that once I have a rdma_cm_id for each process than the processes should communicate without any problem. One thing is that I have only set up one communication channel but it should suffice for many clients (the man pages say this).
Regards
Omar

• January 29, 2014

Hi Omar.

I don't have a lot of experience with rdma_cm (yet?).

If you want a good answer, I suggest that you'll send this question to Sean Hefty,
the writer and maintainer of rdma_cm.

Sorry again..
Dotan

17. January 30, 2014

It seems that the descriptions of IBV_WR_ATOMIC_CMP_AND_SWP and IBV_WR_ATOMIC_FETCH_AND_ADD are swapped.

• January 30, 2014

Fixed, thanks.

Dotan

18. February 15, 2014

Hi Dotan,

What circumstances can make a Send Queue to get overflown?

In my program I perform an RDMA Write in a loop (every time with the same source/destination addresses, just to test), and after a while I constantly get ENOMEM from ibv_post_send(). It doesn't seem to be a race, as it always happens after the same count of iterations, and even sleeping ~1sec between iterations doesn't affect anything; besides, the number of successful iterations is correlated with QP's max_send_wr. None of the WRs is "signaled" (tried to poll the QC at every iteration - it's empty).
I might be missing something basic in the QP configuration. What initialization parameter can cause such a behavior?

Thanks.

• February 16, 2014

When creating a QP, you specify how many WRs are outstanding in either Send or Receive Queue.
A WR is considered outstanding until there is a Work Completion for it or for other WRs in that Work Queue.

You posted many WRs (in your case, to the Send Queue) and all of them are outstanding.
From time to time, you need to make them "signaled" and read the Work Completions.

Thanks
Dotan

• February 16, 2014

Oh, I see. This looks like a design flaw, doesn't it? At least, it's quite counter-intuitive behavior, as one would expect that an unsignaled WR gets removed from SQ silently as soon as it's processed - after all, that's the whole point of unsignaled WRs...

• February 16, 2014

But if you don't get any Work Completion, how can you prevent from posting more WRs than the Work Queue size?
You *assume* that all the posted WRs were processed, in most cases it is true,
but there isn't any guarantee about it...

Thanks
Dotan

• February 16, 2014

Well, if one produces WRs faster than the HCA can consume, the SQ will eventually overflow, and in *such* situation ENOMEM would be quite logical (like in any producer-consumer scheme) - but still, implicitly treating obviously consumed WRs as outstanding doesn't seem to fit well in this logic. Sometimes the producer can know for sure that he can never overflow the queue (for instance due to retry count/timeout settings vs. timings of WRs), and such a behavior of the queue would surprise him.

• February 17, 2014

You start to enter to the synchronization mechanism between the low-level driver and the HW...
Anyway, this is the behavior which the protocol defined.

Thanks
Dotan

• February 16, 2014

Hi Dotan, joining the question on this issue. Is there any way (or will be) to block on ibv_post_send (until there is place in the work queue)?

Otherwise, in multythreaded application, some synchronization semaphore-like mechanism must be applied, and it could be very costly...

• February 17, 2014

Hi Boris.

Currently, there isn't any way to block the post_send if the Work Queue is full.
This require a low-level libraries and API change (to prevent breaking of current behavior).

Sorry
Dotan

19. February 18, 2014

Dotan,

You're writing regarding the inline data that "the low-level driver (i.e. CPU) will read the data and not the RDMA device". Is this correct for the both sides? I.e., on the responding side, will the HCA perform DMA for the inlined data, or will CPU handle it?

Thanks a lot for your assistance.

• February 18, 2014

This is relevant only for the local side, i.e. the side that fetches the data.
There isn't any hint that this was done once the data is being sent over the wire.

Thanks
Dotan

20. February 19, 2014

Hi Dotan,

Is there a more straightforward and efficient way to write a value atomicaly to the remote side, than performing rdma-read followed by atomic CAS? (There're no stores to this location on the remote side, only loads, but the value must appear consistently/atomically.)

Thanks.

• February 20, 2014

Hi Igor.

The only supported atomic operations in RDMA are:

* Compare and Swap

I don't know what you are trying to achieve, but using them you can implement
a mutual exclusion primitives.

What about sending a message using "Send" and increment the value locally using a good old mutex/semaphore/spinlock?

Thanks
Dotan

• February 20, 2014

Due to some constraints I can't use send/receive flow...
What the level of atomicity of a regular RDMAWrite? I.e., does the remote HCA stores to its local memory bytes or words?

• February 20, 2014

I'm sorry, but I can't provide a good answer here.
RDMA supports sending a stream of bytes and AFAIK there isn't any guarantee about atomic access of more than one bytes.

Multiple testing may show you that atomicity of words (or more) is achieved, but there may be scenario that this won't be the case...

Dotan

21. March 17, 2014

Hi Dotan,
Great website. Thanks for all the work.
Question about posting WRs. If I post a WR to a WQ, does a copy of the WR get made so that after the ibv_post_send() completes, I am free to overwrite that WR for my own purposes? Or is just a pointer to that WR posted to the WQ and I have to keep it intact until the completion occurs. It tried to find the internal representation of the WQs to see if I could deduce the answer myself, but no luck.

• March 18, 2014

Thanks
:)

Long answer: the low-level driver translate the Work Request structure from verbs API to HW API
and post this HW-specific WR to the the relevant Work Queue.

After the verb of posting the WR returns, you are free to change this WR structure.

If you can to see how this is done, you need to check the code of the low-level drivers...

Thanks
Dotan

• March 31, 2014

Hi Dotan,
Your site is a huge help!
Regarding reuse of WR, are the ibv_sge elements copied as well?
From my reading of the code they are copied but can i reuse them when ibv_post_send returns?
Also is there a restriction on multiple WR with the same wr_id?
For example can the same id be used to identify a chain of WR posted together?
Thanks!

• March 31, 2014

Thanks!

Yes. The s/g list is being copied to the QP's Send Queue and they can be reused.

About the wr_id; it is a user defined private data and can contain any value that you wish..
(including multiple WRs with the same wr_id).

Sure
Dotan

22. May 29, 2014

Hi

I was wondering: Is there a performance difference between IBV_WR_RDMA_WRITE(_WITH_IMM) and IBV_WR_SEND(_WITH_IMM) ?

thx
Bernard

• May 30, 2014

Hi Bernard.

Tips and tricks to optimize your RDMA code

Yes, there is a performance difference, so one should prefer using RDMA Write with immediate instead of Send with immediate.

RDMA Read is considered more "expensive" than RDMA Write or Send operations, so one should prefer the later operations.

I hope that I helped
Dotan

23. August 11, 2014

Hi Dotan,

This is a fantastic website for RDMA learners! I have a question regarding on the atomic operations. That is, how does the RDMA atomic operations (FetchAdd & CmpSwap) implemented? I guess there should be a locking mechanism that comes to work once the atomic operations are performed on some memory buffer. Is the lock implemented on the network (RNIC?), on the specific memory buffer, on the memory bus, or somewhere else?

Henry

• August 12, 2014

Hi Henry.

Thanks for the compliment.
:)

The atomic operations are atomic related to other atomic operations and not to any other operation or any other memory access.

I don't *know* the internal implementation but I can guess;
It depends of the supported atomicity level of the RDMA device:
* If it is supports atomicity within the device - it may have an internal mechanism to prevent other atomic access to this memory
* If it is supports atomicity between other devices - I guess that it will lock the bus or something like this.

AFAIK, every atomic is supported until now only within the device.

I hope that this answer helped you.

Thanks
Dotan

• August 13, 2014

Hi Dotan,

> The atomic operations are atomic related to other atomic operations and not to any other operation or any other memory access.

Do you mean that if one modifies a remote value with eg. IBV_WR_ATOMIC_FETCH_AND_ADD, this modification will *not* appear as atomic for any other software (eg. running locally on that machine) that attempts to read this memory location?

• August 15, 2014

Hi Igor.

Here is the exact quote from the InfiniBand specifications:
"o9-17: Atomicity of the read/modify/write on the responder’s node by the
ATOMIC Operation shall be assured in the presence of concurrent atomic
accesses by other QPs on the same CA."

It specifies how the RDMA device will handle the content of the memory and doesn't really mention other interfaces (such as the software). For example: it *may* perform the following: Read, modify, write and perform the write 10 seconds after the read happened. During this time, the RDMA device will prevent any access to this memory by other Atomic operations. The (local) software doesn't really aware to the operations that are done by the RDMA device...

Thanks
Dotan

24. October 13, 2014

Hi Dotan

I use ibv_post_send(), doing RDMA write, I found that if the num_sge is 4, it return -1; if the num_sge is 2 or 1 , it works fine. (the buffer is 4kB each).

How can I make it send 4(or more) num_sge buffers?

Thanks.

Zhang Yue

• October 13, 2014

Hi Zhang Yue.

Can you send the output of:
ibv_devinfo | grep max_sge

Thanks
Dotan

• October 14, 2014

hi Dotan,

The command output is these:

root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target# ibv_devinfo | grep max_sge
root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target# ibv_devinfo
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.32.5100
node_guid: f452:1403:0028:0820
sys_image_guid: f452:1403:0028:0823
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: MT_1090120019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 3
port_lid: 4
port_lmc: 0x00

port: 2
state: PORT_ACTIVE (4)
max_mtu: 4096 (5)
active_mtu: 4096 (5)
sm_lid: 1
port_lid: 2
port_lmc: 0x00

root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target#

• October 14, 2014

hi Dotan

I found that the queue pair config limits it:
qp_init_attr.cap.max_send_sge = 1; /* scatter/gather entries */
qp_init_attr.cap.max_recv_sge = 1;
I changed 1 to 16 and works.

Thanks, you are nice.

Zhang Yue

• October 15, 2014

Hi Zhang Yue.

Thanks for the update.
I've updated the description of num_sge in the posts that describe the structures of Send Request and Receive Request to be more informative according to your problem.

Thanks
Dotan

25. November 5, 2014

In a UD QP, can you post an inline send with immediate data?

• November 6, 2014

Yes, you can.

Thanks
Dotan

26. November 11, 2014

Hi Dotan,

I'd like to consult with you on the following subject: we perform IBV_WR_RDMA_WRITE to a remapped BAR of a remote PCI device and experience poor throughput. Using hardware monitoring tools we found out that the data was being written in 64-byte packets, and that's what cased the above issue.
My question is whether there's any configuration that could affect the way HCA writes the data?
I post a non-signalled rdma-write, with >1K of data as a single SGE, 4K MTU.

• November 11, 2014

Hi Igor.

I'm sorry, but this is device specific and I don't know much about it.

However, I would check with the vendor of that PCI device to get more details.
Do you have performance problems when accessing the PCI device locally?
Maybe the way that this BAR is mapped to kernel can be improved?

I hope that this give you a hint...

Thanks
Dotan

• November 11, 2014

It's "device-specific" in the sense that writing 64-byte packets causes the device to get the data slowly (which doesn't happen when HCA writes to RAM, or when we DMA'ing to this PCI device by other means) - the device vendor assured this assumption.
The BAR is remapped to a user-space virtual addresses with io_remap_pfn_range(), then registered as rdma memory-region using PeerMemory mechanism recently introduced in Mellanox OFED especially for this purpose.
I believe the remote (w.r.t to the PCI device) HCA sends the data over the fabric in MTU-sized chunks, so it's probably the local HCA that performs such a "slow", or PCI-unfriendly, DMA.
So, the question is whether we have any control over the way HCA performs the DMA?

• November 13, 2014

Hi Igor.

AFAIK, there isn't any way to control the HCA performs the DMA.
I doubt it, but even if there are ways to do this; you'll need to get this info from the HW vendors..

Sorry.
Dotan

27. December 3, 2014

Hi,
Suppose I post two request in the receive queue but for some reason I received the data for second request before first request. Is it possible to receive data for second request before first or it will always give error.

• December 3, 2014

Hi Govind.

(the Receive Queue "knows" only the order of the posting of those Receive Request,
and this ordered is promised).

The next message that will enter to the Queue Pair that will consume a Receive Request will take
those Receive Requests according to the order that they were enqueues to it.

I understand that your application has the semantics of the first and second one,
however, the RDMA doesn't.

Bottom line, the answer is: no.

BTW it should always give an error. You didn't give me enough info,
but I believe that the problem is that the "first" Receive Request is small.
This can be fixed by making sure that all the Receive Requests can hold all the incoming messages ...

I hope that this helps you
Dotan

28. December 3, 2014

hii all,
during ibv_post_send I am getting errno 0 and 2 for two different messages. Can someone please point out to some document where I can find description of errno. I am using OFA RDMA api's

• December 3, 2014

Hi.

Unfortunately, the errno return values isn't consistent for all low-level drivers in RDMA.
If you'll share the code, maybe I'll be able to answer you.

Thanks
Dotan

29. December 3, 2014

Hello,

I can successfully send RDMA READ/WRITE, but I can't get RDMA atomic operations to work. I get an error when calling ibv_post_send function in the client, and the errno will be set to "Invalid Arguments.". Below I pasted important parts of my code. Could you please check my code and let me know if I'm missing anything?

*********** client side *****************:
-- Registering the memory regions --
mr = ibv_reg_mr(pd, buff, size, IBV_ACCESS_LOCAL_WRITE);
// and the size is 8

if (!mr){
fprintf(stderr, "Error, memory registration failed\n");
return -1;
}

-- Preparing RDMA ATOMIC FETCH AND
struct ibv_send_wr wr, *bad_wr = NULL;
struct ibv_sge sge;

memset(&sge, 0, sizeof(sge));
sge.length = 8;
sge.lkey = mr->lkey;

memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;

wr.wr.atomic.rkey = peer_mr->rkey;

fprintf(stderr, "Error, ibv_post_send() failed\n");
return -1;
}
********* End of Client side *******

****** Server side ****************
-- Registering the memory regions --
mr = ibv_reg_mr(pd, rdma_region_timestamp_oracle, sizeof(TimestampOracle),
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));

if (!mr){
fprintf(stderr, "Error, memory registration() failed\n");
return -1;
}

NOTE: TimestampOracle is a class with two int members, so its size is 8 bytes (satisfies 64-bit condition for RDMA ATOMIC operations)

Erfan

• December 4, 2014

Hi Efran.

I have some questions:
1) Did you check that the RDMA device supports Atomic?
2) Did you check that the remote address is 8 byte aligned?
3) Did you enable atomic at the responder QP?
4) Is this is an RC QP?

I hope that one of the above questions gave you a hint on the problem.
If not, I'll need to see more source code and information on the RDMA devices that you are using.

Thanks
Dotan

• December 4, 2014

Hello Dotan,

1) How can I check that? Do you mean that some RDMA devices support Atomic and some don't?

2) I simplified the code, so now the remote address is one (long long) variable, which is 8 bytes (I paste the code at the end of this comment).

3) As you can see in my previous comment, on the server side code, I registered the memory region to be able to be accessed atomically by ibv_reg_mr(pd, ... , ...,
IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC )). Do I need to do anything other than that?

4) When initializing the queue pairs on both client and server, I used qp_attr->qp_type = IBV_QPT_RC.

Here's the simplified code, I tried to leave unrelated parts out. I know how annoying it can be to read somebody else's lousy code. I'd really appreciate your help.

******** client code **********
void build_qp_attr(struct ibv_qp_init_attr *qp_attr){
memset(qp_attr, 0, sizeof(*qp_attr));
qp_attr->send_cq = s_ctx->cq;
qp_attr->recv_cq = s_ctx->cq;
qp_attr->qp_type = IBV_QPT_RC;

qp_attr->cap.max_send_wr = 10;
qp_attr->cap.max_recv_wr = 10;
qp_attr->cap.max_send_sge = 1;
qp_attr->cap.max_recv_sge = 1;
}

void register_memory(struct connection *conn) {
local_buffer = new long long[1];
local_mr = ibv_reg_mr(pd, local_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE));
}

void on_completion(struct ibv_wc *wc){
struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;
// Assume that the client already knows about the remote_mr on the server side
if (wc->opcode & IBV_WC_RECV) {
struct ibv_send_wr wr, *bad_wr = NULL;
struct ibv_sge sge;

memset(&sge, 0, sizeof(sge));
sge.length = sizeof(long long);
sge.lkey = local_mr->lkey;

memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.sg_list = &sge;
wr.num_sge = 1;
wr.send_flags = IBV_SEND_SIGNALED;

wr.wr.atomic.rkey = remote_mr.rkey;

fprintf(stderr, "Error, ibv_post_send() failed\n");
die();
}

}
}
***** End of client code ********

**** Serve code ******
struct connection {
struct rdma_cm_id *id;
struct ibv_qp *qp;
struct ibv_mr *mr;
long long *rdma_buffer;
};

void build_qp_attr(struct ibv_qp_init_attr *qp_attr) {
memset(qp_attr, 0, sizeof(*qp_attr));
qp_attr->send_cq = s_ctx->cq;
qp_attr->recv_cq = s_ctx->cq;
qp_attr->qp_type = IBV_QPT_RC;

qp_attr->cap.max_send_wr = 10;
qp_attr->cap.max_recv_wr = 10;
qp_attr->cap.max_send_sge = 1;
qp_attr->cap.max_recv_sge = 1;
}

void register_memory(struct connection *conn){
rdma_region = 1ULL;

rm = ibv_reg_mr(pd, rdma_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));
}
***** End of Server code *******

• December 5, 2014

1) In struct ibv_device_attr, there is an attribute called 'atomic_cap'.
This describe the atomicity support level of this device.

Since there may be devices that don't support atomic operations.

(Can you tell me what is its value?)

2) Please check the remote address value, that it is 8 byte aligned
(Can you tell me what is its value?)

3) When calling ibv_modify_qp, there is an attributes in struct ibv_qp_attr called 'qp_access_flags',
did you enable IBV_ACCESS_REMOTE_ATOMIC in the receiver side?

4) Only RC QP supports Atomic, so I see that you are using it.
And it's o.k., I don't mind read other people code :)
(I'm doing it all the time).

The code looks fine, beside from my comments above.

If you'll can send me in email (dotan at rdmamojo dot com) :
1) The full source code
2) The parameters of your program
3) Execution example and output of your program
3) The output of 'ibv_devinfo -v'

(there is a limit to what I can do with only description..)

Thanks
Dotan

30. December 5, 2014

Hello Dotan,

I'm trying to speedup ibv_post_send when sending inline messages by using unsignaled completions. The problem is that it doesn't work if I post more than "qp_init_attr.cap.max_send_wr" unsignaled send requests. I tried to post one signaled request every N unsignaled ones, but still crashes after max_send_wr. What am I doing wrong?

• December 6, 2014

Hi Jaume.

The flow that you've described sounds valid. What do you mean by "still crashes"?
(Since i don't expect to get a crash in this flow, unless there is a bug).

Did you provide a valid bad_wr pointer to the ibv_post_send () verb?

Thanks
Dotan

• December 8, 2014

By crashing, I meant that ibv_post_send fails. I do not want to spend time reading the completions, so I send an "unsignaled" message. However, it seems that the unsignaled does not work because send fails once the CQ gets filled up. The QP is created with "qp_init_attr.sq_sig_all = 0;" and messages sent without the IBV_SEND_INLINE flag.

• December 8, 2014

"Unsignaled Work Requests" mean that those Send Requests won't generate Work Completions.
However, they are still consider outstanding. Which means that you need to empty the Send Queue
by sending signaled Send Requests from time to time
(otherwise, the Send Queue will be full, and you won't be able to post any new Send Requests).

The IBV_SEND_INLINE isn't relevant to the signalling of the Send Requests.

Bottom line, from time to time, you must post signaled Send Requests
(if the Send Queue size is N, you can post signaled Send Requests every N messages,
and by polling its Work Completion, you'll empty the Send Queue).

Thanks
Dotan

• December 7, 2014

Jaume, note that you have to process completions in the completion queue.

31. December 8, 2014

Hii Dotan,
I am trying to post send request in a queue that is already full. and I am getting some error (ENOMEM). So I put some sleep time and again post same request but it is again throwing same error. (Consider that after sleep time send queue is not full)

• December 8, 2014

Hi. Govind.

Did you poll some Work Completions (which were posted to that Work Queue) from the associated CQ during this time?

Thanks
Dotan

• December 8, 2014

yes, i did and and I am getting error there also... Currently I solved these issue by checking the number of pending request (using your idea that u mentioned in one of the comment) in send queue before posting any request and it is working but I don't want to do that because of the performance issue. and one more thing how I should increase the maximum limit of pending request in the queue and thanks for the all the help and suggestions, I really appreciate it.

• December 8, 2014

I'm glad that i can help.
:)

Which error do you get.?
Can you share the source code?
It will be easier for me to help you with source in front of me. .

Thanks
Dotan

32. December 9, 2014

Hi Dotan,
I can't share the code (confidentiality issues), but I can tell u the error number, first error which I am getting having error number 12 and then after error number 5 for all the other messages during polling of CQ?? Can you please tell me how to increase the maximum limit of pending request in queue. Currently I am able to post ~8192 requests.

• December 9, 2014

Hi Govind.

When calling ibv_create_qp(), you control the Send Queue (please refer to the post on this verb for more information).
I suspect that you have completion with error (i.e. the 5 and 12 errors that you reported).
Am I right? (are those are the status values of the Work Completion that you polled?)

If this is the case, completion status 12 = IBV_WC_RETRY_EXC_ERR which means that the remote side didn't answer within the expected time.

Thanks
Dotan

33. December 22, 2014

Hii Dotan,
First of all thanks for all your help, Finally my code is working and currently I am getting 3 times better performance for RDMA compare to UDP. I am having few more question that how much improvement(max) can we suppose with RDMA as compare to udp. Currently I am using only channel semantics, is there any good chances to improve if I use memory semantics also??

• December 22, 2014

Hi Govind.

I'm happy that I can help
:)

1) Performance is a very big area. Which metrics do you check? what is the current numbers in UDP?
Do you compare usin RC QP/UD QP? Which operations do you use?
2) What do you mean by channel semantics and memory semantics?

Thanks
Dotan

34. December 22, 2014

I am using RC QP and compairing with UDP protocol on the basis of waiting time of requested data.
With memory semantics I mean that I am not allowing the remote node channel adapter to write directly to host memory using rkey (all read write operation are done by local channel adapter by using lkey) and the reason for using only channel semantics is that I am transferring very small amount of data at a time.

• December 23, 2014

So, I guess that your metric is latency.

I suggest that you'll execute a tool that comes with the OFED package called ib_send_lat,
which will provide you the (best) latency that you can achieve using SEND operations in your setup.

The performance depends on so many factors, so I prefer not to provide a number.

Thanks
Dotan

35. December 23, 2014

hi Dotan
(at the Target side) When I'm doing a RDMA-READ with 4 wr, each wr have 1 sge (4KB), the initiator will easyly crush or the /dev/sdxx dispear. (While doing RDMA-WRITE is fine.)
I've set the wr's rkey and increase remote_addr by 4096, any suggest?

Thanks
Zhang Yue
ps:
for(k = 1; k cache_req.sglist_size; k++)
{
multi_wr[k] = rdmad->send_wr; // copy struct

multi_wr[k].next = &multi_wr[k+1];
multi_wr[k].send_flags = 0; //zy: should be 0. otherwize will free task multi times

}

// insert to list

//this sge.length mark the total length, will be use at iser_rdma_rd_comp_complete_handler

// so we need to place the first wr's sge to other place

• December 23, 2014

Hi.

I don't know if this is related to RDMA.

I would suggest to check if the local buffer that is being filled
is still allocated or being freed.

Maybe you should print the local address and check if the values make any sense.

Please check that before using the values the Work Completion status is o.k.

Thanks
Dotan

• December 25, 2014

Hi Dotan

Firstly, may all of us Merry Christmas!
Yes,this issuse is NOT related to RDMA.

Yesterday, I print every wr before calling ibv_post_send(), and found a issues:
After doing a lot of 16KB write, tgt may receive a INQUIRY, and if the INQUIRY unluckily use a task struct that was previously used by a 16kB write( or read),
It will use the old 4 4KB buffers and DMA to the initiator. INQUIRY only read 70 bytes, DMA 16 KB to it will broke the initiator's memory.

The main fix is: check the need DMA length, if <=0 , skip the left buffers.

Thanks

Zhang Yue

• December 25, 2014

Hi.

Merry Christmas indeed
:)

I'm happy that you found the problem.

Dotan

36. January 15, 2015

Hi Dotan

I am trying to use IBV_WR_ATOMIC_CMP_AND_SWP operation and I get some error like this when I poll the wc :IBV_WC_REM_ACCESS_ERR

I just make some simple modification base on the codes provided in the book “RDMA_Aware_Programing_user_manual”, do you know what is the problem?

• January 15, 2015

Hi.

Please check that IBV_ACCESS_REMOTE_ATOMIC is enabled in the remote memory buffer and in the remote QP.

Thanks
Dotan

37. January 22, 2015

Hi Dotan,

I want to post a request, but I want that the remote QP discards this request as soon as it receives it. This is because I want to send a dummy packet when I am in the REARM state in the QPs in order to reach the ARMED state (this is because it is needed an incoming packet for this transition).

I am using the below configuration and it seems to be working, but I would like to know if you think that this could be a generic approach for any situation or not:

struct ibv_send_wr wr;

memset(&wr, 0, sizeof(wr));
wr.wr_id = 0;
wr.sg_list = NULL;
wr.num_sge = 0;
wr.opcode = 0;
wr.send_flags = 0;

fprintf(stderr, "Error, ibv_post_send() failed\n");
return -1;
}

Best regards,
Jesus Camacho

• January 23, 2015

Hi Jesus.

You are sending a "standard" zero message. This can work, but you consume a Receive Request in the remote side.
Did you consider sending a zero message RDMA Write?

Thanks
Dotan

38. January 23, 2015

Hi Dotan,

I am currently using the opcode 0 (which is the IBV_WR_RDMA_WRITE operation) and it is working fine with the Infiniband microbenchmarks.

Is that what you are suggesting me?
If so, do you think this can be extrapolated to any scenario?

Jesus

• January 23, 2015

Hi.

Yes, this is was my suggestion.
What do you mean by "do you think this can be extrapolated to any scenario"?

Thanks
Dotan

• January 23, 2015

Hi,

I mean if this is a general solution.

Do you think that this is going to work when using another benchmarks, applications, etc.?

Best,
Jesus

• January 23, 2015

Hi.

Yes. Using zero bytes message is valid and can be always used.
Working with such messages with RDMA Write opcode can provide better performance than the Send opcode.

Thanks
Dotan

• January 23, 2015

Hi,

good to know!

Jesus

• January 23, 2015

Sure
:)

Dotan

39. April 13, 2015

Hello Dotan,

I have a quick question. What happens if the local node calls ibv_post_send() with opcode ibv_wr_send before the remote node calls ibv_post_recv()?

Thanks!

• April 14, 2015

Hi John, the answer won't be quick though
;)

The thing that matter is not when the sides posted the Send/Receive request in absolute time;
since one may not know when the actual scheduling of the Send Request will take place...

If message that consumes a Receive Request received by a Queue Pair when there isn't any available Receive Request in that Queue,
and RNR (Receive Not Ready) flow will start for a Reliable QPs. For Unreliable QPs, the incoming message will be (silently) dropped.

Thanks
Dotan

• April 14, 2015

Hello Dotan,

I am using a Reliable QP. So I think I will get the RNR errors. Now I have a couple of choices. (a) when getting a RNR error, back off and re-post the send request later; (b) implement a flow control protocol so that the local node posts send requests only when the remote node is ready. I like (b) more than (a). But (b) add complexity, and need to take care cases such as both nodes are waiting for the other side to become ready. :-)

So I am wondering if there is a common practice.

Thanks!

• April 15, 2015

Sure :)

In RNR flows, the problem is that the receiver side doesn't post Receive Requests fast enough ..

a) When you have an RNR error, your local QP is in ERROR state, so you can't post another Send Request without reconnecting it with the remote QP.
b) is a good idea

There are more options:
* You can increase the RNR timeout
* You can increase the RNR retry count (the value 7 means infinite retries)
* If you have several QPs at the receiver side, you can use a SRQ and make sure that the SRQ is never empty
(the SRQ LIMIT mechanism can help you to detect if the number of Receive Requests dropped bellow a specific watermark)

Adding flow control to your messages is always a good idea in order to not enter to the RNR flow in the first place ..

Thanks
Dotan

40. May 29, 2015

Hi Dotan,
I have few questions related to connection of RC queue pair.

1. If ibv_post_send fails then we consider connection was lost.
-> considering all the fields in the message are correct and the send queue is not full. Is vice versa also true that if we are able to post means there is working connections b/w nodes.

2. Is it possible that we receive send WC with some error if there is active or working connection between nodes assuming message was correct and receiver also posted recv request (no RNR error).

3. If we post send request beyond max limit in the send queue then it will corrupt the queue pair and no further request post allowed ? If no then can we post same request again without any change ?

• May 29, 2015

Hi.

1. Failure of ibv_post_send() means that one of the Send Requests is invalid or the Send Queue is full;
it doesn't mean that connection is closed. In that case no new Send Request was added to the Send Queue.

You can post Send Request to a Queue Pair which was configured with bad remote attributes
("bad" means not the attributes that you should have been configured...), i.e. no connection.

2. In general, no; but this question is tricky...
Which completion status did you get?

3. If you posted Send Requests beyond the maximum limit and all of them are unsignaled - you have a problem.
The Queue Pair isn't corrupted, but you can't post anymore Send Requests to it:
The status of the outstanding Send Requests is undetermined for the sender side.
The Receive Side of this Queue Pair is still fully operational.

You must recover it but moving it to Error/Reset state and reconnect the Queue Pairs

I hope that I helped you
Dotan

41. June 3, 2015

Hi Dotan:

Nice to meet you. I'm from China. My English is not very good. Recently I have learn somthing about RDMA. But I met a problem:

This is my test program:
server code :

/*
*/

#include
#include
#include
#include
#include
#include

#define VERB_ERR(verb, ret) \
fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)

#define MB 1024 * 1024

/* Default parameter values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 100
#define DEFAULT_MSG_LENGTH MB

/* Resources used in the example */
struct context
{
char *server_name;
char *server_port;
unsigned int msg_count;
unsigned int msg_length;
/* Resources */
struct rdma_cm_id *id;
struct rdma_cm_id *listen_id;
struct ibv_mr *recv_mr;
char *recv_buf;
};

int
reg_mem(struct context *ctx)
{
ctx->recv_buf = (char *) malloc(ctx->msg_length);
memset(ctx->recv_buf, 0x00, ctx->msg_length);

ctx->recv_mr = rdma_reg_msgs(ctx->id, ctx->recv_buf, ctx->msg_length);
if (!ctx->recv_mr) {
VERB_ERR("rdma_reg_msgs", -1);
return -1;
}

return 0;
}

int
{
int ret;
struct ibv_qp_init_attr qp_init_attr;

memset(&hints, 0, sizeof (hints));
hints.ai_port_space = RDMA_PS_TCP;
hints.ai_flags = RAI_PASSIVE; /* this makes it a server */

ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
if (ret) {
return ret;
}

memset(&qp_init_attr, 0, sizeof (qp_init_attr));
qp_init_attr.cap.max_send_wr = 1;
qp_init_attr.cap.max_recv_wr = 1;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;

printf("rdma_create_ep\n");
ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
if (ret) {
VERB_ERR("rdma_create_ep", ret);
return ret;
}

return 0;
}

int
get_connect_request(struct context *ctx)
{
int ret;
printf("rdma_listen\n");

ret = rdma_listen(ctx->id, 4);
if (ret) {
VERB_ERR("rdma_listen", ret);
return ret;
}

ctx->listen_id = ctx->id;
printf("rdma_get_request\n");
ret = rdma_get_request(ctx->listen_id, &ctx->id);
if (ret) {
VERB_ERR("rdma_get_request", ret);
return ret;
}

if (ctx->id->event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
printf("unexpected event: %s", \
rdma_event_str(ctx->id->event->event));
return ret;
}

return 0;
}

int
establish_connection(struct context *ctx)
{
int ret;
struct rdma_conn_param conn_param;

/* post a receive to catch the first send */
ret = rdma_post_recv(ctx->id, NULL, ctx->recv_buf, ctx->msg_length,
ctx->recv_mr);
if (ret) {
VERB_ERR("rdma_post_recv", ret);
return ret;
}

memset(&conn_param, 0, sizeof (conn_param));
conn_param.responder_resources = 2;
conn_param.initiator_depth = 2;
conn_param.retry_count = 5;
conn_param.rnr_retry_count = 5;

printf("rdma_accept\n");
ret = rdma_accept(ctx->id, &conn_param);
if (ret) {
VERB_ERR("rdma_accept", ret);
return ret;
}

return 0;
}

int
recv_msg(struct context *ctx)
{
int ret;
struct ibv_wc wc;

ret = rdma_get_recv_comp(ctx->id, &wc);
if (ret id, NULL, ctx->recv_buf, ctx->msg_length,
ctx->recv_mr);
if (ret) {
VERB_ERR("rdma_post_recv", ret);
return ret;
}

return 0;
}

int
main(int argc, char** argv)
{
int ret, op, i, recv_cnt;
struct context ctx;
struct ibv_qp_attr qp_attr;

memset(&ctx, 0, sizeof (ctx));
memset(&qp_attr, 0, sizeof (qp_attr));

ctx.server_port = DEFAULT_PORT;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;

while ((op = getopt(argc, argv, "a:p:c:l:")) != -1) {
switch (op) {
case 'a':
ctx.server_name = optarg;
break;
case 'p':
ctx.server_port = optarg;
break;
case 'c':
ctx.msg_count = atoi(optarg);
break;
case 'l':
ctx.msg_length = atoi(optarg) * MB;
break;
default:
printf("usage: %s [-s or -a required]\n", argv[0]);
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
exit(1);
}
}

printf("address: %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
printf("port: %s\n", ctx.server_port);
printf("count: %d\n", ctx.msg_count);
printf("length: %d bytes\n", ctx.msg_length);
printf("\n");

if (ret) {
goto out;
}

ret = get_connect_request(&ctx);
if (ret) {
goto out;
}

ret = reg_mem(&ctx);
if (ret) {
goto out;
}

ret = establish_connection(&ctx);

recv_cnt = 0;
for (i = 0; i < ctx.msg_count; i++) {
if (recv_msg(&ctx)) {
break;
}
++recv_cnt;
}
printf("recv %d messages, each message is %d bytes\n", \
recv_cnt, ctx.msg_length);

rdma_disconnect(ctx.id);

out:
if (ctx.recv_mr) {
rdma_dereg_mr(ctx.recv_mr);
}

if (ctx.id) {
rdma_destroy_ep(ctx.id);
}

if (ctx.listen_id) {
rdma_destroy_ep(ctx.listen_id);
}

if (ctx.recv_buf) {
free(ctx.recv_buf);
}

return ret;
}

client code:

/*
*/

#include
#include
#include
#include
#include
#include

#define VERB_ERR(verb, ret) \
fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)

#define MB 1024 * 1024

/* Default parameter values */
#define DEFAULT_PORT "51216"
#define DEFAULT_MSG_COUNT 100
#define DEFAULT_MSG_LENGTH MB
#define DEFAULT_MSEC_DELAY 500

/* Resources used in the example */
struct context
{
char *server_name;
char *server_port;
unsigned int msg_count;
unsigned int msg_length;
/* Resources */
struct rdma_cm_id *id;
struct ibv_mr *send_mr;
char *send_buf;
};

int
reg_mem(struct context *ctx)
{
ctx->send_buf = (char *) malloc(ctx->msg_length);
memset(ctx->send_buf, 'a', ctx->msg_length);

ctx->send_mr = rdma_reg_msgs(ctx->id, ctx->send_buf, ctx->msg_length);
if (!ctx->send_mr) {
VERB_ERR("rdma_reg_msgs", -1);
return -1;
}

return 0;
}

int
{
int ret;
struct ibv_qp_init_attr qp_init_attr;

memset(&hints, 0, sizeof (hints));
hints.ai_port_space = RDMA_PS_TCP;

ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
if (ret) {
return ret;
}

memset(&qp_init_attr, 0, sizeof (qp_init_attr));
qp_init_attr.cap.max_send_wr = 1;
qp_init_attr.cap.max_recv_wr = 1;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;

printf("rdma_create_ep\n");
ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
if (ret) {
VERB_ERR("rdma_create_ep", ret);
return ret;
}

return 0;
}

int
establish_connection(struct context *ctx)
{
int ret;
struct rdma_conn_param conn_param;

memset(&conn_param, 0, sizeof (conn_param));
conn_param.private_data_len = sizeof (int);
conn_param.responder_resources = 2;
conn_param.initiator_depth = 2;
conn_param.retry_count = 5;
conn_param.rnr_retry_count = 5;

printf("rdma_connect\n");
ret = rdma_connect(ctx->id, &conn_param);
if (ret) {
VERB_ERR("rdma_connect", ret);
return ret;
}

if (ctx->id->event->event != RDMA_CM_EVENT_ESTABLISHED) {
printf("unexpected event: %s",
rdma_event_str(ctx->id->event->event));
return -1;
}

return 0;
}

int
send_msg(struct context *ctx)
{
int ret;
struct ibv_wc wc;

ret = rdma_post_send(ctx->id, NULL, ctx->send_buf, ctx->msg_length,
ctx->send_mr, IBV_SEND_SIGNALED);
if (ret) {
VERB_ERR("rdma_send_recv", ret);
return ret;
}

ret = rdma_get_send_comp(ctx->id, &wc);
if (ret < 0) {
VERB_ERR("rdma_get_send_comp", ret);
return ret;
}

return 0;
}

int
main(int argc, char** argv)
{
int ret, op, i, send_cnt;
struct context ctx;
struct ibv_qp_attr qp_attr;

memset(&ctx, 0, sizeof (ctx));
memset(&qp_attr, 0, sizeof (qp_attr));

ctx.server_port = DEFAULT_PORT;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;

while ((op = getopt(argc, argv, "a:p:c:l:")) != -1) {
switch (op) {
case 'a':
ctx.server_name = optarg;
break;
case 'p':
ctx.server_port = optarg;
break;
case 'c':
ctx.msg_count = atoi(optarg);
break;
case 'l':
ctx.msg_length = atoi(optarg) * MB;
break;
default:
printf("usage: %s [-s or -a required]\n", argv[0]);
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
exit(1);
}
}

printf("address: %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
printf("port: %s\n", ctx.server_port);
printf("count: %d\n", ctx.msg_count);
printf("length: %d bytes\n", ctx.msg_length);
printf("\n");

if (!ctx.server_name) {
printf("server address must be specified for client\n");
exit(1);
}

if (ret) {
goto out;
}

ret = reg_mem(&ctx);
if (ret) {
goto out;
}

ret = establish_connection(&ctx);

send_cnt = 0;
for (i = 0; i < ctx.msg_count; i++) {
if (send_msg(&ctx)) {
break;
}
++send_cnt;
}
printf("send %d messages, each message is %d bytes\n", \
send_cnt, ctx.msg_length);

rdma_disconnect(ctx.id);

out:
if (ctx.send_mr) {
rdma_dereg_mr(ctx.send_mr);
}

if (ctx.id) {
rdma_destroy_ep(ctx.id);
}

if (ctx.send_buf) {
free(ctx.send_buf);
}

return ret;
}

What I can't understand is that sometimes this program takes 1 minite to send 1G data and sometimes it only needs 0.2 seconds。 So it's not very stable.

I really don't know why. Can you give me some advice?
Thank you!

• June 3, 2015

Hi.

The code that you sent me is corrupted (problem to be added in a comment).
Can you please send it to me?
dotan at rdmamojo dot com

Thanks
Dotan

• June 4, 2015

Hi Dotan,

Thanks for the quick reply! I have send my code to you by email.Thank you very much.

• June 4, 2015

Hi ChenCong Fu.

As wrote in mail, the problem is that the Sender Queue Pair enters Receiver Not Ready (RNR) flow,
which harms the performance and this is what you sometimes see.

Thanks
Dotan

42. June 9, 2015

Hello Dotan,
Thanks a lot for your help.
I have a design questin, would you mind take a look at?

I have client and server, client wants to send a lot of data to server, instead of using "send" operation to send data directly from client to server, client register a memory region includes these data and use "send" operation to tell the remote server the virtual address of these data. Once the server receive this request from client, server will post an "RDMA Read" operation to read these data directly from client side.
What's the best way to do it?
because at beginning, server needs to receive a so called "rdma msg" from client, so server will be able to know where to read data at remote side(client), which means we need to put our "RDMA Read" operation inside of "receive completion hander" at server side, only when sever finishes receiving the "rdma msg" from client, server will be able to know where to read and starts "read" operation.

Is it OK to put "RDMA Read" operation inside of "receive completion handler"? Do you have any advise for this design?

Thanks a lot for your time!

All the best
Jack

• June 9, 2015

Hi Jack.

I'm glad to help where I can
:)

I would suggest to use RDMA Write to send data instead of RDMA Read,
i.e. the server allocates blocks and advertise its attributes to the client
and the client will initiate an RDMA Write(s).

The last RDMA Write can be with immediate, to let the server know that it was the last message
(or from time to time during the messages as a keep alive messages and let the server know how many
messages it expects to get).

Thanks
Dotan

• June 12, 2015

Thanks a lot Dotan!

• June 13, 2015

Thanks a lot Dotan!
I will try to do both write and read.
While I am implementing it. I found out a weird situation. I am trying to put client and server both on the same machine and perform RDMA Read operation between them. The receiver(reader) can only read half part of data from sender.

Do you have any idea? I have updated my firmware to the newest one(May, 2015), my device is ConnectX3. Does it support to perform RDMA Read operation in a local loopback?

Thanks a lot!

Jack

• June 13, 2015

Hi Jack.

I would have double check the length of the S/G entries in your Send Requests.

Thanks
Dotan

• June 15, 2015

Hello Dotan,
Thanks for your help. I have checked the S/G entries length, they are enough for the requests(these entries length are equal to the bytes of data).
I don't know what to do?

All the best
Jack

43. June 15, 2015

Thanks Dotan, I figured it out. Something wrong in another module...

• June 15, 2015

Great!

As I said, the RDMA device you mentioned works great (I worked/working with it personally).
:)

Thanks
Dotan

44. June 16, 2015

Hello Dotan,
I want to ask a question.
If we want to send a huge message via post_send that reqiures more than one work request(we will use send work request list).
For example, we have a send workrequest list that contains 2 work request(sendwr0, sendwr1)
for sendwr0 and sendwr1,
1) do I need to assign them the same workrequestID because they basically represent the same message?
2) About send flag, do I only need to assign send_flag_signaled on the last request(in the case above, it's sendwr1)?

• June 17, 2015

1) No, you don't *need* to do it, but you *can* do it.
wr_id is the application attribute for use (or not use).
If your application needs to know that the two Work Completion are of the same message, you can use it as a hint.

2) You can set the SIGNALED flag to the second Send Request and get one Work Completion if everything will be fine.

The RDMA stack doesn't know (or care) that you used two Send Requests for one application message
(from the RDMA stack point of view, you have two different messages).

Thanks
Dotan

• June 18, 2015

Thanks a lot Dotan, that's helpful!

45. June 18, 2015

Hello Dotan,
I would like to confirm if my understanding about FRWR is correct.
If we have sender and receiver(reader), before they can start, the sender needs to do "post_send()" twice, right? The first "post_send" is register the memory(FRWR) to the NIC, the second one is actually transfer the virtual address of these FRWR memory regions.
1) "post_send" FRWR to store the incoming data
2)"post_send" to actually read the data
3) "post_send" to tell the remote side(sender) to invalidate the memory region(if receiver finishes reading)
Is that correct?
And how could we suppose to know how many FRWR read operations can be performed currently before we invalide the first FRWR? by using query device, I could not find this information, would you mind give me a hand?

All the best
Jack

• June 21, 2015

Hi Jack.

I don't have any experience with FRWR operations. But let me try to help you anyway.
I assume that you are using RDMA Read (although you didn't wrote it..); this is the reason for the second post send.

According you your scenario (using RDMA Read), yes - three post_sends are needed.

I don't really understand what is do you mean by:
"...how many FRWR read operations can be performed currently before we invalidate the first FRWR".

Thanks
Dotan

• June 22, 2015

Thanks a lot Dotan!
"...how many FRWR read operations can be performed currently before we invalidate the first FRWR".
Because for FRWR(at least from my understanding), we registered a memory region and then we use it and then we invalidate it.
So for increasing performance, the receiver(reader) may perform a couple of Read operations currently, so receiver(reader) will need to invalidate that specific FMR when it's done, so my question was actually about how many Read operations we can perform, so I think it should depend on my system.

Do you know where I can find more info about FRWR? I tried to search online, but I could not find too much info.

• June 23, 2015

Yes. It is your decision when to invalidate this Memory Region.

AFAIK, the InfiniBand specifications is the only place that you can get information on FRWR.

Thanks
Dotan

46. June 29, 2015

Hello Dotan,
If I have a very huge size of data(it's divided into multiple chunks) want to send out, there're two possible ways of doing it.
First one is using one work request (but need extra CPU time to do mem copy)
Second one is using multi rdma work request(don't need extra CPU time do mem copy but needs to post multiple work request).

Which one is better?

All the best
Jingyi

• July 4, 2015

Hi Jingyi.

You can use one Send Requests with a scatter list;
this way you'll be able to eliminate the need to perform mem copy and send message from multiple buffers.

If not, the best solution depends on the size of the total message size:
* If this is small (~ < 1KB), I think that the first one is the best. * If the total message size is big, the second approach will give you best performance. I suggest to use selective signal and create Work Completion only for the last Send Request. Anyway, if performance is highly critical, the best way is to implement both approaches and measure the results (you develop once and use many times ...) I hope that this helped you. Thanks Dotan

47. July 6, 2015

Hello Dotan,
I have a idea, I am not sure if it's possible.
Suppose if sender has 10 chunks data that need to send to remote side(still the send/recv model)
I was thinking if it's possible that we can perform read and write operations at the same time.
Back to our assumption, for the 1st chunck the receiver(reader) reads from the sender and at the same time the sender writes 2st chunck to receiver(reader), and for the rest chunks, we do something similar. So we can improve the speed by having both side busy, right?
Is the above approach possible? If so, I believe the chanllege we will have is the ordering issue, how can we make sure that the chuncks delievered in order? Is there any good way to do it?

All the best
Jack

• July 7, 2015

Hi Jack.

Yes, RDMA Reads and Writes can happen in the same time
(obviously they are initiated by both sides).

I'm not really sure how much improvements it will give compared to the complexity
(maybe you would want to work with several QPs in parallel).

What is the meaning of order?
Each QP can place the data in a different (predefined) location,
In a Write, you specify the remote location that the data will be written to.
In Read, you specify the local location that the data will be written to.

So, at the end all the chunks can be placed in one contiguous block.

Thanks
Dotan
You only need to

48. July 9, 2015

Hello Dotan,
When I am doing RDMA Write operation, I noticed an very interesting problem.
After we successfully post write work request and poll the corespoding wc. the wc.byteLen is not the valid number that we have write. In RDMA read operation, the wc.byteLen is the number of bytes we read from remote side,but in write operation, we can't relay on it. I took a look at driver, the wc.byteLen hasn't been updated in write operation(if opcode = rdma write), but it has been updated in read operation.
I also checked the infiniband specification, in the rdma write section, it says we can depend on dmaLen, the weird it didn't say anything about wc.byteLen.
Why for read operation, wc.byteLen will be updated, but for write, it will not be updated?

All the best
Jack

• July 13, 2015

Hi Jack.

I *think* (since I'm not one of the IB spec authors) is that if you are the Requestor side of RDMA Write or Send, you know how much data you sent. If needed, you can maintain a local information which is associated with the Send Requests, and hold in the wr_id the pointer to it.

Thanks
Dotan

• July 15, 2015

Thanks Dotan!
Actually there's another confusion in driver. If we post_send(wr), in the failure case, it seems that we still can't relay on wc.opcode, because the driver doesn't update it. Is there any design reason?
why driver doesn't need to update the wc.opcode in the failure case?

All the best
Jack

• July 15, 2015

Hi Jack.

This is by design. Look at the post on ibv_poll_cq() for more details on valid attributes when Work Completion has an error.

Thanks
Dotan

49. July 16, 2015

Thanks for all the great info!

I didn't realize the IB verbs layer itself needs completion events created by the application layer, until I saw your response to Igor R. When I first saw the description of the dead lock when the WQ is filled with non-signaled operations, I though you were referring to the application layer SW needing completion events to keep a count of outstanding operations to make sure the WQ is never filled.

Do you know why IB verbs pushes WR flow control back into the application layer by going into the error state when the WQ fills, instead of returning EAGAIN or EWOULDBLOCK like send(), recv(), read() or write() for non-blocking I/O to a busy device?

• July 17, 2015

Hi Mark.

There isn't any problem if the Send Request if full with Send Requests which one of them is Signaled (i.e. will generate a Work Completion).

The problem only exists if all the posted Send Requests are non-signaled.

Letting the low-level driver or the HW make the book-keeping of which Send Request is signaled, which isn't will decrease the performance. Since before any Send Request is posted, the low-level driver will need to check if there is a potential problem.

The application knows what it is doing, and easily can avoid getting into this pitfall.

Thanks
Dotan

50. September 9, 2015

Hi, Dotan.
I have a question about the parallel RDMA READ. Since RDMA is a async model, before we finished a RDMA READ, we can launch another, so there is a lot of unfinished RDMA READ at a time, the number of this RDMA READ operation may exceed the initiator_depth and responder resource. What will happen when exceed? does the NIC will launch the RDMA READ as common, or it will wait until the number of unfinished RDMA READ do not exceed?

I keep the parallel RDMA READ model in a cluster, when I do not limit the parallel number, I failed with IBV_WC_RETRY_EXC_ERR, but when I limit the number of parallel RDMA READ, I can success.

Is there any limit for parallel RDMA READ? or we should avoid this. Thanks!

• September 13, 2015

Hi.

Per QP, there are attributes to number of RDMA Read + Atomic messages that can be sent in parallel.
If wrong values will be used (for example: the initiator is configured to send more READs that the destination can accept)
there will be a retry flow and the initiator side may get completion with RETRY EXCEEDED error (as you seen).

The following attributes in the device capabilities are relevant to this operation:
* max_qp_rd_atom
* max_qp_init_rd_atom

The supported number of RDMA and Atomic operations per QP (for initiator and target).

Thanks
Dotan

• September 13, 2015

Thanks very much! I occurs such a problem, I use shell/python and rping to compose a RDMA shuffle cluster, that is every node run a server mode process(it uses a thread for every incoming client connection), and there is also N client mode process in every node, which will set up connection with other nodes in the cluster. Since rping is RDMA READ--ACK-- RDMA WRITE ---ACK procedure, there is only one outstanding RDMA operation at any time, but there is IBV_WC_RETRY_EXC_ERR error. In my opinion, there is should no reason to occurs this error.

By the way, when the cluster is just 15 nodes, there is no error, errors occurs when there is 30 nodes in the cluster.

Can you give some advice how to deal with this?

• September 15, 2015

Hi.

The problem is that there is one more attributes 'max_res_rd_atom' - the total number of RDMA Reads and atomic that this device supports as the target,
and there isn't any sync or protocol (AFAIK) which guarantees that prevents more RDMA Reads / Atomic operations to be targeted to this value.

Thanks
Dotan

51. September 14, 2015

Hi Dotan,

I know it is not safe to ibv_post_recv several messages on the same address. But is it safe to ibv_post_send several messages on the same address? If so, is there any performance difference between posting the same and different?

Thanks,
Tingyu

• September 15, 2015

Hi Tingyu.

The problem with posting multiple Receive Requests to the same address is that the content isn't consistent
(i.e. one cannot predict the value of the buffers since there isn't any guaranteed order between different Work Queues).

Sending multiple messages from the same address don't have this problem.

Thanks
Dotan

• September 15, 2015

Hi Dotan,

Thanks for this reply! I understand
data will not be consistent, but I wonder
if RDMA allows this type of operation.
So I tested by posting several receive
requests to the same address on the
RDMA library threw out an error during
ibv_poll_cq on the sender side, by setting
wc.status to 12. Could you explain why?
Is there any internal mechanism in RDMA library
that prevents reusing the same buffer?

Thanks,
Tingyu

• September 17, 2015

Hi Tingyu.

wc.status 12 means IBV_WC_RETRY_EXC_ERR.
This means that there was a transport error at some point.

Reusing the same buffer is legal in RDMA.

Thanks
Dotan

52. September 24, 2015

Hi, Dotan,
does the RC QPs guarantee the ordering of RDMA_WRITE WR? For example, if an "initiator" issues 2 consecutive IBV_WR_RDMA_WRITEs into the same remote memory location will the "target" always end up with the data from the second operation (ie, the second WR will always update remote memory after the first one) ?

• September 26, 2015

Hi Valentin.

I will be careful here:
* From network point of view, the first message will reach to destination before the second one.
* The memory will be DMA'ed (by the RDMA device) according to the message ordering

If the memory control, cache in the server will honor this (as I expect to be in most architectures),
I guess the answer is ""yes".

Thanks
Dotan

53. September 30, 2015

Hi Dotan,

Is there any limit on the maximal message size posted using
ibv_post_send? Say 16MB, 32MB, 64MB, 128MB? The problem to me
is that when I try to post message larger than 16MB, there will be
a problem (my code first posts 16MB receive request using ibv_post_receive, then posts 16MB send message using ibv_post_send
to the other side. The first posted receive buffer is to receive
the ack message from the other side). It turns out that the remote side doesn't receive the posted message (The other side also posted 16MB receive buffer before receiving message and the connection between the two has been established already). ibv_poll_cq on the sender side returns wc with status 12. Do you have any idea of this issue? I don't know how to debug this issue, could you give me any instruction on how to debug?

Thanks for help!
Tingyu

• September 30, 2015

Hi Tingyu.

The maximal message size can be found in the port properties: max_msg_sz (in general, RDMA supports up to 2GB messages).
Posting bigger messages will end with completion with error.

Completion with status 12: IBV_WC_RETRY_EXC_ERR, indicate that there is a transport problem.
I suspect that the remote side isn't ready yet or finished it work and closed all the resources.

Thanks
Dotan

• October 2, 2015

Hi Dotan,

Thanks. I just checked the max_msg_sz
was 2GB. To find the transport problem, I
used the example "helloworld" code on github
https://github.com/tarickb/the-geek-in-the-corner as
I got the same status 12 when the message size was set as 256MB (messages with smaller size
worked).
The network I used was qlogic, so is it possible
there was something wrong with the hardware or underlying
verb implementation? Or
was there anything wrong with the infiniband setup? Do you know
the way to debug the problem?

Many thanks,
Tingyu

• October 19, 2015

Hi.

I didn't work with QLOGIC HW, so I don't have any feedback to tell give you.
I would suggest to use the libibverbs example (I know them and they always work).

Thanks
Dotan

54. October 27, 2015

Hello Dotan,

Will work requests be modified after posting them?

In more detail: assuming a list of requests leading by wr is posted by calling ibv_post_send(qp, wr, &bad_wr); will the fields including the next pointers of the requests be modified by the library?

Thanks so much!
Jon

• November 7, 2015

Hi Jon.

After a Send Request was posted, it can be modified by the application.

During post send request, the low-level library translate the libibverbs Send Request to HW-specific Send Request and "tells" the RDMA device that new SRs were posted.

Thanks
Dotan

55. October 28, 2015

Hi Dotan,
I was wondering what is the behavior of an RDMA read of a remote memory if the remote machine is also writing to it concurrently?

More formally, suppose host A is reading using RDMA read, a variable v which is local to host B. If the value of v before the start of the read operation was 'a', and B is writing to v the value 'b' concurrently with the read operation, what is the return value of read going to be? Is it guaranteed to be either 'a' or 'b' or can it be a possibly garbage value too because of the local write or remote read not being atomic?
Thanks,
Sagar

• November 7, 2015

Hi Sagar.

Local Read and Local Write are not atomic and you may get garbage...

If you want to guarantee atomicity, you must use the Atomic operations.

Thanks
Dotan

• November 11, 2015

Thanks for the reply. I can see this happening when we are writing to large memory segments. Is this also true if we are writing to single instance of native data types (bits, bytes, integers, floats etc.)?

• November 11, 2015

If you don't use Atomic operations, there isn't any guarantee to atomic access even for small (and native) data types.

Thanks
Dotan

56. October 29, 2015

Hi.

First of all I would say thank you for this site and your comments, they are very useful.

My question :

I know that the atomic operations maybe not very popular, but I have to use it. I have modified rdma-file example to se send one uint64_t-size structure. Also I am using and example provided above. On the server side it is ok - I see when this structure changing. The problem in a client site. I do not understand when and how I can check swapped value: Can I check it directly after the ibv_post_send, or I should wait or made something different? Because now I see nothing after the ibv_post_send, but if I send back some message via different MR, I see the swapped value. can you give me a hint?

• November 7, 2015

Hi Vasily.

Thanks for the feedback
:)

This isn't really true that atomic isn't popular - it depends what you are trying to do..

If you want to examine the value in the client side (i.e. the side that calls ibv_post_send()),
this can be done only after the Send Request processing was ended, i.e. the Work Completion of the corresponding Send Request was polled from the Completion Queue.

Thanks
Dotan

57. November 26, 2015

hi Dotan,
When I use ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr) to transfer one large message(200K) using one work request in UD mode, the parameter wr->opcode=IBV_WR_SEND wr->numsge=1.
An error IBV_WC_LOC_LEN_ERR occurs in send side. I am sure the receive buffer is larger enough on receive side.
Does this happen because MTU(4096)< 200K? Do I need to spilt 200K message into multiple work requests?

• November 27, 2015

Hi Songping yu.

UD QP doesn't support more than the path MTU message size:
this value is in the range 256-4096 bytes (depends on your subnet).

It is up to the application to split the (big) message to smaller messages,
using multiple Work Requests or use a different QP transport type.

Thanks
Dotan

• December 26, 2017

Hi,
SO does it Mean that RC supports Max 2GB and UD its is Max 4K?

• January 5, 2018

Hi.

* Maximum message size of RC QPs is 2GB (unless one of the end nodes supports a lower value)
* Maximum message size of UD QPs is 4KB (unless one of the end nodes/switches in the path supports a lower value)

Thanks
Dotan

58. December 21, 2015

Hi Dotan.
two questions:
1. I register big memory block, can I send part of it by addr offset、len、rkey？
2. I register many MRs, which is different memory size, when I send msg by RDMA SEND operation, the remote side how to select recv MR?

Thanks!
Ben

• January 1, 2016

Hi.

1) Yes. You can use only part of it in a Work Request.

2) The remote side posts several Receive Requests:
the incoming messages will consume the Receive Requests according to the order they were posted.
i.e. RR[0] will be consumed by message[0], etc.

Thanks
Dotan

59. January 22, 2016

Thank you very much,Dotan. These pages are super-useful as IB API reference.

• January 29, 2016

:)

Thanks for the great feedback
Dotan

60. May 31, 2016

Hi Dotan.
Thank you very much for your post and help!
Now I meet a problem, when i use ibv_post_send, i got a return value : 12. Before ibv_post_send, i checked the send_wr.sge.addr, it is valid. I paste some code here:

1) create qp:

qp_attr.cap.max_send_wr = 1024;
qp_attr.cap.max_send_sge = 1;
qp_attr.cap.max_recv_wr = 1024;
qp_attr.cap.max_recv_sge = 1;
qp_attr.send_cq = send_cq;
qp_attr.recv_cq = recv_cq;
qp_attr.qp_type = IBV_QPT_RC;
err = rdma_create_qp(cm_id, connection->pd, &qp_attr);

2)query qp attr

if (ibv_query_qp(connection->cm_id->qp, &attr, IBV_QP_STATE | IBV_QP_PATH_MTU | IBV_QP_CAP, &qp_attr))
{
printf("client query qp attr fail\n");
return RETURN_ERROR;
}

I found attr.cap.max_send_wr is equal to 2015, and attr.cap.max_recv_wr is equal to 1024, attr.cap.max_send_sge is equal to 2, attr.cap.max_recv_sge is equal to 1.

3)call ibv_post_send to send msg

memset(&sge, 0, sizeof(sge));
sge.length = sizeof(CMD_S);
sge.lkey = connection->connect_mr[MR_REQ].mr->lkey;

memset(&send_wr, 0, sizeof(send_wr));
send_wr.wr_id = (uint64_t)cmd;
send_wr.next = NULL;
send_wr.sg_list = &sge;
send_wr.num_sge = 1;
send_wr.opcode = IBV_WR_SEND;
send_wr.send_flags = IBV_SEND_SIGNALED;
if (ret != 0)
{
printf("client send connect cmd failed, ret=%d.\n", ret);
return RETURN_ERROR;
}

ret is equal to 12.

I am confused with follow question:
1. I set max_send_wr with 1024, max_send_sge with 1, but when I query qp later, they changed, max_send_wr is 2015, max_send_sge is 2. Why?
2. In my test, multi pthreads will call ibv_post_send. My test has two params, one is thread Num, another is queue depth per thread, the queue is used by test , not rdma queue. My test ran well when params are 8 threads and 32 queue depth, but got error when params are 8 threads and 64 queue depth. And ibv_post_send returns a error value 12.

Please give me some suggestion, help me to find key point to resolve the problem. Thanks.

• June 2, 2016

I'd like to add that, my test creates one qp only to send msg. 8 threads and 32 queue depth means that the qp should handle 8*32 requests one time sometimes. Is the qp limited to handle 256 requests when max_send_wr is setted 1024 ? And is there limits when we use qp to send/rdma read/rdma write ?

• June 2, 2016

Hi.

The QP can handle Work Requests according to the max_send_wr that it was created with
(and this value is limited by the HCA capabilities).

* The Send Requests will be processed according to their order in the QP
* RDMA Read & Atomic parallel processing is limited by max_rd_atomic and max_dest_rd_atomic
for QP as initiator and destination

Thanks
Dotan

• June 2, 2016

Hi.

1. The RDMA device/low level driver can provide more resources than the originally requested value, according to its needs and internal structure
2. I suspect that the Send Queue is full, i.e. you have many outstanding Send Requests (Posted Send Requests that were ended with a Work Completion).

You should either increase the rate of polling out the Work Completions from the CQ or increase the QP.max_send_wr value

Thanks
Dotan

• June 2, 2016

Hi Dotan.
But I'm still confused that what causes the send queue to be full. My test generates 256 requests total at first time, and uses them recycled. So I think rdma send queue holds 256 work requests most, and should not be full. Could you give me some detailed explanation？

• June 3, 2016

Hi.

A posted Work Request is considered outstanding until a Work Completion was generated for it or for Work Request after it.
You specify in the created QP the number of outstanding Work Requests for either the Send and Receive Queue of that QP.

I suspect that in your example, you post many Send Requests to the QP and don't poll the Work Completions for them.

Thanks
Dotan

61. June 8, 2016

Hi Dotan,

First of all, thank you so much for the blog! It is tremendously helpful!

Thanks!

• June 10, 2016

Hi.

Are you aware to the fact that there isn't any synchronization at all between both sides in this test?
i.e. the sender send a message, but the remote side may not be ready to receive it
(its QP isn't in the appropriate state or Receive Request wasn't posted or it hasn't join the multicast group yet).

This is the reason that adding a sleep to the sender will solve the problem...

You can solve it by adding a synchronization between both sides, or letting the server send again and again and waiting for an incoming response from the client.

Thanks
Dotan

• June 16, 2016

Oh, I see. That makes sense. Thank you!

62. August 9, 2016

Hi Dotan. Like everyone else, thank you for such an informative resource for RDMA programming. My question: when ibv_post_send is used with one of the atomic opcodes (IBV_WR_ATOMIC_FETCH_AND_ADD or IBV_WR_ATOMIC_CMP_AND_SWAP), do you still need to poll for a completion event to be sure the atomic operation was successful? Or will the operation have completed when ibv_post_send returns?

• August 10, 2016

Hi.

When atomic operations, like any other operation, will end when there is a Work Completion for it
(or for any other Send Request that was posted after it).

When ibv_post_send() returns, this means that the low-level driver enqueues this Send Request for the RDMA device
for future processing.

Thanks
Dotan

63. January 11, 2017

Hi Dotan.

Thank you for such a guideline of rdma programing!

And, I have some trouble about IBV_WR_SEND in UD. I use doorbell batching to post my sends (just like wr[i].next = &wr[i+1]). However, only the data of the lattest wr in the batching is received. I am sure that there is no error thrown in my code because if I replace IBV_WR_SEND with IBV_WR_SEND_WITH_IMMEDIATE it works for the same code, the headers arrive correctly. Also, if I just use a post_send for each wr, it works. I think something in the sender side is wrong.

Hope that you can give me some advice!

Thanks!

• February 10, 2017

Hi.

Please make sure that there isn't any race between the sides,
and when the message arrives to the remote side)
1) The remote QP is in (at least) RTR state
3) The messages are big enough (i.e. at least message size + 40 bytes for the GRH)

Thanks
Dotan

64. April 18, 2017

Hi Dotan,

I have a question. When I query my device I get that max_qp_rd_atom operations is 16. So is it not possible more than 16. Why is it specific to RDMA Read operations. I do not see any problem when there are more than 16 Work Requests posted for RDMA Read. What does attr.max_qp_rd_atom mean?

• July 3, 2017

Hi.

RDMA Read operations require special resources and handling in both send and receive side,
so this is the reason for the limitation.

Configuring QP.max_rd_atomic limit the number of processed RDMA Reads that handled by the QP in any time;
you may post as much as you want RDMA Read operations, and the RDMA device will limit the processing.

Thanks
Dotan

65. May 9, 2017

Now I get some problems and try to search result from RDMA_Aware_Programming_User_Manual.pdf (Version 1.7) and the IB Specification Vol 1-Release-1.3-2015-03-03.pdf,but haven't found the result.So I have to turn to you for help.The problem is When I post work request to queuepair,the NIC got notification and fetch the work request from memory to NIC cache by DMA,but when NIC send the data contained in the work reqeust to cabel,does it need to fetch the queuepair information to NIC cache?I know that NIC cache stores the queuepair data,memory address translation data and some network data,but when NIC send data,is the queuepair information necessary?

• July 3, 2017

Hi.

When sending data, the RDMA device needs to fetch QP information:
* QP state
* PKey index
* Qkey (for UD QPs, in specific scenarios)
* Remote side attributes (for connected QPs)

Thanks
Dotan

66. June 22, 2017

Hi Dotan,

If I want to use "ibv_post_send", since we already have "IBV_WR_SEND", why we need "IBV_WR_RDMA_WRITE"? Is there any performance difference between these two approaches?

• July 2, 2017

Hi.

Yes. There is a performance difference:
* Send operation will consume a Receive Request in the remote side
* RDMA Write operation won't, and a PCI read will be prevented (better latency)

Thanks
Dotan

• July 5, 2017

Great! Thanks Dotan.

67. July 10, 2017

After reading all conversions in this post above,I have one more curious question...(sorry for disturbing).
The question is:When, where and how is the necessary QP information being collected for posting send wr?
First,please allow me sort out some procedure and explain my understanding.
When I post ibv_send_wr* wr using ibv_post_send,things goes on follow:
1.No context switch,in the same context,the ibv_post_send function transforms ibv_send_wr* wr(libibverbs abstraction) to WQE (HW-specific send request,the WQE is writing in Ethernet_Adapter_programming_Mannual,),during constructing WQE,it demands Ctrl Segment,Eth segment,Memory Management segment,Data segment,and Ctrl segment includes the attribute of SQ number(which
seems the necessary information about QP)
2.after constructing new WQE,writing the WQE to the WQE buffer,and update Doorbell record associated with that queue.(ibv_post_send api returns)
3.Device got notification and asynchronously processes these new WQEs.
4.After work request being processed, NIC writes cqes to relevent cq by dma.
5.I poll cq and got notifications.
ok,the whole procedure is sorted.Is there existing some error?
From proceduer above,can guess the collecting necessary QP information happens at transforming ibv_send_wr to WQE(it means calling ibv_post_send)?
And another question...(sorry for my curiousity),as far as i know,in software level,the qp num is the unique identitfier to steer network message flow to corresponding qp,in hardware level,the gid and port is the unique identifier to steer packet flow.So summarize for above question, can i treat "fetching QP information for work request" as "fetching qp num and other non-unique information"?
Sorry for too much words,But I really interested in this part.If I expressed poorly,please point out and I will improve.Thanks for you patience!Dotan

• July 21, 2017

Hi.

This is an interesting question.
After the following step:
"2.after constructing new WQE,writing the WQE to the WQE buffer,and update Doorbell record associated with that queue.(ibv_post_send api returns)"
The WQE was enqueued to the RDMA device for processing; when the processing will actually start the RDMA device needs to collect relevant information for the QP:
* The QP type
* Remote QP number (for connected QP)
* Path to the remote QP (for connected QP)
* Send PSN
* more

Thanks
Dotan

• July 27, 2017

Hi,Dotan
I got it.There is still so much things which device need to do.
Sorry for my recklessness.I should carefully read the driver source code and then ask my questions.But I do learn very much from your detailed articles.Thanks for your patience and generosity.

• July 27, 2017

:)

68. August 25, 2017

Hi Dotan

I want to transform data from serverA's memory to serverB's memory, then I use ibv_post_send(), doing RDMA write, if return value of ibv_post_send is equal zero, does it mean that the data has transformed from serverA's memory to serverB's memory？

Hope that you can give me some advice!

Thanks!

• August 28, 2017

Hi.

No.

If ibv_post_send() returns with the value 0,
this means that this Send Request was added for the RDMA device for further processing.

If this is a reliable transport type, and there is a Work Completion with the SUCCESS status,
this means that the data was written to remote memory successfully.

Thanks
Dotan

69. November 7, 2017

Hi Dotan:

I am new to RDMA and I tried to do a RDMA RC Write. Everything works fine when the message size is smaller than MTU. However, when I set my message size larger than MTU, the side which post the Write is not able to get any write completion in the CQ. Even though the remote side already have completed data in the registered memory. There is no error message at both side. The side who post the write is stuck in the while loop of ibv_poll_cq(). I would like to ask what might be the problem of this.

Thanks,
Sylvia

• December 6, 2017

Hi.

Are you using RoCE or InfiniBand?
Did you configured the same MTU in both sides?

Thanks
Dotan

70. November 8, 2017

Hi Dotan,

I wrote a ping-pong program with IBV_WR_SEND, it's server/client like. The problem I met was sending and receiving 1M 4096 bytes took 26s, while the ibv_post_send call took 9s. Is this normal? Or is there any reason leading to the ibv_post_send blocking?

• December 6, 2017

Hi.

What do you mean "ibv_post_send call took 9s"?
First of all - it is too much time for a fast network and seconds are "infinite".
Second, need to understand what you did to give an answer.

Thanks
Dotan

71. January 11, 2018

hello, Dotan!

I met the problem that many guys mentioned. When I repeat to write and read remote memory, I got the "ENOMEM". I try to empty the CQ at both client and server using ibv_poll_cq, but it didn't work. Please help me! Thanks:)
/*my code seems like that: */

while(1){

...
send_wr.opcode = IBV_WR_RDMA_WRITE;
send_wr.sg_list =&sge;
...
...
send_wr.sg_list =&sge;
...
if (ret == EINVAL){
printf("invalid value provided in wr\n");
}else if(ret == ENOMEM){
printf("send queue is full\n");
do {
ne = ibv_poll_cq(cq, 1, &wc);
if (ne < 0) {
fprintf(stderr, "Failed to poll completions from the CQ: ret = %d\n",
ne);
break;
}
/* there may be an extra event with no completion in the CQ */
if (ne == 0)
continue;

if (wc.status != IBV_WC_SUCCESS) {
fprintf(stderr, "Completion with status 0x%x was found\n",
wc.status);
break;
}
} while (ne);
}else if(ret == EFAULT){
printf("invalid value provided in qp\n");
}else if(ret == errno){
printf("on failure and no change will be down to the qp");
}
}

• January 19, 2018

Hi.

There are 2 options:
1) There aren't any Work Completion (and won't be) since you didn't request for generation of them
(ibv_qp_init_attr.sq_sig_all for all Send Requests in that QP or in ibv_send_wr.send_flags - per specific Send Request)
2) There processing is still on going,
for example: if there is a retransmission and the timeout is very high (or infinite).

Did you read any Work Completion from that CQ?
(from the Send Queue)

Thanks
Dotan