Skip to content

ibv_post_send()

4.88 avg. rating (97% score) - 17 votes
int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr,
                  struct ibv_send_wr **bad_wr);

Description

ibv_post_send() posts a linked list of Work Requests (WRs) to the Send Queue of a Queue Pair (QP). ibv_post_send() go over all of the entries in the linked list, one by one, check that it is valid, generate a HW-specific Send Request out of it and add it to the tail of the QP's Send Queue without performing any context switch. The RDMA device will handle it (later) in asynchronous way. If there is a failure in one of the WRs because the Send Queue is full or one of the attributes in the WR is bad, it stops immediately and return the pointer to that WR. The QP will handle Work Requests in the Send queue according to the following rules:

  • If the QP is in RESET, INIT or RTR state an immediate error should be returned. However, they may be some low-level driver that won't follow this rule (to eliminate extra check in the data path, thus providing better performance) and posting Send Requests at one or all of those states may be silently ignored.
  • If the QP is in RTS state, Send Requests can be posted and they will be processed.
  • If the QP is in SQE or ERROR state, Send Requests can be posted and they will be completed with error.
  • If the QP is in SQD state, Send Requests can be posted, but they won't be processed.

The struct ibv_send_wr describes the Work Request to the Send Queue of the QP, i.e. Send Request (SR).

struct ibv_send_wr {
	uint64_t		wr_id;
	struct ibv_send_wr     *next;
	struct ibv_sge	       *sg_list;
	int			num_sge;
	enum ibv_wr_opcode	opcode;
	int			send_flags;
	uint32_t		imm_data;
	union {
		struct {
			uint64_t	remote_addr;
			uint32_t	rkey;
		} rdma;
		struct {
			uint64_t	remote_addr;
			uint64_t	compare_add;
			uint64_t	swap;
			uint32_t	rkey;
		} atomic;
		struct {
			struct ibv_ah  *ah;
			uint32_t	remote_qpn;
			uint32_t	remote_qkey;
		} ud;
	} wr;
};

Here is the full description of struct ibv_send_wr:

wr_id A 64 bits value associated with this WR. If a Work Completion will be generated when this Work Request ends, it will contain this value
next Pointer to the next WR in the linked list. NULL indicates that this is the last WR
sg_list Scatter/Gather array, as described in the table below. It specifies the buffers that will be read from or the buffers where data will be written in, depends on the used opcode. The entries in the list can specify memory blocks that were registered by different Memory Regions. The message size is the sum of all of the memory buffers length in the scatter/gather list
num_sge Size of the sg_list array. This number can be less or equal to the number of scatter/gather entries that the Queue Pair was created to support in the Send Queue (qp_init_attr.cap.max_send_sge). If this size is 0, this indicates that the message size is 0
opcode The operation that this WR will perform. This value controls the way that data will be sent, the direction of the data flow and the used attributes in the WR. The value can be one of the following enumerated values:

  • IBV_WR_SEND - The content of the local memory buffers specified in sg_list is being sent to the remote QP. The sender doesn’t know where the data will be written in the remote node. A Receive Request will be consumed from the head of remote QP's Receive Queue and sent data will be written to the memory buffers which are specified in that Receive Request. The message size can be [0, 2^{31}] for RC and UC QPs and [0, path MTU] for UD QP
  • IBV_WR_SEND_WITH_IMM - Same as IBV_WR_SEND, and immediate data will be sent in the message. This value will be available in the Work Completion that will be generated for the consumed Receive Request in the remote QP
  • IBV_WR_RDMA_WRITE - The content of the local memory buffers specified in sg_list is being sent and written to a contiguous block of memory range in the remote QP's virtual space. This doesn't necessarily means that the remote memory is physically contiguous. No Receive Request will be consumed in the remote QP. The message size can be [0, 2^{31}]
  • IBV_WR_RDMA_WRITE_WITH_IMM - Same as IBV_WR_RDMA_WRITE, but Receive Request will be consumed from the head of remote QP's Receive Queue and immediate data will be sent in the message. This value will be available in the Work Completion that will be generated for the consumed Receive Request in the remote QP
  • IBV_WR_RDMA_READ - Data is being read from a contiguous block of memory range in the remote QP's virtual space and being written to the local memory buffers specified in sg_list. No Receive Request will be consumed in the remote QP. The message size can be [0, 2^{31}]
  • IBV_WR_ATOMIC_FETCH_AND_ADD - A 64 bits value in a remote QP's virtual space is being read, added to wr.atomic.compare_add and the result is being written to the same memory address, in an atomic way. No Receive Request will be consumed in the remote QP. The original data, before the add operation, is being written to the local memory buffers specified in sg_list
  • IBV_WR_ATOMIC_CMP_AND_SWP - A 64 bits value in a remote QP's virtual space is being read, compared with wr.atomic.compare_add and if they are equal, the value wr.atomic.swap is being written to the same memory address, in an atomic way. No Receive Request will be consumed in the remote QP. The original data, before the compare operation, is being written to the local memory buffers specified in sg_list
send_flags Describes the properties of the WR. It is either 0 or the bitwise OR of one or more of the following flags:

  • IBV_SEND_FENCE - Set the fence indicator for this WR. This means that the processing of this WR will be blocked until all prior posted RDMA Read and Atomic WRs will be completed. Valid only for QPs with Transport Service Type IBV_QPT_RC
  • IBV_SEND_SIGNALED - Set the completion notification indicator for this WR. This means that if the QP was created with sq_sig_all=0, a Work Completion will be generated when the processing of this WR will be ended. If the QP was created with sq_sig_all=1, there won't be any effect to this flag
  • IBV_SEND_SOLICITED - Set the solicited event indicator for this WR. This means that when the message in this WR will be ended in the remote QP, a solicited event will be created to it and if in the remote side the user is waiting for a solicited event, it will be woken up. Relevant only for the Send and RDMA Write with immediate opcodes
  • IBV_SEND_INLINE - The memory buffers specified in sg_list will be placed inline in the Send Request. This mean that the low-level driver (i.e. CPU) will read the data and not the RDMA device. This means that the L_Key won't be checked, actually those memory buffers don't even have to be registered and they can be reused immediately after ibv_post_send() will be ended. Valid only for the Send and RDMA Write opcodes
imm_data (optional) A 32 bits number, in network order, in an SEND or RDMA WRITE opcodes that is being sent along with the payload to the remote side and placed in a Receive Work Completion and not in a remote memory buffer
wr.rdma.remote_addr Start address of remote memory block to access (read or write, depends on the opcode). Relevant only for RDMA WRITE (with immediate) and RDMA READ opcodes
wr.rdma.rkey r_key of the Memory Region that is being accessed at the remote side. Relevant only for RDMA WRITE (with immediate) and RDMA READ opcodes
wr.atomic.remote_addr Start address of remote memory block to access
wr.atomic.compare_add For Fetch and Add: the value that will be added to the content of the remote address. For compare and swap: the value to be compared with the content of the remote address. Relevant only for atomic operations
wr.atomic.swap Relevant only for compare and swap: the value to be written in the remote address if the value that was read is equal to the value in wr.atomic.compare_add. Relevant only for atomic operations
wr.atomic.rkey r_key of the Memory Region that is being accessed at the remote side. Relevant only for atomic operations
wr.ud.ah Address handle (AH) that describes how to send the packet. This AH must be valid until any posted Work Requests that uses it isn't considered outstanding anymore. Relevant only for UD QP
wr.ud.remote_qpn QP number of the destination QP. The value 0xFFFFFF indicated that this is a message to a multicast group. Relevant only for UD QP
wr.ud.remote_qkey Q_Key value of remote QP. Relevant only for UD QP

The following table describes the supported opcodes for each QP Transport Service Type:

Opcode UD UC RC
IBV_WR_SEND X X X
IBV_WR_SEND_WITH_IMM X X X
IBV_WR_RDMA_WRITE X X
IBV_WR_RDMA_WRITE_WITH_IMM X X
IBV_WR_RDMA_READ X
IBV_WR_ATOMIC_CMP_AND_SWP X
IBV_WR_ATOMIC_FETCH_AND_ADD X

struct ibv_sge describes a scatter/gather entry. The memory buffer that this entry describes must be registered until any posted Work Request that uses it isn't considered outstanding anymore. The order in which the RDMA device access the memory in a scatter/gather list isn't defined. This means that if some of the entries overlap the same memory address, the content of this address is undefined.

struct ibv_sge {
	uint64_t		addr;
	uint32_t		length;
	uint32_t		lkey;
};

Here is the full description of struct ibv_sge:

addr The address of the buffer to read from or write to
length The length of the buffer in bytes. The value 0 is a special value and is equal to 2^{31} bytes (and not zero bytes, as one might imagine)
lkey The Local key of the Memory Region that this memory buffer was registered with

Sending inline'd data is an implementation extension that isn't defined in any RDMA specification: it allows send the data itself in the Work Request (instead the scatter/gather entries) that is posted to the RDMA device. The memory that holds this message doesn't have to be registered. There isn't any verb that specifies the maximum message size that can be sent inline'd in a QP. Some of the RDMA devices support it. In some RDMA devices, creating a QP with will set the value of max_inline_data to the size of messages that can be sent using the requested number of scatter/gather elements of the Send Queue. If others, one should specify explicitly the message size to be sent inline before the creation of a QP. for those devices, it is advised to try to create the QP with the required message size and continue to decrease it if the QP creation fails. While a WR is considered outstanding:

  • If the WR sends data, the local memory buffers content shouldn't be changed since one doesn't know when the RDMA device will stop reading from it (one exception is inline data)
  • If the WR reads data, the local memory buffers content shouldn't be read since one doesn't know when the RDMA device will stop writing new content to it

Parameters

Name Direction Description
qp in Queue Pair that was returned from ibv_create_qp()
wr in Linked list of Work Requests to be posted to the Send Queue of the Queue Pair
bad_wr out A pointer to that will be filled with the first Work Request that its processing failed

Return Values

Value Description
0 On success
errno On failure and no change will be done to the QP and bad_wr points to the SR that failed to be posted
EINVAL Invalid value provided in wr
ENOMEM Send Queue is full or not enough resources to complete this operation
EFAULT Invalid value provided in qp

Examples

1) Posting a WR with the Send operation to an UC or RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_SEND;
wr.send_flags = IBV_SEND_SIGNALED;
 
if (ibv_post_send(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

2) Posting a WR with the Send with immediate operation to an UD QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_SEND_WITH_IMM;
wr.send_flags = IBV_SEND_SIGNALED;
wr.imm_data   = htonl(0x1234);
wr.wr.ud.ah          = ah;
wr.wr.ud.remote_qpn  = remote_qpn;
wr.wr.ud.remote_qkey = 0x11111111;
 
if (ibv_post_send(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

3) Posting a WR with an RDMA Write operation to an UC or RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_RDMA_WRITE;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = remote_address
wr.wr.rdma.rkey        = remote_key;
 
if (ibv_post_send(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

4) Posting a WR with an RDMA Write with immediate operation to an UC or RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_RDMA_WRITE_WITH_IMM;
wr.send_flags = IBV_SEND_SIGNALED;
wr.imm_data   = htonl(0x1234);
wr.wr.rdma.remote_addr = remote_address
wr.wr.rdma.rkey        = remote_key;
 
if (ibv_post_send(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

5) Posting a WR with an RDMA Read operation to a RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_RDMA_READ;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.rdma.remote_addr = remote_address
wr.wr.rdma.rkey        = remote_key;
 
if (ibv_post_send(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

6) Posting a WR with a Compare and Swap operation to a RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_ATOMIC_CMP_AND_SWP;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.atomic.remote_addr = remote_address
wr.wr.atomic.rkey        = remote_key;
wr.wr.atomic.compare_add = 0ULL; /* expected value in remote address */
wr.wr.atomic.swap        = 1ULL; /* the value that remote address will be assigned to */
 
if (ibv_post_send(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

7) Posting a WR with a Fetch and Add operation to a RC QP:

struct ibv_sge sg;
struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
wr.opcode     = IBV_WR_ATOMIC_FETCH_AND_ADD;
wr.send_flags = IBV_SEND_SIGNALED;
wr.wr.atomic.remote_addr = remote_address
wr.wr.atomic.rkey        = remote_key;
wr.wr.atomic.compare_add = 1ULL; /* value to be added to the remote address content */
 
if (ibv_post_send(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

8) Posting a WR with the Send operation to an UC or RC QP with zero bytes:

struct ibv_send_wr wr;
struct ibv_send_wr *bad_wr;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = NULL;
wr.num_sge    = 0;
wr.opcode     = IBV_WR_SEND;
wr.send_flags = IBV_SEND_SIGNALED;
 
if (ibv_post_send(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_send() failed\n");
	return -1;
}

FAQs

Does ibv_post_send() cause a context switch?

No. Posting a SR doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

How many WRs can I post?

There is a limit to the maximum number of outstanding WRs for a QP. This value was specified when the QP was created.

Can I know how many WRs are outstanding in a Work Queue?

No, you can't. You should keep track of the number of outstanding WRs according to the number of posted WRs and the number of Work Completions that you polled.

Does the remote side is aware of the fact that RDMA operations are being performed in its memory?

No, this is the idea of RDMA.

If the remote side isn't aware of RDMA operations are being performed in its memory, isn't this a security hole?

Actually, no. For several reasons:

  • In order to allow incoming RDMA operations to a QP, the QP should be configured to enable remote operations
  • In order to allow incoming RDMA access to a MR, the MR should be registered with those remote permissions enabled
  • The remote side must know the r_key and the memory addresses in order to be able to access remote memory

What will happen if I will deregister an MR that is used by an outstanding WR?

When processing a WR, if one of the MRs that are specified in the WR isn't valid, a Work Completion with error will be generated. The only exception for this is posting inline data.

What is the benefit from using IBV_SEND_INLINE?

Using inline data usually provides better performance (i.e. latency).

What is the difference between inline data and immediate data?

Using immediate data means that out of band data will be sent from the local QP to the remote QP: if this is an SEND opcode, this data will exist in the Work Completion, if this is a RDMA WRITE opcode, a WR will be consumed from the remote QP's Receive Queue. Inline data influence only the way that the RDMA device gets the data to send; The remote side isn't aware of the fact that it this WR was sent inline.

I called ibv_post_send() and I got segmentation fault, what happened?

There may be several reasons for this to happen:
1) At least one of the sg_list entries is in invalid address
2) In one of the posted SRs, IBV_SEND_INLINE is set in send_flags, but one of the buffers in sg_list is pointing to an illegal address
3) The value of next points to an invalid address
4) Error occurred in one of the posted SRs (bad value in the SR or full Work Queue) and the variable bad_wr is NULL
5) A UD QP is used and wr.ud.ah points to an invalid address

Help, I've posted and Send Request and it wasn't completed with a corresponding Work Completion. What happened?

In order to debug this kind of problem, one should do the following:

  • Verify that a Send Request was actually posted
  • Wait enough time, maybe a Work Completion will eventually be generated
  • Verify that the logical port state of the RDMA device is IBV_PORT_ACTIVE
  • Verify that the QP state is RTS
  • If this is an RC QP, verify that the timeout value that was configured in ibv_modify_qp() isn't 0 since if a packet will be dropped, this may lead to infinite timeout
  • If this is an RC QP, verify that the timeout and retry_cnt values combination that were configured in ibv_modify_qp() doesn't indicate that long time will pass before a Work Completion with IBV_WC_RETRY_EXC_ERR will be generated
  • If this is an RC QP, verify that the rnr_retry value that was configured in ibv_modify_qp() isn't 7 since this may lead to retry infinite time in case of RNR flow
  • If this is an RC QP, verify that the min_rnr_timer and rnr_retry values combination that were configured in ibv_modify_qp() doesn't indicate that long time will pass before a Work Completion with IBV_WC_RNR_RETRY_EXC_ERR will be generated

How can I send a zero bytes message?

In order to send a zero byes message, no matter what is the opcode, the num_sge must be set to zero.

Can I (re)use the Send Request after ibv_post_send() returned?

Yes. This verb translates the Send Request from the libibverbs abstraction to a HW-specific Send Request and you can (re)use both the Send Request and the s/g list within it.

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

Tell us what do you think.

  1. test says: March 6, 2013

    I have a question about whether a context switch is occurred or not during an RDMA operation. Here (page 15) it is shown that a user space verbs call results in a call of the hardware specific driver (eg mlx4). That "lives" in kernel space. So, ibv_post_send() (RDMA mode) causes a context switch, or not? Can you clarify this for me please.

    Also, if ibv_post_send() never causes a context switch, then why there is an implementation of ibv_post_send() in the linux kernel. When is this function (inside the kernel) called?

    Thanks!

    • Dotan Barak says: March 6, 2013

      This is a great question!

      Every control operation (i.e. create/destroy/modify/query to any resource) will cause a context switch.
      However, the data operations won't create a context switch and from the same context,
      one can post new Work Request (either to the Send or Receive Queues).

      In the example, you mentioned "mlx4"; the create Queue Pair will perform a context switch and the following libraries/modules will be called in order:
      libibverbs -> libmlx4 -> libibverbs -> ib core -> mlx4

      In order to post a Send Request, the following libraries/modules will be called in order:
      libibverbs -> libmlx4
      i.e. no context switch will happen at all.

      However, if there will be devices (or low-level drivers) that doesn't support posting Send Requests without a context switch, the libibverbs prepared the infrastructure to allow posting the Work Requests in the kernel level.
      Personally, I don't know about any device that uses those functions.

      I hope that I answered all of your questions.

      Thanks
      Dotan

      • test says: March 6, 2013

        Yes, you did! Thanks!

        ps: I forgot to paste the link I was referring to in my first post. Here it is (from OpenFabrics) --> https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CC8QFjAA&url=https%3A%2F%2Fwww.openfabrics.org%2Fofa-documents%2Fpresentations%2Fdoc_download%2F522-openfabrics-training-programs.html&ei=fW83UffUDo3Osgam0oDYCg&usg=AFQjCNEyvglOCK0V-6jnoGsqyiYEH3kQDw&bvm=bv.43287494,d.Yms&cad=rja

        So in page 15 the Hardware Specific Driver (yellow box) might be the libmlx4 depending on the implementation (or it might be mlx4 linux kernel module in ./drivers/infiniband/hw/mlx4 otherwise). Am I right?

      • Dotan Barak says: March 6, 2013

        Thanks for this link, now I fully understand your question...

        The Hardware Specific Driver (yellow box) is the mlx4 kernel part (since this section describes the kernel space modules). The User level APIs (white box) is the libibverbs and libmlx4.
        (Do you see the "kernel bypass" line? this means direct access to the HW without need for performing context switch).

      • test says: March 11, 2013

        Yes, I see the "kernel bypass" line. But that makes a contradiction. Kernel bypass from the one hand, but libmlx4 calls something (Hardware Specific Driver (mlx4 kernel module)) that "lives" inside the kernel (kernel context)). Except if the author of the diagram is meaning that the line is going to the Infiniband HCA directly (firmware code). :P

        Sorry for being persistent!

      • Dotan Barak says: March 12, 2013

        It is o.k.
        :)

        The "kernel bypass" means that in the data path, your user level code will be able
        to work directly with the HW (without performing a context switch).

        Please remember that the kernel level must be involved in the control part in order to
        sync the resources (between different processes/modules) and configure the HW since user
        level application can't write directly in the device memory space (since this is a privileged operation).

        In this slide, I can see that there are two lines:
        1) First line that specify kernel bypass (for the data path)
        2) Second line that specify that the user level will call to the open fabrics kernel level verbs

        I hope that I answered all of your questions.
        If you enjoy this blog, please publish it to other people as well.

        Thanks
        Dotan

  2. Jay says: March 11, 2013

    Hi Dotan,

    I have few questions about ibv_post_send():

    1. If I issue one large send request,
    will (or can it be) served by multiple
    smaller receive buffers? or does one
    send request can never use multiple recv
    buffers?

    2. when would I need to use IBV_SEND_SIGNALED and IBV_SEND_SOLICITED?

    3. Can Receive buffer be a gather list and
    HCA will dma the received data to appropriate gather elements?

    • Dotan Barak says: March 12, 2013

      Hi Jay.

      I'll try to answer:
      1) I assume that you mean that you send a big message over the wire,
      At the receive side you can split this message to how many scatter elements
      that you wish (this is a local attribute which).

      To summarize it:
      When using RDMA operation(s): only one contiguous buffer can be used
      When using Send operation: the receive side can post Receive Receive with one
      or more scatter elements (as long as the sum of the buffers will be able to hold
      all of the message).

      2) IBV_SEND_SIGNALED should be used if the QP was created with sq_sig_all=0
      (which means that not all Send Requests will generate Work Completion when completed).

      IBV_SEND_SOLICITED should be used when the remote side is reading the Work Completions
      using events (and not in polling mode). Please check the my about ibv_req_notify_cq()
      for more details.

      3) Yes, this is exactly what the RDMA device will do in the Receive side,
      when using the Send operation. Please keep in mind that those memory buffers should
      be registered first.

      I hope that I answered all of your questions.
      If you enjoy this blog, please publish it to other people as well.

      Thanks
      Dotan

      • Jay says: March 13, 2013

        Hi Dotan,

        Please let me rephrase the question #1 -
        Receive side has posted two receive work
        request with n-bytes worth of buffer each.
        So receiver has total of 2n byte buffer
        available.
        Now sender issues one send work request
        with total of 2n+m byte data.
        Can receiver use two receive work requests
        to satisfy one send work request?
        When using RDMA operations you said one single contig. buffer can be used.
        Do you mean RDMA write OR RDMA read?

        Thank you so much for your reply.

        Jay

      • Dotan Barak says: March 13, 2013

        Hi Jay.

        The Receive Request is working in resolution of messages and not in resolution of bytes.

        Every Receive Request will handle only one incoming message:
        for each incoming message one Receive Request will be fetched from the head
        of the Receive Queue. The messages will be handled by the order of their arrival.

        In your example there are 2 Receive Requests that each has n bytes:
        * Receiving a message of n bytes or less, is fine
        * Receiving a message with more than n bytes will cause an error (since there isn't enough room to hold the message)

        When working with RDMA operations:
        * RDMA Write can read one or more local gather entries and write them to one remote contiguous block
        * RDMA Read can read from one remote contiguous block and write it locally to one or more scatter entries

        If you have more questions, you are more than welcome to ask..
        :)

        Dotan

  3. test says: March 19, 2013

    when i send a 1024 bite block by IBV_WR_RDMA_WRITE mode,everything is ok, but if block size is set larger (ex 4096 bite),I get a IBV_WC_LOC_PROT_ERR err and then many IBV_WC_WR_FLUSH_ERR err for send cq , can u help me

    • Dotan Barak says: March 19, 2013

      Hi.

      Please check the memory buffers in the gather list of the Send Request, I suspect that you try to access memory that wasn't registered.

      Thanks
      Dotan

  4. test says: March 25, 2013

    ibv_post_send returns -1,what is the problem ? thanks for your help

    • Dotan Barak says: March 25, 2013

      Hi.

      There can several reasons:
      * The Send Request has invalid value(s)
      * The Send Queue is full

      Not all of the low level drivers return errno to indicate about errors
      (some of them returned -1 in the past and now return errno).

      It depends of the library that you use and its version..

      Thanks
      Dotan

  5. Sara says: May 1, 2013

    Hi Dotan, I'm running into a problem with ibv_post_send and hoping you can provide some guidance. I've adapted the rc_ping_pong program to exchange 312 byte messages among nodes in a 32-machine IB cluster, except that I use an epoll() based mechanism to call ibv_poll_cq(). Several messages later (around 58900 to be exact), ibv_post_send() fails returning ENOMEM and errno set to 2. Both sides of the connection are in good states: IBV_PORT_ACTIVE & IBV_QPS_RTS. When I keep track of sends posted vs sends completed I find that during the failure (posted-completed) = 31, always. However I have only max_send_wr=1 when I created the qp. So I'm not sure what's going on. On the receive side I guarantee posts (rx_depth=800 and whenever it drops to 400 I post 400 more). Any help is much appreciated, and if you need further clarifications please let me know.
    Thanks much
    Sara

    • Dotan Barak says: May 1, 2013

      Hi Sara.

      I will try to help you
      :)

      If ibv_post_send() itself fails that means that either:
      The Send Queue is full (i.e. all of the Work Requests in the Send Queue are outstanding)
      or
      The posted Send Request is illegal:
      * too many scatter/gather elements
      * too much inline data (if inline data is used)
      * wrong opcode

      Please check if this helps you:
      if you sure that the Send Queue isn't full, dump the Send Request and check what I suggested above.

      Thanks
      Dotan

  6. Sara says: May 1, 2013

    Thanks for the quick response Dotan.
    I'm leaning towards full queue rather than illegal request because:
    1. They've been going through fine for all the previous posts, and
    2. I simply reuse circular buffers for subsequent sends
    3. I inspected the wr (bad_wr points to it) during failure and it looks okay:
    (gdb) p wr
    $1 = {wr_id = 1, next = 0x0, sg_list = 0x7fcaca7fbcb8, num_sge = 1, opcode = IBV_WR_SEND, send_flags = 2, imm_data = 0, wr = {rdma = {remote_addr = 0, rkey = 0}, atomic = {remote_addr = 0,
    compare_add = 0, swap = 0, rkey = 0}, ud = {ah = 0x0, remote_qpn = 0, remote_qkey = 0}}}
    (gdb) p *wr->sg_list
    $11 = {addr = 49981952, length = 312, lkey = 175104}

    I'm confused about two things though (if send queue full is the problem):

    1. ibv_post_send() returns ENOMEM (and not -ENOMEM which is what the drivers seem to return when kmalloc fails or something similar)
    2. errno=2 which is also weird, I'm unable to find out exactly who sets it & why

    I've also tried running it through valgrind to check invalid memory and it looks clean.
    Any pointers?

    Thanks
    Sara

    • Dotan Barak says: May 2, 2013

      Hi Sara.

      I'll try to help here:
      1) User level libraries return positive errno values and not negative ones
      (kernel level drivers return negative errno values)

      2) I don't know where the errno=2 came from. libmlx4 almost doesn't set the errno value
      at all..

      Did you poll all of the completions from the CQ?
      Once you have the failure in the ibv_post_send(), did you try to empty the CQ and try to post the Send Request again?
      (since the QP should still be in a good shape)

      Thanks
      Dotan

      • Sara says: May 3, 2013

        Thanks, Dotan! Once I reach this point, all polls keep returning 0, and if I attempt to post more sends I run into the same issue. The other side is sitting idle doing an epoll_wait() with plenty of recvs posted. So it doesn't look like an easy problem to solve. I'll try a few more experiments & update (in case someone runs into similar issues later).
        Sara

      • Dotan Barak says: May 4, 2013

        This will be great, thanks!

        Dotan

      • Sara says: May 7, 2013

        Just wanted to update on this issue real quick. I restructured the code quite a bit to make it extensible and now I don't hit upon the issue anymore. So most likely some bad coding on my part - if I had more time to spare I'll explore in detail but unfortunately I'm on a deadline so don't have a clear answer :(
        Thanks for your help Dotan!

      • Dotan Barak says: May 7, 2013

        Hi Sara.

        I'm happy that you overcome the bug
        :)

        You are most welcome!
        Dotan

  7. Stefan says: June 28, 2013

    Hi Dotan,

    I'm receiving 'remote invalid request error' (IBV_WC_REM_INV_REQ_ERR) with RDMA_READ requests. I checked buffer sizes, access rights, and QP-type and all seams fine to me. RDMA_WRITE works and since the only difference is the opcode (as far as I know), I don't understand the issue.

    BTW: I'm new to RDMA programming and your side really helps a lot!

    Thanks so far.

    • Dotan Barak says: June 28, 2013

      Hi Stefan.

      Sharing the code will be great (it will allow me to review it and give feedback..)
      Nevertheless, I will try to help you
      :)

      Assuming that you have both RDMA Read and RDMA Write code,
      the delta between the RDMA Write to the RDMA Read support should be:
      1) The QP type is IBV_QPT_RC
      2) The mask IBV_ACCESS_REMOTE_READ is enabled in the responder's MR
      3) The mask IBV_ACCESS_REMOTE_READ is enabled in the responder's QP (qp_access_flags)
      4) The values of max_rd_atomic/max_dest_rd_atomic aren't zero
      (setting the value to one in both sides isn't efficient but will do the trick)
      5) verify that the r_key is correct (although if it worked with RDMA Write, it should be valid)

      I hope that I helped you.
      If you enjoy this blog, please publish it to other people as well.

      Thanks
      Dotan

      • Stefan says: July 1, 2013

        Hi Dotan,

        Thanks for the fast reply. I re-cheched all again and found:

        1) .qp_type of ibv_qp_init_attr is IBV_QPT_RC (OK)
        2) access mask was set by
        if (!(remote_mr = ibv_reg_mr(remote_pd, pmydata->recv_buffer, pmydata->max,
        IBV_ACCESS_REMOTE_WRITE |
        IBV_ACCESS_LOCAL_WRITE |
        IBV_ACCESS_REMOTE_READ))) {
        perror("ibv_reg_mr");
        return NULL;
        }
        Which left the flags of the QP unchainged. I set them now by calling ibv_modify_qp. The flags seam to be alright now, but the error remains.

        3) Both communication partners have the same flags, for their QPs and MRs so this should be ok.

        4) Both, max_rd_atomic and max_dest_rd_atomic are set to 1 by default here. I checked it and it should also be ok.

        5) As you mention, since RDMA_WRITE works r_key,l_key, and remote_addr are ok. (I also re-checked that)

        What seams strange is, that ibv_modify_qp raised an invalid argument error when I called it with IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MAX_QP_RD_ATOMIC to modify the values but modifying access flags works fine.

        Code actually is a mess but basically consist of this parts:

        * rdma_create_event_channel() to create event channels
        * rdma_create_id() to create rdma_cm_id's
        * rdma_bind_addr() and rdma_listen() on the server side
        * rdma_resolve_addr() and rdma_resolve_route() on the client side
        * ibv_create_cq(), ibv_alloc_pd(), rdma_create_qp() and ibv_reg_mr() to setup CQ,PD and register MR
        * Exchange Key and memory Address
        * Message setup:
        // current message size
        sge.length = imyproblemsize;
        // Buffer address == MR address and is large enough
        sge.addr = (uint64_t)pmydata->recv_buffer;
        sge.lkey = client_mr->lkey;

        snd_wr.sg_list = &sge;
        snd_wr.num_sge = 1;
        snd_wr.opcode = IBV_WR_RDMA_READ;
        snd_wr.send_flags = IBV_SEND_SIGNALED;
        snd_wr.next = NULL;
        snd_wr.wr.rdma.remote_addr = rAddr;
        snd_wr.wr.rdma.rkey = rKey;

        * Start Work:
        if (ibv_post_send(client_id->qp, &snd_wr, NULL)) {
        perror("21 ibv_post_send");
        return -21;
        }
        while (!ibv_poll_cq(client_cq, 1, &wc));
        if (wc.status != IBV_WC_SUCCESS) {
        printf("r0: wc.status: %s\n",ibv_wc_status_str(wc.status));
        perror("22 ibv_poll_cq");
        return -22;
        }

        The code is some kind of skeleton I wrote and originally covers send/receive wich works fine. Also modifying it to work with RDMA_READ caused no problem, but RDMA_WRITE does.

        Thanks a lot.

      • Dotan Barak says: July 4, 2013

        Hi Stefan.

        Can you call ibv_query_qp() when the QP should be in RTS state and verify that:
        1) The QP state is RTS
        2) The value of max_rd_atomic isn't zero
        3) The value of max_dest_rd_atomic isn't zero

        I suspect that the fact that ibv_modify_qp() failed is your problem.
        (please check my post about ibv_modify_qp() and make sure that you
        use the right flags for each QP state transition)

        Thanks
        Dotan

  8. boris says: October 9, 2013

    Hello Dotan,

    I'm measuring latency between two RDMA NICs with IBV_WR_SEND

    If I send a work request with IBV_SEND_SIGNALED flag, so when I get
    IBV_WC_SEND event, does it mean that the message was delivered and the remote machine sent an ack back? Should I consider this time as a roundtrip?

    Thanks.

    • Dotan Barak says: October 10, 2013

      Hi.

      It depends on the used transport type:
      * If this is reliable transport type (RC), when you get Work Completion in the sender side - this means that the message was written at the remote side (and an ACK was sent back)
      * If this is unreliable transport type (UC/UD), when you get Work Completion in the sender side - this means that the message was sent through the local port (no ACK/NACK will be sent)

      I hope that I answered your question.

      Thanks
      Dotan

      • Boris says: October 10, 2013

        Thanks a lot.
        That what I've assumed.
        I'm using RC, just to make clear, the following flow

        1. post-receive
        2. start timer
        3. send message (IBV_WC_SEND)
        4. wait for receive to complete (send on the other is posted only when message arrived)
        5. stop timer

        it measures: 2 messages + ACK for the first send + (optional: ACK to other side of received message)

        Thanks.
        Boris.

      • Dotan Barak says: October 10, 2013

        Exactly.

        One tip though: if you care about latency, you should send the message inline'd
        (if the message is small).

        Thanks
        Dotan

  9. Boris says: October 15, 2013

    Hello Dotan

    in ibv_post_send:
    1. Are the ibv_send_wr list, and its sg_list destroyed automatically when the operation completes.
    2. Or can I destroy them after the method call returns.
    3. They have to be kept alive till receiving work completion.

    Boris.
    Thanks.

    • Dotan Barak says: October 15, 2013

      Hi Boris.

      The sg_list array can be safety be (re)used after ibv_post_send() ends:
      The Send Request is being enquequed to the Send Queue space of the Queue Pair
      once it is being posted.

      Thanks
      Dotan

  10. Jagadeesh says: November 11, 2013

    Hi Dotan.

    Is there any way to know, what is the max length of INLINE data can be sent in SEND or RDMA_WRITE ?

    • Dotan Barak says: November 11, 2013

      Hi.

      Unfortunately, struct ibv_device_attr doesn't contain any attribute that specify the maximum INLINE data that can be sent.
      When creating a QP, qp_init_attr->cap->max_inline_data is returned with the number of INLINE data that can be sent in this QP.

      Thanks
      Dotan

  11. Martin says: November 22, 2013

    Hi,
    I'm new to RDMA and run into a weird behavior, which I was hoping you could clarify for me:

    I'm using IBV_WR_SEND to send a struct-object which contains some information needed for an RDMA-read later on (rkeys, address and so on).
    Now in principle this works fine, but the strange behavior is that only if the object-size is a multiple of 2, does it work correctly. So I tried these cases:
    sizeof(message) -> 16. This works
    sizeof(message) -> 24. The last object-attribute is always wrong, the rest is correct.
    sizeof(message) -> 32. This works again.

    Is this normal? I have only seen restrictions about the minimum/maximum message size, but nothing that would hint at an additional restriction of this kind. Or did I something wrong somewhere?

    Thank you very much!
    Martin

    • Dotan Barak says: November 22, 2013

      Hi Martin.

      I have a feeling that the problem isn't related to RDMA.
      In RDMA the minimum message size can be even 0 bytes!

      I have a feeling that the problem happens because of the way the compiler prepare the structure in the memory
      (padding, etc..).
      In RDMA and in any other networking protocol the application needs to take care of how to transfer data between two machines since maybe the machines are different:
      * CPU arch (32/64) bits
      * Big/little endian

      I have two suggestion here:
      1) You can send me the source code for review, and I'll give you feedback
      2) You can give me more information on what went wrong (since you didn't provide this information)

      Thanks
      Dotan

      • Martin says: November 28, 2013

        Hi,

        thank you for your reply.
        Sorry for my late response, but I was busy the last week.

        So, I have a struct containing: int rkey, int remote buffer size, long remote address
        If I send this, everything is fine. But now suppose I add "int id" to the struct. No matter which attribute is specified last in the struct (lets say for example "int id" is now the last one), that attribute is not recieved correctly, but gives a wrong value. All other attributes of that struct are correct.

        You are probably correct that this is due to some little/big endian problem.

        Thank you very much!

        Cheers,
        Martin

      • Dotan Barak says: November 28, 2013

        Hi Martin.

        Do you want to share the code with me? This way I'll find your bug ...

        Another way for you to handle it is to write (using sprintf()) the data to an array of characters,
        and this this data as a string as not as a struct (and parse it in the remote side).

        I hope that this tip helped you
        Dotan

  12. Philippe Marguinaud says: November 22, 2013

    Hello Dotan,

    I have the same problem as Stefan (I get IBV_WC_REM_INV_REQ_ERR with RDMA_READ requests. I tried to follow the advice you already posted here as much as possible, but I cannot sort that out myself.

    I can send you a simple program which reproduces my problem, but I would need your email (and your agreement).

    Best regards,

    Philippe

    • Dotan Barak says: November 23, 2013

      Hi Philippe.

      If you want to share the code with me, and I'll give you a hint
      on the reason of this problem, you can send it to:
      support at rdmamojo dot com

      Thanks
      Dotan

  13. Jasmine says: November 23, 2013

    Hi Dotan,

    I have a question about P_KEYS in BTH header. Once a relation is established between two QP's, both ends can modify the qp attribute pkey_index. Can both ends use different pkey_index (and ultimately different pkeys) ? i.e A can say B is using Pkey=X and B can say A is using Pkey=Y.
    Thanks,

    Jay

    • Dotan Barak says: November 23, 2013

      Hi Jay.

      It doesn't matter what are the P_Key index that each QP is pointing to
      (since what is really matters is the P_Key value itself and different tables
      *may* have same P_Key values but with different order).

      If at some point, the P_Key values of both QPs won't be consistent,
      the packet will be dropped
      (InfiniBand spec: Figure 81 Packet Header Validation Process)

      In your example: if X.key != Y.key, there will be a P-Key mismatch and
      the QPs won't be able to communicate (this is the whole idea of the P_Key..)

      I hope that I helped you.

      Thanks
      Dotan

  14. Omar Khan says: November 25, 2013

    I am trying to use IBV_WR_ATOMIC_CMP_AND_SWP to check a remote value and proceed accordingly. I have registered a 64 bit integer using ibv_reg_mr. and sent this remote address to the sending host. But i am getting a remote access error. The sample code you have provided is not complete.
    In the sample code you have used

    sg.addr = (uintptr_t)buf_addr;
    sg.length = buf_size;
    sg.lkey = mr->lkey;

    Is buf_addr a 64 bit integer or a char buffer of size 8. Is it possible that you may send a complete code of a working compare and swap function.

    • Dotan Barak says: November 25, 2013

      Hi Omar.

      I'm sorry, but I don't have any source code that I can share with you...
      (I plan to write it in the future though)

      Please make sure that:
      1) The remote QP supports incoming Atomic operations
      2) The remote MR supports incoming Atomic operations
      3) The remote address is 8 bytes aligned

      Thanks
      Dotan

      • wentian says: June 3, 2016

        Hi, I came across the same problem, and still cannot figure it out. I can successfully process send/recv operation(which means qpn, psn and lid of the remote side is correct), but I fail at RDMA write operation, receiving the IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error when I call ibv_poll_cq(). Any other comments besides the above three hints?? thanks in advance.

      • Dotan Barak says: June 8, 2016

        Hi.

        Did you read my post on ibv_poll_cq()?

        Anyway, check that RDMA Write is enabled in both the remote QP.qp_access_flags and remote MR.access.

        Thanks
        Dotan

  15. Jiajun says: December 7, 2013

    Hi Dotan,

    I have a question about how WRs are finished. Suppose I have built a RC connection between two QPs. First on the receive side, I post two recv WRs, say recv_wr1 and recv_wr2. Then on the send side, I post two send WRs, say send_wr1, send_wr2. My question is, is there any possibility that send_wr2 finishes before send_wr1? What about the receive side? Is is possible that recv_wr2 is finished before recv_wr1?

    Thanks,
    Jiajun

    • Dotan Barak says: December 7, 2013

      Hi Jiajun.

      In term of the Completion Queue of the Work Queues, you should see their Work Completions according to the order of the corresponding posted Send Requests.

      In term of the wire, this isn't a place that I fully familiar with, BUT:
      if you send a message, every packet increases the PSN in the Send Queue and in the remote Receive Queue),
      so send_wr2 cannot be sent before send_wr1 was sent. Otherwise, it won't be able to detect missing packets (using the PSNs).

      Anyway, you should (re)use the memory only after the relevant Work Request isn't outstanding any more.

      I hope that this helped you.

      Thanks
      Dotan

  16. Omar khan says: January 29, 2014

    Hi
    My question might seem out of context for this post but it's important.
    I have to ask you how to set up an all to all communication between a number of processes, some on same machine and some on different. What I have done is open a listening rdma_cm_id wait for incoming connection requests for each process and bind it to a specific port and create new rdma_cm_id when I have completed a connection request. This works fine if all processes are on different host machines, but if I start multiple processes on the same machine, I get a very slow performance or none at all, the system hangs as if in a deadlock. I had hoped that once I have a rdma_cm_id for each process than the processes should communicate without any problem. One thing is that I have only set up one communication channel but it should suffice for many clients (the man pages say this).
    Regards
    Omar

    • Dotan Barak says: January 29, 2014

      Hi Omar.

      I really sorry, but I can't help you with this...
      I don't have a lot of experience with rdma_cm (yet?).

      If you want a good answer, I suggest that you'll send this question to Sean Hefty,
      the writer and maintainer of rdma_cm.

      Sorry again..
      Dotan

  17. Igor R. says: January 30, 2014

    It seems that the descriptions of IBV_WR_ATOMIC_CMP_AND_SWP and IBV_WR_ATOMIC_FETCH_AND_ADD are swapped.

    • Dotan Barak says: January 30, 2014

      Fixed, thanks.

      Dotan

  18. Igor R. says: February 15, 2014

    Hi Dotan,

    What circumstances can make a Send Queue to get overflown?

    In my program I perform an RDMA Write in a loop (every time with the same source/destination addresses, just to test), and after a while I constantly get ENOMEM from ibv_post_send(). It doesn't seem to be a race, as it always happens after the same count of iterations, and even sleeping ~1sec between iterations doesn't affect anything; besides, the number of successful iterations is correlated with QP's max_send_wr. None of the WRs is "signaled" (tried to poll the QC at every iteration - it's empty).
    I might be missing something basic in the QP configuration. What initialization parameter can cause such a behavior?

    Thanks.

    • Dotan Barak says: February 16, 2014

      When creating a QP, you specify how many WRs are outstanding in either Send or Receive Queue.
      A WR is considered outstanding until there is a Work Completion for it or for other WRs in that Work Queue.

      You posted many WRs (in your case, to the Send Queue) and all of them are outstanding.
      From time to time, you need to make them "signaled" and read the Work Completions.

      Thanks
      Dotan

      • Igor R. says: February 16, 2014

        Oh, I see. This looks like a design flaw, doesn't it? At least, it's quite counter-intuitive behavior, as one would expect that an unsignaled WR gets removed from SQ silently as soon as it's processed - after all, that's the whole point of unsignaled WRs...

      • Dotan Barak says: February 16, 2014

        But if you don't get any Work Completion, how can you prevent from posting more WRs than the Work Queue size?
        You *assume* that all the posted WRs were processed, in most cases it is true,
        but there isn't any guarantee about it...

        Thanks
        Dotan

      • Igor R. says: February 16, 2014

        Well, if one produces WRs faster than the HCA can consume, the SQ will eventually overflow, and in *such* situation ENOMEM would be quite logical (like in any producer-consumer scheme) - but still, implicitly treating obviously consumed WRs as outstanding doesn't seem to fit well in this logic. Sometimes the producer can know for sure that he can never overflow the queue (for instance due to retry count/timeout settings vs. timings of WRs), and such a behavior of the queue would surprise him.

      • Dotan Barak says: February 17, 2014

        You start to enter to the synchronization mechanism between the low-level driver and the HW...
        Anyway, this is the behavior which the protocol defined.

        Thanks
        Dotan

      • Boris says: February 16, 2014

        Hi Dotan, joining the question on this issue. Is there any way (or will be) to block on ibv_post_send (until there is place in the work queue)?

        Otherwise, in multythreaded application, some synchronization semaphore-like mechanism must be applied, and it could be very costly...

      • Dotan Barak says: February 17, 2014

        Hi Boris.

        Currently, there isn't any way to block the post_send if the Work Queue is full.
        This require a low-level libraries and API change (to prevent breaking of current behavior).

        This isn't anything that I can help you with.

        Sorry
        Dotan

  19. Igor R. says: February 18, 2014

    Dotan,

    You're writing regarding the inline data that "the low-level driver (i.e. CPU) will read the data and not the RDMA device". Is this correct for the both sides? I.e., on the responding side, will the HCA perform DMA for the inlined data, or will CPU handle it?

    Thanks a lot for your assistance.

    • Dotan Barak says: February 18, 2014

      This is relevant only for the local side, i.e. the side that fetches the data.
      There isn't any hint that this was done once the data is being sent over the wire.

      Thanks
      Dotan

  20. Igor R. says: February 19, 2014

    Hi Dotan,

    Is there a more straightforward and efficient way to write a value atomicaly to the remote side, than performing rdma-read followed by atomic CAS? (There're no stores to this location on the remote side, only loads, but the value must appear consistently/atomically.)

    Thanks.

    • Dotan Barak says: February 20, 2014

      Hi Igor.

      The only supported atomic operations in RDMA are:

      * Fetch and Add
      * Compare and Swap

      I don't know what you are trying to achieve, but using them you can implement
      a mutual exclusion primitives.

      What about sending a message using "Send" and increment the value locally using a good old mutex/semaphore/spinlock?

      Thanks
      Dotan

      • Igor R. says: February 20, 2014

        Due to some constraints I can't use send/receive flow...
        What the level of atomicity of a regular RDMAWrite? I.e., does the remote HCA stores to its local memory bytes or words?

      • Dotan Barak says: February 20, 2014

        I'm sorry, but I can't provide a good answer here.
        RDMA supports sending a stream of bytes and AFAIK there isn't any guarantee about atomic access of more than one bytes.

        Multiple testing may show you that atomicity of words (or more) is achieved, but there may be scenario that this won't be the case...

        Dotan

  21. scott says: March 17, 2014

    Hi Dotan,
    Great website. Thanks for all the work.
    Question about posting WRs. If I post a WR to a WQ, does a copy of the WR get made so that after the ibv_post_send() completes, I am free to overwrite that WR for my own purposes? Or is just a pointer to that WR posted to the WQ and I have to keep it intact until the completion occurs. It tried to find the internal representation of the WQs to see if I could deduce the answer myself, but no luck.

    • Dotan Barak says: March 18, 2014

      Thanks
      :)

      Short answer: yes.

      Long answer: the low-level driver translate the Work Request structure from verbs API to HW API
      and post this HW-specific WR to the the relevant Work Queue.

      After the verb of posting the WR returns, you are free to change this WR structure.

      If you can to see how this is done, you need to check the code of the low-level drivers...

      Thanks
      Dotan

      • Ariel says: March 31, 2014

        Hi Dotan,
        Your site is a huge help!
        Regarding reuse of WR, are the ibv_sge elements copied as well?
        From my reading of the code they are copied but can i reuse them when ibv_post_send returns?
        Also is there a restriction on multiple WR with the same wr_id?
        For example can the same id be used to identify a chain of WR posted together?
        Thanks!

      • Dotan Barak says: March 31, 2014

        Thanks!

        Yes. The s/g list is being copied to the QP's Send Queue and they can be reused.

        About the wr_id; it is a user defined private data and can contain any value that you wish..
        (including multiple WRs with the same wr_id).

        Sure
        Dotan

  22. Bernard Gütermann says: May 29, 2014

    Hi

    thx for your previous answers.

    I was wondering: Is there a performance difference between IBV_WR_RDMA_WRITE(_WITH_IMM) and IBV_WR_SEND(_WITH_IMM) ?

    Also is there any advantage of having the remote post IBV_WR_RDMA_READ instead of posting IBV_WR_RDMA_WRITE(_WITH_IMM)/IBV_WR_SEND(_WITH_IMM) locally?

    thx
    Bernard

    • Dotan Barak says: May 30, 2014

      Hi Bernard.

      In the following post you'll find most of your answers:
      Tips and tricks to optimize your RDMA code

      However, I'll answer your questions shortly:
      Yes, there is a performance difference, so one should prefer using RDMA Write with immediate instead of Send with immediate.

      RDMA Read is considered more "expensive" than RDMA Write or Send operations, so one should prefer the later operations.

      I hope that I helped
      Dotan

  23. Henry Fu says: August 11, 2014

    Hi Dotan,

    This is a fantastic website for RDMA learners! I have a question regarding on the atomic operations. That is, how does the RDMA atomic operations (FetchAdd & CmpSwap) implemented? I guess there should be a locking mechanism that comes to work once the atomic operations are performed on some memory buffer. Is the lock implemented on the network (RNIC?), on the specific memory buffer, on the memory bus, or somewhere else?

    Thanks in advance!

    Henry

    • Dotan Barak says: August 12, 2014

      Hi Henry.

      Thanks for the compliment.
      :)

      The atomic operations are atomic related to other atomic operations and not to any other operation or any other memory access.

      I don't *know* the internal implementation but I can guess;
      It depends of the supported atomicity level of the RDMA device:
      * If it is supports atomicity within the device - it may have an internal mechanism to prevent other atomic access to this memory
      * If it is supports atomicity between other devices - I guess that it will lock the bus or something like this.

      AFAIK, every atomic is supported until now only within the device.

      I hope that this answer helped you.

      Thanks
      Dotan

      • Igor R. says: August 13, 2014

        Hi Dotan,

        > The atomic operations are atomic related to other atomic operations and not to any other operation or any other memory access.

        Do you mean that if one modifies a remote value with eg. IBV_WR_ATOMIC_FETCH_AND_ADD, this modification will *not* appear as atomic for any other software (eg. running locally on that machine) that attempts to read this memory location?

      • Dotan Barak says: August 15, 2014

        Hi Igor.

        Here is the exact quote from the InfiniBand specifications:
        "o9-17: Atomicity of the read/modify/write on the responder’s node by the
        ATOMIC Operation shall be assured in the presence of concurrent atomic
        accesses by other QPs on the same CA."

        It specifies how the RDMA device will handle the content of the memory and doesn't really mention other interfaces (such as the software). For example: it *may* perform the following: Read, modify, write and perform the write 10 seconds after the read happened. During this time, the RDMA device will prevent any access to this memory by other Atomic operations. The (local) software doesn't really aware to the operations that are done by the RDMA device...

        Thanks
        Dotan

  24. Zhang Yue says: October 13, 2014

    Hi Dotan

    I use ibv_post_send(), doing RDMA write, I found that if the num_sge is 4, it return -1; if the num_sge is 2 or 1 , it works fine. (the buffer is 4kB each).

    How can I make it send 4(or more) num_sge buffers?

    Thanks.

    Zhang Yue

    • Dotan Barak says: October 13, 2014

      Hi Zhang Yue.

      Can you send the output of:
      ibv_devinfo | grep max_sge

      Thanks
      Dotan

      • Zhang Yue says: October 14, 2014

        hi Dotan,

        The command output is these:

        root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target# ibv_devinfo | grep max_sge
        root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target# ibv_devinfo
        hca_id: mlx4_0
        transport: InfiniBand (0)
        fw_ver: 2.32.5100
        node_guid: f452:1403:0028:0820
        sys_image_guid: f452:1403:0028:0823
        vendor_id: 0x02c9
        vendor_part_id: 4099
        hw_ver: 0x0
        board_id: MT_1090120019
        phys_port_cnt: 2
        port: 1
        state: PORT_ACTIVE (4)
        max_mtu: 4096 (5)
        active_mtu: 4096 (5)
        sm_lid: 3
        port_lid: 4
        port_lmc: 0x00
        link_layer: IB

        port: 2
        state: PORT_ACTIVE (4)
        max_mtu: 4096 (5)
        active_mtu: 4096 (5)
        sm_lid: 1
        port_lid: 2
        port_lmc: 0x00
        link_layer: IB

        root@ubuntu-earth:/home/zhangyue/t0_src/stgt/perf_test/zy_target#

      • Zhang Yue says: October 14, 2014

        hi Dotan

        I found that the queue pair config limits it:
        qp_init_attr.cap.max_send_sge = 1; /* scatter/gather entries */
        qp_init_attr.cap.max_recv_sge = 1;
        I changed 1 to 16 and works.

        Thanks, you are nice.

        Zhang Yue

      • Dotan Barak says: October 15, 2014

        Hi Zhang Yue.

        Thanks for the update.
        I've updated the description of num_sge in the posts that describe the structures of Send Request and Receive Request to be more informative according to your problem.

        Thanks
        Dotan

  25. Lluis says: November 5, 2014

    In a UD QP, can you post an inline send with immediate data?

    • Dotan Barak says: November 6, 2014

      Yes, you can.

      Thanks
      Dotan

  26. Igor R. says: November 11, 2014

    Hi Dotan,

    I'd like to consult with you on the following subject: we perform IBV_WR_RDMA_WRITE to a remapped BAR of a remote PCI device and experience poor throughput. Using hardware monitoring tools we found out that the data was being written in 64-byte packets, and that's what cased the above issue.
    My question is whether there's any configuration that could affect the way HCA writes the data?
    I post a non-signalled rdma-write, with >1K of data as a single SGE, 4K MTU.

    • Dotan Barak says: November 11, 2014

      Hi Igor.

      I'm sorry, but this is device specific and I don't know much about it.

      However, I would check with the vendor of that PCI device to get more details.
      Do you have performance problems when accessing the PCI device locally?
      Maybe the way that this BAR is mapped to kernel can be improved?

      I hope that this give you a hint...

      Thanks
      Dotan

      • Igor R. says: November 11, 2014

        It's "device-specific" in the sense that writing 64-byte packets causes the device to get the data slowly (which doesn't happen when HCA writes to RAM, or when we DMA'ing to this PCI device by other means) - the device vendor assured this assumption.
        The BAR is remapped to a user-space virtual addresses with io_remap_pfn_range(), then registered as rdma memory-region using PeerMemory mechanism recently introduced in Mellanox OFED especially for this purpose.
        I believe the remote (w.r.t to the PCI device) HCA sends the data over the fabric in MTU-sized chunks, so it's probably the local HCA that performs such a "slow", or PCI-unfriendly, DMA.
        So, the question is whether we have any control over the way HCA performs the DMA?

      • Dotan Barak says: November 13, 2014

        Hi Igor.

        AFAIK, there isn't any way to control the HCA performs the DMA.
        I doubt it, but even if there are ways to do this; you'll need to get this info from the HW vendors..

        Sorry.
        Dotan

  27. Govind Patidar says: December 3, 2014

    Hi,
    Suppose I post two request in the receive queue but for some reason I received the data for second request before first request. Is it possible to receive data for second request before first or it will always give error.

    • Dotan Barak says: December 3, 2014

      Hi Govind.

      You have two Receive Requests in your Receive Queue
      (the Receive Queue "knows" only the order of the posting of those Receive Request,
      and this ordered is promised).

      The next message that will enter to the Queue Pair that will consume a Receive Request will take
      those Receive Requests according to the order that they were enqueues to it.

      I understand that your application has the semantics of the first and second one,
      however, the RDMA doesn't.

      Bottom line, the answer is: no.

      BTW it should always give an error. You didn't give me enough info,
      but I believe that the problem is that the "first" Receive Request is small.
      This can be fixed by making sure that all the Receive Requests can hold all the incoming messages ...

      I hope that this helps you
      Dotan

  28. Govind Patidar says: December 3, 2014

    hii all,
    during ibv_post_send I am getting errno 0 and 2 for two different messages. Can someone please point out to some document where I can find description of errno. I am using OFA RDMA api's

    • Dotan Barak says: December 3, 2014

      Hi.

      Unfortunately, the errno return values isn't consistent for all low-level drivers in RDMA.
      If you'll share the code, maybe I'll be able to answer you.

      Thanks
      Dotan

  29. Erfan says: December 3, 2014

    Hello,

    I can successfully send RDMA READ/WRITE, but I can't get RDMA atomic operations to work. I get an error when calling ibv_post_send function in the client, and the errno will be set to "Invalid Arguments.". Below I pasted important parts of my code. Could you please check my code and let me know if I'm missing anything?

    *********** client side *****************:
    -- Registering the memory regions --
    mr = ibv_reg_mr(pd, buff, size, IBV_ACCESS_LOCAL_WRITE);
    // and the size is 8

    if (!mr){
    fprintf(stderr, "Error, memory registration failed\n");
    return -1;
    }

    -- Preparing RDMA ATOMIC FETCH AND
    struct ibv_send_wr wr, *bad_wr = NULL;
    struct ibv_sge sge;

    memset(&sge, 0, sizeof(sge));
    sge.addr = buff;
    sge.length = 8;
    sge.lkey = mr->lkey;

    memset(&wr, 0, sizeof(wr));
    wr.wr_id = 0;
    wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD;
    wr.sg_list = &sge;
    wr.num_sge = 1;
    wr.send_flags = IBV_SEND_SIGNALED;

    wr.wr.atomic.remote_addr = remote_buffer;
    wr.wr.atomic.rkey = peer_mr->rkey;
    wr.wr.atomic.compare_add = 1ULL; /* value to be added to the remote address content */

    if (ibv_post_send(qp, &wr, &bad_wr)) {
    fprintf(stderr, "Error, ibv_post_send() failed\n");
    return -1;
    }
    ********* End of Client side *******

    ****** Server side ****************
    -- Registering the memory regions --
    mr = ibv_reg_mr(pd, rdma_region_timestamp_oracle, sizeof(TimestampOracle),
    IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));

    if (!mr){
    fprintf(stderr, "Error, memory registration() failed\n");
    return -1;
    }

    NOTE: TimestampOracle is a class with two int members, so its size is 8 bytes (satisfies 64-bit condition for RDMA ATOMIC operations)

    Thank you for your helps,
    Erfan

    • Dotan Barak says: December 4, 2014

      Hi Efran.

      I have some questions:
      1) Did you check that the RDMA device supports Atomic?
      2) Did you check that the remote address is 8 byte aligned?
      3) Did you enable atomic at the responder QP?
      4) Is this is an RC QP?

      I hope that one of the above questions gave you a hint on the problem.
      If not, I'll need to see more source code and information on the RDMA devices that you are using.

      Thanks
      Dotan

      • Erfan says: December 4, 2014

        Hello Dotan,

        Thank you for your response. I'll try to address your questions as far as my understanding
        1) How can I check that? Do you mean that some RDMA devices support Atomic and some don't?

        2) I simplified the code, so now the remote address is one (long long) variable, which is 8 bytes (I paste the code at the end of this comment).

        3) As you can see in my previous comment, on the server side code, I registered the memory region to be able to be accessed atomically by ibv_reg_mr(pd, ... , ...,
        IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC )). Do I need to do anything other than that?

        4) When initializing the queue pairs on both client and server, I used qp_attr->qp_type = IBV_QPT_RC.

        Here's the simplified code, I tried to leave unrelated parts out. I know how annoying it can be to read somebody else's lousy code. I'd really appreciate your help.

        ******** client code **********
        void build_qp_attr(struct ibv_qp_init_attr *qp_attr){
        memset(qp_attr, 0, sizeof(*qp_attr));
        qp_attr->send_cq = s_ctx->cq;
        qp_attr->recv_cq = s_ctx->cq;
        qp_attr->qp_type = IBV_QPT_RC;

        qp_attr->cap.max_send_wr = 10;
        qp_attr->cap.max_recv_wr = 10;
        qp_attr->cap.max_send_sge = 1;
        qp_attr->cap.max_recv_sge = 1;
        }

        void register_memory(struct connection *conn) {
        local_buffer = new long long[1];
        local_mr = ibv_reg_mr(pd, local_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE));
        }

        void on_completion(struct ibv_wc *wc){
        struct connection *conn = (struct connection *)(uintptr_t)wc->wr_id;
        // Assume that the client already knows about the remote_mr on the server side
        if (wc->opcode & IBV_WC_RECV) {
        struct ibv_send_wr wr, *bad_wr = NULL;
        struct ibv_sge sge;

        memset(&sge, 0, sizeof(sge));
        sge.addr = (uintptr_t)local_buffer;
        sge.length = sizeof(long long);
        sge.lkey = local_mr->lkey;

        memset(&wr, 0, sizeof(wr));
        wr.wr_id = 0;
        wr.opcode = IBV_WR_ATOMIC_FETCH_AND_ADD;
        wr.sg_list = &sge;
        wr.num_sge = 1;
        wr.send_flags = IBV_SEND_SIGNALED;

        wr.wr.atomic.remote_addr = (uintptr_t)remote_mr.addr;
        wr.wr.atomic.rkey = remote_mr.rkey;
        wr.wr.atomic.compare_add = 1ULL;

        if (ibv_post_send(qp, &wr, &bad_wr)) {
        fprintf(stderr, "Error, ibv_post_send() failed\n");
        die();
        }

        }
        }
        ***** End of client code ********

        **** Serve code ******
        struct connection {
        struct rdma_cm_id *id;
        struct ibv_qp *qp;
        struct ibv_mr *mr;
        long long *rdma_buffer;
        };

        void build_qp_attr(struct ibv_qp_init_attr *qp_attr) {
        memset(qp_attr, 0, sizeof(*qp_attr));
        qp_attr->send_cq = s_ctx->cq;
        qp_attr->recv_cq = s_ctx->cq;
        qp_attr->qp_type = IBV_QPT_RC;

        qp_attr->cap.max_send_wr = 10;
        qp_attr->cap.max_recv_wr = 10;
        qp_attr->cap.max_send_sge = 1;
        qp_attr->cap.max_recv_sge = 1;
        }

        void register_memory(struct connection *conn){
        rdma_region = 1ULL;

        rm = ibv_reg_mr(pd, rdma_buffer, sizeof(long long), IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE | IBV_ACCESS_REMOTE_READ | IBV_ACCESS_REMOTE_ATOMIC ));
        }
        ***** End of Server code *******

      • Dotan Barak says: December 5, 2014

        1) In struct ibv_device_attr, there is an attribute called 'atomic_cap'.
        This describe the atomicity support level of this device.

        Since there may be devices that don't support atomic operations.
        For more information, please read the post of ibv_query_device().

        (Can you tell me what is its value?)

        2) Please check the remote address value, that it is 8 byte aligned
        (Can you tell me what is its value?)

        3) When calling ibv_modify_qp, there is an attributes in struct ibv_qp_attr called 'qp_access_flags',
        did you enable IBV_ACCESS_REMOTE_ATOMIC in the receiver side?

        For more information, please read the post on ibv_modify_qp().

        4) Only RC QP supports Atomic, so I see that you are using it.
        And it's o.k., I don't mind read other people code :)
        (I'm doing it all the time).

        The code looks fine, beside from my comments above.

        If you'll can send me in email (dotan at rdmamojo dot com) :
        1) The full source code
        2) The parameters of your program
        3) Execution example and output of your program
        3) The output of 'ibv_devinfo -v'

        I'll be able to help you further more
        (there is a limit to what I can do with only description..)

        Thanks
        Dotan

  30. Jaume says: December 5, 2014

    Hello Dotan,

    I'm trying to speedup ibv_post_send when sending inline messages by using unsignaled completions. The problem is that it doesn't work if I post more than "qp_init_attr.cap.max_send_wr" unsignaled send requests. I tried to post one signaled request every N unsignaled ones, but still crashes after max_send_wr. What am I doing wrong?

    • Dotan Barak says: December 6, 2014

      Hi Jaume.

      The flow that you've described sounds valid. What do you mean by "still crashes"?
      (Since i don't expect to get a crash in this flow, unless there is a bug).

      Did you provide a valid bad_wr pointer to the ibv_post_send () verb?

      Thanks
      Dotan

      • Jaume says: December 8, 2014

        By crashing, I meant that ibv_post_send fails. I do not want to spend time reading the completions, so I send an "unsignaled" message. However, it seems that the unsignaled does not work because send fails once the CQ gets filled up. The QP is created with "qp_init_attr.sq_sig_all = 0;" and messages sent without the IBV_SEND_INLINE flag.

      • Dotan Barak says: December 8, 2014

        "Unsignaled Work Requests" mean that those Send Requests won't generate Work Completions.
        However, they are still consider outstanding. Which means that you need to empty the Send Queue
        by sending signaled Send Requests from time to time
        (otherwise, the Send Queue will be full, and you won't be able to post any new Send Requests).

        The IBV_SEND_INLINE isn't relevant to the signalling of the Send Requests.

        Bottom line, from time to time, you must post signaled Send Requests
        (if the Send Queue size is N, you can post signaled Send Requests every N messages,
        and by polling its Work Completion, you'll empty the Send Queue).

        Thanks
        Dotan

    • Igor R. says: December 7, 2014

      Jaume, note that you have to process completions in the completion queue.

  31. Govind Patidar says: December 8, 2014

    Hii Dotan,
    I am trying to post send request in a queue that is already full. and I am getting some error (ENOMEM). So I put some sleep time and again post same request but it is again throwing same error. (Consider that after sleep time send queue is not full)

    • Dotan Barak says: December 8, 2014

      Hi. Govind.

      Did you poll some Work Completions (which were posted to that Work Queue) from the associated CQ during this time?

      Thanks
      Dotan

      • Govind Patidar says: December 8, 2014

        yes, i did and and I am getting error there also... Currently I solved these issue by checking the number of pending request (using your idea that u mentioned in one of the comment) in send queue before posting any request and it is working but I don't want to do that because of the performance issue. and one more thing how I should increase the maximum limit of pending request in the queue and thanks for the all the help and suggestions, I really appreciate it.

      • Dotan Barak says: December 8, 2014

        I'm glad that i can help.
        :)

        Which error do you get.?
        Can you share the source code?
        It will be easier for me to help you with source in front of me. .

        Thanks
        Dotan

  32. Govind Patidar says: December 9, 2014

    Hi Dotan,
    I can't share the code (confidentiality issues), but I can tell u the error number, first error which I am getting having error number 12 and then after error number 5 for all the other messages during polling of CQ?? Can you please tell me how to increase the maximum limit of pending request in queue. Currently I am able to post ~8192 requests.

    • Dotan Barak says: December 9, 2014

      Hi Govind.

      When calling ibv_create_qp(), you control the Send Queue (please refer to the post on this verb for more information).
      I suspect that you have completion with error (i.e. the 5 and 12 errors that you reported).
      Am I right? (are those are the status values of the Work Completion that you polled?)

      If this is the case, completion status 12 = IBV_WC_RETRY_EXC_ERR which means that the remote side didn't answer within the expected time.

      Thanks
      Dotan

  33. Govind Patidar says: December 22, 2014

    Hii Dotan,
    First of all thanks for all your help, Finally my code is working and currently I am getting 3 times better performance for RDMA compare to UDP. I am having few more question that how much improvement(max) can we suppose with RDMA as compare to udp. Currently I am using only channel semantics, is there any good chances to improve if I use memory semantics also??

    • Dotan Barak says: December 22, 2014

      Hi Govind.

      I'm happy that I can help
      :)

      1) Performance is a very big area. Which metrics do you check? what is the current numbers in UDP?
      Do you compare usin RC QP/UD QP? Which operations do you use?
      2) What do you mean by channel semantics and memory semantics?

      Thanks
      Dotan

  34. Govind Patidar says: December 22, 2014

    I am using RC QP and compairing with UDP protocol on the basis of waiting time of requested data.
    With memory semantics I mean that I am not allowing the remote node channel adapter to write directly to host memory using rkey (all read write operation are done by local channel adapter by using lkey) and the reason for using only channel semantics is that I am transferring very small amount of data at a time.

    • Dotan Barak says: December 23, 2014

      So, I guess that your metric is latency.

      I suggest that you'll execute a tool that comes with the OFED package called ib_send_lat,
      which will provide you the (best) latency that you can achieve using SEND operations in your setup.

      The performance depends on so many factors, so I prefer not to provide a number.

      Thanks
      Dotan

  35. Zhang Yue says: December 23, 2014

    hi Dotan
    (at the Target side) When I'm doing a RDMA-READ with 4 wr, each wr have 1 sge (4KB), the initiator will easyly crush or the /dev/sdxx dispear. (While doing RDMA-WRITE is fine.)
    I've set the wr's rkey and increase remote_addr by 4096, any suggest?

    Thanks
    Zhang Yue
    ps:
    for(k = 1; k cache_req.sglist_size; k++)
    {
    multi_wr[k] = rdmad->send_wr; // copy struct

    multi_wr[k].next = &multi_wr[k+1];
    multi_wr[k].sg_list = &task->rdma_sge[k];
    multi_wr[k].send_flags = 0; //zy: should be 0. otherwize will free task multi times
    multi_wr[k].wr.rdma.remote_addr += (4096 * k);

    task->rdma_sge[k].addr = tgt_phy2virt(task->cache_req.sglist[k].addr);
    task->rdma_sge[k].length = task->cache_req.sglist[k].len;
    task->rdma_sge[k].lkey = get_cache_buf_lkey(task->conn->dev, task->cache_req.sglist[k].addr);

    }

    // insert to list
    multi_wr[k-1].next = rdmad->send_wr.next;
    rdmad->send_wr.next = &multi_wr[1];

    task->task_multi_wr = multi_wr;

    //this sge.length mark the total length, will be use at iser_rdma_rd_comp_complete_handler
    rdmad->sge.length = task->rdma_rd_sz;

    // so we need to place the first wr's sge to other place
    rdmad->send_wr.sg_list = task->rdma_sge;
    task->rdma_sge[0].addr = tgt_phy2virt(task->cache_req.sglist[0].addr);
    task->rdma_sge[0].length = task->cache_req.sglist[0].len;
    task->rdma_sge[0].lkey = get_cache_buf_lkey(task->conn->dev, task->cache_req.sglist[0].addr);

    • Dotan Barak says: December 23, 2014

      Hi.

      I don't know if this is related to RDMA.

      I would suggest to check if the local buffer that is being filled
      is still allocated or being freed.

      Maybe you should print the local address and check if the values make any sense.

      Please check that before using the values the Work Completion status is o.k.

      Thanks
      Dotan

      • Zhang Yue says: December 25, 2014

        Hi Dotan

        Firstly, may all of us Merry Christmas!
        Yes,this issuse is NOT related to RDMA.

        Yesterday, I print every wr before calling ibv_post_send(), and found a issues:
        After doing a lot of 16KB write, tgt may receive a INQUIRY, and if the INQUIRY unluckily use a task struct that was previously used by a 16kB write( or read),
        It will use the old 4 4KB buffers and DMA to the initiator. INQUIRY only read 70 bytes, DMA 16 KB to it will broke the initiator's memory.

        The main fix is: check the need DMA length, if <=0 , skip the left buffers.

        Thanks

        Zhang Yue

      • Dotan Barak says: December 25, 2014

        Hi.

        Merry Christmas indeed
        :)

        I'm happy that you found the problem.

        Dotan

  36. jiaxin shi says: January 15, 2015

    Hi Dotan

    I am trying to use IBV_WR_ATOMIC_CMP_AND_SWP operation and I get some error like this when I poll the wc :IBV_WC_REM_ACCESS_ERR

    I just make some simple modification base on the codes provided in the book “RDMA_Aware_Programing_user_manual”, do you know what is the problem?

    • Dotan Barak says: January 15, 2015

      Hi.

      Please check that IBV_ACCESS_REMOTE_ATOMIC is enabled in the remote memory buffer and in the remote QP.

      Thanks
      Dotan

  37. Jesus Camacho says: January 22, 2015

    Hi Dotan,

    I want to post a request, but I want that the remote QP discards this request as soon as it receives it. This is because I want to send a dummy packet when I am in the REARM state in the QPs in order to reach the ARMED state (this is because it is needed an incoming packet for this transition).

    I am using the below configuration and it seems to be working, but I would like to know if you think that this could be a generic approach for any situation or not:

    struct ibv_send_wr wr;
    struct ibv_send_wr *bad_wr;

    memset(&wr, 0, sizeof(wr));
    wr.wr_id = 0;
    wr.sg_list = NULL;
    wr.num_sge = 0;
    wr.opcode = 0;
    wr.send_flags = 0;

    if (ibv_post_send(ctx->id[num_qp]->qp, &wr, &bad_wr)) {
    fprintf(stderr, "Error, ibv_post_send() failed\n");
    return -1;
    }

    Best regards,
    Jesus Camacho

    • Dotan Barak says: January 23, 2015

      Hi Jesus.

      You are sending a "standard" zero message. This can work, but you consume a Receive Request in the remote side.
      Did you consider sending a zero message RDMA Write?

      Thanks
      Dotan

  38. Jesus Camacho says: January 23, 2015

    Hi Dotan,

    I am currently using the opcode 0 (which is the IBV_WR_RDMA_WRITE operation) and it is working fine with the Infiniband microbenchmarks.

    Is that what you are suggesting me?
    If so, do you think this can be extrapolated to any scenario?

    Thanks for your time,
    Jesus

    • Dotan Barak says: January 23, 2015

      Hi.

      Yes, this is was my suggestion.
      What do you mean by "do you think this can be extrapolated to any scenario"?

      Thanks
      Dotan

      • Jesus Camacho says: January 23, 2015

        Hi,

        I mean if this is a general solution.

        Do you think that this is going to work when using another benchmarks, applications, etc.?

        Best,
        Jesus

      • Dotan Barak says: January 23, 2015

        Hi.

        Yes. Using zero bytes message is valid and can be always used.
        Working with such messages with RDMA Write opcode can provide better performance than the Send opcode.

        Thanks
        Dotan

      • Jesus Camacho says: January 23, 2015

        Hi,

        good to know!

        Thanks for your help :-)
        Jesus

      • Dotan Barak says: January 23, 2015

        Sure
        :)

        Dotan

  39. John says: April 13, 2015

    Hello Dotan,

    I have a quick question. What happens if the local node calls ibv_post_send() with opcode ibv_wr_send before the remote node calls ibv_post_recv()?

    Thanks!

    • Dotan Barak says: April 14, 2015

      Hi John, the answer won't be quick though
      ;)

      The thing that matter is not when the sides posted the Send/Receive request in absolute time;
      since one may not know when the actual scheduling of the Send Request will take place...

      If message that consumes a Receive Request received by a Queue Pair when there isn't any available Receive Request in that Queue,
      and RNR (Receive Not Ready) flow will start for a Reliable QPs. For Unreliable QPs, the incoming message will be (silently) dropped.

      Thanks
      Dotan

      • John says: April 14, 2015

        Hello Dotan,

        Thanks for the quick reply!

        I am using a Reliable QP. So I think I will get the RNR errors. Now I have a couple of choices. (a) when getting a RNR error, back off and re-post the send request later; (b) implement a flow control protocol so that the local node posts send requests only when the remote node is ready. I like (b) more than (a). But (b) add complexity, and need to take care cases such as both nodes are waiting for the other side to become ready. :-)

        So I am wondering if there is a common practice.

        Thanks!

      • Dotan Barak says: April 15, 2015

        Sure :)

        In RNR flows, the problem is that the receiver side doesn't post Receive Requests fast enough ..

        About your suggestions:
        a) When you have an RNR error, your local QP is in ERROR state, so you can't post another Send Request without reconnecting it with the remote QP.
        b) is a good idea

        There are more options:
        * You can increase the RNR timeout
        * You can increase the RNR retry count (the value 7 means infinite retries)
        * If you have several QPs at the receiver side, you can use a SRQ and make sure that the SRQ is never empty
        (the SRQ LIMIT mechanism can help you to detect if the number of Receive Requests dropped bellow a specific watermark)

        Adding flow control to your messages is always a good idea in order to not enter to the RNR flow in the first place ..

        Thanks
        Dotan

  40. gp says: May 29, 2015

    Hi Dotan,
    I have few questions related to connection of RC queue pair.

    1. If ibv_post_send fails then we consider connection was lost.
    -> considering all the fields in the message are correct and the send queue is not full. Is vice versa also true that if we are able to post means there is working connections b/w nodes.

    2. Is it possible that we receive send WC with some error if there is active or working connection between nodes assuming message was correct and receiver also posted recv request (no RNR error).

    3. If we post send request beyond max limit in the send queue then it will corrupt the queue pair and no further request post allowed ? If no then can we post same request again without any change ?

    • Dotan Barak says: May 29, 2015

      Hi.

      1. Failure of ibv_post_send() means that one of the Send Requests is invalid or the Send Queue is full;
      it doesn't mean that connection is closed. In that case no new Send Request was added to the Send Queue.

      You can post Send Request to a Queue Pair which was configured with bad remote attributes
      ("bad" means not the attributes that you should have been configured...), i.e. no connection.

      2. In general, no; but this question is tricky...
      Which completion status did you get?

      3. If you posted Send Requests beyond the maximum limit and all of them are unsignaled - you have a problem.
      The Queue Pair isn't corrupted, but you can't post anymore Send Requests to it:
      The status of the outstanding Send Requests is undetermined for the sender side.
      The Receive Side of this Queue Pair is still fully operational.

      You must recover it but moving it to Error/Reset state and reconnect the Queue Pairs

      I hope that I helped you
      Dotan

  41. ChenCong Fu says: June 3, 2015

    Hi Dotan:

    Nice to meet you. I'm from China. My English is not very good. Recently I have learn somthing about RDMA. But I met a problem:

    This is my test program:
    server code :

    /*
    * Copyright (C) fuchencong@163.com
    */

    #include
    #include
    #include
    #include
    #include
    #include

    #define VERB_ERR(verb, ret) \
    fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)

    #define MB 1024 * 1024

    /* Default parameter values */
    #define DEFAULT_PORT "51216"
    #define DEFAULT_MSG_COUNT 100
    #define DEFAULT_MSG_LENGTH MB

    /* Resources used in the example */
    struct context
    {
    char *server_name;
    char *server_port;
    unsigned int msg_count;
    unsigned int msg_length;
    /* Resources */
    struct rdma_cm_id *id;
    struct rdma_cm_id *listen_id;
    struct ibv_mr *recv_mr;
    char *recv_buf;
    };

    int
    reg_mem(struct context *ctx)
    {
    ctx->recv_buf = (char *) malloc(ctx->msg_length);
    memset(ctx->recv_buf, 0x00, ctx->msg_length);

    ctx->recv_mr = rdma_reg_msgs(ctx->id, ctx->recv_buf, ctx->msg_length);
    if (!ctx->recv_mr) {
    VERB_ERR("rdma_reg_msgs", -1);
    return -1;
    }

    return 0;
    }

    int
    getaddrinfo_and_create_ep(struct context *ctx)
    {
    int ret;
    struct rdma_addrinfo *rai, hints;
    struct ibv_qp_init_attr qp_init_attr;

    memset(&hints, 0, sizeof (hints));
    hints.ai_port_space = RDMA_PS_TCP;
    hints.ai_flags = RAI_PASSIVE; /* this makes it a server */

    printf("rdma_getaddrinfo\n");
    ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
    if (ret) {
    VERB_ERR("rdma_getaddrinfo", ret);
    return ret;
    }

    memset(&qp_init_attr, 0, sizeof (qp_init_attr));
    qp_init_attr.cap.max_send_wr = 1;
    qp_init_attr.cap.max_recv_wr = 1;
    qp_init_attr.cap.max_send_sge = 1;
    qp_init_attr.cap.max_recv_sge = 1;

    printf("rdma_create_ep\n");
    ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
    if (ret) {
    VERB_ERR("rdma_create_ep", ret);
    return ret;
    }

    rdma_freeaddrinfo(rai);

    return 0;
    }

    int
    get_connect_request(struct context *ctx)
    {
    int ret;
    printf("rdma_listen\n");

    ret = rdma_listen(ctx->id, 4);
    if (ret) {
    VERB_ERR("rdma_listen", ret);
    return ret;
    }

    ctx->listen_id = ctx->id;
    printf("rdma_get_request\n");
    ret = rdma_get_request(ctx->listen_id, &ctx->id);
    if (ret) {
    VERB_ERR("rdma_get_request", ret);
    return ret;
    }

    if (ctx->id->event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
    printf("unexpected event: %s", \
    rdma_event_str(ctx->id->event->event));
    return ret;
    }

    return 0;
    }

    int
    establish_connection(struct context *ctx)
    {
    int ret;
    struct rdma_conn_param conn_param;

    /* post a receive to catch the first send */
    ret = rdma_post_recv(ctx->id, NULL, ctx->recv_buf, ctx->msg_length,
    ctx->recv_mr);
    if (ret) {
    VERB_ERR("rdma_post_recv", ret);
    return ret;
    }

    memset(&conn_param, 0, sizeof (conn_param));
    conn_param.responder_resources = 2;
    conn_param.initiator_depth = 2;
    conn_param.retry_count = 5;
    conn_param.rnr_retry_count = 5;

    printf("rdma_accept\n");
    ret = rdma_accept(ctx->id, &conn_param);
    if (ret) {
    VERB_ERR("rdma_accept", ret);
    return ret;
    }

    return 0;
    }

    int
    recv_msg(struct context *ctx)
    {
    int ret;
    struct ibv_wc wc;

    ret = rdma_get_recv_comp(ctx->id, &wc);
    if (ret id, NULL, ctx->recv_buf, ctx->msg_length,
    ctx->recv_mr);
    if (ret) {
    VERB_ERR("rdma_post_recv", ret);
    return ret;
    }

    return 0;
    }

    int
    main(int argc, char** argv)
    {
    int ret, op, i, recv_cnt;
    struct context ctx;
    struct ibv_qp_attr qp_attr;

    memset(&ctx, 0, sizeof (ctx));
    memset(&qp_attr, 0, sizeof (qp_attr));

    ctx.server_port = DEFAULT_PORT;
    ctx.msg_count = DEFAULT_MSG_COUNT;
    ctx.msg_length = DEFAULT_MSG_LENGTH;

    while ((op = getopt(argc, argv, "a:p:c:l:")) != -1) {
    switch (op) {
    case 'a':
    ctx.server_name = optarg;
    break;
    case 'p':
    ctx.server_port = optarg;
    break;
    case 'c':
    ctx.msg_count = atoi(optarg);
    break;
    case 'l':
    ctx.msg_length = atoi(optarg) * MB;
    break;
    default:
    printf("usage: %s [-s or -a required]\n", argv[0]);
    printf("\t[-a ip_address]\n");
    printf("\t[-p port_number]\n");
    printf("\t[-c msg_count]\n");
    printf("\t[-l msg_length]\n");
    exit(1);
    }
    }

    printf("address: %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
    printf("port: %s\n", ctx.server_port);
    printf("count: %d\n", ctx.msg_count);
    printf("length: %d bytes\n", ctx.msg_length);
    printf("\n");

    ret = getaddrinfo_and_create_ep(&ctx);
    if (ret) {
    goto out;
    }

    ret = get_connect_request(&ctx);
    if (ret) {
    goto out;
    }

    ret = reg_mem(&ctx);
    if (ret) {
    goto out;
    }

    ret = establish_connection(&ctx);

    recv_cnt = 0;
    for (i = 0; i < ctx.msg_count; i++) {
    if (recv_msg(&ctx)) {
    break;
    }
    ++recv_cnt;
    }
    printf("recv %d messages, each message is %d bytes\n", \
    recv_cnt, ctx.msg_length);

    rdma_disconnect(ctx.id);

    out:
    if (ctx.recv_mr) {
    rdma_dereg_mr(ctx.recv_mr);
    }

    if (ctx.id) {
    rdma_destroy_ep(ctx.id);
    }

    if (ctx.listen_id) {
    rdma_destroy_ep(ctx.listen_id);
    }

    if (ctx.recv_buf) {
    free(ctx.recv_buf);
    }

    return ret;
    }

    client code:

    /*
    * Copyright (C) fuchencong@163.com
    */

    #include
    #include
    #include
    #include
    #include
    #include

    #define VERB_ERR(verb, ret) \
    fprintf(stderr, "%s returned %d errno %d\n", verb, ret, errno)

    #define MB 1024 * 1024

    /* Default parameter values */
    #define DEFAULT_PORT "51216"
    #define DEFAULT_MSG_COUNT 100
    #define DEFAULT_MSG_LENGTH MB
    #define DEFAULT_MSEC_DELAY 500

    /* Resources used in the example */
    struct context
    {
    char *server_name;
    char *server_port;
    unsigned int msg_count;
    unsigned int msg_length;
    /* Resources */
    struct rdma_cm_id *id;
    struct ibv_mr *send_mr;
    char *send_buf;
    };

    int
    reg_mem(struct context *ctx)
    {
    ctx->send_buf = (char *) malloc(ctx->msg_length);
    memset(ctx->send_buf, 'a', ctx->msg_length);

    ctx->send_mr = rdma_reg_msgs(ctx->id, ctx->send_buf, ctx->msg_length);
    if (!ctx->send_mr) {
    VERB_ERR("rdma_reg_msgs", -1);
    return -1;
    }

    return 0;
    }

    int
    getaddrinfo_and_create_ep(struct context *ctx)
    {
    int ret;
    struct rdma_addrinfo *rai, hints;
    struct ibv_qp_init_attr qp_init_attr;

    memset(&hints, 0, sizeof (hints));
    hints.ai_port_space = RDMA_PS_TCP;

    printf("rdma_getaddrinfo\n");
    ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
    if (ret) {
    VERB_ERR("rdma_getaddrinfo", ret);
    return ret;
    }

    memset(&qp_init_attr, 0, sizeof (qp_init_attr));
    qp_init_attr.cap.max_send_wr = 1;
    qp_init_attr.cap.max_recv_wr = 1;
    qp_init_attr.cap.max_send_sge = 1;
    qp_init_attr.cap.max_recv_sge = 1;

    printf("rdma_create_ep\n");
    ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
    if (ret) {
    VERB_ERR("rdma_create_ep", ret);
    return ret;
    }

    rdma_freeaddrinfo(rai);

    return 0;
    }

    int
    establish_connection(struct context *ctx)
    {
    int ret;
    struct rdma_conn_param conn_param;

    memset(&conn_param, 0, sizeof (conn_param));
    conn_param.private_data_len = sizeof (int);
    conn_param.responder_resources = 2;
    conn_param.initiator_depth = 2;
    conn_param.retry_count = 5;
    conn_param.rnr_retry_count = 5;

    printf("rdma_connect\n");
    ret = rdma_connect(ctx->id, &conn_param);
    if (ret) {
    VERB_ERR("rdma_connect", ret);
    return ret;
    }

    if (ctx->id->event->event != RDMA_CM_EVENT_ESTABLISHED) {
    printf("unexpected event: %s",
    rdma_event_str(ctx->id->event->event));
    return -1;
    }

    return 0;
    }

    int
    send_msg(struct context *ctx)
    {
    int ret;
    struct ibv_wc wc;

    ret = rdma_post_send(ctx->id, NULL, ctx->send_buf, ctx->msg_length,
    ctx->send_mr, IBV_SEND_SIGNALED);
    if (ret) {
    VERB_ERR("rdma_send_recv", ret);
    return ret;
    }

    ret = rdma_get_send_comp(ctx->id, &wc);
    if (ret < 0) {
    VERB_ERR("rdma_get_send_comp", ret);
    return ret;
    }

    return 0;
    }

    int
    main(int argc, char** argv)
    {
    int ret, op, i, send_cnt;
    struct context ctx;
    struct ibv_qp_attr qp_attr;

    memset(&ctx, 0, sizeof (ctx));
    memset(&qp_attr, 0, sizeof (qp_attr));

    ctx.server_port = DEFAULT_PORT;
    ctx.msg_count = DEFAULT_MSG_COUNT;
    ctx.msg_length = DEFAULT_MSG_LENGTH;

    while ((op = getopt(argc, argv, "a:p:c:l:")) != -1) {
    switch (op) {
    case 'a':
    ctx.server_name = optarg;
    break;
    case 'p':
    ctx.server_port = optarg;
    break;
    case 'c':
    ctx.msg_count = atoi(optarg);
    break;
    case 'l':
    ctx.msg_length = atoi(optarg) * MB;
    break;
    default:
    printf("usage: %s [-s or -a required]\n", argv[0]);
    printf("\t[-a ip_address]\n");
    printf("\t[-p port_number]\n");
    printf("\t[-c msg_count]\n");
    printf("\t[-l msg_length]\n");
    exit(1);
    }
    }

    printf("address: %s\n", (!ctx.server_name) ? "NULL" : ctx.server_name);
    printf("port: %s\n", ctx.server_port);
    printf("count: %d\n", ctx.msg_count);
    printf("length: %d bytes\n", ctx.msg_length);
    printf("\n");

    if (!ctx.server_name) {
    printf("server address must be specified for client\n");
    exit(1);
    }

    ret = getaddrinfo_and_create_ep(&ctx);
    if (ret) {
    goto out;
    }

    ret = reg_mem(&ctx);
    if (ret) {
    goto out;
    }

    ret = establish_connection(&ctx);

    send_cnt = 0;
    for (i = 0; i < ctx.msg_count; i++) {
    if (send_msg(&ctx)) {
    break;
    }
    ++send_cnt;
    }
    printf("send %d messages, each message is %d bytes\n", \
    send_cnt, ctx.msg_length);

    rdma_disconnect(ctx.id);

    out:
    if (ctx.send_mr) {
    rdma_dereg_mr(ctx.send_mr);
    }

    if (ctx.id) {
    rdma_destroy_ep(ctx.id);
    }

    if (ctx.send_buf) {
    free(ctx.send_buf);
    }

    return ret;
    }

    What I can't understand is that sometimes this program takes 1 minite to send 1G data and sometimes it only needs 0.2 seconds。 So it's not very stable.

    I really don't know why. Can you give me some advice?
    Thank you!

    • Dotan Barak says: June 3, 2015

      Hi.

      The code that you sent me is corrupted (problem to be added in a comment).
      Can you please send it to me?
      dotan at rdmamojo dot com

      Thanks
      Dotan

      • ChenCong Fu says: June 4, 2015

        Hi Dotan,

        Thanks for the quick reply! I have send my code to you by email.Thank you very much.

      • Dotan Barak says: June 4, 2015

        Hi ChenCong Fu.

        As wrote in mail, the problem is that the Sender Queue Pair enters Receiver Not Ready (RNR) flow,
        which harms the performance and this is what you sometimes see.

        Thanks
        Dotan

  42. Jack says: June 9, 2015

    Hello Dotan,
    Thanks a lot for your help.
    I have a design questin, would you mind take a look at?

    I have client and server, client wants to send a lot of data to server, instead of using "send" operation to send data directly from client to server, client register a memory region includes these data and use "send" operation to tell the remote server the virtual address of these data. Once the server receive this request from client, server will post an "RDMA Read" operation to read these data directly from client side.
    What's the best way to do it?
    because at beginning, server needs to receive a so called "rdma msg" from client, so server will be able to know where to read data at remote side(client), which means we need to put our "RDMA Read" operation inside of "receive completion hander" at server side, only when sever finishes receiving the "rdma msg" from client, server will be able to know where to read and starts "read" operation.

    Is it OK to put "RDMA Read" operation inside of "receive completion handler"? Do you have any advise for this design?

    Thanks a lot for your time!

    All the best
    Jack

    • Dotan Barak says: June 9, 2015

      Hi Jack.

      I'm glad to help where I can
      :)

      I would suggest to use RDMA Write to send data instead of RDMA Read,
      i.e. the server allocates blocks and advertise its attributes to the client
      and the client will initiate an RDMA Write(s).

      The last RDMA Write can be with immediate, to let the server know that it was the last message
      (or from time to time during the messages as a keep alive messages and let the server know how many
      messages it expects to get).

      Did I answer your questions?

      Thanks
      Dotan

      • Jack says: June 12, 2015

        Thanks a lot Dotan!

      • Jack says: June 13, 2015

        Thanks a lot Dotan!
        I will try to do both write and read.
        While I am implementing it. I found out a weird situation. I am trying to put client and server both on the same machine and perform RDMA Read operation between them. The receiver(reader) can only read half part of data from sender.
        For example, the sender send a packet to receiver(contains the address that the reader will read from), assuming that there're 100 bytes in that address, the receiver(reader) can only read first 50 bytes data correctly from the sender side(If sender sends 16 bytes, then only 8 bytes can be read). It's pretty weird. Because I have already tested rdma send/receiver operation, they are fine(in a loopback), which means DMA works OK.

        Do you have any idea? I have updated my firmware to the newest one(May, 2015), my device is ConnectX3. Does it support to perform RDMA Read operation in a local loopback?

        Thanks a lot!

        Jack

      • Dotan Barak says: June 13, 2015

        Hi Jack.

        I would have double check the length of the S/G entries in your Send Requests.

        Thanks
        Dotan

      • Jack says: June 15, 2015

        Hello Dotan,
        Thanks for your help. I have checked the S/G entries length, they are enough for the requests(these entries length are equal to the bytes of data).
        I don't know what to do?

        All the best
        Jack

  43. Jack says: June 15, 2015

    Thanks Dotan, I figured it out. Something wrong in another module...

    • Dotan Barak says: June 15, 2015

      Great!

      As I said, the RDMA device you mentioned works great (I worked/working with it personally).
      :)

      Thanks
      Dotan

  44. Jack says: June 16, 2015

    Hello Dotan,
    I want to ask a question.
    If we want to send a huge message via post_send that reqiures more than one work request(we will use send work request list).
    For example, we have a send workrequest list that contains 2 work request(sendwr0, sendwr1)
    for sendwr0 and sendwr1,
    1) do I need to assign them the same workrequestID because they basically represent the same message?
    2) About send flag, do I only need to assign send_flag_signaled on the last request(in the case above, it's sendwr1)?

    • Dotan Barak says: June 17, 2015

      1) No, you don't *need* to do it, but you *can* do it.
      wr_id is the application attribute for use (or not use).
      If your application needs to know that the two Work Completion are of the same message, you can use it as a hint.

      2) You can set the SIGNALED flag to the second Send Request and get one Work Completion if everything will be fine.

      The RDMA stack doesn't know (or care) that you used two Send Requests for one application message
      (from the RDMA stack point of view, you have two different messages).

      Thanks
      Dotan

      • Jack says: June 18, 2015

        Thanks a lot Dotan, that's helpful!

  45. Jack says: June 18, 2015

    Hello Dotan,
    I would like to confirm if my understanding about FRWR is correct.
    If we have sender and receiver(reader), before they can start, the sender needs to do "post_send()" twice, right? The first "post_send" is register the memory(FRWR) to the NIC, the second one is actually transfer the virtual address of these FRWR memory regions.
    How many "post_send" the receiver(reader) should do? maybe "Three"?
    1) "post_send" FRWR to store the incoming data
    2)"post_send" to actually read the data
    3) "post_send" to tell the remote side(sender) to invalidate the memory region(if receiver finishes reading)
    Is that correct?
    And how could we suppose to know how many FRWR read operations can be performed currently before we invalide the first FRWR? by using query device, I could not find this information, would you mind give me a hand?

    All the best
    Jack

    • Dotan Barak says: June 21, 2015

      Hi Jack.

      I don't have any experience with FRWR operations. But let me try to help you anyway.
      I assume that you are using RDMA Read (although you didn't wrote it..); this is the reason for the second post send.

      According you your scenario (using RDMA Read), yes - three post_sends are needed.

      I don't really understand what is do you mean by:
      "...how many FRWR read operations can be performed currently before we invalidate the first FRWR".

      Can you please explain it?

      Thanks
      Dotan

      • Jack says: June 22, 2015

        Thanks a lot Dotan!
        "...how many FRWR read operations can be performed currently before we invalidate the first FRWR".
        Because for FRWR(at least from my understanding), we registered a memory region and then we use it and then we invalidate it.
        So for increasing performance, the receiver(reader) may perform a couple of Read operations currently, so receiver(reader) will need to invalidate that specific FMR when it's done, so my question was actually about how many Read operations we can perform, so I think it should depend on my system.

        Do you know where I can find more info about FRWR? I tried to search online, but I could not find too much info.

      • Dotan Barak says: June 23, 2015

        Yes. It is your decision when to invalidate this Memory Region.

        AFAIK, the InfiniBand specifications is the only place that you can get information on FRWR.

        Thanks
        Dotan

  46. Jack says: June 29, 2015

    Hello Dotan,
    If I have a very huge size of data(it's divided into multiple chunks) want to send out, there're two possible ways of doing it.
    First one is using one work request (but need extra CPU time to do mem copy)
    Second one is using multi rdma work request(don't need extra CPU time do mem copy but needs to post multiple work request).

    Which one is better?

    All the best
    Jingyi

    • Dotan Barak says: July 4, 2015

      Hi Jingyi.

      You can use one Send Requests with a scatter list;
      this way you'll be able to eliminate the need to perform mem copy and send message from multiple buffers.

      If not, the best solution depends on the size of the total message size:
      * If this is small (~ < 1KB), I think that the first one is the best. * If the total message size is big, the second approach will give you best performance. I suggest to use selective signal and create Work Completion only for the last Send Request. Anyway, if performance is highly critical, the best way is to implement both approaches and measure the results (you develop once and use many times ...) I hope that this helped you. Thanks Dotan

  47. Jack says: July 6, 2015

    Hello Dotan,
    Thanks a lot for your reply!
    I have a idea, I am not sure if it's possible.
    Suppose if sender has 10 chunks data that need to send to remote side(still the send/recv model)
    The normal way to do it is the sender sends the vaddr to receiver then the receiver reads data from sender or the receiver sends its vaddr to sender then sender writes to receiver.
    I was thinking if it's possible that we can perform read and write operations at the same time.
    Back to our assumption, for the 1st chunck the receiver(reader) reads from the sender and at the same time the sender writes 2st chunck to receiver(reader), and for the rest chunks, we do something similar. So we can improve the speed by having both side busy, right?
    Is the above approach possible? If so, I believe the chanllege we will have is the ordering issue, how can we make sure that the chuncks delievered in order? Is there any good way to do it?

    All the best
    Jack

    • Dotan Barak says: July 7, 2015

      Hi Jack.

      Yes, RDMA Reads and Writes can happen in the same time
      (obviously they are initiated by both sides).

      I'm not really sure how much improvements it will give compared to the complexity
      (maybe you would want to work with several QPs in parallel).

      Anyway, back you your idea:
      What is the meaning of order?
      Each QP can place the data in a different (predefined) location,
      In a Write, you specify the remote location that the data will be written to.
      In Read, you specify the local location that the data will be written to.

      So, at the end all the chunks can be placed in one contiguous block.

      Thanks
      Dotan
      You only need to

  48. Jack says: July 9, 2015

    Hello Dotan,
    Thanks for your time!
    When I am doing RDMA Write operation, I noticed an very interesting problem.
    After we successfully post write work request and poll the corespoding wc. the wc.byteLen is not the valid number that we have write. In RDMA read operation, the wc.byteLen is the number of bytes we read from remote side,but in write operation, we can't relay on it. I took a look at driver, the wc.byteLen hasn't been updated in write operation(if opcode = rdma write), but it has been updated in read operation.
    I also checked the infiniband specification, in the rdma write section, it says we can depend on dmaLen, the weird it didn't say anything about wc.byteLen.
    Why for read operation, wc.byteLen will be updated, but for write, it will not be updated?

    All the best
    Jack

    • Dotan Barak says: July 13, 2015

      Hi Jack.

      I *think* (since I'm not one of the IB spec authors) is that if you are the Requestor side of RDMA Write or Send, you know how much data you sent. If needed, you can maintain a local information which is associated with the Send Requests, and hold in the wr_id the pointer to it.

      Thanks
      Dotan

      • Jack says: July 15, 2015

        Thanks Dotan!
        Actually there's another confusion in driver. If we post_send(wr), in the failure case, it seems that we still can't relay on wc.opcode, because the driver doesn't update it. Is there any design reason?
        why driver doesn't need to update the wc.opcode in the failure case?

        All the best
        Jack

      • Dotan Barak says: July 15, 2015

        Hi Jack.

        This is by design. Look at the post on ibv_poll_cq() for more details on valid attributes when Work Completion has an error.

        Thanks
        Dotan

  49. Mark Sherred says: July 16, 2015

    Thanks for all the great info!

    I didn't realize the IB verbs layer itself needs completion events created by the application layer, until I saw your response to Igor R. When I first saw the description of the dead lock when the WQ is filled with non-signaled operations, I though you were referring to the application layer SW needing completion events to keep a count of outstanding operations to make sure the WQ is never filled.

    Do you know why IB verbs pushes WR flow control back into the application layer by going into the error state when the WQ fills, instead of returning EAGAIN or EWOULDBLOCK like send(), recv(), read() or write() for non-blocking I/O to a busy device?

    • Dotan Barak says: July 17, 2015

      Hi Mark.

      There isn't any problem if the Send Request if full with Send Requests which one of them is Signaled (i.e. will generate a Work Completion).

      The problem only exists if all the posted Send Requests are non-signaled.

      Letting the low-level driver or the HW make the book-keeping of which Send Request is signaled, which isn't will decrease the performance. Since before any Send Request is posted, the low-level driver will need to check if there is a potential problem.

      The application knows what it is doing, and easily can avoid getting into this pitfall.

      Thanks
      Dotan

  50. DjvuLee says: September 9, 2015

    Hi, Dotan.
    I have a question about the parallel RDMA READ. Since RDMA is a async model, before we finished a RDMA READ, we can launch another, so there is a lot of unfinished RDMA READ at a time, the number of this RDMA READ operation may exceed the initiator_depth and responder resource. What will happen when exceed? does the NIC will launch the RDMA READ as common, or it will wait until the number of unfinished RDMA READ do not exceed?

    I keep the parallel RDMA READ model in a cluster, when I do not limit the parallel number, I failed with IBV_WC_RETRY_EXC_ERR, but when I limit the number of parallel RDMA READ, I can success.

    Is there any limit for parallel RDMA READ? or we should avoid this. Thanks!

    • Dotan Barak says: September 13, 2015

      Hi.

      Per QP, there are attributes to number of RDMA Read + Atomic messages that can be sent in parallel.
      If wrong values will be used (for example: the initiator is configured to send more READs that the destination can accept)
      there will be a retry flow and the initiator side may get completion with RETRY EXCEEDED error (as you seen).

      The following attributes in the device capabilities are relevant to this operation:
      * max_qp_rd_atom
      * max_qp_init_rd_atom

      The supported number of RDMA and Atomic operations per QP (for initiator and target).

      Thanks
      Dotan

      • DjvuLee says: September 13, 2015

        Thanks very much! I occurs such a problem, I use shell/python and rping to compose a RDMA shuffle cluster, that is every node run a server mode process(it uses a thread for every incoming client connection), and there is also N client mode process in every node, which will set up connection with other nodes in the cluster. Since rping is RDMA READ--ACK-- RDMA WRITE ---ACK procedure, there is only one outstanding RDMA operation at any time, but there is IBV_WC_RETRY_EXC_ERR error. In my opinion, there is should no reason to occurs this error.

        By the way, when the cluster is just 15 nodes, there is no error, errors occurs when there is 30 nodes in the cluster.

        Can you give some advice how to deal with this?

      • Dotan Barak says: September 15, 2015

        Hi.

        The problem is that there is one more attributes 'max_res_rd_atom' - the total number of RDMA Reads and atomic that this device supports as the target,
        and there isn't any sync or protocol (AFAIK) which guarantees that prevents more RDMA Reads / Atomic operations to be targeted to this value.

        Thanks
        Dotan

  51. Tingyu says: September 14, 2015

    Hi Dotan,

    I know it is not safe to ibv_post_recv several messages on the same address. But is it safe to ibv_post_send several messages on the same address? If so, is there any performance difference between posting the same and different?

    Thanks,
    Tingyu

    • Dotan Barak says: September 15, 2015

      Hi Tingyu.

      The problem with posting multiple Receive Requests to the same address is that the content isn't consistent
      (i.e. one cannot predict the value of the buffers since there isn't any guaranteed order between different Work Queues).

      Sending multiple messages from the same address don't have this problem.

      Thanks
      Dotan

      • Tingyu says: September 15, 2015

        Hi Dotan,

        Thanks for this reply! I understand
        data will not be consistent, but I wonder
        if RDMA allows this type of operation.
        So I tested by posting several receive
        requests to the same address on the
        receiver side, it seems
        RDMA library threw out an error during
        ibv_poll_cq on the sender side, by setting
        wc.status to 12. Could you explain why?
        Is there any internal mechanism in RDMA library
        that prevents reusing the same buffer?

        Thanks,
        Tingyu

      • Dotan Barak says: September 17, 2015

        Hi Tingyu.

        wc.status 12 means IBV_WC_RETRY_EXC_ERR.
        This means that there was a transport error at some point.

        Reusing the same buffer is legal in RDMA.

        Thanks
        Dotan

  52. Valentin Petrov says: September 24, 2015

    Hi, Dotan,
    does the RC QPs guarantee the ordering of RDMA_WRITE WR? For example, if an "initiator" issues 2 consecutive IBV_WR_RDMA_WRITEs into the same remote memory location will the "target" always end up with the data from the second operation (ie, the second WR will always update remote memory after the first one) ?

    • Dotan Barak says: September 26, 2015

      Hi Valentin.

      I will be careful here:
      * From network point of view, the first message will reach to destination before the second one.
      * The memory will be DMA'ed (by the RDMA device) according to the message ordering

      If the memory control, cache in the server will honor this (as I expect to be in most architectures),
      I guess the answer is ""yes".

      Thanks
      Dotan

  53. Tingyu says: September 30, 2015

    Hi Dotan,

    Is there any limit on the maximal message size posted using
    ibv_post_send? Say 16MB, 32MB, 64MB, 128MB? The problem to me
    is that when I try to post message larger than 16MB, there will be
    a problem (my code first posts 16MB receive request using ibv_post_receive, then posts 16MB send message using ibv_post_send
    to the other side. The first posted receive buffer is to receive
    the ack message from the other side). It turns out that the remote side doesn't receive the posted message (The other side also posted 16MB receive buffer before receiving message and the connection between the two has been established already). ibv_poll_cq on the sender side returns wc with status 12. Do you have any idea of this issue? I don't know how to debug this issue, could you give me any instruction on how to debug?

    Thanks for help!
    Tingyu

    • Dotan Barak says: September 30, 2015

      Hi Tingyu.

      The maximal message size can be found in the port properties: max_msg_sz (in general, RDMA supports up to 2GB messages).
      Posting bigger messages will end with completion with error.

      Completion with status 12: IBV_WC_RETRY_EXC_ERR, indicate that there is a transport problem.
      I suspect that the remote side isn't ready yet or finished it work and closed all the resources.

      Thanks
      Dotan

      • Tingyu says: October 2, 2015

        Hi Dotan,

        Thanks. I just checked the max_msg_sz
        was 2GB. To find the transport problem, I
        used the example "helloworld" code on github
        https://github.com/tarickb/the-geek-in-the-corner as
        described by http://www.hpcadvisorycouncil.com/pdf/building-an-rdma-capable-application-with-ib-verbs.pdf.
        I got the same status 12 when the message size was set as 256MB (messages with smaller size
        worked).
        The network I used was qlogic, so is it possible
        there was something wrong with the hardware or underlying
        verb implementation? Or
        was there anything wrong with the infiniband setup? Do you know
        the way to debug the problem?

        Many thanks,
        Tingyu

      • Dotan Barak says: October 19, 2015

        Hi.

        I didn't work with QLOGIC HW, so I don't have any feedback to tell give you.
        I would suggest to use the libibverbs example (I know them and they always work).

        Thanks
        Dotan

  54. Jon says: October 27, 2015

    Hello Dotan,

    Will work requests be modified after posting them?

    In more detail: assuming a list of requests leading by wr is posted by calling ibv_post_send(qp, wr, &bad_wr); will the fields including the next pointers of the requests be modified by the library?

    Thanks so much!
    Jon

    • Dotan Barak says: November 7, 2015

      Hi Jon.

      After a Send Request was posted, it can be modified by the application.

      During post send request, the low-level library translate the libibverbs Send Request to HW-specific Send Request and "tells" the RDMA device that new SRs were posted.

      Thanks
      Dotan

  55. Sagar Jha says: October 28, 2015

    Hi Dotan,
    I was wondering what is the behavior of an RDMA read of a remote memory if the remote machine is also writing to it concurrently?

    More formally, suppose host A is reading using RDMA read, a variable v which is local to host B. If the value of v before the start of the read operation was 'a', and B is writing to v the value 'b' concurrently with the read operation, what is the return value of read going to be? Is it guaranteed to be either 'a' or 'b' or can it be a possibly garbage value too because of the local write or remote read not being atomic?
    Thanks,
    Sagar

    • Dotan Barak says: November 7, 2015

      Hi Sagar.

      Local Read and Local Write are not atomic and you may get garbage...

      If you want to guarantee atomicity, you must use the Atomic operations.

      Thanks
      Dotan

      • Sagar Jha says: November 11, 2015

        Thanks for the reply. I can see this happening when we are writing to large memory segments. Is this also true if we are writing to single instance of native data types (bits, bytes, integers, floats etc.)?

      • Dotan Barak says: November 11, 2015

        If you don't use Atomic operations, there isn't any guarantee to atomic access even for small (and native) data types.

        Thanks
        Dotan

  56. Vasily says: October 29, 2015

    Hi.

    First of all I would say thank you for this site and your comments, they are very useful.

    My question :

    I know that the atomic operations maybe not very popular, but I have to use it. I have modified rdma-file example to se send one uint64_t-size structure. Also I am using and example provided above. On the server side it is ok - I see when this structure changing. The problem in a client site. I do not understand when and how I can check swapped value: Can I check it directly after the ibv_post_send, or I should wait or made something different? Because now I see nothing after the ibv_post_send, but if I send back some message via different MR, I see the swapped value. can you give me a hint?

    • Dotan Barak says: November 7, 2015

      Hi Vasily.

      Thanks for the feedback
      :)

      This isn't really true that atomic isn't popular - it depends what you are trying to do..

      If you want to examine the value in the client side (i.e. the side that calls ibv_post_send()),
      this can be done only after the Send Request processing was ended, i.e. the Work Completion of the corresponding Send Request was polled from the Completion Queue.

      Thanks
      Dotan

  57. songping yu says: November 26, 2015

    hi Dotan,
    When I use ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr **bad_wr) to transfer one large message(200K) using one work request in UD mode, the parameter wr->opcode=IBV_WR_SEND wr->numsge=1.
    An error IBV_WC_LOC_LEN_ERR occurs in send side. I am sure the receive buffer is larger enough on receive side.
    Does this happen because MTU(4096)< 200K? Do I need to spilt 200K message into multiple work requests?

    • Dotan Barak says: November 27, 2015

      Hi Songping yu.

      UD QP doesn't support more than the path MTU message size:
      this value is in the range 256-4096 bytes (depends on your subnet).

      It is up to the application to split the (big) message to smaller messages,
      using multiple Work Requests or use a different QP transport type.

      Thanks
      Dotan

  58. Ben says: December 21, 2015

    Hi Dotan.
    two questions:
    1. I register big memory block, can I send part of it by addr offset、len、rkey?
    2. I register many MRs, which is different memory size, when I send msg by RDMA SEND operation, the remote side how to select recv MR?

    Thanks!
    Ben

    • Dotan Barak says: January 1, 2016

      Hi.

      1) Yes. You can use only part of it in a Work Request.

      2) The remote side posts several Receive Requests:
      the incoming messages will consume the Receive Requests according to the order they were posted.
      i.e. RR[0] will be consumed by message[0], etc.

      Thanks
      Dotan

  59. Andy Malakov says: January 22, 2016

    Thank you very much,Dotan. These pages are super-useful as IB API reference.

    • Dotan Barak says: January 29, 2016

      :)

      Thanks for the great feedback
      Dotan

  60. liuyu says: May 31, 2016

    Hi Dotan.
    Thank you very much for your post and help!
    Now I meet a problem, when i use ibv_post_send, i got a return value : 12. Before ibv_post_send, i checked the send_wr.sge.addr, it is valid. I paste some code here:

    1) create qp:

    qp_attr.cap.max_send_wr = 1024;
    qp_attr.cap.max_send_sge = 1;
    qp_attr.cap.max_recv_wr = 1024;
    qp_attr.cap.max_recv_sge = 1;
    qp_attr.send_cq = send_cq;
    qp_attr.recv_cq = recv_cq;
    qp_attr.qp_type = IBV_QPT_RC;
    err = rdma_create_qp(cm_id, connection->pd, &qp_attr);

    2)query qp attr

    if (ibv_query_qp(connection->cm_id->qp, &attr, IBV_QP_STATE | IBV_QP_PATH_MTU | IBV_QP_CAP, &qp_attr))
    {
    printf("client query qp attr fail\n");
    return RETURN_ERROR;
    }

    I found attr.cap.max_send_wr is equal to 2015, and attr.cap.max_recv_wr is equal to 1024, attr.cap.max_send_sge is equal to 2, attr.cap.max_recv_sge is equal to 1.

    3)call ibv_post_send to send msg

    memset(&sge, 0, sizeof(sge));
    sge.addr = (uint64_t)cmd;
    sge.length = sizeof(CMD_S);
    sge.lkey = connection->connect_mr[MR_REQ].mr->lkey;

    memset(&send_wr, 0, sizeof(send_wr));
    send_wr.wr_id = (uint64_t)cmd;
    send_wr.next = NULL;
    send_wr.sg_list = &sge;
    send_wr.num_sge = 1;
    send_wr.opcode = IBV_WR_SEND;
    send_wr.send_flags = IBV_SEND_SIGNALED;
    ret = ibv_post_send(connection->cm_id->qp, &send_wr, &bad_wr);
    if (ret != 0)
    {
    printf("client send connect cmd failed, ret=%d.\n", ret);
    return RETURN_ERROR;
    }

    ret is equal to 12.

    I am confused with follow question:
    1. I set max_send_wr with 1024, max_send_sge with 1, but when I query qp later, they changed, max_send_wr is 2015, max_send_sge is 2. Why?
    2. In my test, multi pthreads will call ibv_post_send. My test has two params, one is thread Num, another is queue depth per thread, the queue is used by test , not rdma queue. My test ran well when params are 8 threads and 32 queue depth, but got error when params are 8 threads and 64 queue depth. And ibv_post_send returns a error value 12.

    Please give me some suggestion, help me to find key point to resolve the problem. Thanks.

    • liuyu says: June 2, 2016

      I'd like to add that, my test creates one qp only to send msg. 8 threads and 32 queue depth means that the qp should handle 8*32 requests one time sometimes. Is the qp limited to handle 256 requests when max_send_wr is setted 1024 ? And is there limits when we use qp to send/rdma read/rdma write ?

      • Dotan Barak says: June 2, 2016

        Hi.

        The QP can handle Work Requests according to the max_send_wr that it was created with
        (and this value is limited by the HCA capabilities).

        However, please notice the following:
        * The Send Requests will be processed according to their order in the QP
        * RDMA Read & Atomic parallel processing is limited by max_rd_atomic and max_dest_rd_atomic
        for QP as initiator and destination

        Thanks
        Dotan

    • Dotan Barak says: June 2, 2016

      Hi.

      1. The RDMA device/low level driver can provide more resources than the originally requested value, according to its needs and internal structure
      2. I suspect that the Send Queue is full, i.e. you have many outstanding Send Requests (Posted Send Requests that were ended with a Work Completion).

      You should either increase the rate of polling out the Work Completions from the CQ or increase the QP.max_send_wr value

      Thanks
      Dotan

      • liuyu says: June 2, 2016

        Hi Dotan.
        Thanks for your answer!
        But I'm still confused that what causes the send queue to be full. My test generates 256 requests total at first time, and uses them recycled. So I think rdma send queue holds 256 work requests most, and should not be full. Could you give me some detailed explanation?

      • Dotan Barak says: June 3, 2016

        Hi.

        A posted Work Request is considered outstanding until a Work Completion was generated for it or for Work Request after it.
        You specify in the created QP the number of outstanding Work Requests for either the Send and Receive Queue of that QP.

        I suspect that in your example, you post many Send Requests to the QP and don't poll the Work Completions for them.

        Thanks
        Dotan

  61. David R. says: June 8, 2016

    Hi Dotan,

    First of all, thank you so much for the blog! It is tremendously helpful!

    I'm not sure if this is the right place to ask this, but I'm having trouble with one of the sample programs from the RDMA Aware Programming User Manual. I'm not 100% certain, but I believe the problem has to do with ibv_post_send() so this was the best place I could think of to ask. The sample program is from Section 8.2 (Multicast Code Example Using RDMA CM). The basic description of this program is that a sender and receiver create a UD QP, join the multicast group, the sender posts a certain number of sends to the group, and the receiver waits to receive them. When I try to run the program, the sender successfully posts the sends, but the receiver never actually receives them. No errors are returned (from the sender or receiver); the receiver simply waits forever. However, if I add a sleep(1) call just before the sender calls ibv_post_send(), everything works correctly. At first I thought the problem was that the sender was posting the sends before the receives are posted by the receiver, but this does not appear to be the case. Are there any other reasons you know of that would explain why sleep() must be called before ibv_post_send() in this case? Or could this problem be caused by something else entirely and calling sleep() just appears to fix it? I'm not sure if this is a common issue or not; hopefully my question is not too vague. The code I am testing is from Revision 1.7 of the manual, but I can post or email it if that would help; just let me know. I greatly appreciate any help you can give me!

    Thanks!

    • Dotan Barak says: June 10, 2016

      Hi.

      Are you aware to the fact that there isn't any synchronization at all between both sides in this test?
      i.e. the sender send a message, but the remote side may not be ready to receive it
      (its QP isn't in the appropriate state or Receive Request wasn't posted or it hasn't join the multicast group yet).

      This is the reason that adding a sleep to the sender will solve the problem...

      You can solve it by adding a synchronization between both sides, or letting the server send again and again and waiting for an incoming response from the client.

      Thanks
      Dotan

      • David R. says: June 16, 2016

        Oh, I see. That makes sense. Thank you!

  62. Bill L says: August 9, 2016

    Hi Dotan. Like everyone else, thank you for such an informative resource for RDMA programming. My question: when ibv_post_send is used with one of the atomic opcodes (IBV_WR_ATOMIC_FETCH_AND_ADD or IBV_WR_ATOMIC_CMP_AND_SWAP), do you still need to poll for a completion event to be sure the atomic operation was successful? Or will the operation have completed when ibv_post_send returns?

    • Dotan Barak says: August 10, 2016

      Hi.

      When atomic operations, like any other operation, will end when there is a Work Completion for it
      (or for any other Send Request that was posted after it).

      When ibv_post_send() returns, this means that the low-level driver enqueues this Send Request for the RDMA device
      for future processing.

      Thanks
      Dotan

  63. windybeing says: January 11, 2017

    Hi Dotan.

    Thank you for such a guideline of rdma programing!

    And, I have some trouble about IBV_WR_SEND in UD. I use doorbell batching to post my sends (just like wr[i].next = &wr[i+1]). However, only the data of the lattest wr in the batching is received. I am sure that there is no error thrown in my code because if I replace IBV_WR_SEND with IBV_WR_SEND_WITH_IMMEDIATE it works for the same code, the headers arrive correctly. Also, if I just use a post_send for each wr, it works. I think something in the sender side is wrong.

    Hope that you can give me some advice!

    Thanks!

    • Dotan Barak says: February 10, 2017

      Hi.

      Please make sure that there isn't any race between the sides,
      and when the message arrives to the remote side)
      1) The remote QP is in (at least) RTR state
      2) There are already Receive Requests available in the remote QP
      3) The messages are big enough (i.e. at least message size + 40 bytes for the GRH)

      Thanks
      Dotan

  64. Param says: April 18, 2017

    Hi Dotan,

    I have a question. When I query my device I get that max_qp_rd_atom operations is 16. So is it not possible more than 16. Why is it specific to RDMA Read operations. I do not see any problem when there are more than 16 Work Requests posted for RDMA Read. What does attr.max_qp_rd_atom mean?

    • Dotan Barak says: July 3, 2017

      Hi.

      RDMA Read operations require special resources and handling in both send and receive side,
      so this is the reason for the limitation.

      Configuring QP.max_rd_atomic limit the number of processed RDMA Reads that handled by the QP in any time;
      you may post as much as you want RDMA Read operations, and the RDMA device will limit the processing.

      Thanks
      Dotan

  65. QiuHaonan says: May 9, 2017

    Hi,Dotan,I have read many of your articles to learn RDMA programming.
    Now I get some problems and try to search result from RDMA_Aware_Programming_User_Manual.pdf (Version 1.7) and the IB Specification Vol 1-Release-1.3-2015-03-03.pdf,but haven't found the result.So I have to turn to you for help.The problem is When I post work request to queuepair,the NIC got notification and fetch the work request from memory to NIC cache by DMA,but when NIC send the data contained in the work reqeust to cabel,does it need to fetch the queuepair information to NIC cache?I know that NIC cache stores the queuepair data,memory address translation data and some network data,but when NIC send data,is the queuepair information necessary?

    • Dotan Barak says: July 3, 2017

      Hi.

      When sending data, the RDMA device needs to fetch QP information:
      * QP state
      * PKey index
      * Qkey (for UD QPs, in specific scenarios)
      * Remote side attributes (for connected QPs)

      So, the answer is yes.

      Thanks
      Dotan

  66. Haodong says: June 22, 2017

    Hi Dotan,

    If I want to use "ibv_post_send", since we already have "IBV_WR_SEND", why we need "IBV_WR_RDMA_WRITE"? Is there any performance difference between these two approaches?

    • Dotan Barak says: July 2, 2017

      Hi.

      Yes. There is a performance difference:
      * Send operation will consume a Receive Request in the remote side
      * RDMA Write operation won't, and a PCI read will be prevented (better latency)

      Thanks
      Dotan

      • Haodong says: July 5, 2017

        Great! Thanks Dotan.

  67. qiuhaonan says: July 10, 2017

    thanks for your answer! Dotan
    After reading all conversions in this post above,I have one more curious question...(sorry for disturbing).
    The question is:When, where and how is the necessary QP information being collected for posting send wr?
    First,please allow me sort out some procedure and explain my understanding.
    When I post ibv_send_wr* wr using ibv_post_send,things goes on follow:
    1.No context switch,in the same context,the ibv_post_send function transforms ibv_send_wr* wr(libibverbs abstraction) to WQE (HW-specific send request,the WQE is writing in Ethernet_Adapter_programming_Mannual,),during constructing WQE,it demands Ctrl Segment,Eth segment,Memory Management segment,Data segment,and Ctrl segment includes the attribute of SQ number(which
    seems the necessary information about QP)
    2.after constructing new WQE,writing the WQE to the WQE buffer,and update Doorbell record associated with that queue.(ibv_post_send api returns)
    3.Device got notification and asynchronously processes these new WQEs.
    4.After work request being processed, NIC writes cqes to relevent cq by dma.
    5.I poll cq and got notifications.
    ok,the whole procedure is sorted.Is there existing some error?
    From proceduer above,can guess the collecting necessary QP information happens at transforming ibv_send_wr to WQE(it means calling ibv_post_send)?
    And another question...(sorry for my curiousity),as far as i know,in software level,the qp num is the unique identitfier to steer network message flow to corresponding qp,in hardware level,the gid and port is the unique identifier to steer packet flow.So summarize for above question, can i treat "fetching QP information for work request" as "fetching qp num and other non-unique information"?
    Sorry for too much words,But I really interested in this part.If I expressed poorly,please point out and I will improve.Thanks for you patience!Dotan

    • Dotan Barak says: July 21, 2017

      Hi.

      This is an interesting question.
      After the following step:
      "2.after constructing new WQE,writing the WQE to the WQE buffer,and update Doorbell record associated with that queue.(ibv_post_send api returns)"
      The WQE was enqueued to the RDMA device for processing; when the processing will actually start the RDMA device needs to collect relevant information for the QP:
      * The QP type
      * Remote QP number (for connected QP)
      * Path to the remote QP (for connected QP)
      * Send PSN
      * more

      I hope that I answered your question.

      Thanks
      Dotan

      • qiuhaonan says: July 27, 2017

        Hi,Dotan
        I got it.There is still so much things which device need to do.
        Sorry for my recklessness.I should carefully read the driver source code and then ask my questions.But I do learn very much from your detailed articles.Thanks for your patience and generosity.

      • Dotan Barak says: July 27, 2017

        :)

Add a Comment

Fill in the form and submit.

Time limit is exhausted. Please reload CAPTCHA.