ibv_post_recv()

Contents

4.77 avg. rating (95% score) - 13 votes

int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr,
                  struct ibv_recv_wr **bad_wr);

Description

ibv_post_recv() posts a linked list of Work Requests (WRs) to the Receive Queue of a Queue Pair (QP).

ibv_post_recv() go over all of the entries in the linked list, one by one, check that it is valid, generate a HW-specific Receive Request out of it and add it to the tail of the QP's Receive Queue without performing any context switch. The RDMA device will take one of those Work Requests as soon as an incoming opcode to that QP will consume a Receive Request (RR). If there is a failure in one of the WRs because the Receive Queue is full or one of the attributes in the WR is bad, it stops immediately and return the pointer to that WR.

A QP, which isn't associated with an SRQ, will handle Work Requests in the Receive queue according to the following rules:

If the QP is in RESET state an immediate error should be returned. However, they may be some low-level driver that won't follow this rule (to eliminate extra check in the data path thus, providing better performance) and posting Receive Requests at this state may be silently ignored.
If the QP is in INIT state, Receive Requests can be posted, but they won't be processed.
If the QP is in RTR, RTS, SQD or SQE state, Receive Requests can be posted and they will be processed.
If the QP is in ERROR state, Receive Requests can be posted and they will be completed with error.

If the QP is associated with a Shared Receive Queue (SRQ), one must call ibv_post_srq_recv(), and not ibv_post_recv(), since the QP's own receive queue will not be used.

If a RR is being posted to an UD QP, the Global Routing Header (GRH) of the incoming message will be placed in the first 40 bytes of the buffer(s) in the scatter list. If no GRH is present in the incoming message, then the first bytes will be undefined. This means that in all cases, the actual data of the incoming message will start at an offset of 40 bytes into the buffer(s) in the scatter list.

The struct ibv_recv_wr describes the Work Request to the Receive Queue of the QP, i.e. Receive Request.

struct ibv_recv_wr {
	uint64_t		wr_id;
	struct ibv_recv_wr     *next;
	struct ibv_sge	       *sg_list;
	int			num_sge;
};

Here is the full description of struct ibv_recv_wr:

wr_id	A 64 bits value associated with this WR. A Work Completion will be generated when this Work Request ends, it will contain this value
next	Pointer to the next WR in the linked list. NULL indicates that this is the last WR
sg_list	Scatter/Gather array, as described in the table below. It specifies the buffers where data will be written in. The entries in the list can specify memory blocks that were registered by different Memory Regions. The maximum message size that it can serve is the sum of all of the memory buffers length in the scatter/gather list
num_sge	Size of the sg_list array. This number can be less or equal to the number of scatter/gather entries that the Queue Pair was created to support in the Receive Queue (qp_init_attr.cap.max_recv_sge). If this size is 0, this indicates that the message size is 0

struct ibv_sge describes a scatter/gather entry. The memory buffer that this entry describes must be registered until any posted Work Request that uses it isn't considered outstanding anymore. The order in which the RDMA device access the memory in a scatter/gather list isn't defined. This means that if some of the entries overlap the same memory address, the content of this address is undefined.

struct ibv_sge {
	uint64_t		addr;
	uint32_t		length;
	uint32_t		lkey;
};

Here is the full description of struct ibv_sge:

addr	The address of the buffer to read from or write to
length	The length of the buffer in bytes. The value 0 is a special value and is equal to [latex]2^{31}[/latex] bytes (and not zero bytes, as one might imagine)
lkey	The Local key of the Memory Region that this memory buffer was registered with

While a WR is considered outstanding, the local memory buffers content shouldn't be read since one doesn't know when the RDMA device will stop writing new content to it.

Parameters

Name	Direction	Description
qp	in	Queue Pair that was returned from ibv_create_qp()
wr	in	Linked list of Work Requests to be posted to the Receive Queue of the Queue Pair
bad_wr	out	A pointer to that will be filled with the first Work Request that its processing failed

Return Values

Value	Description
0	On success
errno	On failure and no change will be done to the QP and bad_wr points to the RR that failed to be posted
EINVAL	Invalid value provided in wr
ENOMEM	Receive Queue is full or not enough resources to complete this operation
EFAULT	Invalid value provided in qp

Examples

Posting a RR to QP which isn't associated with an SRQ:

struct ibv_sge sg;
struct ibv_recv_wr wr;
struct ibv_recv_wr *bad_wr;
 
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
 
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
 
if (ibv_post_recv(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_recv() failed\n");
	return -1;
}

FAQs

Does ibv_post_recv() cause a context switch?

No. Posting a RR doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

How many WRs can I post?

There is a limit to the maximum number of outstanding WRs for a QP. This value was specified when the QP was created.

Can I know how many WRs are outstanding in a Work Queue?

No, you can't. You should keep track of the number of outstanding WRs according to the number of posted WRs and the number of Work Completions that you polled.

Which operations will consume RRs?

If the remote side post a Send Request with one of the following opcodes, a RR will be consumed:

Send
Send with Immediate
RDMA Write with immediate

What will happen if I will deregister an MR that is used by an outstanding WR?

When processing a WR, if one of the MRs that are specified in the WR isn't valid, a Work Completion with error will be generated.

I called ibv_post_recv() and I got segmentation fault, what happened?

There may be several reasons for this to happen:
1) At least one of the sg_list entries is in invalid address
2) The value of next points to an invalid address
3) Error occurred in one of the posted RRs (bad value in the RR or full Work Queue) and the variable bad_wr is NULL

Help, I've posted and Receive Request and it wasn't completed with a corresponding Work Completion. What happened?

In order to debug this kind of problem, one should do the following:

Verify that a Send Request was actually posted in the remote QP
Verify that a Receive Request was actually posted in the local QP
Wait enough time, maybe a Work Completion will eventually be generated
Verify that the logical port state of the RDMA device is IBV_PORT_ACTIVE
Verify that the QP state is in one of the following states: RTR, RTS, SQD, SQE, ERROR

I had a code that work with UC or RC QP and I added support to UD QP, but I keep getting Work Completion with error. What happened?

For UD QP, an extra 40 bytes should be added to the RR buffers (to allow save the GRH, if such exists in this message).

Can I (re)use the Receive Request after ibv_post_recv() returned?

Yes. This verb translates the Receive Request from the libibverbs abstraction to a HW-specific Receive Request and you can (re)use both the Receive Request and the s/g list within it.

Written by: Dotan Barak on February 2, 2013.on February 18, 2015.

Comments

Tell us what do you think.

Igor R. says: February 11, 2014

Hi Dotan,

Is it possible to increase the "level of debugability" of ibverbs? In particular, to have some debug symbols and/or verbose tracing messages from libibverbs?
I'm trying to debug a segfault in ibv_post_recv (bad_wr is ok, and there's no obvious reason for segafult; perhaps some memory corruption prior to this call, but it would be helpful to see where it crashes exactly).

Reply
- Dotan Barak says: February 11, 2014
  
  Hi Igor.
  
  I'm very sorry, but there isn't any built-in capability for debugging in libibverbs
  (or in most of low-level drivers implementations in the user level).
  
  If you'll send me the part of your code that calls the ibv_post_recv() function, I might give you a tip where the potential bug is.
  
  Thanks
  Dotan
  
  Reply
  - Igor R. says: February 11, 2014
    
    Unfortunately, it's quite a lot of code, and I've not managed yet to produce an SSCCE.
    The error is inconsistent (as it usually happens with memory-management issues in async systems), but the symptom is that sometimes it's segfault in ib_post_recv, and sometimes it's ib_post_send that returns EINVAL. Really looks like bad_wr == 0, but it's not.
    Just to be sure: the expected lifespan of bad_wr is just during ib_post_xxx call (i.e., after the function returns, this pointer won't be used anymore), right? And the expected lifespan of the memory region and its underlying buffer is until the appropriate WC gets arrived, isn't it?
  - Igor R. says: February 11, 2014
    
    Solved.
    It happened due to the lack of total zeroing of ibv_send_wr/ibv_recv_wr prior to their use (despite the proper initialization of "sensible" fields).
  - Dotan Barak says: February 11, 2014
    
    Hi Igor.
    
    I'm happy that this issue was solved.
    I would suggest to zero the structure fields one by one to check which one caused the failure,
    since maybe the memset() that you use hides a bug.
    (If you will be able to share the field name that caused the problem, it will be nice).
    
    I *think* that the send_flags attributes is the problem (and the problem was that the inline bit indicator was set).
    
    Thanks
    Dotan
    
    Thanks
    Dotan
Igor R. says: February 11, 2014

Not send_flags, much worse: "next" pointer.

Reply
- Dotan Barak says: February 11, 2014
  
  Yes, invalid address in the 'next' pointer may be a problem..
  One should set this pointer to NULL explicitly in the last SR.
  
  I'll update the relevant posts to reflect this, so next programmer(s) won't fall into this hole too...
  
  Thanks
  Dotan
  
  Reply
  - Igor R. says: February 11, 2014
    
    Thanks!
    BTW, what's the difference between posting a chain of WRs vs. calling ibv_post_xxx several times? Just a function call?
  - Dotan Barak says: February 11, 2014
    
    Posting a chained WRs vs. calling ibv_post_XXX() several times can save not only the function call itself, it allows the low-level driver to perform optimizations that may result a better performance. For example: adding all the WRs to the queue and only then notify the RDMA device about them.
    
    Dotan
  - Igor R. says: February 11, 2014
    
    Great, thanks!
ganesh.irrinki says: April 11, 2014

Hi Dotan,
I'm new to RDMA concepts. can you please tell me what are "chained WRs posted on an SQ of a QP". i think WRs are individual to each other how can be they chained together?

Thanks,
I.Ganesh.

Reply
- Dotan Barak says: April 11, 2014
  
  Hi.
  
  A "chained WRs posted on an SQ of a QP" is actually a linked list of Send Requests.
  As you (correctly) said, every Send Request by itself is individual (and independent),
  but posting them together allow to perform some optimization when compared to posting them
  one by one.
  
  I hope that now it is clear.
  
  Thanks
  Dotan
  
  Reply
  - ganesh.irrinki says: April 11, 2014
    
    Hi Dotan,
    
    So,if suppose in a SQ/RQ has the following WRs respectively
    
    1)SEND
    2)RDMA-READ
    2)RDMA-READ
    3)RDMA-READ
    4)RDMA-WRITE
    5)RDMA-WRITE
    6)SEND
    .........
    .........
    
    then above multiple reads & multiple writes can be done at a time will give better performance......am i right?
    And we can call this process as Chained WRs posted on SQ/RQ of a QP. right?
    
    Thanks for your advise & quick reply....
    Thanks,
    I.Ganesh
  - Dotan Barak says: April 11, 2014
    
    In general: Yes
    
    They won't be processed at the *same* time, but this will perform better performance (for example: you'll consume less CPU cycles and may get better message rate) when posted at one linked list in compared to posting them one by one..
    
    BTW, if you perform RDMA Read followed by RDMA Write, Send or Atomic operation you may need to use Fence
    (if you access the same addresses).
    
    Thanks
    Dotan
  - ganesh.irrinki says: April 13, 2014
    
    Thank you very much Dotan......thanks a lot.
  - ganesh.irrinki says: April 14, 2014
    
    Hi Dotan,
    I faced problem with ib_reg_phys_mr() ,can you please help me?
    I have mellanox 10G card having mlx4_en module on ScientificLinux6u3,kernel is 2.6.32-279.22.1.el6.x86_64.
    
    Look at once The following code :
    
    dbgmr->mr_handle = ib_reg_phys_mr(pd, phys_buff,
    dbgmr->num_dma_addr, acc_flag,
    (u64*)&phys_buff[0].addr);
    printk("dbgmr->mr_handle = 0x%llx\n",dbgmr->mr_handle);
    if (IS_ERR(dbgmr->mr_handle)) {
    .............
    .............
    .............
    }
    
    Problem: the above code printk() prints dbgmr->mr_handle = 0xffffffffffffffda
    
    But the cusor goes to the if(IS_ERR(dbgmr->mr_handle)){......} block and exit the function. Which is not desired.How can i avoid this problem?
    
    Thanks,
    I.Ganesh.
  - Dotan Barak says: April 14, 2014
    
    Hi I.Ganesh.
    
    I assume that the value 0xffffffffffffffda is an kernel encoding
    (using ERR_PTR) of the value -ENOSYS.
    
    Could it be that the device that you are using doesn't support this verb?
    (or something is problematic in your RDMA stack...)
    
    If you are using a Mellanox Ethernet device, you should check if RoCE is enabled on this port/device.
    
    Thanks
    Dotan
    
    If you like RDMAmojo, support it.
ganesh.irrinki says: April 14, 2014

yes Dotan, you're Absolutely Correct. 0xffffffffffffffda is an kernel encoding
(using ERR_PTR) of the value -ENOSYS(-38). And i have heard that Mellanox doesn't implement the function ib_reg_phys_mr(). But i need to do it now. How can i implement that memory registration through any another option on this Mellanox card.Can you please help me?
Thanks a lot Dotan, for giving me knowledge....

Thanks,
Ganesh.

Reply
- Dotan Barak says: April 15, 2014
  
  Hi Ganesh.
  
  You can find in the Linux kernel examples on how to register memory in the kernel:
  drivers/infiniband/ulp
  
  For example, one function that you can use is ib_get_dma_mr().
  
  I hope that I gave you a hint on how to continue.
  Dotan
  
  If you like RDMAmojo, support it.
  
  Reply
  - ganesh.irrinki says: April 15, 2014
    
    Thanks Dotan, I will see it. thank you very much for helping me....
  - ganesh.irrinki says: April 23, 2014
    
    Hi Dotan,
    I wanted to create chained WRs, last time you told me that chained WRs are nothing but linked lists of WRs which can be posted through either ibv_post_send() or ibv_post_recv(). we know that each WR contains memory buffer details which are already crated previously,but what i struggled was if there are 100 buffers having each differeent QP(100 QPs), do we need to maintain & check each WR details like QP_num,QP_type[RC,UC,UD],sg_list,sg_length,sg_lkey,opcode,imm_data,.....etc. I don't get any idea to create the linked list of WRs.....can you please help me. if possible send me some code regarding this.
    
    Thanks & Regards,
    Ganesh.
  - Dotan Barak says: April 23, 2014
    
    Hi Ganesh.
    
    I must admit that I didn't fully understand your problem.
    
    However, you should be able to know which buffers have outstanding Work Request
    (to prevent access to them by your application before the RDMA device stopped working with them).
    To know which WR was posted to any QP is needed only if you need it in your application
    (if for example, in the incoming message your application keeps the origin ID, maybe this isn't required).
    
    As a general rule of thumb you don't need to keep the whole Work Request that was posted, you need to keep track only to the important information..
    
    I hope that this helped you
    Dotan
    
    If you like RDMAmojo, support it.
  - ganesh.irrinki says: April 24, 2014
    
    Hi Dotan,
    Thanks for your explanation. suppose i have 3 buffers, let me assume mr1,mr2,mr3. I need to create chained wr for those buffers. so i need to maintain those mr's details as wr, so finally i have 3 wr's(assume all are outstanding wr's). now i need to create a linked list for those 3 wr's and finally post the list with ibv_post_send() or ibv_post_recv(). can you please give me a programmatic overview for the above requirement?
    
    Thanks & Regards,
    Ganesh.
  - Dotan Barak says: April 24, 2014
    
    Hi Ganesh.
    
    You have 2 options:
    1) Have one Work Request with one S/G entry to each MR
    2) Have several Work Requests each of them will point to a different MR
    
    which of the above to use is a decision that you need to take.
    
    But anyway, I fail to understand what is the question that you need me to answer to.
    Dotan
  - ganesh.irrinki says: April 25, 2014
    
    Hi Dotan,
    Thanks for your explanation. Let me try first and will consult you if i will face any problems. Thanks for your Blog also....
    
    Thanks,
    Ganesh.
  - Dotan Barak says: April 25, 2014
    
    Thanks
    :)
    
    Dotan
Henry Fu says: August 22, 2014

Hi Dotan,

Thanks for the answer to my previous question which helps a lot.

I have another problem regarding on the Send/Recv. I was trying to build a constant communication using Send/Recv but it did not work for some reason. Here is a brief description of my approach: after the connection is established, on the responder's side, I have a while loop which post a RR and wait to get_cq_event, poll the cq, return and continue the loop. On the sender's side, it is almost the same except that a send request is posted instead of RR. However, the test result is that on the responder's side, it stalls at get_cq_event. On the sender's side, it keeps telling me that the WC status is not successful (instead it is 13-IBV_WC_RNR_RETRY_EXC_ERR and 5-IBV_WC_WR_FLUSH_ERR). Can you see any flaws of this general approach? If not, I may instead have some minor issues in the code to figure out.

Many thanks!

Henry

Reply
- Dotan Barak says: August 24, 2014
  
  Hi Henry.
  
  The protocol you've just describes looks fine. I have a feeling that something went wrong in the implementation ..
  
  The problem is that the sender side sent a message but there wasn't any Receive Request ready in the receiver side. This is the reason for the first bad completion (IBV_WC_RNR_RETRY_EXC_ERR). The rest of the bad completions (IBV_WC_WR_FLUSH_ERR) means that the Work Queue is in error state.
  
  Thanks
  Dotan
  
  Reply
  - Henry Fu says: August 25, 2014
    
    Hi Dotan,
    
    Thanks for the answer. It seems that if I put some time interval (usleep(100) for example) before ibv_post_send on the sender's side, it'll work. Although it works fine now, I'm not quite understand the reasoning behind it. Does it indicate that too many ibv_post_send will clog the sender's cq or overflow the remote cq? But I have a ibv_get_cq_event after each ibv_post_send or ibv_post_recv, it should block until the send/recv is finished, right?
    
    Thanks,
    
    Henry
  - Dotan Barak says: August 26, 2014
    
    I fully agree, adding sleep is problematic as a constant solution...
    
    I would expect to get different completion than you mentioned in case of a CQ overrun, so this is weird.
    
    The sleep solved the problem, since it made the receiver side to be faster than the sender side, and gave him a chance to post a Receive Request to the Receive Queue - and by this prevent the RNR error.
    
    Thanks
    Dotan
Govind Patidar says: December 15, 2014

hiii,
I want to post some thousands(~50000) of work request in advance but the value of max_qp_wr = 16384, so can you suggest some alternate way by which max_qp_wr can be increased (by changing device etc...).

Reply
- Dotan Barak says: December 15, 2014
  
  Hi.
  
  This is a device capability and cannot be changed.
  
  The question is: do you really need so many Work Requests?
  * Maybe you can unite some of the messages in one Work Request?
  * Maybe you can use multiple QPs to send those Work Requests
  
  Just throwing some ideas here ..
  
  Thanks
  Dotan
  
  Reply
neuralcn says: April 27, 2015

One client only send, one server only receive, how to avoid RNR error?

Reply
- Dotan Barak says: April 27, 2015
  
  Hi.
  
  In order to prevent getting into RNR, one needs to always make sure that there are enough Receive Requests in the Receive Queue.
  
  Reply
  - Dotan Barak says: May 7, 2015
    
    Hi.
    
    The comments are moderated; this is the reason that you didn't see it
    (until I approved it).
    
    RNR error means that the clients send messages to the server, but there aren't enough Receive Requests in the Server side.
    
    There are few options to deal with this:
    * Increase the RNR timeout / RNR retry count
    * Increase the number of Receive Requests in the server side
    * Use Shared Receive Queue (SRQ) in the server side instead of a QP
    * Maybe use multiple SRQs in the server side, if there server can't post Receive Requests to one SRQ fast enough
    * Implement flow control by your application (to prevent a case which messages are sent without a Receive Requests in the server side).
    
    I hope that this helped you
    
    Thanks
    Dotan
Mavis says: June 27, 2015

Hi Dotan,

I have a question about the order of the ibv_post_recv(). The work request should always be in order for consumption, or they can be in different order? For example, I have several thread post work request to SQ/RQ so the order of WR in SQ may be different from that in RQ.

For example
SQ RQ
RDMA_WRITE_WITH_IMM(zero byte msg) RDMA_WRITE_WITH_IMM
RDMA_WRITE_WITH_IMM RDMA_WRITE_WITH_IMM(zero byte msg)
will fail, right?

The problem is more severe using SRQ? Does it mean I can only use one thread to post WR to make sure the order of SQ/RQ is the same? Thank you so much!

Best,
Mavis

Reply
- Dotan Barak says: June 29, 2015
  
  Hi Mavis.
  
  The order of the Receive Request consumptions in a Receive Queue is by the order that they were posted to it.
  When you have a SRQ, you cannot predict which Receive Request will be consumed by which QP,
  so all the Receive Requests in that SRQ should be able to contain the incoming message (in terms of length).
  
  Thanks
  Dotan
  
  Reply
Mavis says: June 27, 2015

Hi Dotan,
The order of WR in SQ/RQ should always be the same, right? If there are multiple threads posting WRs to SQ/RQ, it seems there is no guarantee on the order. So should I use one thread to post WRs?
Thanks!

Reply
- Dotan Barak says: June 29, 2015
  
  Hi Mavis.
  
  The order of processing a Work Request is guaranteed per Work Queue according to the order the Work Requests were added to it.
  
  Using threads just make it harder to predict the order.
  
  I don't know what you are trying to do, but if the remote RQ can hold any message that you send,
  you can continue working with threads.
  
  Thanks
  Dotan
  
  Reply
Tingyu says: July 15, 2015

Hi Dotan,

Is it possible to register a large receive buffer once, but post it many times at the same time, and at each post setting "sg.addr" to a different region of this large buffer?

Thanks,
Tingyu

Reply
- Dotan Barak says: July 16, 2015
  
  Yes.
  
  Memory Region can be registered once and worked with many times, access any address within this region.
  
  Thanks
  Dotan
  
  Reply
Junhyun says: September 27, 2016

Hi Dotan,

Should sge.length used in ibv_post_recv WR be exactly the size of the incoming payload or can it be larger? What happens if the buffer is smaller? would it trigger a Segmentation Fault?

Reply
- Dotan Barak says: October 1, 2016
  
  Hi.
  
  The buffer size of a receive request should be at least the size of the incoming message.
  If the supplied buffer is smaller, there will a Work Completion with error
  (and not a segmentation fault, which is always a program bug)
  
  Thanks
  Dotan
  
  Reply
Junhyun says: September 27, 2016

Also, is it possible to know a priori the size of the next incoming payload before the MR I provided for recv is filled?

Reply
- Dotan Barak says: October 1, 2016
  
  Hi.
  
  The answer is no, exchanging the supported message sizes is something that the application must handle
  (the maximum message size is 2 GB)
  
  Thanks
  Dotan
  
  Reply
Junhyun says: December 14, 2016

Hi, Dotan
If I'm sure the opcodes of incoming SRs will be RDMA_WRITE_WITH_IMMs exclusively,
is it okay to post recv requests with sge {addr: nullptr, length: 0}?

Reply
- Dotan Barak says: February 10, 2017
  
  Yes.
  
  If you are sure that the opcode will be RDMA Write with immediate,
  the S/G list can be empty, and the Receive Request will be used only to provide the immediate data
  in the Work Completion.
  
  Thanks
  Dotan
  
  Reply
Srinivas says: January 1, 2017

Hi Dotan,

If we post a work request with IBV_SEND_INLINE, does it apply to only send, or would the receive work request also be inline? I mean to say should we still post receive buffer to receive data that is sent through inline send, or whether it will be part of the work request itself, without posting any receive buffer?

Reply
- Dotan Barak says: February 10, 2017
  
  Hi.
  
  IBV_SEND_INLINE is relevant *only* to the Send side.
  
  The inline indicator explain to the HW how the message buffer will be provided to it
  (the HW will perform DMA read, or it will be contained in the Send Request descriptor).
  
  Over the wire, one can not know if it was "inlined" or not.
  
  The receive side should behave the same for inline/non-inline buffer messages
  (since it just don't know this information and it isn't relevant to it..)
  
  Thanks
  Dotan
  
  Reply
wangt0907 says: August 1, 2017

Hi, Dotan,
I want to use SEND&RECV verbs to implement a RPC subsystem. The server's recv queue should have enough recv wrs for fear that clients' send wrs fail. But the send wrs has various lengths. I wonder how to determine the address and length of buffers when posting recv wrs, so that every send can be well received and keep the use of buffer efficient. Thank you!
My English is not very well, I'm sorry if there is something unclear...

Reply
- Dotan Barak says: August 1, 2017
  
  Hi.
  
  The Receive Requests are fetched according the order they exist in the Receive Queue and the incoming message
  (Receive Request N buffers will be filled with the incoming N message).
  
  If you plan to have several length messages,
  maybe you should have several QPs for efficient work, each one will serve different message sizes range.
  
  For example:
  * QPX messages: 1-2KB
  * QPY messages: 2-16KB
  * Etc.
  
  I hope that I answered your question.
  Dotan
  
  Reply
RAPHAEL says: April 6, 2018

any chance to get rid off those 40 extra bytes for UD/QP ? is there a workaround when you want concatenate incoming data ?

regards

Reply
- Dotan Barak says: April 19, 2018
  
  Hi.
  
  No. There isn't any way to ignore the extra 40 bytes of the GRH.
  What you can do is to write the GRH to specific location in all the Receive Requests,
  and concatenate only the incoming data.
  
  Thanks
  Dotan
  
  Reply
ksang says: July 7, 2019

Hi Dotan,

After did some tests in terms of reading arbitrary data from sender, my finding is:
if the sge length requested in receiver's ibv_post_recv is smaller than the actual bytes length the sender sent in ibv_post_send, data after that length will be discarded. If receiver posted another receive request, it will receive data from next WRs sender send.
Please correct me if my understanding is incorrect.

Thanks

Reply
- Dotan Barak says: July 8, 2019
  
  Hi.
  
  Please let me answer a little different then how you asked the question:
  If the length of an incoming message with a Send opcode is longer than the total length of the Receive Request in the Receive Queue,
  the message will be dropped (and there may be an error - depends on the QPs transport type).
  
  Let me emphasis this: The same message won't be written in the buffers of few constitutive Receive Request.
  
  However, an incoming message will be written to all the buffers specified in a single Receive Request, according to the SGE entries.
  Example:
  We got an incoming message of 100 bytes.
  There is a RR with:
  S/G[0]: buffer with 30 bytes
  S/G[1]: buffer with 40 bytes
  S/G[2]: buffer with 100 bytes
  S/G[3]: buffer with 15 bytes
  
  The first 30 bytes will be written to buffer of S/G[0]
  then 40 bytes will be written to the buffer of S/G[1]
  then 30 bytes will be written to the buffer of S/G[2]
  
  I hope that my answer was clear.
  
  Thanks
  Dotan
  
  Reply
vinit says: September 24, 2019

Hi,
I am using rdma_post_recv (ibv_post_recv) to post on qp, however I get ENOMEM as error code sometimes. My connection to server consists of RC QP with qp->cap.max_send_wr/recv_wr is 2048 and they this QP has dedicated CQ with 8192 entries. This CQ is polled by dedicated thread as well. I tried increasing above number to 8192 and 16384 respectively, still I do get ENOMEM sometimes. I looked into kernel driver code and realize it ight be Q full condition but not sure why with such higher numbers I still get error. Any pointer in this direction would be of great help. Thanks.

Reply
- Dotan Barak says: September 28, 2019
  
  Hi.
  
  I think that is isn't related to the CQ.
  I suspect that it is related to the receive requests in the QP,
  that sometimes the incoming messages consumed receive requests from that QP - and sometimes not.
  So, there is a race between the incoming messages and the code that posts the receive requests.
  
  This is a theory, since I didn't see your code...
  
  Thanks
  Dotan
  
  Reply
Vinit says: September 30, 2019

Thanks, I’ll take a closer look again.

Reply
Norman says: November 18, 2020

Hi Dotan,

When I do an RDMA read of 10MB of data, I wonder what is the difference between using ibv_post_recv() and ibv_post_send() with the opcode of IBV_WR_RDMA_READ? Any performance trade-off? Any pros and cons?

Thanks,
Norman

Reply
- Dotan Barak says: November 21, 2020
  
  Hi Norman.
  
  In an RDMA Read, you don't need to call ibv_post_recv() at the responder side.
  (if you want, you can do it - what it just won't have any effect).
  
  Thanks
  Dotan
  
  Reply