Skip to content


5.00 avg. rating (99% score) - 5 votes
int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr,
                  struct ibv_recv_wr **bad_wr);


ibv_post_recv() posts a linked list of Work Requests (WRs) to the Receive Queue of a Queue Pair (QP).

ibv_post_recv() go over all of the entries in the linked list, one by one, check that it is valid, generate a HW-specific Receive Request out of it and add it to the tail of the QP's Receive Queue without performing any context switch. The RDMA device will take one of those Work Requests as soon as an incoming opcode to that QP will consume a Receive Request (RR). If there is a failure in one of the WRs because the Receive Queue is full or one of the attributes in the WR is bad, it stops immediately and return the pointer to that WR.

A QP, which isn't associated with an SRQ, will handle Work Requests in the Receive queue according to the following rules:

  • If the QP is in RESET state an immediate error should be returned. However, they may be some low-level driver that won't follow this rule (to eliminate extra check in the data path thus, providing better performance) and posting Receive Requests at this state may be silently ignored.
  • If the QP is in INIT state, Receive Requests can be posted, but they won't be processed.
  • If the QP is in RTR, RTS, SQD or SQE state, Receive Requests can be posted and they will be processed.
  • If the QP is in ERROR state, Receive Requests can be posted and they will be completed with error.

If the QP is associated with a Shared Receive Queue (SRQ), one must call ibv_post_srq_recv(), and not ibv_post_recv(), since the QP's own receive queue will not be used.

If a RR is being posted to an UD QP, the Global Routing Header (GRH) of the incoming message will be placed in the first 40 bytes of the buffer(s) in the scatter list. If no GRH is present in the incoming message, then the first bytes will be undefined. This means that in all cases, the actual data of the incoming message will start at an offset of 40 bytes into the buffer(s) in the scatter list.

The struct ibv_recv_wr describes the Work Request to the Receive Queue of the QP, i.e. Receive Request.

struct ibv_recv_wr {
	uint64_t		wr_id;
	struct ibv_recv_wr     *next;
	struct ibv_sge	       *sg_list;
	int			num_sge;

Here is the full description of struct ibv_recv_wr:

wr_id A 64 bits value associated with this WR. A Work Completion will be generated when this Work Request ends, it will contain this value
next Pointer to the next WR in the linked list. NULL indicates that this is the last WR
sg_list Scatter/Gather array, as described in the table below. It specifies the buffers where data will be written in. The entries in the list can specify memory blocks that were registered by different Memory Regions. The maximum message size that it can serve is the sum of all of the memory buffers length in the scatter/gather list
num_sge Size of the sg_list array. This number can be less or equal to the number of scatter/gather entries that the Queue Pair was created to support in the Receive Queue (qp_init_attr.cap.max_recv_sge). If this size is 0, this indicates that the message size is 0

struct ibv_sge describes a scatter/gather entry. The memory buffer that this entry describes must be registered until any posted Work Request that uses it isn't considered outstanding anymore. The order in which the RDMA device access the memory in a scatter/gather list isn't defined. This means that if some of the entries overlap the same memory address, the content of this address is undefined.

struct ibv_sge {
	uint64_t		addr;
	uint32_t		length;
	uint32_t		lkey;

Here is the full description of struct ibv_sge:

addr The address of the buffer to read from or write to
length The length of the buffer in bytes. The value 0 is a special value and is equal to 2^{31} bytes (and not zero bytes, as one might imagine)
lkey The Local key of the Memory Region that this memory buffer was registered with

While a WR is considered outstanding, the local memory buffers content shouldn't be read since one doesn't know when the RDMA device will stop writing new content to it.


Name Direction Description
qp in Queue Pair that was returned from ibv_create_qp()
wr in Linked list of Work Requests to be posted to the Receive Queue of the Queue Pair
bad_wr out A pointer to that will be filled with the first Work Request that its processing failed

Return Values

Value Description
0 On success
errno On failure and no change will be done to the QP and bad_wr points to the RR that failed to be posted
EINVAL Invalid value provided in wr
ENOMEM Receive Queue is full or not enough resources to complete this operation
EFAULT Invalid value provided in qp


Posting a RR to QP which isn't associated with an SRQ:

struct ibv_sge sg;
struct ibv_recv_wr wr;
struct ibv_recv_wr *bad_wr;
memset(&sg, 0, sizeof(sg));
sg.addr	  = (uintptr_t)buf_addr;
sg.length = buf_size;
sg.lkey	  = mr->lkey;
memset(&wr, 0, sizeof(wr));
wr.wr_id      = 0;
wr.sg_list    = &sg;
wr.num_sge    = 1;
if (ibv_post_recv(qp, &wr, &bad_wr)) {
	fprintf(stderr, "Error, ibv_post_recv() failed\n");
	return -1;


Does ibv_post_recv() cause a context switch?

No. Posting a RR doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

How many WRs can I post?

There is a limit to the maximum number of outstanding WRs for a QP. This value was specified when the QP was created.

Can I know how many WRs are outstanding in a Work Queue?

No, you can't. You should keep track of the number of outstanding WRs according to the number of posted WRs and the number of Work Completions that you polled.

Which operations will consume RRs?

If the remote side post a Send Request with one of the following opcodes, a RR will be consumed:

  • Send
  • Send with Immediate
  • RDMA Write with immediate

What will happen if I will deregister an MR that is used by an outstanding WR?

When processing a WR, if one of the MRs that are specified in the WR isn't valid, a Work Completion with error will be generated.

I called ibv_post_recv() and I got segmentation fault, what happened?

There may be several reasons for this to happen:
1) At least one of the sg_list entries is in invalid address
2) The value of next points to an invalid address
3) Error occurred in one of the posted RRs (bad value in the RR or full Work Queue) and the variable bad_wr is NULL

Help, I've posted and Receive Request and it wasn't completed with a corresponding Work Completion. What happened?

In order to debug this kind of problem, one should do the following:

  • Verify that a Send Request was actually posted in the remote QP
  • Verify that a Receive Request was actually posted in the local QP
  • Wait enough time, maybe a Work Completion will eventually be generated
  • Verify that the logical port state of the RDMA device is IBV_PORT_ACTIVE
  • Verify that the QP state is in one of the following states: RTR, RTS, SQD, SQE, ERROR

I had a code that work with UC or RC QP and I added support to UD QP, but I keep getting Work Completion with error. What happened?

For UD QP, an extra 40 bytes should be added to the RR buffers (to allow save the GRH, if such exists in this message).

Can I (re)use the Receive Request after ibv_post_recv() returned?

Yes. This verb translates the Receive Request from the libibverbs abstraction to a HW-specific Receive Request and you can (re)use both the Receive Request and the s/g list within it.

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati


Tell us what do you think.

  1. the hammer says: February 21, 2013

    Your webpage does not render correctly on my blackberry - you might wanna try and repair that

    • Dotan Barak says: February 21, 2013


      I must admit that I didn't think about smartphone users (until today).
      Starting today, my blog is smartphone friendly.


  2. Igor R. says: February 11, 2014

    Hi Dotan,

    Is it possible to increase the "level of debugability" of ibverbs? In particular, to have some debug symbols and/or verbose tracing messages from libibverbs?
    I'm trying to debug a segfault in ibv_post_recv (bad_wr is ok, and there's no obvious reason for segafult; perhaps some memory corruption prior to this call, but it would be helpful to see where it crashes exactly).

    • Dotan Barak says: February 11, 2014

      Hi Igor.

      I'm very sorry, but there isn't any built-in capability for debugging in libibverbs
      (or in most of low-level drivers implementations in the user level).

      If you'll send me the part of your code that calls the ibv_post_recv() function, I might give you a tip where the potential bug is.


      • Igor R. says: February 11, 2014

        Unfortunately, it's quite a lot of code, and I've not managed yet to produce an SSCCE.
        The error is inconsistent (as it usually happens with memory-management issues in async systems), but the symptom is that sometimes it's segfault in ib_post_recv, and sometimes it's ib_post_send that returns EINVAL. Really looks like bad_wr == 0, but it's not.
        Just to be sure: the expected lifespan of bad_wr is just during ib_post_xxx call (i.e., after the function returns, this pointer won't be used anymore), right? And the expected lifespan of the memory region and its underlying buffer is until the appropriate WC gets arrived, isn't it?

      • Igor R. says: February 11, 2014

        It happened due to the lack of total zeroing of ibv_send_wr/ibv_recv_wr prior to their use (despite the proper initialization of "sensible" fields).

      • Dotan Barak says: February 11, 2014

        Hi Igor.

        I'm happy that this issue was solved.
        I would suggest to zero the structure fields one by one to check which one caused the failure,
        since maybe the memset() that you use hides a bug.
        (If you will be able to share the field name that caused the problem, it will be nice).

        I *think* that the send_flags attributes is the problem (and the problem was that the inline bit indicator was set).



  3. Igor R. says: February 11, 2014

    Not send_flags, much worse: "next" pointer.

    • Dotan Barak says: February 11, 2014

      Yes, invalid address in the 'next' pointer may be a problem..
      One should set this pointer to NULL explicitly in the last SR.

      I'll update the relevant posts to reflect this, so next programmer(s) won't fall into this hole too...


      • Igor R. says: February 11, 2014

        BTW, what's the difference between posting a chain of WRs vs. calling ibv_post_xxx several times? Just a function call?

      • Dotan Barak says: February 11, 2014

        Posting a chained WRs vs. calling ibv_post_XXX() several times can save not only the function call itself, it allows the low-level driver to perform optimizations that may result a better performance. For example: adding all the WRs to the queue and only then notify the RDMA device about them.


      • Igor R. says: February 11, 2014

        Great, thanks!

  4. ganesh.irrinki says: April 11, 2014

    Hi Dotan,
    I'm new to RDMA concepts. can you please tell me what are "chained WRs posted on an SQ of a QP". i think WRs are individual to each other how can be they chained together?


    • Dotan Barak says: April 11, 2014


      A "chained WRs posted on an SQ of a QP" is actually a linked list of Send Requests.
      As you (correctly) said, every Send Request by itself is individual (and independent),
      but posting them together allow to perform some optimization when compared to posting them
      one by one.

      I hope that now it is clear.


      • ganesh.irrinki says: April 11, 2014

        Hi Dotan,

        So,if suppose in a SQ/RQ has the following WRs respectively


        then above multiple reads & multiple writes can be done at a time will give better i right?
        And we can call this process as Chained WRs posted on SQ/RQ of a QP. right?

        Thanks for your advise & quick reply....

      • Dotan Barak says: April 11, 2014

        In general: Yes

        They won't be processed at the *same* time, but this will perform better performance (for example: you'll consume less CPU cycles and may get better message rate) when posted at one linked list in compared to posting them one by one..

        BTW, if you perform RDMA Read followed by RDMA Write, Send or Atomic operation you may need to use Fence
        (if you access the same addresses).


      • ganesh.irrinki says: April 13, 2014

        Thank you very much Dotan......thanks a lot.

      • ganesh.irrinki says: April 14, 2014

        Hi Dotan,
        I faced problem with ib_reg_phys_mr() ,can you please help me?
        I have mellanox 10G card having mlx4_en module on ScientificLinux6u3,kernel is 2.6.32-279.22.1.el6.x86_64.

        Look at once The following code :

        dbgmr->mr_handle = ib_reg_phys_mr(pd, phys_buff,
        dbgmr->num_dma_addr, acc_flag,
        printk("dbgmr->mr_handle = 0x%llx\n",dbgmr->mr_handle);
        if (IS_ERR(dbgmr->mr_handle)) {

        Problem: the above code printk() prints dbgmr->mr_handle = 0xffffffffffffffda

        But the cusor goes to the if(IS_ERR(dbgmr->mr_handle)){......} block and exit the function. Which is not desired.How can i avoid this problem?


      • Dotan Barak says: April 14, 2014

        Hi I.Ganesh.

        I assume that the value 0xffffffffffffffda is an kernel encoding
        (using ERR_PTR) of the value -ENOSYS.

        Could it be that the device that you are using doesn't support this verb?
        (or something is problematic in your RDMA stack...)

        If you are using a Mellanox Ethernet device, you should check if RoCE is enabled on this port/device.


        If you like RDMAmojo, support it.

  5. ganesh.irrinki says: April 14, 2014

    yes Dotan, you're Absolutely Correct. 0xffffffffffffffda is an kernel encoding
    (using ERR_PTR) of the value -ENOSYS(-38). And i have heard that Mellanox doesn't implement the function ib_reg_phys_mr(). But i need to do it now. How can i implement that memory registration through any another option on this Mellanox card.Can you please help me?
    Thanks a lot Dotan, for giving me knowledge....


    • Dotan Barak says: April 15, 2014

      Hi Ganesh.

      You can find in the Linux kernel examples on how to register memory in the kernel:

      For example, one function that you can use is ib_get_dma_mr().

      I hope that I gave you a hint on how to continue.

      If you like RDMAmojo, support it.

      • ganesh.irrinki says: April 15, 2014

        Thanks Dotan, I will see it. thank you very much for helping me....

      • ganesh.irrinki says: April 23, 2014

        Hi Dotan,
        I wanted to create chained WRs, last time you told me that chained WRs are nothing but linked lists of WRs which can be posted through either ibv_post_send() or ibv_post_recv(). we know that each WR contains memory buffer details which are already crated previously,but what i struggled was if there are 100 buffers having each differeent QP(100 QPs), do we need to maintain & check each WR details like QP_num,QP_type[RC,UC,UD],sg_list,sg_length,sg_lkey,opcode,imm_data,.....etc. I don't get any idea to create the linked list of WRs.....can you please help me. if possible send me some code regarding this.

        Thanks & Regards,

      • Dotan Barak says: April 23, 2014

        Hi Ganesh.

        I must admit that I didn't fully understand your problem.

        However, you should be able to know which buffers have outstanding Work Request
        (to prevent access to them by your application before the RDMA device stopped working with them).
        To know which WR was posted to any QP is needed only if you need it in your application
        (if for example, in the incoming message your application keeps the origin ID, maybe this isn't required).

        As a general rule of thumb you don't need to keep the whole Work Request that was posted, you need to keep track only to the important information..

        I hope that this helped you

        If you like RDMAmojo, support it.

      • ganesh.irrinki says: April 24, 2014

        Hi Dotan,
        Thanks for your explanation. suppose i have 3 buffers, let me assume mr1,mr2,mr3. I need to create chained wr for those buffers. so i need to maintain those mr's details as wr, so finally i have 3 wr's(assume all are outstanding wr's). now i need to create a linked list for those 3 wr's and finally post the list with ibv_post_send() or ibv_post_recv(). can you please give me a programmatic overview for the above requirement?

        Thanks & Regards,

      • Dotan Barak says: April 24, 2014

        Hi Ganesh.

        You have 2 options:
        1) Have one Work Request with one S/G entry to each MR
        2) Have several Work Requests each of them will point to a different MR

        which of the above to use is a decision that you need to take.

        But anyway, I fail to understand what is the question that you need me to answer to.

      • ganesh.irrinki says: April 25, 2014

        Hi Dotan,
        Thanks for your explanation. Let me try first and will consult you if i will face any problems. Thanks for your Blog also....


      • Dotan Barak says: April 25, 2014



  6. Henry Fu says: August 22, 2014

    Hi Dotan,

    Thanks for the answer to my previous question which helps a lot.

    I have another problem regarding on the Send/Recv. I was trying to build a constant communication using Send/Recv but it did not work for some reason. Here is a brief description of my approach: after the connection is established, on the responder's side, I have a while loop which post a RR and wait to get_cq_event, poll the cq, return and continue the loop. On the sender's side, it is almost the same except that a send request is posted instead of RR. However, the test result is that on the responder's side, it stalls at get_cq_event. On the sender's side, it keeps telling me that the WC status is not successful (instead it is 13-IBV_WC_RNR_RETRY_EXC_ERR and 5-IBV_WC_WR_FLUSH_ERR). Can you see any flaws of this general approach? If not, I may instead have some minor issues in the code to figure out.

    Many thanks!


    • Dotan Barak says: August 24, 2014

      Hi Henry.

      The protocol you've just describes looks fine. I have a feeling that something went wrong in the implementation ..

      The problem is that the sender side sent a message but there wasn't any Receive Request ready in the receiver side. This is the reason for the first bad completion (IBV_WC_RNR_RETRY_EXC_ERR). The rest of the bad completions (IBV_WC_WR_FLUSH_ERR) means that the Work Queue is in error state.


      • Henry Fu says: August 25, 2014

        Hi Dotan,

        Thanks for the answer. It seems that if I put some time interval (usleep(100) for example) before ibv_post_send on the sender's side, it'll work. Although it works fine now, I'm not quite understand the reasoning behind it. Does it indicate that too many ibv_post_send will clog the sender's cq or overflow the remote cq? But I have a ibv_get_cq_event after each ibv_post_send or ibv_post_recv, it should block until the send/recv is finished, right?



      • Dotan Barak says: August 26, 2014

        I fully agree, adding sleep is problematic as a constant solution...

        I would expect to get different completion than you mentioned in case of a CQ overrun, so this is weird.

        The sleep solved the problem, since it made the receiver side to be faster than the sender side, and gave him a chance to post a Receive Request to the Receive Queue - and by this prevent the RNR error.


  7. Govind Patidar says: December 15, 2014

    I want to post some thousands(~50000) of work request in advance but the value of max_qp_wr = 16384, so can you suggest some alternate way by which max_qp_wr can be increased (by changing device etc...).

    • Dotan Barak says: December 15, 2014


      This is a device capability and cannot be changed.

      The question is: do you really need so many Work Requests?
      * Maybe you can unite some of the messages in one Work Request?
      * Maybe you can use multiple QPs to send those Work Requests

      Just throwing some ideas here ..


  8. neuralcn says: April 27, 2015

    One client only send, one server only receive, how to avoid RNR error?

    • Dotan Barak says: April 27, 2015


      In order to prevent getting into RNR, one needs to always make sure that there are enough Receive Requests in the Receive Queue.

      • Dotan Barak says: May 7, 2015


        The comments are moderated; this is the reason that you didn't see it
        (until I approved it).

        RNR error means that the clients send messages to the server, but there aren't enough Receive Requests in the Server side.

        There are few options to deal with this:
        * Increase the RNR timeout / RNR retry count
        * Increase the number of Receive Requests in the server side
        * Use Shared Receive Queue (SRQ) in the server side instead of a QP
        * Maybe use multiple SRQs in the server side, if there server can't post Receive Requests to one SRQ fast enough
        * Implement flow control by your application (to prevent a case which messages are sent without a Receive Requests in the server side).

        I hope that this helped you


  9. Mavis says: June 27, 2015

    Hi Dotan,

    I have a question about the order of the ibv_post_recv(). The work request should always be in order for consumption, or they can be in different order? For example, I have several thread post work request to SQ/RQ so the order of WR in SQ may be different from that in RQ.

    For example
    SQ RQ
    will fail, right?

    The problem is more severe using SRQ? Does it mean I can only use one thread to post WR to make sure the order of SQ/RQ is the same? Thank you so much!


    • Dotan Barak says: June 29, 2015

      Hi Mavis.

      The order of the Receive Request consumptions in a Receive Queue is by the order that they were posted to it.
      When you have a SRQ, you cannot predict which Receive Request will be consumed by which QP,
      so all the Receive Requests in that SRQ should be able to contain the incoming message (in terms of length).


  10. Mavis says: June 27, 2015

    Hi Dotan,
    The order of WR in SQ/RQ should always be the same, right? If there are multiple threads posting WRs to SQ/RQ, it seems there is no guarantee on the order. So should I use one thread to post WRs?

    • Dotan Barak says: June 29, 2015

      Hi Mavis.

      The order of processing a Work Request is guaranteed per Work Queue according to the order the Work Requests were added to it.

      Using threads just make it harder to predict the order.

      I don't know what you are trying to do, but if the remote RQ can hold any message that you send,
      you can continue working with threads.


  11. Tingyu says: July 15, 2015

    Hi Dotan,

    Is it possible to register a large receive buffer once, but post it many times at the same time, and at each post setting "sg.addr" to a different region of this large buffer?


    • Dotan Barak says: July 16, 2015


      Memory Region can be registered once and worked with many times, access any address within this region.


  12. Junhyun says: September 27, 2016

    Hi Dotan,

    Should sge.length used in ibv_post_recv WR be exactly the size of the incoming payload or can it be larger? What happens if the buffer is smaller? would it trigger a Segmentation Fault?

    • Dotan Barak says: October 1, 2016


      The buffer size of a receive request should be at least the size of the incoming message.
      If the supplied buffer is smaller, there will a Work Completion with error
      (and not a segmentation fault, which is always a program bug)


  13. Junhyun says: September 27, 2016

    Also, is it possible to know a priori the size of the next incoming payload before the MR I provided for recv is filled?

    • Dotan Barak says: October 1, 2016


      The answer is no, exchanging the supported message sizes is something that the application must handle
      (the maximum message size is 2 GB)


  14. Junhyun says: December 14, 2016

    Hi, Dotan
    If I'm sure the opcodes of incoming SRs will be RDMA_WRITE_WITH_IMMs exclusively,
    is it okay to post recv requests with sge {addr: nullptr, length: 0}?

    • Dotan Barak says: February 10, 2017


      If you are sure that the opcode will be RDMA Write with immediate,
      the S/G list can be empty, and the Receive Request will be used only to provide the immediate data
      in the Work Completion.


  15. Srinivas says: January 1, 2017

    Hi Dotan,

    If we post a work request with IBV_SEND_INLINE, does it apply to only send, or would the receive work request also be inline? I mean to say should we still post receive buffer to receive data that is sent through inline send, or whether it will be part of the work request itself, without posting any receive buffer?

    • Dotan Barak says: February 10, 2017


      IBV_SEND_INLINE is relevant *only* to the Send side.

      The inline indicator explain to the HW how the message buffer will be provided to it
      (the HW will perform DMA read, or it will be contained in the Send Request descriptor).

      Over the wire, one can not know if it was "inlined" or not.

      The receive side should behave the same for inline/non-inline buffer messages
      (since it just don't know this information and it isn't relevant to it..)


  16. wangt0907 says: August 1, 2017

    Hi, Dotan,
    I want to use SEND&RECV verbs to implement a RPC subsystem. The server's recv queue should have enough recv wrs for fear that clients' send wrs fail. But the send wrs has various lengths. I wonder how to determine the address and length of buffers when posting recv wrs, so that every send can be well received and keep the use of buffer efficient. Thank you!
    My English is not very well, I'm sorry if there is something unclear...

    • Dotan Barak says: August 1, 2017


      The Receive Requests are fetched according the order they exist in the Receive Queue and the incoming message
      (Receive Request N buffers will be filled with the incoming N message).

      If you plan to have several length messages,
      maybe you should have several QPs for efficient work, each one will serve different message sizes range.

      For example:
      * QPX messages: 1-2KB
      * QPY messages: 2-16KB
      * Etc.

      I hope that I answered your question.

  17. RAPHAEL says: April 6, 2018

    any chance to get rid off those 40 extra bytes for UD/QP ? is there a workaround when you want concatenate incoming data ?


    • Dotan Barak says: April 19, 2018


      No. There isn't any way to ignore the extra 40 bytes of the GRH.
      What you can do is to write the GRH to specific location in all the Receive Requests,
      and concatenate only the incoming data.


Add a Comment

Fill in the form and submit.

Time limit is exhausted. Please reload CAPTCHA.