Skip to content


4.17 avg. rating (84% score) - 6 votes
int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc);


ibv_poll_cq() polls Work Completions from a Completion Queue (CQ).

A Work Completion indicates that a Work Request in a Work Queue, and all of the outstanding unsignaled Work Requests that posted to that Work Queue, associated with the CQ are done. Any Receive Requests, signaled Send Requests and Send Requests that ended with an error will generate a Work Completion after their processing end.

When a Work Requests end, a Work Completion is being added to the tail of the CQ that this Work Queue is associated with. ibv_poll_cq() check if Work Completions are present in a CQ and pop them from the head of the CQ in the order they entered it (FIFO). After a Work Completion was popped from a CQ, it can't be returned to it.

One should consume Work Completions at a rate that prevents the CQ from being overrun (hold more Work Completions than the CQ size). In case of an CQ overrun, the async event IBV_EVENT_CQ_ERR will be triggered, and the CQ cannot be used anymore.

The struct ibv_wc describes the Work Completion attributes.

struct ibv_wc {
	uint64_t		wr_id;
	enum ibv_wc_status	status;
	enum ibv_wc_opcode	opcode;
	uint32_t		vendor_err;
	uint32_t		byte_len;
	uint32_t		imm_data;
	uint32_t		qp_num;
	uint32_t		src_qp;
	int			wc_flags;
	uint16_t		pkey_index;
	uint16_t		slid;
	uint8_t			sl;
	uint8_t			dlid_path_bits;

Here is the full description of struct ibv_wc:

wr_id The 64 bits value that was associated with the corresponding Work Request
status Status of the operation. The value can be one of the following enumerated values and their numeric value:

  • IBV_WC_SUCCESS (0) - Operation completed successfully: this means that the corresponding Work Request (and all of the unsignaled Work Requests that were posted previous to it) ended and the memory buffers that this Work Request refers to are ready to be (re)used.
  • IBV_WC_LOC_LEN_ERR (1) - Local Length Error: this happens if a Work Request that was posted in a local Send Queue contains a message that is greater than the maximum message size that is supported by the RDMA device port that should send the message or an Atomic operation which its size is different than 8 bytes was sent. This also may happen if a Work Request that was posted in a local Receive Queue isn't big enough for holding the incoming message or if the incoming message size if greater the maximum message size supported by the RDMA device port that received the message.
  • IBV_WC_LOC_QP_OP_ERR (2) - Local QP Operation Error: an internal QP consistency error was detected while processing this Work Request: this happens if a Work Request that was posted in a local Send Queue of a UD QP contains an Address Handle that is associated with a Protection Domain to a QP which is associated with a different Protection Domain or an opcode which isn't supported by the transport type of the QP isn't supported (for example: RDMA Write over a UD QP).
  • IBV_WC_LOC_EEC_OP_ERR (3) - Local EE Context Operation Error: an internal EE Context consistency error was detected while processing this Work Request (unused, since its relevant only to RD QPs or EE Context, which aren’t supported).
  • IBV_WC_LOC_PROT_ERR (4) - Local Protection Error: the locally posted Work Request’s buffers in the scatter/gather list does not reference a Memory Region that is valid for the requested operation.
  • IBV_WC_WR_FLUSH_ERR (5) - Work Request Flushed Error: A Work Request was in process or outstanding when the QP transitioned into the Error State.
  • IBV_WC_MW_BIND_ERR (6) - Memory Window Binding Error: A failure happened when tried to bind a MW to a MR.
  • IBV_WC_BAD_RESP_ERR (7) - Bad Response Error: an unexpected transport layer opcode was returned by the responder. Relevant for RC QPs.
  • IBV_WC_LOC_ACCESS_ERR (8) - Local Access Error: a protection error occurred on a local data buffer during the processing of a RDMA Write with Immediate operation sent from the remote node. Relevant for RC QPs.
  • IBV_WC_REM_INV_REQ_ERR (9) - Remote Invalid Request Error: The responder detected an invalid message on the channel. Possible causes include the operation is not supported by this receive queue (qp_access_flags in remote QP wasn't configured to support this operation), insufficient buffering to receive a new RDMA or Atomic Operation request, or the length specified in a RDMA request is greater than 2^{31} bytes. Relevant for RC QPs.
  • IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error: a protection error occurred on a remote data buffer to be read by an RDMA Read, written by an RDMA Write or accessed by an atomic operation. This error is reported only on RDMA operations or atomic operations. Relevant for RC QPs.
  • IBV_WC_REM_OP_ERR (11) - Remote Operation Error: the operation could not be completed successfully by the responder. Possible causes include a responder QP related error that prevented the responder from completing the request or a malformed WQE on the Receive Queue. Relevant for RC QPs.
  • IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter Exceeded: The local transport timeout retry counter was exceeded while trying to send this message. This means that the remote side didn't send any Ack or Nack. If this happens when sending the first message, usually this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. Relevant for RC QPs.
  • IBV_WC_RNR_RETRY_EXC_ERR (13) - RNR Retry Counter Exceeded: The RNR NAK retry count was exceeded. This usually means that the remote side didn't post any WR to its Receive Queue. Relevant for RC QPs.
  • IBV_WC_LOC_RDD_VIOL_ERR (14) - Local RDD Violation Error: The RDD associated with the QP does not match the RDD associated with the EE Context (unused, since its relevant only to RD QPs or EE Context, which aren't supported).
  • IBV_WC_REM_INV_RD_REQ_ERR (15) - Remote Invalid RD Request: The responder detected an invalid incoming RD message. Causes include a Q_Key or RDD violation (unused, since its relevant only to RD QPs or EE Context, which aren't supported)
  • IBV_WC_REM_ABORT_ERR (16) - Remote Aborted Error: For UD or UC QPs associated with a SRQ, the responder aborted the operation.
  • IBV_WC_INV_EECN_ERR (17) - Invalid EE Context Number: An invalid EE Context number was detected (unused, since its relevant only to RD QPs or EE Context, which aren't supported).
  • IBV_WC_INV_EEC_STATE_ERR (18) - Invalid EE Context State Error: Operation is not legal for the specified EE Context state (unused, since its relevant only to RD QPs or EE Context, which aren't supported).
  • IBV_WC_FATAL_ERR (19) - Fatal Error.
  • IBV_WC_RESP_TIMEOUT_ERR (20) - Response Timeout Error.
  • IBV_WC_GENERAL_ERR (21) - General Error: other error which isn't one of the above errors.
opcode The operation that the corresponding Work Request performed. This value controls the way that data was sent, the direction of the data flow and the valid attributes in the Work Completion. The value can be one of the following enumerated values:

  • IBV_WC_SEND - Send operation for a WR that was posted to the Send Queue
  • IBV_WC_RDMA_WRITE - RDMA Write operation for a WR that was posted to the Send Queue
  • IBV_WC_RDMA_READ - RDMA Read operation for a WR that was posted to the Send Queue
  • IBV_WC_COMP_SWAP - Compare and Swap operation for a WR that was posted to the Send Queue
  • IBV_WC_FETCH_ADD - Fetch and Add operation for a WR that was posted to the Send Queue
  • IBV_WC_BIND_MW - Memory Window bind operation for a WR that was posted to the Send Queue
  • IBV_WC_RECV - Send data operation for a WR that was posted to a Receive Queue (of a QP or to an SRQ)
  • IBV_WC_RECV_RDMA_WITH_IMM - RDMA with immediate for a WR that was posted to a Receive Queue (of a QP or to an SRQ). For this opcode, only a Receive Request was consumed and the sg_list of this RR wasn't used
vendor_err Vendor specific error which provides more information if the completion ended with error. This value provides a hint to the RDMA device's vendor about the reason of the failure in case there is a Work Completion that ended with error
byte_len The number of bytes transferred. Relevant if the Receive Queue for incoming Send or RDMA Write with immediate operations. This value doesn't include the length of the immediate data, if such exists. Relevant in the Send Queue for RDMA Read and Atomic operations.

For the Receive Queue of a UD QP that is not associated with an SRQ or for an SRQ that is associated with a UD QP this value equals to the payload of the message plus the 40 bytes reserved for the GRH.
The number of bytes transferred is the payload of the message plus the 40 bytes reserved for the GRH, whether or not the GRH is present

imm_data (optional) A 32 bits number, in network order, in an SEND or RDMA WRITE opcodes that is being sent along with the payload to the remote side and placed in a Receive Work Completion and not in a remote memory buffer. This value is valid if the IBV_WC_WITH_IMM is set
qp_num Local QP number of completed WR. Relevant for Receive Work Completions that are associated with an SRQ
src_qp Source QP number (remote QP number) of completed WR. Relevant for Receive Work Completions of a UD QP
wc_flags Flags of the Work Completion. It is either 0 or the bitwise OR of one or more of the following flags:

  • IBV_WC_GRH - Indicator that GRH is present for a Receive Work Completions of a UD QP. If this bit is set, the first 40 bytes of the buffered that were referred to in the Receive request will contain the GRH of the incoming message. If this bit is cleared, the content of those first 40 bytes is undefined
  • IBV_WC_WITH_IMM - Indicator that imm_data is valid. Relevant for Receive Work Completions
pkey_index P_Key index. Relevant for GSI QPs
slid Source LID (the base LID that this message was sent from). Relevant for Receive Work Completions of a UD QP
sl Service Level (the SL LID that this message was sent with). Relevant for Receive Work Completions of a UD QP
dlid_path_bits Destination LID path bits. Relevant for Receive Work Completions of a UD QP (not applicable for multicast messages)

The following test (opcode & IBV_WC_RECV) will indicate that the status of a completion is from the Receive Queue.

For a receive Work Completions of a UD QP, the data start at offset 40 from the posted receive buffer start whether if the IBV_WC_GRH bit it set or not.

Not all wc attributes are always valid. If the completion status is other than IBV_WC_SUCCESS, only the following attributes are valid:

  • wr_id
  • status
  • qp_num
  • vendor_err


Name Direction Description
cq in Completion Queue that was returned from ibv_create_cq()
num_entries in Maximum number of Work Completions to read from the CQ
wc out Array of size num_entries of the Work Completions that will be read from the CQ

Return Values

Value Description
Positive Number of Work Completions that were read from the CQ and their value was returned in wc. If this value is less than num_entries it means that there aren't any more Work Completions in the CQ. If this value equals to num_entries, maybe there are more Work Completions in the CQ
0 The CQ is empty
Negative A failure occurred while trying to read Work Completions from the CQ


Poll a Work Completion from a CQ (in polling mode):

struct ibv_wc wc;
int num_comp;
do {
	num_comp = ibv_poll_cq(cq, 1, &wc);
} while (num_comp == 0);
if (num_comp < 0) {
	fprintf(stderr, "ibv_poll_cq() failed\n");
	return -1;
/* verify the completion status */
if (wc.status != IBV_WC_SUCCESS) {
	fprintf(stderr, "Failed status %s (%d) for wr_id %d\n", 
		wc.status, (int)wc.wr_id);
	return -1;


What is that Work Completion anyway?

Work Completion means that the corresponding Work Request is ended and the buffer can be (re)used for read, write or free.

Does ibv_poll_cq() cause a context switch?

No. Polling for Work Completions doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

Is there a limit to the number of Work Completions that can we polled when calling ibv_poll_cq()?

No. One can read as many Work Requests that he wishes.

I called ibv_poll_cq() and it filled all of the array that I've provided to it. Can I know how many more Work Completions exist in the CQ?

No, you can't.

I got a Work Completion from the Receive Queue of a UD QP and it ended well. I read the data from the memory buffers and I got bad data. Why?

Maybe you looked at the data starting offset 0. For any Work Completion of a UD QP, the data is placed in offset 40 of the relevant memory buffers, no matter if GRH was present or not.

What is this GRH and why do I need it?

The Global Routing Header (GRH) provides information that is most useful for sending a message back to the sender of this message if it came from a different subnet or from a multicast group.

I've got completion with error status. Can I read all of the Work Completion fields?

No. If the Work Completion status indicates that there is an error, only the following attributes are valid: wr_id, status, qp_num, and vendor_err. The rest of the attributes are undefined.

I read a Work Completion from the CQ and I don't need. Can I return it to the CQ?

No, you can't.

Can I read Work Completion that belongs to a specific Work Queue?

No, you can't.

What will happen if more Work Completion than the CQ size will be added to it

There will be a CQ overrun and the CQ (and all of the QPs that are associated with it) will move into the error state.

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati


Tell us what do you think.

  1. exhaust system says: February 27, 2013

    I really enjoyed viewing your post. You have a lot of insight and truly opened my eyes with a point you made.

  2. Omar Khan says: September 20, 2013

    If i post an RDMA send, how would i know that the receiving side has received the buffer. Does the entry in the Completion queue of the sender, indicate that the receiver has received the data, or does it only indicate that the sender can now reuse the buffer.


    • Dotan Barak says: September 20, 2013

      Hi Omar.

      The question is: which QP transport type are you using?
      Assuming that the Work Completion was ended successfully:

      • For Reliable QP (for example, RC): this means that the sent buffer was written at the receiver side.
      • For Unreliable QP: this means that the sent buffer can be reused, since the message was already sent.

      I hope that this answer helped you.


      • Alan says: April 4, 2014


        In your previous post regarding upon the end of a successful Work Completion using RC RDMA Write, you said it means the send buffer was written at the receiver side. My question is what does the "receiver side" mean? Does it mean the user memory at the remote or the HCA on the remote?

        I saw some posts that point out that a successful Work Completion for a RDMA Write doesn't mean user can read the data on the receiver buffer.

        Did I misunderstand something?



      • Dotan Barak says: April 4, 2014

        Hi Alan, this is a great question.

        The receiver side is the responder side (remote side).
        I mean that the data was received to the remote side HCA and in almost all cases was written to its memory.

        However, the remote side doesn't know that the RDMA Write was finished to its memory
        (it doesn't have any indication that RDMA Write was performed to its memory or that it was finished).

        Sure, it can inspect the memory and see that it was changed but if the last byte was changed it doesn't necessary mean that the whole buffer changed.

        I think that it is better to be cautious and wait until the remote side will have a Work Completions on this QP. But I guess, other methods can be used instead.

        Did I answer your question?


      • Alan says: April 4, 2014

        Hi Dotan,
        Thanks for the fast reply. It answers part of my question. In certain scenario I cannot poll cq on the remote side, so there is no way for me to get and process the Work Completion on the remote. I am not sure if doing something as following would help:

        RDMA_Write(big user data);
        RDMA_Read (last byte of the date from remote);
        wait Work Completion for both of them.
        RDMA_Write (flag);

        while (!flag) ;
        check data from sending side;
        Please note that the two RDMA_Write may use different QPs. But the RDMA_Read will use the same QP as the 1st RDAM_Write.



      • Dotan Barak says: April 4, 2014

        Hi Alan.

        I'm sorry, but I didn't understand what you are doing;
        which operations is performed by every side, and in which QP.
        What is the reason that you try to write and then read the last byte of the data?

        Please note that there isn't any guarantee between messages from different queues.


      • Alan says: April 4, 2014

        Hi Dotan,

        I am sorry I didn't make it clear.

        What I want to know is that if the receiving of the Work Completion of a RDMA_Read which follows a RDMA_Write on the same QP would guarantee (or force) the data of the RDMA_Write being written into the remote memory.



      • Dotan Barak says: April 5, 2014

        The question: is why do you assume that the memory will be written to memory after the RDMA Read was completed?
        (and why do you assume that it won't be written in the first place).

        Can you please send me the reference to the post that you are referring to?


      • Alan says: April 5, 2014

        Hi Dotan,

        Here is one of the links:

        The other place is in the print outs we had for IB education years ago.



      • Dotan Barak says: April 19, 2014

        Hi Alan.

        (I'm not considering my self as a PCI express or computer architecture expert, so I hope that I'm not confusing you with this answer).

        As far as I understand, this question is a little bit tricky; since it isn't related to RDMA.

        The same problem can happen to you when you send data using Send opcode as well
        (and may happen in other network architecture that allow HW offloads,
        and in some cases even when using sockets).

        The data that you want to write to the memory *may* be different than the memory that was actually written to memory because of errors/bit flips any kind of error that may happen between the time that data was reached to the remote side HW and the time that data was written to the memory.

        Actually, this kind of errors can happen when you are accessing local memory, without performing any data transfer with any memory.

        So, I think that this issue isn't related to RDMA.

        BTW, if you want to make sure that the same content was written you can add checksums to your data.


  3. Omar Khan says: January 23, 2014

    i want to know one thing. if i get a "IBV_WC_RNR_RETRY_EXC_ERR" when I poll the completion queue, can i repoll the queue after a while or does my queue enter an error state and cannot be used any more.


    • Dotan Barak says: January 23, 2014

      Hi Omar.

      You are polling a CQ for Completion. If you get a Completion with bad status
      (e.g. "IBV_WC_RNR_RETRY_EXC_ERR"), the QP itself enter to error state and cannot be used.

      However, the CQ itself is still valid and fully functional; If this CQ is being used in several QPs,
      one/some of them may get into error and the rest of them can still be fully functional...

      I hope that I answered

  4. Omar Khan says: June 25, 2014

    Dear Dotan
    I want to know if it is necessary to poll the send completion queue after each ibv_post_send whether it's for RDMA WRITE OR normal send. Polling the send completion queue is time consuming and takes almost 10 microseconds on our cluster and if I do not poll the send completion queue, I overflow it after the maximum send queue counter set for the queue pairs. Is it possible that I do not generate a completion entry for send operation. Please share with me some code snippet where I set up the queue pairs such that for each entry added to the send queue no completion is generated.
    Hopefully I have made my point clear.

    Omar Khan

    • Dotan Barak says: June 26, 2014

      Hi Omar.

      You don't have to poll the Send Completion Queue after every call to ibv_post_send();
      you can create the Queue Pair and specify that a Work Completion isn't needed for each Send Request:

      struct ibv_qp_init_attr attr = {
      .send_cq = ctx->cq,
      .recv_cq = ctx->cq,
      .cap = {
      .max_send_wr = 1,
      .max_recv_wr = rx_depth,
      .max_send_sge = 1,
      .max_recv_sge = 1
      .qp_type = IBV_QPT_RC,
      .sq_sig_all = 0

      When posting a Send Request(s), you need to specify the Send Requests that will generate the Work Completion
      (by setting the IBV_SEND_SIGNALED flag):

      struct ibv_send_wr wr = {
      .wr_id = PINGPONG_SEND_WRID,
      .sg_list = &list,
      .num_sge = 1,
      .opcode = IBV_WR_SEND,
      .send_flags = IBV_SEND_SIGNALED,

      I hope that it helped.
      I guess that I'll write a post this weekend on selective signalling..


      • Omar Khan says: June 26, 2014

        Dear Dotan

        Thanks for your reply. I set send_flags = IBV_SEND_SIGNALED for those send requests for which completion entry is required. What about those for which completion entry in CQ is not required? Do I set the send flag = 0

      • Omar says: June 26, 2014

        Dear Dotan

        I have tried what you have said about setting .sq_sig_all = 0 and only using .send_flags = IBV_SEND_SIGNALED for those send requests which i need to signal. For those send requests whose completion notification is not required, I set .send_flags = 0. I have also set the .max_send_wr = 1 before creating the queues. But it does not work. If i set the .sq_sig_all = 1 and poll the send completion queue after every ibv_post_send, it works very well but i get a delay of several microseconds.
        Please help me out in this.


      • Omar says: June 26, 2014

        Selective signalling works. All we need to do is signal one WR for every SQ-depth worth of WRs posted. For example, If the SQ depth is 16, we must signal at least one out of every 16. This ensures proper flow control for HW resources.
        Courtesy: section 8.2.1 of the iWARP Verbs draft


        Omar Khan

      • Dotan Barak says: June 26, 2014

        Hi Omar.

        I'm happy that it is working for you and thanks for the URL that you shared.


  5. Aunn Raza says: November 6, 2014

    Hi Dotan,
    What if the CQ has 2 entries, but i take only 1 entry by ibv_poll_cq, Will it generate another notification for other one when i will poll it again? or i have take both the entries together?

    • Dotan Barak says: November 6, 2014

      Hi Aunn.

      The question is what do you mean by "notification".
      If you are talking about Completion Notification,
      then the next Work Completion that will be added to the CQ will generate Completion event
      (if you asked to get this notification from the first place).

      This notification will happen when a new Work Completion is added to the CQ,
      and it doesn't matter if the CQ is empty or not.

      I hoped that I answer to your question.


  6. Valentin Petrov says: December 16, 2014

    Hi, Dotan, could you possibly give a hint (maybe somewhere in the literature) on how to organize flow control when a single RCQ (recv completion queue) is shared among multiple QPs. The issue i have is the following. I do maintain necessary level of pre-posted recv WRs in all QPs so that there is no dropped packets. This is easy to do on per-connection (per QP) basis since everybody knows how many recvs are preposted on the other side. But the shared RCQ can be easily overflown in case its depth < N*num_preposted (N - number of connections). I beleive there should be a "gold/commonly_adopted" algorithm for this scenario. Can u suggest anything here?

    • Dotan Barak says: December 18, 2014

      Sorry, there isn't such algorithm that I'm aware of..
      If you'll develop one, it will be great if you'll share it.

      You need to be careful not to overflow the CQ, and if needed work with several CQs;
      make sure that if you have X QPs that every QP may get Y Work Completion, the CQ size must be bigger than X * Y.

      If there can be a case where the CQ won't be big enough, you should use multiple CQs.
      Working with Completion Events and an event channel that handle multiple CQs can be useful too.


  7. Starichok says: December 25, 2014

    Hi! Please, help me!
    I can't get any events at the receiving side.
    Although I see from the debugger that the contents of the receive buffer has changed. On the server side ibv_poll_cq always return 0. If I use ibv_get_cq_event, then the program will be blocked forever.
    - Client side:
    - ibv_post_send() with IBV_SEND_SIGNALED and opcode=IBV_WR_RDMA_WRITE;
    - ibv_poll_cq;
    - Server side:
    - ibv_poll_cq;

    Trying .sq_sig_all = 0 and .sq_sig_all = 1, but the result on server side is the same.
    What am I doing wrong?

    • Dotan Barak says: December 25, 2014


      Let me try to understand what is going on:
      In the client side, you post a Send Request an RDMA operation,
      and poll for Work Completion (i.e. poll_cq return a value which isn't 0, and fill a Work Completion structure).

      However, in the server side you don't get any completion at all - right?

      Since you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
      (this is the whole idea of RDMA).

      If you want to get a Work Completion in the receiver side, I suggest that you'll:
      1) post a Receive Request at the server side
      2) Use RDMA Write with immediate, which will consume the Receive Request in the receiver side and generate a Work Completion.

      I hope that this helped you.


      • Starichok says: December 26, 2014

        Thank you very much!!! Today did as you said - it all worked perfectly!!!

        "Since you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
        (this is the whole idea of RDMA)."
        Sorry for the boring, but
        how, then, can be found on the remote side that its buffer data were recorded, in addition to my case and the TCP/IP socket?

      • Starichok says: December 26, 2014

        or so - what is the best way to learn about it

      • Dotan Barak says: December 28, 2014


        I didn't really understand the question here.
        But I'll try to explain what I think you meant:
        The sender side perform RDMA Write to the receiver memory,
        and he should hint the receiver that its memory was changed.

        This can be done by sending Send or RDMA Write with immediate operations.
        One may ask: what this is good for?
        Well, the sender can issue several RDMA Write to the receiver memory and hint the receiver only once about all the written memory buffers.

        This blog is a good place to start learning RDMA from.
        Currently, there isn't any "Getting started" post, but I'll guess that I'll write such in the (near?) future.


      • Starichok says: December 30, 2014

        Thank you very much!!!
        I have achieved transfer rate by 65 KB (interface QDR) about 8 Gbit/s using one QP and four buffers !!!
        Happy New Year !!!

      • Dotan Barak says: December 31, 2014

        (This is a very good start)

        Happy new year

  8. Parthiban says: January 2, 2015

    Hi Dotan,
    Happy New Year!!

    I'm trying RDMA transfer between two nodes and I observe no work completion WU in the queue. The same application works between two adjacent nodes but when i try to run across the network nodes i observe the above mentioned error.
    Then i checked the ibv_rc_pingpong or ibping test, i see the remote address are shared but the transfer didn't happen. But the normal ping to remote node is working fine.


    • Dotan Barak says: January 2, 2015

      Hi Parthiban.

      I need some more information:
      Which transport are you using (InfiniBand, RoCE, iWARP)?
      Can you send me the output of ibv_devinfo?


  9. Parthiban says: January 2, 2015

    Hi Dotan,
    Thanks for the reply. I'm using InfiniBand.

    system 1:
    hca_id: mlx4_1
    transport: InfiniBand (0)
    fw_ver: 2.10.630
    node_guid: 0025:90ff:ff17:0448
    sys_image_guid: 0025:90ff:ff17:044b
    vendor_id: 0x02c9
    vendor_part_id: 4099
    hw_ver: 0x0
    board_id: SM_2191000001000
    phys_port_cnt: 1
    port: 1
    state: PORT_ACTIVE (4)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 31
    port_lid: 4
    port_lmc: 0x00
    link_layer: IB

    hca_id: mlx4_0
    transport: InfiniBand (0)
    fw_ver: 2.32.5100
    node_guid: f452:1403:008c:3d80
    sys_image_guid: f452:1403:008c:3d83
    vendor_id: 0x02c9
    vendor_part_id: 4099
    hw_ver: 0x0
    board_id: MT_1090120019
    phys_port_cnt: 2
    port: 1
    state: PORT_ACTIVE (4)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 1
    port_lid: 1
    port_lmc: 0x00
    link_layer: IB

    port: 2
    state: PORT_DOWN (1)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 0
    port_lid: 0
    port_lmc: 0x00
    link_layer: IB
    System 2:
    hca_id: mlx4_0
    transport: InfiniBand (0)
    fw_ver: 2.32.5100
    node_guid: f452:1403:008e:e9b0
    sys_image_guid: f452:1403:008e:e9b3
    vendor_id: 0x02c9
    vendor_part_id: 4099
    hw_ver: 0x0
    board_id: MT_1090120019
    phys_port_cnt: 2
    port: 1
    state: PORT_ACTIVE (4)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 19
    port_lid: 1
    port_lmc: 0x00
    link_layer: IB

    port: 2
    state: PORT_ACTIVE (4)
    max_mtu: 2048 (4)
    active_mtu: 2048 (4)
    sm_lid: 1
    port_lid: 24
    port_lmc: 0x00
    link_layer: IB

    • Dotan Barak says: January 2, 2015


      Which IB port did you try to wirk with 1 or 2?

      (Since i think that port 1 of the devices isn't managed by the same SM)



  10. Parthiban says: January 2, 2015

    Yes you are right! there are again two separate IB networks the systems are connected to. I use port 2. one more doubt! if the two ports are connected to different IB network and the same system is configured to run the SM for the two network, will it work properly for both the networks?


    • Dotan Barak says: January 2, 2015


      If you use the same SM for two networks, it becomes one subnet.

      It you have two subnets (for example, all port 1 in one subnet and all port 2 in the second one), working with port 1 in different machines will communicate (same goes with port 2).


  11. Parthiban says: January 3, 2015

    Hi Dotan,
    I see that

    system001:~ # ibv_rc_pingpong
    local address: LID 0x001b, QPN 0x340049, PSN 0xa06196, GID ::
    remote address: LID 0x0001, QPN 0x60004a, PSN 0xbb7261, GID ::

    system002:~ # ibv_rc_pingpong
    local address: LID 0x0001, QPN 0x60004a, PSN 0xbb7261, GID ::
    remote address: LID 0x001b, QPN 0x340049, PSN 0xa06196, GID ::
    Failed status transport retry counter exceeded (12) for wr_id 2


    system001:~ # ibping -S -d -v
    ibdebug: [12314] ibping_serv: starting to serve...

    system002:~ # ibping -d -v 14
    ibdebug: [6738] ibping: Ping..
    ibwarn: [6738] ib_vendor_call_via: route Lid 14 data 0x7fff4c7b8c10
    ibwarn: [6738] ib_vendor_call_via: class 0x132 method 0x1 attr 0x0 mod 0x0 datasz 216 off 40 res_ex 1
    ibwarn: [6738] mad_rpc_rmpp: rmpp (nil) data 0x7fff4c7b8c10
    ibwarn: [6738] mad_rpc_rmpp: MAD completed with error status 0xc; dport (Lid 14)
    ibdebug: [6738] main: ibping to Lid 14 failed

    not able to figure out the reason.


    • Dotan Barak says: January 4, 2015


      First of all, In system001, ibv_rc_pingpong prints that the local LID is 0x1b (27 decimal),
      bu when you executed ibping you used LID 14.

      The above failure in ibv_rc_pingpong suggests that there is connectivity problem in your subnet.
      Are they both in the same subnet now?


      • Parthiban says: January 4, 2015

        Hi Dotan,
        Yes, both the systems are in same network. If i execute normal ping it works fine. Another scenario is that if I run the RDMA sample application which uses RDMA CM the application is working fine but if use IB verbs it fails with completion wasn't found in the CQ and poll completion failed.

  12. Parthiban says: January 5, 2015

    Hi Dotan,
    The issue is fixed, actually the bug is the program scans the interfaces and tries to use the interface found first, but that interface is not connected to the same subnet. Now I pass the interface to use and it works!

    • Dotan Barak says: January 5, 2015


      Thanks for updating me.


    • yuzhen says: September 13, 2016

      Hi Parthiban,

      I also tried to run the example provided using IB verbs, but it failed with the same error like yours. "completion wasn't found in the CQ after time out. poll completion failed".

      Do you have any suggestions?


      • Dotan Barak says: September 16, 2016


        Which example did you try to use?
        What is the exact command line and the output that you got?


  13. Anonymous says: January 14, 2015

    Hi Dotan!
    I use the QDR device. How do I use all 4 tires? Experimentally, I found that all clients use a single bus :(. If I run one client the maximum transmission speed is 10 GB/s, if I run 4 client, then the total transfer speed is equal to 10 GB/s, and each client can transmit at 2.5 GB/s...
    How can I fill the entire bandwidth, i.e., 40 GB/s???

    • Dotan Barak says: January 14, 2015


      QDR means that the speed of the speed of the line is 4 times faster than the base speed.
      Base speed: SDR is 2.5 Gb/s.

      Please execute 'ibstat | grep Rate' to get the maximum supported BW for your adapter.
      (assuming that you are using InfiniBand)


  14. Floaterions says: February 17, 2015

    Hello Dotan,

    When program is waiting at ibv_poll_cq(), does it consumes CPU, or does it go idle and wait for an event to wake it up? I'm asking this because I'm now facing a design choice, where I can end up with hundreds of threads (more than cpu cores), each polling on a separate QP for messages, and I was wondering if the waiting threads actually incur any cost to the system.
    Thank you for your help

    • Dotan Barak says: February 17, 2015

      Hi. Floaterions.

      When ibv_poll_cq() is called, it consumes CPU (i.e. polling).

      If you want to reduce the CPU consumption (and latency isn't an issue),
      it is preferred to work with Completion events.


  15. DjvuLee says: March 9, 2015

    HI, Dotan!

    I have a question is that I want to try what will happen if the ReceiveRequest is not ready in the receiver node(also RNR).

    so I just post one ReceiveRequest in the receive node, and the Sender will send several SendRequests through a loop. I hope there will occurs a IBV_WC_RNR_RETRY_EXC_ERR error in the second loop.

    The first loop is just as me expected, the receiver received the SendRequest and consume the ReceiveRequest, however in the second loop, the receiver get a event(ibv_get_cq_event), however the following ibv_poll_cq get zero, and blocked in the ibv_get_cq_event again.

    this seem impossible, because there is a event notify from the completion queue, however the poll get nothing. How this happened?

    • DjvuLee says: March 9, 2015

      Oh, I am a liitle sorry Dotan. There is some mistake in my last post.

      Every SendRequest used the signal, and it is the sender get a event notify using ibv_get_cq_event, but get 0 using ibv_poll_cq.

      and the receiver just blocking in the ibv_get_cq_event, no error message is throwed out.

      • Dotan Barak says: March 10, 2015


        Yes, in RDMA you may get a Completion Event without finding a Work Completion in the Completion Queue
        (I've wrote about it in my posts).

        Some questions:
        * Are you using Reliable transport types for the Queue Pair?
        * If you switch to polling instead of using events do you still have a problem?
        * Do you check the status of the Work Completions (in both sides)?
        * what is the value of the following attributes: min_rnr_timer, rnr_retry, timeout, retry_cnt?


  16. DjvuLee says: March 11, 2015

    Thanks very much! I will search your blog to see this.
    I use the RC. I modify my code later, and there is some mess, so I have to restore my code and check this status later.

  17. DjvuLee says: March 12, 2015

    Hi Dotan! I have a question about the concurrency connection setup.

    If I have a server which will accept a lot of clients.

    On the connection setup stage, support we get a RDMA_CM_EVENT_CONNECT_REQUEST event from one client, and then a RDMA_CM_EVENT_CONNECT_REQUEST from another client, and then a RDMA_CM_EVENT_ESTABLISHED event.

    Because we use the same event channel, and we can not get the connection id when we get the RDMA_CM_EVENT_ESTABLISHED event, so which client got established?

    I thought maybe RDMA deal with another way: If we get a RDMA_CM_EVENT_CONNECT_REQUEST event, we will reject the connection request from other client until we get the RDMA_CM_EVENT_ESTABLISHED for the former client, but if the server failed to get RDMA_CM_EVENT_ESTABLISHED for this client, what will lead to? Other clients will be rejected forever.

    Or we should use different event channel for different client, which seems not a good way.

    I write a program which use the main thread for the connection setup from RDMA_CM_EVENT_CONNECT_REQUEST to RDMA_CM_EVENT_ESTABLISHED, after the RDMA_CM_EVENT_ESTABLISHED event, we dispatch the setup connection to another thread, and use the main thread to accept the new connection. But when I use some clients to connect the server simultaneously,only one get serviced, the other is rejected. In TCP/IP, this is the right way for concurrency connection.

    And I also wonder how to get which client disconnected when we receive RDMA_CM_EVENT_DISCONNECTED, since we can not get the connection id from the event .

    I have little RDMA programming experience, so I hope this problem not stupid enough.

    • Dotan Barak says: March 17, 2015


      I'm sorry, but I don't consider myself an expert (yet) in programming over librdmacm.
      There is an example in the rdmacm git repository, called rping.

      This example has a persistent mode, and I think that all your questions will be answered from this example.
      Please pay attention to the function rping_run_persistent_server().

      If you care about specific clients, maybe you can use the private_data field to exchange important information about the remote identity.

      I hope that this helps you

      • DjvuLee says: March 21, 2015

        Thanks Dotan.

        I kown how to deal with this now. I just using one thread to listen the EVENT, use the connectionId to relate different events, and dispatch the connection to the thread pool.

      • Dotan Barak says: March 29, 2015


        Thanks for the update

  18. Avis says: July 17, 2015

    Hi Dotan,
    I see a behavior where completion event (for receive) is triggered, but when I poll the cq (ib_poll_cq), it returns 0 work completions. Why would a completion event be generated when there are no work completions ?. Is this a normal behavior, if not where do you suspect the problem could be ?

    • Dotan Barak says: July 17, 2015


      Yes. Completion events can be triggered even if there isn't any Work Completion in the Completion Queue.
      This can happen if you armed the CQ, emptied the CQ (thus polling the Work Completion that triggered the event). When you'll read the event and and check the CQ, you may find the CQ is empty.

      I believe that if you'll check, you'll find that all the Work Completion were read from the CQ before you got this event with empty CQ.


      • Avis says: July 17, 2015

        Thank you.

  19. Anon says: July 22, 2015

    Hi Dotan,

    I am trying to create an example of a one sided RDMA READ off the rc_pingpong.c sample from the ibverbs code.
    What I have changed is:
    1. When creating the memory regions, allow remote reads through the IBV_ACCESS_REMOTE_READ flag.
    2. The pp_post_send function to use IBV_WR_RDMA_READ as the opcode.
    3. Removed all calls to pp_post_recv.
    4. Changed the main while loop, so that the server and the client both poll the cq. Once an event has happened, they exit. Particularly, the server will keep running, and the client exits after it does one run of pp_post_send.

    The issue that I am seeing is that on the client side, the work completion returns code IBV_WC_REM_INV_REQ_ERR.
    Do you know why this might be? It seems that the qp_access_flags is not used anymore (? or at least when I try and set them, they don't get modified) and the buffers in the pingpong context are still the same 4KB page size. With the permissions set on the memory regions, I am not sure what else is going wrong?

    Thanks for any help

    • Dotan Barak says: July 22, 2015

      Hi Anon.

      Did you enable RDMA_READ in qp_attr.'qp_access_flags'?


      • Anon says: July 23, 2015

        Hi Dotan,

        I eventually figured it out. I was setting the qp_access_flags to allow IBV_ACCESS_REMOTE_READ.
        The issue is that I misinterpreted what max_dest_rd_atomic and max_rd_atomic fields were used for -- I thought it was only for remote atomic operations. As such, I set them to 0. So when I tried to modify the QP state machine to RTR, the access flags simply didn't update.

        Thanks for the help.

      • Dotan Barak says: July 23, 2015

        I'm glad everything is working for you


  20. Adrian says: September 14, 2015

    Hello Dotan,

    I am trying to figure out what the maximum number of scatter/gather entries I can use per one Work Request is.
    I have read the FAQ on the ibv_create_qp page, however, I am not seeing the failure when I am trying to create the QP.
    What I have is:
    1. ibv_query_device returns a max_sge value of 32.
    2. I use this in the max_send_sge field of the ibv_qp_cap struct in the ibv_qp_init_attr struct used to create the QP. When the function returns, the value in max_send_sge is updated to 62, to my surprise (I am not sure why...)
    3. I then attempt an RDMA READ with an sg_list of length 32, and 31. Each scatter/gather entry has length 1 (i.e. I am reading only one byte from the remote buffer into the local one for each entry). Both of these return IBV_WC_LOC_LEN_ERR as the completion.
    4. If I use an sg_list of length 30, everything seems to work.

    Do you know why:
    a) ibv_create_qp modifies the max_send_sge to be larger than the max_sge value returned from ibv_query_device?
    b) The max_sge value seems to be too large, even though creating the QP with that value set in the init attributes returns with no error?

    Thanks in advance.

    • Dotan Barak says: September 15, 2015

      Hi Adrian.

      The problem that the ibv_query_device provides one value for max_send_sge for all transport types, for all Work Queues (both Send and Receive),
      and sometimes this just enough.

      I suspect that there is a bug in the low-level driver and you should use the latest version of it,
      and inform the low-level driver provider if this still happens.


  21. Anonymous says: September 14, 2015

    Hi Dotan,
    When posting a signaled Rdma Send from server side i receive no WC at the client side and the kernel hangs. Even though i have some outstanding receives work requests posted on the client side. Can you tell me what the reason could be?

    • Dotan Barak says: September 15, 2015


      Signaled RDMA Sends are relevant only to the local side (the remote side isn't aware to the signalling mode).
      Is this is the first message? (maybe the QPs weren't connected correctly).


  22. Long says: February 4, 2016

    Hi Dotan,

    Thanks for your web site that provides a lot of useful information about RDMA and Infiniband.

    I want to use an example program provided by Tarick Bedeir ( for setting up an RDMA connection (using RDMA_CM) between two machines and then call
    ibv_post_send()/ibv_post_recv() to send/receive data. Setting up the RDMA connection works fine. However, ibv_post_send() fails on the first attempt to send (I get error IBV_WC_RETRY_EXC_ERR (12)).

    Your article on ibv_poll_cq says that "this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages". However, this example program sets the retry_count parameter to 7 (infinite retry). Further, the program uses rdma_create_qp() to create the queue pairs and the RDMA programming manual says that "QPs allocated to an rdma_cm_id are automatically transitioned by the librdmacm through their states. After being allocated, the QP will be ready to handle posting of receives."

    I wonder what goes wrong and how I can fix it. The example code is available at .

    I would be grateful if you have any suggestion for me.


    • Dotan Barak says: February 5, 2016

      Hi Long.

      I didn't have a chance to play with this example (yet).
      7 is infinite value only for the rnr_retry.
      For retry_cnt 7 is actual a seven retries.

      I would suggest to try executing ibv_rc_pingpong an rping,
      to check that your fabric is configured and functioning correctly.


  23. liuyu says: March 15, 2016

    Hi Dotan,

    Thanks for your help!

    Now I am very confused when I use verbs to programing. I want to use rdma read or rdma write for handling IO, but I get err IBV_WC_REM_INV_REQ_ERR(9) at the sender side. I have checked the mem, i didn't find something wrong. I paste some code here, could you give me some suggestion ?

    //create qp
    RCPRINT("client creating qp\n");
    qp_attr.cap.max_send_wr = MAX_WR;
    qp_attr.cap.max_send_sge = 1;
    qp_attr.cap.max_recv_wr = MAX_WR;
    qp_attr.cap.max_recv_sge = 1;
    qp_attr.send_cq = send_cq;
    qp_attr.recv_cq = recv_cq;
    qp_attr.qp_type = IBV_QPT_RC;
    err = rdma_create_qp(cm_id, pd, &qp_attr);
    if (err)
    RCPRINT_ERROR("client create qp fail\n");
    return 1;

    //rdma write or read
    memset(&send_wr, 0, sizeof(send_wr));
    send_wr.wr_id = (uint64_t)sge;
    send_wr.sg_list = sge;
    send_wr.num_sge = 1;
    send_wr.opcode = (opCode == CTRL_READ) ? IBV_WR_RDMA_READ : IBV_WR_RDMA_WRITE;
    send_wr.send_flags = IBV_SEND_SIGNALED;
    send_wr.wr.rdma.remote_addr = remoteRdma->remote_addr;
    send_wr.wr.rdma.rkey = remoteRdma->rkey;

    if (ibv_post_send(connection->cm_id->qp, &send_wr, &bad_wr))
    RCPRINT_ERROR("server send rdma opt(%d) fail\n", opCode);
    return RETURN_ERROR;

    • liuyu says: March 17, 2016

      Hi Dotan,

      Today I write a test program, I found that client could post IBV_WR_RDMA_WRITE or IBV_WR_RDMA_READ successfully, but server only could post IBV_WR_RDMA_WRITE successfully. When server post send with op IBV_WR_RDMA_READ, it get error IBV_WC_REM_INV_REQ_ERR(9) after ibv_poll_cq successfully, and wc.opcode change to IBV_WC_SEND. In the test program , I just send an IBV_WR_RDMA_READ. could you give me some suggestion?

      • Dotan Barak says: March 22, 2016


        I think that the problem is with the permission of the QP or MR in the client side.
        (RDMA Read isn't enabled)


  24. liuyu says: March 29, 2016

    Thanks very much for your help! I have resolved the problem. When the server calls rdma_accept, I do not assign value to struct rdma_conn_param's member initiator_depth which value is zero default. So struct ibv_qp_attr's member max_rd_atomic is zero also. The server cannot send RDMA_READ operation nerver.

  25. tamlok says: April 13, 2016

    I posted a Receive Request and call ibv_poll_cq() to see if we received anything. However, after I suddenly kill the program which is intended to send something to the receiver, the ibv_poll_cq() called by the received still keeps returning 0.
    So it is confused that ibv_poll_cq() doesn't return negative value even after the connection has been disconnected.
    Do you have any ideas?
    Thanks very much!

    • Dotan Barak says: April 19, 2016


      ibv_poll_cq() with negative value means that there is an error in the CQ.
      The QP doesn't know that the remove side is dead ...
      (unless CM is used, and there is a DISCONNECT indication)

      If needed, you can add "keep alive" messages to your application
      (for example: RDMA Write with 0 bytes - if you are using RC).


      • tamlok says: April 19, 2016

        Thanks very much! Do you know how much the cost of ibv_poll_cq() is? Will it be expensive if I keep calling ibv_poll_cq() frequently? Will it consult the hardware register or just memory?

      • Dotan Barak says: April 19, 2016


        It is hard to answer, since it is device specific.

        In Mellanox devices (for example), ibv_poll_cq() access memory - which is relevantly cheap
        (no context switch or any expensive operation).

        I can't say for other devices...


      • tamlok says: April 19, 2016

        Impressive! Thanks very much!

  26. Junhyun says: October 8, 2016

    Hi Dotan, when exactly is the buffer posted for recv WR updated?
    Is it updated when I call ibv_poll_cq or whenever the device accepts a next incoming send WR?
    For instance, if I posted 10 Recv WRs on a same buffer, and in other host I post 2 Send WRs, will the first Send contents be overwritten? or can I get the contents of the first Send by polling just the first Recv WR out of the CQ?

    • Dotan Barak says: October 11, 2016


      The Receive WR buffer(s) are filled when the incoming message arrives and the Receive Request is fetched.
      Once all the message is filled to the buffer, a Work Completion is enqueued to the CQ.

      In your example, the first message content will be overwritten by the second message.


  27. vineeth says: October 27, 2016

    I am trying to do nvme over fabrics project(RDMA). But in RDMA , I am getting rdma read fail with status 5 in host; qp move to error state.can you please tell me why my qp moving to error state in host side while rdma read?

    • Dotan Barak says: November 4, 2016


      Work Completion with status 0x5 means: IBV_WC_WR_FLUSH_ERR.
      * Is this is the first completion with error?
      * Is the QP is already in error?
      * Is there is asynchronous event in the remote side?


  28. shilvea says: February 20, 2017

    in case of SRQ the poll_cq is not used? I cann't understand how I can call it if input parameter is cq. But I didn't create the cq, only srq.

    • Dotan Barak says: February 22, 2017


      The SRQ by itself, can't be used;
      it is used by the QP(s) to hold the Receive Requests.

      The corresponding Work Completion of that Receive Request is enqueued to the QP.receive_queue
      for incoming messages.


Add a Comment

Fill in the form and submit.

Time limit is exhausted. Please reload CAPTCHA.