# ibv_poll_cq()

4.29 avg. rating (86% score) - 7 votes
 int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc);

# Description

ibv_poll_cq() polls Work Completions from a Completion Queue (CQ).

A Work Completion indicates that a Work Request in a Work Queue, and all of the outstanding unsignaled Work Requests that posted to that Work Queue, associated with the CQ are done. Any Receive Requests, signaled Send Requests and Send Requests that ended with an error will generate a Work Completion after their processing end.

When a Work Requests end, a Work Completion is being added to the tail of the CQ that this Work Queue is associated with. ibv_poll_cq() check if Work Completions are present in a CQ and pop them from the head of the CQ in the order they entered it (FIFO). After a Work Completion was popped from a CQ, it can't be returned to it.

One should consume Work Completions at a rate that prevents the CQ from being overrun (hold more Work Completions than the CQ size). In case of an CQ overrun, the async event IBV_EVENT_CQ_ERR will be triggered, and the CQ cannot be used anymore.

The struct ibv_wc describes the Work Completion attributes.

 struct ibv_wc { uint64_t wr_id; enum ibv_wc_status status; enum ibv_wc_opcode opcode; uint32_t vendor_err; uint32_t byte_len; uint32_t imm_data; uint32_t qp_num; uint32_t src_qp; int wc_flags; uint16_t pkey_index; uint16_t slid; uint8_t sl; uint8_t dlid_path_bits; };

Here is the full description of struct ibv_wc:

 wr_id The 64 bits value that was associated with the corresponding Work Request status Status of the operation. The value can be one of the following enumerated values and their numeric value: IBV_WC_SUCCESS (0) - Operation completed successfully: this means that the corresponding Work Request (and all of the unsignaled Work Requests that were posted previous to it) ended and the memory buffers that this Work Request refers to are ready to be (re)used. IBV_WC_LOC_LEN_ERR (1) - Local Length Error: this happens if a Work Request that was posted in a local Send Queue contains a message that is greater than the maximum message size that is supported by the RDMA device port that should send the message or an Atomic operation which its size is different than 8 bytes was sent. This also may happen if a Work Request that was posted in a local Receive Queue isn't big enough for holding the incoming message or if the incoming message size if greater the maximum message size supported by the RDMA device port that received the message. IBV_WC_LOC_QP_OP_ERR (2) - Local QP Operation Error: an internal QP consistency error was detected while processing this Work Request: this happens if a Work Request that was posted in a local Send Queue of a UD QP contains an Address Handle that is associated with a Protection Domain to a QP which is associated with a different Protection Domain or an opcode which isn't supported by the transport type of the QP isn't supported (for example: RDMA Write over a UD QP). IBV_WC_LOC_EEC_OP_ERR (3) - Local EE Context Operation Error: an internal EE Context consistency error was detected while processing this Work Request (unused, since its relevant only to RD QPs or EE Context, which aren’t supported). IBV_WC_LOC_PROT_ERR (4) - Local Protection Error: the locally posted Work Request’s buffers in the scatter/gather list does not reference a Memory Region that is valid for the requested operation. IBV_WC_WR_FLUSH_ERR (5) - Work Request Flushed Error: A Work Request was in process or outstanding when the QP transitioned into the Error State. IBV_WC_MW_BIND_ERR (6) - Memory Window Binding Error: A failure happened when tried to bind a MW to a MR. IBV_WC_BAD_RESP_ERR (7) - Bad Response Error: an unexpected transport layer opcode was returned by the responder. Relevant for RC QPs. IBV_WC_LOC_ACCESS_ERR (8) - Local Access Error: a protection error occurred on a local data buffer during the processing of a RDMA Write with Immediate operation sent from the remote node. Relevant for RC QPs. IBV_WC_REM_INV_REQ_ERR (9) - Remote Invalid Request Error: The responder detected an invalid message on the channel. Possible causes include the operation is not supported by this receive queue (qp_access_flags in remote QP wasn't configured to support this operation), insufficient buffering to receive a new RDMA or Atomic Operation request, or the length specified in a RDMA request is greater than $2^{31}$ bytes. Relevant for RC QPs. IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error: a protection error occurred on a remote data buffer to be read by an RDMA Read, written by an RDMA Write or accessed by an atomic operation. This error is reported only on RDMA operations or atomic operations. Relevant for RC QPs. IBV_WC_REM_OP_ERR (11) - Remote Operation Error: the operation could not be completed successfully by the responder. Possible causes include a responder QP related error that prevented the responder from completing the request or a malformed WQE on the Receive Queue. Relevant for RC QPs. IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter Exceeded: The local transport timeout retry counter was exceeded while trying to send this message. This means that the remote side didn't send any Ack or Nack. If this happens when sending the first message, usually this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. Relevant for RC QPs. IBV_WC_RNR_RETRY_EXC_ERR (13) - RNR Retry Counter Exceeded: The RNR NAK retry count was exceeded. This usually means that the remote side didn't post any WR to its Receive Queue. Relevant for RC QPs. IBV_WC_LOC_RDD_VIOL_ERR (14) - Local RDD Violation Error: The RDD associated with the QP does not match the RDD associated with the EE Context (unused, since its relevant only to RD QPs or EE Context, which aren't supported). IBV_WC_REM_INV_RD_REQ_ERR (15) - Remote Invalid RD Request: The responder detected an invalid incoming RD message. Causes include a Q_Key or RDD violation (unused, since its relevant only to RD QPs or EE Context, which aren't supported) IBV_WC_REM_ABORT_ERR (16) - Remote Aborted Error: For UD or UC QPs associated with a SRQ, the responder aborted the operation. IBV_WC_INV_EECN_ERR (17) - Invalid EE Context Number: An invalid EE Context number was detected (unused, since its relevant only to RD QPs or EE Context, which aren't supported). IBV_WC_INV_EEC_STATE_ERR (18) - Invalid EE Context State Error: Operation is not legal for the specified EE Context state (unused, since its relevant only to RD QPs or EE Context, which aren't supported). IBV_WC_FATAL_ERR (19) - Fatal Error. IBV_WC_RESP_TIMEOUT_ERR (20) - Response Timeout Error. IBV_WC_GENERAL_ERR (21) - General Error: other error which isn't one of the above errors. opcode The operation that the corresponding Work Request performed. This value controls the way that data was sent, the direction of the data flow and the valid attributes in the Work Completion. The value can be one of the following enumerated values: IBV_WC_SEND - Send operation for a WR that was posted to the Send Queue IBV_WC_RDMA_WRITE - RDMA Write operation for a WR that was posted to the Send Queue IBV_WC_RDMA_READ - RDMA Read operation for a WR that was posted to the Send Queue IBV_WC_COMP_SWAP - Compare and Swap operation for a WR that was posted to the Send Queue IBV_WC_FETCH_ADD - Fetch and Add operation for a WR that was posted to the Send Queue IBV_WC_BIND_MW - Memory Window bind operation for a WR that was posted to the Send Queue IBV_WC_RECV - Send data operation for a WR that was posted to a Receive Queue (of a QP or to an SRQ) IBV_WC_RECV_RDMA_WITH_IMM - RDMA with immediate for a WR that was posted to a Receive Queue (of a QP or to an SRQ). For this opcode, only a Receive Request was consumed and the sg_list of this RR wasn't used vendor_err Vendor specific error which provides more information if the completion ended with error. This value provides a hint to the RDMA device's vendor about the reason of the failure in case there is a Work Completion that ended with error byte_len The number of bytes transferred. Relevant if the Receive Queue for incoming Send or RDMA Write with immediate operations. This value doesn't include the length of the immediate data, if such exists. Relevant in the Send Queue for RDMA Read and Atomic operations. For the Receive Queue of a UD QP that is not associated with an SRQ or for an SRQ that is associated with a UD QP this value equals to the payload of the message plus the 40 bytes reserved for the GRH. The number of bytes transferred is the payload of the message plus the 40 bytes reserved for the GRH, whether or not the GRH is present imm_data (optional) A 32 bits number, in network order, in an SEND or RDMA WRITE opcodes that is being sent along with the payload to the remote side and placed in a Receive Work Completion and not in a remote memory buffer. This value is valid if the IBV_WC_WITH_IMM is set qp_num Local QP number of completed WR. Relevant for Receive Work Completions that are associated with an SRQ src_qp Source QP number (remote QP number) of completed WR. Relevant for Receive Work Completions of a UD QP wc_flags Flags of the Work Completion. It is either 0 or the bitwise OR of one or more of the following flags: IBV_WC_GRH - Indicator that GRH is present for a Receive Work Completions of a UD QP. If this bit is set, the first 40 bytes of the buffered that were referred to in the Receive request will contain the GRH of the incoming message. If this bit is cleared, the content of those first 40 bytes is undefined IBV_WC_WITH_IMM - Indicator that imm_data is valid. Relevant for Receive Work Completions pkey_index P_Key index. Relevant for GSI QPs slid Source LID (the base LID that this message was sent from). Relevant for Receive Work Completions of a UD QP sl Service Level (the SL LID that this message was sent with). Relevant for Receive Work Completions of a UD QP dlid_path_bits Destination LID path bits. Relevant for Receive Work Completions of a UD QP (not applicable for multicast messages)

The following test (opcode & IBV_WC_RECV) will indicate that the status of a completion is from the Receive Queue.

For a receive Work Completions of a UD QP, the data start at offset 40 from the posted receive buffer start whether if the IBV_WC_GRH bit it set or not.

Not all wc attributes are always valid. If the completion status is other than IBV_WC_SUCCESS, only the following attributes are valid:

• wr_id
• status
• qp_num
• vendor_err

# Parameters

Name Direction Description
cq in Completion Queue that was returned from ibv_create_cq()
num_entries in Maximum number of Work Completions to read from the CQ
wc out Array of size num_entries of the Work Completions that will be read from the CQ

# Return Values

Value Description
Positive Number of Work Completions that were read from the CQ and their value was returned in wc. If this value is less than num_entries it means that there aren't any more Work Completions in the CQ. If this value equals to num_entries, maybe there are more Work Completions in the CQ
0 The CQ is empty
Negative A failure occurred while trying to read Work Completions from the CQ

# Examples

Poll a Work Completion from a CQ (in polling mode):

 struct ibv_wc wc; int num_comp;   do { num_comp = ibv_poll_cq(cq, 1, &wc); } while (num_comp == 0);   if (num_comp < 0) { fprintf(stderr, "ibv_poll_cq() failed\n"); return -1; }   /* verify the completion status */ if (wc.status != IBV_WC_SUCCESS) { fprintf(stderr, "Failed status %s (%d) for wr_id %d\n", ibv_wc_status_str(wc.status), wc.status, (int)wc.wr_id); return -1; }

# FAQs

#### What is that Work Completion anyway?

Work Completion means that the corresponding Work Request is ended and the buffer can be (re)used for read, write or free.

#### Does ibv_poll_cq() cause a context switch?

No. Polling for Work Completions doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

#### Is there a limit to the number of Work Completions that can we polled when calling ibv_poll_cq()?

No. One can read as many Work Requests that he wishes.

No, you can't.

#### I got a Work Completion from the Receive Queue of a UD QP and it ended well. I read the data from the memory buffers and I got bad data. Why?

Maybe you looked at the data starting offset 0. For any Work Completion of a UD QP, the data is placed in offset 40 of the relevant memory buffers, no matter if GRH was present or not.

#### What is this GRH and why do I need it?

The Global Routing Header (GRH) provides information that is most useful for sending a message back to the sender of this message if it came from a different subnet or from a multicast group.

#### I've got completion with error status. Can I read all of the Work Completion fields?

No. If the Work Completion status indicates that there is an error, only the following attributes are valid: wr_id, status, qp_num, and vendor_err. The rest of the attributes are undefined.

No, you can't.

No, you can't.

#### What will happen if more Work Completion than the CQ size will be added to it

There will be a CQ overrun and the CQ (and all of the QPs that are associated with it) will move into the error state.

## Share Our Posts

Tell us what do you think.

1. February 27, 2013

I really enjoyed viewing your post. You have a lot of insight and truly opened my eyes with a point you made.

2. September 20, 2013

If i post an RDMA send, how would i know that the receiving side has received the buffer. Does the entry in the Completion queue of the sender, indicate that the receiver has received the data, or does it only indicate that the sender can now reuse the buffer.

Regards

• September 20, 2013

Hi Omar.

The question is: which QP transport type are you using?
Assuming that the Work Completion was ended successfully:

• For Reliable QP (for example, RC): this means that the sent buffer was written at the receiver side.
• For Unreliable QP: this means that the sent buffer can be reused, since the message was already sent.

I hope that this answer helped you.

Thanks
Dotan

• April 4, 2014

Hi,

In your previous post regarding upon the end of a successful Work Completion using RC RDMA Write, you said it means the send buffer was written at the receiver side. My question is what does the "receiver side" mean? Does it mean the user memory at the remote or the HCA on the remote?

I saw some posts that point out that a successful Work Completion for a RDMA Write doesn't mean user can read the data on the receiver buffer.

Did I misunderstand something?

Thanks.

Alan

• April 4, 2014

Hi Alan, this is a great question.

The receiver side is the responder side (remote side).
I mean that the data was received to the remote side HCA and in almost all cases was written to its memory.

However, the remote side doesn't know that the RDMA Write was finished to its memory
(it doesn't have any indication that RDMA Write was performed to its memory or that it was finished).

Sure, it can inspect the memory and see that it was changed but if the last byte was changed it doesn't necessary mean that the whole buffer changed.

I think that it is better to be cautious and wait until the remote side will have a Work Completions on this QP. But I guess, other methods can be used instead.

Thanks
Dotan

• April 4, 2014

Hi Dotan,
Thanks for the fast reply. It answers part of my question. In certain scenario I cannot poll cq on the remote side, so there is no way for me to get and process the Work Completion on the remote. I am not sure if doing something as following would help:
=============================

RDMA_Write(big user data);
RDMA_Read (last byte of the date from remote);
wait Work Completion for both of them.
RDMA_Write (flag);

while (!flag) ;
check data from sending side;
=============================
Please note that the two RDMA_Write may use different QPs. But the RDMA_Read will use the same QP as the 1st RDAM_Write.

Thanks.

Alan

• April 4, 2014

Hi Alan.

I'm sorry, but I didn't understand what you are doing;
which operations is performed by every side, and in which QP.
What is the reason that you try to write and then read the last byte of the data?

Please note that there isn't any guarantee between messages from different queues.

Thanks
Dotan

• April 4, 2014

Hi Dotan,

I am sorry I didn't make it clear.

What I want to know is that if the receiving of the Work Completion of a RDMA_Read which follows a RDMA_Write on the same QP would guarantee (or force) the data of the RDMA_Write being written into the remote memory.

Thanks.

Alan

• April 5, 2014

The question: is why do you assume that the memory will be written to memory after the RDMA Read was completed?
(and why do you assume that it won't be written in the first place).

Can you please send me the reference to the post that you are referring to?

Thanks
Dotan

• April 5, 2014

Hi Dotan,

Here is one of the links: http://lists.openfabrics.org/pipermail/general/2007-May/036615.html

The other place is in the print outs we had for IB education years ago.

Regards,

Alan

• April 19, 2014

Hi Alan.

(I'm not considering my self as a PCI express or computer architecture expert, so I hope that I'm not confusing you with this answer).

As far as I understand, this question is a little bit tricky; since it isn't related to RDMA.

The same problem can happen to you when you send data using Send opcode as well
(and may happen in other network architecture that allow HW offloads,
and in some cases even when using sockets).

The data that you want to write to the memory *may* be different than the memory that was actually written to memory because of errors/bit flips any kind of error that may happen between the time that data was reached to the remote side HW and the time that data was written to the memory.

Actually, this kind of errors can happen when you are accessing local memory, without performing any data transfer with any memory.

So, I think that this issue isn't related to RDMA.

BTW, if you want to make sure that the same content was written you can add checksums to your data.

Thanks
Dotan

3. January 23, 2014

Hi
i want to know one thing. if i get a "IBV_WC_RNR_RETRY_EXC_ERR" when I poll the completion queue, can i repoll the queue after a while or does my queue enter an error state and cannot be used any more.

regards
Omar

• January 23, 2014

Hi Omar.

You are polling a CQ for Completion. If you get a Completion with bad status
(e.g. "IBV_WC_RNR_RETRY_EXC_ERR"), the QP itself enter to error state and cannot be used.

However, the CQ itself is still valid and fully functional; If this CQ is being used in several QPs,
one/some of them may get into error and the rest of them can still be fully functional...

Dotan

4. June 25, 2014

Dear Dotan
I want to know if it is necessary to poll the send completion queue after each ibv_post_send whether it's for RDMA WRITE OR normal send. Polling the send completion queue is time consuming and takes almost 10 microseconds on our cluster and if I do not poll the send completion queue, I overflow it after the maximum send queue counter set for the queue pairs. Is it possible that I do not generate a completion entry for send operation. Please share with me some code snippet where I set up the queue pairs such that for each entry added to the send queue no completion is generated.
Hopefully I have made my point clear.

Regards
Omar Khan

• June 26, 2014

Hi Omar.

You don't have to poll the Send Completion Queue after every call to ibv_post_send();
you can create the Queue Pair and specify that a Work Completion isn't needed for each Send Request:

struct ibv_qp_init_attr attr = {
.send_cq = ctx->cq,
.recv_cq = ctx->cq,
.cap = {
.max_send_wr = 1,
.max_recv_wr = rx_depth,
.max_send_sge = 1,
.max_recv_sge = 1
},
.qp_type = IBV_QPT_RC,
.sq_sig_all = 0
};

When posting a Send Request(s), you need to specify the Send Requests that will generate the Work Completion
(by setting the IBV_SEND_SIGNALED flag):

struct ibv_send_wr wr = {
.wr_id = PINGPONG_SEND_WRID,
.sg_list = &list,
.num_sge = 1,
.opcode = IBV_WR_SEND,
.send_flags = IBV_SEND_SIGNALED,
};

I hope that it helped.
I guess that I'll write a post this weekend on selective signalling..

Thanks
Dotan

• June 26, 2014

Dear Dotan

Thanks for your reply. I set send_flags = IBV_SEND_SIGNALED for those send requests for which completion entry is required. What about those for which completion entry in CQ is not required? Do I set the send flag = 0

• June 26, 2014

Dear Dotan

I have tried what you have said about setting .sq_sig_all = 0 and only using .send_flags = IBV_SEND_SIGNALED for those send requests which i need to signal. For those send requests whose completion notification is not required, I set .send_flags = 0. I have also set the .max_send_wr = 1 before creating the queues. But it does not work. If i set the .sq_sig_all = 1 and poll the send completion queue after every ibv_post_send, it works very well but i get a delay of several microseconds.

Regards

• June 26, 2014

Selective signalling works. All we need to do is signal one WR for every SQ-depth worth of WRs posted. For example, If the SQ depth is 16, we must signal at least one out of every 16. This ensures proper flow control for HW resources.
Courtesy: section 8.2.1 of the iWARP Verbs draft http://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.2.1

Regards

Omar Khan

• June 26, 2014

Hi Omar.

I'm happy that it is working for you and thanks for the URL that you shared.

Thanks
Dotan

5. November 6, 2014

Hi Dotan,
What if the CQ has 2 entries, but i take only 1 entry by ibv_poll_cq, Will it generate another notification for other one when i will poll it again? or i have take both the entries together?

• November 6, 2014

Hi Aunn.

The question is what do you mean by "notification".
then the next Work Completion that will be added to the CQ will generate Completion event

This notification will happen when a new Work Completion is added to the CQ,
and it doesn't matter if the CQ is empty or not.

Thanks
Dotan

6. December 16, 2014

Hi, Dotan, could you possibly give a hint (maybe somewhere in the literature) on how to organize flow control when a single RCQ (recv completion queue) is shared among multiple QPs. The issue i have is the following. I do maintain necessary level of pre-posted recv WRs in all QPs so that there is no dropped packets. This is easy to do on per-connection (per QP) basis since everybody knows how many recvs are preposted on the other side. But the shared RCQ can be easily overflown in case its depth < N*num_preposted (N - number of connections). I beleive there should be a "gold/commonly_adopted" algorithm for this scenario. Can u suggest anything here?

• December 18, 2014

Sorry, there isn't such algorithm that I'm aware of..
If you'll develop one, it will be great if you'll share it.
:)

You need to be careful not to overflow the CQ, and if needed work with several CQs;
make sure that if you have X QPs that every QP may get Y Work Completion, the CQ size must be bigger than X * Y.

If there can be a case where the CQ won't be big enough, you should use multiple CQs.
Working with Completion Events and an event channel that handle multiple CQs can be useful too.

Dotan

7. December 25, 2014

I can't get any events at the receiving side.
Although I see from the debugger that the contents of the receive buffer has changed. On the server side ibv_poll_cq always return 0. If I use ibv_get_cq_event, then the program will be blocked forever.
Pseudocode:
- Client side:
- ibv_post_send() with IBV_SEND_SIGNALED and opcode=IBV_WR_RDMA_WRITE;
- ibv_poll_cq;
- Server side:
- ibv_poll_cq;

Trying .sq_sig_all = 0 and .sq_sig_all = 1, but the result on server side is the same.
What am I doing wrong?

• December 25, 2014

Hi.

Let me try to understand what is going on:
In the client side, you post a Send Request an RDMA operation,
and poll for Work Completion (i.e. poll_cq return a value which isn't 0, and fill a Work Completion structure).

However, in the server side you don't get any completion at all - right?

Since you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
(this is the whole idea of RDMA).

If you want to get a Work Completion in the receiver side, I suggest that you'll:
1) post a Receive Request at the server side
2) Use RDMA Write with immediate, which will consume the Receive Request in the receiver side and generate a Work Completion.

I hope that this helped you.

Thanks
Dotan

• December 26, 2014

Thank you very much!!! Today did as you said - it all worked perfectly!!!

"Since you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
(this is the whole idea of RDMA)."
Sorry for the boring, but
how, then, can be found on the remote side that its buffer data were recorded, in addition to my case and the TCP/IP socket?

• December 26, 2014

or so - what is the best way to learn about it

• December 28, 2014

Hi.

I didn't really understand the question here.
But I'll try to explain what I think you meant:
The sender side perform RDMA Write to the receiver memory,
and he should hint the receiver that its memory was changed.

This can be done by sending Send or RDMA Write with immediate operations.
One may ask: what this is good for?
Well, the sender can issue several RDMA Write to the receiver memory and hint the receiver only once about all the written memory buffers.

This blog is a good place to start learning RDMA from.
Currently, there isn't any "Getting started" post, but I'll guess that I'll write such in the (near?) future.

Thanks
Dotan

• December 30, 2014

Hi!
Thank you very much!!!
I have achieved transfer rate by 65 KB (interface QDR) about 8 Gbit/s using one QP and four buffers !!!
Happy New Year !!!

• December 31, 2014

Nice...
(This is a very good start)

Happy new year
Dotan

8. January 2, 2015

Hi Dotan,
Happy New Year!!

I'm trying RDMA transfer between two nodes and I observe no work completion WU in the queue. The same application works between two adjacent nodes but when i try to run across the network nodes i observe the above mentioned error.
Then i checked the ibv_rc_pingpong or ibping test, i see the remote address are shared but the transfer didn't happen. But the normal ping to remote node is working fine.

Thanks,
Parthiban

• January 2, 2015

Hi Parthiban.

Which transport are you using (InfiniBand, RoCE, iWARP)?
Can you send me the output of ibv_devinfo?

Thanks
Dotan

9. January 2, 2015

Hi Dotan,
Thanks for the reply. I'm using InfiniBand.

system 1:
hca_id: mlx4_1
transport: InfiniBand (0)
fw_ver: 2.10.630
node_guid: 0025:90ff:ff17:0448
sys_image_guid: 0025:90ff:ff17:044b
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: SM_2191000001000
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 31
port_lid: 4
port_lmc: 0x00

hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.32.5100
node_guid: f452:1403:008c:3d80
sys_image_guid: f452:1403:008c:3d83
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: MT_1090120019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 1
port_lmc: 0x00

port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
System 2:
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.32.5100
node_guid: f452:1403:008e:e9b0
sys_image_guid: f452:1403:008e:e9b3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: MT_1090120019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 19
port_lid: 1
port_lmc: 0x00

port: 2
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 24
port_lmc: 0x00

• January 2, 2015

Hi.

Which IB port did you try to wirk with 1 or 2?

(Since i think that port 1 of the devices isn't managed by the same SM)

Thanks
Dotan

Thanks
Dotan

10. January 2, 2015

Yes you are right! there are again two separate IB networks the systems are connected to. I use port 2. one more doubt! if the two ports are connected to different IB network and the same system is configured to run the SM for the two network, will it work properly for both the networks?

Thanks,

• January 2, 2015

Hi.

If you use the same SM for two networks, it becomes one subnet.

It you have two subnets (for example, all port 1 in one subnet and all port 2 in the second one), working with port 1 in different machines will communicate (same goes with port 2).

Thanks
Dotan

11. January 3, 2015

Hi Dotan,
I see that

system001:~ # ibv_rc_pingpong
local address: LID 0x001b, QPN 0x340049, PSN 0xa06196, GID ::
remote address: LID 0x0001, QPN 0x60004a, PSN 0xbb7261, GID ::

system002:~ # ibv_rc_pingpong 192.168.96.101
local address: LID 0x0001, QPN 0x60004a, PSN 0xbb7261, GID ::
remote address: LID 0x001b, QPN 0x340049, PSN 0xa06196, GID ::
Failed status transport retry counter exceeded (12) for wr_id 2

and

system001:~ # ibping -S -d -v
ibdebug: [12314] ibping_serv: starting to serve...

system002:~ # ibping -d -v 14
ibdebug: [6738] ibping: Ping..
ibwarn: [6738] ib_vendor_call_via: route Lid 14 data 0x7fff4c7b8c10
ibwarn: [6738] ib_vendor_call_via: class 0x132 method 0x1 attr 0x0 mod 0x0 datasz 216 off 40 res_ex 1
ibwarn: [6738] mad_rpc_rmpp: rmpp (nil) data 0x7fff4c7b8c10
ibwarn: [6738] mad_rpc_rmpp: MAD completed with error status 0xc; dport (Lid 14)
ibdebug: [6738] main: ibping to Lid 14 failed

not able to figure out the reason.

Thanks
Parthiban

• January 4, 2015

Hi.

First of all, In system001, ibv_rc_pingpong prints that the local LID is 0x1b (27 decimal),
bu when you executed ibping you used LID 14.

The above failure in ibv_rc_pingpong suggests that there is connectivity problem in your subnet.
Are they both in the same subnet now?

Thanks
Dotan

• January 4, 2015

Hi Dotan,
Yes, both the systems are in same network. If i execute normal ping it works fine. Another scenario is that if I run the RDMA sample application which uses RDMA CM the application is working fine but if use IB verbs it fails with completion wasn't found in the CQ and poll completion failed.
Thanks

12. January 5, 2015

Hi Dotan,
The issue is fixed, actually the bug is the program scans the interfaces and tries to use the interface found first, but that interface is not connected to the same subnet. Now I pass the interface to use and it works!
Thanks,
Parthiban

• January 5, 2015

Great!

Thanks for updating me.

Dotan

• September 13, 2016

Hi Parthiban,

I also tried to run the example provided using IB verbs, but it failed with the same error like yours. "completion wasn't found in the CQ after time out. poll completion failed".

Do you have any suggestions?

Thanks

• September 16, 2016

Hi.

Which example did you try to use?
What is the exact command line and the output that you got?

Thanks
Dotan

13. January 14, 2015

Hi Dotan!
I use the QDR device. How do I use all 4 tires? Experimentally, I found that all clients use a single bus :(. If I run one client the maximum transmission speed is 10 GB/s, if I run 4 client, then the total transfer speed is equal to 10 GB/s, and each client can transmit at 2.5 GB/s...
How can I fill the entire bandwidth, i.e., 40 GB/s???
Thanks!

• January 14, 2015

Hi.

QDR means that the speed of the speed of the line is 4 times faster than the base speed.
Base speed: SDR is 2.5 Gb/s.

(assuming that you are using InfiniBand)

Thanks
Dotan

14. February 17, 2015

Hello Dotan,

When program is waiting at ibv_poll_cq(), does it consumes CPU, or does it go idle and wait for an event to wake it up? I'm asking this because I'm now facing a design choice, where I can end up with hundreds of threads (more than cpu cores), each polling on a separate QP for messages, and I was wondering if the waiting threads actually incur any cost to the system.

• February 17, 2015

Hi. Floaterions.

When ibv_poll_cq() is called, it consumes CPU (i.e. polling).

If you want to reduce the CPU consumption (and latency isn't an issue),
it is preferred to work with Completion events.

Thanks
Dotan

15. March 9, 2015

HI, Dotan!

I have a question is that I want to try what will happen if the ReceiveRequest is not ready in the receiver node(also RNR).

so I just post one ReceiveRequest in the receive node, and the Sender will send several SendRequests through a loop. I hope there will occurs a IBV_WC_RNR_RETRY_EXC_ERR error in the second loop.

The first loop is just as me expected, the receiver received the SendRequest and consume the ReceiveRequest, however in the second loop, the receiver get a event(ibv_get_cq_event), however the following ibv_poll_cq get zero, and blocked in the ibv_get_cq_event again.

this seem impossible, because there is a event notify from the completion queue, however the poll get nothing. How this happened?

• March 9, 2015

Oh, I am a liitle sorry Dotan. There is some mistake in my last post.

Every SendRequest used the signal, and it is the sender get a event notify using ibv_get_cq_event, but get 0 using ibv_poll_cq.

and the receiver just blocking in the ibv_get_cq_event, no error message is throwed out.

• March 10, 2015

Hi.

Yes, in RDMA you may get a Completion Event without finding a Work Completion in the Completion Queue
(I've wrote about it in my posts).

Some questions:
* Are you using Reliable transport types for the Queue Pair?
* If you switch to polling instead of using events do you still have a problem?
* Do you check the status of the Work Completions (in both sides)?
* what is the value of the following attributes: min_rnr_timer, rnr_retry, timeout, retry_cnt?

Thanks
Dotan

16. March 11, 2015

Thanks very much! I will search your blog to see this.
I use the RC. I modify my code later, and there is some mess, so I have to restore my code and check this status later.

17. March 12, 2015

Hi Dotan! I have a question about the concurrency connection setup.

If I have a server which will accept a lot of clients.

On the connection setup stage, support we get a RDMA_CM_EVENT_CONNECT_REQUEST event from one client, and then a RDMA_CM_EVENT_CONNECT_REQUEST from another client, and then a RDMA_CM_EVENT_ESTABLISHED event.

Because we use the same event channel, and we can not get the connection id when we get the RDMA_CM_EVENT_ESTABLISHED event, so which client got established?

I thought maybe RDMA deal with another way: If we get a RDMA_CM_EVENT_CONNECT_REQUEST event, we will reject the connection request from other client until we get the RDMA_CM_EVENT_ESTABLISHED for the former client, but if the server failed to get RDMA_CM_EVENT_ESTABLISHED for this client, what will lead to? Other clients will be rejected forever.

Or we should use different event channel for different client, which seems not a good way.

I write a program which use the main thread for the connection setup from RDMA_CM_EVENT_CONNECT_REQUEST to RDMA_CM_EVENT_ESTABLISHED, after the RDMA_CM_EVENT_ESTABLISHED event, we dispatch the setup connection to another thread, and use the main thread to accept the new connection. But when I use some clients to connect the server simultaneously，only one get serviced, the other is rejected. In TCP/IP, this is the right way for concurrency connection.

And I also wonder how to get which client disconnected when we receive RDMA_CM_EVENT_DISCONNECTED, since we can not get the connection id from the event .

I have little RDMA programming experience, so I hope this problem not stupid enough.

• March 17, 2015

Hi.

I'm sorry, but I don't consider myself an expert (yet) in programming over librdmacm.
There is an example in the rdmacm git repository, called rping.

This example has a persistent mode, and I think that all your questions will be answered from this example.
Please pay attention to the function rping_run_persistent_server().

If you care about specific clients, maybe you can use the private_data field to exchange important information about the remote identity.

I hope that this helps you
Dotan

• March 21, 2015

Thanks Dotan.

I kown how to deal with this now. I just using one thread to listen the EVENT, use the connectionId to relate different events, and dispatch the connection to the thread pool.

• March 29, 2015

Cool.

Thanks for the update
Dotan

18. July 17, 2015

Hi Dotan,
I see a behavior where completion event (for receive) is triggered, but when I poll the cq (ib_poll_cq), it returns 0 work completions. Why would a completion event be generated when there are no work completions ?. Is this a normal behavior, if not where do you suspect the problem could be ?

• July 17, 2015

Hi.

Yes. Completion events can be triggered even if there isn't any Work Completion in the Completion Queue.
This can happen if you armed the CQ, emptied the CQ (thus polling the Work Completion that triggered the event). When you'll read the event and and check the CQ, you may find the CQ is empty.

I believe that if you'll check, you'll find that all the Work Completion were read from the CQ before you got this event with empty CQ.

Thanks
Dotan

• July 17, 2015

Thank you.

19. July 22, 2015

Hi Dotan,

I am trying to create an example of a one sided RDMA READ off the rc_pingpong.c sample from the ibverbs code.
What I have changed is:
1. When creating the memory regions, allow remote reads through the IBV_ACCESS_REMOTE_READ flag.
2. The pp_post_send function to use IBV_WR_RDMA_READ as the opcode.
3. Removed all calls to pp_post_recv.
4. Changed the main while loop, so that the server and the client both poll the cq. Once an event has happened, they exit. Particularly, the server will keep running, and the client exits after it does one run of pp_post_send.

The issue that I am seeing is that on the client side, the work completion returns code IBV_WC_REM_INV_REQ_ERR.
Do you know why this might be? It seems that the qp_access_flags is not used anymore (? or at least when I try and set them, they don't get modified) and the buffers in the pingpong context are still the same 4KB page size. With the permissions set on the memory regions, I am not sure what else is going wrong?

Thanks for any help

• July 22, 2015

Hi Anon.

Did you enable RDMA_READ in qp_attr.'qp_access_flags'?

Thanks
Dotan

• July 23, 2015

Hi Dotan,

I eventually figured it out. I was setting the qp_access_flags to allow IBV_ACCESS_REMOTE_READ.
The issue is that I misinterpreted what max_dest_rd_atomic and max_rd_atomic fields were used for -- I thought it was only for remote atomic operations. As such, I set them to 0. So when I tried to modify the QP state machine to RTR, the access flags simply didn't update.

Thanks for the help.

• July 23, 2015

I'm glad everything is working for you
:)

Dotan

20. September 14, 2015

Hello Dotan,

I am trying to figure out what the maximum number of scatter/gather entries I can use per one Work Request is.
I have read the FAQ on the ibv_create_qp page, however, I am not seeing the failure when I am trying to create the QP.
What I have is:
1. ibv_query_device returns a max_sge value of 32.
2. I use this in the max_send_sge field of the ibv_qp_cap struct in the ibv_qp_init_attr struct used to create the QP. When the function returns, the value in max_send_sge is updated to 62, to my surprise (I am not sure why...)
3. I then attempt an RDMA READ with an sg_list of length 32, and 31. Each scatter/gather entry has length 1 (i.e. I am reading only one byte from the remote buffer into the local one for each entry). Both of these return IBV_WC_LOC_LEN_ERR as the completion.
4. If I use an sg_list of length 30, everything seems to work.

Do you know why:
a) ibv_create_qp modifies the max_send_sge to be larger than the max_sge value returned from ibv_query_device?
b) The max_sge value seems to be too large, even though creating the QP with that value set in the init attributes returns with no error?

• September 15, 2015

The problem that the ibv_query_device provides one value for max_send_sge for all transport types, for all Work Queues (both Send and Receive),
and sometimes this just enough.

I suspect that there is a bug in the low-level driver and you should use the latest version of it,
and inform the low-level driver provider if this still happens.

Thanks
Dotan

21. September 14, 2015

Hi Dotan,
When posting a signaled Rdma Send from server side i receive no WC at the client side and the kernel hangs. Even though i have some outstanding receives work requests posted on the client side. Can you tell me what the reason could be?

• September 15, 2015

Hi.

Signaled RDMA Sends are relevant only to the local side (the remote side isn't aware to the signalling mode).
Is this is the first message? (maybe the QPs weren't connected correctly).

Thanks
Dotan

22. February 4, 2016

Hi Dotan,

Thanks for your web site that provides a lot of useful information about RDMA and Infiniband.

I want to use an example program provided by Tarick Bedeir (https://thegeekinthecorner.wordpress.com/) for setting up an RDMA connection (using RDMA_CM) between two machines and then call
ibv_post_send()/ibv_post_recv() to send/receive data. Setting up the RDMA connection works fine. However, ibv_post_send() fails on the first attempt to send (I get error IBV_WC_RETRY_EXC_ERR (12)).

Your article on ibv_poll_cq says that "this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages". However, this example program sets the retry_count parameter to 7 (infinite retry). Further, the program uses rdma_create_qp() to create the queue pairs and the RDMA programming manual says that "QPs allocated to an rdma_cm_id are automatically transitioned by the librdmacm through their states. After being allocated, the QP will be ready to handle posting of receives."

I wonder what goes wrong and how I can fix it. The example code is available at

I would be grateful if you have any suggestion for me.

Thanks,
Long

• February 5, 2016

Hi Long.

I didn't have a chance to play with this example (yet).
7 is infinite value only for the rnr_retry.
For retry_cnt 7 is actual a seven retries.

I would suggest to try executing ibv_rc_pingpong an rping,
to check that your fabric is configured and functioning correctly.

Thanks
Dotan

23. March 15, 2016

Hi Dotan,

Now I am very confused when I use verbs to programing. I want to use rdma read or rdma write for handling IO, but I get err IBV_WC_REM_INV_REQ_ERR（9） at the sender side. I have checked the mem, i didn't find something wrong. I paste some code here, could you give me some suggestion ?

//create qp
RCPRINT("client creating qp\n");
qp_attr.cap.max_send_wr = MAX_WR;
qp_attr.cap.max_send_sge = 1;
qp_attr.cap.max_recv_wr = MAX_WR;
qp_attr.cap.max_recv_sge = 1;
qp_attr.send_cq = send_cq;
qp_attr.recv_cq = recv_cq;
qp_attr.qp_type = IBV_QPT_RC;
err = rdma_create_qp(cm_id, pd, &qp_attr);
if (err)
{
RCPRINT_ERROR("client create qp fail\n");
clientDestroyRdmaObj(connection);
return 1;
}

memset(&send_wr, 0, sizeof(send_wr));
send_wr.wr_id = (uint64_t)sge;
send_wr.sg_list = sge;
send_wr.num_sge = 1;
send_wr.send_flags = IBV_SEND_SIGNALED;
send_wr.wr.rdma.rkey = remoteRdma->rkey;

{
RCPRINT_ERROR("server send rdma opt(%d) fail\n", opCode);
return RETURN_ERROR;
}

• March 17, 2016

Hi Dotan,

Today I write a test program, I found that client could post IBV_WR_RDMA_WRITE or IBV_WR_RDMA_READ successfully, but server only could post IBV_WR_RDMA_WRITE successfully. When server post send with op IBV_WR_RDMA_READ, it get error IBV_WC_REM_INV_REQ_ERR（9） after ibv_poll_cq successfully, and wc.opcode change to IBV_WC_SEND. In the test program , I just send an IBV_WR_RDMA_READ. could you give me some suggestion？

• March 22, 2016

Hi.

I think that the problem is with the permission of the QP or MR in the client side.

Thanks
Dotan

24. March 29, 2016

Thanks very much for your help! I have resolved the problem. When the server calls rdma_accept, I do not assign value to struct rdma_conn_param's member initiator_depth which value is zero default. So struct ibv_qp_attr's member max_rd_atomic is zero also. The server cannot send RDMA_READ operation nerver.

25. April 13, 2016

I posted a Receive Request and call ibv_poll_cq() to see if we received anything. However, after I suddenly kill the program which is intended to send something to the receiver, the ibv_poll_cq() called by the received still keeps returning 0.
So it is confused that ibv_poll_cq() doesn't return negative value even after the connection has been disconnected.
Do you have any ideas?
Thanks very much!

• April 19, 2016

Hi.

ibv_poll_cq() with negative value means that there is an error in the CQ.
The QP doesn't know that the remove side is dead ...
(unless CM is used, and there is a DISCONNECT indication)

(for example: RDMA Write with 0 bytes - if you are using RC).

Thanks
Dotan

• April 19, 2016

Thanks very much! Do you know how much the cost of ibv_poll_cq() is? Will it be expensive if I keep calling ibv_poll_cq() frequently? Will it consult the hardware register or just memory?

• April 19, 2016

Hi.

It is hard to answer, since it is device specific.

In Mellanox devices (for example), ibv_poll_cq() access memory - which is relevantly cheap
(no context switch or any expensive operation).

I can't say for other devices...

Thanks
Dotan

• April 19, 2016

Impressive! Thanks very much!

26. October 8, 2016

Hi Dotan, when exactly is the buffer posted for recv WR updated?
Is it updated when I call ibv_poll_cq or whenever the device accepts a next incoming send WR?
For instance, if I posted 10 Recv WRs on a same buffer, and in other host I post 2 Send WRs, will the first Send contents be overwritten? or can I get the contents of the first Send by polling just the first Recv WR out of the CQ?

• October 11, 2016

Hi.

The Receive WR buffer(s) are filled when the incoming message arrives and the Receive Request is fetched.
Once all the message is filled to the buffer, a Work Completion is enqueued to the CQ.

In your example, the first message content will be overwritten by the second message.

Thanks
Dotan

• February 28, 2017

Hi!

i used send/recv to transfer 50 bytes data over RC QP. The receiver polled a cqe with byte_len is exactly 50, and the status is IBV_WC_SUCCESS, but i cannot find data in the buffer pointed by pre-posted RR. i am very confusing what's going on...

• July 21, 2017

Hi.

I believe that either:
1) There is a bug in the code
2) The program overwrite the buffer (using multiple Receive Requests points to the same location or write directly to the buffer)

Thanks
Dotan

27. October 27, 2016

I am trying to do nvme over fabrics project(RDMA). But in RDMA , I am getting rdma read fail with status 5 in host Side.ie; qp move to error state.can you please tell me why my qp moving to error state in host side while rdma read?

• November 4, 2016

Hi.

Work Completion with status 0x5 means: IBV_WC_WR_FLUSH_ERR.
* Is this is the first completion with error?
* Is the QP is already in error?
* Is there is asynchronous event in the remote side?

Thanks
Dotan

28. February 20, 2017

Hi!
in case of SRQ the poll_cq is not used? I cann't understand how I can call it if input parameter is cq. But I didn't create the cq, only srq.

• February 22, 2017

Hi.

The SRQ by itself, can't be used;
it is used by the QP(s) to hold the Receive Requests.

The corresponding Work Completion of that Receive Request is enqueued to the QP.receive_queue
for incoming messages.

Thanks
Dotan

29. April 21, 2017

Hello Dotan,

My understanding is this: When receiving an incoming UD Send, the gather region will contain the payload offset by the 40-byte grh. But the grh will only be valid if the sender is in a different subnet or the message is part of a multicast (please correct me if I am mistaken).
My questions are the following:

In the case of a unicast UD Send within the same subnet is the grh actually transported? I.e. can I overload the 40 bytes of the grh with actual payload, since it's going to go to my gather list in the receiver anyway?

In case of the WR posted on the Receive Queue, is it mandatory to specify a valid, registered region in the sge list to accommodate the ghr, even if the incoming UD Send has no payload (i.e. send only the immediate value)?
In the case of sending no payload I should be able to write num_sge=0 when specifying the receive request. Does that hold true for a Receive Request that accommodates a UD Send?

Thank you!

• July 3, 2017

Hi.

You are correct, for a UD QP the packet payload will be placed starting offset 40 in the Receive Request buffer;
the first 40 bytes will contain a GRH only if the packet contained a GRH.

When posting a Send Request of a UD QP, the sender controls, in the Address Handle, whether or not a GRH will be sent over the wire.
You cannot control the context of the 40 bytes of the GRH; most of the GRH is filled automatically by the RDMA device.

If a GRH isn't present, I *believe* (i.e. I didn't verify it) that you don't have to provide a valid MR key,
since I have a feeling that only before the RDMA device write content to memory it validates the S/G, provided in the Receive Request.

The spec is writing: "Note that for UD QPs, the first 40 bytes of the buffer(s) referred to by the Scatter/Gather list will contain the GRH of the incoming message. If no GRH is present, the contents of first 40 bytes of the buffer(s) will be undefined."

The behavior is implementation dependent; different vendors/devices may behave differently.
I suggest not to count on the implementation of specific RDMA device and always provide a valid MR.

Thanks
Dotan

30. December 14, 2017

Hi Dotan,

Is ibv_poll_cq a blocking function? for example, if the CQ is empty, would ibv_poll_cq return 0 immediately, or would it block and only sporadically returns 0?

• December 22, 2017

Hi.

ibv_poll_cq() isn't a blocking function and it will always return immediately.

Negative return value in case of an error.
Otherwise, the number of Work Completions returned (0 means that no Work Completions were found in that CQ).

Thanks
Dotan

31. January 22, 2018

Hi,
There is a problem, when the sender endpoint post a send request, and then meet a "RNR retry counter exceeded" completion (because at that time I don't set a receive request in the remote endpoint), then after I set a receive request in the remote endpoint, I let the sender endpoint post a send request again ,However, it occurs Work Request Flushed Error.Could you tell me How to solve this problem. I really appreciate it.

• March 2, 2018

Hi.

Once you get a Work Completion with error in an RC QP,
the QP is being transitioned to the Error state.

If you want to work with that QP, you need to reestablish the connection in both sides
(move the QP to the Reset state, and configure it in both sides to the RTS state).

Thanks
Dotan

32. March 24, 2018

I met same issue. Here is my data:
* Is this is the first completion with error?
Yes. It is first error
* Is the QP is already in error?
How do I know QP is in error or not?
* Is there is asynchronous event in the remote
No.
I'm using SoftRoCE. Is it possible related?

• April 19, 2018

Hi.

You can know if a QP is in error or not by calling ibv_query_qp() and check the state field.

Personally, I don't have any experience with SoftRoCE, but all RDMA stack should behave in the same way...

Thanks
Dotan

Fill in the form and submit.