ibv_poll_cq()

Contents

4.70 avg. rating (94% score) - 20 votes

int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc);

Description

ibv_poll_cq() polls Work Completions from a Completion Queue (CQ).

A Work Completion indicates that a Work Request in a Work Queue, and all of the outstanding unsignaled Work Requests that posted to that Work Queue, associated with the CQ are done. Any Receive Requests, signaled Send Requests and Send Requests that ended with an error will generate a Work Completion after their processing end.

When a Work Requests end, a Work Completion is being added to the tail of the CQ that this Work Queue is associated with. ibv_poll_cq() check if Work Completions are present in a CQ and pop them from the head of the CQ in the order they entered it (FIFO). After a Work Completion was popped from a CQ, it can't be returned to it.

One should consume Work Completions at a rate that prevents the CQ from being overrun (hold more Work Completions than the CQ size). In case of an CQ overrun, the async event IBV_EVENT_CQ_ERR will be triggered, and the CQ cannot be used anymore.

The struct ibv_wc describes the Work Completion attributes.

struct ibv_wc {
	uint64_t		wr_id;
	enum ibv_wc_status	status;
	enum ibv_wc_opcode	opcode;
	uint32_t		vendor_err;
	uint32_t		byte_len;
	uint32_t		imm_data;
	uint32_t		qp_num;
	uint32_t		src_qp;
	int			wc_flags;
	uint16_t		pkey_index;
	uint16_t		slid;
	uint8_t			sl;
	uint8_t			dlid_path_bits;
};

Here is the full description of struct ibv_wc:

wr_id	The 64 bits value that was associated with the corresponding Work Request
status	Status of the operation. The value can be one of the following enumerated values and their numeric value: IBV_WC_SUCCESS (0) - Operation completed successfully: this means that the corresponding Work Request (and all of the unsignaled Work Requests that were posted previous to it) ended and the memory buffers that this Work Request refers to are ready to be (re)used. IBV_WC_LOC_LEN_ERR (1) - Local Length Error: this happens if a Work Request that was posted in a local Send Queue contains a message that is greater than the maximum message size that is supported by the RDMA device port that should send the message or an Atomic operation which its size is different than 8 bytes was sent. This also may happen if a Work Request that was posted in a local Receive Queue isn't big enough for holding the incoming message or if the incoming message size if greater the maximum message size supported by the RDMA device port that received the message. IBV_WC_LOC_QP_OP_ERR (2) - Local QP Operation Error: an internal QP consistency error was detected while processing this Work Request: this happens if a Work Request that was posted in a local Send Queue of a UD QP contains an Address Handle that is associated with a Protection Domain to a QP which is associated with a different Protection Domain or an opcode which isn't supported by the transport type of the QP isn't supported (for example: RDMA Write over a UD QP). IBV_WC_LOC_EEC_OP_ERR (3) - Local EE Context Operation Error: an internal EE Context consistency error was detected while processing this Work Request (unused, since its relevant only to RD QPs or EE Context, which aren’t supported). IBV_WC_LOC_PROT_ERR (4) - Local Protection Error: the locally posted Work Request’s buffers in the scatter/gather list does not reference a Memory Region that is valid for the requested operation. IBV_WC_WR_FLUSH_ERR (5) - Work Request Flushed Error: A Work Request was in process or outstanding when the QP transitioned into the Error State. IBV_WC_MW_BIND_ERR (6) - Memory Window Binding Error: A failure happened when tried to bind a MW to a MR. IBV_WC_BAD_RESP_ERR (7) - Bad Response Error: an unexpected transport layer opcode was returned by the responder. Relevant for RC QPs. IBV_WC_LOC_ACCESS_ERR (8) - Local Access Error: a protection error occurred on a local data buffer during the processing of a RDMA Write with Immediate operation sent from the remote node. Relevant for RC QPs. IBV_WC_REM_INV_REQ_ERR (9) - Remote Invalid Request Error: The responder detected an invalid message on the channel. Possible causes include the operation is not supported by this receive queue (qp_access_flags in remote QP wasn't configured to support this operation), insufficient buffering to receive a new RDMA or Atomic Operation request, or the length specified in a RDMA request is greater than $2^{31}$ bytes. Relevant for RC QPs. IBV_WC_REM_ACCESS_ERR (10) - Remote Access Error: a protection error occurred on a remote data buffer to be read by an RDMA Read, written by an RDMA Write or accessed by an atomic operation. This error is reported only on RDMA operations or atomic operations. Relevant for RC QPs. IBV_WC_REM_OP_ERR (11) - Remote Operation Error: the operation could not be completed successfully by the responder. Possible causes include a responder QP related error that prevented the responder from completing the request or a malformed WQE on the Receive Queue. Relevant for RC QPs. IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter Exceeded: The local transport timeout retry counter was exceeded while trying to send this message. This means that the remote side didn't send any Ack or Nack. If this happens when sending the first message, usually this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. Relevant for RC QPs. IBV_WC_RNR_RETRY_EXC_ERR (13) - RNR Retry Counter Exceeded: The RNR NAK retry count was exceeded. This usually means that the remote side didn't post any WR to its Receive Queue. Relevant for RC QPs. IBV_WC_LOC_RDD_VIOL_ERR (14) - Local RDD Violation Error: The RDD associated with the QP does not match the RDD associated with the EE Context (unused, since its relevant only to RD QPs or EE Context, which aren't supported). IBV_WC_REM_INV_RD_REQ_ERR (15) - Remote Invalid RD Request: The responder detected an invalid incoming RD message. Causes include a Q_Key or RDD violation (unused, since its relevant only to RD QPs or EE Context, which aren't supported) IBV_WC_REM_ABORT_ERR (16) - Remote Aborted Error: For UD or UC QPs associated with a SRQ, the responder aborted the operation. IBV_WC_INV_EECN_ERR (17) - Invalid EE Context Number: An invalid EE Context number was detected (unused, since its relevant only to RD QPs or EE Context, which aren't supported). IBV_WC_INV_EEC_STATE_ERR (18) - Invalid EE Context State Error: Operation is not legal for the specified EE Context state (unused, since its relevant only to RD QPs or EE Context, which aren't supported). IBV_WC_FATAL_ERR (19) - Fatal Error. IBV_WC_RESP_TIMEOUT_ERR (20) - Response Timeout Error. IBV_WC_GENERAL_ERR (21) - General Error: other error which isn't one of the above errors.
opcode	The operation that the corresponding Work Request performed. This value controls the way that data was sent, the direction of the data flow and the valid attributes in the Work Completion. The value can be one of the following enumerated values: IBV_WC_SEND - Send operation for a WR that was posted to the Send Queue IBV_WC_RDMA_WRITE - RDMA Write operation for a WR that was posted to the Send Queue IBV_WC_RDMA_READ - RDMA Read operation for a WR that was posted to the Send Queue IBV_WC_COMP_SWAP - Compare and Swap operation for a WR that was posted to the Send Queue IBV_WC_FETCH_ADD - Fetch and Add operation for a WR that was posted to the Send Queue IBV_WC_BIND_MW - Memory Window bind operation for a WR that was posted to the Send Queue IBV_WC_RECV - Send data operation for a WR that was posted to a Receive Queue (of a QP or to an SRQ) IBV_WC_RECV_RDMA_WITH_IMM - RDMA with immediate for a WR that was posted to a Receive Queue (of a QP or to an SRQ). For this opcode, only a Receive Request was consumed and the sg_list of this RR wasn't used
vendor_err	Vendor specific error which provides more information if the completion ended with error. This value provides a hint to the RDMA device's vendor about the reason of the failure in case there is a Work Completion that ended with error
byte_len	The number of bytes transferred. Relevant if the Receive Queue for incoming Send or RDMA Write with immediate operations. This value doesn't include the length of the immediate data, if such exists. Relevant in the Send Queue for RDMA Read and Atomic operations. For the Receive Queue of a UD QP that is not associated with an SRQ or for an SRQ that is associated with a UD QP this value equals to the payload of the message plus the 40 bytes reserved for the GRH. The number of bytes transferred is the payload of the message plus the 40 bytes reserved for the GRH, whether or not the GRH is present
imm_data	(optional) A 32 bits number, in network order, in an SEND or RDMA WRITE opcodes that is being sent along with the payload to the remote side and placed in a Receive Work Completion and not in a remote memory buffer. This value is valid if the IBV_WC_WITH_IMM is set
qp_num	Local QP number of completed WR. Relevant for Receive Work Completions that are associated with an SRQ
src_qp	Source QP number (remote QP number) of completed WR. Relevant for Receive Work Completions of a UD QP
wc_flags	Flags of the Work Completion. It is either 0 or the bitwise OR of one or more of the following flags: IBV_WC_GRH - Indicator that GRH is present for a Receive Work Completions of a UD QP. If this bit is set, the first 40 bytes of the buffered that were referred to in the Receive request will contain the GRH of the incoming message. If this bit is cleared, the content of those first 40 bytes is undefined IBV_WC_WITH_IMM - Indicator that imm_data is valid. Relevant for Receive Work Completions
pkey_index	P_Key index. Relevant for GSI QPs
slid	Source LID (the base LID that this message was sent from). Relevant for Receive Work Completions of a UD QP
sl	Service Level (the SL LID that this message was sent with). Relevant for Receive Work Completions of a UD QP
dlid_path_bits	Destination LID path bits. Relevant for Receive Work Completions of a UD QP (not applicable for multicast messages)

The following test (opcode & IBV_WC_RECV) will indicate that the status of a completion is from the Receive Queue.

For a receive Work Completions of a UD QP, the data start at offset 40 from the posted receive buffer start whether if the IBV_WC_GRH bit it set or not.

Not all wc attributes are always valid. If the completion status is other than IBV_WC_SUCCESS, only the following attributes are valid:

wr_id
status
qp_num
vendor_err

Parameters

Name	Direction	Description
cq	in	Completion Queue that was returned from ibv_create_cq()
num_entries	in	Maximum number of Work Completions to read from the CQ
wc	out	Array of size num_entries of the Work Completions that will be read from the CQ

Return Values

Value	Description
Positive	Number of Work Completions that were read from the CQ and their value was returned in wc. If this value is less than num_entries it means that there aren't any more Work Completions in the CQ. If this value equals to num_entries, maybe there are more Work Completions in the CQ
0	The CQ is empty
Negative	A failure occurred while trying to read Work Completions from the CQ

Examples

Poll a Work Completion from a CQ (in polling mode):

struct ibv_wc wc;
int num_comp;
 
do {
	num_comp = ibv_poll_cq(cq, 1, &wc);
} while (num_comp == 0);
 
if (num_comp < 0) {
	fprintf(stderr, "ibv_poll_cq() failed\n");
	return -1;
}
 
/* verify the completion status */
if (wc.status != IBV_WC_SUCCESS) {
	fprintf(stderr, "Failed status %s (%d) for wr_id %d\n", 
		ibv_wc_status_str(wc.status),
		wc.status, (int)wc.wr_id);
	return -1;
}

FAQs

What is that Work Completion anyway?

Work Completion means that the corresponding Work Request is ended and the buffer can be (re)used for read, write or free.

Does ibv_poll_cq() cause a context switch?

No. Polling for Work Completions doesn't cause a context switch at all; this is why RDMA technologies can achieve very low latency (below 1 usec).

Is there a limit to the number of Work Completions that can we polled when calling ibv_poll_cq()?

No. One can read as many Work Requests that he wishes.

I called ibv_poll_cq() and it filled all of the array that I've provided to it. Can I know how many more Work Completions exist in the CQ?

No, you can't.

I got a Work Completion from the Receive Queue of a UD QP and it ended well. I read the data from the memory buffers and I got bad data. Why?

Maybe you looked at the data starting offset 0. For any Work Completion of a UD QP, the data is placed in offset 40 of the relevant memory buffers, no matter if GRH was present or not.

What is this GRH and why do I need it?

The Global Routing Header (GRH) provides information that is most useful for sending a message back to the sender of this message if it came from a different subnet or from a multicast group.

I've got completion with error status. Can I read all of the Work Completion fields?

No. If the Work Completion status indicates that there is an error, only the following attributes are valid: wr_id, status, qp_num, and vendor_err. The rest of the attributes are undefined.

I read a Work Completion from the CQ and I don't need. Can I return it to the CQ?

No, you can't.

Can I read Work Completion that belongs to a specific Work Queue?

No, you can't.

What will happen if more Work Completion than the CQ size will be added to it

There will be a CQ overrun and the CQ (and all of the QPs that are associated with it) will move into the error state.

Written by: Dotan Barak on February 15, 2013.on April 17, 2015.

Comments

Tell us what do you think.

Omar Khan says: September 20, 2013

If i post an RDMA send, how would i know that the receiving side has received the buffer. Does the entry in the Completion queue of the sender, indicate that the receiver has received the data, or does it only indicate that the sender can now reuse the buffer.

Regards

Reply
- Dotan Barak says: September 20, 2013
  Hi Omar.
  
  The question is: which QP transport type are you using?
  Assuming that the Work Completion was ended successfully:
  
  For Reliable QP (for example, RC): this means that the sent buffer was written at the receiver side.
  
  For Unreliable QP: this means that the sent buffer can be reused, since the message was already sent.
  
  I hope that this answer helped you.
  
  Thanks
  Dotan
  Reply
  - Alan says: April 4, 2014
    
    Hi,
    
    In your previous post regarding upon the end of a successful Work Completion using RC RDMA Write, you said it means the send buffer was written at the receiver side. My question is what does the "receiver side" mean? Does it mean the user memory at the remote or the HCA on the remote?
    
    I saw some posts that point out that a successful Work Completion for a RDMA Write doesn't mean user can read the data on the receiver buffer.
    
    Did I misunderstand something?
    
    Thanks.
    
    Alan
  - Dotan Barak says: April 4, 2014
    
    Hi Alan, this is a great question.
    
    The receiver side is the responder side (remote side).
    I mean that the data was received to the remote side HCA and in almost all cases was written to its memory.
    
    However, the remote side doesn't know that the RDMA Write was finished to its memory
    (it doesn't have any indication that RDMA Write was performed to its memory or that it was finished).
    
    Sure, it can inspect the memory and see that it was changed but if the last byte was changed it doesn't necessary mean that the whole buffer changed.
    
    I think that it is better to be cautious and wait until the remote side will have a Work Completions on this QP. But I guess, other methods can be used instead.
    
    Did I answer your question?
    
    Thanks
    Dotan
  - Alan says: April 4, 2014
    
    Hi Dotan,
    Thanks for the fast reply. It answers part of my question. In certain scenario I cannot poll cq on the remote side, so there is no way for me to get and process the Work Completion on the remote. I am not sure if doing something as following would help:
    =============================
    
    RDMA_Write(big user data);
    RDMA_Read (last byte of the date from remote);
    wait Work Completion for both of them.
    RDMA_Write (flag);
    
    while (!flag) ;
    check data from sending side;
    =============================
    Please note that the two RDMA_Write may use different QPs. But the RDMA_Read will use the same QP as the 1st RDAM_Write.
    
    Thanks.
    
    Alan
  - Dotan Barak says: April 4, 2014
    
    Hi Alan.
    
    I'm sorry, but I didn't understand what you are doing;
    which operations is performed by every side, and in which QP.
    What is the reason that you try to write and then read the last byte of the data?
    
    Please note that there isn't any guarantee between messages from different queues.
    
    Thanks
    Dotan
  - Alan says: April 4, 2014
    
    Hi Dotan,
    
    I am sorry I didn't make it clear.
    
    What I want to know is that if the receiving of the Work Completion of a RDMA_Read which follows a RDMA_Write on the same QP would guarantee (or force) the data of the RDMA_Write being written into the remote memory.
    
    Thanks.
    
    Alan
  - Dotan Barak says: April 5, 2014
    
    The question: is why do you assume that the memory will be written to memory after the RDMA Read was completed?
    (and why do you assume that it won't be written in the first place).
    
    Can you please send me the reference to the post that you are referring to?
    
    Thanks
    Dotan
  - Alan says: April 5, 2014
    
    Hi Dotan,
    
    Here is one of the links: http://lists.openfabrics.org/pipermail/general/2007-May/036615.html
    
    The other place is in the print outs we had for IB education years ago.
    
    Regards,
    
    Alan
  - Dotan Barak says: April 19, 2014
    
    Hi Alan.
    
    (I'm not considering my self as a PCI express or computer architecture expert, so I hope that I'm not confusing you with this answer).
    
    As far as I understand, this question is a little bit tricky; since it isn't related to RDMA.
    
    The same problem can happen to you when you send data using Send opcode as well
    (and may happen in other network architecture that allow HW offloads,
    and in some cases even when using sockets).
    
    The data that you want to write to the memory *may* be different than the memory that was actually written to memory because of errors/bit flips any kind of error that may happen between the time that data was reached to the remote side HW and the time that data was written to the memory.
    
    Actually, this kind of errors can happen when you are accessing local memory, without performing any data transfer with any memory.
    
    So, I think that this issue isn't related to RDMA.
    
    BTW, if you want to make sure that the same content was written you can add checksums to your data.
    
    Thanks
    Dotan
Omar Khan says: January 23, 2014

Hi
i want to know one thing. if i get a "IBV_WC_RNR_RETRY_EXC_ERR" when I poll the completion queue, can i repoll the queue after a while or does my queue enter an error state and cannot be used any more.

regards
Omar

Reply
- Dotan Barak says: January 23, 2014
  
  Hi Omar.
  
  You are polling a CQ for Completion. If you get a Completion with bad status
  (e.g. "IBV_WC_RNR_RETRY_EXC_ERR"), the QP itself enter to error state and cannot be used.
  
  However, the CQ itself is still valid and fully functional; If this CQ is being used in several QPs,
  one/some of them may get into error and the rest of them can still be fully functional...
  
  I hope that I answered
  Dotan
  
  Reply
Omar Khan says: June 25, 2014

Dear Dotan
I want to know if it is necessary to poll the send completion queue after each ibv_post_send whether it's for RDMA WRITE OR normal send. Polling the send completion queue is time consuming and takes almost 10 microseconds on our cluster and if I do not poll the send completion queue, I overflow it after the maximum send queue counter set for the queue pairs. Is it possible that I do not generate a completion entry for send operation. Please share with me some code snippet where I set up the queue pairs such that for each entry added to the send queue no completion is generated.
Hopefully I have made my point clear.

Regards
Omar Khan

Reply
- Dotan Barak says: June 26, 2014
  
  Hi Omar.
  
  You don't have to poll the Send Completion Queue after every call to ibv_post_send();
  you can create the Queue Pair and specify that a Work Completion isn't needed for each Send Request:
  
  struct ibv_qp_init_attr attr = { .send_cq = ctx->cq, .recv_cq = ctx->cq, .cap = { .max_send_wr = 1, .max_recv_wr = rx_depth, .max_send_sge = 1, .max_recv_sge = 1 }, .qp_type = IBV_QPT_RC, .sq_sig_all = 0 };
  When posting a Send Request(s), you need to specify the Send Requests that will generate the Work Completion
  (by setting the IBV_SEND_SIGNALED flag):
  
  struct ibv_send_wr wr = { .wr_id = PINGPONG_SEND_WRID, .sg_list = &list, .num_sge = 1, .opcode = IBV_WR_SEND, .send_flags = IBV_SEND_SIGNALED, };
  I hope that it helped.
  I guess that I'll write a post this weekend on selective signalling..
  
  Thanks
  Dotan
  
  Reply
  - Omar Khan says: June 26, 2014
    
    Dear Dotan
    
    Thanks for your reply. I set send_flags = IBV_SEND_SIGNALED for those send requests for which completion entry is required. What about those for which completion entry in CQ is not required? Do I set the send flag = 0
  - Omar says: June 26, 2014
    
    Dear Dotan
    
    I have tried what you have said about setting .sq_sig_all = 0 and only using .send_flags = IBV_SEND_SIGNALED for those send requests which i need to signal. For those send requests whose completion notification is not required, I set .send_flags = 0. I have also set the .max_send_wr = 1 before creating the queues. But it does not work. If i set the .sq_sig_all = 1 and poll the send completion queue after every ibv_post_send, it works very well but i get a delay of several microseconds.
    Please help me out in this.
    
    Regards
  - Omar says: June 26, 2014
    
    Selective signalling works. All we need to do is signal one WR for every SQ-depth worth of WRs posted. For example, If the SQ depth is 16, we must signal at least one out of every 16. This ensures proper flow control for HW resources.
    Courtesy: section 8.2.1 of the iWARP Verbs draft http://tools.ietf.org/html/draft-hilland-rddp-verbs-00#section-8.2.1
    
    Regards
    
    Omar Khan
  - Dotan Barak says: June 26, 2014
    
    Hi Omar.
    
    I'm happy that it is working for you and thanks for the URL that you shared.
    
    Thanks
    Dotan
Aunn Raza says: November 6, 2014

Hi Dotan,
What if the CQ has 2 entries, but i take only 1 entry by ibv_poll_cq, Will it generate another notification for other one when i will poll it again? or i have take both the entries together?

Reply
- Dotan Barak says: November 6, 2014
  
  Hi Aunn.
  
  The question is what do you mean by "notification".
  If you are talking about Completion Notification,
  then the next Work Completion that will be added to the CQ will generate Completion event
  (if you asked to get this notification from the first place).
  
  This notification will happen when a new Work Completion is added to the CQ,
  and it doesn't matter if the CQ is empty or not.
  
  I hoped that I answer to your question.
  
  Thanks
  Dotan
  
  Reply
Valentin Petrov says: December 16, 2014

Hi, Dotan, could you possibly give a hint (maybe somewhere in the literature) on how to organize flow control when a single RCQ (recv completion queue) is shared among multiple QPs. The issue i have is the following. I do maintain necessary level of pre-posted recv WRs in all QPs so that there is no dropped packets. This is easy to do on per-connection (per QP) basis since everybody knows how many recvs are preposted on the other side. But the shared RCQ can be easily overflown in case its depth < N*num_preposted (N - number of connections). I beleive there should be a "gold/commonly_adopted" algorithm for this scenario. Can u suggest anything here?

Reply
- Dotan Barak says: December 18, 2014
  
  Sorry, there isn't such algorithm that I'm aware of..
  If you'll develop one, it will be great if you'll share it.
  :)
  
  You need to be careful not to overflow the CQ, and if needed work with several CQs;
  make sure that if you have X QPs that every QP may get Y Work Completion, the CQ size must be bigger than X * Y.
  
  If there can be a case where the CQ won't be big enough, you should use multiple CQs.
  Working with Completion Events and an event channel that handle multiple CQs can be useful too.
  
  Dotan
  
  Reply
Starichok says: December 25, 2014

Hi! Please, help me!
I can't get any events at the receiving side.
Although I see from the debugger that the contents of the receive buffer has changed. On the server side ibv_poll_cq always return 0. If I use ibv_get_cq_event, then the program will be blocked forever.
Pseudocode:
- Client side:
- ibv_post_send() with IBV_SEND_SIGNALED and opcode=IBV_WR_RDMA_WRITE;
- ibv_poll_cq;
- Server side:
- ibv_poll_cq;

Trying .sq_sig_all = 0 and .sq_sig_all = 1, but the result on server side is the same.
What am I doing wrong?

Reply
- Dotan Barak says: December 25, 2014
  
  Hi.
  
  Let me try to understand what is going on:
  In the client side, you post a Send Request an RDMA operation,
  and poll for Work Completion (i.e. poll_cq return a value which isn't 0, and fill a Work Completion structure).
  
  However, in the server side you don't get any completion at all - right?
  
  Since you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
  (this is the whole idea of RDMA).
  
  If you want to get a Work Completion in the receiver side, I suggest that you'll:
  1) post a Receive Request at the server side
  2) Use RDMA Write with immediate, which will consume the Receive Request in the receiver side and generate a Work Completion.
  
  I hope that this helped you.
  
  Thanks
  Dotan
  
  Reply
  - Starichok says: December 26, 2014
    
    Thank you very much!!! Today did as you said - it all worked perfectly!!!
    
    "Since you are using RDMA Write, you shouldn't get any Work Completion in the receiver side at all
    (this is the whole idea of RDMA)."
    Sorry for the boring, but
    how, then, can be found on the remote side that its buffer data were recorded, in addition to my case and the TCP/IP socket?
  - Starichok says: December 26, 2014
    
    or so - what is the best way to learn about it
  - Dotan Barak says: December 28, 2014
    
    Hi.
    
    I didn't really understand the question here.
    But I'll try to explain what I think you meant:
    The sender side perform RDMA Write to the receiver memory,
    and he should hint the receiver that its memory was changed.
    
    This can be done by sending Send or RDMA Write with immediate operations.
    One may ask: what this is good for?
    Well, the sender can issue several RDMA Write to the receiver memory and hint the receiver only once about all the written memory buffers.
    
    This blog is a good place to start learning RDMA from.
    Currently, there isn't any "Getting started" post, but I'll guess that I'll write such in the (near?) future.
    
    Thanks
    Dotan
  - Starichok says: December 30, 2014
    
    Hi!
    Thank you very much!!!
    I have achieved transfer rate by 65 KB (interface QDR) about 8 Gbit/s using one QP and four buffers !!!
    Happy New Year !!!
  - Dotan Barak says: December 31, 2014
    
    Nice...
    (This is a very good start)
    
    Happy new year
    Dotan
Parthiban says: January 2, 2015

Hi Dotan,
Happy New Year!!

I'm trying RDMA transfer between two nodes and I observe no work completion WU in the queue. The same application works between two adjacent nodes but when i try to run across the network nodes i observe the above mentioned error.
Then i checked the ibv_rc_pingpong or ibping test, i see the remote address are shared but the transfer didn't happen. But the normal ping to remote node is working fine.

Thanks,
Parthiban

Reply
- Dotan Barak says: January 2, 2015
  
  Hi Parthiban.
  
  I need some more information:
  Which transport are you using (InfiniBand, RoCE, iWARP)?
  Can you send me the output of ibv_devinfo?
  
  Thanks
  Dotan
  
  Reply
Parthiban says: January 2, 2015

Hi Dotan,
Thanks for the reply. I'm using InfiniBand.

system 1:
hca_id: mlx4_1
transport: InfiniBand (0)
fw_ver: 2.10.630
node_guid: 0025:90ff:ff17:0448
sys_image_guid: 0025:90ff:ff17:044b
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: SM_2191000001000
phys_port_cnt: 1
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 31
port_lid: 4
port_lmc: 0x00
link_layer: IB

hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.32.5100
node_guid: f452:1403:008c:3d80
sys_image_guid: f452:1403:008c:3d83
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: MT_1090120019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 1
port_lmc: 0x00
link_layer: IB

port: 2
state: PORT_DOWN (1)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 0
port_lid: 0
port_lmc: 0x00
link_layer: IB
System 2:
hca_id: mlx4_0
transport: InfiniBand (0)
fw_ver: 2.32.5100
node_guid: f452:1403:008e:e9b0
sys_image_guid: f452:1403:008e:e9b3
vendor_id: 0x02c9
vendor_part_id: 4099
hw_ver: 0x0
board_id: MT_1090120019
phys_port_cnt: 2
port: 1
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 19
port_lid: 1
port_lmc: 0x00
link_layer: IB

port: 2
state: PORT_ACTIVE (4)
max_mtu: 2048 (4)
active_mtu: 2048 (4)
sm_lid: 1
port_lid: 24
port_lmc: 0x00
link_layer: IB

Reply
- Dotan Barak says: January 2, 2015
  
  Hi.
  
  Which IB port did you try to wirk with 1 or 2?
  
  (Since i think that port 1 of the devices isn't managed by the same SM)
  
  Thanks
  Dotan
  
  Thanks
  Dotan
  
  Reply
Parthiban says: January 2, 2015

Yes you are right! there are again two separate IB networks the systems are connected to. I use port 2. one more doubt! if the two ports are connected to different IB network and the same system is configured to run the SM for the two network, will it work properly for both the networks?

Thanks,

Reply
- Dotan Barak says: January 2, 2015
  
  Hi.
  
  If you use the same SM for two networks, it becomes one subnet.
  
  It you have two subnets (for example, all port 1 in one subnet and all port 2 in the second one), working with port 1 in different machines will communicate (same goes with port 2).
  
  Thanks
  Dotan
  
  Reply
Parthiban says: January 3, 2015

Hi Dotan,
I see that

system001:~ # ibv_rc_pingpong
local address: LID 0x001b, QPN 0x340049, PSN 0xa06196, GID ::
remote address: LID 0x0001, QPN 0x60004a, PSN 0xbb7261, GID ::

system002:~ # ibv_rc_pingpong 192.168.96.101
local address: LID 0x0001, QPN 0x60004a, PSN 0xbb7261, GID ::
remote address: LID 0x001b, QPN 0x340049, PSN 0xa06196, GID ::
Failed status transport retry counter exceeded (12) for wr_id 2

and

system001:~ # ibping -S -d -v
ibdebug: [12314] ibping_serv: starting to serve...

system002:~ # ibping -d -v 14
ibdebug: [6738] ibping: Ping..
ibwarn: [6738] ib_vendor_call_via: route Lid 14 data 0x7fff4c7b8c10
ibwarn: [6738] ib_vendor_call_via: class 0x132 method 0x1 attr 0x0 mod 0x0 datasz 216 off 40 res_ex 1
ibwarn: [6738] mad_rpc_rmpp: rmpp (nil) data 0x7fff4c7b8c10
ibwarn: [6738] mad_rpc_rmpp: MAD completed with error status 0xc; dport (Lid 14)
ibdebug: [6738] main: ibping to Lid 14 failed

not able to figure out the reason.

Thanks
Parthiban

Reply
- Dotan Barak says: January 4, 2015
  
  Hi.
  
  First of all, In system001, ibv_rc_pingpong prints that the local LID is 0x1b (27 decimal),
  bu when you executed ibping you used LID 14.
  
  The above failure in ibv_rc_pingpong suggests that there is connectivity problem in your subnet.
  Are they both in the same subnet now?
  
  Thanks
  Dotan
  
  Reply
  - Parthiban says: January 4, 2015
    
    Hi Dotan,
    Yes, both the systems are in same network. If i execute normal ping it works fine. Another scenario is that if I run the RDMA sample application which uses RDMA CM the application is working fine but if use IB verbs it fails with completion wasn't found in the CQ and poll completion failed.
    Thanks
Parthiban says: January 5, 2015

Hi Dotan,
The issue is fixed, actually the bug is the program scans the interfaces and tries to use the interface found first, but that interface is not connected to the same subnet. Now I pass the interface to use and it works!
Thanks,
Parthiban

Reply
- Dotan Barak says: January 5, 2015
  
  Great!
  
  Thanks for updating me.
  
  Dotan
  
  Reply
- yuzhen says: September 13, 2016
  
  Hi Parthiban,
  
  I also tried to run the example provided using IB verbs, but it failed with the same error like yours. "completion wasn't found in the CQ after time out. poll completion failed".
  
  Do you have any suggestions?
  
  Thanks
  
  Reply
  - Dotan Barak says: September 16, 2016
    
    Hi.
    
    Which example did you try to use?
    What is the exact command line and the output that you got?
    
    Thanks
    Dotan
Anonymous says: January 14, 2015

Hi Dotan!
I use the QDR device. How do I use all 4 tires? Experimentally, I found that all clients use a single bus :(. If I run one client the maximum transmission speed is 10 GB/s, if I run 4 client, then the total transfer speed is equal to 10 GB/s, and each client can transmit at 2.5 GB/s...
How can I fill the entire bandwidth, i.e., 40 GB/s???
Thanks!

Reply
- Dotan Barak says: January 14, 2015
  
  Hi.
  
  QDR means that the speed of the speed of the line is 4 times faster than the base speed.
  Base speed: SDR is 2.5 Gb/s.
  
  Please execute 'ibstat | grep Rate' to get the maximum supported BW for your adapter.
  (assuming that you are using InfiniBand)
  
  Thanks
  Dotan
  
  Reply
Floaterions says: February 17, 2015

Hello Dotan,

When program is waiting at ibv_poll_cq(), does it consumes CPU, or does it go idle and wait for an event to wake it up? I'm asking this because I'm now facing a design choice, where I can end up with hundreds of threads (more than cpu cores), each polling on a separate QP for messages, and I was wondering if the waiting threads actually incur any cost to the system.
Thank you for your help

Reply
- Dotan Barak says: February 17, 2015
  
  Hi. Floaterions.
  
  When ibv_poll_cq() is called, it consumes CPU (i.e. polling).
  
  If you want to reduce the CPU consumption (and latency isn't an issue),
  it is preferred to work with Completion events.
  
  Thanks
  Dotan
  
  Reply
DjvuLee says: March 9, 2015

HI, Dotan!

I have a question is that I want to try what will happen if the ReceiveRequest is not ready in the receiver node(also RNR).

so I just post one ReceiveRequest in the receive node, and the Sender will send several SendRequests through a loop. I hope there will occurs a IBV_WC_RNR_RETRY_EXC_ERR error in the second loop.

The first loop is just as me expected, the receiver received the SendRequest and consume the ReceiveRequest, however in the second loop, the receiver get a event(ibv_get_cq_event), however the following ibv_poll_cq get zero, and blocked in the ibv_get_cq_event again.

this seem impossible, because there is a event notify from the completion queue, however the poll get nothing. How this happened?

Reply
- DjvuLee says: March 9, 2015
  
  Oh, I am a liitle sorry Dotan. There is some mistake in my last post.
  
  Every SendRequest used the signal, and it is the sender get a event notify using ibv_get_cq_event, but get 0 using ibv_poll_cq.
  
  and the receiver just blocking in the ibv_get_cq_event, no error message is throwed out.
  
  Reply
  - Dotan Barak says: March 10, 2015
    
    Hi.
    
    Yes, in RDMA you may get a Completion Event without finding a Work Completion in the Completion Queue
    (I've wrote about it in my posts).
    
    Some questions:
    * Are you using Reliable transport types for the Queue Pair?
    * If you switch to polling instead of using events do you still have a problem?
    * Do you check the status of the Work Completions (in both sides)?
    * what is the value of the following attributes: min_rnr_timer, rnr_retry, timeout, retry_cnt?
    
    Thanks
    Dotan
DjvuLee says: March 11, 2015

Thanks very much! I will search your blog to see this.
I use the RC. I modify my code later, and there is some mess, so I have to restore my code and check this status later.

Reply
DjvuLee says: March 12, 2015

Hi Dotan! I have a question about the concurrency connection setup.

If I have a server which will accept a lot of clients.

On the connection setup stage, support we get a RDMA_CM_EVENT_CONNECT_REQUEST event from one client, and then a RDMA_CM_EVENT_CONNECT_REQUEST from another client, and then a RDMA_CM_EVENT_ESTABLISHED event.

Because we use the same event channel, and we can not get the connection id when we get the RDMA_CM_EVENT_ESTABLISHED event, so which client got established?

I thought maybe RDMA deal with another way: If we get a RDMA_CM_EVENT_CONNECT_REQUEST event, we will reject the connection request from other client until we get the RDMA_CM_EVENT_ESTABLISHED for the former client, but if the server failed to get RDMA_CM_EVENT_ESTABLISHED for this client, what will lead to? Other clients will be rejected forever.

Or we should use different event channel for different client, which seems not a good way.

I write a program which use the main thread for the connection setup from RDMA_CM_EVENT_CONNECT_REQUEST to RDMA_CM_EVENT_ESTABLISHED, after the RDMA_CM_EVENT_ESTABLISHED event, we dispatch the setup connection to another thread, and use the main thread to accept the new connection. But when I use some clients to connect the server simultaneously，only one get serviced, the other is rejected. In TCP/IP, this is the right way for concurrency connection.

And I also wonder how to get which client disconnected when we receive RDMA_CM_EVENT_DISCONNECTED, since we can not get the connection id from the event .

I have little RDMA programming experience, so I hope this problem not stupid enough.

Reply
- Dotan Barak says: March 17, 2015
  
  Hi.
  
  I'm sorry, but I don't consider myself an expert (yet) in programming over librdmacm.
  There is an example in the rdmacm git repository, called rping.
  
  This example has a persistent mode, and I think that all your questions will be answered from this example.
  Please pay attention to the function rping_run_persistent_server().
  
  If you care about specific clients, maybe you can use the private_data field to exchange important information about the remote identity.
  
  I hope that this helps you
  Dotan
  
  Reply
  - DjvuLee says: March 21, 2015
    
    Thanks Dotan.
    
    I kown how to deal with this now. I just using one thread to listen the EVENT, use the connectionId to relate different events, and dispatch the connection to the thread pool.
  - Dotan Barak says: March 29, 2015
    
    Cool.
    
    Thanks for the update
    Dotan
Avis says: July 17, 2015

Hi Dotan,
I see a behavior where completion event (for receive) is triggered, but when I poll the cq (ib_poll_cq), it returns 0 work completions. Why would a completion event be generated when there are no work completions ?. Is this a normal behavior, if not where do you suspect the problem could be ?

Reply
- Dotan Barak says: July 17, 2015
  
  Hi.
  
  Yes. Completion events can be triggered even if there isn't any Work Completion in the Completion Queue.
  This can happen if you armed the CQ, emptied the CQ (thus polling the Work Completion that triggered the event). When you'll read the event and and check the CQ, you may find the CQ is empty.
  
  I believe that if you'll check, you'll find that all the Work Completion were read from the CQ before you got this event with empty CQ.
  
  Thanks
  Dotan
  
  Reply
  - Avis says: July 17, 2015
    
    Thank you.
Anon says: July 22, 2015

Hi Dotan,

I am trying to create an example of a one sided RDMA READ off the rc_pingpong.c sample from the ibverbs code.
What I have changed is:
1. When creating the memory regions, allow remote reads through the IBV_ACCESS_REMOTE_READ flag.
2. The pp_post_send function to use IBV_WR_RDMA_READ as the opcode.
3. Removed all calls to pp_post_recv.
4. Changed the main while loop, so that the server and the client both poll the cq. Once an event has happened, they exit. Particularly, the server will keep running, and the client exits after it does one run of pp_post_send.

The issue that I am seeing is that on the client side, the work completion returns code IBV_WC_REM_INV_REQ_ERR.
Do you know why this might be? It seems that the qp_access_flags is not used anymore (? or at least when I try and set them, they don't get modified) and the buffers in the pingpong context are still the same 4KB page size. With the permissions set on the memory regions, I am not sure what else is going wrong?

Thanks for any help

Reply
- Dotan Barak says: July 22, 2015
  
  Hi Anon.
  
  Did you enable RDMA_READ in qp_attr.'qp_access_flags'?
  
  Thanks
  Dotan
  
  Reply
  - Anon says: July 23, 2015
    
    Hi Dotan,
    
    I eventually figured it out. I was setting the qp_access_flags to allow IBV_ACCESS_REMOTE_READ.
    The issue is that I misinterpreted what max_dest_rd_atomic and max_rd_atomic fields were used for -- I thought it was only for remote atomic operations. As such, I set them to 0. So when I tried to modify the QP state machine to RTR, the access flags simply didn't update.
    
    Thanks for the help.
  - Dotan Barak says: July 23, 2015
    
    I'm glad everything is working for you
    :)
    
    Dotan
Adrian says: September 14, 2015

Hello Dotan,

I am trying to figure out what the maximum number of scatter/gather entries I can use per one Work Request is.
I have read the FAQ on the ibv_create_qp page, however, I am not seeing the failure when I am trying to create the QP.
What I have is:
1. ibv_query_device returns a max_sge value of 32.
2. I use this in the max_send_sge field of the ibv_qp_cap struct in the ibv_qp_init_attr struct used to create the QP. When the function returns, the value in max_send_sge is updated to 62, to my surprise (I am not sure why...)
3. I then attempt an RDMA READ with an sg_list of length 32, and 31. Each scatter/gather entry has length 1 (i.e. I am reading only one byte from the remote buffer into the local one for each entry). Both of these return IBV_WC_LOC_LEN_ERR as the completion.
4. If I use an sg_list of length 30, everything seems to work.

Do you know why:
a) ibv_create_qp modifies the max_send_sge to be larger than the max_sge value returned from ibv_query_device?
b) The max_sge value seems to be too large, even though creating the QP with that value set in the init attributes returns with no error?

Thanks in advance.

Reply
- Dotan Barak says: September 15, 2015
  
  Hi Adrian.
  
  The problem that the ibv_query_device provides one value for max_send_sge for all transport types, for all Work Queues (both Send and Receive),
  and sometimes this just enough.
  
  I suspect that there is a bug in the low-level driver and you should use the latest version of it,
  and inform the low-level driver provider if this still happens.
  
  Thanks
  Dotan
  
  Reply
Anonymous says: September 14, 2015

Hi Dotan,
When posting a signaled Rdma Send from server side i receive no WC at the client side and the kernel hangs. Even though i have some outstanding receives work requests posted on the client side. Can you tell me what the reason could be?

Reply
- Dotan Barak says: September 15, 2015
  
  Hi.
  
  Signaled RDMA Sends are relevant only to the local side (the remote side isn't aware to the signalling mode).
  Is this is the first message? (maybe the QPs weren't connected correctly).
  
  Thanks
  Dotan
  
  Reply
Long says: February 4, 2016

Hi Dotan,

Thanks for your web site that provides a lot of useful information about RDMA and Infiniband.

I want to use an example program provided by Tarick Bedeir (https://thegeekinthecorner.wordpress.com/) for setting up an RDMA connection (using RDMA_CM) between two machines and then call
ibv_post_send()/ibv_post_recv() to send/receive data. Setting up the RDMA connection works fine. However, ibv_post_send() fails on the first attempt to send (I get error IBV_WC_RETRY_EXC_ERR (12)).

Your article on ibv_poll_cq says that "this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages". However, this example program sets the retry_count parameter to 7 (infinite retry). Further, the program uses rdma_create_qp() to create the queue pairs and the RDMA programming manual says that "QPs allocated to an rdma_cm_id are automatically transitioned by the librdmacm through their states. After being allocated, the QP will be ready to handle posting of receives."

I wonder what goes wrong and how I can fix it. The example code is available at
https://github.com/tarickb/the-geek-in-the-corner/tree/master/02_read-write .

I would be grateful if you have any suggestion for me.

Thanks,
Long

Reply
- Dotan Barak says: February 5, 2016
  
  Hi Long.
  
  I didn't have a chance to play with this example (yet).
  7 is infinite value only for the rnr_retry.
  For retry_cnt 7 is actual a seven retries.
  
  I would suggest to try executing ibv_rc_pingpong an rping,
  to check that your fabric is configured and functioning correctly.
  
  Thanks
  Dotan
  
  Reply
liuyu says: March 15, 2016

Hi Dotan,

Thanks for your help!

Now I am very confused when I use verbs to programing. I want to use rdma read or rdma write for handling IO, but I get err IBV_WC_REM_INV_REQ_ERR（9） at the sender side. I have checked the mem, i didn't find something wrong. I paste some code here, could you give me some suggestion ?

//create qp
RCPRINT("client creating qp\n");
qp_attr.cap.max_send_wr = MAX_WR;
qp_attr.cap.max_send_sge = 1;
qp_attr.cap.max_recv_wr = MAX_WR;
qp_attr.cap.max_recv_sge = 1;
qp_attr.send_cq = send_cq;
qp_attr.recv_cq = recv_cq;
qp_attr.qp_type = IBV_QPT_RC;
err = rdma_create_qp(cm_id, pd, &qp_attr);
if (err)
{
RCPRINT_ERROR("client create qp fail\n");
clientDestroyRdmaObj(connection);
return 1;
}

//rdma write or read
memset(&send_wr, 0, sizeof(send_wr));
send_wr.wr_id = (uint64_t)sge;
send_wr.sg_list = sge;
send_wr.num_sge = 1;
send_wr.opcode = (opCode == CTRL_READ) ? IBV_WR_RDMA_READ : IBV_WR_RDMA_WRITE;
send_wr.send_flags = IBV_SEND_SIGNALED;
send_wr.wr.rdma.remote_addr = remoteRdma->remote_addr;
send_wr.wr.rdma.rkey = remoteRdma->rkey;

if (ibv_post_send(connection->cm_id->qp, &send_wr, &bad_wr))
{
RCPRINT_ERROR("server send rdma opt(%d) fail\n", opCode);
return RETURN_ERROR;
}

Reply
- liuyu says: March 17, 2016
  
  Hi Dotan,
  
  Today I write a test program, I found that client could post IBV_WR_RDMA_WRITE or IBV_WR_RDMA_READ successfully, but server only could post IBV_WR_RDMA_WRITE successfully. When server post send with op IBV_WR_RDMA_READ, it get error IBV_WC_REM_INV_REQ_ERR（9） after ibv_poll_cq successfully, and wc.opcode change to IBV_WC_SEND. In the test program , I just send an IBV_WR_RDMA_READ. could you give me some suggestion？
  
  Reply
  - Dotan Barak says: March 22, 2016
    
    Hi.
    
    I think that the problem is with the permission of the QP or MR in the client side.
    (RDMA Read isn't enabled)
    
    Thanks
    Dotan
liuyu says: March 29, 2016

Thanks very much for your help! I have resolved the problem. When the server calls rdma_accept, I do not assign value to struct rdma_conn_param's member initiator_depth which value is zero default. So struct ibv_qp_attr's member max_rd_atomic is zero also. The server cannot send RDMA_READ operation nerver.

Reply
tamlok says: April 13, 2016

I posted a Receive Request and call ibv_poll_cq() to see if we received anything. However, after I suddenly kill the program which is intended to send something to the receiver, the ibv_poll_cq() called by the received still keeps returning 0.
So it is confused that ibv_poll_cq() doesn't return negative value even after the connection has been disconnected.
Do you have any ideas?
Thanks very much!

Reply
- Dotan Barak says: April 19, 2016
  
  Hi.
  
  ibv_poll_cq() with negative value means that there is an error in the CQ.
  The QP doesn't know that the remove side is dead ...
  (unless CM is used, and there is a DISCONNECT indication)
  
  If needed, you can add "keep alive" messages to your application
  (for example: RDMA Write with 0 bytes - if you are using RC).
  
  Thanks
  Dotan
  
  Reply
  - tamlok says: April 19, 2016
    
    Thanks very much! Do you know how much the cost of ibv_poll_cq() is? Will it be expensive if I keep calling ibv_poll_cq() frequently? Will it consult the hardware register or just memory?
  - Dotan Barak says: April 19, 2016
    
    Hi.
    
    It is hard to answer, since it is device specific.
    
    In Mellanox devices (for example), ibv_poll_cq() access memory - which is relevantly cheap
    (no context switch or any expensive operation).
    
    I can't say for other devices...
    
    Thanks
    Dotan
  - tamlok says: April 19, 2016
    
    Impressive! Thanks very much!
Junhyun says: October 8, 2016

Hi Dotan, when exactly is the buffer posted for recv WR updated?
Is it updated when I call ibv_poll_cq or whenever the device accepts a next incoming send WR?
For instance, if I posted 10 Recv WRs on a same buffer, and in other host I post 2 Send WRs, will the first Send contents be overwritten? or can I get the contents of the first Send by polling just the first Recv WR out of the CQ?

Reply
- Dotan Barak says: October 11, 2016
  
  Hi.
  
  The Receive WR buffer(s) are filled when the incoming message arrives and the Receive Request is fetched.
  Once all the message is filled to the buffer, a Work Completion is enqueued to the CQ.
  
  In your example, the first message content will be overwritten by the second message.
  
  Thanks
  Dotan
  
  Reply
  - Wang says: February 28, 2017
    
    Hi!
    
    Thanks for your web site.
    
    i used send/recv to transfer 50 bytes data over RC QP. The receiver polled a cqe with byte_len is exactly 50, and the status is IBV_WC_SUCCESS, but i cannot find data in the buffer pointed by pre-posted RR. i am very confusing what's going on...
  - Dotan Barak says: July 21, 2017
    
    Hi.
    
    I believe that either:
    1) There is a bug in the code
    2) The program overwrite the buffer (using multiple Receive Requests points to the same location or write directly to the buffer)
    
    Thanks
    Dotan
vineeth says: October 27, 2016

I am trying to do nvme over fabrics project(RDMA). But in RDMA , I am getting rdma read fail with status 5 in host Side.ie; qp move to error state.can you please tell me why my qp moving to error state in host side while rdma read?

Reply
- Dotan Barak says: November 4, 2016
  
  Hi.
  
  Work Completion with status 0x5 means: IBV_WC_WR_FLUSH_ERR.
  * Is this is the first completion with error?
  * Is the QP is already in error?
  * Is there is asynchronous event in the remote side?
  
  Thanks
  Dotan
  
  Reply
shilvea says: February 20, 2017

Hi!
in case of SRQ the poll_cq is not used? I cann't understand how I can call it if input parameter is cq. But I didn't create the cq, only srq.

Reply
- Dotan Barak says: February 22, 2017
  
  Hi.
  
  The SRQ by itself, can't be used;
  it is used by the QP(s) to hold the Receive Requests.
  
  The corresponding Work Completion of that Receive Request is enqueued to the QP.receive_queue
  for incoming messages.
  
  Thanks
  Dotan
  
  Reply
Vasilis.G says: April 21, 2017

Hello Dotan,

My understanding is this: When receiving an incoming UD Send, the gather region will contain the payload offset by the 40-byte grh. But the grh will only be valid if the sender is in a different subnet or the message is part of a multicast (please correct me if I am mistaken).
My questions are the following:

In the case of a unicast UD Send within the same subnet is the grh actually transported? I.e. can I overload the 40 bytes of the grh with actual payload, since it's going to go to my gather list in the receiver anyway?

In case of the WR posted on the Receive Queue, is it mandatory to specify a valid, registered region in the sge list to accommodate the ghr, even if the incoming UD Send has no payload (i.e. send only the immediate value)?
In the case of sending no payload I should be able to write num_sge=0 when specifying the receive request. Does that hold true for a Receive Request that accommodates a UD Send?

Thank you!

Reply
- Dotan Barak says: July 3, 2017
  
  Hi.
  
  You are correct, for a UD QP the packet payload will be placed starting offset 40 in the Receive Request buffer;
  the first 40 bytes will contain a GRH only if the packet contained a GRH.
  
  When posting a Send Request of a UD QP, the sender controls, in the Address Handle, whether or not a GRH will be sent over the wire.
  You cannot control the context of the 40 bytes of the GRH; most of the GRH is filled automatically by the RDMA device.
  
  If a GRH isn't present, I *believe* (i.e. I didn't verify it) that you don't have to provide a valid MR key,
  since I have a feeling that only before the RDMA device write content to memory it validates the S/G, provided in the Receive Request.
  
  The spec is writing: "Note that for UD QPs, the first 40 bytes of the buffer(s) referred to by the Scatter/Gather list will contain the GRH of the incoming message. If no GRH is present, the contents of first 40 bytes of the buffer(s) will be undefined."
  
  The behavior is implementation dependent; different vendors/devices may behave differently.
  I suggest not to count on the implementation of specific RDMA device and always provide a valid MR.
  
  Thanks
  Dotan
  
  Reply
Erfan Zamanian says: December 14, 2017

Hi Dotan,

Is ibv_poll_cq a blocking function? for example, if the CQ is empty, would ibv_poll_cq return 0 immediately, or would it block and only sporadically returns 0?

Reply
- Dotan Barak says: December 22, 2017
  
  Hi.
  
  ibv_poll_cq() isn't a blocking function and it will always return immediately.
  
  Negative return value in case of an error.
  Otherwise, the number of Work Completions returned (0 means that no Work Completions were found in that CQ).
  
  Thanks
  Dotan
  
  Reply
HeBoxin says: January 22, 2018

Hi,
There is a problem, when the sender endpoint post a send request, and then meet a "RNR retry counter exceeded" completion (because at that time I don't set a receive request in the remote endpoint), then after I set a receive request in the remote endpoint, I let the sender endpoint post a send request again ,However, it occurs Work Request Flushed Error.Could you tell me How to solve this problem. I really appreciate it.

Reply
- Dotan Barak says: March 2, 2018
  
  Hi.
  
  Once you get a Work Completion with error in an RC QP,
  the QP is being transitioned to the Error state.
  
  If you want to work with that QP, you need to reestablish the connection in both sides
  (move the QP to the Reset state, and configure it in both sides to the RTS state).
  
  Thanks
  Dotan
  
  Reply
Long says: March 9, 2018

Hi Dotan,

After setting local QP to a IBV_ACCESS_REMOTE_WRITE mode and send a WQE, once the message is sent, after geting that into CQE, the local side poll_cq, a got an ibv_wc but with byte_len =0.

As far as I understand, the byte_len should be the number of byte transferred, which means the size of the buffer sent.

Am i missing something ?

Thank you for your help.

Long

Reply
- Dotan Barak says: May 13, 2019
  
  Hi.
  
  In Completion with errors not all fields are valid.
  
  Thanks
  Dotan
  
  Reply
Leslie says: March 24, 2018

I met same issue. Here is my data:
* Is this is the first completion with error?
Yes. It is first error
* Is the QP is already in error?
How do I know QP is in error or not?
* Is there is asynchronous event in the remote
No.
I'm using SoftRoCE. Is it possible related?

Reply
- Dotan Barak says: April 19, 2018
  
  Hi.
  
  You can know if a QP is in error or not by calling ibv_query_qp() and check the state field.
  
  Personally, I don't have any experience with SoftRoCE, but all RDMA stack should behave in the same way...
  
  Thanks
  Dotan
  
  Reply
Conrad says: May 13, 2018

I am currently trying to make a simple sender/receiver setup using RDMA over infiniband (UD protocol). All the hardware is rated for 40+ Gbit/s, but i am only able to achieve around 13. From my understanding it seems that the completion polling is slowing it down. The sender was made faster by only sending with flags every 100th work request and thereby saving a lot of pollings, but on the receiving side i have to poll each request? How do i speed up the application?

Reply
- Dotan Barak says: May 18, 2019
  
  Hi.
  
  If you want to decrease the stress on CPU, work with CQ events.
  I think that the number of outstanding WRs that you use is too low;
  I wrote a post on improving RDMA application performance - check it out.
  
  Thanks
  Dotan
  
  Reply
Igor Leshenko says: August 19, 2018

Once I got WC from ibv_poll_cq() - do I have a standart API to know - what is the address of corresponding memory buffer (provided in ibv_post_recv())?

Reply
- Dotan Barak says: August 24, 2018
  
  Hi.
  
  No.
  It is up of the SW to provide hints and use information that exists in the Work Completion to know what is the corresponding memory buffer.
  
  Here are ideas how this can be done:
  * If it was a SEND message, the wr_id can be useful
  * If it was an RDMA message, imm_data can be used
  
  Thanks
  Dotan
  
  Reply
Zhao, Bing says: September 11, 2018

Hi Dotan,
In your previous comments "I mean that the data was received to the remote side HCA and in almost all cases was written to its memory", I have a question (I cannot click the "reply" button due to the network reason). Do you mean that once the data arrives at the remote side in a WRITE operation, the NIC/HCA will generate a ACK if the RC mode is used? I tested the write latency on the MLNX NIC(RDMA over Ethernet) and got a little confused. In RC mode, for example, it will take about 43 cycles (a low precision counter) to get the successful status after the write request function returns. But in UC mode, it will take only 21 cycles. 2 times consumption of the time in RC mode compared to the UC mode. To my understanding, the link time consumption and the HW ACK will take little time. Then why it will cost more time in the RC mode? Many thanks.
BR. Bing

Reply
- Dotan Barak says: September 11, 2018
  
  Hi.
  
  There isn't any ACK in UC;
  You got a Work Completion in the sender side once the data was sent out of the port.
  
  You ask about the reason for the different between reliable and unreliable latency;
  it depends on many things: RDMA device implementation, path (switch/cable type + length), local/remote chipset, and more ..
  
  Thanks
  Dotan
  
  Reply
Zhao, Bing says: September 19, 2018

Hi Dotan,
Thanks a lot for your reply. I do understand "You got a Work Completion in the sender side once the data was sent out of the port" now. Maybe I didn't describe my question quite clear above. The gap of latencies of RC and UC mode is not quite big with ib_write_lat tool, only about 0.01~0.02 microsecond. I've done some modification of the tool's code. And then I calcuate each part of test loop within an iteration with MLNX NIC, X86_64 & aarch64 platforms.
The cost of "mlx5 post send" is very small. And the "poll cq" is different between RC and UC mode, only a half in UC mode compared to the RC mode. As you say, it is due to the "ACK". And then the most of the saved cycles in the UC mode will be "wasted" in the infinity loop of waiting for the data from the remote side.
So almost half of the "poll cq" cycles in UC mode is about waiting for the ACK from the remote. (If the poll cq drivers are almost the same for the UC and RC mode of mlx5). I just wonder why it is so long?

B.R
Bing

Reply
- Dotan Barak says: September 26, 2018
  
  Hi.
  
  I'm sorry, but I don't understand what you application does.
  * is it a pingpong?
  * Is only one side post Send requests?
  * How the sender "knows" that the remote side got a response (in UC)?
  
  Thanks
  Dotan
  
  Reply
FANTAR says: May 13, 2019

Hi Dotan,

I want to establish an UD connection between a server and a client.

- I have 128 bits GIDs of the server and the client. ( ibv_query_gid, index 0 port 1)

- i created UD QPairs on both side and i put QPair number qp->num to a wanted value ( does it work ) I use it in
- I use PKey index 0 on both sides : 0xFFFF
- Qkey to a wanted value : 0x1234

On host side i create an address handle (used in Send Work request :

- union gid with dgid.raw[16] ( values from ibv_query_gids)

- destqpnumber with my values ( do i need to keep the initial values generated by ibv_create_qp ?)

Do you have an example of establishing an UD connection using GRH ?

Thank you very much in advance

Ramy

Reply
- Dotan Barak says: May 18, 2019
  
  Hi.
  
  The destination QP number is the value that was returned from ibv_create_qp().
  You can find an example for UD connection in:
  https://github.com/linux-rdma/rdma-core/blob/master/libibverbs/examples/ud_pingpong.c
  
  Thanks
  Dotan
  
  Reply
Vinit Agnihotri says: June 24, 2019

I am running RDMA server under centOS. I could do rdma send/recvs or rdma reads/write without any issue. However while posting RDMA read operations from server I get IBV_WC_REM_INV_REQ_ERR if and only if data size if 128k and more and as cherry on top it does not happens for every request, its pretty random.

qp_access_flags, length, mr permissions all are correct as same code works well for sizes below 128k.

I have no issues for posting rdma writes, but some rdma reads get into trouble.

I tried setting IBV_SEND_FENCE while posting then error goes away, but it seems to lower throughput, any pointers/thoughts about what could be going wrong? Any help is greatly appreciated.

I am using rdma_post_read() to post operation and rdma_reg_write() to register buffer.

Thanks

Reply
- Dotan Barak says: June 27, 2019
  
  Hi.
  
  What are the values of max_rd_atomic and max_dest_rd_atomic in both QPs?
  
  Thanks
  Dotan
  
  Reply
Vinit Agnihotri says: July 16, 2019

Values are as follows.
max_qp_rd_atom=21 max_res_rd_atom=387072 max_qp_init_rd_atom=21

Reply
- Dotan Barak says: July 16, 2019
  
  Hi.
  
  What are the values in the QP context, not in the device capabilities?
  You can query the QP to get those values.
  
  thanks
  Dotan
  
  Reply
Vinit says: July 17, 2019

Ahh got it, after ibv_query_qp() it returns both values as 0,0(max_rd_atomic, max_dest_rd_atomic)

Reply
- Dotan Barak says: July 17, 2019
  
  Hi.
  
  I suggest that you'll set a non-zero value in there ...
  
  Thanks
  Dotan
  
  Reply
vinit says: July 18, 2019

would you suggest using max_qp_rd_atom (from device query) to be used for qp? What impat it could put if I assign say 10 as oppose to 21 at my end? would it address my problem?

Thanks.

Reply
- Dotan Barak says: July 18, 2019
  
  For a proper operation: max_rd_atomic should be lower or equal than remote side's max_dest_rd_atomic.
  
  For example, if you have QP_A and QP_B:
  QP_A.max_rd_atomic <= QP_B.max_dest_rd_atomic and QP_B.max_rd_atomic <= QP_A.max_dest_rd_atomic The higher the better (to allow supporting more outstanding RDMA Reads/Atomic operations) Thanks Dotan (this should be done configured to every
  
  Reply
Vinit says: July 22, 2019

Unfortunately I don't have any control over client, as client runs in windows domain. Only control I have is of linux based server. So is there any query I could run which can get me remote side params?

Reply
- Dotan Barak says: July 26, 2019
  
  Hi.
  
  It is up to the SW protocol, as part of the communication manager, to exchange the supported attributes and configure the best attributes..
  
  Thanks
  Dotan
  
  Reply
vinit says: August 1, 2019

Alright, then I think nothing much I can do in this case.
I'll try setting some value at my end atleast and see how it goes.
Thank you.

Reply
Christopher R says: September 25, 2019

Hi Dotan,

First, thanks for the great blog!

I am wondering what an appropriate method for measuring the latency of ib verbs is. If you have a reliable connection, would taking a timestamp, then issuing the post read/write, then spin polling for a completion, followed by another timestamp accurately capture the latency?

What about for unreliable connection / unreliable datagram?

Thanks

Reply
- Dotan Barak says: September 28, 2019
  
  Thanks
  :)
  
  Hi.
  
  I think that the answer is "no":
  Since the flow you just describes check the latency of the RDMA device scheduling, processing and network latency.
  if you want to verify the latency of the RDMA verbs, you can check the timestamp before and after calling the verb.
  (assuming that you want to check the latency of the RDMA verbs).
  
  If you want to check the latency of the data, the flow you describes does it.
  BTW, there are tools for checking the BW and latency of RDMA, ib_*_lat and ib_*_bw performance tools.
  
  Thanks
  Dotan
  
  Reply
Nathan says: April 27, 2020

Hi, I'm wondering how to determine the incoming message is from a normal UD Rdma Send or from a UD multicast. Thanks :)

Reply
- Dotan Barak says: July 10, 2020
  
  Hi.
  
  An multicast message always has a GRH and the mgid[15] = 0xff,
  so you can get this information from the Work Completion and the message's GRH header.
  
  Thanks
  Dotan
  
  Reply
rdma is hard says: July 24, 2020

Hello Dotan,
Assuming that I have two machines talking over RC QPs (all connection setup is done and working properly). A sender issues an RDMA READ and marks it as signaled using IBV_SEND_SIGNALED. Assuming no other requests, I use an infinite while loop to poll one completion from the completion queue. Does the completion indicate that read is complete (i.e. data is fetched into local memory) or just that the the read request has reached the remote side.

Thanks for your blog.

Reply
- Dotan Barak says: July 25, 2020
  
  Hi.
  
  The IBV_SEND_SIGNALED means to generate Work Completion to the *local* CQ,
  when the processing of that Send Request ends.
  
  In RDMA Read, this will happen when all data that was requested to be read (from the remote side) will be received by the requestor.
  This means that now you can process the buffers, since data is now available in the local buffers.
  
  Thanks
  Dotan
  
  Reply
Alberto Perro says: October 27, 2020

Hi Dotan,

Thank you so much for this blog, it is helping me so much.
I have set up an RC QP and I can exchange IBV_SEND/RECV messages without issues.
When I try to use IBV_RDMA_READ/WRITE I can successfully post an SR but polling the CQ gives me error 12 for the wc.
I am using the same QP and MR and I don't know why it is happening.

Thanks,
Alberto

Reply
- Dotan Barak says: November 21, 2020
  
  Hi.
  
  Most likely that you have a sync problem between the sides;
  I suspect that the requestor posted the Send Request BEFORE the responder QP was transitioned to (at least) RTR.
  
  Thanks
  Dotan
  
  Reply