Compare of verbs implementation vs. the specifications

Contents

4.75 avg. rating (95% score) - 4 votes

The InfiniBand spec defines several features and verbs that the verbs implementation (i.e. RDMA stack in the Linux kernel and libibverbs) didn't implement or implemented in a different way.

In this post I will cover missing verbs and functionality that was defined in the specifications and how it was implemented.

Missing functionality

Reliable Datagram (RD)

The RDMA stack in the kernel and libibverbs don't support RD at all: RD isn't a valid transport type when creating a QP and all the following verbs that are relevant to manage its related resources weren't implemented:

Allocate Reliable Datagram Domain
Deallocate Reliable Datagram Domain
Create EE Context
Modify EE Context Attributes
Query EE Context
Destroy EE Context

Address Handle (AH)

Modify Address Handle
Query Address Handle

The RDMA stack in the kernel supports those verbs, but most low-level drivers don't support them.
libibverbs doesn't support those verbs at all.

Memory Region (MR)

Reregister Memory Region

Libibverbs have preparations to support this verb. However, there isn't any implementation of it (yet?).

Libibverbs doesn't support this verb at all.

Memory Window (MW)

Allocate Memory Window
Query Memory Window
Bind Memory Window
Deallocate Memory Window

The RDMA stack in the kernel supports those verbs and some of the low-level drivers support them as well.

Libibverbs has preparations to support this verb. However, there isn't any implementation to it (yet?).

Changed functionality

Memory Region (MR)

Query Memory Region

The RDMA stack in the kernel supports this verb, but most low-level drivers don't support it.
libibverbs doesn't support this verb verbs at all. However, the attributes addr, length (that were provided when registering the MR),
lkey and rkey (that were filled by ibv_reg_mr()) are part of struct ibv_mr and this replaced the need of this verb. The only attribute that cannot be retrieved from the MR after its creation is the access permissions to it.

Completion Queue (CQ)

Query Completion Queue

The RDMA stack in the kernel and libibverbs don't support this verb at all. However, the attribute cqe is part of struct ibv_cq and this replaces the need of this verb.

Set Completion Event Handler

The RDMA stack in the kernel and libibverbs don't support this verb at all. However, when calling ib_create_cq() in the RDMA stack in the kernel the client code can specify a CQ event handler.

In libibverbs the client code can create a thread that will call ibv_req_notify_cq(), ibv_get_cq_event() and ibv_ack_cq_events() and it actually behaves as a Completion Event Handler.

Asynchronous Event

Set Asynchronous Event Handler

The RDMA stack in the kernel and libibverbs don't fully support this verb. However, when calling ib_create_cq(), ib_create_srq(), ib_create_qp() in the RDMA stack in the kernel the client code can specify Asynchronous Event Handler for those resources. Further more, the user can call ib_register_event_handler() to register the event handler for the RDMA device's events.
In libibverbs the client code can create a thread that will call ibv_get_async_event() and ibv_ack_async_event() and it actually behaves as an Asynchronous Event Handler.

eXtended Reliable Connected (XRC)

Annex A14 adds XRC to the IB spec. The following verbs were added:

Allocate XRC Domain
Deallocate XRC Domain
Create XRC Shared Receive Queue
Query XRC Shared Receive Queue
Modify XRC Shared Receive Queue
Destroy XRC Shared Receive Queue
Create XRC Target Queue Pair
Query XRC Target Queue Pair
Modify XRC Target Queue Pair
Destroy XRC Target Queue Pair

Most of this functionality was added to the RDMA stack in the kernel, either by adding new verbs or by extending the functionality of exiting ones (for example: instead of adding a new verb for creating an XRC Shared Receive Queue, ib_create_srq() was extended to support the creation of XRC SRQs as well).
However, libibvers doesn't support XRC at all.

Some notes:
1) There are some OFED distributions (such as MLNX-OFED) that have XRC support.

2) Patch that extends libibverbs to support XRC was sent to the mailing list, but they weren't (yet?) accepted to the libibverbs upstream.

Fast Memory Region (FMR)

The IB spec defines registering FMR in a Send Request. However, in the RDMA stack in the kernel there are verbs that allow creating of FMR pools using verbs too and not only using Send Requests.

General

The InfiniBand spec define special return values for errors that may happen when calling the verbs (for example: Invalid HCA handle, Invalid protection domain, Insufficient resources to complete request and more). The RDMA stack in the kernel and libibverbs using the errno values instead.

Written by: Dotan Barak on November 23, 2013.on February 19, 2015.

Comments

Tell us what do you think.

Mahesh says: December 6, 2013

hi Dotan,
1) why RD is not considered while implementation ? is it because it does not have any use cases
2) XRC is more relevant in User space (libibverbs) as MPI may get benefit from it if its in user space. why it is restricted to kernel stack ?
Also it was there in OFED-1.5.4's libibverbs but removed from OFED-3.5 . Any reason ??

Reply
- Dotan Barak says: December 6, 2013
  
  Hi Mahesh.
  
  1) IMHO, RD has a lot of use cases. However, (AFAIK) there isn't a single HW that supports it. Because of this reason, the RDMA stack didn't add any support to it.
  
  2) The answer is a little bit complicated:
  XRC is mostly relevant for user space. However, the RDMA stack (kernel part) added support only in the kernel space.
  There are some suggestions (and patches) to extend libibverbs in order to support XRC, but they weren't (yet) accepted.
  
  For your question about the XRC removal, I *think* that the methodology of what is the content of the OFED distribution was changed (only take content from the upstream).
  
  Thanks
  Dotan
  
  Reply
Baturay O. says: November 14, 2014

Hi Dotan,

I have large data and want to send it in a blockwised manner. You say there isn't any implementation of Memory Window. So how can I handle the problem?

Reply
- Dotan Barak says: November 14, 2014
  
  Hi Baturay.
  
  What is the reason that you think that Memory Window's will help you? How did you plan to use them?
  
  Thanks
  Dotan
  
  Reply
  - Baturay O. says: November 15, 2014
    
    Actually, I have a large integer vector and want to send it block by block. So I've planned that my program automatically registers the blocks in my vector. I mean I don't want to deregister and register the MR with block's address every time. I thought some windowing operation may help to solve the problem.
  - Dotan Barak says: November 15, 2014
    
    Hi Baturay.
    
    Yes, I agree that Memory Windows could be handy for your task.
    
    I wonder, what is the reason that you can't (or don't want) to (re)use the same Memory Region every time?
    
    Thanks
    Dotan
Baturay O. says: November 16, 2014

Hi Dotan.

I will use RDMA-Write. So when I reuse the same MR every time, I should send the virtual address and remote key to other side every time and it will take time. The communication cost is important in my study. That's the reason.

Reply
- Dotan Barak says: November 17, 2014
  
  Hi Baturay.
  
  But this issue won't be eliminated with Memory Windows;
  you'll still need to send the virtual address and the remote key (of the Memory Window).
  
  Thanks
  Dotan
  
  Reply
  - Baturay O. says: November 17, 2014
    
    Hi Dotan,
    
    Oh, thanks. But I wonder, if this is the case, what is the advantage of MW compared to MR?
  - Dotan Barak says: November 17, 2014
    
    The advantages of Memory Windows over Memory Regions is:
    Light weight generation of r_keys (with changing permissions).
    
    If you'll register and deregister memory, it will take a lot of time.
    However, binding a Memory Windows to a Memory Region will generate a new r_key,
    is a short time. If you want to invalidate this r_key, it takes short time as well
    (since the Region is already registered).
    
    I hope that I was clear on this..
    
    Thanks
    Dotan
Baturay O. says: November 17, 2014

Hi Dotan,

I understand. Actually, what I want to implement is that. I want to register memory for whole vector at once. And use some blocks of it without deregistering and registering again. Also, I don't want to do memcopy and of course to send virtual address and r-key every time. I hope I can explain my problem clearly. How can I handle this issue as your opinion?

Reply
- Dotan Barak says: November 17, 2014
  
  I would suggest to register the memory buffers several times, with different permissions (if needed),
  and provide the remote side the appropriate remote key+address to the block that it needs to access.
  
  Thanks
  Dotan
  
  Reply
Jack says: July 21, 2016

Hi Dotan,

Thanks for the posts! There are lots of information difficult to find elsewhere.

I have a question about the FMR section. As you said, "in the RDMA stack in the kernel there are verbs that allow creating of FMR pools using verbs too and not only using Send Requests." Assuming my RDMA cards support both of the methods (i.e., the FMR pool method and the using the Send Requests method), which one will have better performance in general?

BTW, my cards are the Mellanox ConnectX-3 Pro EN 40 Gigabit.

Thanks,
Jack

Reply
- Dotan Barak says: August 7, 2016
  
  Hi Jack.
  
  You are welcome
  :)
  
  It is hard for me to answer this question, and I would suggest for you to write a benchmark for your typical scenario and check which approach provides the better performance.
  
  However, if you would ask me to guess:
  I would suggest that the registration using Work Request will provide the best performance.
  
  But again, this needs to be tested ...
  
  Thanks
  Dotan
  
  Reply

Add a Comment

This comment will be moderated; answer may be provided within 14 days.

Social Network Badges

Main Menu

Compare of verbs implementation vs. the specifications

Missing functionality

Reliable Datagram (RD)

Address Handle (AH)

Memory Region (MR)

Memory Window (MW)

Changed functionality

Memory Region (MR)

Completion Queue (CQ)

Asynchronous Event

eXtended Reliable Connected (XRC)

Fast Memory Region (FMR)

General

Related

Comments

Add a Comment

Sidebar

Donate

Categories

Archives

Recent Comments

Twitter Status

Archives

Social Network Badges

Main Menu

Compare of verbs implementation vs. the specifications

Missing functionality

Reliable Datagram (RD)

Address Handle (AH)

Memory Region (MR)

Memory Window (MW)

Changed functionality

Memory Region (MR)

Completion Queue (CQ)

Asynchronous Event

eXtended Reliable Connected (XRC)

Fast Memory Region (FMR)

General

Share:

Related

Share Our Posts

Comments

Add a Comment

Sidebar

Donate

Tags

Categories

Archives

Popular Posts

Recent Comments

Twitter Status

Blogroll

Archives