Tips and tricks to optimize your RDMA code

Contents

5.00 avg. rating (99% score) - 11 votes

RDMA is used in many places, mainly because of the high performance that it allows to achieve. In this post, I will provide tips and tricks on how to optimize RDMA code in several aspects.

General tips

Avoid using control operations in the data path

Unlike the data operations that stay in the same context that they were called in (i.e. don't perform a context switch) and they are written in optimized way, the control operations (all create/destroy/query/modify) operations are very expensive because:

Most of the time, they perform a context switch
Sometimes they allocate or free dynamic memory
Sometimes they involved in accessing the RDMA device

As a general rule of thumb, one should avoid calling control operations or decrease its use in the data path.

The following verbs are considered as data operations:

ibv_post_send()
ibv_post_recv()
ibv_post_srq_recv()
ibv_poll_cq()
ibv_req_notify_cq

When posting multiple WRs, post them in a list in one call

When posting several Work Requests to one of the ibv_post_*() verbs, posting multiple Work Requests as a linked list in one call instead of several calls each time with one Work Request will provide better performance since it allows the low-level driver to perform optimizations.

When using Work Completion events, acknowledge several events in one call

When handling Work Completions using events, acknowledging several completions in one call instead of several calls each time will provide better performance since less mutual exclusion locks are being performed.

Avoid using many scatter/gather entries

Using several scatter/gather entries in a Work Request (either Send Request or Receive Request) mean that the RDMA device will read those entries and will read the memory that they refer to. Using one scatter/gather entry will provide better performance than more than one.

Avoid using Fence

Send Request with the fence flag set will be blocked until all prior RDMA Read and Atomic Send Requests will be completed. This will decrease the BW.

Avoid using atomic operations

Atomic Operations allow to perform read-modify-write in an atomic way. This usually will decrease the performance since doing this usually involved in locking the access to the memory (implementation dependent).

Read multiple Work Completions at once

ibv_poll_cq() allows to reading multiple completions at once. If the number of Work Completions in the CQ is less than the number of Work Completion that one tried to read, it means that the CQ is empty and there isn't any need to check if there are more Work Completions in it.

Set processor affinity for a certain task or process

When working with a Symmetric MultiProcessing (SMP) machines, binding the processes to a specific CPU(s)/core(s) may provide better utilization of the CPU(s)/core(s) thus provide better performance. Executing processes as the number of CPU(s)/core(s) in a machine and spread a process to each CPU(s)/core(s) may be a good practice. This can be done with the "taskset" utility.

Work with local NUMA node

When working on a Non-Uniform Memory Access (NUMA) machines, binding the processes to CPU(s)/core(s) which are considered local NUMA nodes for the RDMA device may provide better performance because of faster CPU access. Spreading the processes to all of the local CPU(s)/core(s) may be a good practice.

Work with cache-line aligned buffers

Working with cache-line aligned buffers (in S/G list, Send Request, Receive Request and data) will improve performance compared to working with unaligned memory buffers; it will decrease the number of CPU cycles and number of memory accesses.

Avoid getting into retransmission flows

Retransmission is a performance killer. There are 2 major reasons for retransmission in RDMA:

Transport retransmission - remote QP isn't at a state that can process incoming messages, i.e. didn't get to, at least, RTR state, or moved to Error state
RNR retransmission - there is a message that should consume a Receive Request in the responder side, but there isn't any Receive Request in the Receive Queue

There are RDMA devices that provide counters to indicate that retry flows occurred, but not all of them.

Setting QP.retry_cnt and QP.rnr_retry to zero will cause a failure (i.e. Completion with error) when the QP enters those flows.

However, if retry flows can't be avoided, use low (as possible) delay between the retransmission.

Improving the Bandwidth

Find the best MTU for the RDMA device

The MTU value specifies the maximum packet payload size (i.e. excluding the packet headers) that can be sent. As a rule of thumb since the packet header sizes are the same for all MTU values, using the maximum available MTU size will decrease the "paid price" per packet; the percent of the payload data in the total used BW will be increased. However, there are RDMA devices which provide the best performance for MTU values which are lower than the maximum supported value. One should perform some testing in order to find the best MTU for the specific device that he works with.

Use big messages

Sending a few big messages is more effective than sending a lot of small messages. In application level one should collect data and send big messages over RDMA.

Work with multiple outstanding Send Requests

Working with multiple outstanding Send Requests and keeping the Send Queue always full (i.e. for every polled Work Completion post a new Send Request) will keep the RDMA device busy and prevents it from being idle.

Configure the Queue Pair to allow several RDMA Reads and Atomic in parallel

If one uses RDMA Read or Atomic operations, it is advised to configure the QP to work with several RDMA Read and Atomic operations in flight since it will provide higher BW.

Work with selective signaling in the Send Queue

Working with selective signaling in the Send Queue means that not every Send Request will produce a Work Completion when it ends and this will reduce the number of Work Completions that should be handled.

Reducing the latency

Read Work Completions by polling

In order to read the Work Completion as soon as they are added to the Completion Queue, polling will provide the best results (rather than working with Work Completion events).

Send small messages as inline

In RDMA devices which supports sending data as inline, sending small messages as inline will provide better latency since it eliminates the need of the RDMA device to perform extra read (over the PCIe bus) in order to read the message payload.

Use low values in QP's timeout and min_rnr_timer

Using lower values in the QP's timeout and min_rnr_timer means that in case that something gets wrong and retry is required (whether if because the remote QP doesn't answer or if it doesn't have outstanding Receive Request) the waited time before a retransmission will be short.

If immediate data is used, use RDMA Write with immediate instead of Send with immediate

When sending a message that has only immediate data, RDMA Write with immediate will provide better performance than Send With immediate since the latter causes the outstanding posted Receive Request to be read (in the responder side) and not only be consumed.

Reducing memory consumption

Use Shared Receive Queue (SRQ)

Instead of posting many Receive Requests for each Queue Pair, using SRQ can save the total number of outstanding Receive Request thus reduce the total consumed memory.

Register physical contiguous memory

Register physical contiguous memory, such as huge pages, can allow the low-level driver(s) to perform optimizations since lower amount of memory address translations will be required (compared to 4KB memory pages buffer).

Reduce the size of the used Queues to the minimum

Creating the various Queues (Queue Pairs, Shared Receive Queues, Completion Queues) may consume a lot of memory. One should set the size of them to the minimum that is required by his application.

Reducing CPU consumption

Work with Work Completion events

Reading the Work Completions using events will eliminate the need to perform constant polling on the CQ since the RDMA device will send an event when a Work Completion was added to the CQ.

Work with solicited events in Responder side

When reading the Work Completions in the Responder side, the solicited event can be a good way to the Requestor to provide a hint that now is a good time to read the completions. This reduces the total number of handled Work Completions.

Share the same CQ with several Queues

Using the same CQ with several Queues and reducing the total number of CQs will eliminate the need to check several CQs in order to understand if an outstanding Work Request was completed. This can be done by sharing the same CQ with multiple Send Queues, multiple Receive Queues or with a mix of them.

Increase the scalability

Use collective algorithms

Using collective algorithms will reduce the total number of messages that cross the wire and will decrease the total number of messages and resources that each node in a cluster will use. There are RDMA devices that provide special collective offload operations that will help reducing the CPU utilization.

Use Unreliable Datagram (UD) QP

If every node needs to be able to receive or send a message to any other node in the subnet, using a connected QP (either Reliable or Unreliable) may be a bad solution since many QPs will be created in every node. Using a UD QP is better since it can send and receive messages from any other UD QP in the subnet.

Written by: Dotan Barak on June 8, 2013.on March 9, 2019.

Comments

Tell us what do you think.

Rdma User says: October 28, 2013

Hi RDMAmojo.

Do you know about the performance dependence of RDMA reads or writes on the size of the registered memory region?

My experiments show that random reads on a 1 KB registered region are much faster than random reads on a 1 GB registered region. Random writes are even more faster. My setup is a server with several connected clients. With 1 KB registered at the server, I get about 5 million reads per seconds. With 1 GB, I get only about 1.6 million reads per second. The difference for writes is even larger.

What could be the reason for this? HCA caching could explain the faster reads (it could cache the entire 1 KB region, while it probably can't cache the 1 GB region). But caching cannot explain the faster writes. Does the HCA delay writes to the CPU memory by buffering them in the HCA's RAM?

Could the faster performance for reads/writes be due to TLB misses in the HCA?

Reply
- Dotan Barak says: October 30, 2013
  
  Hi.
  
  There shouldn't be any performance impact when using small or big Memory Regions.
  
  Did you try to register 1 GB buffer and access only the first 1 KB?
  (just to prove that the above sentence is correct in your benchmark)
  
  Thanks
  Dotan
  
  Reply
Rdma User says: October 30, 2013

Hi Dotan,

Sorry, I should have posted that I solved my issue. The slowness was due to TLB misses inside the HCA, which I removed by using hugepages.

I made reads faster by using your advice: "Configure the Queue Pair to allow several RDMA Reads and Atomic in parallel".

Thanks a ton for your blog!

Reply
- Dotan Barak says: October 31, 2013
  
  I'm happy that you find this blog helpful
  :)
  
  Thanks
  Dotan
  
  Reply
Ariel says: March 31, 2014

Hi Dotan,
Re: Avoid using multiple SGEs
Are you suggesting that if i have 10 seperate buffers to send i should use a list of 10 chained work requests (in a single post) rather than a single work request with 10 SGEs?

What is the performance gain if any?
Thanks,
Ariel

Reply
- Dotan Barak says: March 31, 2014
  
  Hi Ariel.
  
  I mean that if possible, one should use one S/G entry (in one message) instead of multiple S/G entries.
  Doing this will decrease the amount of memory that is read by the device and the number of memory access (same data will be read but accessing different memory places will harm the performance).
  
  If the options are 10 different messages or 10 S/G entries in one message, I quite sure that the later will provide the best performance because of many reasons.
  
  I'm sorry, but I don't have any performance numbers that I share with you. But you can write a trivial application to check this..
  ;)
  
  Thanks
  Dotan
  
  Reply
Lior says: November 28, 2014

Hi Dotan,

Thanks for the excellent Blog.

I have a many to many application and I am using UD qp, one on each process.
I wanted to ask if there is a limit for the IOPs a single QP can provide?.
Is it better to use 2 QP instead of 1 QP in each process??

Thanks a lot.

Lior

Reply
- Dotan Barak says: November 28, 2014
  
  Hi Lior.
  
  Thanks for the feedback
  :)
  
  It is very hard for me to answer this question, since there are many variables which can influence the answer.
  
  But if you'll push me to the corner and ask for an answer, I'll say that 2 QPs will provide more IOPs than 1.
  (But if I were you, I would have implement both options and check in my specific application and configuration if this have any impact - since the management of 2 QPs may add some complexity to the program).
  
  I hope that this answer helped you..
  
  Thanks
  Dotan
  I'll try to be very cautious here and say:
  In general, I think that the answer is that two QP
  As a general
  
  Reply
Lior says: December 1, 2014

Hi Dotan,

I have a question regarding profiling code using libibverbs.
Is there a recommended way or tutorial out there ?

Currently I am profiling using 2 methods using home made time stamping (using rdtsc) around
interesting areas + using gprof (using 2 separate runs of the code).
I can see that the libibverbs functions such as ibv_post_send and ibv_post_recv which takes a considerable
amount of time using my home made timestamping method does not appear in the gprof output.
Do you have any ideas??

Also, is there a performance report regarding the cost (in time) of the libibverbs routines??

Regards
--lior

Reply
- Dotan Barak says: December 1, 2014
  
  Hi Lior.
  
  IMHO, using rdtsc is a good way to profile code.
  
  Profiling can be tricky, since you may get different values on different RDMA devices (even from the same HW vendor), let's add to this equation chipsets. In RDMA a lot of SW (from the low-level driver) is involved and many HW flows.
  
  ibv_post_send() may be involved in several locks (spinlocks/mutexes) and write barriers
  (depends on the driver that you are using).
  
  AFAIK, the only profiler that exists for libibverbs is libibprof,
  which is part of the Mellanox HPC-X™ Software Toolkit
  (http://www.mellanox.com/page/products_dyn?product_family=189&mtag=hpc-x).
  
  I don't know about any document of performance report about cost of libibverbs routines
  (Maybe there is, but I'm not aware of it).
  
  Sorry, and I hope that I helped you.
  Dotan
  
  Reply
Rafi C says: December 12, 2014

Hi Dotan,

First of all thanks for this wonderful blog. I was trying to improve our rdma implementation. From this blog I understood that using memory registration in data path will decrease the performance to some extent and I got some numbers using a trivial application to prove the same. Currently we are registering/deregistering memory for each read and write. Is there any implementation to overcome the issue, where i can pin large enough memory at starting,and make use of this region dynamically. could memory window or shared memory will help in this case?

Thanks & Regards
Rafi KC

Reply
- Dotan Barak says: December 12, 2014
  
  You are welcome and thanks for the complements
  :)
  
  Yes. Data registering/deregistering in the data path is performance killer.
  I suggest that you'll register the memory block in advanced (in one or several Memory Regions),
  and only give the Memory Region attributes (address + r_key) to the remote side.
  
  Using Memory Windows (if they are implemented) can be a good idea as well,
  since allocating/invalidating/deallocating them is much more cheap than Memory Regions.
  
  I hope that I helped here.
  
  Thanks
  Dotan
  
  Reply
  - Rafi C says: December 12, 2014
    
    Thanks for a quick reply, and obviously it will be really helpful for us.
    
    Infact i have a couple of doubts regarding preallocation. We are doing large amount of data in various size of chunks size (0- some thousands of KB's). What should be the appropriate size amount of each memory regions,and number of MR regions?
    
    Also, scenarios like, if we are getting a buffer to rdma application and we are forced to write into the same buffer itself, I guess we will forced to do registration in code path itself unless we go with an extra level copying?
    
    Rafi KC
  - Dotan Barak says: December 12, 2014
    
    Sure.
    
    You can use which Memory Region size you you need. As long as:
    1) This size is supported by the RDMA device (this value can be queried in the device attributes)
    2) There are enough resource in the RDMA device to handle this Memory Region
    (many Memory Regions will consume a lot of resources).
    
    If needed, maybe you can configure the low-level driver of the RDMA device,
    to prepare more resources in advanced.
    3) You don't pin too much memory from the computer memory; since this may
    cause your computer to be very slow.
    
    Thousands of KB's isn't a lot of memory now days ..
    (when you can have a machine with several GBs of ram).
    
    Yes, there are 2 options:
    1) Copy memory to the supplied buffer
    2) Registering the memory and make the RDMA device to write to this memory (i.e. Zero Copy).
    
    Obviously, zero copy will give you the best performance ..
    
    Dotan
Dmitry says: December 14, 2014

Hi Dotan,
Thanks for the blog.
I found one more helpful tip which you may include in the optimization:
Use aligned RDMA if possible.
In my tests I measured the timing of RDMA write ibv_post_send() using blocks with 8-byte alignment and 1 page alignment (4096 bytes). The former took around 3000 clocks, the latter around 350 clocks.

Reply
- Dotan Barak says: December 16, 2014
  
  Hi Dmitry.
  
  Thanks for the tip!
  I wonder, what is the block size that you used?
  Did you enable inline send?
  (since theoretically, there shouldn't be any difference in the CPU usage of the ibv_post_send() code itself, no matter if the data buffer is aligned or not).
  
  Which memory that should be aligned? the data itself or the Send Request and s/g list?
  
  I wonder, can you share this benchmark?
  
  Thanks
  Dotan
  
  Reply
  - Dmitry says: December 17, 2014
    
    Hi Dotan.
    
    Let me give some more details of my setup. I have a producer and a consumer side. The producer have packets of various size, say from 600 bytes to 1500 bytes. Initially I tried to send every packet aligned to the 8 byte boundary in the RDMA buffer. And I found that the time spend in ibv_post_send() is about 3000 clocks, measured using rdtsc. Then I decided to combine several packets into one RDMA send. This send is performed once the page (4096 bytes) is full. And exactly one page is sent irrespective on how many packets is in it. The number of calls reduced of course. But the most surprising was the reduction of time spend in ibv_post_send() - it is about 350 clocks. So I thought may be there is some optimization in case of the exactly one page is sent.
    
    I didn't use inline in both approaches.
    Yes, it is the data itself that should be aligned.
    
    Unfortunately I cannot share the code, it is under NDA.
    Best regards,
    Dmitry
  - Dotan Barak says: December 25, 2014
    
    Hi Dmitry.
    
    Which device are you using?
    I need to understand more details before I give a recommendation...
    AFAIK, there isn't any optimization in the flow of post_send(), depends on the alignment of the data.
    
    However, maybe some other addresses were affected by the fact that the data is page aligned
    (for example, the s/g list array or the Send Request are now cache line size aligned).
    
    Thanks
    Dotan
  - Dmitry says: January 11, 2015
    
    Hi Dotan,
    Sorry for late reply, I was out of work on holiday.
    The device we use is:
    46:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
    
    Both SG list and WR are on stack, and the data in both cases is on heap. I think maybe this behavior caused by cache overflow when too many requests are performed with non-aligned case...
  - Dotan Barak says: January 16, 2015
    
    Hi Dmitry.
    
    Thanks for the update.
    I'll add a note that specify that all memory buffers (WRs, S/Gs, data) should be cache-line aligned in order to get best performance.
    
    Thanks
    Dotan
Jack says: June 8, 2015

Hello Dotan,
I have two questions,
1) Do you know the max number of S/G element supported in send/recv work request?
2) What's the max amount of work requests we can use post_send to post at one time?

All the best
Jack

Reply
- Dotan Barak says: June 8, 2015
  
  Hi Jack.
  
  Those numbers are attribute of the device that you are working on
  (part of the HCA attributes).
  
  Different devices may have different attributes.
  
  You get get those values from the verbs layer (look at the post on ibv_query_device() for more information),
  or by executing ibv_devinfo from the command line.
  
  I hope that this helped you.
  Dotan
  
  Reply
  - Jack says: June 8, 2015
    
    Thanks a lot Dotan, that's helpful.
    For me, the max sge is :32, and the max wr is :16351
    Does that mean I can post up to 32*16351 S/G elements in only one post_send?
  - Dotan Barak says: June 9, 2015
    
    Theoretically, yes (unless there are more resource limitations on machine, device, etc.)
    But please notice that every Work Request is one message (no matter the number of S/G entries it has).
    
    Thanks
    Dotan
hyswtj says: June 30, 2020

Hi, Dotan:
Thanks for your blog! I have a question about inline buffers.
Does the buffer for inline need align with 4K?Can I use calloc for alloc inline buffers?

Reply
- Dotan Barak says: July 10, 2020
  
  Hi.
  
  This is a device specific thing. However, since the data is copied as is (using memcpy), the buffers don't have to be with any specific alignment,
  Those buffers can be allocated with malloc() family, since there isn't any validation to this buffer.
  
  And again - this is an implementation specific and different devices may have different behavior..
  
  Thanks
  Dotan
  
  Reply

Add a Comment

This comment will be moderated; answer may be provided within 14 days.

Social Network Badges

Main Menu