Skip to content

Verify that RDMA is working

Contents

4.14 avg. rating (84% score) - 7 votes

In the last few posts, I explained how to install the RDMA stack in several ways (inbox, OFED and manually). In this post, I'll describe how to verify that the RDMA stack is working properly.

Verify that RDMA kernel part is loaded

First, one should check that the kernel part of the RDMA stack is working. There are two options to do this: using the service file or using lsmod.

Verify that RDMA kernel part is loaded using service file

Verify that the kernel part is loaded can be done using the relevant service file of the package/OS. For example, over inbox RedHat 6.* installation:

[root@localhost] # /etc/init.d/rdma status
Low level hardware support loaded:
        mlx4_ib
 
Upper layer protocol modules:
        ib_ipoib
 
User space access modules:
        rdma_ucm ib_ucm ib_uverbs ib_umad
 
Connection management modules:
        rdma_cm ib_cm iw_cm
 
Configured IPoIB interfaces: none
Currently active IPoIB interfaces: ib0 ib1

Verify that RDMA kernel part is loaded using lsmod

In all Linux distributions, lsmod can show the loaded kernel modules.

[root@localhost] # lsmod | grep ib
mlx4_ib               113239  0
mlx4_core             189003  2 mlx4_ib,mlx4_en
ib_ipoib               68315  0
ib_ucm                  9597  0
ib_uverbs              30216  2 rdma_ucm,ib_ucm
ib_umad                 8931  4
ib_cm                  30987  3 ib_ipoib,ib_ucm,rdma_cm
ib_addr                 5176  2 rdma_ucm,rdma_cm
ib_sa                  19056  5 mlx4_ib,ib_ipoib,rdma_ucm,rdma_cm,ib_cm
ib_mad                 32968  4 mlx4_ib,ib_umad,ib_cm,ib_sa
ib_core                59893  11 mlx4_ib,ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,ib_mad

One should verify that the following kernel modules are loaded: ib_uverbs and low-level driver of the HW that he has in his machine.

Verify that userspace applications are working

Verify that RDMA devices are available

ibv_devices is a tool, that included in the libibverbs-utils rpm, and shows the available RDMA devices in the local machine.

[root@localhost libibverbs]# ibv_devices
device node GUID
------ ----------------
mlx4_0 000c29632d420400

One should verify that the number of available devices equals to the expected devices in his local machine.

Verify that RDMA devices can be accessed

ibv_devinfo is a tool, that included in the libibverbs-utils rpm, and opens a device and queries for its attributes and by doing this verify that the user and kernel part of the RDMA stack can work together.

[root@localhost libibverbs]# ibv_devinfo -d mlx4_0
hca_id: mlx4_0
    transport:                  InfiniBand (0)
    fw_ver:                     1.2.005
    node_guid:                  000c:2963:2d42:0300
    sys_image_guid:             000c:2963:2d42:0200
    vendor_id:                  0x02c9
    vendor_part_id:             25418
    hw_ver:                     0xa
    phys_port_cnt:              2
            port:   1
                    state:              PORT_ACTIVE (4)
                    max_mtu:            4096 (5)
                    active_mtu:         4096 (5)
                    sm_lid:             1
                    port_lid:           1
                    port_lmc:           0x00
                    link_layer:         InfiniBand
             port:   2
                    state:              PORT_INIT (2)
                    max_mtu:            4096 (5)
                    active_mtu:         256 (1)
                    sm_lid:             0
                    port_lid:           0
                    port_lmc:           0x00
                    link_layer:         InfiniBand

One should verify that at least one port is in PORT_ACTIVE state, which means that the port is available for working.

Verify that traffic is working

Send traffic using ibv_*_pingpong

The ibv_*_pingpong tests, that included in the libibverbs-utils rpm, and sends traffic over RDMA using the SEND opcode. They are relevant only to InfiniBand and RoCE.

It is highly recommended to execute those tools with an explicit device name and port number, although it will work without any parameter; since without any parameter they will work with the first detected RDMA device and port number 1.

Here is an execution example of the server side:

[root@localhost libibverbs]# ibv_rc_pingpong -g 0 -d mlx4_0 -i 1
  local address:  LID 0x0003, QPN 0xb5de9e, PSN 0x9d7046, GID fe80::c:2963:2d42:401
  remote address: LID 0x0003, QPN 0xb5de9f, PSN 0xfeec26, GID fe80::c:2963:2d42:401
8192000 bytes in 0.27 seconds = 239.96 Mbit/sec
1000 iters in 0.27 seconds = 273.11 usec/iter

Here is an execution example of the client side (the IP address is the trusted IP address of the machine that the server is running at):

[root@localhost libibverbs]# ibv_rc_pingpong -g 0 -d mlx4_0 -i 2 192.168.2.106
  local address:  LID 0x0003, QPN 0xb5de9f, PSN 0xfeec26, GID fe80::c:2963:2d42:401
  remote address: LID 0x0003, QPN 0xb5de9e, PSN 0x9d7046, GID fe80::c:2963:2d42:401
8192000 bytes in 0.27 seconds = 245.91 Mbit/sec
1000 iters in 0.27 seconds = 266.50 usec/iter

One should execute the server side before the client side (otherwise, it will fail to connect to the server).

Send traffic using rping

rping is a tool, that included in the librdmacm-utils rpm, and sends RDMA traffic. rping is relevant for all RDMA powered protocols (InfiniBand, RoCE and iWARP).
The address for both client and server sides (the '-a' parameter) is the address that the server listens to. In InfiniBand, this address should be of an IPoIB network interface. In RoCE and iWARP this is the network interface IP address.

Here is an execution example of the server side:

[root@localhost libibverbs]# rping -s -a 192.168.11.1 -v
server ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
server ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
server ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
server ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
server ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
server ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
server ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
server ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
server ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
server ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
server ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
server ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB
server ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC
server ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD
server ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE
server ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF
server ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG
server ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH
server ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI
server ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ

Here is an execution example of the client side:

[root@localhost libibverbs]# rping -c -a 192.168.11.1 -v             
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB
ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC
ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD
ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE
ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF
ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG
ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH
ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI
ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ

One should execute the server side before the client side (otherwise, it will fail to connect to the server).

rping will be running endlessly and continue printing the data to stdout until CTRL-C will be pressed.

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

Tell us what do you think.

  1. Murthy says: March 31, 2015

    Hi Dotan,
    I have a very fundamental doubt.
    what actually improves latency in RDMA.
    is it kernel bypass, is it address translation saved(memory registration) if YES, its all there in send/recv as well.
    Only reason I can imagine is one dma saved in RDMA, which otherwise needs to be done to retrieve recv WQE in send/recv.
    Please clarify, if possible

    • Dotan Barak says: April 17, 2015

      Hi Murthy.

      Several things improves the latency of RDMA (compared to other technologies). I guess that the most important ones are:
      * Kernel bypass (which may take tens to hundreds of nano second at each side)
      * The fact that memory buffers always present at the RAM (no page faults)
      * The fact that the RDMA device handles the data path and not SW (i.e. the network stack)

      When comparing RDMA Write vs. Send, in RDMA Write there isn't any consumption of a Receive Request (less PCI transactions),
      and as soon as data is received, the device knows the address that it should be written to (no delay until the Receive Request is fetched).

      I hope that this was clear enough
      :)

      Dotan

  2. Murthy says: March 31, 2015

    Hi Dotan,
    can u please throw light on technique/feature which improves latency in RDMA.
    only reason I can imagine is one DMA is saved, which otherwise needed to fetch recv WQE in case of send/recv

    • Dotan Barak says: April 17, 2015

      I believe that I answered to you in the previous comment...

  3. tamlok says: April 19, 2016

    Hi! Given that we want to transfer 100 pages (continuous or not) through RDMA READ, which way is more efficient, 100 WRs with only one SGE or 10 WRs with 10 SGEs? Thanks very much!

    • Dotan Barak says: April 22, 2016

      It is HW specific.

      However, IMHO 10 WRs with 10 S/Gs will be more effective than the other suggestion,
      since the overhead of Send Requests attributes (not related to the S/Gs) checkers will be reduced.
      For example: check if QP exists, check if WQ is full, etc.

      I would suggest to write a benchmark to be sure.

      Thanks
      Dotan

  4. Youngmoon says: May 13, 2016

    RDMA can be implemented inside kernel? or kernel module?
    I want it to be transparently doing its job.
    Is there kernel-level implementation that uses only kernel headers?

    • Dotan Barak says: May 16, 2016

      Hi.

      Yes. RDMA can work in kernel level.
      IPoIB is an example to such a module.

      Thanks
      Dotan

  5. A. M. Sheppard says: August 20, 2016

    Hello again, my good sir!

    Finally having my Mellanox Ex III/20GBps (MT25208's) installed, I came across some good info on, thus decided to switch to Debian 8 instead of SLES 11 SP4. The cards seem detected, yet I'm having more than a bit of bother.

    I have two machines, HPV00 & HPV01, respectively. Both 4x PCIE cards are in 8x PCIE 1.0 slots.

    When attempting to connect each card's respective port 0 to the another, I can only get them to link @ 2.5 Gbps (HPV00 Port 0 to HPV01 Port 0). When connecting HPV00 Port 0 to HPV01 Port 1, I get a ibstate rate of 10 Gb/sec (4X). Connecting HPV00 Port 0 to HPV00 Port 1 returns a linked rate of 20Gb/sec (4X DDR)... per card specs.

    I am unable to get IPoIB operational, thus unable to verify that traffic is working (as advised in this article).

    I think I bungled up my port rates not knowing how to use ibportstate properly. How can I ensure I've properly reset {port, node, etc} GUIDs LIDs back to their default states &/or how can I force 4x DDR on ea. port?

    I am using an OpenSM 3.3.18 config (/etc/opensm/opensm.conf), from Debian repos, not Mellanox OFED. Apologies that I should have called "Port 0" "Port 1", etc., per ibstat & ibstatus.

    Linux hpv00 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux
    Linux hpv01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux

    HPV00 & HPV01 lsmod ib
    http://pastebin.com/ELtfTE8x

    HPV00 mthca Port 0 to HPV01 mthca Port 1
    http://pastebin.com/0PgKkLpq

    HPV00 & HPV01 mthca ibportstate history
    http://pastebin.com/d8kZHaEk

    Any advice is appreciated.

    • A. M. Sheppard says: August 20, 2016

      CORRECTION:
      Per HPV00 mthca Port 0 (really "Port 1") to HPV01 mthca Port 1 (really "Port 2")'s pastebin, I have misinfo in it. Checking the OpenSM Status's "Loading Cached Option:guid = 0x0002c90200223ac9",

      ll /var/log/opensm.0x0002c90200223ac9.log returns 542798 [7A1EB700] 0x02 -> SUBNET UP

      Apologies for adding to the confusion. Please advise if there's any more info I can provide.

      • A. M. Sheppard says: August 20, 2016

        A better log for my comment stating my correction:
        HPV00 log OpenSM bound Port GUID after fresh SM restart
        http://pastebin.com/RT2Unqwd

      • Dotan Barak says: September 16, 2016

        Hi.

        Sorry, I had a lot to handle and fail to answer until now.
        Do you still have a problem?

        what is the output of ibv_devinfo?
        (you can send me this by mail)

        Thanks
        Dotan

  6. A. M. Sheppard says: September 17, 2016

    Hello Dotan -

    It's a pleasure to hear from you.

    Since posting 2016-08-20, I switched my connected port on HPV01 mthca from Port 1 (GUID 0x0002c90200223719) to Port 2 (GUID 0x0002c9020022371a) as HPV01 Port 1 was only showing LinkUp of 2.5/SDR when connected to HPV00 mthca Port 1 (HPV00 GUID 0x0002c90200223ac9). I have only one cable.

    I have successfully setup IBoIP via HPV00 Port 1 to HPV01 Port 2 ... though, as stated above, it's only connected @ 10Gbps/4X. Of course, I would prefer to be able to ensure all ports are running as 20GBps/4X DDR.

    As requested & for the sake of anyone stumbling across this thread, here's the ibv_devinfo && ibstatus && ibstat && iblinkinfo && ibportstate -D 0 1 && ibdiagnet -lw 4x -ls 5 -c 1000 && lspci -Qvvs && cat /sys/class/net/{ib1, ib0}/mode && uname -a for both HPV01 && HPV00.

    HPV01 Port 2 to HPV00 Port1 - 4X - Jessie
    http://pastebin.com/4Lkparcm

    HPV00 Port 1 to HPV01 Port 2 - 4X - Jessie
    http://pastebin.com/5AwqNUAB

    Looking forward to your insights.

    (Note: it seems I'm unable to reply to your 2016-09-16 response as max. thread depth seems reached.)

    • Dotan Barak says: September 21, 2016

      Hi.

      I suggest to ignore the port attributes before SM configured the fabric;
      I can see that since the logical port is INITIALIZING and not ACTIVE.

      The SM will configure the ports to use maximum possible values.

      Thanks
      Dotan

  7. Muneendra Kumar says: July 5, 2017

    Hi I have downloaded the linux distro and the corresponding libs as specified in the below link.
    https://community.mellanox.com/docs/DOC-2184
    As the above ink specifies i did exactly same.
    Iam able todo ibv_rc_pingpong on both client and server.
    But when i try todo rping it's not working.Any suggestions here will help me a lot.

    • Dotan Barak says: July 8, 2017

      Hi.

      I can't answer if I don't know what the problem is;
      rping is working on any RDMA device (from all vendors):
      * InfiniBand - if the IPoIB I/F is up and configures
      * RoCE - if the I/F is configured

      Thanks
      Dotan

  8. githubfoam says: April 26, 2018

    I do not get it.There are inbox driver and mellanox OFED driver. It is built and ready to use.It is inbox driver.When you do mellanox ofed driver, it is uninstalling some parts of kernel then inserting itself with some stuff.Why do someone bother to do this? What are the advantages of using of mellanox ofed driver over inbox driver? thnx
    thnx

    • Dotan Barak says: April 27, 2018

      Hi.

      A word of ethics: I'm currently a Mellanox technologies employee.

      Now, to the answer:
      The inbox driver is a relatively old driver which is based on code which was accepted by the upstream kernel.

      MLNX-OFED contains the most updated code with some features/enhancements that:
      a) weren't (yet) submitted to the upstream kernel due to lime limitation
      b) were merged to the upstream kernel by yet not released in the inbox driver (by any Linux distribution)
      c) features that were denied by the community

      The downside of this is that you change the kernel modules that you load,
      with all the implications of this...

      Thanks
      Dotan

  9. Zhig says: January 5, 2019

    Hi Dotan,

    One question about the IB perf tests (I couldn't find a more relevant rdmamojo page to ask this question).

    First let me describe my use case:

    So I'm planning to limit the bandwidth of infiniband temporarily (my final goal is to vary the bandwidth and see its impact on my application). The solution I came to is to use infiniband perf tests (e.g. ib_read_bw) in the background to consume part of the bandwidth. For example, by having a ib_read_bw that consumes 9GB/sec of my 10GB/sec network, I will have 1GB/sec left for my application.

    Now my two questions:
    1- Is there any better, more standard way to limit (or throttle) the bandwidth?
    2- Is there a way to prioritize the ib_read_bw packets over my application packets, so that I will be sure that 9GB/sec is dedicated to ib_read_bw, and my app will not steal that.
    3- There is a flag in ib_read_bw (-w or --limit_bw) that seems to be perfect for me, but I don't seem to get it to work properly. What I do is:
    on the server: ib_read_bw -w 5
    on the client: ib_read_bw SERVER_IP -w 5
    but the final report indicates that the bandwidth was not limited.
    What did I do wrong?

    Thank you

    • Dotan Barak says: May 18, 2019

      Hi.

      AFAIK, the only way to limit the BW is use rate_limit in the Address Vector
      (the whole of RDMA is best performance - not lowering it).

      AFAIK, there isn't any tool that allows controlling the effective BW.

      Thanks
      Dotan

  10. briankr says: March 4, 2019

    I have a basic question - I think. Am interested in using RDMA to get data from a Mellanox card to my GPU. The potential wrinkle is that the data is sourced by a non-GPU server that is just spewing out a datastream.

    The other question is if I can verify RDMA using a single server that has 2 GPUs and 2 Mellanox cards. Do I need an external switch?

    Thanks in advance.

    • Dotan Barak says: March 8, 2019

      Hi.

      1) Let me see if I understand your question:
      You have a computer (without a GPU) that has data and you want another computer to takes this data
      and write/use it with a GPU.

      I don't see any problem with this - it will work;
      the GPU isn't really a factor here...

      2) I don't understand what is the expected topology:
      You can use the following topology:
      device 1 port 1 -> device 2 port 1
      device 1 port 2 -> device 2 port 2

      And you won't need any switch.

      If you want full connectivity between all the ports, you'll need a switch
      (since in the describes topology, you can't send any message from port 1 to port 2, in any device)

      Thanks
      Dotan

  11. Hamed says: January 11, 2021

    Hi Dotan,

    I hope this post finds you well! All the above tests are working for me except for 'ibv_rc_pingpong'. I am receiving the following error on the client side: Failed status transport retry counter exceeded (12) for wr_id 2. Please see the full output below.

    Have you ever encountered such an issue? Any tips/advice would be much appreciated!

    Thanks, in advance!

    Best,
    Hamed

    Server:

    ibv_rc_pingpong -g 4 -d mlx5_0
    local address: LID 0x0000, QPN 0x00067c, PSN 0x8dccae, GID ::ffff:192.168.1.2
    remote address: LID 0x0000, QPN 0x000716, PSN 0xbda7d2, GID ::ffff:192.168.1.3

    Client:

    ibv_rc_pingpong -g 4 -d mlx5_0 192.168.1.2
    local address: LID 0x0000, QPN 0x000716, PSN 0xbda7d2, GID ::ffff:192.168.1.3
    remote address: LID 0x0000, QPN 0x00067c, PSN 0x8dccae, GID ::ffff:192.168.1.2
    Failed status transport retry counter exceeded (12) for wr_id 2

    • Dotan Barak says: February 28, 2021

      Hi.

      Many reasons can cause this problem,
      i don't have enough information here to understahnd what went wrong
      (MTU too big? network interfaces IPs aren't configured, SM wasn't executed - for IB).

      Thanks
      Dotan

Add a Comment

This comment will be moderated; answer may be provided within 14 days.

Time limit is exhausted. Please reload CAPTCHA.