Skip to content

Verify that RDMA is working

5.00 avg. rating (97% score) - 1 vote

In the last few posts, I explained how to install the RDMA stack in several ways (inbox, OFED and manually). In this post, I'll describe how to verify that the RDMA stack is working properly.

Verify that RDMA kernel part is loaded

First, one should check that the kernel part of the RDMA stack is working. There are two options to do this: using the service file or using lsmod.

Verify that RDMA kernel part is loaded using service file

Verify that the kernel part is loaded can be done using the relevant service file of the package/OS. For example, over inbox RedHat 6.* installation:

[root@localhost] # /etc/init.d/rdma status
Low level hardware support loaded:
        mlx4_ib
 
Upper layer protocol modules:
        ib_ipoib
 
User space access modules:
        rdma_ucm ib_ucm ib_uverbs ib_umad
 
Connection management modules:
        rdma_cm ib_cm iw_cm
 
Configured IPoIB interfaces: none
Currently active IPoIB interfaces: ib0 ib1

Verify that RDMA kernel part is loaded using lsmod

In all Linux distributions, lsmod can show the loaded kernel modules.

[root@localhost] # lsmod | grep ib
mlx4_ib               113239  0
mlx4_core             189003  2 mlx4_ib,mlx4_en
ib_ipoib               68315  0
ib_ucm                  9597  0
ib_uverbs              30216  2 rdma_ucm,ib_ucm
ib_umad                 8931  4
ib_cm                  30987  3 ib_ipoib,ib_ucm,rdma_cm
ib_addr                 5176  2 rdma_ucm,rdma_cm
ib_sa                  19056  5 mlx4_ib,ib_ipoib,rdma_ucm,rdma_cm,ib_cm
ib_mad                 32968  4 mlx4_ib,ib_umad,ib_cm,ib_sa
ib_core                59893  11 mlx4_ib,ib_ipoib,rdma_ucm,ib_ucm,ib_uverbs,ib_umad,rdma_cm,ib_cm,iw_cm,ib_sa,ib_mad

One should verify that the following kernel modules are loaded: ib_uverbs and low-level driver of the HW that he has in his machine.

Verify that userspace applications are working

Verify that RDMA devices are available

ibv_devices is a tool, that included in the libibverbs-utils rpm, and shows the available RDMA devices in the local machine.

[root@localhost libibverbs]# ibv_devices
device node GUID
------ ----------------
mlx4_0 000c29632d420400

One should verify that the number of available devices equals to the expected devices in his local machine.

Verify that RDMA devices can be accessed

ibv_devinfo is a tool, that included in the libibverbs-utils rpm, and opens a device and queries for its attributes and by doing this verify that the user and kernel part of the RDMA stack can work together.

[root@localhost libibverbs]# ibv_devinfo -d mlx4_0
hca_id: mlx4_0
    transport:                  InfiniBand (0)
    fw_ver:                     1.2.005
    node_guid:                  000c:2963:2d42:0300
    sys_image_guid:             000c:2963:2d42:0200
    vendor_id:                  0x02c9
    vendor_part_id:             25418
    hw_ver:                     0xa
    phys_port_cnt:              2
            port:   1
                    state:              PORT_ACTIVE (4)
                    max_mtu:            4096 (5)
                    active_mtu:         4096 (5)
                    sm_lid:             1
                    port_lid:           1
                    port_lmc:           0x00
                    link_layer:         InfiniBand
             port:   2
                    state:              PORT_INIT (2)
                    max_mtu:            4096 (5)
                    active_mtu:         256 (1)
                    sm_lid:             0
                    port_lid:           0
                    port_lmc:           0x00
                    link_layer:         InfiniBand

One should verify that at least one port is in PORT_ACTIVE state, which means that the port is available for working.

Verify that traffic is working

Send traffic using ibv_*_pingpong

The ibv_*_pingpong tests, that included in the libibverbs-utils rpm, and sends traffic over RDMA using the SEND opcode. They are relevant only to InfiniBand and RoCE.

It is highly recommended to execute those tools with an explicit device name and port number, although it will work without any parameter; since without any parameter they will work with the first detected RDMA device and port number 1.

Here is an execution example of the server side:

[root@localhost libibverbs]# ibv_rc_pingpong -g 0 -d mlx4_0 -i 1
  local address:  LID 0x0003, QPN 0xb5de9e, PSN 0x9d7046, GID fe80::c:2963:2d42:401
  remote address: LID 0x0003, QPN 0xb5de9f, PSN 0xfeec26, GID fe80::c:2963:2d42:401
8192000 bytes in 0.27 seconds = 239.96 Mbit/sec
1000 iters in 0.27 seconds = 273.11 usec/iter

Here is an execution example of the client side (the IP address is the trusted IP address of the machine that the server is running at):

[root@localhost libibverbs]# ibv_rc_pingpong -g 0 -d mlx4_0 -i 2 192.168.2.106
  local address:  LID 0x0003, QPN 0xb5de9f, PSN 0xfeec26, GID fe80::c:2963:2d42:401
  remote address: LID 0x0003, QPN 0xb5de9e, PSN 0x9d7046, GID fe80::c:2963:2d42:401
8192000 bytes in 0.27 seconds = 245.91 Mbit/sec
1000 iters in 0.27 seconds = 266.50 usec/iter

One should execute the server side before the client side (otherwise, it will fail to connect to the server).

Send traffic using rping

rping is a tool, that included in the librdmacm-utils rpm, and sends RDMA traffic. rping is relevant for all RDMA powered protocols (InfiniBand, RoCE and iWARP).
The address for both client and server sides (the '-a' parameter) is the address that the server listens to. In InfiniBand, this address should be of an IPoIB network interface. In RoCE and iWARP this is the network interface IP address.

Here is an execution example of the server side:

[root@localhost libibverbs]# rping -s -a 192.168.11.1 -v
server ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
server ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
server ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
server ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
server ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
server ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
server ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
server ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
server ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
server ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
server ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
server ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB
server ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC
server ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD
server ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE
server ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF
server ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG
server ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH
server ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI
server ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ

Here is an execution example of the client side:

[root@localhost libibverbs]# rping -c -a 192.168.11.1 -v             
ping data: rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr
ping data: rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs
ping data: rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst
ping data: rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu
ping data: rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv
ping data: rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw
ping data: rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx
ping data: rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy
ping data: rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
ping data: rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
ping data: rdma-ping-10: KLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA
ping data: rdma-ping-11: LMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzAB
ping data: rdma-ping-12: MNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABC
ping data: rdma-ping-13: NOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCD
ping data: rdma-ping-14: OPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDE
ping data: rdma-ping-15: PQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEF
ping data: rdma-ping-16: QRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFG
ping data: rdma-ping-17: RSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGH
ping data: rdma-ping-18: STUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHI
ping data: rdma-ping-19: TUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzABCDEFGHIJ

One should execute the server side before the client side (otherwise, it will fail to connect to the server).

rping will be running endlessly and continue printing the data to stdout until CTRL-C will be pressed.

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

Tell us what do you think.

  1. Murthy says: March 31, 2015

    Hi Dotan,
    I have a very fundamental doubt.
    what actually improves latency in RDMA.
    is it kernel bypass, is it address translation saved(memory registration) if YES, its all there in send/recv as well.
    Only reason I can imagine is one dma saved in RDMA, which otherwise needs to be done to retrieve recv WQE in send/recv.
    Please clarify, if possible

    • Dotan Barak says: April 17, 2015

      Hi Murthy.

      Several things improves the latency of RDMA (compared to other technologies). I guess that the most important ones are:
      * Kernel bypass (which may take tens to hundreds of nano second at each side)
      * The fact that memory buffers always present at the RAM (no page faults)
      * The fact that the RDMA device handles the data path and not SW (i.e. the network stack)

      When comparing RDMA Write vs. Send, in RDMA Write there isn't any consumption of a Receive Request (less PCI transactions),
      and as soon as data is received, the device knows the address that it should be written to (no delay until the Receive Request is fetched).

      I hope that this was clear enough
      :)

      Dotan

  2. Murthy says: March 31, 2015

    Hi Dotan,
    can u please throw light on technique/feature which improves latency in RDMA.
    only reason I can imagine is one DMA is saved, which otherwise needed to fetch recv WQE in case of send/recv

    • Dotan Barak says: April 17, 2015

      I believe that I answered to you in the previous comment...

  3. tamlok says: April 19, 2016

    Hi! Given that we want to transfer 100 pages (continuous or not) through RDMA READ, which way is more efficient, 100 WRs with only one SGE or 10 WRs with 10 SGEs? Thanks very much!

    • Dotan Barak says: April 22, 2016

      It is HW specific.

      However, IMHO 10 WRs with 10 S/Gs will be more effective than the other suggestion,
      since the overhead of Send Requests attributes (not related to the S/Gs) checkers will be reduced.
      For example: check if QP exists, check if WQ is full, etc.

      I would suggest to write a benchmark to be sure.

      Thanks
      Dotan

  4. Youngmoon says: May 13, 2016

    RDMA can be implemented inside kernel? or kernel module?
    I want it to be transparently doing its job.
    Is there kernel-level implementation that uses only kernel headers?

    • Dotan Barak says: May 16, 2016

      Hi.

      Yes. RDMA can work in kernel level.
      IPoIB is an example to such a module.

      Thanks
      Dotan

  5. A. M. Sheppard says: August 20, 2016

    Hello again, my good sir!

    Finally having my Mellanox Ex III/20GBps (MT25208's) installed, I came across some good info on, thus decided to switch to Debian 8 instead of SLES 11 SP4. The cards seem detected, yet I'm having more than a bit of bother.

    I have two machines, HPV00 & HPV01, respectively. Both 4x PCIE cards are in 8x PCIE 1.0 slots.

    When attempting to connect each card's respective port 0 to the another, I can only get them to link @ 2.5 Gbps (HPV00 Port 0 to HPV01 Port 0). When connecting HPV00 Port 0 to HPV01 Port 1, I get a ibstate rate of 10 Gb/sec (4X). Connecting HPV00 Port 0 to HPV00 Port 1 returns a linked rate of 20Gb/sec (4X DDR)... per card specs.

    I am unable to get IPoIB operational, thus unable to verify that traffic is working (as advised in this article).

    I think I bungled up my port rates not knowing how to use ibportstate properly. How can I ensure I've properly reset {port, node, etc} GUIDs LIDs back to their default states &/or how can I force 4x DDR on ea. port?

    I am using an OpenSM 3.3.18 config (/etc/opensm/opensm.conf), from Debian repos, not Mellanox OFED. Apologies that I should have called "Port 0" "Port 1", etc., per ibstat & ibstatus.

    Linux hpv00 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux
    Linux hpv01 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2+deb8u3 (2016-07-02) x86_64 GNU/Linux

    HPV00 & HPV01 lsmod ib
    http://pastebin.com/ELtfTE8x

    HPV00 mthca Port 0 to HPV01 mthca Port 1
    http://pastebin.com/0PgKkLpq

    HPV00 & HPV01 mthca ibportstate history
    http://pastebin.com/d8kZHaEk

    Any advice is appreciated.

    • A. M. Sheppard says: August 20, 2016

      CORRECTION:
      Per HPV00 mthca Port 0 (really "Port 1") to HPV01 mthca Port 1 (really "Port 2")'s pastebin, I have misinfo in it. Checking the OpenSM Status's "Loading Cached Option:guid = 0x0002c90200223ac9",

      ll /var/log/opensm.0x0002c90200223ac9.log returns 542798 [7A1EB700] 0x02 -> SUBNET UP

      Apologies for adding to the confusion. Please advise if there's any more info I can provide.

      • A. M. Sheppard says: August 20, 2016

        A better log for my comment stating my correction:
        HPV00 log OpenSM bound Port GUID after fresh SM restart
        http://pastebin.com/RT2Unqwd

      • Dotan Barak says: September 16, 2016

        Hi.

        Sorry, I had a lot to handle and fail to answer until now.
        Do you still have a problem?

        what is the output of ibv_devinfo?
        (you can send me this by mail)

        Thanks
        Dotan

  6. A. M. Sheppard says: September 17, 2016

    Hello Dotan -

    It's a pleasure to hear from you.

    Since posting 2016-08-20, I switched my connected port on HPV01 mthca from Port 1 (GUID 0x0002c90200223719) to Port 2 (GUID 0x0002c9020022371a) as HPV01 Port 1 was only showing LinkUp of 2.5/SDR when connected to HPV00 mthca Port 1 (HPV00 GUID 0x0002c90200223ac9). I have only one cable.

    I have successfully setup IBoIP via HPV00 Port 1 to HPV01 Port 2 ... though, as stated above, it's only connected @ 10Gbps/4X. Of course, I would prefer to be able to ensure all ports are running as 20GBps/4X DDR.

    As requested & for the sake of anyone stumbling across this thread, here's the ibv_devinfo && ibstatus && ibstat && iblinkinfo && ibportstate -D 0 1 && ibdiagnet -lw 4x -ls 5 -c 1000 && lspci -Qvvs && cat /sys/class/net/{ib1, ib0}/mode && uname -a for both HPV01 && HPV00.

    HPV01 Port 2 to HPV00 Port1 - 4X - Jessie
    http://pastebin.com/4Lkparcm

    HPV00 Port 1 to HPV01 Port 2 - 4X - Jessie
    http://pastebin.com/5AwqNUAB

    Looking forward to your insights.

    (Note: it seems I'm unable to reply to your 2016-09-16 response as max. thread depth seems reached.)

    • Dotan Barak says: September 21, 2016

      Hi.

      I suggest to ignore the port attributes before SM configured the fabric;
      I can see that since the logical port is INITIALIZING and not ACTIVE.

      The SM will configure the ports to use maximum possible values.

      Thanks
      Dotan

Add a Comment

Fill in the form and submit.

Time limit is exhausted. Please reload CAPTCHA.