Skip to content

IP over InfiniBand (IPoIB) architecture

Contents

5.00 avg. rating (99% score) - 5 votes

Motivation

InfiniBand is a great networking protocol that has many features and provides great performance. However, unlike RoCE and iWARP, which are running over an Ethernet infrastructure (NIC, switches and routers) and support legacy (IP-based) applications, by design. InfiniBand, as a completely different protocol, uses a different addressing mechanism, which isn't IP, and doesn't support sockets - therefore it doesn't support legacy applications. This means that in some clusters InfiniBand can't be used as the only interconnect since many management systems are IP-based. The result of this could have been that clusters that use InfiniBand for the data traffic may deploy an Ethernet infrastructure in the cluster as well (for the management). This increases the price and complexity of cluster deployment.

To solve this problem, IP over InfiniBand (IPoIB) was specified by the Internet Engineering Task Force (IETF) IPoIB working group. IPoIB allows working with IP-based applications, thus allow running legacy applications and many management systems in an InfiniBand fabric seamlessly.

IPoIB Architecture

The IPoIB module registers to the local Operating System's network stack as an Ethernet device and translate all the needed functionality between Ethernet and InfiniBand. An Unreliable Datagram (UD) Queue Pair (QP) that represents the network interface is created and the link layer address (MAC) of that interface is created according to the following scheme:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+---------------+-----------------------------------------------+
|     Reserved  |                   Queue Pair number           |
+---------------+-----------------------------------------------+
|                                                               |
+                                                               +
|                                                               |
+                              GID                              +
|                                                               |
+                                                               +
|                                                               |
+---------------------------------------------------------------+

1 byte: reserved
3 bytes: QP number
16 bytes: One of the port's GID
Total: 20 bytes

Every such IPoIB network interface has a 20 bytes MAC address. This may cause problems since the "standard" Ethernet MAC address is 6 bytes (48 bits) and there are applications, services and operating systems which don't support a network device with a different size of MAC address than 6 bytes.

Such IPoIB network interface is created for every port of the InfiniBand device. In Linux, the prefix of those interfaces is "ib".

The traffic that is sent over the IPoIB network interface uses the network stack of the kernel and doesn't benefit from features of the InfiniBand device: kernel bypass, reliability, zero copy, splitting and assembly of messages to packets, and more.

The kernel provides to the IPoIB module buffers to be sent, IPoIB send them over the InfiniBand fabric. When receiving data from the wire, IPoIB fills them in kernel buffers and provides them to the kernel (which in turn, give them to the userspace application which uses sockets). The IP packets are encapsulated in InfiniBand packets, hence packet sniffing tools, such as: tcpdump and wireshark can be used on an IPoIB network interface.

Interoperability

The IPoIB RFCs specify a wire protocol, which means that different IPoIB implementations (even between different Operating Systems) can interoperate.

Supported protocols

The IPoIB network interface supports ICMP, IPv4, IPv6 and all the network protocols that use IP, such as UDP, TCP and more.

Broadcast

In order for an Ethernet interface to work properly, broadcast must be supported since many protocols (such as: ARP) perform broadcasting of messages. However, InfiniBand, by design, doesn't support broadcast. This is handled by InfiniBand multicast groups. IPoIB uses a specific Multicast group as a broadcast domain and when there is a need to broadcast a message, this message is sent to a multicast group. Here is the scheme which describes the used multicast GID.

|   8    |  4 |  4 |     16 bits     | 16 bits |      80 bits      |
+------ -+----+----+-----------------+---------+-------------------+
|11111111|0001|scop| IPoIB signature |  P_Key  |      group ID     |
+--------+----+----+-----------------+---------+-------------------+

Working with multicast groups (for example: sending join/leave requests) requires an active Subnet Administrator (SA).

ARP resolution

When an application needs to send a message to a remote IP at the first time, and the MAC address of that remote address is yet unknown, the Address Resolution Protocol (ARP) is used. An ARP request is broadcasted to the network with the IP address that its MAC is needed and the NIC, in the subnet, that was configured with this IP address return an ARP response with its MAC address. This is a standard protocol in Ethernet networks and isn't special to IPoIB.

Now, that the MAC address of that remote IPoIB network interface is known, this isn't enough; there is a need to translate this address from the Ethernet space to the InfiniBand space. Luckily, the MAC address of the remote IPoIB network interface contains a port's GID and by sending a Path Query request to the SA, the needed InfiniBand attributes (for example: SL, LID, and more), for creating an Address Handle for sending messages to the remote side, can be acquired. Usually this Address Handle is associated with the ARP entry at the kernel level; but this is an implementation specific behavior.

Network I/F configuration

The IPoIB network interface is like any other network interface, except for the fact that it has a 20 bytes MAC address. Its IP address can be configured statically using an OS configuration file or using DHCP. Please note that there are DHCP server implementations that don't natively support IPoIB, and special patches may be needed to be applied to them for supporting IPoIB. The IPoIB network interface attributes, such as: MTU size, enabled offloads, and more, can be configured like any other network interface; in Linux using ifconfig and ethtool.

VLAN configuration

In IPoIB, VLANs are implemented using the InfiniBand partitions. The fabric should be configured to support this partition and the corresponding multicast group, according to the above multicast GID scheme, in advanced by the SM. The configuration of that VLAN to the IPoIB network interface is done using special method and not using the standard vconfig.

Connected Mode IPoIB

The IPoIB network interface which uses UD QP called: IPoIB Datagram. It is mandatory and always active and working as part of the IPoIB protocol.

An optional feature of IPoIB is Connected mode: after performing the negotiation over the UD QP, a connected QP: RC or UC (usually RC) QP is created for handling the data between the local interface to the remote interface. This can allow using MTU of 64KB instead of an MTU of 2KB or 4KB, which is usually used in IPoIB Datagram. Using big MTU value allows the network stack to handle fewer number of packets per message thus improve the bandwidth and decrease the CPU utilization.

IPoIB CM will be used only if both sides support it.

The downside of working the IPoIB CM is that that in a big fabric, many QPs are created - this isn't scalable and consumes many resources. In a big cluster, it is advised to work with IPoIB Datagram, which is highly scalable.

Stateless offloads

There are InfiniBand devices which support stateless offload such as checksum offloads, LSO and more. IPoIB Datagram has the ability to utilize those offloads, thus decrease the CPU usage and improve the bandwidth.

IPoIB Limitations

IPoIB solves many problems for us. However, compared to a standard Ethernet network interface, it has some limitations:

  • IPoIB supports IP-based application only (since the Ethernet header isn't encapsulated).
  • SM/SA must always be available in order for IPoIB to function.
  • The MAC address of an IPoIB network interface is 20 bytes.
  • The MAC address of the network interface can't be controlled by the user.
  • The MAC address of the IPoIB network interface may change in consecutive loading of the IPoIB module and it isn't persistent (i.e. a constant attribute of the interface).
  • Configuring VLANs in an IPoIB network interface requires awareness of the SM to the corresponding P_Keys.
  • Non-standard interface for managing VLANs.

More information

More information can be found at the following URLs:

RFC 4391: Transmission of IP over InfiniBand (IPoIB)

RFC 4392: IP over InfiniBand (IPoIB) Architecture

RFC 4755: IP over InfiniBand: Connected Mode

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

Tell us what do you think.

  1. Stephen says: March 30, 2015

    THANK YOU for writing this. I just stumbled across your blog and I've found it to be fantastic.

    • Dotan Barak says: March 30, 2015

      Hi Stephen.

      Thanks for the feedback!!!
      Dotan

  2. Whitney says: May 13, 2019

    Hi Dotan,
    Thank you very much for writing this :)
    Here are my questions:
    1)How does TCP/IP message received by HCA card enter the TCP/IP protocol stack of Linux?
    2)What is the process of HCA mounting as a NIC?

    • Dotan Barak says: May 18, 2019

      Hi.

      1) IPoIB register a network interface for every port;
      a "standard" devices with 20 bytes HW address.
      Packets are processed like any other network interface.

      2) The HCA isn't mounted as a NIC, every port is registered as a network interface.

      Thanks
      Dotan

Add a Comment

This comment will be moderated; answer may be provided within 14 days.

Time limit is exhausted. Please reload CAPTCHA.