IP over InfiniBand (IPoIB) architecture
InfiniBand is a great networking protocol that has many features and provides great performance. However, unlike RoCE and iWARP, which are running over an Ethernet infrastructure (NIC, switches and routers) and support legacy (IP-based) applications, by design. InfiniBand, as a completely different protocol, uses a different addressing mechanism, which isn't IP, and doesn't support sockets - therefore it doesn't support legacy applications. This means that in some clusters InfiniBand can't be used as the only interconnect since many management systems are IP-based. The result of this could have been that clusters that use InfiniBand for the data traffic may deploy an Ethernet infrastructure in the cluster as well (for the management). This increases the price and complexity of cluster deployment.
To solve this problem, IP over InfiniBand (IPoIB) was specified by the Internet Engineering Task Force (IETF) IPoIB working group. IPoIB allows working with IP-based applications, thus allow running legacy applications and many management systems in an InfiniBand fabric seamlessly.
The IPoIB module registers to the local Operating System's network stack as an Ethernet device and translate all the needed functionality between Ethernet and InfiniBand. An Unreliable Datagram (UD) Queue Pair (QP) that represents the network interface is created and the link layer address (MAC) of that interface is created according to the following scheme:
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +---------------+-----------------------------------------------+ | Reserved | Queue Pair number | +---------------+-----------------------------------------------+ | | + + | | + GID + | | + + | | +---------------------------------------------------------------+
1 byte: reserved
3 bytes: QP number
16 bytes: One of the port's GID
Total: 20 bytes
Every such IPoIB network interface has a 20 bytes MAC address. This may cause problems since the "standard" Ethernet MAC address is 6 bytes (48 bits) and there are applications, services and operating systems which don't support a network device with a different size of MAC address than 6 bytes.
Such IPoIB network interface is created for every port of the InfiniBand device. In Linux, the prefix of those interfaces is "ib".
The traffic that is sent over the IPoIB network interface uses the network stack of the kernel and doesn't benefit from features of the InfiniBand device: kernel bypass, reliability, zero copy, splitting and assembly of messages to packets, and more.
The kernel provides to the IPoIB module buffers to be sent, IPoIB send them over the InfiniBand fabric. When receiving data from the wire, IPoIB fills them in kernel buffers and provides them to the kernel (which in turn, give them to the userspace application which uses sockets). The IP packets are encapsulated in InfiniBand packets, hence packet sniffing tools, such as: tcpdump and wireshark can be used on an IPoIB network interface.
The IPoIB RFCs specify a wire protocol, which means that different IPoIB implementations (even between different Operating Systems) can interoperate.
The IPoIB network interface supports ICMP, IPv4, IPv6 and all the network protocols that use IP, such as UDP, TCP and more.
In order for an Ethernet interface to work properly, broadcast must be supported since many protocols (such as: ARP) perform broadcasting of messages. However, InfiniBand, by design, doesn't support broadcast. This is handled by InfiniBand multicast groups. IPoIB uses a specific Multicast group as a broadcast domain and when there is a need to broadcast a message, this message is sent to a multicast group. Here is the scheme which describes the used multicast GID.
| 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | +------ -+----+----+-----------------+---------+-------------------+ |11111111|0001|scop| IPoIB signature | P_Key | group ID | +--------+----+----+-----------------+---------+-------------------+
Working with multicast groups (for example: sending join/leave requests) requires an active Subnet Administrator (SA).
When an application needs to send a message to a remote IP at the first time, and the MAC address of that remote address is yet unknown, the Address Resolution Protocol (ARP) is used. An ARP request is broadcasted to the network with the IP address that its MAC is needed and the NIC, in the subnet, that was configured with this IP address return an ARP response with its MAC address. This is a standard protocol in Ethernet networks and isn't special to IPoIB.
Now, that the MAC address of that remote IPoIB network interface is known, this isn't enough; there is a need to translate this address from the Ethernet space to the InfiniBand space. Luckily, the MAC address of the remote IPoIB network interface contains a port's GID and by sending a Path Query request to the SA, the needed InfiniBand attributes (for example: SL, LID, and more), for creating an Address Handle for sending messages to the remote side, can be acquired. Usually this Address Handle is associated with the ARP entry at the kernel level; but this is an implementation specific behavior.
Network I/F configuration
The IPoIB network interface is like any other network interface, except for the fact that it has a 20 bytes MAC address. Its IP address can be configured statically using an OS configuration file or using DHCP. Please note that there are DHCP server implementations that don't natively support IPoIB, and special patches may be needed to be applied to them for supporting IPoIB. The IPoIB network interface attributes, such as: MTU size, enabled offloads, and more, can be configured like any other network interface; in Linux using ifconfig and ethtool.
In IPoIB, VLANs are implemented using the InfiniBand partitions. The fabric should be configured to support this partition and the corresponding multicast group, according to the above multicast GID scheme, in advanced by the SM. The configuration of that VLAN to the IPoIB network interface is done using special method and not using the standard vconfig.
Connected Mode IPoIB
The IPoIB network interface which uses UD QP called: IPoIB Datagram. It is mandatory and always active and working as part of the IPoIB protocol.
An optional feature of IPoIB is Connected mode: after performing the negotiation over the UD QP, a connected QP: RC or UC (usually RC) QP is created for handling the data between the local interface to the remote interface. This can allow using MTU of 64KB instead of an MTU of 2KB or 4KB, which is usually used in IPoIB Datagram. Using big MTU value allows the network stack to handle fewer number of packets per message thus improve the bandwidth and decrease the CPU utilization.
IPoIB CM will be used only if both sides support it.
The downside of working the IPoIB CM is that that in a big fabric, many QPs are created - this isn't scalable and consumes many resources. In a big cluster, it is advised to work with IPoIB Datagram, which is highly scalable.
There are InfiniBand devices which support stateless offload such as checksum offloads, LSO and more. IPoIB Datagram has the ability to utilize those offloads, thus decrease the CPU usage and improve the bandwidth.
IPoIB solves many problems for us. However, compared to a standard Ethernet network interface, it has some limitations:
- IPoIB supports IP-based application only (since the Ethernet header isn't encapsulated).
- SM/SA must always be available in order for IPoIB to function.
- The MAC address of an IPoIB network interface is 20 bytes.
- The MAC address of the network interface can't be controlled by the user.
- The MAC address of the IPoIB network interface may change in consecutive loading of the IPoIB module and it isn't persistent (i.e. a constant attribute of the interface).
- Configuring VLANs in an IPoIB network interface requires awareness of the SM to the corresponding P_Keys.
- Non-standard interface for managing VLANs.
More information can be found at the following URLs:
Tell us what do you think.