Skip to content

Zero byte messages

5.00 avg. rating (97% score) - 1 vote

RDMA supports zero byte messages, and this can be done by posting a Send Request without a scatter/gather list (i.e. a list with zero entries).

Zero byte messages can be done with the following opcodes:

  • Send
  • Send with immediate
  • RDMA Write
  • RDMA Write with Immediate
  • RDMA Read

To the RDMA operations, the remote address and remote key aren't being actually used or validated, so those values don't have to contain the details of a valid remote Memory Region.

What zero byte messages are good for?

Zero byte messages can be useful in the following scenarios:

  • When only the immediate data is used - This can be useful to mark a directive or a status update.
  • For keep alive messages in a reliable QP - Zero byte messages of RDMA Write or RDMA Read are a good idea for a non-intrusive keep alive messages in a reliable QP: to make sure that the remote QP is still alive and functioning. If the remote QP will be offline, for example, if the QP was transitioned to Error or Reset state, or if the process was terminated or if even the node itself was rebooted, there will be a Work Completion with Retry Exceeded status. Using one of the other above-mentioned opcodes will consume a Receive Request from the remote side QP.

Share Our Posts

Share this post through social bookmarks.

  • Delicious
  • Digg
  • Newsvine
  • RSS
  • StumbleUpon
  • Technorati

Comments

Tell us what do you think.

  1. Hiroyuki Sato(@hiroysato) says: September 25, 2013

    This is nice article!.

    By the way, My follower said. It cause IBV_WC_LOC_LEN_ERR, when zero byte message send to peer on ConnectX-3 card.
    The source is here.
    http://www.nminoru.jp/~nminoru/data/201309/libibverbs-1.1.6-zero-length-send-test.diff

    Could you please tell me what is missing?

    • Dotan Barak says: September 25, 2013

      Thanks.

      As I wrote, in order to send zero byte messages, the s/g list should have zero entries, but in your example it has one entry:

      struct ibv_send_wr wr = {
      .wr_id = PINGPONG_SEND_WRID,
      .sg_list = &list,
      .num_sge = 1, <-------------------- this should be zero - .opcode = IBV_WR_SEND, + .opcode = IBV_WR_SEND_WITH_IMM, + .imm_data = 0, .send_flags = IBV_SEND_SIGNALED, }; Sending a scatter/gather list with value zero in the size member, actually mean send 2GB... Thanks Dotan

      • Hiroyuki Sato(@hiroysato) says: September 25, 2013

        Hello Dotan.

        Thank you for your reply. I'll check it.
        And feedback it later.

      • rsai says: April 18, 2014

        From what I understand, this should have actually caused 2GB of data transfer. But it causes IBV_WC_LOC_LEN_ERR. Why ?
        The reason why I ask is, that I am facing the same problem setting ie, .num_sge = 1 and .length = 0, causes IBV_WC_LOC_LEN_ERR.

      • Dotan Barak says: April 18, 2014

        The question is whether the S/G entry that you described points to a valid Memory Region space?

        Thanks
        Dotan

        If you like RDMAmojo, support it.

  2. Hiroyuki Sato(@hiroysato) says: September 28, 2013

    Hello Dotan.

    Thank you for your advice. It worked properly.

    • Dotan Barak says: September 28, 2013

      Great!

  3. Anuj Kalia says: November 11, 2013

    Hey Dotan.

    I was wondering if you could help me understand the interaction between RDMA and CPU caches. I had the following specific question:

    When a remote host reads from a server via RDMA, where does the read actually come from? I read that writes go to L3 cache. Do the reads come from L3 cache too? If so, what happens if something is in a modified state in L1 or L2 cache? Is L3 always up-to-date with L1/L2?

    Thanks a lot for your time!

    • Dotan Barak says: November 30, 2013

      Hi Anuj and sorry for the late response.

      The question if L3 chache is up-to-date with L1/L2 is a question that you should ask
      the chipset/CPU guys.

      But IMHO, the answer is yes.

      Thanks
      Dotan

  4. rsai says: April 14, 2014

    Hey Dotan,

    I noticed this from your post on ibv_post_send about sge.length:
    The length of the buffer in bytes. The value 0 is a special value and is equal to 2^{31} bytes (and not zero bytes, as one might imagine)

    Is this Mellanox specific or generic?
    Is there a spec document which describes this? I could not find any which states this , except for a couple of forum posts on zero-byte messages.
    Is there a reason why 0 is expected to mean 2gig? If so, what would my CI interpret if sge.length is set to 2147483648 (bytes) which can also be stored in uint32_t?

    Thanks,
    --

    • Dotan Barak says: April 15, 2014

      Hi.

      This is a good question.
      I searched for an answer in the InfiniBand spec, and couldn't find one.
      So, I can't give you a quick answer what is the origin of this behavior.

      I can think about one reason: what is the meaning on a scatter/gather entry with zero bytes?
      If it is zero bytes, why did you add it in the first place?

      One another reason is that 0 is actually 2GB module 2GB, so if for any scatter/gather entry length you perform a module of 2GB (the maximum size of a message in RDMA), you'll get to 0.

      I further investigate it, but if will take some time though.
      (BTW, all posts are moderated to prevent SPAM, so they will be seen only when they approved by me)

      Dotan

      If you like RDMAmojo, support it.

      • rsai says: April 15, 2014

        Dotan,

        Thanks for your response.
        And no problem. Please take your time. I shall keep my eyes open on this thread. :-)

  5. Parthiban says: September 23, 2014

    Hi Dotan,
    I have the same question as "rsai" has, I did you find any answer for that? and
    I'm trying to do RDMA write of 5 GB of memory but i see only 1GB is getting into the remote buffer, i have assigned the sge with the allocated buffer and length to the sge.length and using only sge. kindly suggest me what could be the issue.
    Thanks,

    • Dotan Barak says: September 23, 2014

      Hi Parthiban.

      In general, RDMA (the protocol itself) can support up to 2 GB in one message.
      RDMA devices may have lower limit.

      If you need to send more data than the maximum supported value (1 GB in your example),
      you can use several RDMA writes to send the local (big) buffer to the remote buffer.

      Thanks
      Dotan

      • Parthiban says: September 23, 2014

        Hi Dotan,
        Thanks a lot for your response, I'm new to RDMA is there any program that i can refer to implement this?
        Thanks.

      • Dotan Barak says: September 23, 2014

        Hi Parthiban.

        I haven't published an "hello world" posts - yet.
        A good example can be the examples/rping.c in the following URL:
        .

        Thanks
        Dotan

      • rsai says: September 25, 2014

        Just to point out, I looked through different vendor's driver source code, and what I found out was, Mellanox are only one who consider 0 as 2 GB.

      • Dotan Barak says: September 25, 2014

        Hi.

        Here is a text from the InfiniBand specifications:

        9.3.3.3 DMA LENGTH (DMALEN) - 32 BITS

        This field indicates the length, in bytes, of the remote DMA operation.

        C9-9: For an HCA performing RDMA operations, the minimum length
        specified in the DMALen field is 0; the maximum length is 2^31.

        So, the value zero (in packet headers) means 2^31.
        If one wishes to send a message with zero bytes, he can use a Send Request with no scatter/gather elements at all.

        Thanks
        Dotan

  6. Parthiban says: September 25, 2014

    Hi Dotan,
    Thanks for your pointers, not able to see the link you have posted, but i have referred the program in the below link to implement,
    http://web.mit.edu/freebsd/head/contrib/ofed/libibverbs/examples/rc_pingpong.c
    Thanks.

Add a Comment

This comment will be moderated; answer may be provided within 14 days.

Time limit is exhausted. Please reload CAPTCHA.