DCQCN+ ICNP 2018

Taming Large-scale Incast Congestion in RDMA

Posted by Yiran on April 2, 2019

解决问题

DCQCN是目前RDMA over Ethernet Networks的拥塞控制算法,但是大规模的incast导致其性能下降。

核心思想

动态参数控制:发送端需要知悉incast的规模, using a long period and a small increase step for large-scale incast, and a short period and a large increase step for small-scale incast.

DCQCN: Fixed rate recovery period($K$) and fixed increase steps($R_{AI}$)

1. Flow rate reduction:

  • The receiver generates and sends a CNP to the sender only if (1) the received packet is marked on ECN bits and (2) the flow has not been notified for a fixed period. This period, called interval between CNPs $N$, is a static parameter that needed to be configured ahead of time. DCQCN 设置为固定的50us.

  • Senders cut the current rate $R_{C}$ and the target rate $R_{T}$ as follows:

    $\alpha$ denotes the reduction factor, $g$ is a pre-configured constant, and $R_{min}$ means the minimum rate of a flow. Rate cut is trivial. If CNPs are received for 2 or more periods, $\alpha$ will increase and rate cut ratio will be larger at the next time. $K_{α}$ 是更新$\alpha$ 的timer.

2. Flow rate recovery:

  • Receivers maintain a time counter and a byte counter and corresponding state bits for each flow. There is also a state counter to sign the state of increase. The time state and byte state are set to 0 after rate cut. When a flow has not received CNPs for time $K = 55us$, its time state increases. When a flow has already sent B bytes without receiving a CNP, its byte state increases. When one of the two states is not 0, the flow starts recovery in a bisecting way for $F = 5$ rounds called Fast Recovery (FR):

  • The start of a new round is the increase of the states. After that, Additive Increase (AI) and Hyper Increase (HI) can be triggered. In Additive Increase, rate increases by fixed step $R_{AI}$:

  • Hyper Increase is triggered when both the states exceed $F$. In Hyper Increase, the rate shows an exponential growth with parameter RHAI, i = min{time_state, byte_state} − F:

  • At each recovery, $\alpha$ will also decrease, $K_{\alpha} = K$ 是更新$\alpha$ 的timer:

3. Line-rate strategy:

In DCQCN, flows start at line rate to get full utilization of links. But line-rate is so aggressive that as the number of congested flow grows, the trivial-way rate cut needs more periods to take effect. 因此DCQCN仍然需要PFC保证不丢包.

4. DCQCN 设计与实际 Mellanox CX4网卡参数的不同:

5. 简单地调大K、调小$R_{AI}$对小Incast不友好

6. CNP的产生速率(网卡实现问题)

RDMA网卡大概能够1-5μs产生一个CNP, CX4 能够1μs产生一个CNP. 对10Gbps链路, 每1.25 μs到达一个1KB size数据包, CNP产生速率足够, 但是对于40Gbps, 100Gbps,CNP的产生速率就不够了. 如果拥塞的流不能够收到足够的CNP, 就会导致在rate recovery的时间长.

Design

  • The recovery timer should always be larger than the CNP period to receive a CNP. All congested flows with the same receiver share the CNPs equally by time multiplexing. Thus, CNP period can reflect the number of congested flows on the receiver. We utilize an available field in CNPs to carry the CNP period without using additional signal packets.

    Flow rate reduction:

    If the CNP interval τ is larger than 50us, the timer of $\alpha$ update $K_{\alpha}$ and the timer of rate increase $K$ will update as follows:

    $\lambda_{\alpha}$ is set to 1 and $\lambda$ is set to 2. M: MTU.

  • The total throughput increase should not increase as the number of congested flows grows. Thus, increase step $R_{AI}$ should be proportional to flow rate. Generally, flows of larger rate have more packets in the congested queue, thus with a larger possibility packets it will receive a CNP and reduce rate. So when converged, all congested flows have almost same rate, which is related to the flow number.

    Flow rate recovery:

    Additive Increase (AI):

    Hyper Increase (HI):

Implementation and evaluation

ns3 + testbed.

  • Large-scale Incast. At least 2,000 flows. 10 times larger than that of DCQCN in simulation and 4 times larger in testbed.

  • Small-scale Incast. Similar to DCQCN.

  • Realistic Workload. Similar to DCQCN.

My thinking

参考文献

DCQCN+: Taming Large-scale Incast Congestion in RDMA over Ethernet Networks