QJUMP NSDI 2015

Queues Don’t Matter When You Can JUMP Them!

Posted by Yiran on July 14, 2019

核心思想

rate limiting 与 priority value 相结合, 来保证 latency

Motivation

解决数据中心网络中 network interference 问题: congestion from throughput-intensive applications causes queueing that delays traffic from latency-sensitive applications.

实际上与DCTCP解决的问题有相似之处, latency来自于排队, 而大流占据队列使得小流的排队长, DCTCP通过减小queue长度来降低latency, 采用per-flow的拥塞控制+switch的ECN标记. QJUMP从应用的角度解决network interference来保证latency

Design

  • Single Switch Service Model

    \[worst \ case \ end-to-end \ delay <= n \times \frac{P}{R} + \epsilon\]

    $P$ the maximum packet size (in bits), $R$ is the rate of the slowest link in bits per second and $\epsilon$ is the cumulative processing delay introduced by switch hops.

    If we can find a bound for servicing delay, we can rate limit hosts so that they can never experience queueing dealy.

  • Network epochs

    A network epoch is the maximum time that an idle network will take to service one packet from every sending host, regardless of the source, destination or timing of those packets.

    每个host一个epoch只发一个包, 就能有latency的bound. 由于时钟不同步,放宽后一个epoch应当是worst case end-to-end delay的两倍.

    \[network \ epoch = 2n \times \frac{P}{R} + \epsilon\]
  • QJUMP level

    如果按照上式的所有host每个epoch发一个包, 随着host数目增多, 吞吐会下降. 对吞吐敏感的应用不友好.

    \[throughput = \frac{P}{network \ epoch} \approx \frac{R}{2n}\]

    然而, 不同的应用对latency要求不同, 并且实际情况下, 不可能所有的host同时向一个目的端口发数据. 通过引入一个 factor $f$, 根据不同的应用放宽host的数目, 也就是引入QJUMP level. $f$ is a “throughput factor”: as the value of $f$ grows, so does the amount of bandwidth available.

    Use hardware priorities to run different QJump levels together, but isolated from each other.

    \[n^{'} = \frac{n}{f}\]

    $f = 1$时, 一个epoch能发1个包保证latency; $f = n$时, 一个epoch能发$n$个包保证throughput. QJUMP level利用网络支持的优先级实现: For each priority, assign a distinct value of $f$ , with higher priorities receiving smaller values. Since a small value of $f$ implies an aggressive rate limit, priorities become useful because they are no longer “free”.

Experiments

  • Testbed
  • Simulation

    $P = 9kB$, $n = 144$, {$f_{0}$… $f_{7}$} = {144,100,20,10,5,3,2,1}

Thoughts

  1. QJUMP实际上是对不同的优先级, 实施不同的速率limit. 高优先级的速率控制应当比低优先级的速率控制更加aggressive.

  2. 在端进行rate limit实际上把网络中的排队一部分转移到了端的socket buffer中, QJUMP resizes socket buffers to apply early back-pressure. end-to-end latency实际上包括在端的排队latency以及网络中排队latency. 如果不resizes socket buffer可能会性能下降. 应用层 rate control 的重要性?

  3. Figure 9(b), 结果中QJUMP比pfabric指标好, 说明了端速率控制的必要性?QJUMP对小流甚至能超过pfabric,但是大流不一定. Figure 9(f), between 30% and 63% worse than pFabric.

    文中的解释:In the data-mining workload, 85% of all flows transfer fewer than 100kB, but over 80% of the bytes are transferred in flows of greater than 100MB (less than 15% of the total flows). QJUMP’s short epoch intervals cannot sense the difference between large flows, so it does not apply any rate-limiting (scheduling) to them. This results in sub-optimal behavior. A combined approach where QJUMP regulates interactions between large flows and small flows, while DCTCP regulates the interactions between different large flows might improve this.

  4. 优先级是针对应用的, 四层以上的, QJUMP解决的也是应用 interference的问题. DCTCP针对三层, 更加底层, 解决小流排队长

参考文献

Queues Don’t Matter When You Can JUMP Them! (best paper)