NetBouncer NSDI 2019

Failure Localization in Data Center Networks

Posted by Yiran on March 28, 2019


通过IP-in-IP 的probe探测,定位数据中心故障,包括device故障和link故障。(微软)

Network Troubleshooting三个requirement(Motivation)

  • End host’s perspective: 交换机无法观察到gray failure. gray failure: 交换机不会记录. e.g. drop packets probabilistically, and hence cannot be detected by simply evaluating connectivity.
  • Compatible with commodity hardware, existing network protocols and software stack
  • Precise and accurate: Pingmesh等工具只能定位region.

Design Overview

1.Probing plan design


假设:failures are independent

目标:link-identifiable 的 probe plan

Theorem 1. (sufficient probing theorem 证明见附录). In a Clos network with k layers of switches (k ≥ 1), by probing all paths from the servers to the top-layer switches, we can uniquely infer the link success probabilities from the measured path success probabilities, if and only if at least one path with success probability 1 passes each switch


2.Efficient path probing via IP-in-IP

Probing采用一种IP-in-IP的packet bouncing的方式。以前的方案中ping-based的probing is unable to pinpoint the routing path; Tracert consumes switch CPUs. switch CPU的资源消耗有限制(007)。

Packet bouncing: 发probe的server选定一个交换机,交换机会将probe弹回。sender和receiver是同一个交换机。链路被双向探测。

3.Failure inference from path measurements


问题:how can one infer the link probabilities from the end-to-end path measurements?


latent factor model 结合troubleshooting领域知识

第二项是一个特殊的正则化。有两个特性:(a), it has a two-direction penalty; (b), because of the quadratic term, the closer to 1 the greater the slope.

特性a:The regularization term tends to “pull” the good links to be better, while “push” the bad links to be worse, while the product of link probabilities will stay approximately the same

特性b: The characteristic (b) mitigates the accidental packet loss and noisy measurements, which helps endorse most links (good) and assign the blame to only a small number of links (bad)

模型使用Non-convex representation (比convex representation效果好),求解方法:coordinate descent(坐标下降法),不用梯度下降的原因是运行时间慢

Implementation and evaluation

  • 仿真:与hop-by-hop比较性能非常接近,device failure detection对避免假阴性必不可少,特殊正则化vs standard正则化,随机梯度下降(SGD)vs 坐标下降(CD)收敛时间,正则化参数lamda对假阳性假阴性的影响,与已有系统比较: deTector, NetScope, and KDD14

  • Testbed: controller、agent 架构与Pingmesh类似

    NetBouncer chooses 5 minutes as one probing epoch. 算法运行时间在几十秒左右(60秒以内)



  • 假设所有probing packets 会经历与应用数据一样的failures
  • 针对某些特定数据包的错误 如packet black hole,ACL或防火墙误配置等 无法定位,因为NetBouncer scan的是大量的IP地址
  • 理论上不能保证零假阳性和零假阴性


NetBouncer: Active Device and Link Failure Localization in Data Center Networks