Networked Systems for Cloud Computing

  • sovereignty: legislation cause data to stay within regions

motivation: cloud app

  • merchant silicon switch: Broadcom; only provide chip (cheaper)
    • as oppose to Cisco, which provide entire rack + software (expensive)
  • previous network could not run new huge app
    • four-post design: 4 router each connected to each of 512 racks via 1G port
      • each ToR only 4G uplink despite 40x server w/ 1G link
      • ⇒ NIC oversubscription
  • uniform bandwidth: pairwise same among server
    • power domain: outlet w/ same power source
      • source fail ⇒ all fail
    • uniform bandwidth resilient to power domain fail

Papers

Jupiter rising: a decade of clos topologies and centralized control in Google’s datacenter network

Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Hong Liu, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, Amin Vahdat, CACM, 2016

  • Clos topology: switch have same radix; core & aggregation layer
    • assume switch has port ⇒ core + aggregation switch
    • ⇒ get switch w/ port
    • rearrangeable
    • non-blocking: 1:1 subscription ratio (telecom terminology)
      • mathematically proved: ALA where is downlink per middle layer, is uplink
      • bisection bandwidth: as if cut network in half
    • multi-stage Clos: more layer ⇒ exponential scaling
      • 2-stage give port, -stage give
  • Firehose: 32up, 32down aggregation block each made of Clos of 8-port switch
    • each ToR connect to 2 aggregation block
    • deployed side-by-side w/ legacy network; big red button (fallback)
  • Watchtower: 128-port line card from 3 layer of 8x 16-port switch chip
    • standardized design for economic of scale
    • optical fiber
  • Saturn: similar to Firehose but w/ 288-port line card from 12x 24-port chip
    • ToR: 4up 20down (5:1 oversubscription) or 8up 16down (2:1 oversubscription)
  • Juniper: w/ 16x40G or 64x10G switch chip
    • 128-port centauri chassis from 4 switch chip (not interconnected)
    • 64up 256down blocking middle block from 4 centauri
    • aggregation block from 8 middle block
    • spine block from 6 centauri; 128down to 64x aggregation block (2x redundancy)
    • incremental: build aggregation block first, spine later
  • external connection: cluster block router (CBS), work like normal racks
    • much larger internal traffic than external
    • choose this bc any racks can have all external bandwidth
    • freedome block (FDB): freedome border router (FBR) + freedome edge router (FER)
      • ??
    • datacenter freedome (DFD): 4x FDB to campus layer
    • campus freedome (CFD): 4x FDB to WAN
  • routing for full bisection bandwidth
    • equal-cost multi-path (ECMP)
      • same path per flow, e.g., hash flow 5-tuple
    • centralized routing
      • work bc topology very regular
      • switch (client) tell Firepath master state w/ BGP update
      • master provide 1 default route for outgoing traffic, aggregate incoming traffic into a single IP prefix
      • ??

Jupiter evolving: transforming google’s datacenter network via optical circuit switches and software-defined networking

Leon Poutievski, Omid Mashayekhi, Joon Ong, Arjun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang, Virginia Beauregard, Patrick Conner, Steve Gribble, Rishi Kapoor, Stephen Kratzer, Nanfang Li, Hong Liu, Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ryohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shidong Zhang, Junlan Zhou, Amin Vahdat, SIGCOMM, 2022

  • wavelength division multiplexing (WDM): multiple stream on 1 optic fiber
    • data rate can vary per stream
  • optical circuit switch (OCS): programmable mirror
    • make topology reconfigurable w/o manual operation
    • each block w/ 4 separate power domain (failure domain)
  • direct connect architecture: rid spine, OCS connect high-speed aggregation block
    • bc spine & full bisection cost too much
    • blocking ⇒ traffic engineering & topology engineering
      • doable bc traffic mostly stable, topology more stable
      • weighted cost multi-path (WCMP)
  • traffic matrix: directed demand between aggregation block
  • Orion: software-defined networking (SDN) control plane for OCS
    • aggregation block run Orion domain controller
    • OCS group run Orion DCNI
      • DCNI??
    • topology engineering controller change Orion DCNI
    • use separate control plane network; but collocate w/ data plane
    • fail static: continue w/ final config when fail
  • traffic engineering
    • indirect path: may hop over another aggregation block to satisfy demand
      • hedging: reduce burst; use spare capacity
    • minimize maximum link utilization (MLU)
    • frequent: 15min at paper’s time; 2-5min now
  • topology engineering
    • only change topology when no feasible solution thru routing
    • traffic-aware topology; help w/ heterogeneous link capacity
    • manual rewiring + OCS reconfiguration
      • gradual rewiring a few links
      • capacity drop ≤25%
    • may reduce “stretch” for some traffic: path length per shortest path
    • infrequent: weeks at paper’s time

Alibaba HPN- A Data Center Network for Large Language Model Training

Kun Qian, Yongqing Xi, Jiamin Cao, Jiaqi Gao, Yichi Xu, Yu Guan, Binzhang Fu, Xuemei Shi, Fangbo Zhu, Rui Miao, Chao Wang, Peng Wang, Pengcheng Zhang, Xianlong Zeng, Eddie Ruan, Zhiping Yao, Ennan Zhai, Dennis Cai, SIGCOMM, 2024

  • dense (not MoE) LLM training cause huge synchronized spike
  • very few connection per network interface (~10)
  • 2 network: frontend standard Clos, backend for LLM training only
    • frontend has storage; handle inference
  • backend:
  • NIC per GPU + CPU; 8 CPU per host (server)
  • intra-host network for each GPU e.g. NVLink (alongside NIC)
    • only CPU’s NIC connect to frontend; GPU NIC connect to backend
  • 136-host segment; +1/16 spare GPU in case of failure
  • rail-optimized ToR: each GPU connected to alternating ToR other machines’ GPU connect to
    • can go thru different ToR + NVLink for communication
    • mitigate biggest impact: ToR failure
  • every other backend ToR connect another plane’s aggregation block
    • 2-plane pod
    • avoid routing load imbalance from hash polarization
  • multi-pod core layer: highly subscribed bc training fitted in 1 pod

NegotiaToR: Towards A Simple Yet Effective On-demand Reconfigurable Datacenter Network

Cong Liang, Xiangli Song, Jing Cheng, Mowei Wang, Yashe Liu, Zhenhua Liu, Shizhen Zhao, Yong Cui, SIGCOMM, 2024

  • futuristic configurable ToR
  • AWGR: dumb optical switch; smart ToR reconfigure path by sending different wavelength
    • nanosecond reconfiguration time
  • each epoch: 1-bit message to indicate whether want to send to reconfigure
    • no centralized control
    • REQUEST, GRANT, ACCEPT message to establish connection

Running BGP in Data Centers at Scale

Anubhavnidhi Abhashkumar, Kausik Subramanian, Alexey,reyev, Hyojeong Kim, Nanda Kishore Salem, Jingyi Yang, Petr Lapukhov, Aditya Akella, Hongyi Zeng, NSDI, 2021

  • Facebook used BGP for DC network routing
    • tussle in software development: BGP already exist & has software
    • ⇒ fast startup
  • no IGP bc OSPF does not scale
  • emulate external BGP: each switch is 1 AS
  • peer group: sweitch w/ same role, connected to same group
    • use very similar BGP policy
  • use AS confederation for ASN assignment
    • group all ASN within each pod into 1 when advertise externally
    • uniform ASN assignment across DC, reuse if possible
    • avoid devastating buggy config via simulation & staged rollout
  • each spine plane has unique ASN
  • infrastructure IP for switch, vs production IP for server
  • aggregate route per rack/pod to minimize routing table

Orion: Google’s Software-Defined Networking Control Plane

Andrew D. Ferguson, Steve Gribble, Chi-Yao Hong, Charles Killian, Waqar Mohsin, Henrik Muehe, Joon Ong, Leon Poutievski, Arjun Singh, Lorenzo Vicisano, Richard Alimi, Shawn Shuoshuo Chen, Mike Conley, Subhasree Mandal, Karthik Nagaraj, Kondapa Naidu Bollineni, Amr Sabaa, Shidong Zhang, Min Zhu, Amin Vahdat, NSDI, 2021

  • SDN: app write to state, controller read state & adjust
  • generalized forwarding: forwarding table++; match any header field, do any action
  • Orion: break up routing, network management, config management into microservices
  • blast radius: #controller failrue
  • inter-block controller (IBC) control controller in spine/aggregation block
  • no fate sharing in SDN: dunno what failed when controller cannot reach switch
    • Orion identify switch state: (un)health/unknown
    • for unknown, initially fail-closed; if below capacity, fail-static
  • out-of-band control plane except for ToR (in-band)
    • out-of-band break circular dependency, but expensive
  • intent reconciliation:
    • believe top-level authority on conflict
  • architecture: core → network information base (NIB) → managers → OpenFlow front-end (OFE)
    • NIB: in-memory database of network state; non-durable

Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

Zhiying Xu, Francis Y. Yan, Rachee Singh, Justin T. Chiu, Alexander M. Rush, Minlan Yu, SIGCOMM, 2023

RedTE: Mitigating Subsecond Traffic Bursts with Real-time and Distributed Traffic Engineering

Fei Gui, Songtao Wang, Dan Li, Li Chen, Kaihui Gao, Congcong Min, Yi Wang, SIGCOMM, 2024

B4 and after: managing hierarchy, partitioning, and asymmetry for availability and scale in google’s software-defined WAN

Chi-Yao Hong, Subhasree Mandal, Mohammad Al-Fares, Min Zhu, Richard Alimi, Kondapa Naidu B., Chandan Bhagat, Sourabh Jain, Jay Kaimal, Shiyu Liang, Kirill Mendelev, Steve Padgett, Faro Rabe, Saikat Ray, Malveeka Tewari, Matt Tierney, Monika Zahn, Jonathan Zolla, Joon Ong, Amin Vahdat, SIGCOMM, 2018

Achieving high utilization with software-driven WAN

Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, Roger Wattenhofer, SIGCOMM, 2013

EBB: Reliable and Evolvable Express Backbone Network in Meta

Marek Denis, Yuanjun Yao, Ashley Hatch, Qin Zhang, Chiun Lin Lim, Shuqiang Zhang, Kyle Sugrue, Henry Kwok, Mikel Jimenez Fernandez, Petr Lapukhov, Sandeep Hebbani, Gaya Nagarajan, Omar Baldonado, Lixin Gao, Ying Hang, SIGCOMM, 2023

OneWAN is better than two: Unifying a split WAN architecture

Umesh Krishnaswamy, Rachee Singh, Paul Mattes, Paul-Andre C Bissonnette, Nikolaj Bjørner, Zahira Nasrin, Sonal Kothari, Prabhakar Reddy, John Abeln, Srikanth Kandula, Himanshu Raj, Luis Irun-Briz, Jamie Gaudette, Erica Lan, NSDI, 2023

Data center TCP (DCTCP)

Mohammad Alizadeh, Albert Greenberg, David A. Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, Murari Sridharan, SIGCOMM, 2010

Swift: Delay is Simple and Effective for Congestion Control in the Datacenter

Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan M. G. Wassel, Xian Wu, Behnam Montazeri, Yaogong Wang, Kevin Springborn, Christopher Alfeld, Michael Ryan, David Wetherall, Amin Vahdat

Crux: GPU-Efficient Communication Scheduling for Deep Learning Training

Jiamin Cao, Yu Guan, Kun Qian, Jiaqi Gao, Wencong Xiao, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai, SIGCOMM, 2024

Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms

Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, James Dinan, Jeff Hammond, Torsten Hoefler, arXiv, 2025

Andromeda: Performance, Isolation, and Velocity at Scale in Cloud Network Virtualization

Michael Dalton, David Schultz, Jacob Adriaens, Ahsan Arefin, Anshuman Gupta, Brian Fahs, Dima Rubinstein, Enrique Cauich Zermeno, Erik Rubow, James Alexander Docauer, Jesse Alpert, Jing Ai, Jon Olson, Kevin DeCabooter, Marc de Kruijf, Nan Hua, Nathan Lewis, Nikhil Kasinadhuni, Riccardo Crepaldi, Srinivas Krishnan, Subbaiah Venkata, Yossi Richter, Uday Naik, Amin Vahdat, NSDI, 2018

Network Virtualization in Multi-tenant Datacenters

Teemu Koponen, Keith Amidon, Peter Balland, Martin Casado, Anupam Chanda, Bryan Fulton, Igor Ganichev, Jesse Gross, Paul Ingram, Ethan Jackson, Andrew Lambeth, Romain Lenglet, Shih-Hao Li, Amar Padmanabhan, Justin Pettit, Ben Pfaff, Rajiv Ramanathan, Scott Shenker, Alan Shieh, Jeremy Stribling, Pankaj Thakkar, Dan Wendlandt, Alexander Yip, Ronghua Zhang, NSDI, 2014

Achelous: Enabling Programmability, Elasticity, and Reliability in Hyperscale Cloud Networks

Chengkun Wei, Xing Li, Ye Yang, Xiaochong Jiang, Tianyu Xu, Bowen Yang, Taotao Wu, Chao Xu, Yilong Lv, Haifeng Gao, Zhentao Zhang, Zikang Chen, Zeke Wang, Zihui Zhang, Shunmin Zhu, Wenzhi Chen, SIGCOMM, 2023

Triton: A Flexible Hardware Offloading Architecture for Accelerating Apsara vSwitch in Alibaba Cloud

Xing Li, Xiaochong Jiang, Ye Yang, Lilong Chen, Yi Wang, Chao Wang, Chao Xu, Yilong Lv, Bowen Yang, Taotao Wu, Haifeng Gao, Zikang Chen, Yisong Qiao, Hongwei Ding, Yijian Dong, Hang Yang, Jianming Song, Jianyuan Lu, Pengyu Zhang, Chengkun Wei, Zihui Zhang, Wenzhi Chen, Qinming He, Shunmin Zhu, SIGCOMM, 2024

Azure Accelerated Networking: SmartNICs in the Public Cloud

Daniel Firestone, Andrew Putnam, Sambhrama Mundkur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric Chung, Harish Kumar Chandrappa, Somesh Chaturmohta, Matt Humphrey, Jack Lavier, Norman Lam, Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw, Gabriel Silva, Madhan Sivakumar, Nisheeth Srivastava, Anshuman Verma, Qasim Zuhair, Deepak Bansal, Doug Burger, Kushagra Vaid, David A. Maltz, Albert Greenberg, NSDI, 2018

1RMA: Re-envisioning Remote Memory Access for Multi-tenant Datacenters

Arjun Singhvi, Aditya Akella, Dan Gibson, Thomas F. Wenisch, Monica Wong-Chan, Sean Clark, Milo M. K. Martin, Moray McLaren, Prashant Chandra, Rob Cauble, Hassan M. G. Wassel, Behnam Montazeri, Simon L. Sabato, Joel Scherpelz, Amin Vahdat, SIGCOMM, 2020

Empowering Azure Storage with RDMA

Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu, Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee, Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun, Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, Brian Zill, NSDI, 2023

Harmonic: Hardware-assisted RDMA Performance Isolation for Public Clouds

Jiaqi Lou, Xinhao Kong, Jinghan Huang, Wei Bai, Nam Sung Kim, Danyang Zhuo, NSDI, 2024

Maglev: A Fast and Reliable Software Network Load Balancer

Danielle E. Eisenbud, Cheng Yi, Carlo Contavalli, Cody Smith, Roman Kononov, Eric Mann-Hielscher, Ardas Cilingiroglu, Bin Cheyney, Wentao Shang, Jinnah Dylan Hosein, NSDI, 2016

Ananta: cloud scale load balancing

Parveen Patel, Deepak Bansal, Lihua Yuan, Ashwin Murthy, Albert Greenberg, David A. Maltz, Randy Kern, Hemant Kumar, Marios Zikos, Hongyu Wu, Changhoon Kim, Naveen Karri, SIGCOMM, 2013

Network Load Balancing with In-network Reordering Support for RDMA

Cha Hwan Song, Xin Zhe Khooi, Raj Joshi, Inho Choi, Jialin Li, Mun Choon Chan, SIGCOMM, 2023

End-User Mapping: Next Generation Request Routing for Content Delivery

Fangfei Chen, Ramesh K. Sitaraman, Marcelo Torres, SIGCOMM, 2015

Analyzing the Performance of an Anycast CDN

Matt Calder, Ashley Flavel, Ethan Katz-Bassett, Ratul Mahajan, Jitendra Padhye, IMC, 2015

Taking the Edge off with Espresso: Scale, Reliability and Programmability for Global Internet Peering

Kok-Kiong Yap, Murtaza Motiwala, Jeremy Rahe, Steve Padgett, Matthew Holliman, Gary Baldus, Marcus Hines, Taeeun Kim, Ashok Narayanan, Ankur Jain, Victor Lin, Colin Rice, Brian Rogan, Arjun Singh, Bert Tanaka, Manish Verma, Puneet Sood, Mukarram Tariq, Matt Tierney, Dzevad Trumic, Vytautas Valancius, Calvin Ying, Mahesh Kallahalla, Bikash Koley, Amin Vahdat, SIGCOMM, 2017

EdgeFabric