Advanced Operating Systems How to read a paper, Srinivasan Keshav, SIGCOMM CCR, 2007 An Evaluation of The Ninth SOSP Submissions or How (and How Not) to Write A Good Systems Paper, Roy Levin, David D. Redell RAID: High-Performance, Reliable Secondary Storage, Peter M. Chen, Edward K. Lee, Garth A. Gibson, Randy H. Katz, David A. Patterson, ACM Computing Surveys, 1994 The Recovery Manager of the System R Database Manager, Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond Lorie, Tom Price, Franco Putzolu, Irving Traiger, ACM Computing Surveys, 1981 Disconnected Operation in the Coda File System, James J. Kistler M. Satyanarayanan, ACM Transactions on Computer Systems, 1992 Eraser: A Dynamic Data Race Detector for Multithreaded Programs, Stefan Savage, Michael Burrows, Greg Nelson, and Patrick Sobalvarro, Thomas Anderson, ACM Transactions on Computer Systems, 1997 Improving the reliability of commodity operating systems, Michael M. Swift, Brian N. Bershad, Henry M. Levy, SOSP, 2003 Cores that don’t count, Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy, Ranganathan, David E. Culler, Amin Vahdat, HotOS, 2021 The Design and Implementation of a Log-Structured File System, Mendel Rosenblum, John K. Ousterh, ACM Transactions on Computer Systems, 1992 Rethink the sync, Edmund B. Nightingale, Kaushik Veeraraghavan, Peter M. Chen, Jason Flinn, TOCS, 2008 Caching in the Sprite Network File System, Michael N. Nelson, Brent B. Welch, John K. Ousterho, ACM Transactions on Computer Systems, 1988 Outatime: Using Speculation to Enable Low-Latency Continuous Interaction for Mobile Cloud Gaming, Kyungmin Lee, David Chu, Eduardo Cuervo, Johannes Kopf, Yury Degtyarev, Sergey Grizan, Alec Wolman, Jason Flinn, MobiSys, 2015 A low-bandwidth network file system, Athicha Muthitacharoen, Benjie Chen, David Mazières, SOSP, 2001 Coz: finding code that counts with causal profiling, Charlie Curtsinger, Emery D. Berger, SOSP, 2015 Horcrux: Automatic JavaScript Parallelism for Resource-Efficient Web Computation, Shaghayegh Mardani, Ayush Goel, Ronny Ko, Harsha V. Madhyastha, Ravi Netravali, OSDI, 2021 Shielding Applications from an Untrusted Cloud with Haven, Andrew Baumann, Marcus Peinado, Galen Hunt, OSDI, 2014 Difference Engine: Harnessing Memory Redundancy in Virtual Machines, Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, Amin Vahdat, Communications of the ACM, 2010 Deciding when to forget in the Elephant file system, Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, Jacob Ofir, SOSP, 2017 Energy-aware adaptation for mobile applications, Jason Flinn, M. Satyanarayanan, SOSP, 1999 Wide-area cooperative storage with CFS, Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, SIGOPS OSR, 2001 Cluster-Based Scalable Network Services, Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, Paul Gauthier, SOSP, 1997 Improving MapReduce Performance in Heterogeneous Environments, Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, Ion Stoica, OSDI, 2008 Efficient Memory Disaggregation with Infnswap, Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf Chowdhury, Kang G. Shin, NSDI, 2017 Hints for Computer System Design, Butler W. Lampson, SOSP, 1983 On system design, Jim Waldo, SIGPLAN Notices, 2006 reliability things fail: storage, communication, bug in code, datacenter, CPU/other hardware security vuln.: spectre, meltdown, buffer overflow, attack (DDoS) incompatibility: version change performance non-scalable algo. congestion deadlock out of memory GC pause hypervisor pause 3 pass: intro + heading, overall grasp, reproduce survey: start from reference section of recent searched paper; find prominent author/conference paper category: implementation, theory, idea, measurement, survey, simulation know related work: old & new, peer-reviewed not readily solved by existing approach (or w/ tweak) disk array for performance&reliability&cost compare architecture: RAID level 0-6 motivation: processor much faster but disk is bottleneck of overall perf large disk array exponentially likely to fail array of disk improve bandwidth for big I/O & latency for small I/O data striping: distribute data across multiple disk for parallel data interleaving granularity fine-grained: every I/O use all disk, but is blocking & cause seek for all disk coarse-grained: parallel request method to handle redundant info compute: parity, e.g., Hamming/Reed-Solomon code bottleneck on disk storing parity bit P+Q redundancy (Reed-Solomon): 2 parity disk to defend against ≥2 disk failure distribution: whether concentrate redundant info on few disk memory-style ECC not assume controller know which disk bad RAID 3&5&6 do inversely better on big I/O, poorer in small I/O (but, 5&6 are capped) can read fewer disk if write large improve small I/O perf by batching&caching block interleaving cannot handle some system crash unless w/ hardware like nonvolatile RAM system crash more frequent track if each parity block consistent for recovery perf bit rot hard to handle; solution: predict disk about to fail & replace correlated disk failure, system crash, bit rot greatly increase data loss probability in RAID 5 The Recovery Manager of the System R Database Manager , Jim Gray, Paul McJones, Mike Blasgen, Bruce Lindsay, Raymond Lorie, Tom Price, Franco Putzolu, Irving Traiger, ACM Computing Surveys, 1981concurrent SQL database failure recovery transaction: ACD (no I) only way to undo transaction is another shadowing: CoW, point to new copy at commit transaction log for recovery include update value to be idempotent checkpoint to avoid log big recovery: redo forward from checkpoint for committed (winner), undo backward for loser offline access to remote file system shared file for collaboration & larger storage local caching for availability prioritize user-specified, recently used; fetch periodically cache entire file instead of block reconcile conflict thru replay of log of update aggregate log periodically, only save last update, rm inverse operation detect conflict by version number impossible to automatically resolve conflict evaluation: benchmark & user trace little conflict during in CMU usage bc not editing same file Eraser: A Dynamic Data Race Detector for Multithreaded Programs , Stefan Savage, Michael Burrows, Greg Nelson, and Patrick Sobalvarro, Thomas Anderson, ACM Transactions on Computer Systems, 1997monitor: force all access to acquire lock, but cannot handle dynamic allocation happens-before problem: cannot catch when separate access by chance detect synchronization between variable access across thread by transitive happens-before lock set intersection to track all lock held when accessing each variable allow initialization w/o locking, RW lock, reuse allocation high overhead, but optimize by deduplicating lock set to track mechanism to opt out of checking bc false positive binary modification to track set of lock held when accessing each static/heap variable device driver/ kernel module responsible for many OS crash need backward compatibility → cannot separate address space why possible: well-defined OS-driver interface extension procedure call (XPC): wrapper of kernel call that copy argument/ return value memory back and forth (interpose) separate address space/ page table for driver assume extension/driver not malicious; not defensive recovery: deallocate based on object tracking; restart driver page tracking to avoid double free reliable: avoid almost all crash overhead: higher CPU utilization/ lower performance packet sending higher overhead than receiving bc no batching XPC is main bottleneck bc page table switch cause TLB flush, close to page tracking & object tracking extensible: lots of shared code among different driver Cores that don’t count , Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy, Ranganathan, David E. Culler, Amin Vahdat, HotOS, 2021CPU certain (“mercurial”) core certain instruction silent failure e.g., faulty encryption cause data loss, mutex malfunction storage/network redundancy trick do not apply detection: proactive/ passive/ live with hard: intermittent, specific core, after aging, under specific frequency&voltage&temperature infeasible overhead: need triple computation to verify no known solution motivation: disk seek slow → slow read/write slow; RAM cheap; want to utilize write speed store all data in append-only log cache read & inode map in memory metric: write cost: disk access time per storage vs theoretical time superlinear: write cost vs fraction alive in segment cleaned segment & clean disk by group of block distinguish hot & cold data; prioritize cleaning cold segment bc they may squat segment for long assume temporal locality: factor youngest file age in segment evaluation: user trace & micro benchmark Rethink the sync , Edmund B. Nightingale, Kaushik Veeraraghavan, Peter M. Chen, Jason Flinn, TOCS, 2008Caching in the Sprite Network File System , Michael N. Nelson, Brent B. Welch, John K. Ousterho, ACM Transactions on Computer Systems, 1988Difference Engine: Harnessing Memory Redundancy in Virtual Machines , Diwaker Gupta, Sangmin Lee, Michael Vrable, Stefan Savage, Alex C. Snoeren, George Varghese, Geoffrey M. Voelker, Amin Vahdat, Communications of the ACM, 2010Deciding when to forget in the Elephant file system , Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch, Ross W. Carton, Jacob Ofir, SOSP, 2017Wide-area cooperative storage with CFS , Frank Dabek, M. Frans Kaashoek, David Karger, Robert Morris, Ion Stoica, SIGOPS OSR, 2001Cluster-Based Scalable Network Services , Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, Paul Gauthier, SOSP, 1997On system design , Jim Waldo, SIGPLAN Notices, 2006