Big Data Systems - Index

NFS: Sun's Network File System
[SOSP '03] The Google File System
[OSDI '04] MapReduce: Simplified Data Processing on Large Clusters
[SOSP '09] FAWN: A Fast Array of Wimpy Nodes
[NSDI '11] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
[NSDI '12] Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
[HotOS '15] Scalability! But at what COST? (pdf)
[HotOS '21] From Cloud Computing to Sky Computing

Scheduling & Resource Allocation

[NSDI '11] Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
[EuroSys '13] Omega: flexible, scalable schedulers for large compute clusters
[SoCC '13] Apache Hadoop YARN: Yet Another Resource Negotiator
[SoCC '14] Wrangler: Predictable and Faster Jobs using Fewer Resources
[OSDI '14] Apollo: Scalable and Coordinated Scheduling for Cloud-Scale Computing (pdf)
[SIGCOMM '14] Tetris: Multi-Resource Packing for Cluster Schedulers (pdf)
[ASPLOS '14] Quasar: Resource-Efficient and QoS-Aware Cluster Management
[SIGCOMM '15] Network-Aware Scheduling for Data-Parallel Jobs: Plan When You Can
[OSDI '16] CARBYNE: Altruistic Scheduling in Multi-Resource Clusters (pdf)
[OSDI '16] Packing and Dependency-aware Scheduling for Data-Parallel Clusters
[NSDI '16] HUG: Multi-Resource Fairness for Correlated and Elastic Demands
[EuroSys '16] TetriSched: global rescheduling with adaptive plan-ahead in dynamic heterogeneous clusters
[SoCC '17] Selecting the best vm across multiple public clouds: A data-driven performance modeling approach
[ATC '18] On the diversity of cluster workloads and its impact on research results

Cloud/Serverless Computing

[SoCC '17] Occupy the Cloud: Distributed Computing for the 99%
[arXiv '19] Cloud Programming Simplified: A Berkeley View on Serverless Computing
[SoCC '19] Centralized Core-granular Scheduling for Serverless Functions
[SoCC '19] Cirrus: a Serverless Framework for End-to-end ML Workflows
[NSDI '19] Shuffling, Fast and Slow: Scalable Analytics on Serverless Infrastructure
[SIGMOD '20] Le Taureau: Deconstructing the Serverless Landscape & A Look Forward
[SoCC '20] Serverless linear algebra
[ATC '20] Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider
[SIGMOD '21] Towards Demystifying Serverless Machine Learning Training
[OSDI '21] Dorylus: Affordable, Scalable, and Accurate GNN Training with Distributed CPU Servers and Serverless Threads
[SoCC '21] Atoll: A Scalable Low-Latency Serverless Platform
[NSDI '21] Caerus: Nimble Task Scheduling for Serverless Analytics
[ASPLOS '22] Serverless computing on heterogeneous computers
[arXiv '22] Groundhog: Efficient Request Isolation in FaaS (pdf)

Network Flow Scheduling

[SIGCOMM '11] Managing Data Transfers in Computer Clusters with Orchestra
[HotNets '12] Coflow: A Networking Abstraction for Cluster Applications
[SIGCOMM '14] Efficient coflow scheduling with Varys
[SIGCOMM '14] Barrat: Decentralized task-aware scheduling for data center networks
[SIGCOMM '15] Aalo: Efficient coflow scheduling without prior knowledge
[SIGCOMM '16] CODA: Toward Automatically Identifying and Scheduling COflows in the DArk
[SIGCOMM '16] Scheduling Mix-flows in Commodity Datacenters with Karuna (pdf)
[SIGCOMM '18] Sincronia: Near-Optimal Network Design for Coflows
[SPAA '19] Near Optimal Coflow Scheduling in Networks

Graphs

MIT's 6.886 Graph Analytics reading list by Prof. Julian Shun
[SIGMOD '10] Pregel: A System for Large-Scale Graph Processing
[OSDI '12] PowerGraph: Distributed Graph-Parallel Computation on Neural Graphs
[PPoPP '13] Ligra: A Lightweight Graph Processing Framework for Shared Memory (pdf)
[OSDI '14] GraphX: Graph Processing in a Distributed Dataflow Framework
[ATC '17] Garaph: Efficient GPU-accelerated Graph Processing on a Single Machine with Balanced Replication
[EuroSys '17] MOSAIC: Processing a Trillion-Edge Graph on a Single Machine (pdf)
[VLDB '18] A Distributed Multi-GPU System for Fast Graph Processing (pdf)
[SoCC '20] PaGraph: Scaling GNN Training on Large Graphs via Computation-aware Caching (pdf)
[EuroSys '21] NextDoor: Accelerating graph sampling for graph machine learning using GPUs
[OSDI '21] Marius: Learning Massive Graph Embeddings on a Single Machine
[arXiv '22] Marius++: Large-Scale Training of Graph Neural Networks on a Single Machine
[MLSys '22] Graphiler: Optimizing Graph Neural Networks with Message Passing Data Flow Graph

Distributed Tracing

[Textbook] Distributed Tracing in Practice
[SOSP '15] Pivot tracing: dynamic causal monitoring for distributed systems (pdf)
[SoCC '16] Principled Workflow-Centric Tracing of Distributed Systems (pdf)
[SOSP '17] Canopy: An End-to-End Performance Tracing And Analysis System (pdf)
[SoCC '18] Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay (pdf)
[SoCC '19] Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering (pdf)
[HotNets '21] Snicket: Query-Driven Distributed Tracing (pdf)
[NSDI '23] The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems (pdf)

Caching

[SoCC '11] Small Cache, Big Effect: Provable Load Balancing for Randomly Partitioned Cluster Services
[NSDI '16] Be Fast, Cheap and in Control with SwitchKV
[SOSP '17] NetCache: Balancing Key-Value Stores with Fast In-Network Caching
[FAST '19] DistCache: Provable Load Balancing for Large-Scale Storage Systems with Distributed Caching

New Data, Hardware Models

[ISCA '17] In-Datacenter Performance Analysis of a Tensor Processing Unit

Databases

[SIGMOD '12] Towards a Unified Architecture for in-RDBMS Analytics
[arXiv '13] Bayesian Optimization in a Billion Dimensions via Random Embeddings
[SIGMOD '17] Automatic Database Management System Tuning Through Large-scale Machine Learning
[HotStorage '20] Too Many Knobs to Tune? Towards Faster Database Tuning by Pre-selecting Important Knobs
[arXiv '21] Facilitating Database Tuning with Hyper-Parameter Optimization: A Comprehensive Experimental Evaluation
[VLDB '21] An Inquiry into Machine Learning-based Automatic Configuration Tuning Services on Real-World Database Management Systems (pdf)
[VLDB '22] LlamaTune: Sample-Efficient DBMS Configuration Tuning

Meta stuff

Reading lists
- CS 294 @ Berkeley: Machine Learning Systems
- CS 744 @ UW-Madison: Big Data Systems
- CS 6787 @ Cornell: Advanced Machine Learning Systems, with a focus on the ML side
- Awesome-System-for-Machine-Learning: An open-sourced reading list
- The MLSys conference
- SOSP AI Systems workshop
Some other stuff
- Meta papers
  - A Berkeley View of Systems Challenges for AI
  - MLSys: The New Frontier of Machine Learning Systems
- Systems Benchmarking Crimes
- CSE 559W @ U Washington Slides: Not a paper reading class, more of an end-to-end comprehensive introduction of foundations of DL Systems
- CS 759 @ UW-Madison (HPC) Course Notes: A great overview of HPC, CUDA, OpenMP, MPI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Big Data Systems - Index

Table of Contents

Infrastructure, Frameworks, and Paradigms

Scheduling & Resource Allocation

Cloud/Serverless Computing

Network Flow Scheduling

Graphs

Distributed Tracing

Caching

New Data, Hardware Models

Databases

Meta stuff

Files

README.md

Latest commit

History

README.md

File metadata and controls

Big Data Systems - Index

Table of Contents

Infrastructure, Frameworks, and Paradigms

Scheduling & Resource Allocation

Cloud/Serverless Computing

Network Flow Scheduling

Graphs

Distributed Tracing

Caching

New Data, Hardware Models

Databases

Meta stuff