Mesos is a scheduler for sharing cluster resources between multiple services/applications. Using two-level scheduling (application-specific schedulers), its orchestration mechanism can provide scalable, decentralized scheduling.
- Motivation: Share resources among multiple frameworks/services
- Background: OS scheduling: time sharing
- Mechanism: Pre-empt processes
- Policy: Which process is chosen to run
- Background: Cluster scheduling
- Time sharing: Context switches are more expensive
- Space sharing: Partition resources at the same time across jobs
- Policies: Should be aware of locality
- Scale
- Fault tolerance
- Background: Target environment
- Multiple MapReduce versions
- Mix of frameworks: MPI, Spark, MapReduce
- Data sharing across frameworks
- Avoid per-framework clusters (not good for resource utilization)
- Centralized master (& its backups)
- Agent on every machine
- Two-level scheduling: Framework scheduler attached to Mesos
- "Simplicity across frameworks"
- Slave send heartbeats with information on available resources
- Mesos master sends resource offers to frameworks -> Frameworks replies tasks & their granularity
- Constraints
- Example of constraints
- Soft constraints: Prefer to run the task at a particular location (e.g. for data locality)
- Hard constraints: Task needs GPUs
- Constraints in Mesos
- Applications can reject offers
- Optimization: Filters (reduces the number of rejected offers)
- Example of constraints
- Allocation
- Assumption: Tasks are short -> allocate when they finish
- Long tasks: Revocation beyond guaranteed allocation
- E.g., MPI has guaranteed allocation of 8 machines; currently assigned 12 machines; can take away 4 machines
- Isolation
- Containers (Docker/Linux cgroups)
- Node fails -> forward the failure to Hadoop, let them decide what to do
- Master fails -> recover (soft) state by communicating with framework schedulers/workers
- Also has a standby master
- Framework scheduler fails -> Mesos doesn't handle that
- Problem: More frameworks have preferred nodes than available. Who gets the offers?
- Lottery scheduling: Offers weighted by num allocations
- Centralized vs. Distributed
- Framework complexity: Every framework developer needs to implement a scheduler
- Fragmentation, starvation: Especially if a job has large resource requirements. Partial workaround: min offer size
- Inter-dependent framework: 2 frameworks cannot be colocated (e.g. due to security, privacy, ...)