Skip to content

Latest commit

 

History

History
595 lines (475 loc) · 33.2 KB

cs449.md

File metadata and controls

595 lines (475 loc) · 33.2 KB
typora-copy-images-to
./img

CS449

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Spring 2018: Systems for Data Science

[TOC]

Introduction

  • parallel programming : difficult, need abstraction and understand systems architecture
  • working with changing data at scale : super difficult, require consistency, complex protocols, concurrencey hard to analyze, strange failures, acceptable performance
  • "Whenever the load on the system increases by an order of magnitude, a completely new design is needed!"
  • systems
    • data wrangling, ETL : OLAP (online analytical processing)
    • data storage, safekeeping, management : relational database management systems RDBMS, non-relational systems
    • online vs real-time : streaming, IVM
    • data analysis : query engines, machine learning
  • database management system DBMS : software designed to store and manage databases with concurrency control and recovery components
    • query optimization and execution
    • relational operators
    • files and access methods
    • buffer management
    • disk space management
  • data model : collection of concepts for describing data
    • schema : description of particular collection of data given data model
    • relational model of data : main concept relation, every relation has schema
  • levels of abstraction
    • many views : how users see data
    • single conceptual (logical) schema : logical structures
    • physical schema : files and indexes used
  • data manipualtion language : DML, queries
  • data description language : DDL, schema
  • query processing : high level declarative to efficient code automatically, query plan, cost-based optimizer, memory hierarchy-aware operator
  • data-parallel programming : SIMD, single instruction on multiple data
  • transaction
    • atomic : all or nothing sequence of database actions (reads/writes)
    • consistency : execution of ${T_1,\ldots, T_n}$ equivalent to serial execution, limits (serial, consensus when distributed)
    • integrity constraints
    • recovery from failure : write-ahead, append-only log and smart protocol, alternative (redundancy and replication via k-safety)
  • NoSQL : give up consistency or eventual consistency (if no further change, eventually consistent state) 0897E4E5-C8A1-4DA2-9D67-6586856ACFD7
  • database landscape FE2D01A2-37FA-4FA4-A450-C3F4A3E66A0C

Writing database queries

  • SQL : DSL, limited, non-Turing complete, declarative, fundamental power is compositionality

  • universal quantifiers

  • complex queries : build from smaller parts

  • sometimes inexpressible : cannot match some polynomial-time problems (e.g. does those pairs forms a cycle)

  • first oder logic

    • variables : $x, y, x_1, x_2$
    • constants : $a, b, c, d$
    • relations : $R, S, T$
    • terms : constants and variables
    • formulae : $t_1=t_1$, $R(t_1,\ldots,t_n)$, $\phi\wedge\psi$, $\neg\phi$, $\exists x;\phi$
    • free variable : $free(x=1)={1}$, $free(\exists x;\phi)=free(\phi)-{x}$
    • quantifier rank : $qr:formulae\to\mathbb N$, $qr(t_1=t_2)=0$, $qr(\phi\wedge\psi)=max(qr(\phi),qr(\psi))$, $qr(\exists x;\phi)=qr(\phi)+1$
    • structure of signature $(c_1,\ldots, c_k;R_1^{ar(R_1)},\ldots, R_l^{ar(R_l)})$ (schema) : tuple $A=(c_1^A,\ldots, c_k^A;R_1^A,\ldots, R_l^A)$ with $c$ constants and $ar(R)$-ary relation
    • satisfaction relation : $A \vDash \phi[a_1,\ldots, a_n]$ when free variables replaced by $a$ BF5FB77F-F759-424D-86D5-15AAEDC38EE8
    • equivalent : $\phi \equiv\psi$ iff satisfaction true
    • neutral elements : monoids $(\mathbb R, +)$, $(\mathbb R,\cdot)$
  • domain independence : iff not exist a database instance $A$ and two sets $B, C$ that contain all atomic values that appear in $A$ (active domain) s.t. $Q_B(A)\not=Q_C(A)$, queries must evaluate to same result no matter what domain assumed, undecidable

    • solution : syntactic restrictions (range-restricted queries)
    • safe-range normal form SRNF : variable substitution (no distrinct pair of quantifiers may employ same variable), remove universal quantifiers (use existence), remove implications, remove double negation, flatten and/or 2DB13D62-DCBC-4ED0-BDEB-A7214CE26A8D
    • Codd theorm : query is definable in relation algrebra iff definable in range-restricted domain calculus
    • relativized rewritings : $\forall y;\phi(x_1,\ldots, x_k,y)$ into $\forall y; D(y)\Rightarrow\phi(x_1,\ldots,x_k,y)$

Parallel collection programming and Spark

  • idea parallelization : given polynomially much hardware (unreal) computation better than linear time

  • relational algebra

    • selection : $\sigma$, select subset of rows from relation
    • projection : $\pi$, delete unwanted columns from relation
    • cross-product : $\times$, all pairs of tuple from two relation
    • set-difference : $-$, tuples in relation 1 but not in relation 2
    • union : $\cup$, tuples in relation 1 and 2
    • intersection, join, division, renaming : syntatic sugar
  • complexity

    • $AC_k$ : languages recognizable by logspace-uniform classes of unbounded fan-in circuits of polysize and $O(\log^k(\cdot))$ depth
    • $AC_0$ : constant depth
    • $NC_k\subseteq AC_k$ : $AC_k$ with bounded (w.l.o.g. 2) fan-in gates
    • $NC=\cup_k NC_k$ : Nick's class, highly parallelizable problems
    • theorem : $AC_k\subseteq NC_{k+1}$, replace unbounded fan-in gates by binary trees of binary gates, depth grows by log factor
    • theorem : $NC_1\subseteq LOGSPACE$
    • conjecture : $NC\subset P$
    • hierarchy : $AC_0\subseteq NC_1\subseteq LOGSPACE\subseteq AC_1\subseteq\cdots\subseteq NC_i\subseteq AC_i\subseteq\cdots\subseteq NC = \cup_k NC_k=\cup_k AC_k\subseteq P$
  • first order queries : $AC_0$, for fixed schema and query, given domain size, build a circuit, single $LOGSPACE$ algo for thismage-20180317221545

    • given enough hardware : queries evaluated in constant time, in practice too large to be materialized
    • $k$ computing node : map each input gate to one node, network communication issue as queries unknown beforehand
  • parallelizing SQL

    • SQL : $FO/RA$ + aggregates + bells and whistles
  • functional collection interface

    • list
    • foldRight : sequential
    • map : nicely parallelize
    • flatten
    • relation algebra : simple mapping
    • group by
    • reduce : nice parallel implementations
  • monad : map and flatten

  • object oriented database : motivated by impedance mismatch (map to object), but oo complex, now many RDBMS object-relational (relational at core, OO through an outside layer)

  • monad algebra

    • complex values : constructed from sets, tuples, atomic values from single-sorted domain
    • types : terms of grammer $\tau=Dom|{\tau}|<A_1:\tau_1,\ldots, A_k:\tau_k>$
    • collection types
    • nullary tuple : $<>$
    • monad language $M$
      • identity : $id:x\mapsto x$, $sng \circ flatten = map(sng)\circ flatten=id$
      • composition : $f\circ g:x\mapsto g(f(x))$
      • constants : from $Dom\cup{\emptyset,<>}$
      • singleton set construction : $sng:x\mapsto{x}$
      • application of function to every member : $map(f):x\mapsto{f(x)|x\in X}$
      • unnesting : $flatten:x\mapsto\cup x$, $flatten\circ flatten=map(flatten)\circ flatten$ (associativity)
      • pairing : $pairwith_{A_1}:<A_1:X_1,\ldots,A_n:x_n>\mapsto{<A_1:x_1,\ldots, A_n:x_n>|x_1\in X_1}$
      • tuple formation : $<A_1:f_1,\ldots, A_n:f_n>:x\mapsto<A_1:f_1(x),\ldots,A_n:f_n(x)>$
      • projection : $\pi_{A_i}:<A_1:x_1,\ldots,A_n:x_n>\mapsto x_i$ (applied to tuples rather to sets of tuples)
      • flatmap : map and flattenmap
      • cartesian product : $f\times g$ defined as $<1:f,2:g>\circ pairwith_1\circ flatmap(pairwith_2)$
    • boolean : true ${<>}$, false $\empty$
      • conjunction : $\gamma\wedge\delta$ as $\gamma\times\delta$
      • negation : $\neg\gamma$ as $<1:id,2:\emptyset>\circ =$
    • monad with tensorial strengh (tuple) : query language
    • monoid : endofunctors map (same type)
    • positive monad algebra : $M_U$ extends with set union, impractical query language as cannot express value equality
    • equality of atomic values : $<A_1:x_1,\ldots, A_k:x_k>\mapsto{<>|x_i=x_j}$
      • deep : holds on same type for each elements
    • nest : $nest_{C=(B)}(R)$ computes ${ \langle A : x ,C : { \langle B : y \rangle | \langle A : x ,B : y \rangle \in R } \rangle | ( \exists y ) \langle A : X ,B : y \rangle \in R }$
    • full monad algebra : extends $M_U$ either with deep equality, testing set membership $A\in B$ or containment $A\subseteq B$, selection $\sigma_{A=B}$, set difference $-$, set intersection $\cap$ or nesting
      • conservative extension over relational algebra
      • theorem : mapping form relational database to relation expressible in $M_U[={deep}]$ iff expressible in relation algebra, all relational calculus can be expressed in $M_U[={deep}]$
      • conservativity : transitive closure not expressible
    • powerset : $pow:{\tau}\mapsto{{\tau}}$, $pow:S\mapsto2^S$
      • $M_U[=_{deep},pow]$ : extemly expressive, all $ELEMENTARY$ (complexity of all problems solvable in time $O(2^{2^{\cdot^{2^n}}})$ queries
    • xquery : XML nodes encoded as label, child, or list pairs
  • map reduce :

    • hadoop

      def hadoopMapReduce[BLOCK, K, V, RES](
            data: List[BLOCK],
            mapper: BLOCK => List[(K, V)],
            reducer: ((K, Iterable[V])) => RES): Iterable[RES] =
        reshuffle(data.map(mapper)).map(reducer)
      // same
      data.flatMap(mapper).groupBy(_._1).map(x => (x._1, x._2.map(_._2))).map(reducer)
    • reshuffle : naive talk between nodes, implement by flatten and group by

    • parallelization : map (yes), reduce/aggregation (not in general, e.g. yes for associativity)

  • spark : fast and expressive cluster computing system, improve efficiency through general execution graphs and in-memory storage (10x faster on disk, 100x in memory), less code

    • resilient distributed datasets : RDDs, spread across cluster, built through parallel transformations, automatically rebuilt on failure, controllable persistence
    • operations : lazy transformations (map, filter, group by), actions (count, collect, save)
    • sort : locally sort each node, collect local top and compute global top
  • yarn : yet another resource negociator

  • bulk-synchronous parallel programming model : BSP, in each superstep, each node's compute function called in parallel, node message received at next superstep

Parallel join processing and analytics

  • asynchronous : every node progresses individually, communication by message passing
  • synchronous : computation proceeds in distinct aligned epochs across distributed system (assumed in the following)
  • speed-up : more resources, proportionally less time for given amount of data
  • scale-up : if resources increased in proporition to increase data size, time constant, get better machine
  • scale-out : get more machines
  • parallel DBMS : natural
    • pipeline parallelism : many machines each doing one step in multi-step process
    • partition parallelism : many machines doing same thing on different pieces of data
    • share what issue
      • shared memory : SMP, easy program, expensive to build, difficult to scale up
      • shared disk
      • shared nothing (network) : hard to program, cheap to build, easy to scale out
    • parallelism types
      • intra-operator : all machines working to compute given ops (scan, sort, join)
      • inter-operator : each ops run concurrently on different site (exploit pipelining), bushy trees
      • inter-query : different queries on different sites
    • automatic data partitioning : shared nothing benefits from good partitioning
      • range : good for equijoin, range queries, group-by
      • hash : good for equijoin
      • round robin : good to spread load
    • storing data
      • fragmentation : row-horizontal (disjoint), col-vertical (lossless-join, ids)
      • replication : increased availablity, faster query evaluation, update and consistency more challenging
  • parallel hash equijoin : partition both relations by join key, distribute to sites, join at each site
    • good hash : distribute work evenly, caution on key skew
    • data placement : optimize partitioning scheme for workload
  • bloom filter : bit-vector of size $b$, $k$ different hash functions
    • principle : if some value hashes to $i$ for some hash function, set bit $i$ to 1,
    • test value : true if image bits of all hash function are 1, no false negative
    • optimal $k=|\text{distinct key values}|/b * \ln 2$
  • query optimization : reordering joins, push selections and projections down the tree, push aggregations (usually down), keep intermediate result as small as possible
  • data skew : reducer processing become bottleneck
    • reducer-centric cost model : processing time at reducer approximately monotonic in input and output size mage-20180318010452
    • problem classification
      • input-size dominated : minimize max-reducer input
      • output-size dominated : minimize max-reducer output
      • input-output balanced : minimize combination of both
    • theta-join model : cover each true-valued cell by exactly one reducer mage-20180318010944
    • 1-bucket theta : distribute very evenly over regions, wins for output-size dominated joins
      • map mage-20180318011035
      • reduce mage-20180318011043 mage-20180318011105
      • randomization : avoid pre-processing step, effectively removes output skew, input size very close to target (probability of receiving 10% over target close to 0)
    • how do we know which matrix cells have value true ?
    • cartesian product : entire matrix need to be covered
      • use square-shaped regions : reducer that covers $c$ will receive at least $2\sqrt{c}$ input tuples
      • lower bound : $|S||T|/r$, max-reducer input and max-recuder output within factor 2 and 4 of lower bound
    • join-specific : unlikey to improve max-reducer output, has to beat 1-bucket-theta on input cost
    • approximate join matrix mage-20180318012334
    • M-bucket-I (O) algorithms : use histograms to minimize max-reducer input (output), identifies candidate cells, tries to cover all with $r$ regions, need simple enough join predicate mage-20180318012511
    • memory-awareness : input for region might exceed reducer memory, solve by create more regions
      • 1-bucket theta : use square of side-length mem/2
      • M-bucket-I/O : instead of binary search on max-reducer I/O set it immediately to mem

Consistency and concurrency

  • consistency : receive current/latest value, no matter which node I contact, if there is no update, same value is retrieved no matter which node I contact (require atomic writing)
  • transaction : atomic sequence of reads and write executing by itself in isolation, either commit or abort
    • consistent state afterwards : if consistent beforehand
    • scheduling
      • serial : does not interleave actions of different transactions
      • equivalent : for any database size, effect of executing first schedule identical to effect of executing second schedule
      • serializable : same as some serial execution of transactions
      • conflict equivalent : involve same actions of same transactions, every pair of conflicting action ordered same way
      • conflict serializable : if conflict equivalent to some serial schedule (iff acyclic dependency graph)
    • dependency graph : one node per transaction, edge from $T_i$ to $T_j$ if $T_j$ reads/writes object last written by $T_i$
    • conflicts
      • reading uncommitted data : WR, dirty reads
      • unrepeatable reads : RW
      • overwriting uncommitted datsa : WW
    • aborting : $T_i$ aborted, all its actions undone, if $T_j$ reads object last written by $T_i$, $T_j$ must also be aborted
      • maintains logs in which every write recorded : used to undo and recover from crashes
  • acid properties
    • atomicity
    • consistency
    • isolation
    • durability
  • lock-based controls
    • two-phase locking : 2PL, each transaction must obtain a S shared lock before reading and an X exclusive lock before writing
      • transaction cannot request more locks once any lock released
      • X hold : no transaction can get X or S on that object
    • strict two-phase locking : all locks released only once transaction completes, enforce precedence graph to be acyclic thus serializable, simplify aborts
    • deadlocks : cycle of transactions waiting for lock to be released by each other
      • detection : build waits-for dependency graph
      • prevention : timestamp-based priorities make deadlock impossibile
    • relax assumption DB fixed collection of object : even strict 2PL will not assure serializability, no records of that kind should be added when one kind is locked
      • index locking : lock index, if no index lock all pages or file/table to prevent new pages from being added, special case of predicate locking
      • predicate locking : grant lock on all records satisfying some logical predicate, lots of overhead
  • optimistic concurrency control : if conflict are rare, gain concurrency by not locking and check instead for conflicts before transaction commit
    • Kung-Robinson model
      • read : from database, make change to private copies of objects
      • validate : check for conflict, each transaction assigned at end of read phase a numeric id (timestamp)
        • readSet($T_i$)
        • writeSet($T_i$)
        • test 1 : $T_i<T_j$ need $T_i$ completed before $T_j$ begins
        • test 2 : $T_i<T_j$ need $T_i$ completes before $T_j$ begins write, writeSet($T_i$) intersect readSet($T_j$) empty
        • test 3 : $T_i<T_j$ need $T_i$ completed read phase before $T_j$ does, writeSet($T_i$) intersect readSet$(T_j)$ empty and writeSet($T_i$) intersect writeSet($T_j$) empty
        • failure : $T_j$ restarted once $T_i$ finishes
      • write : make local copies of changes public
      • critical section : id assignment, validation and write phase, no concurrency, read-only does need it
      • overheads : record readSet and writeSet per transaction, must make validated write global, might require clean-up
  • multiversion concurrency control : let writer make new copy while readers use old copy, each transaction has reading timestamp $RTS$, each version of object has its writer timestamp $WTS$
    • version : chained backward, discard too old
    • transaction classified : writer if may write, otherwise reader
    • reader : for each object to be read, find newest version $WTS<TS(T)$, never restarted but might wait until appropriate version commits
    • writer : same as reader, for write, find newest version $V$ s.t. $WTS< TS(T)$, if not $RTS(V)< TS(T)$ reject, else $T$ copy $CV$ of $V$ with pointer to $V$, with $WTS(CV)=TS(T)$, $RTS(CV)=TS(T)$, buffered until commits other transaction can see $TS$ values but cannot read $CV$
  • snapshot isolation : each transaction has access mode, diagnostics size and isolation level, implemented in real DBMS
    • level serialiazable does not guarantee serializability mage-20180318133833
    • transaction work on copy at starting time, commit only if no update conflict since snapshot
    • can be implement by multiversion cc
  • distributed transaction management
    • assumption : multiple database servers talk to each other, no replication, transaction access objects on multiple servers
    • global 2PL : local strict 2PL at each site
    • distributed deadlock detection : each site maintains local waits-for graph, global deadlock might exist if no cycle in local graph
      • centralized : send all local graphs to one site
      • hierarchical : send local graphs to parent in hierarchy
      • timeout : abort transaction if waits too long
    • two-phase commit : two rounds, voting, termination, any site can decide to abort, every message first locally logged (to survive failure), all commit log transaction id and coordinator id (which includes ids of all subordinate) mage-20180318135810
      • failure at a site : if commit or abort log, must redo/undo
        • if coordinator : resend ack requests, subordinate who voted yes cannot decide blocking the transaction (unless one said no), if fails after prepare it must abort
        • if subordinate and prepare : contact coordinator to find status, commit/abort, redu/undo and write log
        • if subordinate without prepare : abort and undo
      • link failure : does not respond during commit
        • if coordinator : abort
        • if subordinate and has not yet voted : abort
        • if subordinate and voted yes : blocked until coordinator respond
      • subtransaction without update : answer reader instead of yes/no to prepare
  • impossibility to scale out transaction processing
    • consensus : at least a round-trip from master to each node
    • distance between nodes increases : as nodes increases, transaction throughput upper-bounded by 1/latency
    • CAP theorem : cannot have consistency availability and paritition tolerance all at once, choose consistency (ACID DBMS) or availability (NoSQL systems) as network happen to parition
      • consistency : possible answers to read request, newest value or error
      • availability : possible answers to read request, a value, not necessarily the newest
  • weak consistency : eventual consistency, no transaction, atomicity only to support consistency in presence of replicas mage-20180318143039
    • consistent state : converged
    • server-side viewpoint : N = #nodes, W = #replicas that need ack receipt of update before update complete, R = #replicas contacted when read operation
      • W+R>N : strong consistency (quorum consensus method) (N=3, R=2, W=2 good performance/availability tradeoff)
      • W+R<=N : weak/eventual consistency (R=1 reasonable)
  • causal consistency : if memory operations causally related seen by every node in order, enforced by vector clocks mage-20180318144611

How to parallelize

  • optimization criterion : running time performance
    • detailed goal : minimize expected cost over possible algos
    • need to know : distribution of possible inputs i, workloads, characteristics
    • reachability in graph :
      • scenario : undirect graph, single-core compute, main mem O(#vertices), hugh graph on disk
      • workload : queries (a reachable from b), inserts (new edge), 100 queries for every insert, no delete
      • best : store connected component, union/find in memory, near constant time (inverse Ackerman)
    • parallel pagerank
      • scenario : stochastic n x n matrix, find principal eigenvector p = M * p given mild assumptions ergodicity
      • option 1 : materialize all p on every machine, broadcast k * n
      • option 2 : server maintains p, worker requests/writes 2n components
  • heuristics
    • tricks : partitioning, replication, communication (eager, synchronous)
    • read-only vs read/write
    • big/small data
    • compute in place vs move data (reshuffle?)

Locality

  • locality : relates software systems with physical world, can reduce any performance concern to a locality issue, limited memory and computing power (2/3 dimension, cannot have smaller transistor)

  • in computing : all time/energy cost due to moving data over distance, networks are bottlenecks

  • principles

    • caching : used frequently, prefetching (usually assumes sequential access)
      • examples : CPU cache, MMU TLB (memory management unit, translation lookaside buffer), DBMS materialized views
    • prefer sequential over random access : hard drives, flash (only large blocks can be written)
    • partitioning : decomposability and embarrassing parallelism
  • data structures

    • arrays : storage layout should match looping order, row vs columns stores
    • trees : keep sibilings local, leaf level size dominates, aligned with sort order, range lookup, no secondary indexes aligns
    • graphs
      • local representation graphs : chains, trees
      • small cuts graphs : resource-constrained graphs, road networks, brain, planar, k degrees of separation
      • random graphs : monster connected component, except high sparseness
      • real-worlds graphs : locality nightmare, internet communication, social networks
      • no small cut : no nodes partition into two equally sized sets s.t. edges crossing less than linear, impossible to parallelize
      • small world phenonenon : short paths between any two nodes
  • external sorting

    • 2-way sort : 3 buffers, numbers of passes $1+ \left[ \log _ { B - 1} [ N / B ] \right]$, cost $2N\times$ passes, practice sorted in 2-3 passes, use block of pages for each buffer

      image-20180425200149246

    • double buffering : prefetch shadow block for each buffer

  • joins

    • simple nested loops join image-20180425200725102
    • block nested loops join : scan of outer + # outer blocks * scan of inner with # outer blocks = pages of outer / blocksize image-20180425200902204
    • hash-join : # partitions k < B - 1, B - 2 > size of largest partition held in memory, B must be > sqrt(M) assuming uniformly sized, partitioning phase 2(M+N), matching phase M+N, superior to sort-merge-join if relation size differ, can be parallelize, sensitive to data-skew image-20180425201037387

Query engines

  • precomputation : indexing

  • query processing : algebraic plans

  • query optimization : heuristic, cost-based

  • memory hierarchies : cache, main memory, disk caches, disk, tapes

    • hard disk : sequential vs random access, latency vs bandwith, capacity, block size, sectors, tracks, surfaces
      • seek time : 10ms, move head and read/write
      • transfer time per page : 0.1ms
      • cost : #page I/O, #seek, dominates CPU cost
    • flash : read/write/erase times differ, finitely many operations (mean time to failure MTTF), only erase large blocks
    • pages based
  • join : two collections sequentially stored, find pairs that match condition

    • block I/O : block nested loops BNL
    • exploiting sorted data : index nested loops, sort-merge
    • bucketization : hash join image-20180628235004724
  • relational model image-20180628235022872

  • select-project-cartesian queries : equivalent to FO, conjunctive queries, SQL (without union, except, aggs, negation or disjunction)

    • selection : $\sigma_\phi(R)$
    • projection : $\pi_A(R)$
    • relational : $R\times S$
    • column renaming : $\rho_{A_1,\ldots,A_{ar(R)}}(R)$
    • theta-join : $R \bowtie _ { \theta } S = \sigma _ { \theta } ( R \times S )$
  • optimization : using index, clustering, materialized views

    • algebraic rewriting : pushing selection and project down the algebra tree, join reordering
    • cost-based : using cost function for operations and statistics about data
      • selectivity : $\operatorname { sel } _ { \phi } [ R ] \approx \frac { | \sigma _ { \phi } ( R ) | | } { | R | }$ estimate for size reduction, result size $| \sigma _ { \phi } ( R ) | \approx \operatorname { sel } _ { \phi } [ R ] \cdot | R |$
      • estimate : stats, $1:n$ relation give $| R \bowtie _ { \theta } S | = | S | $ and $ \operatorname { sel } \left( R \bowtie _ { \theta } S \right) = 1 / | R |$, uniform categorical $\operatorname { sel } _ { A = c } [ R ] \approx 1 / k$, equality selection $\operatorname { sel } _{ K = c} ( R ) = 1 / | R |$
    • different operator implementation
  • query plans : relation algebra trees with annotations

    • each operator : choice of implementation

    • NL join : which outer (left child is outer by convention)

    • how buffer pages assigned

    • choose IdxNL join over BNL if few tuples in outer and index clustered

    • left deep plan good as admit pipelining

    • pipelining : selection, projection, BNL, join and index NL join, versus materialization

      image-20180629001203887

      image-20180629001222224

  • System R algorithm : dynamic programming (bottom-up), $i$th phase builds $i$-relation plans from $i-1$-relation plans, considers only left-deep plans

Physical database design and tuning

  • understand workload
    • most important queries and their frequency
    • most important updatres and their frequency
    • desired performance for these queries
  • decisions
    • what index to create ?
    • kind of index ? clustered or hash/tree
    • change to schema ? denormalized, horizontal partitioning, replication, views
  • index selection
    • hash index on inner good for nested loops
    • clustered B+ tree on join columns good for sort-merge
    • clustering usefull whenever many tuples retrieved
  • first normal form : need all
    • eliminate repeating groups in individual tables
    • create separate table for each set of related data
    • identify each set of related data with primary key
  • second normal form : in addition to first
    • non-prime attribute dependent on any proper subset of any candidate key of the relation
  • third normal form : 3NF iff at least one of the following holds for each of function dependencies $X\to A$, equivalent to Codd's
    • $X$ contains $A$
    • $X$ is superkey
    • each column in $A-X$ contained in some candidate key
  • Boyce-Codd normal form BCNF : eliminate 3NF third alternative
  • schema evolution : define views to mask these changes
  • 3NF vs BCNF
    • lossless decomposition
    • dependency-preserving

Data warehouse and decision support

  • OLAP : online analytics processing

  • OLTP : online transaction processing, long queries instead of short updates xacts

  • data warehousing : consolidate data from many sources in one large repository, synchronization of replicas, semantic integration

    • semantic integration : multiple sources, eliminate mismatches
    • heterogeneous sources : variety of formats and repositories
    • load, refresh, purge : periodically, freshness
    • metadata management : keep track of source, loading time
  • multidimensional data model : collection of numeric measures which depend on set of dimensions

    • MOLAP : stored as array
    • ROLAP : stored as relation, fact table (relate dimension to measure), dimension table, columns store
  • queries

    • roll-up : aggregating at different levels of dimension hierarchy
    • drill-down : inverse of roll-up
    • pivoting : aggregation on selected dimension
    • slicing and dicing
  • CUBE operator : $k$ dimension, $2^k$ possible SQL group by can be generated through pivoting on subset of dimension

  • star schema : fact table containing id to detailed date, product and location, full join is called star join

  • bitmap index : one-hot encoding and bit-vector operations

  • join index : between multiple tables, rapidly grows

  • columns store : column by column, order and duplicates matter, better compression, best for analytics (and row store best for xacts)

Incremental view maintenance

  • materialized views : recomputation
  • incremental view maintenance : perform work only needed to update view, determine changes and apply
    • insert and delete : delta tuples
    • self-join : tricky delete (natural join behave as * and union as +)
    • set semantics : no associativity
  • algebra
    • semigroup (S, op) : $op: S\times S\to S$, associative
    • group : semigroup with neutral element e and inverses
    • communtativity
    • ring (S, +, *) : (S, +) commutative group and (S, *) semigroup and distributive law holds
  • multisets (bags) semantics
    • goal : generalization of relation, clean framework for insertions and deletions
    • generalized multiset relations : function $R:T\to A$ s.t. $R(\vec t)\not=0^A$ for at most a finite number of tuples $\vec t$
      • typed tuple $T$ : partial function from vocabulary of columns names to data values
      • commutative ring : $\mathbb{Z,Q,R}$
  • ring of relation image-20180630151109943 image-20180630151150824
    • polynomial ring : variables are relation names and constants are elements
    • delta image-20180630153329005
  • recursive IVM : $\Delta f(x)=f(x+1)-f(x)$, on increment $x +!!= 1$ $f(x)+!!=\Delta f(x)$
    • $f$ polynomial : $deg(\Delta f(x))=\max(0,deg(f(x))-1)$ and $\Delta^k f=0$ image-20180630154123335
    • aggressive recursive incremental view maintenance : maintain $Q,\Delta Q,\Delta^2 Q,\ldots$
      • compile query $Q$ : incrementally maintain materialized view
        • compute $\Delta Q$ for insertion/deletion
        • incremental view maintenance : $Q\pm!!=\Delta Q$
        • recursively compile : $\Delta Q$
      • query language L : closed under delta
  • aggregation calculus AGCA : image-20180630154519545 image-20180630154804619 image-20180630154830777 image-20180630154918687
    • ring elements beyond relation : modelled by GMR, without finite support => avalanche (semi)rings
    • sum : multiplicity-preserving relational projection
  • parameterized GMRs : function that takes an environment to GMR, doesnot require finit support, ring image-20180630154745846
  • compilation : low-level, inline, improve cache-locality, dead code elimination