typora-copy-images-to
./img

CS449

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Spring 2018: Systems for Data Science

[TOC]

Introduction

parallel programming : difficult, need abstraction and understand systems architecture
working with changing data at scale : super difficult, require consistency, complex protocols, concurrencey hard to analyze, strange failures, acceptable performance
"Whenever the load on the system increases by an order of magnitude, a completely new design is needed!"
systems
- data wrangling, ETL : OLAP (online analytical processing)
- data storage, safekeeping, management : relational database management systems RDBMS, non-relational systems
- online vs real-time : streaming, IVM
- data analysis : query engines, machine learning
database management system DBMS : software designed to store and manage databases with concurrency control and recovery components
- query optimization and execution
- relational operators
- files and access methods
- buffer management
- disk space management
data model : collection of concepts for describing data
- schema : description of particular collection of data given data model
- relational model of data : main concept relation, every relation has schema
levels of abstraction
- many views : how users see data
- single conceptual (logical) schema : logical structures
- physical schema : files and indexes used
data manipualtion language : DML, queries
data description language : DDL, schema
query processing : high level declarative to efficient code automatically, query plan, cost-based optimizer, memory hierarchy-aware operator
data-parallel programming : SIMD, single instruction on multiple data
transaction
- atomic : all or nothing sequence of database actions (reads/writes)
- consistency : execution of ${T_1,\ldots, T_n}$ equivalent to serial execution, limits (serial, consensus when distributed)
- integrity constraints
- recovery from failure : write-ahead, append-only log and smart protocol, alternative (redundancy and replication via k-safety)
NoSQL : give up consistency or eventual consistency (if no further change, eventually consistent state)
database landscape

Writing database queries

SQL : DSL, limited, non-Turing complete, declarative, fundamental power is compositionality
universal quantifiers
complex queries : build from smaller parts
sometimes inexpressible : cannot match some polynomial-time problems (e.g. does those pairs forms a cycle)
first oder logic
- variables : $x, y, x_1, x_2$
- constants : $a, b, c, d$
- relations : $R, S, T$
- terms : constants and variables
- formulae : $t_1=t_1$, $R(t_1,\ldots,t_n)$, $\phi\wedge\psi$, $\neg\phi$, $\exists x;\phi$
- free variable : $free(x=1)={1}$, $free(\exists x;\phi)=free(\phi)-{x}$
- quantifier rank : $qr:formulae\to\mathbb N$, $qr(t_1=t_2)=0$, $qr(\phi\wedge\psi)=max(qr(\phi),qr(\psi))$, $qr(\exists x;\phi)=qr(\phi)+1$
- structure of signature $(c_1,\ldots, c_k;R_1^{ar(R_1)},\ldots, R_l^{ar(R_l)})$ (schema) : tuple $A=(c_1^A,\ldots, c_k^A;R_1^A,\ldots, R_l^A)$ with $c$ constants and $ar(R)$-ary relation
- satisfaction relation : $A \vDash \phi[a_1,\ldots, a_n]$ when free variables replaced by $a$
- equivalent : $\phi \equiv\psi$ iff satisfaction true
- neutral elements : monoids $(\mathbb R, +)$, $(\mathbb R,\cdot)$
domain independence : iff not exist a database instance $A$ and two sets $B, C$ that contain all atomic values that appear in $A$ (active domain) s.t. $Q_B(A)\not=Q_C(A)$, queries must evaluate to same result no matter what domain assumed, undecidable
- solution : syntactic restrictions (range-restricted queries)
- safe-range normal form SRNF : variable substitution (no distrinct pair of quantifiers may employ same variable), remove universal quantifiers (use existence), remove implications, remove double negation, flatten and/or
- Codd theorm : query is definable in relation algrebra iff definable in range-restricted domain calculus
- relativized rewritings : $\forall y;\phi(x_1,\ldots, x_k,y)$ into $\forall y; D(y)\Rightarrow\phi(x_1,\ldots,x_k,y)$

Parallel collection programming and Spark

idea parallelization : given polynomially much hardware (unreal) computation better than linear time
relational algebra
- selection : $\sigma$, select subset of rows from relation
- projection : $\pi$, delete unwanted columns from relation
- cross-product : $\times$, all pairs of tuple from two relation
- set-difference : $-$, tuples in relation 1 but not in relation 2
- union : $\cup$, tuples in relation 1 and 2
- intersection, join, division, renaming : syntatic sugar
complexity
- $AC_k$ : languages recognizable by logspace-uniform classes of unbounded fan-in circuits of polysize and $O(\log^k(\cdot))$ depth
- $AC_0$ : constant depth
- $NC_k\subseteq AC_k$ : $AC_k$ with bounded (w.l.o.g. 2) fan-in gates
- $NC=\cup_k NC_k$ : Nick's class, highly parallelizable problems
- theorem : $AC_k\subseteq NC_{k+1}$, replace unbounded fan-in gates by binary trees of binary gates, depth grows by log factor
- theorem : $NC_1\subseteq LOGSPACE$
- conjecture : $NC\subset P$
- hierarchy : $AC_0\subseteq NC_1\subseteq LOGSPACE\subseteq AC_1\subseteq\cdots\subseteq NC_i\subseteq AC_i\subseteq\cdots\subseteq NC = \cup_k NC_k=\cup_k AC_k\subseteq P$
first order queries : $AC_0$, for fixed schema and query, given domain size, build a circuit, single $LOGSPACE$ algo for this
- given enough hardware : queries evaluated in constant time, in practice too large to be materialized
- $k$ computing node : map each input gate to one node, network communication issue as queries unknown beforehand
parallelizing SQL
- SQL : $FO/RA$ + aggregates + bells and whistles
functional collection interface
- list
- foldRight : sequential
- map : nicely parallelize
- flatten
- relation algebra : simple mapping
- group by
- reduce : nice parallel implementations
monad : map and flatten
object oriented database : motivated by impedance mismatch (map to object), but oo complex, now many RDBMS object-relational (relational at core, OO through an outside layer)
monad algebra
- complex values : constructed from sets, tuples, atomic values from single-sorted domain
- types : terms of grammer $\tau=Dom|{\tau}|<A_1:\tau_1,\ldots, A_k:\tau_k>$
- collection types
- nullary tuple : $<>$
- monad language $M$
  - identity : $id:x\mapsto x$, $sng \circ flatten = map(sng)\circ flatten=id$
  - composition : $f\circ g:x\mapsto g(f(x))$
  - constants : from $Dom\cup{\emptyset,<>}$
  - singleton set construction : $sng:x\mapsto{x}$
  - application of function to every member : $map(f):x\mapsto{f(x)|x\in X}$
  - unnesting : $flatten:x\mapsto\cup x$, $flatten\circ flatten=map(flatten)\circ flatten$ (associativity)
  - pairing : $pairwith_{A_1}:<A_1:X_1,\ldots,A_n:x_n>\mapsto{<A_1:x_1,\ldots, A_n:x_n>|x_1\in X_1}$
  - tuple formation : $<A_1:f_1,\ldots, A_n:f_n>:x\mapsto<A_1:f_1(x),\ldots,A_n:f_n(x)>$
  - projection : $\pi_{A_i}:<A_1:x_1,\ldots,A_n:x_n>\mapsto x_i$ (applied to tuples rather to sets of tuples)
  - flatmap : map and flattenmap
  - cartesian product : $f\times g$ defined as $<1:f,2:g>\circ pairwith_1\circ flatmap(pairwith_2)$
- boolean : true ${<>}$, false $\empty$
  - conjunction : $\gamma\wedge\delta$ as $\gamma\times\delta$
  - negation : $\neg\gamma$ as $<1:id,2:\emptyset>\circ =$
- monad with tensorial strengh (tuple) : query language
- monoid : endofunctors map (same type)
- positive monad algebra : $M_U$ extends with set union, impractical query language as cannot express value equality
- equality of atomic values : $<A_1:x_1,\ldots, A_k:x_k>\mapsto{<>|x_i=x_j}$
  - deep : holds on same type for each elements
- nest : $nest_{C=(B)}(R)$ computes ${ \langle A : x ,C : { \langle B : y \rangle | \langle A : x ,B : y \rangle \in R } \rangle | ( \exists y ) \langle A : X ,B : y \rangle \in R }$
- full monad algebra : extends $M_U$ either with deep equality, testing set membership $A\in B$ or containment $A\subseteq B$, selection $\sigma_{A=B}$, set difference $-$, set intersection $\cap$ or nesting
  - conservative extension over relational algebra
  - theorem : mapping form relational database to relation expressible in $M_U[={deep}]$ iff expressible in relation algebra, all relational calculus can be expressed in $M_U[={deep}]$
  - conservativity : transitive closure not expressible
- powerset : $pow:{\tau}\mapsto{{\tau}}$, $pow:S\mapsto2^S$
  - $M_U[=_{deep},pow]$ : extemly expressive, all $ELEMENTARY$ (complexity of all problems solvable in time $O(2^{2^{\cdot^{2^n}}})$ queries
- xquery : XML nodes encoded as label, child, or list pairs

map reduce :

hadoop

def hadoopMapReduce[BLOCK, K, V, RES](
      data: List[BLOCK],
      mapper: BLOCK => List[(K, V)],
      reducer: ((K, Iterable[V])) => RES): Iterable[RES] =
  reshuffle(data.map(mapper)).map(reducer)
// same
data.flatMap(mapper).groupBy(_._1).map(x => (x._1, x._2.map(_._2))).map(reducer)

reshuffle : naive talk between nodes, implement by flatten and group by
parallelization : map (yes), reduce/aggregation (not in general, e.g. yes for associativity)

spark : fast and expressive cluster computing system, improve efficiency through general execution graphs and in-memory storage (10x faster on disk, 100x in memory), less code
- resilient distributed datasets : RDDs, spread across cluster, built through parallel transformations, automatically rebuilt on failure, controllable persistence
- operations : lazy transformations (map, filter, group by), actions (count, collect, save)
- sort : locally sort each node, collect local top and compute global top
yarn : yet another resource negociator
bulk-synchronous parallel programming model : BSP, in each superstep, each node's compute function called in parallel, node message received at next superstep

Parallel join processing and analytics

asynchronous : every node progresses individually, communication by message passing
synchronous : computation proceeds in distinct aligned epochs across distributed system (assumed in the following)
speed-up : more resources, proportionally less time for given amount of data
scale-up : if resources increased in proporition to increase data size, time constant, get better machine
scale-out : get more machines
parallel DBMS : natural
- pipeline parallelism : many machines each doing one step in multi-step process
- partition parallelism : many machines doing same thing on different pieces of data
- share what issue
  - shared memory : SMP, easy program, expensive to build, difficult to scale up
  - shared disk
  - shared nothing (network) : hard to program, cheap to build, easy to scale out
- parallelism types
  - intra-operator : all machines working to compute given ops (scan, sort, join)
  - inter-operator : each ops run concurrently on different site (exploit pipelining), bushy trees
  - inter-query : different queries on different sites
- automatic data partitioning : shared nothing benefits from good partitioning
  - range : good for equijoin, range queries, group-by
  - hash : good for equijoin
  - round robin : good to spread load
- storing data
  - fragmentation : row-horizontal (disjoint), col-vertical (lossless-join, ids)
  - replication : increased availablity, faster query evaluation, update and consistency more challenging
parallel hash equijoin : partition both relations by join key, distribute to sites, join at each site
- good hash : distribute work evenly, caution on key skew
- data placement : optimize partitioning scheme for workload
bloom filter : bit-vector of size $b$, $k$ different hash functions
- principle : if some value hashes to $i$ for some hash function, set bit $i$ to 1,
- test value : true if image bits of all hash function are 1, no false negative
- optimal $k=|\text{distinct key values}|/b * \ln 2$
query optimization : reordering joins, push selections and projections down the tree, push aggregations (usually down), keep intermediate result as small as possible
data skew : reducer processing become bottleneck
- reducer-centric cost model : processing time at reducer approximately monotonic in input and output size
- problem classification
  - input-size dominated : minimize max-reducer input
  - output-size dominated : minimize max-reducer output
  - input-output balanced : minimize combination of both
- theta-join model : cover each true-valued cell by exactly one reducer
- 1-bucket theta : distribute very evenly over regions, wins for output-size dominated joins
  - map
  - reduce
  - randomization : avoid pre-processing step, effectively removes output skew, input size very close to target (probability of receiving 10% over target close to 0)
- how do we know which matrix cells have value true ?
- cartesian product : entire matrix need to be covered
  - use square-shaped regions : reducer that covers $c$ will receive at least $2\sqrt{c}$ input tuples
  - lower bound : $|S||T|/r$, max-reducer input and max-recuder output within factor 2 and 4 of lower bound
- join-specific : unlikey to improve max-reducer output, has to beat 1-bucket-theta on input cost
- approximate join matrix
- M-bucket-I (O) algorithms : use histograms to minimize max-reducer input (output), identifies candidate cells, tries to cover all with $r$ regions, need simple enough join predicate
- memory-awareness : input for region might exceed reducer memory, solve by create more regions
  - 1-bucket theta : use square of side-length mem/2
  - M-bucket-I/O : instead of binary search on max-reducer I/O set it immediately to mem

Consistency and concurrency

consistency : receive current/latest value, no matter which node I contact, if there is no update, same value is retrieved no matter which node I contact (require atomic writing)
transaction : atomic sequence of reads and write executing by itself in isolation, either commit or abort
- consistent state afterwards : if consistent beforehand
- scheduling
  - serial : does not interleave actions of different transactions
  - equivalent : for any database size, effect of executing first schedule identical to effect of executing second schedule
  - serializable : same as some serial execution of transactions
  - conflict equivalent : involve same actions of same transactions, every pair of conflicting action ordered same way
  - conflict serializable : if conflict equivalent to some serial schedule (iff acyclic dependency graph)
- dependency graph : one node per transaction, edge from $T_i$ to $T_j$ if $T_j$ reads/writes object last written by $T_i$
- conflicts
  - reading uncommitted data : WR, dirty reads
  - unrepeatable reads : RW
  - overwriting uncommitted datsa : WW
- aborting : $T_i$ aborted, all its actions undone, if $T_j$ reads object last written by $T_i$, $T_j$ must also be aborted
  - maintains logs in which every write recorded : used to undo and recover from crashes
acid properties
- atomicity
- consistency
- isolation
- durability
lock-based controls
- two-phase locking : 2PL, each transaction must obtain a S shared lock before reading and an X exclusive lock before writing
  - transaction cannot request more locks once any lock released
  - X hold : no transaction can get X or S on that object
- strict two-phase locking : all locks released only once transaction completes, enforce precedence graph to be acyclic thus serializable, simplify aborts
- deadlocks : cycle of transactions waiting for lock to be released by each other
  - detection : build waits-for dependency graph
  - prevention : timestamp-based priorities make deadlock impossibile
- relax assumption DB fixed collection of object : even strict 2PL will not assure serializability, no records of that kind should be added when one kind is locked
  - index locking : lock index, if no index lock all pages or file/table to prevent new pages from being added, special case of predicate locking
  - predicate locking : grant lock on all records satisfying some logical predicate, lots of overhead
optimistic concurrency control : if conflict are rare, gain concurrency by not locking and check instead for conflicts before transaction commit
- Kung-Robinson model
  - read : from database, make change to private copies of objects
  - validate : check for conflict, each transaction assigned at end of read phase a numeric id (timestamp)
    - readSet($T_i$)
    - writeSet($T_i$)
    - test 1 : $T_i<T_j$ need $T_i$ completed before $T_j$ begins
    - test 2 : $T_i<T_j$ need $T_i$ completes before $T_j$ begins write, writeSet($T_i$) intersect readSet($T_j$) empty
    - test 3 : $T_i<T_j$ need $T_i$ completed read phase before $T_j$ does, writeSet($T_i$) intersect readSet$(T_j)$ empty and writeSet($T_i$) intersect writeSet($T_j$) empty
    - failure : $T_j$ restarted once $T_i$ finishes
  - write : make local copies of changes public
  - critical section : id assignment, validation and write phase, no concurrency, read-only does need it
  - overheads : record readSet and writeSet per transaction, must make validated write global, might require clean-up
multiversion concurrency control : let writer make new copy while readers use old copy, each transaction has reading timestamp $RTS$, each version of object has its writer timestamp $WTS$
- version : chained backward, discard too old
- transaction classified : writer if may write, otherwise reader
- reader : for each object to be read, find newest version $WTS<TS(T)$, never restarted but might wait until appropriate version commits
- writer : same as reader, for write, find newest version $V$ s.t. $WTS< TS(T)$, if not $RTS(V)< TS(T)$ reject, else $T$ copy $CV$ of $V$ with pointer to $V$, with $WTS(CV)=TS(T)$, $RTS(CV)=TS(T)$, buffered until commits other transaction can see $TS$ values but cannot read $CV$
snapshot isolation : each transaction has access mode, diagnostics size and isolation level, implemented in real DBMS
- level serialiazable does not guarantee serializability
- transaction work on copy at starting time, commit only if no update conflict since snapshot
- can be implement by multiversion cc
distributed transaction management
- assumption : multiple database servers talk to each other, no replication, transaction access objects on multiple servers
- global 2PL : local strict 2PL at each site
- distributed deadlock detection : each site maintains local waits-for graph, global deadlock might exist if no cycle in local graph
  - centralized : send all local graphs to one site
  - hierarchical : send local graphs to parent in hierarchy
  - timeout : abort transaction if waits too long
- two-phase commit : two rounds, voting, termination, any site can decide to abort, every message first locally logged (to survive failure), all commit log transaction id and coordinator id (which includes ids of all subordinate)
  - failure at a site : if commit or abort log, must redo/undo
    - if coordinator : resend ack requests, subordinate who voted yes cannot decide blocking the transaction (unless one said no), if fails after prepare it must abort
    - if subordinate and prepare : contact coordinator to find status, commit/abort, redu/undo and write log
    - if subordinate without prepare : abort and undo
  - link failure : does not respond during commit
    - if coordinator : abort
    - if subordinate and has not yet voted : abort
    - if subordinate and voted yes : blocked until coordinator respond
  - subtransaction without update : answer reader instead of yes/no to prepare
impossibility to scale out transaction processing
- consensus : at least a round-trip from master to each node
- distance between nodes increases : as nodes increases, transaction throughput upper-bounded by 1/latency
- CAP theorem : cannot have consistency availability and paritition tolerance all at once, choose consistency (ACID DBMS) or availability (NoSQL systems) as network happen to parition
  - consistency : possible answers to read request, newest value or error
  - availability : possible answers to read request, a value, not necessarily the newest
weak consistency : eventual consistency, no transaction, atomicity only to support consistency in presence of replicas
- consistent state : converged
- server-side viewpoint : N = #nodes, W = #replicas that need ack receipt of update before update complete, R = #replicas contacted when read operation
  - W+R>N : strong consistency (quorum consensus method) (N=3, R=2, W=2 good performance/availability tradeoff)
  - W+R<=N : weak/eventual consistency (R=1 reasonable)
causal consistency : if memory operations causally related seen by every node in order, enforced by vector clocks

How to parallelize

optimization criterion : running time performance
- detailed goal : minimize expected cost over possible algos
- need to know : distribution of possible inputs i, workloads, characteristics
- reachability in graph :
  - scenario : undirect graph, single-core compute, main mem O(#vertices), hugh graph on disk
  - workload : queries (a reachable from b), inserts (new edge), 100 queries for every insert, no delete
  - best : store connected component, union/find in memory, near constant time (inverse Ackerman)
- parallel pagerank
  - scenario : stochastic n x n matrix, find principal eigenvector p = M * p given mild assumptions ergodicity
  - option 1 : materialize all p on every machine, broadcast k * n
  - option 2 : server maintains p, worker requests/writes 2n components
heuristics
- tricks : partitioning, replication, communication (eager, synchronous)
- read-only vs read/write
- big/small data
- compute in place vs move data (reshuffle?)

Locality

locality : relates software systems with physical world, can reduce any performance concern to a locality issue, limited memory and computing power (2/3 dimension, cannot have smaller transistor)
in computing : all time/energy cost due to moving data over distance, networks are bottlenecks
principles
- caching : used frequently, prefetching (usually assumes sequential access)
  - examples : CPU cache, MMU TLB (memory management unit, translation lookaside buffer), DBMS materialized views
- prefer sequential over random access : hard drives, flash (only large blocks can be written)
- partitioning : decomposability and embarrassing parallelism
data structures
- arrays : storage layout should match looping order, row vs columns stores
- trees : keep sibilings local, leaf level size dominates, aligned with sort order, range lookup, no secondary indexes aligns
- graphs
  - local representation graphs : chains, trees
  - small cuts graphs : resource-constrained graphs, road networks, brain, planar, k degrees of separation
  - random graphs : monster connected component, except high sparseness
  - real-worlds graphs : locality nightmare, internet communication, social networks
  - no small cut : no nodes partition into two equally sized sets s.t. edges crossing less than linear, impossible to parallelize
  - small world phenonenon : short paths between any two nodes
external sorting
- 2-way sort : 3 buffers, numbers of passes $1+ \left[ \log _ { B - 1} [ N / B ] \right]$, cost $2N\times$ passes, practice sorted in 2-3 passes, use block of pages for each buffer
- double buffering : prefetch shadow block for each buffer
joins
- simple nested loops join
- block nested loops join : scan of outer + # outer blocks * scan of inner with # outer blocks = pages of outer / blocksize
- hash-join : # partitions k < B - 1, B - 2 > size of largest partition held in memory, B must be > sqrt(M) assuming uniformly sized, partitioning phase 2(M+N), matching phase M+N, superior to sort-merge-join if relation size differ, can be parallelize, sensitive to data-skew

Query engines

precomputation : indexing
query processing : algebraic plans
query optimization : heuristic, cost-based
memory hierarchies : cache, main memory, disk caches, disk, tapes
- hard disk : sequential vs random access, latency vs bandwith, capacity, block size, sectors, tracks, surfaces
  - seek time : 10ms, move head and read/write
  - transfer time per page : 0.1ms
  - cost : #page I/O, #seek, dominates CPU cost
- flash : read/write/erase times differ, finitely many operations (mean time to failure MTTF), only erase large blocks
- pages based
join : two collections sequentially stored, find pairs that match condition
- block I/O : block nested loops BNL
- exploiting sorted data : index nested loops, sort-merge
- bucketization : hash join
relational model
select-project-cartesian queries : equivalent to FO, conjunctive queries, SQL (without union, except, aggs, negation or disjunction)
- selection : $\sigma_\phi(R)$
- projection : $\pi_A(R)$
- relational : $R\times S$
- column renaming : $\rho_{A_1,\ldots,A_{ar(R)}}(R)$
- theta-join : $R \bowtie _ { \theta } S = \sigma _ { \theta } ( R \times S )$
optimization : using index, clustering, materialized views
- algebraic rewriting : pushing selection and project down the algebra tree, join reordering
- cost-based : using cost function for operations and statistics about data
  - selectivity : $\operatorname { sel } _ { \phi } [ R ] \approx \frac { | \sigma _ { \phi } ( R ) | | } { | R | }$ estimate for size reduction, result size $| \sigma _ { \phi } ( R ) | \approx \operatorname { sel } _ { \phi } [ R ] \cdot | R |$
  - estimate : stats, $1:n$ relation give $| R \bowtie _ { \theta } S | = | S | $ and $ \operatorname { sel } \left( R \bowtie _ { \theta } S \right) = 1 / | R |$, uniform categorical $\operatorname { sel } _ { A = c } [ R ] \approx 1 / k$, equality selection $\operatorname { sel } _{ K = c} ( R ) = 1 / | R |$
- different operator implementation
query plans : relation algebra trees with annotations
- each operator : choice of implementation
- NL join : which outer (left child is outer by convention)
- how buffer pages assigned
- choose IdxNL join over BNL if few tuples in outer and index clustered
- left deep plan good as admit pipelining
- pipelining : selection, projection, BNL, join and index NL join, versus materialization
System R algorithm : dynamic programming (bottom-up), $i$th phase builds $i$-relation plans from $i-1$-relation plans, considers only left-deep plans

Physical database design and tuning

understand workload
- most important queries and their frequency
- most important updatres and their frequency
- desired performance for these queries
decisions
- what index to create ?
- kind of index ? clustered or hash/tree
- change to schema ? denormalized, horizontal partitioning, replication, views
index selection
- hash index on inner good for nested loops
- clustered B+ tree on join columns good for sort-merge
- clustering usefull whenever many tuples retrieved
first normal form : need all
- eliminate repeating groups in individual tables
- create separate table for each set of related data
- identify each set of related data with primary key
second normal form : in addition to first
- non-prime attribute dependent on any proper subset of any candidate key of the relation
third normal form : 3NF iff at least one of the following holds for each of function dependencies $X\to A$, equivalent to Codd's
- $X$ contains $A$
- $X$ is superkey
- each column in $A-X$ contained in some candidate key
Boyce-Codd normal form BCNF : eliminate 3NF third alternative
schema evolution : define views to mask these changes
3NF vs BCNF
- lossless decomposition
- dependency-preserving

Data warehouse and decision support

OLAP : online analytics processing
OLTP : online transaction processing, long queries instead of short updates xacts
data warehousing : consolidate data from many sources in one large repository, synchronization of replicas, semantic integration
- semantic integration : multiple sources, eliminate mismatches
- heterogeneous sources : variety of formats and repositories
- load, refresh, purge : periodically, freshness
- metadata management : keep track of source, loading time
multidimensional data model : collection of numeric measures which depend on set of dimensions
- MOLAP : stored as array
- ROLAP : stored as relation, fact table (relate dimension to measure), dimension table, columns store
queries
- roll-up : aggregating at different levels of dimension hierarchy
- drill-down : inverse of roll-up
- pivoting : aggregation on selected dimension
- slicing and dicing
CUBE operator : $k$ dimension, $2^k$ possible SQL group by can be generated through pivoting on subset of dimension
star schema : fact table containing id to detailed date, product and location, full join is called star join
bitmap index : one-hot encoding and bit-vector operations
join index : between multiple tables, rapidly grows
columns store : column by column, order and duplicates matter, better compression, best for analytics (and row store best for xacts)

Incremental view maintenance

materialized views : recomputation
incremental view maintenance : perform work only needed to update view, determine changes and apply
- insert and delete : delta tuples
- self-join : tricky delete (natural join behave as * and union as +)
- set semantics : no associativity
algebra
- semigroup (S, op) : $op: S\times S\to S$, associative
- group : semigroup with neutral element e and inverses
- communtativity
- ring (S, +, *) : (S, +) commutative group and (S, *) semigroup and distributive law holds
multisets (bags) semantics
- goal : generalization of relation, clean framework for insertions and deletions
- generalized multiset relations : function $R:T\to A$ s.t. $R(\vec t)\not=0^A$ for at most a finite number of tuples $\vec t$
  - typed tuple $T$ : partial function from vocabulary of columns names to data values
  - commutative ring : $\mathbb{Z,Q,R}$
ring of relation
- polynomial ring : variables are relation names and constants are elements
- delta
recursive IVM : $\Delta f(x)=f(x+1)-f(x)$, on increment $x +!!= 1$ $f(x)+!!=\Delta f(x)$
- $f$ polynomial : $deg(\Delta f(x))=\max(0,deg(f(x))-1)$ and $\Delta^k f=0$
- aggressive recursive incremental view maintenance : maintain $Q,\Delta Q,\Delta^2 Q,\ldots$
  - compile query $Q$ : incrementally maintain materialized view
    - compute $\Delta Q$ for insertion/deletion
    - incremental view maintenance : $Q\pm!!=\Delta Q$
    - recursively compile : $\Delta Q$
  - query language L : closed under delta
aggregation calculus AGCA :
- ring elements beyond relation : modelled by GMR, without finite support => avalanche (semi)rings
- sum : multiplicity-preserving relational projection
parameterized GMRs : function that takes an environment to GMR, doesnot require finit support, ring
compilation : low-level, inline, improve cache-locality, dead code elimination

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cs449.md

cs449.md

CS449

Introduction

Writing database queries

Parallel collection programming and Spark

Parallel join processing and analytics

Consistency and concurrency

How to parallelize

Locality

Query engines

Physical database design and tuning

Data warehouse and decision support

Incremental view maintenance

Files

cs449.md

Latest commit

History

cs449.md

File metadata and controls

CS449

Introduction

Writing database queries

Parallel collection programming and Spark

Parallel join processing and analytics

Consistency and concurrency

How to parallelize

Locality

Query engines

Physical database design and tuning

Data warehouse and decision support

Incremental view maintenance