Cyanide - Note for Resilient Distributed Datasets

Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing

RDD is read-only, partitioned collection of records that can only be created through deterministic operations on either data in stable storage or other RDDs.

RDDs are best suited for batch analytics that apply the same operation to all elements of a dataset.

transformations

lazy operations

map
filter
flatMap
sample
groupByKey
reduceByKey
union
join
cogroup
crossProduct
mapValues
sort
partitionBy

actions

launch a computation to return a value to the program or write data to external storage

count
collect
save
reduce
lookup

control

persistence
partitioning

Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.

interface

partitions()
preferredLocations
dependencies()
iterator(p, paarentIters)
partitioner()

explain:

a set of partitions
a set of dependencies on parent RDDs
- narrow
- wide
a function for computing the dataset based on its parents
metadata about its partitioning scheme and data placement

Compare to Distribted Shared Memory

DSM

bulk writes

RDD

efficient fault tolerance
immutable nature lets a system mitigate slow nodes by running backup copies of slow tasks
bulk operations can be scheduled based on data locality
degrade gracefully

Implementations

Job secheduling

DAG
assign tasks to nodes based on data locality using delay scheduling
materialize intermediate records on the nodes holding parent partitions for wide dependencies

Contents

interface

Compare to Distribted Shared Memory

DSM

RDD

Implementations

Job secheduling

Interpreter integration

Memory management

Support for checkpointing