Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing
RDD is read-only, partitioned collection of records that can only be created through deterministic operations on either data in stable storage or other RDDs.
RDDs are best suited for batch analytics that apply the same operation to all elements of a dataset.
transformations
lazy operations
- map
- filter
- flatMap
- sample
- groupByKey
- reduceByKey
- union
- join
- cogroup
- crossProduct
- mapValues
- sort
- partitionBy
actions
launch a computation to return a value to the program or write data to external storage
- count
- collect
- save
- reduce
- lookup
control
- persistence
- partitioning
Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.
interface
- partitions()
- preferredLocations
- dependencies()
- iterator(p, paarentIters)
- partitioner()
explain:
- a set of partitions
- a set of dependencies on parent RDDs
- narrow
- wide
- a function for computing the dataset based on its parents
- metadata about its partitioning scheme and data placement
Compare to Distribted Shared Memory
DSM
- bulk writes
RDD
- efficient fault tolerance
- immutable nature lets a system mitigate slow nodes by running backup copies of slow tasks
- bulk operations can be scheduled based on data locality
- degrade gracefully
Implementations
Job secheduling
- DAG
- assign tasks to nodes based on data locality using delay scheduling
- materialize intermediate records on the nodes holding parent partitions for wide dependencies
Interpreter integration
- class shipping
- modified code generation
Memory management
- LRU
Support for checkpointing
- REPLICATE flag