Resilient Distributed Datasets: A Fault-Tolerant Abstraction forIn-Memory Cluster Computing

RDD is read-only, partitioned collection of records that can only be created through deterministic operations on either data in stable storage or other RDDs.

RDDs are best suited for batch analytics that apply the same operation to all elements of a dataset.

transformations

lazy operations

  • map
  • filter
  • flatMap
  • sample
  • groupByKey
  • reduceByKey
  • union
  • join
  • cogroup
  • crossProduct
  • mapValues
  • sort
  • partitionBy

actions

launch a computation to return a value to the program or write data to external storage

  • count
  • collect
  • save
  • reduce
  • lookup

control

  • persistence
  • partitioning

Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.

interface

  • partitions()
  • preferredLocations
  • dependencies()
  • iterator(p, paarentIters)
  • partitioner()

explain:

  • a set of partitions
  • a set of dependencies on parent RDDs
    • narrow
    • wide
  • a function for computing the dataset based on its parents
  • metadata about its partitioning scheme and data placement

Compare to Distribted Shared Memory

DSM

  • bulk writes

RDD

  • efficient fault tolerance
  • immutable nature lets a system mitigate slow nodes by running backup copies of slow tasks
  • bulk operations can be scheduled based on data locality
  • degrade gracefully

Implementations

Job secheduling

  • DAG
  • assign tasks to nodes based on data locality using delay scheduling
  • materialize intermediate records on the nodes holding parent partitions for wide dependencies

Interpreter integration

  • class shipping
  • modified code generation

Memory management

  • LRU

Support for checkpointing

  • REPLICATE flag