Saturday, August 18, 2018

Reading “Spanner: Google's Globally Distributed Database”

  • Introduction

    • Scalable, Globally-distributed database 
    • Shards data across many sets of Paxos state machines in global datacenters
    • Two features 
      • Externally consistent R / W
      • Globally consistent R
      • Enabled by globally-meaningful commit timestamps that reflect serialization order that satisfies external consistency (If a T1 commits before another T2 starts, then T1's commit timestamp is smaller than T2). Enabler is TrueTime API. 

  • Implementation (Structure of Spanner's Implementation, feature set, engineering decisions) 

      Servers in a Spanner Universe 
      • Spanserver Software Stack 
        • Responsible for between 100-1000 instances of a data structure called tablet (bag of key to timestamp mappings), stored in a set of B-tree like files and a write-ahead log in a DFS called Colossus. 
        • Implements a single Paxos state machine on top of each tablet to support replication
          • These state machines implement a consistently replicated bag of mappings; each mapping state of each replica is stored in its corresponding tablet. 
          • Writes initiate the Paxos protocol at the leader
          • Reads access state directly from the underlying tablet at any replica
        • Implements lock table at every leader replica. Lock table contains state for two-phase locking.
          • Operations that require synchronization (transactional reads) acquire locks in the lock table 
        • Implements a transaction manager at every leader replica. 
          • Transaction manager implements participant leaders and slaves.
          • One paxos group involved with a tx? Bypass the transaction manager (since lock table & Paxos is sufficient for transactionality)
          • Multiple Paxos group involved with a tx? Group leaders coordinate to perform two-phase commit. 
            • Coordinator leader: participant leader of one of the participant groups 
            • Coordinator slaves: slaves from the one group above
      • Directories and Placement
        • Directory - bucketing abstraction, unit of data placement
          • All data in a directory has the same replication configuration
          • Replication properties can be specified by an application 
        • Applications can..
          • Control locality of their data by choosing keys carefully
          • Control how data is replicated by tagging each database and/or individual directories with a combination of named options in two dimensions (1st dimension - number of replicas, 2nd dimension - types of replicas) 
      • Data Model
        • Application data model -> layered on top of the directory-bucketed key-value mappings
        • An application creates 1+ databases in a universe
        • Each database can contain an unlimited number of schematized tables
        • Features
          • Schematized semi-relational tables
            • Semi?
              • Each table is a mapping of primary-key columns to the non-primary-key columns. Keys can be nulls (hence 'semi'?) or be composite (hence 'semi')?
          • Query Language
            • Interleaving
              • INTERLEAVE IN 
                • declares the hierarchies in database schemas
                  • Top of hiearchy - Directory table 
                  • Each row in a directory table with K, together with all the rows in descendant tables that start iwht K, form a directory 
              • ON DELETE CASCADE
                • deleting a row in a directory table deletes any associated child rows
              • Interleaving is important because clients can describe the locality relationships that exist between multiple tables 
          • General-purpose transactions 

      • TrueTime (API, Implementation)
        • API
          • tt.now(), tt.before(), tt.after() 
        • Implementation
          • Time master machine per datacenter 
            • Both GPS & Atomic Clocks are used at all masters because those two have different failure modes
            • All masters' time references are regularly compared against each other
            • Each master cross-checks the rate at which its reference advances time against its own local clock and evicts itself if there's a substantial divergence
          • Timeslave daemon per machine
            • Every daemon polls a variety of masters to reduce errors from any one master
            • Daemons apply a variant of Marzullo's algorithm 
            • Has slowly increasing time uncertainty betwen synchronizations 
            • Daemon poll interval of 30 seconds, drift rate @ 200 ms/seconds 

      • Concurrency Control (How TrueTime is used to derive correctness properties and implement features) 

            • Timestamp management
              • Assigning timestamps to RW transactions
                • Two-phase locking
                  • Timestamps can be assigned at any time when all locks have been acquired but before any locks have been released
                • Monotonicity invariant
                  • Within each Paxos group, Spanner assigns timestamp to Paxos writes in monotonically increasing order even across leaders 
                  • A single leader replica can trivially assign timestamps in monotonically increasing order 
                  • Across leaders - disjointness invariant (a leader must only assign timestamps within the interval of is leader lease)
                • External consistency invariant
                  • If start of T2 occurs after the commit of T1, then the commit timestamp of T2 must be greater than the ocmmit timestamp of T1. 
                • Protocol
                  • e_i^server = arrival event of the commit request at the coordinator leader for a write T_i 
                  • Start: The coordinator leader for a write, T_i, assigns a commit timestamp, s_i, no less than the value of TT.now().latest computed after e_i^server. 
                  • Commit Wait: The coordinator leader ensures that clients cannot see any data commited by T_i until TT.after(s_i) is true. Commit wait ensures that s_i is less than the absolute commit time of T_i. 
              • Serving Reads at a Timestamp
                • The monotonicity invariant allows Spanner to determine whether a replica's state is sufficiently up to date to satisfy a read. Every replica tracks safe time, t_safe, which is the maximum timestamp at which a replica is up-to-date. A replica can satisfy a read at timestamp t, if t <= t_safe. 
                • Derivation of t_safe
              • Assigning Timestamps to Read-only Transactions 
                • Two phases 
                  • Assign a timestamp, s_read
                  • Execute the transaction's reads as snapshot reads at s_read. 
                • Assignment s_read = TT.now().latest 
                  • Preserves external consistency 
                  • Such a timestamp may require the execution of the data reads at s_read to block, if t_safe has not advanced sufficiently. 
            • Details
              • Read-write transactions
                • Client drives two-phase commit 
                  • For Reads within Read-Write transactions;
                    • Client issues Reads to the leader replica which reads most recent data
                    • Leader replica sends keepalive messages to participant leaders
                    • When client has finished all reads & buffered all writes, it begins two-phase commit. 
                    • Client chooses coordinator group and sends a commit message to each participant leader with the identity of the coordinator.
                    • Non-coordinator-participant leader acquires write locks, chooses prepare timestamps, then notifies the coordinator of this timestamp.
                    • The coordinator also first acquires write locks, skips prepare phase, then chooes a timestamp for the entire transaction after hearing from all other participant leaders.
                    • Before allowing any coordinator replica to apply the commit record, the coordinator leader waits until TT.after(s) to obey Commit Wait rule.
                    • After Commit Wait, coordinator sends the commit timestamp to the client and all other participant leaders.
                    • Each participant leader logs the outcome through Paxos and release locks. 
              • Read-only transactions 
                • Scope expression is required for every Read-only transaction
                  • Expression that summarizes the keys that will be read by the entire transaction 
                  • If the scope's values are served by a single Paxos group, the client issues the read-only transaction to the group's leader
                  • If the scope's values are served by multiple Paxos group, then..
                    • Client has reads executed at S_read = TT.now().latest
                    • All reads in the transaction can be sent to replicas that are sufficiently up-to-date. 

            No comments:

            Post a Comment