Google全球分布式数据库:Spanner

发布时间 2023-10-26 02:36:23作者: 真昼小天使daisuki

2012年的OSDI上google发布了Spanner数据库。个人认为Spanner对于版本控制,事务外部一致性的处理,使用TrueTime + Timestamp进行全球备份同步的实现都比较值得一看。个人认为对于其中时序逻辑的理解对在大范围内(通常是全国到全球)部署分布式DB以确保复制同步有重要意义。

key point:
external consistency -> txn sequence
truetime + timestamp, sync & multi-version
global deployment
2PC 2PL
3 basic txns(RW, RO, snapshot)

Spanner: Globally-Distributed Database

Implementation

Different environment: universe
test development production......

Hierarchy

  1. universe: global

    The universe master and the placement driver are currently singletons.

  2. zone: manage deployment unit; logical & physical isolation

    zone master & location proxy

  3. spanserver
  4. tablet

Spanserver

software stack

1 leader, server replica, in different data centers

all have:

  1. tablet
    $$
    (key:string, timestamp:int64) → string
    $$

  2. Colossus: a distributed filesystem like GFS

  3. Paxos state machine: to support replication, for consistently replicated bag of mappings, replicas set: Paxos group

    Each state machine stores its metadata and log in its corresponding tablet. Paxos implementation supports long-lived leaders with time-based leader leases.

    Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date.

Paxos: implementation pipelined, write in-order

leader uniquely has:

  1. lock table: the state for two-phase locking
  2. transaction manager: for distributed transactions, across Paxos group

Directories and Placement

based on k/v map, bucketing abstraction called a directory, which is a set of contiguous keys that share a common prefix.

tablet: different with bigtable, spanner tablet is a container that may encapsulate multiple partitions of the row space

Movedir: background, not a single txn, register fact and uses a transaction to atomically move small data!(actually the fragment, not a big dir)

Data model

  1. schematized semi-relational tables
  2. a query language
  3. generalpurpose transactions

Spanner’s data model is not purely relational, in that rows must have names.

hierarchies: in database schemas via the INTERLEAVE IN: get locality relationships.


TrueTime

API:

  • now: return interval[earliest, latest]
  • after
  • before

underlying time references: GPS and atomic clocks


Concurrency Control

two-phase commit generates a Paxos write for the prepare phase that has no corresponding Spanner client write.

transactions:

  • read-write: (including Standalone writes)
  • read-only: without locking, any replica that is sufficiently up-to-date
  • snapshot-reads: read in the past, no locking, any replica that is sufficiently up-to-date

Paxos leader lease:

timed leases: to make leadership long-lived, for lease votes

lease interval: [discover quorum of votes, no longer has votes]

Smax: the maximum timestamp used by a leader.

two-phase commit: a protocol maintain consistency - unsuccess: rollback

  1. prepare phase
  2. commit phase

RW txn:

buffered before written

wound-wait :avoid deadlock

both two have writing lock,

  • non-coordinator participant leader
  • coordinator leader: skip prepare phase

RO txn:

execution flow:

  • assign a timestamp sread
  • execute the transaction’s reads as snapshot reads at sread.

simply select sread = TT.now().latest

  • single Paxos group

    Define LastTS() to be the timestamp of the last committed write at a Paxos group.

  • multiple Paxos groups

Schema-Change Transactions

Discussion

Paxos Truetime consistency

strong consistency cross data centers

data model: not pure relational(can use sql )

tablets are replicated, concurrtency corrtdiantion by Pxaos

txns with multiple Paxos groups --- 2PC coordination

leader

what's the actually difference compared with the classical distributed database?????

consistent versions of the data
the only reading data

the spirit kernel: the timestamp & version control

time mechenism

global-time consistency: timestamp no uncertainty

commit time: interval

there are two txns, to distinguish one happened actually before another

Participant leader -> Transaction manager -> Paxos group

three basic r/w ops, make the external consistency, global timestamp for sync across regions and certain txns sequences

Concurrency control : timestamp management to do

timestamp -> multi-version -> snapshot

almost all the work in spanner around the sequence of timestamp!

condition: multiple data centers

target: external consistency ~= linearizability

Two phase locking:

  1. growing phase: acquire lock
  2. shrinking phase: release lock
  • 2PC: distributed system, global manage
  • 2PL: one node, multi-txns, resource acquire and manage,

TrueTime: local clock -> global clock, which is essentially important for global distributed system because of sync needs.

uncertainty interval[earliest, latest]: try to make it as small as possible(increase accuracy) -> less lock -> increase efficiency

Thus, Timestamps + TrueTime can build a global accessible time service for all the application around the world.

external-consistency invariant: s1 < s2