2012年的OSDI上google发布了Spanner数据库。个人认为Spanner对于版本控制,事务外部一致性的处理,使用TrueTime + Timestamp进行全球备份同步的实现都比较值得一看。个人认为对于其中时序逻辑的理解对在大范围内(通常是全国到全球)部署分布式DB以确保复制同步有重要意义。
key point:
external consistency -> txn sequence
truetime + timestamp, sync & multi-version
global deployment
2PC 2PL
3 basic txns(RW, RO, snapshot)
Spanner: Globally-Distributed Database
Implementation
Different environment: universe
test development production......
Hierarchy
- universe: global
The universe master and the placement driver are currently singletons.
- zone: manage deployment unit; logical & physical isolation
zone master & location proxy
- spanserver
- tablet
Spanserver
software stack
1 leader, server replica, in different data centers
all have:
-
tablet
$$
(key:string, timestamp:int64) → string
$$ -
Colossus: a distributed filesystem like GFS
-
Paxos state machine: to support replication, for consistently replicated bag of mappings, replicas set: Paxos group
Each state machine stores its metadata and log in its corresponding tablet. Paxos implementation supports long-lived leaders with time-based leader leases.
Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date.
Paxos: implementation pipelined, write in-order
leader uniquely has:
- lock table: the state for two-phase locking
- transaction manager: for distributed transactions, across Paxos group
Directories and Placement
based on k/v map, bucketing abstraction called a directory, which is a set of contiguous keys that share a common prefix.
tablet: different with bigtable, spanner tablet is a container that may encapsulate multiple partitions of the row space
Movedir: background, not a single txn, register fact and uses a transaction to atomically move small data!(actually the fragment, not a big dir)
Data model
- schematized semi-relational tables
- a query language
- generalpurpose transactions
Spanner’s data model is not purely relational, in that rows must have names.
hierarchies: in database schemas via the INTERLEAVE IN: get locality relationships.
TrueTime
API:
- now: return interval[earliest, latest]
- after
- before
underlying time references: GPS and atomic clocks
Concurrency Control
two-phase commit generates a Paxos write for the prepare phase that has no corresponding Spanner client write.
transactions:
- read-write: (including Standalone writes)
- read-only: without locking, any replica that is sufficiently up-to-date
- snapshot-reads: read in the past, no locking, any replica that is sufficiently up-to-date
Paxos leader lease:
timed leases: to make leadership long-lived, for lease votes
lease interval: [discover quorum of votes, no longer has votes]
Smax: the maximum timestamp used by a leader.
two-phase commit: a protocol maintain consistency - unsuccess: rollback
- prepare phase
- commit phase
RW txn:
buffered before written
wound-wait :avoid deadlock
both two have writing lock,
- non-coordinator participant leader
- coordinator leader: skip prepare phase
RO txn:
execution flow:
- assign a timestamp sread
- execute the transaction’s reads as snapshot reads at sread.
simply select sread = TT.now().latest
-
single Paxos group
Define LastTS() to be the timestamp of the last committed write at a Paxos group.
-
multiple Paxos groups
Schema-Change Transactions
Discussion
Paxos Truetime consistency
strong consistency cross data centers
data model: not pure relational(can use sql )
tablets are replicated, concurrtency corrtdiantion by Pxaos
txns with multiple Paxos groups --- 2PC coordination
leader
what's the actually difference compared with the classical distributed database?????
consistent versions of the data
the only reading data
the spirit kernel: the timestamp & version control
time mechenism
global-time consistency: timestamp no uncertainty
commit time: interval
there are two txns, to distinguish one happened actually before another
Participant leader -> Transaction manager -> Paxos group
three basic r/w ops, make the external consistency, global timestamp for sync across regions and certain txns sequences
Concurrency control : timestamp management to do
timestamp -> multi-version -> snapshot
almost all the work in spanner around the sequence of timestamp!
condition: multiple data centers
target: external consistency ~= linearizability
Two phase locking:
- growing phase: acquire lock
- shrinking phase: release lock
- 2PC: distributed system, global manage
- 2PL: one node, multi-txns, resource acquire and manage,
TrueTime: local clock -> global clock, which is essentially important for global distributed system because of sync needs.
uncertainty interval[earliest, latest]: try to make it as small as possible(increase accuracy) -> less lock -> increase efficiency
Thus, Timestamps + TrueTime can build a global accessible time service for all the application around the world.
external-consistency invariant: s1 < s2