取巧:既然是跑分,OB也在某种程度上利用了规则,以降低硬件成本,不过这是在FDR中明确披露的“All other tables are replicated into 3 replicas: one full replica, one data replica and one log replica that are distributed among database nodes. A full replica contains cardinality, in-memory increments (mutations) that could be checkpointed to disk and redo log of the corresponding table. A data replica contains cardinality, checkpoints of in-memory increments (mutations) and redo log (both from the full replica). A log replica contains redo log only.” 与正常的应用在网站核心业务上3副本5副本的部署方式不同,OB在tpcc的性能测试中,3个副本不是对等的,其中1个完整副本包含redolog,LSM内存表,SST数据,这里隐含的意思是这个副本会做compaction;第2个副本只包含redolog,SST数据,它应该是在主副本完成compaction之后从主副本直接copy sst;第三副本只包含redolog。因此这里可以分析出来:整个1557台节点的内存基本都可以用于事务读写,两个副本基本没有内存开销;整个1557台节点的硬盘,用于存储2份数据文件而不是3份。之所以可以这样,是因为tpcc并没有规定或者考察RTO时间,所以只要保证RPO为零就行了。
P1 脏读 (“Dirty read”): SQL-transaction T1 modifies a row. SQL- transaction T2 then reads that row before T1 performs a COMMIT. If T1 then performs a ROLLBACK, T2 will have read a row that was never committed and that may thus be considered to have never existed.
P2 不可重复读 (“Non-repeatable read”): SQL-transaction T1 reads a row. SQL- transaction T2 then modifies or deletes that row and performs a COMMIT. If T1 then attempts to reread the row, it may receive the modified value or discover that the row has been deleted.
P3 幻读 (“Phantom”): SQL-transaction T1 reads the set of rows N that satisfy some <search condition>. SQL-transaction T2 then executes SQL-statements that generate one or more rows that satisfy the <search condition> used by SQL-transaction T1. If SQL-transaction T1 then repeats the initial read with the same <search condition>, it obtains a different collection of rows.
通过依次禁止这三种异象,ANSI确定了4种标准隔离级别,如下表所示:
级别
P1(脏读)
P2(不可重复读)
P3(幻读)
Read Uncommitted
允许
允许
允许
Read Committed
禁止
允许
允许
Repeatable Read
禁止
禁止
允许
(Anomaly) Serializable
禁止
禁止
禁止
Note: The exclusion of these penomena or SQL-transactions executing at isolation level SERIALIZABLE is a consequence of the requirement that such transactions be serializable.
The execution of concurrent SQL-transactions at isolation level SERIALIZABLE is guaranteed to be serializable. A serializable execution is defined to be an execution of the operations of concurrently executing SQL-transactions that produces the same effect as some serial execution of those same SQL-transactions
两阶段提交(2 Phase Commit简称2PC)协议是用于在多个节点之间达成一致的通信协议,它是实现“有状态的”分布式系统所必须面对的经典问题之一。本文通过对比经典2PC协议,和Google工程实践的基础上,分析一种优化延迟的2PC协议。为了方便说明,本文主要针对分布式数据库中,跨域sharding的2PC方案的讨论。主要参考文献:Gray J, Lamport L. Consensus on transaction commit[J]. ACM Transactions on Database Systems (TODS), 2006, 31(1): 133-160.
我们无法通过简单的修改pop逻辑来避免“ABA问题”,因为这里的纠结之处在于要“锁定(取出)”top,就要使用CAS原子的将top改为它的next指针,但是要安全的引用top→next又要先“锁定(取出)”top。因此解决“ABA问题”的根本,在于保证pop逻辑引用top指针时,它不会被释放,而在释放节点内存空间时,严格保证不会再有其他人引用这个节点。Hazard Pointer(Michael M M. Hazard pointers: Safe memory reclamation for lock-free objects[J]. IEEE Transactions on Parallel and Distributed Systems, 2004, 15(6): 491-504.)技术是一个里程碑式的创新,真正在实际工程中解决“ABA问题”(memsql/oceanbase都在使用),Hazard Version是对Hazard Pointer在易用性上的改进,原理一致,详细介绍请参考《共享变量的并发读写》一文。