计算机工程与科学 ›› 2022, Vol. 44 ›› Issue (04): 571-583.
沈佳杰1,卢修文1,2,3,向望1,赵泽宇1,王新1,2,3
收稿日期:
2021-10-08
修回日期:
2021-12-03
接受日期:
2022-04-25
出版日期:
2022-04-25
发布日期:
2022-04-20
基金资助:
SHEN Jia-jie1,LU Xiu-wen1,2,3,XIANG Wang1,ZHAO Ze-yu1,WANG Xin1,2,3
Received:
2021-10-08
Revised:
2021-12-03
Accepted:
2022-04-25
Online:
2022-04-25
Published:
2022-04-20
摘要: 读写一致性算法被广泛部署到分布式存储系统,以保证读写数据的正确性。然而,读写一致性算法通常需要使用一个复杂的通信协议来保证多个节点读写数据的正确性,会带来较大网络传输开销和读写时延。由于各种读写一致性算法实现机制存在较大差异,特定的读写一致性算法往往需要部署到特定的存储应用场景,才能高效地执行数据读写操作,保障对其上应用的服务质量。因此,实际的存储系统开发过程中,开发人员往往需要根据存储应用场景选择读写一致性算法,从而减少数据读写操作带来的系统开销。为了明确各种读写一致性算法适合的应用场景,介绍了分布式存储系统中存在的读写一致性问题,并综述了当前读写一致性算法的实现机制。总结了在副本和纠删码2种存储机制下主流的读写一致性算法,比较了这些读写一致性算法在实现机制、网络开销和数据存储开销等方面的特性。在此基础上,结合了单数据中心分布式存储系统和跨数据中心云际存储系统2种经典的应用场景,总结了开发人员在实际存储系统中部署读写一致性算法过程中需要注意的要点,分析了亟需解决的问题和提升数据读写操作性能的可能途径,展望了读写一致性算法未来的发展方向。
沈佳杰, 卢修文, 向望, 赵泽宇, 王新, . 分布式存储系统读写一致性算法性能优化研究综述[J]. 计算机工程与科学, 2022, 44(04): 571-583.
SHEN Jia-jie, LU Xiu-wen, XIANG Wang, ZHAO Ze-yu, WANG Xin, . consensus algorithms for distributed storage system[J]. Computer Engineering & Science, 2022, 44(04): 571-583.
[1] | Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system[C]∥Proc of IEEE Symposium on Mass Storage Systems & Technologies,2010:1-10. |
[2] | Abd-El-Malek M, Ganger G R,Goodson G R,et al.Fault-scalable Byzantine fault-tolerant services[C]∥Proc of the 20th ACM Symposium on Operating Systems Principles,2005:59-74. |
[3] | Castro M, Liskov B.Practical Byzantine fault tolerance and proactive recovery[J].ACM Transactions on Computer Systems,2002,20(4):398-461. |
[4] | Lampson B.The ABCD’s of Paxos[C]∥Proc of the 20th Annual ACM Symposium on Principles of Distributed Comput- ing,2001:1-25. |
[5] | Lamport L.The part-time parliament[J].ACM Transactions on Computer Systems,1998,16(2):133-169. |
[6] | Ongaro D, Ousterhout J K. In search of an understandable consensus algorithm[C]∥Proc of the 2014 USENIX Confe- rence on USENIX Annual Technical Conference,2014:305-320. |
[7] | Uluyol M,Huang A,Goel A,et al.Near-optimal latency versus cost trade-offs in geo-distributed storage[C]∥Proc of the 17th USENIX Symposium on Networked Systems Design and Implementation,2020:157-180. |
[8] | Ghemawat S,Gobioff H,Leung S T.The Google file system[C]∥Proc of the 19th ACM Symposium on Operating Systems Principles,2003:29-43. |
[9] | Samaras G,Britton K,Citron A,et al.Two-phase commit optimizations in a commercial distributed environment[J].Distributed & Parallel Databases,1995,3(4):325-360. |
[10] | Al-Houmaily Y J, Samaras G.Three-phase commit[M]∥Encyclopedia of Database System.New York:Springer,2016:3091-3097. |
[11] | Pritchett D. BASE:An ACID alternative:In partitioned databases,trading some consistency for availability can lead to dramatic improvements in scalability[J].Queue,2008,6(3):48-55. |
[12] | Lamport L. Paxos made simple[J].ACM SIGACT News (Distributed Computing Column),2001,32(4):18-25. |
[13] | Chandra T D,Griesemer R,Redstone J,et al.Paxos made live:An engineering perspective [C]∥Proc of the 26th ACM Symposium on Principles of Distributed Computing,2007:398-407. |
[14] | Lamport L.Generalized consensus and Paxos[R].Microsoft Research Technical Report MSR-TR-2005-33,Radmond:Microsoft,2005. |
[15] | Lamport L.Fast Paxos[J].Distributed Computing,2006,19(2):79-103. |
[16] | Lamport L,Malkhi D,Zhou L.Vertical Paxos and primary-backup replication[C]∥Proc of the 28th ACM Symposium on Principles of Distributed Computing,2009:312-313. |
[17] | Lamport L,Massa M.Cheap Paxos[C]∥Proc of International Conference on Dependable Systems and Networks,2004:1-9. |
[18] | Nawab F,Agrawal D,Abbadi A E.DPaxos:Managing data closer to users for low-latency and mobile applications[C]∥Proc of the 2018 International Conference on Management of Data,2018:1221-1236. |
[19] | Howard H,Malkhi D,Spiegelman A.Flexible Paxos:Quorum intersection revisited[J].arXiv:1608.06696v1,2016. |
[20] | Moraru I,Andersen D G,Kaminsky M.There is more consensus in egalitarian parliaments[C]∥Proc of the 24th ACM Symposium on Operating Systems Principles,2013:358-372. |
[21] | Benjamin R,Flavio P J.A simple totally ordered broadcast protocol[C]∥Proc of the 2nd Workshop on Large-Scale Distributed Systems and Middleware,2008:1-6. |
[22] | Junqueira F P,Reed B C,Serafini M.ZAB:High-performance broadcast for primary-backup systems[C]∥Proc of IEEE/IFIP International Conference on Dependable Systems & Networks,2011:245-256. |
[23] | Hunt P, Konar M,Junqueira F P,et al. ZooKeeper:Wait-free coordination for internet-scale systems[C]∥Proc of USENIX Annual Technical Conference,2010:1-14. |
[24] | Medeiros A. ZooKeeper’s atomic broadcast protocol:Theory and practice[EB/OL].[2021-05-07]. www.tcs-hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf. |
[25] | Brenner S,Wulf C,Goltzsche D,et al.SecureKeeper:confidential ZooKeeper using Intel SGX[C]∥Proc of the 17th International Middleware Conference,2016:1-13. |
[26] | Frommgen A, Haas S, Pfannemuller M, et al. Switching ZooKeeper’s consensus protocol at runtime[C]∥Proc of IEEE International Conference on Autonomic Computing,2017:81-82. |
[27] | Shi Bo-xuan, Zhang Feng, Jiang Wen-bao.Research on unified trust anchor model based on ZooKeeper[J].Application Research of Computers,2020,37(12):3722-3725.(in Chinese) |
[28] | Gray C, Cheriton D. Leases:An efficient fault-tolerant mechanism for distributed file cache consistency[C]∥Proc of the 12th ACM Symposium on Operating Systems Principles,1989:202-210. |
[29] | Moraru I,Andersen D G,Kaminsky M.Paxos quorum leases:Fast reads without sacrificing writes [C]∥Proc of the ACM Symposium on Cloud Computing,2014:1-13. |
[30] | Moraru I,Andersen D G,Kaminsky M.Quorum read leases[EB/OL].[2014-10-17].https://github.com/efficient/qlease. |
[31] | Decandia G,Hastorun D,Jampani M,et al.Dynamo:Amazon’s highly available key-value store[J].ACM SIGOPS Operating Systems Review,2007,41(6):205-220. |
[32] | Wang Hua-jin, Li Jian-hua,Shen Zhi-hong. Write latency model for NWR databases based on (n,r,k) fork-join queueing analysis[J].Application Research of Computers,2019,36(2):466-471.(in Chinese) |
[33] | Huang X,Wang J,Bai J,et al.Inherent replica inconsistency in Cassandra[C]∥Proc of IEEE International Congress on Big Data,2014:740-747. |
[34] | Zhu Tao, Guo Jin-wei, Zhou Huan,et al.Consistency and availability in distributed database systems[J].Journal of Software,2018,29(1):131-149.(in Chinese) |
[35] | Wang H J,Li J H,Zhang H M,et al.Benchmarking replication and consistency strategies in cloud serving databases:HBase and Cassandra [C]∥Proc of Workshop on Big Data Benchmarks,Performance Optimization,and Emerging Hardware,2014:71-82. |
[36] | Vora M N,Consultancy T.Hadoop-HBase for large-scale data[C]∥Proc of 2011 International Conference on Computer Science and Network Technology,2011:601-605. |
[37] | Lakshman A,Malik P.Cassandra[J].ACM SIGOPS Ope- rating Systems Review,2010,44(2):35-40. |
[38] | Glendenning L,Beschastnikh I,Krishnamurthy A,et al.Scalable consistency in Scatter[C]∥Proc of the 23rd ACM Symposium on Operating Systems Principles,2011:15-28. |
[39] | LevAri K,Boritnikov E,Keidar I,et al.Modular composition of coordination services[C]∥Proc of the 2016 USENIX Conference on USENIX Annual Technical Conference,2016:250-264. |
[40] | Lloyd W,Freedman M J,Kaminsky M,et al.Don't settle for eventual:Scalable causal consistency for wide-area storage with COPS[C]∥Proc of the 23rd ACM Symposium on Operating Systems Principles,2011:401-416. |
[41] | Ailijiang A,Charapko A,Demirbas M,et al.WPaxos:Wide area network flexible consensus[J].IEEE Transactions on Parallel and Distributed Systems,2020,31(1):211-223. |
[42] | Jeffrey H,Matthew B,Amit L,et al.Regular sequential serializability and regular sequential consistency[C]∥Proc of ACM SIGOPS the 28th Symposium on Operating Systems Principles,2021:163-179. |
[43] | Trencseni M, Gazso A,Reinhardt H.PaxosLease:Diskless Paxos for leases[EB/OL].[2021-07-10].http://arxiv.org/pdf/1209.4187.pdf. |
[44] | Yin J,Alvisi L,Dahlin M, et al.Volume leases for consistency in large-scale systems [J].IEEE Transactions on Knowledge and Data Engineering,1999,11(4):563-576. |
[45] | Lamport L,Malkhi D,Zhou L.Reconfiguring a state machine [J].ACM SIGACT News,2010,41(1):63-73. |
[46] | Burrows M. The Chubby lock service for loosely-coupled distributed systems[C]∥Proc of the 7th USENIX Confe- rence on Operating Systems Design and Implementation,2006:335-350. |
[47] | Li J,Michael E,Sharma N K,et al.Just say no to paxos overhead:Replacing consensus with network ordering[C]∥Proc of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI 16),2016:467-483. |
[48] | Mahmoud H,Nawab F,Pucher A,et al.Low-latency multi-datacenter databases using replicated commits[J].Proceedings of the VLDB Endowment,2013,6(9):661-672. |
[49] | Nawab F, Arora V,Agrawal D, et al.Minimizing commit latency of transactions in geo-replicated data stores[C]∥Proc of the 2015 International Conference on Management of Data,2015:1279-1294. |
[50] | Guo H,Larson P-A,Ramakrishnan R, et al.Relaxed currency and consistency:How to say “good enough” in SQL[C]∥Proc of ACM International Conference on Management of Data,2004:1-12. |
[51] | Baker J,Bond C,Corbett J C,et al.Megastore:Providing scalable,highly available storage for interactive services[C]∥Proc of the 5th Biennial Conference on Innovative Data System Research,2011:223-234. |
[52] | Corbett J C,Dean J,Epstein M,et al.Spanner:Google’s globally distributed database[J].ACM Transactions on Computer Systems,2013,31(3):8:1-8:22. |
[53] | Bradley C K,Frigo M,Paluskas J M,et al.Everyone loves file:File storage service (FSS) in Oracle cloud infrastructure[C]∥Proc of the 2019 USENIX Conference on USENIX Annual Technical Conference,2019:15-32. |
[54] | Agarwal S,Dunagan J,Jain N,et al.Volley:Automated data placement for geo-distributed cloud services[C]∥Proc of USENIX Symposium on Networked Systems Design and Implementation,2010:1-16. |
[55] | Liskov B,Ghemawat S,Gruber R,et al.Replication in the Harp file system[J]. ACM SIGOPS Operating Systems Review,1991,25(5):226-238. |
[56] | Mao Y,Junqueria F P,Marzullo K.Mencius:Building efficient replicated state machines for WANs [C]∥Proc of the 8th USENIX Conference on Operating Systems Design and Implementation,2008:369-384. |
[57] | The etcd storage systems[EB/OL].[2021-07-10].https://etcd.io/. |
[58] | Kraska T,Pang G,Franklin M J,et al.MDCC:Multi-data center consistency[C]∥Proc of the 8th ACM European Conference on Computer Systems,2013:1-12. |
[59] | Reed I S, Solomon G.Polynomial codes over certain finite fields[J].Journal of the Society for Industrial and Applied Mathematics,1960,8(2):300-304. |
[60] | Dimakis A G,Godfrey P B,Wu Y,et al.Network coding for distributed storage systems[J].IEEE Transactions on Information Theory,2010,56(9):4539-4551. |
[61] | Shum K W,Hu Y.Cooperative regenerating codes[J].IEEE Transactions on Information Theory,2013,59(11):7229-7258. |
[62] | Huang H,Simitci H,Xu Y K,et al.Erasure coding in windows azure storage[C]∥Proc of the 2012 USENIX Confe- rence on USENIX Annual Technical Conference,2012:15-26. |
[63] | Shen J,Li Y,Zhou Y, et al.Mobile cloud-of-clouds storage made efficient:A network coding based approach[C]∥Proc of IEEE Symposium on Reliable Distributed Systems,2018:72-82. |
[64] | Li R, Lin J, Lee P P C. Core:Augmenting regenerating- coding-based recovery for single and concurrent failures in distributed storage systems[C]∥Proc of IEEE Conference on Mass Storage Systems and Technologies,2013:1-6. |
[65] | Zhang G,Wu G,Wang S,et al.CaCo:An efficient Cauchy coding approach for cloud storage systems[J].IEEE Transactions on Computers,2016,65(2):435-447. |
[66] | Huang J,Liang X,Qin X,et al.PUSH:A pipelined reconstruction I/O for erasure-coded storage clusters[J].IEEE Transactions on Parallel and Distributed Systems,2014,26(2):516-526. |
[67] | Resch J K,Plank J S.AONT-RS:Blending security and performance in dispersed storage systems[C]∥Proc of the 9th USENIX Conference on File and Storage Technologies,2011:191-202. |
[68] | Li M,Qin C,Lee P P C,et al.Convergent dispersal:Toward storage-efficient security in a cloud-of-clouds[C]∥Proc of the 6th USENIX Workshop on Hot Topics in Storage and File Systems,2014:1-5. |
[69] | Li M,Qin C,Lee P P C.CDStore:Toward reliable,secure,and cost-efficient cloud storage via convergent dispersal[C]∥Proc of the 2015 USENIX Conference on USENIX Annual Technical Conference,2015:111-124. |
[70] | Tang H,Liu F,Shen G,et al.UniDrive:Synergize multiple consumer cloud storage services[C]∥Proc of the 16th Annual Middleware Conference,2015:137-148. |
[71] | Xiang L P,Xu Y L, Lui J C S,et al.Optimal recovery of single disk failure in RDP code storage systems[C]∥Proc of ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems,2010:119-130. |
[72] | Mu S, Chen K,Wu Y, et al.When Paxos meets erasure code:Reduce network and storage cost in state machine replication[C]∥Proc of the 23rd International ACM Symposium on High-Performance Parallel and Distributed Comput- ing,2014:61-72. |
[73] | Wang Z, Li T, Wang H, et al.CRaft:An erasure- coding-supported version of raft or reducing storage cost and network cost[C]∥Proc of the 18th USENIX Conference on File and Storage Technologies,2020:297-308. |
[74] | Lai F, Zhu X, Madhyastha H V, et al. Oort:Efficient federated learning via guided participant selection[C]∥Proc of the 15th USENIX Symposium on Operating Systems Design and Implementation,2021:19-35. |
[75] | Sathiamoorthy M,Asteris M,Papailiopoulos D,et al.XORing elephants:Novel erasure codes for big data[J].Proceedings of the VLDB Endowment,2013,6(5):325-336. |
[76] | Chen Y L,Mu S,Li J Y,et al.Giza:Erasure coding objects across global data centers [C]∥Proc of the 2017 USENIX Conference on USENIX Annual Technical Conference,2017:539-551. |
[77] | Chen H,Zhang H,Dong M,et al.Efficient and available in-memory KV-store with hybrid erasure coding and replication[J].ACM Transactions on Storage,2017,13(3):25:1-25:30. |
[78] | Shen Z,Lee P P.Cross-rack-aware updates in erasure-coded data centers[C]∥Proc of the 47th International Conference on Parallel Processing,2018:1-10. |
[79] | Shi H,Lu X,Shankar D,et al.UMR-EC:A unified and multi-rail erasure coding library for high-performance distributed storage systems [C]∥Proc of the International ACM Symposium on High-Performance Parallel and Distributed Computing,2019:219-230. |
[80] | Renesse V R,Altinbuken D.Paxos made moderately complex[J].ACM Computing Surveys,2015,47(3):1-16. |
[81] | Wang Jin-xi,Yan Shu-bin,Su Hao, et al.Heretic packaging of MEMS vapor cell of chip-scale cesium atomic clock[J].Micronanoelectronic Technology,2021,58(4):342-349.(in Chinese) |
附中文参考文献: | |
[27] | 史博轩,章峰,蒋文保.基于ZooKeeper的全网统一信任锚模型研究[J].计算机应用研究,2020,37(12):3722-3725. |
[32] | 王华进,黎建辉,沈志宏.基于(n,r,k) fork-join队列分析的NWR数据库写延时模型[J].计算机应用研究,2019,36(2):466-471. |
[34] | 朱涛,郭进伟,周欢,等.分布式数据库中一致性与可用性的关系[J].软件学报,2018,29(1):131-149. |
[81] | 王锦曦,闫树斌,苏浩,等.芯片级铯原子钟MEMS气室的气密性封装[J].微纳电子技术,2021,58(4):342-349. |
[1] | 康宇晗, 时洋, 陈照云, 文梅. 面向迈创+MatrixZone异构系统的深度学习编程框架[J]. 计算机工程与科学, 2023, 45(07): 1149-1158. |
[2] | 莫舒恒, 卢圣有, 黄聃, 卢宇彤. 基于即时编译的GNU Octave性能优化[J]. 计算机工程与科学, 2022, 44(12): 2091-2101. |
[3] | 李馨 谢长生 曹强. 自适应分布式存储系统设计[J]. J4, 2007, 29(9): 140-142. |
阅读次数 | ||||||
全文 |
|
|||||
摘要 |
|
|||||
湘公网安备 43010502000083号
湘ICP备10006030号
版权所有 © 《计算机工程与科学》 编辑部
地址:中国湖南省长沙市开福区德雅路109号(410073) 电话:0731-87002567 Email: jsjgcykx@vip.163.com
本系统由北京玛格泰克科技发展有限公司设计开发 技术支持:support@magtech.com.cn