[1] |
Brown T B,Mann B,Ryder N,et al.Language models are few-shot learners[C]∥Proc of the 34th International Confe- rence on Neural Information Processing Systems,2020:1877-1901.
|
[2] |
Smith S,Patwary M,Norick B,et al.Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B,a large-scale generative language model[J]. arXiv:2201.11990,2022.
|
[3] |
Shoeybi M,Patwary M,Puri R,et al.Megatron-LM:Training multi-billion parameter language models using model parallelism[J]. arXiv:1909.08053,2019.
|
[4] |
OpenAI.Techniques for training large neural networks[EB/OL].[2022-06-09].https://openai.com/index/techniques-for-training-large-neural-networks.
|
[5] |
Narayanan D,Shoeybi M,Casper J,et al.Efficient large-scale language model training on GPU clusters using Megatron-LM[C]∥Proc of the International Conference for High Performance Computing,Networking,Storage and Analysis,2021:1-15.
|
[6] |
Rashidi S,Sridharan S,Srinivasan S,et al.ASTRA-SIM:Enabling SW/HW co-design exploration for distributed DL training platforms[C]∥Proc of 2020 IEEE International Symposium on Performance Analysis of Systems and Software,2020:81-92.
|
[7] |
Foley D,Danskin J.Ultra-performance Pascal GPU and NVLink interconnect[J].IEEE Micro,2017,37(2):7-17.
|
[8] |
Ugnius. Wafer-scale processors:The time has come[EB/OL].[2019-09-06].https://www.cerebras.net/blog/wafer-scale-processors-the-time-has-come/.
|
[9] |
Chun S R,Kuo T H,Tsai H Y,et al.Info_SoW (system-on-wafer) for high performance computing[C]∥Proc of 2020 IEEE 70th Electronic Components and Technology Confe- rence,2020:1-6.
|
[10] |
Lewington R.An AI chip with unprecedented performance to do the unimaginable[EB/OL].[2021-08-17].https://www.cerebras.net/blog/an-ai-chip-with-unprecedented-performance-to-do-the-unimaginable/.
|
[11] |
Talpes E,Williams D D,Sarma D D.DOJO:The microarchitecture of Tesla’s exa-scale computer[C]∥Proc of 2022 IEEE Hot Chips 34 Symposium,2022:1-28.
|
[12] |
Klender J.Tesla’s in-house Dojo chip teased by legendary engineer ahead of AI day[EB/OL].[2021-08-03].https://www.teslarati.com/tesla-dojo-chip-images-dennis-hong/.
|
[13] |
Jia Z,Tillman B,Maggioni M,et al.Dissecting the Graphcore IPU architecture via microbenchmarking[J].arXiv:1912.03413,2019.
|
[14] |
Hewitt C,Bishop P,Steiger R. A universal modular ACTOR formalism for artificial intelligence[C]∥Proc of the 3rd International Joint Conference on Artificial Intelligence, 1973: 235-245.
|
[15] |
Agha G A,Mason I A,Smith S F,et al.A foundation for actor computation[J].Journal of Functional Programming,1997,7(1):1-72.
|
[16] |
Virding R,Wikstrm C,Williams M.Concurrent programming in ERLANG(2nd ed.)[M].Hertfordshire:Prentice Hall International (UK) Ltd.,1996.
|
[17] |
Akka: A concurrency framework for java and scala[EB/OL].[2023-05-15].http://ptolemy.eecs.berkeley.edu/.
|
[18] |
Bernstein P,Bykov S,Geller A,et al.Orleans:Distributed virtual actors for programmability and scalability[EB/OL].[2023-05-05]. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Orleans-MSR-TR-2014-41.pdf.
|
[19] |
Yuan J H,Li X Q,Cheng C,et al.OneFlow:Redesign the distributed deep learning framework from scratch[J]. arXiv:2110.15032,2021.
|
[20] |
Srirama S N,Vemuri D.CANTO:An actor model-based distributed fog framework supporting neural networks training in IoT applications[J].Computer Communications,2023,199:1-9.
|
[21] |
Choquette J,Gandhi W,Giroux O,et al.NVIDIA A100 Tensor Core GPU:Performance and innovation[J].IEEE Micro,2021,41(2):29-35.
|
[22] |
Huang Y P,Cheng Y L,Bapna A,et al.Gpipe:Efficient training of giant neural networks using pipeline parallelism[C]∥Proc of the 33rd International Conference on Neural Information Processing Systems,2019:103-112.
|