[1] |
Kubala F,Anastasakos T, Jin H,et al.Transcribing radio news[C]∥Proc of 4th International Conference on Spoken Language Processing,1996:598-601.
|
[2] |
Rybach D,Gollan C,Schluter R,et al.Audio segmentation for speech recognition using segment features[C]∥Proc of 2009 IEEE International Conference on Acoustics,Speech and Signal Processing,2009:4197-4200.
|
[3] |
Yeh C K,Chen J,Yu C,et al.Unsupervised speech recognition via segmental empirical output distribution matching[J].arXiv:1812.09323v1,2018.
|
[4] |
Ren Y, Ruan Y J,Tan X,et al.FastSpeech:Fast,robust and controllable text to speech[C]∥Proc of the 33rd International Conference on Neural Information Processing Systems,2019:3171-3180.
|
[5] |
Kreuk F,Keshet J,Adi Y.Self-supervised contrastive learning for unsupervised phoneme segmentation[J].arXiv:2007.13465v2,2020.
|
[6] |
van den Oord A,Li Y Z,Vinyals O.Representation learning with contrastive predictive coding[J].arXiv:1807.03748,2019.
|
[7] |
Baevski A, Zhou H, Mohamed A, et al. wav2vec 2.1: A framework for self-supervised learning of speech representations[C]∥Proc of the 34th International Conference on Neural Information Processing Systems,2020:12449-12460.
|
[8] |
Keshet J,Shalev-Shwartz S,Singer Y,et al.Phoneme alignment based on discriminative learning[C]∥Proc of the 9th European Conference on Speech Communication and Technology,2005:2961-2964.
|
[9] |
McAuliffe M,Socolof M,Mihuc S,et al.Montreal forced aligner:Trainable text-speech alignment using Kaldi[C]∥Proc of Interspeech,2017:498-502.
|
[10] |
King S,Hasegawa-Johnson M.Accurate speech segmentation by mimicking human auditory processing[C]∥Proc of 2013 IEEE International Conference on Acoustics,Speech and Signal Processing,2013:8096-8100.
|
[11] |
Franke J,Mueller M,Hamlaoui F,et al.Phoneme boundary detection using deep bidirectional LSTMs[C]∥Proc of Speech Communication;12.ITG Symposium,2016:1-5.
|
[12] |
Kreuk F,Sheena Y,Keshet J,et al.Phoneme boundary detection using learnable segmental features[C]∥Proc of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing,2020:8089-8093.
|
[13] |
Kim H,Choi H-S.Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric[C]∥Proc of 2023 IEEE International Conference on Acoustics,Speech and Signal Processing,2023:1-5.
|
[14] |
Dusan S,Rabiner L.On the relation between maximum spectral transition positions and phone boundaries[C]∥Proc of the 9th International Conference on Spoken Language Processing,2006:645-648.
|
[15] |
Pereiro-Estevan Y P,Wan V,Scharenborg O.Finding maximum margin segments in speech[C]∥Proc of 2007 IEEE International Conference on Acoustics,Speech and Signal Processing,2007:IV-937-IV-940.
|
[16] |
Almpanidis G,Kotropoulos C.Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion[J].Speech Communication,2008,50(1):38-55.
|
[17] |
Rsnen O,Laine U, Altosaar T.Blind segmentation of speech using non-linear filtering methods[M]∥Speech Technologies.Rijeka,Croatia:In Tech,2011:105-124.
|
[18] |
Michel P,Rsnen O,Thiollière R,et al.Blind phoneme segmentation with temporal prediction errors[C]∥Proc of Annual Meeting of the Association for Computational Linguistics,2017:62-68.
|
[19] |
Chorowski J,Ciesielski G,Dzikowski J,et al.Aligned contrastive predictive coding[J].arXiv:2014.11946v3,2021.
|
[20] |
Cuervo S,Grabias M,Chorowski J,et al.Contrastive prediction strategies for unsupervised segmentation and categorization of phonemes and words[C]∥Proc of 2022 IEEE International Conference on Acoustics,Speech and Signal Processing,2022:3189-3193.
|
[21] |
Cuervo S,Lancucki A,Marxer R,et al.Variable-rate hierarchical CPC leads to acoustic unit discovery in speech[C]∥Proc of International Conference on Neural Information Processing Systems,2022:34995-35006.
|
[22] |
Schneider S,Baevski A,Collobert R,et al.Wav2vec:Unsupervised pre-training for speech recognition[J].arXiv:1904.05862v4,2019.
|
[23] |
Gemello R, Albesano D,Mana F.Multi-source neural networks for speech recognition[C]∥Proc of the IEEE International Joint Conference on Neural Networks,2000:265-270.
|
[24] |
Vaswani A,Shazeer N,Parmar N,et al.Attention is all you need[C]∥Proc of the 31st International Conference on Neural Information Processing Systems,2017:6000-6010.
|
[25] |
Rsnen O J,Laine U K,Altosaar T.An improved speech segmentation quality measure:The R-value[C]∥Proc of 10th Annual Conference of the International Speech Communication Association,2009:1851-1854.
|