Talks and presentations

Low-Resource Text Classification via Cross-Lingual Language Model Fine-Tuning

October 31, 2020

Post paper, The Nineteenth China National Conference on Computational Linguistics, CCL 2020, Haikou, Hainan

Text classification tends to be difficult when data are inadequate considering the amount of manually labeled text corpora. For low-resource agglutinative languages including Uyghur, Kazakh, and Kyrgyz (UKK languages), in which words are manufactured via stems concatenated with several suffixes and stems are used as the representation of text content, this feature allows infinite derivatives vocabulary that leads to high uncertainty of writing forms and huge redundant features. There are major challenges of low-resource agglutinative text classification the lack of labeled data in a target domain and morphologic diversity of derivations in language structures. It is an effective solution which fine-tuning a pre-trained language model to provide meaningful and favorable-to-use feature extractors for downstream text classification tasks. To this end, we propose a low-resource agglutinative language model fine-tuning 𝐴𝑔𝑔𝑙𝑒𝑑𝑖𝐹𝑖𝑇, specifically, we build a low-noise fine-tuning dataset by morphological analysis and stem extraction, then fine-tune the cross-lingual pre-training model on this dataset. Moreover, we propose an attention-based fine-tuning strategy that better selects relevant semantic and syntactic information from the pre-trained language model and uses those features on downstream text classification tasks. We evaluate our methods on nine Uyghur, Kazakh, and Kyrgyz classification datasets, where they have significantly better performance compared with several strong baselines.

YuQ: A Chinese-Uyghur Medical-domain Neural Machine Translation Dataset Towards Knowledge-driven

October 11, 2020

Oral presentation, The 16th China Conference on Machine Translation, Hohhot, Inner Mongolia

Recent advances in deep learning have been successful in delivering state-of-the-art performance in medical analysis, However, deep neural networks (DNNs) require a large amount of training data with a high-quality annotation which is not available or expensive in the field of the medical domain. The research of medical domain neural machine translation(NMT) is largely limited due to the lack of parallel sentences that consist of medical domain background knowledge annotations. To this end, we propose a Chinese Uyghur NMT knowledge-driven dataset, YuQ, which refers to ground medical domain knowledge graphs. Our corpus contains 65K parallel sentences from the medical domain and 130K utterances. By introducing medical domain glossary knowledge to the training model, we can win the challenge of low translation accuracy in Chinese-Uyghur machine translation professional terms. We provide several benchmark models. Ablation study results show that the models can be enhanced by introducing domain knowledge.