CLMAD is an open Chinese Language Model Adaptation Dataset. The dataset contains 14 classes of 740,000 news. Several necessary preprocessing steps are adopted on the dataset for language model training. Documents are split into sentences in terms of punctuations, and then all punctuations are removed. ICTLACS word segmentation tool is used to segment continues character sequences in to word sequences. Each class of text is split into training set and testing set. The testing set is randomly selected 7000 sentences. The text of training set and testing set are not overlapped. Detailed comparative experiments on four selected domains (fashion, finance, sport, and stock) are shown in our paper "CLMAD: A Chinese Language Model Adaptation Dataset", Ye Bai, Jianhua Tao, Jiangyan Yi, Zhengqi Wen, Cunhang Fan, ISCSLP 2018 (submitted).
The dataset is extended from THUCNews text classification dataset. We appreciate NLP lab of Tsinghua University to provide THUC News corpus, and Dr. Zhiyuan Liu to admit us to extend this corpus.
You can cite the data using the following BibTeX entry:@inproceedings{yebai2018clmad, title={CLMAD: A Chinese Language Model Adaptation Dataset}, author={Ye Bai, Jianhua Tao, Jiangyan Yi, Zhengqi Wen, Cunhang Fan}, booktitle={The Eleventh International Symposium on Chinese Spoken Language Processing (ISCSLP 2018)}, pages={To Appear}, year={2018}, }