MADCAT Chinese data splits
Identifier: SLR50
Summary: Unofficial data splits (dev/train/test) for the MADCAT Chinese LDC corpus
Category: Other
License: Apache 2.0
Downloads (use a mirror closer to you):
madcat.dev.raw.lineid [725K] ( dev set
) Mirrors:
[US]
[EU]
[CN]
madcat.test.raw.lineid [734K] (test set
) Mirrors:
[US]
[EU]
[CN]
madcat.train.raw.lineid [2.8M] (train set
) Mirrors:
[US]
[EU]
[CN]
About this resource:
These are unofficial data splits for the corpus MADCAT Chinese Pilot Training Set (LDC2014T13).
LDC is providing only training data for this corpus and not the original dev/eval sets, so the original
training data have been split into three different disjoint parts (i.e. there shouldn't be sentences/lines
from the same document in different sets -- as each document is handwritten/transcribed
by a different author in the MADCAT data) to allow for evaluation of the performance in the usual way.
Also, please not that the license relates only for the splits. You still need to obtain the original databases
and respect the databases' license!
It contains the madcat xml name and segment id (s{1,2,3,4}). For example:
GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s1 GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s2 GMW_CMN_20070118.0014_001_LDC0632.madcat.xml s3