MADCAT Arabic data splits
Identifier: SLR48
Summary: Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus
Category: Other
License: Apache 2.0
Downloads (use a mirror closer to you):
madcat.dev.raw.lineid [581K] ( dev set
) Mirrors:
[US]
[EU]
[CN]
madcat.test.raw.lineid [602K] (test set
) Mirrors:
[US]
[EU]
[CN]
madcat.train.raw.lineid [9.9M] (train set
) Mirrors:
[US]
[EU]
[CN]
About this resource:
These are unofficial data splits for the corpus MADCAT Arabic (LDC2013T15, LDC2013T09, LDC2012T15).
LDC is providing only training data for these corpora and not the original dev/eval sets, so the original
training data have been split into three different disjoint parts (i.e. there shouldn't be sentences/lines
from the same document in different sets -- as each document is handwritten/transcribed
by a different author in the MADCAT data) to allow for evaluation of the performance in the usual way.
Also, please not that the license relates only for the splits. You still need to obtain the original databases
and respect the databases' license!
It contains the madcat xml name and segment id (s{1,2,3,4}). For example:
groups.google.com_women1000_508c404bd84f8ba3_ARB_20060426_124900_3_LDC0188.madcat.xml s1 groups.google.com_women1000_508c404bd84f8ba3_ARB_20060426_124900_3_LDC0188.madcat.xml s2 groups.google.com_women1000_508c404bd84f8ba3_ARB_20060426_124900_3_LDC0188.madcat.xml s3