openslr.org

Open Speech and Language Resources

MADCAT Arabic data splits

Identifier: SLR48

Summary: Unofficial data splits (dev/train/test) for the MADCAT Arabic LDC corpus

Category: Other

License: Apache 2.0

Downloads (use a mirror closer to you):
madcat.dev.raw.lineid [581K] ( dev set ) Mirrors: [US] [EU] [CN]
madcat.test.raw.lineid [602K] (test set ) Mirrors: [US] [EU] [CN]
madcat.train.raw.lineid [9.9M] (train set ) Mirrors: [US] [EU] [CN]

About this resource:

These are unofficial data splits for the corpus MADCAT Arabic (LDC2013T15, LDC2013T09, LDC2012T15). LDC is providing only training data for these corpora and not the original dev/eval sets, so the original training data have been split into three different disjoint parts (i.e. there shouldn't be sentences/lines from the same document in different sets -- as each document is handwritten/transcribed by a different author in the MADCAT data) to allow for evaluation of the performance in the usual way.

Also, please not that the license relates only for the splits. You still need to obtain the original databases and respect the databases' license!

It contains the madcat xml name and segment id (s{1,2,3,4}). For example:

	groups.google.com_women1000_508c404bd84f8ba3_ARB_20060426_124900_3_LDC0188.madcat.xml s1
	groups.google.com_women1000_508c404bd84f8ba3_ARB_20060426_124900_3_LDC0188.madcat.xml s2
	groups.google.com_women1000_508c404bd84f8ba3_ARB_20060426_124900_3_LDC0188.madcat.xml s3