Joshua decoder | Indian Languages Parallel Corpora

This page describes a set of six parallel corpora obtained by translating popular Wikipedia documents in six languages from the Indian sub-continent into English. The languages are:

Bengali
Hindi
Malayalam
Tamil
Telugu
Urdu

The collection and release of this data is described in the following paper:

Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing
Matt Post, Chris Callison-Burch, and Miles Osborne
WMT 2012
PDF BIB

Download & License

The Indian parallel corpora dataset is hosted on Github. You can clone that, or download a release tarball by clicking the big green button above. The corpus is licensed under the Creative Commons Attribution-Sharealike 3.0 Unported License (CC BY-SA 3.0).

Scores

Below are the best translation scores (case-insensitive BLEU-4) that have been reported on the provided test sets. The Google results were recorded in the fall of 2011 (and are described in Post et al. (2012)). Google does not have a Malayalam system.

Citation	BN	HI	ML	TA	TE	UR
Google	20.01	25.21	–	13.51	16.03	23.09
Post et al. (2012)	13.53	17.29	13.72	9.81	12.46	19.53

Indo-Aryan languages

Dravidian languages

Datasets

Indian Parallel Languages

Download & License

Scores