This project has retired. For details please refer to its Attic page.
Joshua decoder | Indian Languages Parallel Corpora

Datasets

Indian Parallel Languages

Download

This page describes a set of six parallel corpora obtained by translating popular Wikipedia documents in six languages from the Indian sub-continent into English. The languages are:
  • Bengali
  • Hindi
  • Malayalam
  • Tamil
  • Telugu
  • Urdu

The collection and release of this data is described in the following paper:

Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing
Matt Post, Chris Callison-Burch, and Miles Osborne
WMT 2012
PDF BIB

Download & License

The Indian parallel corpora dataset is hosted on Github. You can clone that, or download a release tarball by clicking the big green button above. The corpus is licensed under the Creative Commons Attribution-Sharealike 3.0 Unported License (CC BY-SA 3.0).

Scores

Below are the best translation scores (case-insensitive BLEU-4) that have been reported on the provided test sets. The Google results were recorded in the fall of 2011 (and are described in Post et al. (2012)). Google does not have a Malayalam system.

Citation BN HI ML TA TE UR
Google 20.01 25.21 13.51 16.03 23.09
Post et al. (2012) 13.53 17.29 13.72 9.81 12.46 19.53