Joshua Documentation | Getting Started

This page contains end-user oriented documentation for the 5.0 release of the Joshua decoder.

Download and Setup

Download Joshua by clicking the big green button above, or from the command line:
```
wget -q http://cs.jhu.edu/~post/files/joshua-v5.0.tgz
```

Next, unpack it, set environment variables, and compile everything:

tar xzf joshua-v5.0.tgz
cd joshua-v5.0

# for bash
export JAVA_HOME=/path/to/java
export JOSHUA=$(pwd)
echo "export JOSHUA=$JOSHUA" >> ~/.bashrc

# for tcsh
setenv JAVA_HOME /path/to/java
setenv JOSHUA `pwd`
echo "setenv JOSHUA $JOSHUA" >> ~/.profile
   
ant

(If you don’t know what to set $JAVA_HOME to, try /usr/java/default)

If you have a Hadoop installation, make sure that the environment variable $HADOOP is set and points to it. If you don’t, Joshua will roll one out for you in standalone mode.
If you want to use Cherry & Foster’s batch MIRA tuner (recommended), you need to install Moses and define the $MOSES environment variable to point to the root of the Moses installation.

Quick start

Our pipeline script is the quickest way to get started. For example, to train and test a complete model translating from Bengali to English:

First, download the Indian languages data:

wget --no-check -O indian-languages.tgz https://github.com/joshua-decoder/indian-parallel-corpora/tarball/master
tar xf indian-languages.tgz
ln -s joshua-decoder-indian-parallel-corpora-b71d31a input

Then, train and test a model

$JOSHUA/bin/pipeline.pl --source bn --target en \
    --no-prepare --aligner berkeley \
    --corpus input/bn-en/tok/training.bn-en \
    --tune input/bn-en/tok/dev.bn-en \
    --test input/bn-en/tok/devtest.bn-en

This will align the data with the Berkeley aligner, build a Hiero model, tune with MERT, decode the test sets, and reports results that should correspond with what you find on the Indian Parallel Corpora page. For more details, including information on the many options available with the pipeline script, please see its documentation page.

More information

For more detail on the decoder itself, including its command-line options, see the Joshua decoder page. You can also learn more about other steps of the Joshua MT pipeline, including grammar extraction with Thrax and Joshua’s efficient grammar representation.

If you have problems or issues, you might find some help on our answers page or in the mailing list archives.

A bundled configuration, which is a minimal set of configuration, resource, and script files, can be created and easily transferred and shared.