Use SGE cluster array job for inference

To speed up the inference, the neuralmonkey-run binary provides the --grid option, which can be used when running the program as a SGE array job.

The run script make use of the SGE_TASK_ID and SGE_TASK_STEPSIZE environment variables that are set in each computing node of the array job. If the --grid option is supplied and these variables are present, it runs the inference only on a subset of the dataset, specified by the variables.

Consider this example test_data.ini:

[main]
test_datasets=[<dataset>]
variables=["path/to/variables.data"]

[dataset]
class=dataset.load_dataset_from_files
s_source="data/source.en"
s_target_out="out/target.de"

If we want to run a model configured in model.ini on this dataset, we can do:

neuralmonkey-run model.ini test_data.ini

And the program executes the model on the dataset loaded from data/source.en and stores the results in out/target.de.

If the source file is large or if you use a slow inference method (such as beam search), you may want to split the source file into smaller parts and execute the model on all of them in parallel. If you have access to a SGE cluster, you don’t have to do it manually - just create an array job and supply the --grid option to the program. Now, suppose that the source file contains 100,000 sentences and you want to split it to 100 parts and run it on cluster. To accomplish this, just run:

qsub <qsub_options> -t 1-100000:1000 -b y \
"neuralmonkey-run --grid model.ini test_data.ini"

This will submit 100 jobs to your cluster. Each job will use its SGE_TASK_ID and SGE_TASK_STEPSIZE parameters to determine its part of the data to process. It then runs the inference only on the subset of the dataset and stores the result in a suffixed file.

For example, if the SGE_TASK_ID is 3, the SGE_TASK_STEPSIZE is 100, and the --grid option is specified, the inference will be run on lines 201 to 300 of the file data/source.en and the output will be written to out/target.de.0000000200.

After all the jobs are finished, you just need to manually run:

cat out/target.de.* > out/target.de

and delete the intermediate files. (Careful when your file has more than 10^10 lines - you need to concatenate the intermediate files in the right order!)