- Joshua MT Tool
-
Joshua MT Tool is an open source tool for statistical machine translation which is parsing-based. The toolkit achieves state of the art translation performance on the French-English translation task.
Contenido
History
Joshua uses parallel and distributed computing techniques for scalability. It is written in Java and implements all the essential algorithms: chart-parsing, n-gram language model integration, beam- and cube-pruning, and k-best extraction. The toolkit also implements suffix-array grammar extraction and minimum error rate training. Additionally, parallel and distributed computing techniques are exploited to make it scalable. A great effort has been made to ensure that the toolkit is easy to use and to extend. The toolkit has been used to translate roughly a million sentences in a parallel corpus for large-scale discriminative training experiments in order to contribute to the progress of the syntax-based machine translation research.
Goals
The design of Joshua is supposed to achieve three major goals:
Extensibility: The Joshua code is organized into separate packages for each major aspect of functionality. In this way it is clear which files contribute to a given functionality and researchers can focus on a single package without worrying about the rest of the system and all extensible components are defined by Java interfaces.
End-to-end cohesion: To combat issues such as the diverse components of a machine translation pipeline which are often designed by separate groups and have different file format and interaction requirements, the Joshua toolkit integrates most critical components of the machine translation pipeline. Moreover, each component can be treated as a stand-alone tool and does not rely on the rest of the toolkit we provide.
Scalability: It has been ensured that the decoder is scalable to large models and data sets. Among the techniques contributing to scalability: parsing and pruning algorithms implemented with dynamic programming strategies and efficient data structures, suffix-array grammar extraction, parallel and distributed decoding and boom filter language models.
Main functions implemented in Joshua toolkit
Training Corpus Sub-sampling: Joshua makes use of a method proposed by Kishore Papineni to select the subset of the training data consisting of sentences useful for inducing a grammar to translate a particular test set.
Suffix-array Grammar Extraction: It uses a source language suffix array to extract only those rules which will actually be used in translating a particular set of test sentences. This results in a vastly smaller rule set than techniques which extract all rules from the training set.
Decoding Algorithms: The decoder assumes a probabilistic synchronous context-free grammar (SCFG)
Language Models: It has three local n-gram language models implemented.
Minimum Error Rate Training (MERT): Joshua's MERT optimizes parameter weights to maximize performance on a development set as measured by an automatic evaluation metric.
References
- Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan Weese and Omar Zaidan, 2009. Joshua: An Open Source Toolkit for Parsing-based Machine Translation. In Proceedings of the Workshop on Statistical Machine Translation (WMT09).
Enlaces externos
- Joshua home: http://joshua.sourceforge.net/Joshua/Welcome.html
See also
- Machine Translation
- Comparation of machine translation applications
Categorías:- Machine Translation Tool
- Statistical machine translation
- Phrase-based machine translation
Wikimedia foundation. 2010.