In all the discussions about Machine Translation, we do not often hear much about post-editors and what could be done to enhance and improve the task of PEMT, which is often viewed negatively. Lucía Guerrero, Senior Translation and Localization Project Manager at CPSL, provides useful insights into her direct experience of improving the experience for post-editors doing this kind of work.
Post-editing has become common practice when using MT. According to Common Sense Advisory (2016), more than 80% of LSPs offer Machine Translation Post-Editing (MTPE) services, and one of the main conclusions from a study presented by Memsource at the 2017 Conference of the European Association for Machine Translation (EAMT) states that less than 10% of MT carried out in Memsource Cloud is left unedited. While it is true that a lot of user-generated content is machine-translated without post-editing (you can see it every day on eBay, Amazon, and Airbnb, to name just a few examples), whether it is RBMT, SMT, or NMT, post-editors are still needed to improve the raw MT output.
Quantitative Evaluation Methods: Only Half the Picture
While this data shows that they are key, linguists are often excluded from the MT process, and are only required to participate in the post-editing task, with no interaction “in process.” Human evaluation is still seen as “expensive, time consuming and prone to subjectivity.” Error annotation takes a long time compared to automated metrics such as BLEU or WER, which are certainly cheaper and faster. These tools provide quantitative data, usually obtained by automatically comparing the raw MT to a reference translation, but the post-editor’s evaluation is hardly ever taken into account. Shouldn’t that be important if the role of the post-editor is here to stay?
Although machines are better than we are at spotting differences, humans are better at assessing linguistic phenomena, categorising them and giving detailed analysis.
Our approach at CPSL is to involve post-editors in three stages of the MT process:
- For testing an MT engine in a new domain or language combination
- For regular evaluation of an existing MT engine
- For creating/updating post-editing guidelines
Some companies use the Likert scale for collecting human evaluation. This method involves asking people – usually the end-users, rather than linguists – to assess raw MT segments one by one, based on criteria such as adequacy (how effectively has the source text message been transferred to the translation?) and fluency (does the segment sound natural to a native speaker of the target language?).
For our evaluation purposes, we find it more useful to ask the post-editor to fill in a form with their feedback, correlating information such as source segment, raw MT and post-edited segment, type and severity of errors encountered, and personal comments.
Turning Bad Experiences Into Rewarding Jobs
One of the main issues I often have to face when I manage a project based on MT is the reluctance of some translators to work with machine translated files due to previous bad experiences with post-editing. I have heard many stories about post-editors being paid based on an edit distance that was calculated using a test that was a long way from reality, or post-editors never being asked for their evaluation of the raw MT output. They were only asked to provide the post-edited files and, sometimes, the time they spent on the job, but only for billing purposes. One of our regular translators even told me that he received machine translated files that were worse than results form Google Translate (NMT had not yet been implemented). A common theme in all these stories is that post-editors are seldom involved in the process of evaluating and improving the system. This can turn post-editing into an alienating job that nobody wants to do a second time.
To avoid such situations, we decided to create our own feedback form for assessing and categorising the severity of errors and prioritising the errors. For example, errors such as incorrect capitalisation of months and days in Spanish, word order problems in questions in English, punctuation issues in French, and other similar errors, were given the highest priority by our post-editors and our MT provider was asked to fix them immediately. The complexity of the evaluation document can vary according to need. It can be as detailed as the Dynamic Quality Framework (DQF) template or be a simple list of the main errors with examples…
(To be continued)