Using text mining techniques to maintain translation memories

author: Andraž Repar, Department of Knowledge Technologies, Jožef Stefan Institute
published: May 23, 2017,   recorded: April 2017,   views: 2
Categories

Slides

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.
  Delicious Bibliography

Description

In this paper, we explore the use of text mining techniques for translation memory maintenance. Language service providers often have large databases of translations, called translation memories, which have been in use for a long time leading to a slow population of the translation memory with other domains (i.e. adding financial content to a technical domain translation memory). To our best knowledge, no tools exist that would effectively separate the content of a translation memory according to different domains. Having the ability to extract individual domains from low-quality translation memories could mean a significant benefit to language service providers looking to utilize modern translation methods, such as machine translation and automated terminology management. In the first stage, we used OntoGen, a semi-automatic ontology building tool, to separate the segments in the translation memory according to domains. In the second stage, we wanted to test whether we could use OntoGen’s topic keywords as shortcuts for building classification models– the reason for this being that manual annotation is costly and time consuming. If the topics extracted with OntoGen are accurate enough, then we could potentially skip the manual annotation phase of text classification, thereby significantly speeding up the process. We successfully managed to build an ontology of the translation memory, but the boundaries between some topics were relatively vague. One reason for this is that we had to deal with sentences – as opposed to larger blocks of text – which are difficult to classify. Nevertheless, the results of the ontology creation were promising with manual evaluation showing that around 4 in 5 strings were assigned a correct label. The results of the second stage were less clear - the accuracy did significantly improve compared to the majority class classifier, but did not reach levels where it would be deemed useful in a professional language service provider environment.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: