Automatic detection and aggregation of name variants from large multi-lingual document collections
published: Nov. 26, 2007, recorded: October 2007, views: 4008
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Most of the Named Entity Recognition software will recognize Vladimir Putin, Wladimir Putin and Vladimir Poutine as being named entities. But, for some application, it is necessary to mark them as being all variants of the same person. In the Joint Research Centre we try to handle this problem of merging name variants that we gathered in a News corpus as part of the Europe Media Monitor (EMM) system. Such highly multi lingual system (35,000 news article per day in more than 30 languages) is quite likely to use various spellings to refer to the same person. As example, during 15 hours EMM found 10 variants referring to Abdullah Gul (Turkish foreign minister): Abdullah Gül / Abdullah Gul / Абдуллах Гюл / Abdulá Gül / Abdullah Guel / Abdullah Gulas / Абдула Гюл / لگ للهادبع / لوغ للها دبع / Абдуллаха Гюл. Our approach consists of extracting names from multilingual corpus, and then to merge very similar names in our repository. The various steps are: - guessing new names using language specific light resources - storing names in a repository - lookup for known names. Including some variants for languages that decline proper names. For example in Polish we can face the following sentence containing a declension of the proper name Tony Blair: Brown tymczasem wybrał skromne wakacje w ojczyźnie chcąc wyraźnie odróżnić się od Tony’ego Blaira. - compute similarity of names, including names written in different alphabets in order to merge two variants as belonging to the same person. In NewsExplorer the system automatically add 450 new names per day, 10% are automatically recognized as being variant of existing person names, 9% are possible variants to be validated by an expert. Those figures highlight the importance of such system when dealing with multi-lingual information. Various collected variants of person names gathered are available on our public website: http://press.jrc.it/NewsExplorer
Download slides: mmdss07_pouliquen_ada_01.ppt (2.9 MB)
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !
Write your own review or comment: