Automatic detection and aggregation of name variants from large multi-lingual document collections
Description
Most of the Named Entity Recognition software will recognize Vladimir Putin, Wladimir Putin and Vladimir Poutine as being named entities. But, for some application, it is necessary to mark them as being all variants of the same person. In the Joint Research Centre we try to handle this problem of merging name variants that we gathered in a News corpus as part of the Europe Media Monitor (EMM) system. Such highly multi lingual system (35,000 news article per day in more than 30 languages) is quite likely to use various spellings to refer to the same person. As example, during 15 hours EMM found 10 variants referring to Abdullah Gul (Turkish foreign minister): Abdullah Gül / Abdullah Gul / Абдуллах Гюл / Abdulá Gül / Abdullah Guel / Abdullah Gulas / Абдула Гюл / لگ للهادبع / لوغ للها دبع / Абдуллаха Гюл.
Our approach consists of extracting names from multilingual corpus, and then to merge very similar names in our repository. The various steps are:
- guessing new names using language specific light resources
- storing names in a repository
- lookup for known names. Including some variants for languages that decline proper names. For example in Polish we can face the following sentence containing a declension of the proper name Tony Blair: Brown tymczasem wybrał skromne wakacje w ojczyźnie chcąc wyraźnie odróżnić się od Tony’ego Blaira.
- compute similarity of names, including names written in different alphabets in order to merge two variants as belonging to the same person.
In NewsExplorer the system automatically add 450 new names per day, 10% are automatically recognized as being variant of existing person names, 9% are possible variants to be validated by an expert. Those figures highlight the importance of such system when dealing with multi-lingual information.
Various collected variants of person names gathered are available on our public website: http://press.jrc.it/NewsExplorer
| Slides | |
| 0:00 | Automatic detection and aggregation of name variants from large multilingual document collections |
| 0:23 | Introduction: Translation of proper names |
| 4:28 | Introduction (2) |
| 6:27 | Context |
| 6:59 | Context, observation |
| 8:15 | Context: Europe Media Monitor |
| 9:04 | Lookup known names in text |
| 12:07 | Example |
| 13:26 | Guessing unknown names |
| 15:59 | Name Knowledge base |
| 17:03 | Adding name variants from web sources |
| 18:18 | Merging Name Variants |
| 18:53 | Transliteration |
| 20:38 | Normalisation (1) |
| 21:44 | Normalisation (2) |
| 23:35 | Similarity measure (1) |
| 27:37 | Similarity measure (2) |
| 29:01 | Comparing strings |
| 30:26 | Merging names, some results: (1) |
| 31:12 | Merging names, some results: (2) |
| 35:12 | Name entity recognition and merging: example of use |
| 35:23 | Person name recognition – Result |
| 38:57 | Highlighting, Cross-lingual Glossing |
| 39:48 | OSInt |
| 40:12 | Cross-language document similarity |
| 40:36 | NewsExplorer - Cross-lingual cluster linking |
| 42:34 | Social networks build out of names in news |
| 43:28 | Social networks: statistical |
| 44:09 | Links between two names |
| 44:11 | Live Social Networks |
| 44:45 | Social network, visualisation |
| 44:51 | Relation Extraction (1) |
| 45:48 | Relation Extraction (2) |
| 46:01 | Quotations |
| 46:36 | Quotation link: from a person about another entity |
| 47:15 | Quotation network: |
| 48:19 | Future work |
| 51:14 | Conclusion |
| 52:59 | Bibliography (1) |
| 54:11 | Bibliography (2) |
| 60:57 | - Questions |
Lecture rating
| People found this lecture: | ||
| Worth seeing | ||
| because it is: | ||
| Valuable and informative | ||
| Well presented | ||
| Easily understandable | ||
| Acceptably recorded | ||
| You need to login to cast your vote. | ||
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Related content
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !


