Populating the Semantic Web by Macro-Reading Internet Text
published: Nov. 24, 2009, recorded: October 2009, views: 1019
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
A key question to the future of the semantic web is "how will we acquire structured information to populate the semantic web on a vast scale?" One approach is to enter this information manually. A second approach is to take advantage of the great deal of structured information already present in various databases, and to develop common ontologies, publishing standards, and reward systems to make this data widely accessible. We consider here a third approach: developing software that automatically extracts structured information from unstructured text present on the web.
This talk will survey attempts to extract structured knowledge from unstructured text, and will focus on an approach with three characteristics that we hypothesize make it viable. First, in contrast to the very difficult problem of reading information from a single document, we consider the much easier problem of reading hundreds of millions of documents simultaneously, so that our system can extract facts that are stated many times by combining evidence from many documents. Second, our system begins with a given ontology that defines the types of information to be extracted, enabling it to focus its effort and to ignore most of the text which is irrelevant to the target ontology. Third, the system uses a new class of semi-supervised learning algorithms to learn how to extract information from web pages -- algorithms designed to achieve greater accuracy when given more complex ontologies. Our experiments show that this approach can produce knowledge bases containing tens of thousands of facts to populate given ontologies with approximately 90% accuracy, starting with only a handful of labeled training examples and 200 million unlabeled web pages.
Download slides: iswc09_mitchell_ptsw_01.pdf (2.5 MB)
Download slides: iswc09_mitchell_ptsw_01.ppt (2.2 MB)
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !