event thumbnail image
Carnegie Mellon Machine Learning Lunch seminar

Exploiting document structure and feature hierarchy for semi-supervised domain adaptation

author: Andrew Arnold, +Machine Learning Department; School of Computer Science; Carnegie Mellon University

Description

In this work we try to bridge the gap often encountered by researchers who find themselves with few or no labeled examples from their desired target domain, yet still have access to large amounts of labeled data from other related, but distinct source domains, and seemingly no way to transfer knowledge from one to the other.

Experimentally, we focus on the problem of extracting protein mentions from academic publications in the field of biology, where the source domain data are abstracts labeled with protein mentions, and the target domain data are wholly unlabeled captions. We mine the large number of such full text articles freely available on the Internet in order to supplement the limited amount of annotated data available.

By exploiting the explicit and implicit common structure of the different subsections of these documents, including the unlabeled full text, we are able to generate robust features that are insensitive to changes in marginal and conditional distributions of classes and data across domains. We supplement these domain-insensitive features with automatically obtained high-confidence positive and negative predictions on the target domain to learn extractors that generalize well from one section of a document to another. Similarly, we develop a novel hierarchical prior structure over the features motivated by the common structure of feature spaces for this task across natural language data sets. Finally, lacking labeled target testing data, we employ comparative user preference studies to evaluate the relative performance of the proposed methods with respect to existing baselines.

You might be experiencing some problems with Your Video player.
Slides
0:00 Exploiting document structure and feature hierarchy for semi-supervised domain adaptation
0:34 Domain: Biological publications
0:48 Problem: Protein-name extraction
1:13 The Problem
2:53 Motivation
3:10 What we are able to do:
3:45 What we would like to be able to do:
4:59 What we’d like to be able to do:
5:52 State-of-the-art features: Lexical
7:03 Feature Hierarchy
7:56 State-of-the-art features: Lexical
8:01 Feature Hierarchy
9:02 Hierarchical prior model (HIER)
10:26 Data
11:29 Experiments
13:15 Results: Intra-genre, same-task transfer (1)
14:12 Results: Intra-genre, same-task transfer (2)
14:51 Results: Baselines vs. HIER
15:42 Conclusions
16:42 Transfer across document structure:
17:20 Sample biology paper
19:39 Structural frequency features (1)
20:30 Structural frequency features (2)
22:28 Snippets
24:42 Sample biology paper
25:58 Data
27:39 Performance: abstract -->abstract (1)
28:45 Performance: abstract -->abstract (2)
30:11 Performance: abstract -->captions
31:59 Conclusions
34:03 Thank you!

Lecture rating

People found this lecture:
Worth seeing
because it is:
 Valuable and informative
Well presented
Easily understandable
Acceptably recorded
You need to login to cast your vote.

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: