Webpage Understanding: an Integrated Approach

Published on 2007-09-146606 Views

Jun Zhu

Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents

Research Tracks

Related categories

Presentation

Webpage Understanding: an Integrated Approach00:03

Outline00:32

Motivating Examples00:50

Characteristics of Webpage02:10

Tasks of Web Data Extraction03:39

slide 604:49

Existing Attempts – De-coupled Approaches04:57

Disadvantages05:36

Why no integrated approach?06:16

Statistical Web Structure Mining Model (KDD 2006)07:33

Integrated Webpage Understanding Model08:50

Factorized Distribution09:59

Separate Learning13:04

Experiments13:33

Extraction Accuracy14:20

NP-Chunking Features15:02

Conclusions & Future Work15:35