Webpage Understanding: an Integrated Approach
Description
Recent work has shown the effectiveness of leveraging layout and tag-tree structure for segmenting webpages and labeling HTML elements. However, how to effectively segment and label the text contents inside HTML elements is still an open problem. Since many text contents on a webpage are often text fragments and not strictly grammatical, traditional natural language processing techniques, that typically expect grammatical sentences, are no longer directly applicable. In this paper, we examine how to use layout and tag-tree structure in a principled way to help understand text contents on webpages. We propose to segment and label the page structure and the text content of a webpage in a joint discriminative probabilistic model. In this model, semantic labels of page structure can be leveraged to help text content understanding, and semantic labels of the text phrases can be used in page structure understanding tasks such as data record detection. Thus, integration of both page structure and text content understanding leads to an integrated solution of webpage understanding. Experimental results on research homepage extraction show the feasibility and promise of our approach.
| Slides | |
| 0:03 | Webpage Understanding: an Integrated Approach |
| 0:32 | Outline |
| 0:50 | Motivating Examples |
| 2:10 | Characteristics of Webpage |
| 3:39 | Tasks of Web Data Extraction |
| 4:49 | slide 6 |
| 4:57 | Existing Attempts – De-coupled Approaches |
| 5:36 | Disadvantages |
| 6:16 | Why no integrated approach? |
| 7:10 | Outline |
| 7:33 | Statistical Web Structure Mining Model (KDD 2006) |
| 8:50 | Integrated Webpage Understanding Model |
| 9:59 | Factorized Distribution |
| 13:04 | Separate Learning |
| 13:26 | Outline |
| 13:33 | Experiments |
| 14:20 | Extraction Accuracy |
| 15:02 | NP-Chunking Features |
| 15:35 | Conclusions & Future Work |
Lecture rating
| People found this lecture: | ||
| Worth seeing | ||
| because it is: | ||
| Valuable and informative | ||
| Well presented | ||
| Easily understandable | ||
| Acceptably recorded | ||
| You need to login to cast your vote. | ||
Report a problem or upload files
If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Related content
Link this page
Would you like to put a link to this lecture on your homepage?Go ahead! Copy the HTML snippet !





well done