Boilerplate Detection Using Shallow Text Features

author: Christian Kohlschütter, L3S Research Center, Leibniz University of Hannover
published: Oct. 7, 2010,   recorded: February 2010,   views: 3545
Categories

Slides

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.
  Delicious Bibliography

Description

In addition to the actual content Web pages consist of navigational elements, templates, and advertisements. This boilerplate text typically is not related to the main content, may deteriorate search precision and thus needs to be detected properly. In this paper, we analyze a small set of shallow text features for classifying the individual text elements in a Web page. We compare the approach to complex, state- of-the-art techniques and show that competitive accuracy can be achieved, at almost no cost. Moreover, we derive a simple and plausible stochastic model for describing the boilerplate creation process. With the help of our model, we also quantify the impact of boilerplate removal to retrieval performance and show significant improvements over the baseline. Finally, we extend the principled approach by straight-forward heuristics, achieving a remarkable accuracy.

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Reviews and comments:

Comment1 Christian Kohlschütter, November 20, 2010 at 11:09 a.m.:

To improve the audio quality of the video, please turn your speakers balance to left (= mono).

To test my algorithms, have a look at http://boilerpipe-web.appspot.com/ and http://code.google.com/p/boilerpipe/

Cheers,
Christian


Comment2 Dr. Christian Kohlschütter, March 13, 2011 at 3:28 p.m.:

More than one year has passed by now, and videolectures.net still has not managed to update the audio in the stream.

Until this is fixed, feel free to download/watch the video with corrected (mono) audio here:
http://www.l3s.de/~kohlschuetter/boil...

Best,
Christian


Comment3 avi, September 15, 2013 at 4:15 a.m.:

Hello Christian,

How can I use this algorithm with jQuery ? Do you have any POC ? please let me know. thanks


Comment4 mohsin, February 26, 2014 at 10:14 p.m.:

Hello Christian,
how can i implement this algorithm in c language? Please Please let me know

thanks in advance

kind regards,
Mohsin

Write your own review or comment:

make sure you have javascript enabled or clear this field: