Detecting Duplicate Web Documents using Clickthrough Data
published: Aug. 9, 2011, recorded: February 2011, views: 3079
Report a problem or upload filesIf you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
The web contains many duplicate and near-duplicate documents. Given that user satisfaction is negatively affected by redundant information in search results, a significant amount of research has been devoted to developing duplicate detection algorithms. However, most such algorithms rely solely on document content to detect duplication, ignoring the fact that a primary goal of duplicate detection is to identify documents that contain redundant information with respect to a particular user query. Similarly, although query-dependent result diversification algorithms compute a query-dependent ranking, they tend to do so on the basis of a query-independent content similarity score.
In this paper, we bridge the gap between query-dependent redundancy and query-independent duplication by showing how user click behavior following a query provides evidence about the relative novelty of web documents. While most previous work on interpreting user clicks on search results has assumed that they reflect just result relevance, we show that clicks also provide information about duplication between web documents since users consider search results in the context of previously seen documents. Moreover, we find that duplication explains a substantial amount of presentation bias observed in clicking behavior. We identify three distinct types of redundancy that commonly occur on the web and show how click data can be used to detect these different types.
Link this pageWould you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !