Efficient Indexing of Repeated n-Grams

author: Samuel Huston, Department of Computer Science, University of Massachusetts Amherst
published: Aug. 9, 2011,   recorded: February 2011,   views: 3132


Related Open Educational Resources

Related content

Report a problem or upload files

If you have found a problem with this lecture or would like to send us extra material, articles, exercises, etc., please use our ticket system to describe your request and upload the data.
Enter your e-mail into the 'Cc' field, and we will keep you updated with your request's status.
Lecture popularity: You need to login to cast your vote.


The identification of repeated n-gram phrases in text has many practical applications, including authorship attribution, text reuse identification, and plagiarism detection. We consider methods for finding the repeated n-grams in text corpora, with emphasis on techniques that can be effectively scaled across a cluster of processors to handle very large amounts of text. We compare our proposed method to existing techniques using the 1.5 TB TREC ClueWeb-B text collection, using both single-processor and multi-processor approaches. The experiments show that our method offers an important tradeoff between speed and temporary storage space, and provides an alternative to previous approaches that scales almost linearly in the length of the sequence, is largely independent of n, and provides a uniform workload balance across the set of available processors.

See Also:

Download slides icon Download slides: wsdm2011_huston_eir_01.pdf (1.6┬áMB)

Help icon Streaming Video Help

Link this page

Would you like to put a link to this lecture on your homepage?
Go ahead! Copy the HTML snippet !

Write your own review or comment:

make sure you have javascript enabled or clear this field: