Analyzing Word Frequencies in Large Text Corpora using Inter-arrival Times and Bootstrapping

Published on 2011-11-302749 Views

Jefrey Lijffijt

Comparing frequency counts over texts or corpora is an important task in many applications and scientific disciplines. Given a text corpus, we want to test a hypothesis, such as "word X is frequent",

Sessions

Related categories

Presentation

Motivation00:00

Data00:19

Problem setting01:17

Binomial test (bag-of-words model) - 101:54

Binomial test (bag-of-words model) - 202:56

Binomial test (bag-of-words model) - 303:14

Many words are bursty04:03

Proposed method 1: Inter-arrival times - 105:32

Proposed method 1: Inter-arrival times - 207:11

Proposed method 2: Bootstrapping08:24

Comparison for sergeant09:09

Example: frequency thresholds09:46

Finding significant news events11:43

Conclusion13:23