C4Corpus: Multilingual Web-size Corpus with Free License

Published on 2016-07-281092 Views

Ivan Habernal

Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens)

LREC 2016 - Portorož

Related categories

Natural Language Processing