
C4Corpus: Multilingual Web-size Corpus with Free License
Published on 2016-07-281081 Views
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens)