C4Corpus: Multilingual Web-size Corpus with Free License
Published on Jul 28, 20161080 Views
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens)