
C4Corpus: Multilingual Web-size Corpus with Free License
Published on Feb 4, 20251081 Views
Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens)