C4Corpus: Multilingual Web-size Corpus with Free License thumbnail
Pause
Mute
Subtitles
Playback speed
0.25
0.5
0.75
1
1.25
1.5
1.75
2
Full screen

C4Corpus: Multilingual Web-size Corpus with Free License

Published on Jul 28, 20161080 Views

Large Web corpora containing full documents with permissive licenses are crucial for many NLP tasks. In this article we present the construction of 12 million-pages Web corpus (over 10 billion tokens)

Related categories