Synchronic English Web Corpus

This corpus consists of 467,713,650 words (tokens) from web-extracted texts. It covers the period 2000-2010 split into the sub-corpora below.

339,907,995 words from 100,000 randomly selected web-pages to form sample of the distrubution of texts throughout the web.

127,805,655 words from 56,000 pages selected based on the Open Directory classification of web pages. Each domain consists of 4,000 pages.

Arts7,633,937 words
Business8,042,995 words
Computers9,270,736 words
Games9,236,305 words
Health10,413,454 words
Home7,007,868 words
Kids and Teens9,776,391 words
News6,792,433 words
Recreation7,419,495 words
Reference12,272,054 words
Science14,919,257 words
Shopping5,185,703 words
Society11,231,670 words
Sport8,603,357 words

Read more about the construction of our web corpora.

Search ►