Birmingham Blog Corpus

This corpus consists of 628,558,282 words extracted from blog texts. The corpus is split into sections according to how the texts were discovered and downloaded:

Technorati427,768,219 words
Crawled the top blogs ranked by Technorati.
Google Blog Search19,384,429 words
Downloaded new posts daily as identified by Google Blog Search.
Blogspot and Wordpress181,405,634 words
Crawled blogs hosted on and Extracted date information and separated posts from comments:
posts94,977,062 words
comments86,428,572 words

Read more about the construction of our web corpora.

Search ►