Birmingham Blog Corpus
This corpus consists of 628,558,282 words extracted from blog texts.
The corpus is split into sections according to how the texts were discovered and downloaded:
|Crawled the top blogs ranked by Technorati.|
|Google Blog Search||19,384,429 words|
|Downloaded new posts daily as identified by Google Blog Search.|
|Blogspot and Wordpress||181,405,634 words|
|Crawled blogs hosted on blogspot.com and wordpress.com. Extracted date information and separated posts from comments:|