Web Corpora

The web corpora available in WebCorp LSE were constructed as outlined below.

Miniweb

The miniweb section of the corpus is intended to be a microcosm of the web. In other words, as few restrictions as possible were placed on the selection of pages with the goal of obtaining a distribution of pages within the corpus that is roughly similar to the distribution of pages from sites across the web itself. Our method for selecting the pages takes inspiration from Baroni and Bernardini (2004):

  1. Select high frequency words from the British National Corpus and our existing newspaper corpora, excluding grammatical words.
  2. Create 100 combinations of three words by choosing words at random from the high frequency word list.
  3. Submit these combinations of words to the Google Search API (no longer in operation) and retrieve the top 5 hits.
  4. Use the full set of Google hits as the seeds for a 'broad' web crawl. A broad crawl means that any link to any web page found may be followed by the crawler. The Heritrix crawler (version 1.12) was used.
  5. Run the crawl for one day.

The above steps were repeated every 6 days between August 2008 and February 2011.

Domain specific corpora

The domains used in the this section of the corpus are based on the Open Directory project. Open Directory is a manually curated list of websites split into a number of subject domains. Unfortunately the Open Directory is no longer in operation. The method used was:

  1. Choose a top level domain from the Open Directory project.
  2. Extract all URLs for sites listed under that domain.
  3. Choose 50 URLs at random and use these as seeds for a limited web crawl. The crawl is limited to the path of the URL, so only links to pages in the same sub-path may be followed by the crawler. The Heritrix crawler (version 1.12) was used.
  4. Run the crawl for one day.

The above steps were repeated for each domain over a cycle of 17 days (interspersed with the miniweb crawls from above) between August 2008 and February 2011. The crawl of each domain was run on a different day of the week each cycle.

Text clean-up

The following modules were used to clean-up the textual data.

Synchronic Web Corpus

For practical reasons the synchronic web corpus is a random selection of pages from the miniweb and domain datasets described above.

Synchronic Web Corpus

Diachronic Web Corpus

The diachronic web corpus is a sample of the miniweb described above where a random sample of the same size was taken for each month between 2000-2010. An additional limitation was imposed such that the date must have been discovered from the server header, HTML metadata or in the textual body near "last modified" or similar. See Kehoe (2006) for more details.

Diachronic Web Corpus

Birmingham Blog Corpus

The Birmingham Blog Corpus was constructed using a combination of the methods above and website-specific HTML parsing rules to enable the accurate extraction of blog posts and comments. See Kehoe and Gee (2012) for more details.

Birmingham Blog Corpus

References

Baroni, M. and S. Bernardini. 2004. BootCaT: Bootstrapping Corpora and Terms from the Web. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC'04), 1313-1316. Lisbon, Portugal.

Cavnar, W. & J. Trenkle. 1994. N-Gram-Based Text Categorization. In Proceedings of Third Annual Symposium on Document Analysis and Information Retrieval, 161-175. Las Vegas, USA.

Kehoe, A. 2006. Diachronic Linguistic Analysis on the Web with WebCorp. In The Changing Face of Corpus Linguistics, edited by A. Renouf and A. Kehoe, 297-308. Amsterdam: Rodopi.

Kehoe, A. & M. Gee. 2007. New corpora from the web: making web text more 'text-like'. In Towards Multimedia in Corpus Studies, edited by P. Pahta, I. Taavitsainen, T. Nevalainen and J. Tyrkkö. Electronic publication, University of Helsinki.

Kehoe, A. & M. Gee. 2012. Reader comments as an aboutness indicator in online texts: introducing the Birmingham Blog Corpus. In Aspects of Corpus Linguistics: Compilation, Annotation, Analysis, edited by S. Oksefjell Ebeling, J. Ebeling and H. Hasselgård. Electronic publication, University of Helsinki.

Schmid, H. 1994. Probabilistic Part-of-Speech Tagging Using Decision Trees. Proceedings of International Conference on New Methods in Language Processing, 44-49. Manchester, UK.