How does it work?

WebCorp LSE is a search engine formed from a unique architecture to facilitate linguistic study of the web. Web content is downloaded, processed and stored to eventually form a 10 billion word corpus.

CrawlingArbitrary or targeted selection and retrieval of documents from the web.
IndexingExtracting meta information from the downloaded content, cleaning up the text, duplicate detection and building the corpus.
Linguistic Post-ProcessingLinguistic knowledge bases including collocates.
SearchingLinguistic search software accessed through our graphical user interface.

We can download documents in many formats but have chosen to deal only with HTML, PDF, Microsoft Word, PostScript, RTF and plain text. This is because they are the most frequent text formats on the web and are most likely to contain connected prose. For more information see 'Making Web text more 'text-like': some recent advances', A. Kehoe, Gee, M., 2007.