WebCorp LSE is a search engine formed from a unique architecture to facilitate linguistic study of the web. Web content is downloaded, processed and stored to eventually form a 10 billion word corpus.
| Crawling | Arbitrary or targeted selection and retrieval of documents from the web. |
| Indexing | Extracting meta information from the downloaded content, cleaning up the text, duplicate detection and building the corpus. |
| Linguistic Post-Processing | Linguistic knowledge bases including collocates. |
| Searching | Linguistic search software accessed through our graphical user interface. |
We can download documents in many formats but have chosen to deal only with HTML, PDF, Microsoft Word, PostScript, RTF and plain text. This is because they are the most frequent text formats on the web and are most likely to contain connected prose. For more information see 'Making Web text more 'text-like': some recent advances', A. Kehoe, Gee, M., 2007.