Dates

Each document in our corpus contains some meta data including the date of its last modification. This information must be extracted from the document or its header data. It is often difficult to do so as there is no single web standard for recording dates, so we have a developed a methodology to extract dates from a web page of the following types (see 'Diachronic Linguistic Analysis on the Web with WebCorp' A. Kehoe, 2006). Within WebCorp LSE's interface the date type for a web page will be displayed in brackets beside its date.

No.NameDescription
1Server returned last modifiedMost web pages are returned with a last modified header based on the date the serving operating system deemed the page to be changed.
2Last modified meta tagSome web pages contain meta information within the document (including PDF and MS Word formats). We can look here for a last modified date.
3Last modified within the page bodySome web pages will contain a string similar to 'Last Modified:' followed by a date within the body of the page.
4Copyright yearMeta information contained within the document can include a copyright date or the body may include a string similar to 'Copyright' or '©' followed by a date. From either of these we can extract the year the document was last modified.
5Year within the URLSome web pages contain dates within the URL. We can extract the year from these.
6Any other dateIf none of the above could be found we search the document body for any date before when the page was downloaded.
7Download dateIf no date can be found at all within the web page we use the date the document was downloaded.

Priority is given to the date types of lowest number, so if a date of type 2 is the first found this will be used and we will not check for any of the following types. If more than one date of a particular type is found we will use the most recent.