About Dante

The Database

entry imageDANTE is not a dictionary. It is much more than this. Dante is a lexical database which describes the core vocabulary of English at a level of detail which is – to the best of our knowledge – unequalled.

Similarly, its users are not simply dictionary users. It was created for lexicographers and computational linguists who are developing dictionaries and computer lexicons, either manually, with computer assistance, or automatically.

The database medium is British English, but the corpus is not, and particular attention has been paid to ensuring equal coverage of American English orthography and usage.

Dante is a completely new product. It was conceived from the start as a systematic, coherent, and comprehensive database, whose objective was to record and exemplify every significant linguistic fact about the behaviour of the 50,000 most frequent English words. This called for meticulous planning. In an intensive R&D phase, every decision – about corpus resources, software, entry structures, and editorial policy – was geared to achieving this goal.

The Dante methodology

From a lexicographic point of view, Dante applies an established methodology, across the lexicon, in a rigorous way. For every lemma, the data in a 1.7-billion-word corpus is analysed in two stages:

  1. identifying the discrete uses or lexical units (LUs) of each lemma
  2. attaching relevant linguistic facts to each sense

entry imageThe LU approximates to a dictionary sense. It is the principal ‘currency’ of the database, and represents a word form with its own part of speech, expressing a discrete sense. The ordering of LUs within an entry is based on semantic proximity, not (as in many dictionaries) on part of speech. Word sense disambiguation is notoriously difficult and prone to variation. The challenge for Dante was to systematize this process, and our approach was to focus on context. ‘Context’ includes semantic features; syntactic, colligational, and collocational behaviour; and text-type preferences (domain, register, and so on). Each sense identified in Dante represents a unique cluster of these contextual features, which can be observed to occur frequently in the corpus. In this way, lexicographers’ intuitions are tested and refined using objective linguistic data.

With a robust inventory of senses in place, the second main task is to provide a complete account of how each sense behaves, and to record these facts as far as possible in formal codings which are machine-retrievable.

A toolkit for lexicographic description

The Dante database consists of over 42,000 lexical entries (or headword entries). Collectively, these provide a description of around 42,000 single-word lemmas and 23,000 multiword lemmas, and include over 27,000 idioms and phrases, together with over 622,000 example sentences. In order to record the relevant facts about each of these lemmas, we employ over 40 machine-searchable datatypes. The principal of these are:

Field Example of attributes
part of speech: 16 attributes noun, verb, adjective, adverb etc.
inherent grammatical properties: 32 attributes
  • nouns: the count / uncount distinction, the unitary / mass distinction, pluralization etc.
  • verbs: impersonal, passive-only etc.
  • adjectives: attributive, predicative, post-modifying etc.
  • adverbs: manner, degree, viewpoint, and sentence adverbs
valency / syntactic contexts: nouns: 18 attributes modified by or modifying a noun; clausal, infinitival and prepositional complements etc.
valency / syntactic contexts: verbs: 42 attributes intransitive, transitive and ditransitive uses; clausal, infinitival and prepositional complements etc.
valency / syntactic contexts: adjectives: 17 attributes premodified by noun or adverb; clausal, infinitival and prepositional complements etc.
linguistic labels: domain:
156 attributes
accountancy, aerospace, agriculture, airforce, anatomy etc.
linguistic labels: region:
6 attributes
American English, British English, Australian English etc.
linguistic labels: register:
4 attributes
formal, informal, very informal, vulgar
linguistic labels: style:
21 attributes
child language, drug abuse language, euphemism, journalese etc.
linguistic labels: time :
2 attributes
dated, obsolete
linguistic labels: attitude :
3 attributes
appreciative, offensive, pejorative.

Crucially, each individual fact that we record using any of these datatypes is supported by at least one (and often as many as four or five) example sentences taken in full directly from the corpus. It is this combination of rigorous analysis, systematic recording, and corpus underpinning which makes Dante unique.

A note on the example sentences

entry image

Our guiding principle has been to record facts found in the corpus and exemplify these with corpus sentences. In certain cases, however, corpus examples are supplemented with brief, formulaic examples designed to illustrate the complete range of likely contexts. Examples of this type appear in entries compiled using generic ‘proformas’. Proforma entries are applied to members of lexical sets, which include natural-kind terms, days of the week, colours, illnesses, containers, and minerals. The entries for pink and Monday show how this works. In some entries for ‘named entities’ (names of countries, rivers, or cities, for example), the headword must be substituted for ‘X’ in these formulaic examples. Note that not all of these examples with ‘X’ are relevant to the specific headword.

Use the Search box at the top of the page to see how the various datatypes appear in the entries.