A Syntactically Annotated Idiom
Database (SAID) v.1
Therese Aitchison: Department of Linguistics, University of Canterbury
Kees van der Veer: Max Planck Institut für Psycholinguistic, Nijmegen.
For e-mail contact with the authors: <[email protected]>
Copyright © Koenraad Kuiper, Heather McCann, Heidi Quinn, Therese Aitchison,
Kees van der Veer 2003
The right of Koenraad Kuiper, Heather McCann, Heidi Quinn, Therese Aitchison
and Kees van der Veer to be identified as authors of this work is asserted.
All uses of SAID are to be bibliographically acknowledged.
What's in SAID and how to use it.
This data set was constructed with a number of ends in view. The chief of these was
to provide data for investigating the structural configurations in which English
idioms are typically found. The assumption was that, since idioms are phrasal lexical
items (PLIs), they will therefore have structural properties which are idiosyncratic.
A number of these can be noted.
Bound words exist in a number of PLIs.
take cognisance of
have an inkling
What are the properties of these words and how frequent are they in PLIs?
There may be locality constraints on the syntactic configuration of PLIs.
Are all lexicalized constituents within the maximal projection domain of the head of
Do all lexical heads of a PLI form a lexical selection chain within the M domain of
the head of the PLI? (Van Gestel 1995, O'Grady 1998, Kuiper and Everaert 2000).
PLIs can have slots, unfilled positions (Williams 1994)
Some are empty argument positions.
e.g. take NP to task
Some are not
e.g. get NP’s goat.
Some slots have selectional restrictions.
e.g. blow hot and cold
(Only human subjects can blow hot and cold.)
Some have co-indexing restrictions.
e.g. get NP’s goat
(The NP in the genitive cannot be co-referential with the subject of the PLI.)
Some PLI’s have constituents which may or may not be used but are known to
be are part of PLI.
e.g. Good riddance (to bad rubbish).
Some PLIs have options that are a smaller set than the syntax and semantics of
the PLI would generally permit.
e.g. in a bad/foul mood/temper
e.g. in a good mood/temper
but not e.g. #in a pleasant mood
(We use a # to indicate a phrase which is not lexicalised.)
Some PLIs allow the optional insertion of free modifiers (adjuncts) (Abeillé
e.g. get annoyed, get very annoyed, get slightly annoyed.
e.g. Get lost! #Get very lost!
Some PLIs have degrees of freedom under movement with a range from frozen
to free, e.g. passivisation of:
e.g. take care of NP
e.g. poke borax at NP
e.g. kick the bucket
(Abeillé 1995, Chafe 1968, Nunberg et al. 1994)
Some PLIs are restricted collocations where more than one option for selection
exists in the grammar but only one is conventionally selected (Howarth 1996).
e.g. get on the bus, #get in the bus
e.g. to the best of one’s abilities, #at the best of one’s abilities
Further questions arise such as:
How much adjunction is there in PLIs?
Can any functional projection be potentially lexicalized?
Can slots occur internal to a PLI?
How are PLIs entered into the computational system in a minimalist model of
Data are needed to answer these and many other questions relating to the structural
properties of phrasal lexical items. PLIs are also used as data in arguing for particular
theoretical positions (Everaert and Kuiper 1996). Such data would be more useful if
it came with syntactic annotation.
The data was originally drawn from four dictionaries of English idioms: Cowie,
Mackin 1975, Cowie, Mackin and McCaig 1983, Long 1979, and Courteney 1983.
Only citation forms, suitably adapted for our purposes, were used. The citation files
were amalgamated. (See SAID1.txt.) The rationale for the selection was that these
are among the biggest and most comprehensive listings of English idioms.
An assumption was made that many of the structural types would be represented.
No assumption was made that the selected items were a statistically significant
subset of the total phrasal lexicon of English. Estimates of the size of the phrasal
lexicon of an average native speaker of English range from about the same order of
magnitude as the single word lexicon (Jackendoff 1995: 137) to an order of
magnitude larger (Mel'çuk 1995: 169). Both these are guestimates. Given that the
current guestimates of the size of the single word vocabulary of a native speaker are
conservatively between 20,000 (Goulden, Nation and Read 1990) and 60,000 words
(Fromkin 2000: 8), claims as to whether our sample is representative in some way
would be premature.
There are 13,467 PLIs in the SAID1.txt file
Who and how?
The analysis was conducted by Heather McCann and Koenraad Kuiper, the checking
by Heidi Quinn and Therese Aitchison. Each went over the analysis of the other in
the pair double-checking to attempt to gain consistency. Computational checking of
bracketing was done by Kees van der Veer who also did the conversion to PROLOG
and the various other formats.
The analysis was manual for the following reasons. First, when the analysis began
(quite some years ago), machine parsers were not able to provide sufficient detail.
Second, manual annotation raised questions about the best analysis which were
heuristically challenging. Third, the time period taken for the analysis allowed a
number of people to work on the project both with analysis and checking and this has
led to a perhaps more considered analysis than what might have been done with
faster machine parsing.
There are consequences. These data are likely to be not without error. They have
been through analysis twice and checked twice. There are no bracketing errors since
bracketing symmetry has been checked computationally. But it is likely that there
remain errors of commission and omission. We apologize for these. In our defence
we would say that, if the analysis had been done computationally, then all computer
analysis errors would have been systematic and thus probably created more problems
for the user than our odd, casual and unsystematic errors.
The analytic system we used was initially drawn from a pre-barriers generative
framework. The following were notable analytic decisions:
Projections and categories in the verb complex
Following Chomsky (1981), we are assuming that the basic structure of a sentence
(S) is NP - AUX - VP (1).
The VP is headed by the lexical verb, and contains no further verbal elements. Any
modal and non-modal auxiliaries are constituents of AUX, as is the infinitive marker
to. As can be seen from (1), non-modal auxiliaries are assigned the category V.
Modal auxiliaries have the category MOD (2).
The infinitive marker to is bracketed like an auxiliary verb, but does not have its own
category label (3).
As (4) illustrates, AUX may contain more than one verbal element.
When the lexical verb be
auxiliary verb, as in
is inverted in WH-questions, this is treated like an
Other constituents typically found in AUX are the negative marker not (6), and any
VP-external APs (7)-(8).
Non-finite clauses and PRO
PRO appears in the subject position of non-finite clauses that lack an overt subject
and have the infinitive marker to in AUX. Following Chomsky (1981, 1986) we are
assuming that PRO must be ungoverned. Any clause with a PRO subject is therefore
presented as projecting up to S’ level (9).
(9) dressed [S’[S[PRO][AUX[to]][VPkill]]]
Gerund and participle constructions
Since there is considerable debate about the categorial and structural status of
different gerund and participle constructions (cf. Abney 1987, Cowper 1993, Kratzer
1996), we have decided to represent gerunds as deficient clauses that project only up
to S level and are therefore unable to take PRO subjects. This means that gerunds
will occur either without any subject at all (10), or with overt subject NP (11).
Similar constructions involving the past/passive participle of the verb are given an
analogous analysis (12).
(10) keen on [S[VPing]]
(11) not see [S[NP[PRONit/that]][VP[Vhappening]]]]]
(12) have got [S[NPsb][VP[Vfooled]]]]]
Where an NP is followed by a non-verbal predicative phrase (13)-(15), or by a VP
headed by a bare infinitive (16), we have analysed the whole constituent as a small
clause. To ensure easy identification and compatibility with different existing
approaches to small clauses (Aarts 1992, Bowers 1993, Chomsky 1993, Cardinaletti
& Guasti 1995), all small clauses identified in the dictionary have the category label
SC and a basic NP - XP structure.
We are treating the possessive marker ’s as a lexical clitic that turns an NP into a
possessive phrase (cf. Halpern 1995 for a similar approach). Such possessive
phrases are assigned the category POSS (17).
Possessive pronouns are assumed to project both NP and POSS, without the need for
a separate possessive marker (17).
Where the possessive form of a pronoun is indicated by ’s, the pronoun is analysed
as the head of a noun phrase taking a possessive clitic (19).
Conjunction constructions are assumed to be headed by the conjuncts rather than the
conjunction (cf. Pesetsky 1982, Gazdar et al. 1985, Jackendoff 1990, Pollard & Sag
1994, Kaplan & Maxwell 1995). This means that the overall constituent inherits the
category from its conjuncts (20).
If the conjuncts belong to different categories, the overall constituent is bracketed but
does not bear a category label (21).
Conjunctions have the category CONJ. If no overt conjunction is present, conjuncts
are separated by an empty CONJ node (22).
It is usually phrases that are conjoined, but sometimes words are conjoined, as in
Comparative and equative structures
Our analysis distinguishes two types of comparative/equative structures:
Type 1: Comparatives and equatives where the degree adverb (more, as, enough)
semantically modifies an A (24) or N (25). The constituent following than or as is
either clausal or readily interpreted as a reduced clause.
(24) more beautiful(ly) than ever
as bad as ever
old enough to be X
(25) more trouble than be worth
Our syntactic analysis of Type 1 comparatives and equatives is designed to reflect
their semantic properties. Thus the initial degree adverb is treated as a modifying
AP, and than and the second as in equatives are analysed as complementizers
introducing an embedded clause (26)-(28).
Support for this analysis comes from the optionality of the initial as in many equative
structures (29), and from the occurrence of morphological as well as periphrastic
comparatives in Type 1 comparative constructions (30)-(31).
Type 2: Comparatives where the degree element (more, better) does not have this
kind of modifying relationship with the following word, but rather seems to
introduce a kind of coordinate structure. For this reason, the degree element may be
followed by clauses (32) as well as phrasal constituents (33). As we might expect
from a structure resembling either…or… and both…and… coordinates, the
constituent following than is usually of the same category as the constituent
following the degree element.
(32) better to be safe than sorry
(33) more dead than alive
better late than never
more in sorrow than in anger
Our analysis of Type 2 comparatives captures the special status of the degree
element and the symmetry between the compared constituents by assuming that the
whole structure is headed by the degree adverb, which takes the following
constituent and the than clause as its complements (34)-(35).
The category of all
All is treated as a QP when it modifies an overt N or PRON (36), as an AP when it
modifies an overt A (37), and as an NP when it occurs by itself (38).
(36) a. [NP[QP[Qall]][DETthe][Nfun][PP[Pof][NP[DETthe][Nfair]]]]
We have analysed the ungrammatical PLIs as best we can. For example, where these
are historical throwbacks, we have tried to analyse them in line with the grammar of
the period as in (39)
The following conventions were adopted in the analysis to give SAID2.txt which is
the manual analysis file.
Square brackets enclose constituents.
Upper case notation inside the leftmost bracket provides the syntactic
label for the constituent.
All linguistic data is reduced to lower case.
/ is placed between alternative heads (selection sets).
() is placed around lexicalized optional constituents.
NP is used for many slots.
Dictionary abbreviations like sb and sth for somebody or something are
also used within slots.
* indicates an ungrammatical PLI
What is missing?
No single bar levels are used.
No traces are indicated.
No co-indexing is noted.
In order to facilitate machine manipulation of the annotated data, the manual analysis
was converted to PROLOG format. This involved expansions of those PLIs which
had optional constituents so that both the case with and that without the options were
made available. Alternatives were left in the PROLOG file separated by /. SAID3.txt
contains the data in SAID2 form above and Prolog form below for each datum.
SAID4.txt contains only the PROLOG form of the data.
We think that the various file formats provided will make it possible to convert our
format to others with judiciously constructed algorithms.
Files have been left in text format with each record separated by a return. This should
make it easy to import the data into any database for interrogation.
We have also enclosed Theo Vosse's program 'TreeParse' and its manual, with his
blessing, for Macintosh users since it will draw tree diagrams from PROLOG input.
We hope this data set will be useful for linguists. Those working in parsing and
machine translation might find the data useful for priming linguistic analysis of new
data and cutting down the search space for non-compositional phrases in parsing and
machine translation algorithms.
Some teachers of English as a second or foreign language may also find the
structural analyses useful for grounding grammar learning in idioms which are often
themselves memorable or at least worth knowing if you are a foreign language
We have made our best effort as to the consistency and accuracy of the data analysis;
however no guarantees are made or implied as to either.
We are grateful to the following for grants:
The New Zealand Vice Chancellors' Committee,
The University of Canterbury.
Koenraad Kuiper is grateful to:
the University of Canterbury for periods of study leave during which some of his
contribution to the project was made,
NWO who provided support during a period of leave hosted by het
Onderzoeksinstituut voor Taal and Spraak (OTS) at the University of Utrecht (19956),
the Max Planck Institut für Psycholinguistic and Professor W.J.M. Levelt for
hospitality and assistance during a three month's stay at the Institute in 1999,
the Netherlands Institute for Advanced Studies in the Humanities and Social
Sciences for a Fellowship in 2002-2003.
Aarts, Bas. 1992. Small clauses in English: the nonverbal types. (Topics in English
linguistics, 8) Berlin: Mouton de Gruyter.
Abbeillé, Anne. 1995. The flexibility of French idioms: A representation with
lexicalized tree adjoining grammar. Idioms: Structural and psychological
perspectives, ed. by Martin Everaert, Eric-Jan van der Linden, André Schenk,
and Rob Schroeder, 15-52. Hillsdale, New Jersey: Lawrence Erlbaum
Abney, Steven Paul. 1987. The English noun phrase in its sentential aspect.
Unpublished PhD thesis. Cambridge, MA: MIT.
Borsley, Robert D. 1994. In defense of coordinate structures. Linguistic Analysis 24.
Bowers, John. 1993. The syntax of predication. Linguistic Inquiry 24. 591-656.
Cardinaletti, Anna and Maria Teresa Guasti (eds). 1995. Small clauses. (Syntax and
semantics, 28) San Diego: Academic Press.
Chafe, Wallace. 19688. Idiomaticity as an anomaly in the Chomskian paradigm.
Foundations of language 4. 109-127.
Chomsky, Noam. 1981. Lectures on government and binding. Dordrecht: Foris.
Chomsky, Noam. 1986. Knowledge of language: its nature, origin, and use.
(Convergence) New York: Praeger.
Chomsky, Noam. 1993. A minimalist program for linguistic theory. The view from
building 20: essays in linguistics in honor of Sylvain Bromberger, ed. by
Kenneth Hale and Samuel Jay Keyser, 1-52. (Current studies in linguistics, 24)
Cambridge, MA: MIT Press.
Courteney, Rosemary. 1983. Longman dictionary of phrasal verbs. Harlow, Essex:
Cowie, Anthony P. and Ronald Mackin. 1975. Oxford dictionary of current idiomatic
English: Verbs with prepositions and particles. Oxford: Oxford University
Cowie, Anthony P. Ronald Mackin and Isobel McCaig 1983. Oxford Dictionary of
Current Idiomatic English: Phrase, clause and sentence idioms. Oxford: Oxford
Cowper, Elizabeth. 1993. A non-unified treatment of -ing. Toronto Working Papers
in Linguistics 12. 49-59.
Everaert, Martin and Koenraad Kuiper. 1996. Theory and data in idiom research.
Parasession on theory and data in linguistics, ed. by Michele AuCoin, Rodolfo
Celis, Lise M. Dobrin, Lisa McNair, and Kora Singer (CLS 32v2), 43-57.
Chicago: Chicago Linguistics Society.
Fromkin, Victoria et al. 2000. Linguistics: An introduction to linguistic theory.
Gazdar, Gerald, Ewen Klein, Geoffrey K. Pullum, and Ivan A. Sag. 1985.
Generalized phrase structure grammar. Cambridge, MA: Harvard University
Goulden, Robin, Paul Nation, and John Read. 1990. How large can the receptive
vocabulary be? Applied Linguistics 11. 341-363.
Halpern, Aaron. 1995. On the placement and morphology of clitics. (Dissertations in
linguistics) Stanford, CA: Center for the Study of Language and Information
Howarth, Peter A. 1996. Phraseology in English academic writing: Some
implications for language learning and dictionary making. Max Niemeyer
Jackendoff, Ray. 1990. On Larson’s analysis of the double object construction.
Linguistic Inquiry 21. 427-456.
Jackendoff, Ray S. 1995. The boundaries of the lexicon. Idioms: structural and
psychological perspectives, ed. by Martin Everaert, Eric-Jan van der Linden,
André Schenk, and Rob Schroeder, 133-165. Hillsdale, New Jersey: Lawrence
Kaplan, Ronald M. and John T. Maxwell III. 1995. Constituent coordination in
lexical-functional grammar. Formal issues in lexical-functional grammar, ed.
by Mary Dalrymple, Ronald M. Kaplan, John T. Maxwell III, and Annie
Zaenen, 199-210. (CSLI lecture notes, 47) Stanford: CSLI.
Kratzer, Angelika. 1996. Severing the external argument from its verb. Phrase
structure and the lexicon, ed. by Johan Rooryck and Laurie Zaring, 109-137.
(Studies in natural language and linguistic theory, 33) Dordrecht: Kluwer
Kuiper, Koenraad and Martin Everaert. 2000. Constraints on the phrase structural
properties of English phrasal lexical items. PASE papers in language studies:
Proceedings of the 8th annual conference of the Polish Association for the
study of English, ed. by Rozwadowska Bozena , 151-170. Aksel: Wroclaw.
Long, Thomas H. et al. 1979. The Longman dictionary of English idioms. Harlow,
Mel'çuk, Igor 1995. Idioms: structural and psychological perspectives, ed. by Martin
Everaert, Eric-Jan van der Linden, André Schenk, and Rob Schroeder, 167232. Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Nunberg, Jeffrey, Ivan Sag, and Thomas Wasow. 1994. Idioms. Language 70. 491538.
O'Grady, William. 1998. The syntax of idioms. Natural Language and Linguistic
Theory 16. 279-312.
Pesetsky, David. 1982. Paths and categories. Unpublished PhD thesis. Cambridge,
Pollard, Carl and Ivan A. Sag. 1994. Head-driven phrase structure grammar.
Chicago: University of Chicago Press.
Van Gestel, Frank, 1995. En bloc insertion. Idioms: structural and psychological
perspectives, ed. by Martin Everaert, Eric-Jan van der Linden, André Schenk,
and Rob Schroeder, 75-94. Hillsdale, New Jersey: Lawrence Erlbaum
Williams, Edwin. 1994. Thematic structure in syntax. Cambridge, Massachusetts: