** I take the liberty of reproducing the review of CCL by Katsoyannou and Economou, which appeared in the International Journal of Applied Linguistics, and sent to me by the authors **

Review of OOI, Vincent B.Y. (1998) Computer Corpus Lexicography. Edinburgh U.P., pp. xii+243. ISBN 0-7486-0815-X.  International Journal of Applied Linguistics, Vol, 9/2, pp 299-303.

Reviewed by

Katsoyannou Marianne, Institute for Language and Speech Processing, Athens


Economou Constandina, Institute for Language and Speech Processing, Athens

The title Computer Corpus Lexicography constitutes an effective merger of three notions: computational linguistics (CL1), computational lexicography (CL2) and computer corpus linguistics (CL3). The book is more an introduction to the use of computers in dictionary making and corpus lexical analysis (exclusively for NLP [natural language processing] purposes) and less an overview of the disciplines of lexicography and lexicology. The theoretical core of the book can be encapsulated in the following words: “In the present book, the emphasis on corpus lexicography means that corpus data forms the primary basis for the description of a word’s behaviour.” (p. 93). Despite the fact that it is not clearly stated, the book is evidently addressed to final-year/graduate students studying related subjects or to researchers interested in the respective fields. The existence of study questions at the end of each chapter, as well as the case study discussed in Chapter 6, points to this.

The book is divided into seven chapters, three appendices and a bibliography followed by a short index. The first chapter starts by introducing the three CLs and stating the main aims of the book: constructing a linguistic theory for the analysis of a corpus-based lexicon and applying this framework to corpora in order to describe, extract and process a corpus-based lexicon. The rest of the chapter constitutes an adequate introduction to various theories and notions of the lexicon, especially with regard to the concept of grammar, in order to explain the author’s main choice of the word as the basic unit in linguistic analysis – at least as far as text processing and corpora analysis are concerned. However, the choice of the word (based upon Hudson’s position) against the clause (Halliday’s basic unit of analysis in Functional Grammar) is too fundamental to be confined to two pages. Despite the fact that the presentation of related issues is exhaustive, the discussion in this section of the book is too short and dense to satisfy the needs of a public lacking specialists’ knowledge. Another issue dealt with in the same chapter concerns the kind of linguistic information that the lexicon should provide, the general statements it should contain, and the means by which lexicographers should represent lexical and pragmatic knowledge: the author suggests the notion of lexical frame as a structure intended to unify the various types of information included in the lexicon.

The purpose of Chapter 2 is to thoroughly define the three CLs and to discuss what they entail for the lexicon. CL1 involves “the computing of natural language seen from the point of view of linguistics” and is distinguished from NLP, which is “the computing of natural language seen from the point of view of computer science” (p. 24). Within this framework, machine translation is considered to be the oldest CL1 enterprise. The nature of the lexicon in CL1 has shifted from simple word lists to large computational lexicons derived from the analysis of real texts using modern programs.  CL2 involves not only the creation of MRDs (machine-readable dictionaries, both directly created in an electronic format and derived from published dictionaries), but also the building of lexicons for machine use, as well as the development of dictionaries (in databases) for human use. According to the author, the development of dictionaries for advanced learners is the most important part of CL2. Several such dictionaries are listed, compared and contrasted in terms of the use of corpora for their construction.  The advent of computers in CL3 obliges the author to coin the term “computer corpus linguistics”, as the most appropriate to replace corpus linguistics. Textual and acoustic (or speech) corpora form the basis for the study of language within this field. The contribution of G. Leech to the establishment of contemporary CL3 is duly admitted. The role of the lexicon within this field was considered rather secondary, since the main object of CL3 is the corpus. However, the use of corpora instead of/as well as  MRDs for the construction of huge computational lexicons used in the COBUILD dictionary has challenged this view. The most interesting part of Chapter 2 is the section where the author deals with the interdependence of CL1, CL2 and CL3. Nowadays, it is more than evident that these fields can no longer be treated in isolation. The author enumerates several attempts to provide textual and lexical standards and to share various kinds of lexical resources.

Chapter 3 constitutes a more detailed attempt to explore what it means to derive lexicons from corpus data. Besides the common problem of how to handle the corpus as a lexical resource (comprising a series of questions such as the choice, size and representativeness of data), using corpora for the development of dictionaries also involves the means by which lexical evidence is extracted and treated. The later include basic electronic tools like the concordancer and a linguistic framework for the exploitation of corpus data. The main advantages of gathering evidence from corpora are mentioned: first, the systematisation of the method - which facilitates the generation of reliable frequency statistics; second, the representativeness of the data (the evidence from corpora indicates the typical and central tendencies of language and shows the full range of variability for a given word or expression); third, the speed of data processing; and, finally, the size of the corpus. Different examples illustrate the above points. The linguistic framework, which is analysed in this chapter, for examining data coming from corpora has been implemented by J. Sinclair (1995) with the COBUILD “Bank of English” corpus. The main characteristic of the above framework is that the context is taken into account: words are grouped into larger units of meaning which form an upper layer of lexical structure.

Chapter 4 is devoted to the issue of lexical acquisition, which may be “regarded as a rubric for both lexical knowledge acquisition and lexical data acquisition” (p. 73). Information is drawn either from machine-readable dictionaries or corpora. Here, research is presented along with the challenge of the possibility to develop leading-edge computational techniques in order to create tools efficient enough to derive automatically lexicons from real texts. Of course, this approach is not without its disadvantages, especially as far as issues like representativeness and reusability of lexical information are concerned. Furthermore, such efforts, being still in their infancy, lead to semi-automatic (rather than purely automatic) methods of language processing, involving more or less human intervention. The principles governing extraction and organisation of information in order to achieve the transition from corpus to lexicon are embodied in a framework called the LFA (Lexical Frame Analysis). This framework has been applied to two sublanguage corpora, and this experience is described in Chapter 6.

Chapter 5 deals with the computational organisation of a lexicon, offering a survey of the notions of a computerised lexicon, lexical database and lexical knowledge base, mainly by presenting various structures already used and available in the literature. The concept of formalism, defined as the encoding of linguistic or computational information in an explicit manner, is discussed, and a distinction is made between: a) a linguistically motivated organisation of the lexicon, called formalism_1 and b) a computational organisation/representation of the lexicon, called formalism_2. Both these formalisms may be built into a lexical database or lexical knowledge base, although the author does not make a sharp distinction between the two. Rather, lexical knowledge bases are viewed as more developed and dynamic tools which could contain information about computational and formal semantics. They are considered as particularly efficient and economic, as they allow for structuring the lexical entries with “inheritance”, which means that they are integrated into a system where a daughter node inherits some or all of the properties of a parent node (example: the daughter node transitive verb inherits the properties of the parent node verb). The chapter concludes with a critical presentation of a number of databases (all English), most of which are well known in the literature, with examples of their application.

In Chapter 6, the author presents an application of the methods and principles suggested for lexical acquisition and computational organisation of the lexicon using the corpus-based approach. The two corpora used come from the sublanguage of business English. In passing, the author also provides a short overview of the main theories about the notions of sublanguage, genre, register and text type. Detailed information about the size, material, origin, and dating of the two corpora is given, as well as examples using text samples and concordance listings. The issues of corpus tagging and syntactic parsing are discussed and presented with samples coming from both corpora. However, neither the stage of lemmatisation nor the problems encountered and the methods of converting word forms into lemmas are addressed. Two more problematic points for computational linguistics, those of semantic tagging and parsing, are mentioned, but, as the related techniques are not fully developed yet, no adequate solution is provided. Another interesting issue is corpus processing for pragmatic information which allows us to have considerable typological data about each dictionary entry by referencing texts according to a series of category labels. The structure of the dictionary lexical entries, as well as their degree of potentiality for NLP applications, derives from the above corpus processing: not only is syntactic and semantic information extracted, but also collocational, typological and frequency information can be obtained as a result of this processing. The chapter concludes with concrete examples from the analysis of some sublanguage lexemes.

The concluding chapter stresses the importance of the lexicon as “the central repository of linguistic knowledge” (p. 173) in natural language processing systems. To achieve an adequate lexicon, dictionaries and texts must be combined as machine-readable lexical resources. The whole enterprise presupposes that we take advantage of all the insights that the computer can offer to produce dictionaries that can be disseminated in various ways.

The clear organisation with helpful conclusions at the end of each chapter, the study questions (with the suggested solutions to exercises in Appendix C) and the further reading section for each chapter containing a handful of well-chosen sources, are some of the good points of the book in terms of its structure. One of the serious gaps that we could spot was the omission from the bibliography of some of the more recent publications based on international conferences which reflect significant issues in natural language processing, computational linguistics, corpus linguistics and machine translation - for instance Ljung (1997) or Ruslan Mitkov & Nicolas Nicolov (1997). Throughout the book there is reference to relevant web sites, which are all conveniently gathered in Appendix B. However, one can not help but notice that all the examples refer exclusively to the English language. No mention of corpora and databases involving other languages can be found in the book. This is not in itself a disadvantage; however, we have to strongly disagree with the remark of page 119 that: “ …it should be clear that similar principles can apply to the case of the computational storage of the lexicon of other languages”. If this were the case, then the same language processing tools would have been available for most natural languages. Another disadvantage of the book is the heavy use of abbreviations and acronyms that may put off the novice reader, although they will not be unfamiliar to experts in the field.

Overall, the book seems to be an attempt by the author to share his experience of working with business English corpora to produce lexicons processed in the computer and stored in an electronic format. From this perspective, the first five chapters explain the theoretical framework upon which he bases this enterprise. Since computer lexicography and other related fields are still in their infancy, all such contributions are welcome.


Mitkov, R. and Nicolov, N. (eds.) (1997) Recent advances in natural language processing. Amsterdam: John Benjamins.

Ljung, M. (ed.) (1997) Corpus-based studies in English. Amsterdam Atlanda: Rodopi. 

  » Home

  » About Me

  » Teaching

  » Research

  » Links

This site is © Copyright Vincent Ooi 2005, All Rights Reserved.
  Web templates