** I take the liberty of reproducing the review of CCL by Katsoyannou and
Economou, which appeared in the International Journal of Applied Linguistics,
and sent to me by the authors **
Review of OOI, Vincent B.Y. (1998) Computer Corpus
Lexicography. Edinburgh U.P., pp. xii+243. ISBN 0-7486-0815-X.
International Journal of Applied Linguistics, Vol, 9/2, pp 299-303.
Reviewed by
Katsoyannou Marianne, Institute for Language and Speech
Processing, Athens
and
Economou Constandina, Institute for Language and Speech Processing, Athens
The title Computer Corpus Lexicography constitutes an
effective merger of three notions: computational linguistics (CL1),
computational lexicography (CL2) and computer corpus linguistics (CL3). The book
is more an introduction to the use of computers in dictionary making and corpus
lexical analysis (exclusively for NLP [natural language processing] purposes)
and less an overview of the disciplines of lexicography and lexicology. The
theoretical core of the book can be encapsulated in the following words: “In the
present book, the emphasis on corpus lexicography means that corpus data forms
the primary basis for the description of a word’s behaviour.” (p. 93). Despite
the fact that it is not clearly stated, the book is evidently addressed to
final-year/graduate students studying related subjects or to researchers
interested in the respective fields. The existence of study questions at the end
of each chapter, as well as the case study discussed in Chapter 6, points to
this.
The book is divided into seven chapters, three appendices and
a bibliography followed by a short index. The first chapter starts by
introducing the three CLs and stating the main aims of the book: constructing a
linguistic theory for the analysis of a corpus-based lexicon and applying this
framework to corpora in order to describe, extract and process a corpus-based
lexicon. The rest of the chapter constitutes an adequate introduction to various
theories and notions of the lexicon, especially with regard to the concept of
grammar, in order to explain the author’s main choice of the word as the
basic unit in linguistic analysis – at least as far as text processing and
corpora analysis are concerned. However, the choice of the word (based upon
Hudson’s position) against the clause (Halliday’s basic unit of analysis in
Functional Grammar) is too fundamental to be confined to two pages. Despite the
fact that the presentation of related issues is exhaustive, the discussion in
this section of the book is too short and dense to satisfy the needs of a public
lacking specialists’ knowledge. Another issue dealt with in the same chapter
concerns the kind of linguistic information that the lexicon should provide, the
general statements it should contain, and the means by which lexicographers
should represent lexical and pragmatic knowledge: the author suggests the notion
of lexical frame as a structure intended to unify the various types of
information included in the lexicon.
The purpose
of Chapter 2 is to thoroughly define the three CLs and to discuss what they
entail for the lexicon. CL1 involves “the computing of natural language seen
from the point of view of linguistics” and is distinguished from NLP, which is
“the computing of natural language seen from the point of view of computer
science” (p. 24). Within this framework, machine translation is considered to be
the oldest CL1 enterprise. The nature of the lexicon in CL1 has shifted from
simple word lists to large computational lexicons derived from the analysis of
real texts using modern programs. CL2 involves not only the creation of MRDs
(machine-readable dictionaries, both directly created in an electronic format
and derived from published dictionaries), but also the building of lexicons for
machine use, as well as the development of dictionaries (in databases) for human
use. According to the author, the development of dictionaries for advanced
learners is the most important part of CL2. Several such dictionaries are
listed, compared and contrasted in terms of the use of corpora for their
construction. The advent of computers in CL3 obliges the author to coin the
term “computer corpus linguistics”, as the most appropriate to replace corpus
linguistics. Textual and acoustic (or speech) corpora form the basis for the
study of language within this field. The contribution of G. Leech to the
establishment of contemporary CL3 is duly admitted. The role of the lexicon
within this field was considered rather secondary, since the main object of CL3
is the corpus. However, the use of corpora instead of/as well as MRDs for the
construction of huge computational lexicons used in the COBUILD dictionary has
challenged this view. The most interesting part of Chapter 2 is the section
where the author deals with the interdependence of CL1, CL2 and CL3. Nowadays,
it is more than evident that these fields can no longer be treated in isolation.
The author enumerates several attempts to provide textual and lexical standards
and to share various kinds of lexical resources.
Chapter 3 constitutes a more detailed attempt to explore what it means to derive
lexicons from corpus data. Besides the common problem of how to handle the
corpus as a lexical resource (comprising a series of questions such as the
choice, size and representativeness of data), using corpora for the development
of dictionaries also involves the means by which lexical evidence is extracted
and treated. The later include basic electronic tools like the concordancer and
a linguistic framework for the exploitation of corpus data.
The main advantages of gathering evidence from corpora are mentioned:
first, the systematisation of the method - which facilitates the generation of
reliable frequency statistics; second, the representativeness of the data (the
evidence from corpora indicates the typical and central tendencies of language
and shows the full range of variability for a given word or expression); third,
the speed of data processing; and, finally, the size of the corpus. Different
examples illustrate the above points. The linguistic
framework, which is analysed in this chapter, for examining data coming from
corpora has been implemented by J. Sinclair (1995) with the COBUILD “Bank of
English” corpus. The main characteristic of the above framework is that the
context is taken into account: words are grouped into larger units of meaning
which form an upper layer of lexical structure.
Chapter 4 is devoted to the issue of lexical acquisition,
which may be “regarded as a rubric for both lexical knowledge acquisition and
lexical data acquisition” (p. 73). Information is drawn either from
machine-readable dictionaries or corpora. Here, research is presented along with
the challenge of the possibility to develop leading-edge
computational techniques in order to create tools
efficient enough to derive automatically lexicons from real texts. Of course,
this approach is not without its disadvantages, especially as far as issues like
representativeness and reusability of lexical information are concerned.
Furthermore, such efforts, being still in their infancy, lead to semi-automatic
(rather than purely automatic) methods of language processing, involving more or
less human intervention. The principles governing extraction and organisation of
information in order to achieve the transition from corpus to lexicon are
embodied in a framework called the LFA (Lexical Frame Analysis). This framework
has been applied to two sublanguage corpora, and this experience is described in
Chapter 6.
Chapter 5
deals with the computational organisation of a lexicon, offering a survey of the
notions of a computerised lexicon, lexical database and lexical knowledge base,
mainly by presenting various structures already used and available in the
literature. The concept of formalism, defined as the encoding of linguistic or
computational information in an explicit manner, is discussed, and a distinction
is made between: a) a linguistically motivated organisation of the lexicon,
called formalism_1 and b) a computational organisation/representation of the
lexicon, called formalism_2. Both these formalisms may be built into a lexical
database or lexical knowledge base, although the author does not make a sharp
distinction between the two. Rather, lexical knowledge bases are viewed as more
developed and dynamic tools which could contain information about computational
and formal semantics. They are considered as particularly efficient and
economic, as they allow for structuring the lexical entries with “inheritance”,
which means that they are integrated into a system where a daughter node
inherits some or all of the properties of a parent node (example: the daughter
node transitive verb inherits the
properties of the parent node verb).
The chapter concludes with a critical presentation of a number of databases (all
English), most of which are well known in the literature, with examples of their
application.
In Chapter 6, the author presents an application of the
methods and principles suggested for lexical acquisition and computational
organisation of the lexicon using the corpus-based approach. The two corpora
used come from the sublanguage of business English. In passing, the author also
provides a short overview of the main theories about the notions of sublanguage,
genre, register and text type. Detailed information about the size, material,
origin, and dating of the two corpora is given, as well as examples using text
samples and concordance listings. The issues of corpus tagging and syntactic
parsing are discussed and presented with samples coming from both corpora.
However, neither the stage of lemmatisation nor the problems encountered and the
methods of converting word forms into lemmas are addressed. Two more problematic
points for computational linguistics, those of semantic tagging and parsing, are
mentioned, but, as the related techniques are not fully developed yet, no
adequate solution is provided. Another interesting issue is corpus processing
for pragmatic information which allows us to have considerable typological data
about each dictionary entry by referencing texts according to a series of
category labels. The structure of the dictionary lexical entries, as well as
their degree of potentiality for NLP applications, derives from the above corpus
processing: not only is syntactic and semantic information extracted, but also
collocational, typological and frequency information can be obtained as a result
of this processing. The chapter concludes with concrete examples from the
analysis of some sublanguage lexemes.
The concluding chapter stresses the importance of the lexicon
as “the central repository of linguistic knowledge” (p. 173) in natural language
processing systems. To achieve an adequate lexicon, dictionaries and texts must
be combined as machine-readable lexical resources. The whole enterprise
presupposes that we take advantage of all the insights that the computer can
offer to produce dictionaries that can be disseminated in various ways.
The clear organisation with helpful conclusions at the end of
each chapter, the study questions (with the suggested solutions to exercises in
Appendix C) and the further reading section for each chapter containing a
handful of well-chosen sources, are some of the good points of the book in terms
of its structure. One of the serious gaps that we could spot was the omission
from the bibliography of some of the more recent publications based on
international conferences which reflect significant issues in natural
language processing, computational linguistics, corpus linguistics and machine
translation - for instance Ljung (1997)
or Ruslan Mitkov & Nicolas Nicolov (1997).
Throughout the book there is reference to relevant web sites, which are all
conveniently gathered in Appendix B. However, one can not help but notice that
all the examples refer exclusively to the English language. No mention of
corpora and databases involving other languages can be found in the book. This
is not in itself a disadvantage; however, we have to strongly disagree with the
remark of page 119 that: “ …it should be clear that similar principles can apply
to the case of the computational storage of the lexicon of other languages”. If
this were the case, then the same language processing tools would have been
available for most natural languages. Another disadvantage of the book is the
heavy use of abbreviations and acronyms that may put off the novice reader,
although they will not be unfamiliar to experts in the field.
Overall, the book seems to be an attempt by the author to
share his experience of working with business English corpora to produce
lexicons processed in the computer and stored in an electronic format. From this
perspective, the first five chapters explain the theoretical framework upon
which he bases this enterprise. Since computer lexicography and other related
fields are still in their infancy, all such contributions are welcome.
References
Mitkov, R. and Nicolov, N. (eds.) (1997) Recent advances in natural language
processing. Amsterdam: John Benjamins.
Ljung, M. (ed.) (1997) Corpus-based studies in English.
Amsterdam Atlanda: Rodopi. |