|
|
LANGUAGE LEARNING AS COMPRESSION
Gerry Wolff, ORCID: 0000-0002-4624-8904, short BIO and longer BIO.
KEY READS: Key publications with notes.
There is good evidence that the way a child learns his or her first
language may, in large measure, be understood as information compression.
The principle of "Minimum Length Encoding" (MLE), "Minimum Description
Length" (MDL) or "Minimum Message Length" (MML) encoding which has been
pursued in other research on grammatical inference appears to be highly
relevant to understanding language learning by children.
Two computer models of first language learning have been developed:
-
Program MK10, dedicated to the learning of conjunctive (segmental) structures
(words, phrases etc) from unsegmented linguistic input.
-
Program SNPR, dedicated to the learning of hierarchical grammars with the
formation of conjunctive structures and disjunctive categories (parts
of speech etc) at any level, together with the generalisation of
grammatical rules and the correction of overgeneralisations.
Both models may be seen as systems for information compression.
As a by-product of this feature and without ad hoc provision,
the patterns of learning by the models correspond remarkably well with
observed features of language learning by children:
- Unsupervised learning of conjunctive segmental structures
(words, phrases etc) from unsegmented linguistic input.
- Unsupervised learning of disjunctive categories (nouns, verbs etc) from
unsegmented linguistic input.
- Unsupervised learning of hierarchical grammars with conjunctive
and disjunctive structures at any level.
- Generalization of grammatical rules and unsupervised correction of
overgeneralizations.
- The pattern of changes in the rate of acquisition of words as
learning proceeds. The slowing of language development in later years.
- The order of acquisition of words and morphemes.
- Brown's (1973) Law of Cumulative Complexity.
- The S-P/episodic-semantic shift.
- Learning of non-linguistic cognitive structures.
- Learning of 'correct' forms despite the existence of errors
in linguistic data.
- The word frequency effect.
This empirical evidence is described and discussed in
Learning syntax and meanings through optimization and distributional analysis.
The two models are based on the belief (for which there is good
evidence) that, notwithstanding Gold's theorems,
a child can learn his or her first language without correction by
a 'teacher' or the provision of 'negative' samples or the grading of
samples in any way (although any of these things may assist learning).
In short, these are unsupervised models of language learning.
Notice that 'correct' grammatical generalisations and erroneous 'over'
generalisations (e.g. "I bightened it" or "They are gooses")
both, by definition, have zero frequency in a child's experience. How can a
child learn to eliminate the incorrect generalisations and retain the correct
generalisations without error correction by a 'teacher' or other kinds of
supervision? MLE appears to provide the answer: a
grammar containing 'correct' generalisations but no 'incorrect' ones appears to
provide an optimal or near-optimal balance between the size of the grammar and
its efficiency for encoding data. The SNPR model demonstrates how 'correct'
generalisations can be discriminated from 'incorrect' ones without external
supervision (see
Language acquisition, data compression and generalization).
The evidence for language learning as information compression reinforces
the idea, which dates back to William of Ockham in the 14th century, that information processing
in brains and nervous systems may often usefully be understood as
information compression (see
Cognition as Compression).
These ideas have provided the inspiration for a current programme of research
developing the idea that computing
in some deep sense may be understood as information compression
(see Computing as Compression).
The list of publications, below, are most of my publications relating to the learning of a first language or languages by children. The publications which are marked with
are the ones that I regard as my best publications in this area.
Why should I be advertising publications with dates which are not recent? Surely, they are all "old hat" and
superseded by later work? I offer some explanation and justification in
an accompanying web page.
What I regard as the best five articles are available as MS Word documents (in compressed or uncompressed form). They were scanned into MS Word using optical character recognition with manual correction of errors.
The book is on sale from the publishers and from
amazon.com.
Towards a Theory of Cognition and Computing. Chichester:
Ellis Horwood, 1991.
This book describes work on 'language learning as compression' and
'computing as compression' up to 1991. It includes copies of my 1988 and 1982 papers (see below).
Learning syntax and meanings through optimization and distributional analysis
Chapter 7 in Y Levy, I M Schlesinger and M D S Braine (Eds),
Categories and Processes in Language Acquisition, Hillsdale, NJ: Lawrence Erlbaum,
pp 179-215, 1988. PDF (227 Kb), MS Word
(389 Kb), MS Word
(compressed) (132 Kb).
This is an overview of the entire programme of research on language
learning as information compression. It is reproduced in
Towards a Theory of Cognition and Computing, Chapter 2.
"Cognitive development as optimization". In L Bolc (Ed),
Computational Models of Learning, Heidelberg:
Springer-Verlag, pp 161-205, 1987.
Discusses how the SNPR model may be generalised to handle discontinuous dependencies in syntax.
Language acquisition, data compression and generalization
Language & Communication 2, 57-89, 1982. MS Word
(219 Kb), MS Word
(compressed) (59 Kb).
This gives a detailed description of program SNPR, which can
discover simple grammars successfully from unsegmented text without
supervision by a 'teacher'. The program demonstrates how correct generalisations can
be distinguished from overgeneralisations without correction by a teacher
(or negative samples or graded presentation of samples) using the principle
of Minimum Length Encoding. The article is reproduced in
Towards a Theory of Cognition and Computing, Chapter 3.
Language acquisition and the discovery of phrase structure
Language & Speech 23, 255-269, 1980. MS Word
(131 Kb), MS Word
(compressed) (52 Kb).
Describes the application of program MK10 to the discovery of
phrase structure in natural language.
The discovery of segments in natural language
British Journal of Psychology 68, 97-106, 1977. MS Word
(135 Kb), MS Word
(compressed) (56 Kb).
Describes the application of program MK10 to the discovery of word
structure in natural language.
An algorithm for the segmentation of an artificial language analogue
British Journal of Psychology 66, 79-90, 1975. MS Word
(654 Kb), MS Word
(compressed) (288 Kb).
Describes program MK10 which is designed to discover segmental structure in
unsegmented samples of 'language' by statistical means. Examples
are presented showing how it can detect 'words' in artificial language texts.
Frequency, conceptual structure
and pattern recognitionBritish Journal of Psychology 67 (3), 377-390, 1976.
MS Word (444 Kb),
MS
Word (compressed) (393 Kb)
This is an early attempt to model conceptual structure and the
recognition of concepts in terms of economical encoding of information. The
computer models are now superceded by the SP framework (see Computing
as Compression) but the paper is still useful for identifying six key
features of human conceptual/recognition systems: Salience of concepts;
hierarchical relations amongst concepts; overlap between concepts; ‘fuzziness’
of conceptual boundaries; the polythetic nature of human concepts including
the possibility of recognizing patterns in spite of distortions;
differential weighting of attributes in recognition.
|