CognitionResearch.org

The SP theory

Cognition

Language Learning

Home
Computing
Cognition
Language Learning
Book
Contact
SP logo

LANGUAGE LEARNING AS COMPRESSION

Photo of Gerry Wolff Gerry Wolff, ORCID: 0000-0002-4624-8904, short BIO and longer BIO.

KEY READS: Key publications with notes.

There is good evidence that the way a child learns his or her first language may, in large measure, be understood as information compression. The principle of "Minimum Length Encoding" (MLE), "Minimum Description Length" (MDL) or "Minimum Message Length" (MML) encoding which has been pursued in other research on grammatical inference appears to be highly relevant to understanding language learning by children.

Two computer models of first language learning have been developed:

  • Program MK10, dedicated to the learning of conjunctive (segmental) structures (words, phrases etc) from unsegmented linguistic input.
  • Program SNPR, dedicated to the learning of hierarchical grammars with the formation of conjunctive structures and disjunctive categories (parts of speech etc) at any level, together with the generalisation of grammatical rules and the correction of overgeneralisations.

Both models may be seen as systems for information compression. As a by-product of this feature and without ad hoc provision, the patterns of learning by the models correspond remarkably well with observed features of language learning by children:

  • Unsupervised learning of conjunctive segmental structures (words, phrases etc) from unsegmented linguistic input.
  • Unsupervised learning of disjunctive categories (nouns, verbs etc) from unsegmented linguistic input.
  • Unsupervised learning of hierarchical grammars with conjunctive and disjunctive structures at any level.
  • Generalization of grammatical rules and unsupervised correction of overgeneralizations.
  • The pattern of changes in the rate of acquisition of words as learning proceeds. The slowing of language development in later years.
  • The order of acquisition of words and morphemes.
  • Brown's (1973) Law of Cumulative Complexity.
  • The S-P/episodic-semantic shift.
  • Learning of non-linguistic cognitive structures.
  • Learning of 'correct' forms despite the existence of errors in linguistic data.
  • The word frequency effect.

This empirical evidence is described and discussed in Learning syntax and meanings through optimization and distributional analysis.

The two models are based on the belief (for which there is good evidence) that, notwithstanding Gold's theorems, a child can learn his or her first language without correction by a 'teacher' or the provision of 'negative' samples or the grading of samples in any way (although any of these things may assist learning). In short, these are unsupervised models of language learning.

Notice that 'correct' grammatical generalisations and erroneous 'over' generalisations (e.g. "I bightened it" or "They are gooses") both, by definition, have zero frequency in a child's experience. How can a child learn to eliminate the incorrect generalisations and retain the correct generalisations without error correction by a 'teacher' or other kinds of supervision? MLE appears to provide the answer: a grammar containing 'correct' generalisations but no 'incorrect' ones appears to provide an optimal or near-optimal balance between the size of the grammar and its efficiency for encoding data.

The SNPR model demonstrates how 'correct' generalisations can be discriminated from 'incorrect' ones without external supervision (see Language acquisition, data compression and generalization). 

The evidence for language learning as information compression reinforces the idea, which dates back to William of Ockham in the 14th century, that information processing in brains and nervous systems may often usefully be understood as information compression (see Cognition as Compression).

These ideas have provided the inspiration for a current programme of research developing the idea that computing in some deep sense may be understood as information compression (see Computing as Compression).

PUBLICATIONS ON LANGUAGE LEARNING AS COMPRESSION

The list of publications, below, are most of my publications relating to the learning of a first language or languages by children. The publications which are marked with Marks articles and conference papers that give the best overall view of the research are the ones that I regard as my best publications in this area.

Why should I be advertising publications with dates which are not recent? Surely, they are all "old hat" and superseded by later work? I offer some explanation and justification in an accompanying web page.

What I regard as the best five articles are available as MS Word documents (in compressed or uncompressed form). They were scanned into MS Word using optical character recognition with manual correction of errors.

The book is on sale from the publishers and from amazon.com.


Towards a Theory of Cognition and Computing.
Chichester: Ellis Horwood, 1991.
This book describes work on 'language learning as compression' and 'computing as compression' up to 1991. It includes copies of my 1988 and 1982 papers (see below).

Marks articles and conference papers that give the best overall view of the research Learning syntax and meanings through optimization and distributional analysis

Chapter 7 in Y Levy, I M Schlesinger and M D S Braine (Eds), Categories and Processes in Language Acquisition, Hillsdale, NJ: Lawrence Erlbaum, pp 179-215, 1988. PDF (227 Kb), MS Word (389 Kb), MS Word (compressed) (132 Kb).

This is an overview of the entire programme of research on language learning as information compression. It is reproduced in Towards a Theory of Cognition and Computing, Chapter 2.

"Cognitive development as optimization".
In L Bolc (Ed), Computational Models of Learning, Heidelberg: Springer-Verlag, pp 161-205, 1987.
Discusses how the SNPR model may be generalised to handle discontinuous dependencies in syntax.

Marks articles and conference papers that give the best overall view of the research Language acquisition, data compression and generalization

Language & Communication 2, 57-89, 1982. MS Word (219 Kb), MS Word (compressed) (59 Kb).

This gives a detailed description of program SNPR, which can discover simple grammars successfully from unsegmented text without supervision by a 'teacher'. The program demonstrates how correct generalisations can be distinguished from overgeneralisations without correction by a teacher (or negative samples or graded presentation of samples) using the principle of Minimum Length Encoding. The article is reproduced in Towards a Theory of Cognition and Computing, Chapter 3.

Marks articles and conference papers that give the best overall view of the research Language acquisition and the discovery of phrase structure

Language & Speech 23, 255-269, 1980. MS Word (131 Kb), MS Word (compressed) (52 Kb).

Describes the application of program MK10 to the discovery of phrase structure in natural language.

Marks articles and conference papers that give the best overall view of the research The discovery of segments in natural language

British Journal of Psychology 68, 97-106, 1977.  MS Word (135 Kb), MS Word (compressed) (56 Kb).

Describes the application of program MK10 to the discovery of word structure in natural language.

Marks articles and conference papers that give the best overall view of the research An algorithm for the segmentation of an artificial language analogue

British Journal of Psychology 66, 79-90, 1975. MS Word (654 Kb), MS Word (compressed) (288 Kb).

Describes program MK10 which is designed to discover segmental structure in unsegmented samples of 'language' by statistical means. Examples are presented showing how it can detect 'words' in artificial language texts.

Frequency, conceptual structure and pattern recognition

British Journal of Psychology 67 (3), 377-390, 1976. MS Word (444 Kb), MS Word (compressed) (393 Kb)

This is an early attempt to model conceptual structure and the recognition of concepts in terms of economical encoding of information. The computer models are now superceded by the SP framework (see Computing as Compression) but the paper is still useful for identifying six key features of human conceptual/recognition systems: Salience of concepts; hierarchical relations amongst concepts; overlap between concepts; ‘fuzziness’ of conceptual boundaries; the polythetic nature of human concepts including the possibility of recognizing patterns in spite of distortions; differential weighting of attributes in recognition.

CognitionResearch.org

The SP theory

Cognition

Language Learning