The Antedote
Posts
AI and Natural Languages - 20th Century to Today

AI and Natural Languages - 20th Century to Today

How decades of scientific work brought about the LLM revolution

Soban Raza
December 10, 2024

“The limits of my language means the limits of my world.”

— Ludwig Wittgenstein, Tractatus Logico-Philosophicus

1. The Straw

The year is 1950 — humanity is recovering from one of the most harrowing conflicts it has ever seen. Major world powers have witnessed the capabilities of computers in war and are in a race for technological hegemony.

At this point, all eyes are set on Alan Turing — the father of computer science and major codebreaker of Bletchley Park — who has just published an article in Mind titled ‘Computing Machinery and Intelligence’.

Little does anyone realize that this article would set the stage for today by asking one simple question: can we tell if a machine thinks?

Turing’s answer is simple — rather than trying to determine if it can think, one should see if it can mimic someone that does.

Turing proposes what he calls the imitation game — simply have a human evaluator talk to a person and a machine through a computer terminal and see if they can tell the difference. This, of course, would be the first major publication in the nascent fields of artificial intelligence and natural language processing.

The Antedote

Subscribe for regular doses on AI agents & LLMs.

The earliest researchers in natural language processing — interested in mechanizing language translation — employed logical models in the study of language.

The idea of using a set of rules to approach natural language wasn’t new — in the early 17th century, famous philosopher and mathematician Rene Descartes, in a letter to Martin Mersenne of Mersenne prime fame, proposed the idea of a universal language that assigned common codes to equivalent ideas across languages.

Later, Gottfried Leibniz would publish a similar idea in ‘De Arte Combinatoria', where he would posit the existence of a universal `human thought alphabet' on the basis of the commonality of human intellect.

Rene Descartes, French philosopher and mathematician of Cartesian coordinate fame.

Meanwhile in the 1930s, Petr Troyanskii and Georges Artsrouni would independently file patents for mechanical tape-machines for multilingual translation. Noam Chomsky's 1957 book, Syntactic Structures, laid the foundations for studying languages using generative grammars. The advent of the digital computers, however, would rapidly accelerate developments in the area.

“There is no use disputing with the translating machine. It will prevail.”

— P. P. Troyanskii

Chomsky, author of Syntactic Structures, which approached linguistics via generative grammar.

2. Call to Action

In 1947, Warren Weaver had first suggested using computers for machine translation in a conversation with Andrew Booth, who would go on to conduct the first set of experiments with punched cards the following year.

At the urging of colleagues, Weaver would publish his now celebrated 1949 memorandum, simply titled ‘Translation’, which outlined four major proposals for machine translation:

Words should be judged using a context window.
Languages must be understood using logical models.
Languages may lend themselves to statistical analysis.
All translation requires an understanding of a hidden ‘universal human language’.

Responses to Weaver’s memorandum were mixed — many believed mechanizing translations was a pipe dream. Some, however, gave Weaver’s ideas serious consideration, such as Erwin Reifler and Abraham Kaplan.

That said, perhaps the most important consequence of Weaver’s memorandum was its role in the appointment of logician Yehoshua Bar-Hillel at MIT to a research position in machine translation in 1951, who would go on to organize the first conference on machine translation the following year.

Warren Weaver, mathematician and science administrator, and a central figure in machine translation.

Hurd, Dostert, and Watson at the Georgetown-IBM demonstration. Image taken from Josh Hutchins’ paper.

3. Early Contenders

Natural language processing first earned the general public’s attention with the Georgetown-IBM demonstration in 1954. Using a small set of 250 words and six syntax rules, the accompanying IBM 701 mainframe computer translated sentences on fields such as mathematics, organic chemistry, and politics in Russian to the English language.

The project was headed by Cuthbert Herd, head of Applied Sciences at IBM, and Leon Dostert, the central figure behind interpretations made in the Nuremberg trials, with Thomas Watson at IBM being the key facilitator.

How’d this happen? Bar-Hillel’s 1952 conference had persuaded an initially skeptic Dostert that mechanized translation was feasible, and it was his idea that a practical, small-scale experiment should be the next step in approaching machine translation.

The demonstration made it to the front pages of the New York Times and several other newspapers across America and Europe, and it was the first time the average person was exposed to the idea of computers translating languages. And, up until the 1980s, rule-based systems would dominate natural language processing as a whole.

Weizenbaum interacting with ELIZA on a computer terminal with printable output, circa 1966. Image taken from ‘Desiring Fakes’ by Daniel Becker.

Some major examples of rule-based developments in the ‘60s include:

Daniel Bobrow’s PhD thesis in 1964, which presented an AI system called STUDENT. Written in Lisp, STUDENT provided numerical answers to elementary algebra word problems. It is one of the earliest examples of a question answering system.
1967 saw Joseph Weizenbaum’s ELIZA — an early chatbot that simulated a Rogerian psychotherapist by reiterating the user’s input in a different way.
1968, Terry Winograd released SHRDLU, a rudimentary conversational AI operating in an internalized ‘block world’ that users could instruct.

The 1970s and 1980s continued the general trend of rule-driven explorations in natural language processing, e.g. chatbots such as Jabberwacky trying to tackle humor and methods such as Lesk attempting to address word sense disambiguation.

4. Change of Plans

Of course, rule based methods weren’t the only ones. Earlier in the year 1948, Claude Shannon — father of information theory and modern communications — had outlined the word n-gram model for statistically modelling transmitted messages.

It was precisely this idea that Weaver had referred to in Translations for his third proposal. However, statistical methods in the context of natural language processing suffered from two major flaws — they required swathes of data and great computing power.

As a result, statistical models weren’t as explored in natural language processing.

But as fate would have it, statistical methods would gain in popularity due to a series of developments leading up to the late 1980s.

Claude Shannon, father of information theory and central figure in the use of statistics in communications. Image taken from Wikipedia.

“But it must be recognized that the notion of ‘probability of a sentence’ is an entirely useless one, under any known interpretation of this term.”

— Noam Chomsky, Challenges to Empiricism

What led to this migration? Well,

‘Chomskyan’ approaches to language modelling — which generally discouraged the use of text corpora as the foundations for natural language processing — fell out of favor due to ever increasing complexity in rule-design without matching results.
The landscape of computer hardware was transforming dramatically — computer manufacturers abandoned the vacuum tube for transistors by the dawn of the ‘60s, and Moore’s Law was already in high gear: Intel’s 4004 in 1971 with around 2,300 transistors paled in comparison to their 80286 with 120,000+ transistors from eleven years later!
Massive text corpora became available — the famous Brown Corpus was on tape by 1964.

As a result of all of this, the late 1980s saw an upswing in statistical methods being used for language modelling, notable examples being the IBM alignment models and a slew of original methods built around varying strategies for navigating probabilities.

It would take until the 2000s for the natural language processing landscape to change into something closer to what it is today.

“Of course, linguists do not generally keep NLP in mind as they do their work, so not all of linguistics should be expected to be useful.”

— Noah A. Smith

5. Signs of the Revolution

Artificial Intelligence research reemerged from the ‘AI Winter’ on the back of neural networks, largely thanks to the collective work of several (oft ignored) researchers:

Alexey Ivakhnenko, the inventor of deep learning.
Seppo Lainnainmaa, who discovered the backpropogation algorithm.
Kunihiko Fukushima, the creator of what we now know as the convolutional neural network architecture.

Accompanying computational experiments, such as those of David Rumelhart and Geoffrey Hinton in 1986 and Yann LeCun in 1989 further demonstrated the capabilities of neural network based approaches. Naturally, it was expected that neural networks would be the way to go even in natural language processing.

Alexey Ivakhnenko, the inventor of deep learning. Image taken from Wikipedia.

In 1995, Sepp Hochreiter and Jeurgen Schmidhuber published the oft-cited paper Long Short-Term Memory, which is the source of the LSTM architecture.

Schmidhuber and Stefan Heil had already demonstrated the effectiveness of neural networks in text analysis a year earlier. In 2001, LSTMs were proven to learn languages impenetrable to methods such as hidden Markov models.

Then in 2003, Yoshua Bengio et. al. outperformed the word n-gram model by using a multilayer perceptron.

Later in 2013, Tomas Mikolov and the team at Google would develop word2vec, a neural approach to developing vector representations for text.

This was matched in 2014 by Jeffrey Pennington et. al.’s GloVe, with notable improvements over the former.

Although promising, these results don’t quite match the modern-day large language model (LLM) boom. So what’s different? We must first turn our attention to seq2seq, a family of natural language processing approaches built on Shannon’s theory of communication.

Viewing linguistic problems as noisy channel problems, seq2seq makes use of an encoder-decoder pair (built using LSTMs) to produce word sequences. One could then, in principle, chat to a well-tuned model built on the idea and communicate with it as they would a modern LLM.

Although promising in concept, seq2seq had one major flaw — the fixed size of the encoding scheme resulted in an embedded bottleneck. As a consequence, information in large streams of text would be lost.

6. It Is Televised

“GPT is like alchemy!”

— Ilya Sutskever

Then in 2017, the world was taken by storm by one particular paper — ‘Attention Is All You Need‘.

Published by a team of eight scientists from Google, this paper is the critical component that all LLMs draw on today because it introduced two key developments to modern-day natural language processing — attention and transformers.

Although attention as a mechanism was well studied by cognitive scientists, and in fact earlier papers such as Schmidhuber’s in 1992 proposed mathematically similar concepts, it wasn’t until the publication by Google that transformers became so sought out.

Transformers overtook LSTMs in most applications for two key reasons:

Transformers lack recurrent components, thereby making them quicker to train.
Transformers overcome seq2seq’s bottleneck problem thanks to the attention mechanism.

Although transformers were initially meant to tackle machine translation, they’ve since enjoyed several applications such as computer vision, audio processing, and playing games like chess.

In fact, transformers are the core behind several pre-trained models such as BERT and Generative Pre-Trained Transformer — the GPT family of models that has earned public attention since ChatGPT was unveiled.

With a solid foundation of heuristic principles and a heap of data to analyze (i.e. the internet), LLMs have been improved primarily by expanding datasets and increasing the number of underlying parameters — these correspond to how complex the architecture is under-the-hood. Let’s look at the GPT family as an example:

GPT-1 released in 2018 with 117 million parameters.
GPT-2 released in 2019 with 1.5 billion parameters.
GPT-3 released in 2020 with 175 billion parameters.
GPT-4 — the backbone of ChatGPT — is estimated to have at least a trillion parameters!

If you’ve kept up with these models as they’ve been released, you can immediately see the difference in performance! So far it seems like smooth sailings, no?

7. Do LLMs Dream of Made-Up Text?

“The problem is not access to annotated data, the problem is access to data [...] We have grad students who are incredibly smart who are working on beer reviews and Twitter and emojis because that's where the data is, not because they are not interested in applying [clinical NLP] techniques.”

— Philip Resnik

Of course, there’s no such thing as a free lunch. Despite how promising these LLMs are as far as conversational ability is concerned, there is one glaring flaw staring us all right in the face — hallucinations.

The Antedote

Subscribe for regular doses on AI agents & LLMs.

The key idea is that by preparing a complex model by trying to fit a nice statistical distribution to large data — with no concern for the underlying semantics — you end up making the model produce nonsense when it’s exposed to problems it hasn’t seen before.

Examples of hallucinations include providing misinformation on works of media and generally conflating niche technical concepts, e.g. advanced definitions in mathematics. This issue is why they’re often referred to as ‘stochastic parrots’ — they merely figure out the odds of what can be said given what is already said.

Hallucinations are in fact the symptom of a much bigger problem — the inability to reason.

Efforts to assess LLMs in technical domains such as physics, mathematics, computer science, etc. have yielded mixed or poor results.

Terence Tao’s mathstodon posts highlight major issues in the latest iteration of GPT’s ability to tackle advanced math, and the recently published mathematics benchmark, FrontierMath, showcases poor performances on a novel dataset of math problems that haven’t been exposed to existing models in their training pipelines.

It takes a few minutes for anyone in the tech sector to spot shoddy code being produced by an LLM.

There’s already buzzing in academia on low-effort papers being produced using LLMs with tell-tale signs such as non-existent citations.

Although quite helpful, LLMs in technical contexts still have to be overseen by technical experts.

Terence Tao, mathematics professor at UCLA who was awarded the Fields medal in 2006. He’s one of many mathematicians interested in how AI tackles mathematics.

Well-known figures in the AI field are highlighting these issues and insisting on the need to go beyond LLMs — Ilya Sutskever, one of the original founders of OpenAI, and LeCun have already argued that the pre-training approach has hit a plateau.

Major research groups, such as Michael Bronstein’s at Oxford, are trying to broaden our understanding of deep learning by appealing to mathematics. Google’s AlphaGeometry 2 made waves at the 2024 International Mathematics Olympiad due to possessing a symbolic engine.

If there’s one thing that can be said, it’s that all hands are on deck to mitigate the issues we’ve seen so far with LLMs.

In some sense, this brief foray into the shared history of natural language processing and AI has a tale to tell — maybe the right answer is indeed ‘somewhere in the middle’.

Purely rule-based systems are too intractable to approach, as any complexity theorist would tell you, but methods relying solely on statistics fail to demonstrate the sound leaps of logic humans make every day.

All in all, I’d wager that the times ahead of us will be fascinating to say the least.

If you’re interested in AI agents for your organization or product, visit our website at antematter.io

Reply

or to participate.