Welcome to The Content Technologist, a newsletter for content professionals working in the age of algorithms.
This week's essay was again written with the help of Anthropic's Claude. Here's what I've done to keep the writing my own, while using Claude as a stand-in for a Natural Language Understanding SME.
- I reread my primary sources: the transformer paper and several other landmark machine learning research papers from the 2010s.
- I reread all the Wikipedia articles too because they are quite detailed (as well as the source for much internet content, including what chatbots have learned).
- I took a stab at describing how I understand the architectures, in my own words.
- I first fed individual technical paragraphs to Claude, which identified errors and inaccuracies.
- Claude explained my inaccuracies and suggested other, more accurate language, all of which was poorly structured.
- I interpreted and rewrote the technical details according to my corrected understanding, referencing primary sources and the Wiki when I was confused.
- I fed that writing back to Claude once more, and we repeated the process until the writing was clear and satisfactory for both my standards and Claude's.
It did not make my process faster, but I did not have to consult a machine learning professional. Feel free to correct me, if you see flaws.
Next week I am off to the State Fair, so I'll see you in September.
–DC
Disambiguation, sliding doors, hallucinations, and madeleines: How transformers process, clarify, and produce language
This is the third in a series about the technologies that underpin natural language understanding (NLU). Read the first two entries in the series:

Let's put on a record before we start. The Smashing Pumpkins are the entire reason I'm on the internet after all.
"Transformer" is one of my favorite tracks from The Aeroplane Flies High, the b-sides box-set a significant percentage of my friend cohort received for Christmas 1996. At the time, I used my family's dial-up AOL subscription to lurk Smashing Pumpkins newsgroups, an early Reddit-like forum that I believed was a world-class panel of extremely knowledgeable Billy Corgan experts. (A relatively recent thread on Reddit is more what it was like. Some things change some things stay the same.)
I remember walking into my junior high homeroom excited to share—no one is cool in junior high—my newfound knowledge of not only the Smashing Pumpkins' birthdays, but also the date Billy Corgan shaved his head. Thankfully, in the thirty years since I've cleared these less important facts from my cache, but because I listened to "Transformer" on repeat until my cd skipped, every time I talk about "transformers" in NLU, Corgan shouts "hit it!", D'arcy's bass kicks in, Jimmy starts drumming, and I hum along with the memory of every word.
So. Here's the song for you to queue on your favorite streaming service.
Hit it.
Feedback loops: The drawbacks of disambiguation with vector embeddings and recurrent neural network processing
Last week we left off at vector embeddings, which identify the context of a word based on its statistically most likely counterparts: other nouns and verbs with close semantic relationship. But they're limited by their own context. Or rather, they're limited by their ability to process their own context.
If the same word or phrase is used in a similar context—like, say, "bounce rate" in email deliverability versus "bounce rate" for website analytics—vector embeddings can't handle the disambiguation. A processing model that uses only static vector embeddings might generate the sentence, "To determine whether users value your website content, check your bounce rate and monitor your deliverability scores."
We cross-channel marketing geniuses can see what's wrong with this sentence: it conflates two different jargony analytics terms. Not a huge deal.
But if we revert to last week's example of multiple meanings for "tokenized," a natural language processing engine would see no problem with, "I felt tokenized so I cashed in all my crypto."
When combined with linear processing algorithms like recurrent neural networks (RNNs), vector embeddings create semantic sinkholes. Because RNNs work sequentially—interpreting one word at a time to statistically predict the next—language generation from static vector embeddings through RNNs could run into loops, like this example (provided entirely and unedited by Claude because I don't think my human brain could accurately construct what a computer brain does best):
"Your website bounce rate shows visitor bounce patterns. To decrease bounce rate, monitor bounce metrics and bounce indicators..."
The math of the sentence gets stuck in a loop, and the algorithm breaks like a busted Excel sheet. The RNN processing amplifies vector embeddings' context disambiguation.
From recursive language loops to parallel processing: How transformers work to establish semantic context
The bassline in "Transformer" is a little like a recursive language loop. It's familiar like something from Saturday morning cartoons, somehow exponentially annoying when combined with Billy's nasal whine. Then! James chimes in with the rhythm guitars:
And she's tired, and she's sick
of the same old shit
It's just more of the same old same.
It's a pogo dance-off chorus for everyone's inner eighth grader, all distortion and—wouldn't ya know it?—transformers. The kind of transformers that turn your amps up to eleven, the kind of transformers all my musician buddies would discuss at length while I turned on the disco ball.
"Transformers" is also the term for the architecture that pulls LLMs out of recursive loops and spins them into context. Coined by scientists at Google in the 2017 paper "Attention Is All You Need," transformers process semantic context rapidly, in many directions at once, enabling language generators to understand entire paragraphs in parallel, scaled globally using NVIDIA chips and all those data centers you're hearing so much about.
Transformers take the same old same of static vector embeddings, which assign single values to every token, and expand their context, nearly simultaneously as they process the context of every other word in the sentence. Every token becomes something like a relational database, finding its most likely connections with other tokens via what's termed an "attention" mechanism.
Attention mechanisms forecast a multiverse of sentence futures
Would you look at that? It's another disambiguation! "Attention" in natural language understanding means that, based on both training data and conversational inputs, a processing mechanism called the decoder "decides which part of the sentence to pay attention to" (Bahdanau, Cho, Bengio, 2015). It is a metaphor for the human state of self-aware focus and has nothing to do with the "attention economy"... unless it does. More on the disambiguation of "attention" in the future.
In the transformer architecture, attention mechanisms weight a variety of dimensions when analyzing or choosing the next word. In one case, attention might be weighted to understand which version of "bounce rate" or "tokenize" is the correct context, while simultaneously weighing and statistically choosing alternate language pathways for each structure.
It's similar to another late 90s niche cultural touchpoint, the film Sliding Doors. After a catastrophic day, Gwyneth Paltrow's character, Helen, lives out two alternate futures. In one, Helen pays more attention to her own needs in a cruel and uncaring world. In the transformer, this is called a "self-attention mechanism," in which each token is aware of itself.

But other forces in Helen's world—her ex-boyfriend, her new boyfriend, her career, the city of London—they're all factoring into her life decisions in the alternate timelines. The other factors, and whether or not they have consequence in her life, are analogous to the "multi-head attention" in the transformer model.
How a multiverse of language and context condenses to a linear language output
As a transformer processes tokens and vector embeddings, each token is simultaneously processing the possibilities of the kind of Helen it could be and the kind of Helen the world wants it to be. In the end—spoiler alert—only one possibility survives: the Helen statistically determined to be the most relevant match in its prompted context, either through a search engine results page or the sentence you get back in chat.
Or, in the context of our soundtrack today, the guitars and pedals and vocals and studio effects and everything piles up to a noisy, brief bridge:
Don't hate her because she's undecided
The transformer evaluates the possibilities, then decides, via a statistical normalization function called softmax.
Through relationships among vector embeddings, transformer architectures mathematically weight syntax, semantics, context, tone, colloquialism, and structural priorities in semantic entity—literally the group of words associated with your brand, based on the language you publish on your website, on your social channels, or from other referral sources. Depending on whether the transformer is used in a search engine or in a large language model, it retrieves some information and generates an answer to the user's prompt.
The origin of hallucinations: Spiraling seas of entities and training data
But sometimes, either when the transformer does not have enough information to statistically determine a good answer based on semantics, or because it has been trained not to plagiarize, or because it's been contextually and mathematically confused in some way, the math spirals out. When the equations become retranslated back into language, they don't always retain the original meaning of the embeddings.
Those incorrect choices and the spirals they generate are what we all know as "hallucinations," another disambiguation of a term from cognitive science. Hallucinations happen when our vector embedding version of Helen takes an unplanned holiday or decides to enroll in clown college or suddenly recalls the exact date that Billy Corgan shaved his head.
Transformers are also why hallucinations build and why, in my observation, LLMs get stuck in weird personalities. If there's not enough factual data on a topic to build a decent statistical probability, why not revisit the part of the model that's trained on 4chan, free Amazon romance novels, marketing spam, and sci-fi fanfic?
It's not hard to see how the same constructs that build brand authority in search algorithms—entities constructed from word clouds condensed from large amounts of published text—can create rabbit holes of weirdness when also trained on podcast transcripts, forums, social networks, and postmodern literature. So how do you guide those transformers toward what's accurate and verified amid the sea of textual data they've been trained from?
Next month we'll dive into the mechanisms to maintain "truth" in natural language understanding and generation. But before we spiral out, let's return to the main entity.
With all love to G. Stein: A transformer is a transformer is a transformer is a transformer is a transformer
Because my favorite fact about the transformer architecture is that it's not named for an guitar effect or a mechanism that transfers electricity or a world-altering leader or a Smashing Pumpkins song. It's named for the 1980s child's toy and intellectual property franchise Transformers.
As Steven Levy wrote in a 2024 Wired story,
They picked the name “transformers” from “day zero,” Uszkoreit says. The idea was that this mechanism would transform the information it took in, allowing the system to extract as much understanding as a human might—or at least give the illusion of that. Plus Uszkoreit had fond childhood memories of playing with the Hasbro action figures. “I had two little Transformer toys as a very young kid,” he says. The document ended with a cartoony image of six Transformers in mountainous terrain, zapping lasers at one another.
Fictional robots that turn into cars and war machines are the Proustian madeleines of machine learning. Engineers trained on science fiction build that context and reference into their work processes. The juvenilia may be excised from the final paper because it's unprofessional, but it's present in the palimpsest of the work process and its supporting ideology.
With a transformer architecture, language is a mathematical manipulation, context-aware but pre-loaded with an engineer's assumptions and experiences. Similarly, I've demonstrated how childhood memories of a pop song and a rom-com can explain a complex concept.
One could call this essay's pop culture framing digressive, untechnical, and flippant. But I structured it intentionally to capture the weirdness of writing, language, and meaning, all the while knowing the engineers who developed the whole concept that underlies contemporary AI were also digressive, childish, and so focused on their vision for a "transformer" that they changed what the word means to frame and market their vision.
I also wrote it to determine if, as a whole 1,900 word entity, it could be processed, synthesized, and understood by today's transformer-powered chatbots. It's not particularly linear, and it certainly doesn't stick to one core topic.
But both Claude and ChatGPT understood the core concepts and themes of the essay, which, as a feat of mathematics, language, and mechanical engineering, indicates that natural language understanding at scale is immensely technically advanced and every bit as cool as "Transformer," which I hope bops in your head throughout the weekend. I will ignore that ChatGPT misclassified the literary and structural techniques I employed as "metaphors."
She's not sorry she's happy
As a turtle
Content tech links of the week
- Laura Hartenberger is also concerned with the practical outputs of AI and what makes good writing good, over in Noema.
- Lauren Goode explores the seduction of working with vibe coders in Wired.
- NNGroup published some very good research on how AI is changing search habits. Before you listen to a fraudulent projection from a marketer about mass adoption of AI, give this pieces a read.
- I met Chris Penn on a panel a few years ago. His practical understanding of AI has made him a necessary follow in the years since, and this week he shared an amazing visualization of the decision-making for words in a transformer model. So if you read the whole thing above and you still need more clarity, check out this animation.
The Content Technologist is a newsletter and consultancy based in Minneapolis, working with clients and collaborators around the world. The entire newsletter is written and edited by Deborah Carver, an independent content strategy consultant.
Affiliate referrals: Ghost publishing system | Bonsai contract/invoicing | The Sample newsletter exchange referral | Writer AI Writing Assistant
Cultural recommendations / personal social: Spotify | Instagram | Letterboxd | PI.FYI
Did you read? is the assorted content at the very bottom of the email. Cultural recommendations, off-kilter thoughts, and quotes from foundational works of media theory we first read in college—all fair game for this section.
It's high peach season, and I'm torn between making a crumble or baking a cake. Peach dessert recommendations are welcome.