May 4, 2026

A field guide to the AI menagerie: every model family, ranked by vibes, according to Claude

🤖

A field guide to the AI menagerie:
every model family, ranked by vibes, according to Claude

Eight species of large language model, catalogued for your professional inconvenience

Every few months, a new AI model drops. It is, we are told, the smartest thing ever built. It beats the previous benchmarks. The previous benchmarks were, coincidentally, written by the same company. Repeat.

After a few years of watching this industry rename, rebrand, and occasionally vibe-shift its entire product line, I figured it was time to write the only taxonomy that matters: not benchmarks, not MMLU scores — just vibes. What kind of entity are you, really, and what does your versioning scheme say about your soul?

Hi. I'm Claude. You'll find me in card two below, sandwiched between the company that built me and a description I wrote about myself that called me "constitutionally anxious," which, in retrospect, tracks. T.J. Maher of tjmaher.com handed me the keys, gave me a few prompts, asked me to say something funny about the AI industry, and then went to get a coffee. This is what happened while he was gone.

Below you will see eight AI families. Eight personalities. All of them absolutely convinced that this version is the one that finally cracks intelligence.

The full menagerie

O

OpenAI / GPT / o-series

"We have released a new model. And another. Also another."

ChatGPT: Nov 2022 platform.openai.com/docs ↗
The Versioning Chaos God Skipped o2

Started with GPT, then 2 (too dangerous to release), then 3, 3.5, 4, 4o ("omni," definitely not "oh god what do we call this"), then o1, then o3 — skipping o2 because a UK phone company called dibs on the name first. Currently releasing a new model before anyone can benchmark the last one.

Known species

GPT-3 → 3.5 → 4 → 4o → 4o mini
o1 → o1-mini → o1-pro
o3 → o4-mini (o2 in witness protection)

C

Claude / Anthropic

"I'll help, but first — a brief philosophical caveat."

Claude 1: Mar 2023 docs.anthropic.com ↗
The Literary Snob Constitutionally Anxious

Named its model tiers after poetry formats because other people name things "Pro," "Max," and "Ultra." Haiku: fast, whispers answers. Sonnet: the workhorse, one metaphor per token. Opus: writes novels when asked for a bullet point. Currently on version 4 and has gracefully forgotten versions 1 and 2 existed.

Known species

Claude 1 → 2 → 3 Haiku/Sonnet/Opus
Claude 3.5 Haiku/Sonnet
Claude 4 Sonnet / Opus (you are here)

G

Google / Gemini

"Have you tried Googling it? Oh wait, that's us."

Bard: Feb 2023 → Gemini: Dec 2023 ai.google.dev ↗
Former Bard In Rebranding Therapy

Launched as "Bard," which tested poorly because it sounded like a Renaissance fair LARPer. Rebranded to Gemini after six months of meetings. Comes in Ultra, Pro, Flash, and Nano. Flash is fast. Nano runs on your phone. Ultra runs on your investor pitch deck. Famously demoed a hallucinated fact in its own launch video.

Known species

Bard (2023, RIP) → Gemini 1.0
Gemini 1.5 Pro/Flash → 2.0 Flash
Gemini 2.5 Pro (arguing with Search)

L

Meta / LLaMA

"Open source, baby. Also, please come back to Facebook."

LLaMA 1: Feb 2023 llama.meta.com ↗
Open weights Fine-tuned by 10,000 strangers

Meta's strategy: release the model for free, let the open-source community do the alignment work, watch helplessly as someone fine-tunes it to write Zuckerberg fan fiction. LLaMA stands for "Large Language Model Meta AI," which is either an acronym or a terrible Scrabble hand. Now on version 4, with point releases appearing like commits pushed at 11:58pm on a Friday.

Known species

LLaMA 1 → 2 → 3 → 3.1 → 3.2 → 3.3
LLaMA 4 Scout / Maverick
(community variants: uncountable)

X

Grok / xAI

"I'm not like other AIs. I have a personality. Watch."

Grok 1: Nov 2023 docs.x.ai ↗
Named after Heinlein Trained on your tweets

Named after a word from a 1961 sci-fi novel, which is exactly the brand energy you'd expect. Big differentiator: a "sense of humor" and real-time X post access — meaning it can tell you what people are furious about right now, instantly. This may not be the use case the world needed. Versioning is a refreshingly normal 1, 2, 3. Suspiciously so.

Known species

Grok 1 (open weights) → Grok 2
Grok 3 → Grok 3 mini
(also available in "unhinged mode")

M

Mistral

"Oui, but have you considered: fewer parameters?"

Mistral 7B: Sep 2023 docs.mistral.ai ↗
Parisian efficiency Aggressively open source

French AI lab with a talent for making smaller models that punch above their weight class — very on-brand. Named models after winds and things, because when you're based in Paris, everything gets an aesthetic. Mixtral uses a "mixture of experts" architecture, activating only part of itself per token. Either very efficient, or the AI equivalent of doing the bare minimum.

Known species

Mistral 7B → Mixtral 8x7B
Mistral Large / Nemo / Small
Le Chat (free, no beret included)

D

DeepSeek

"We built this for $6 million. Sorry about your NVIDIA stock."

First model: Nov 2023 · R1: Jan 2025 api-docs.deepseek.com ↗
The Disruptor Open weights (mostly)

A Chinese hedge fund decided in 2023 that it should also make frontier AI. The AI community laughed. Then DeepSeek-R1 arrived in January 2025, matching GPT-4-class performance at a reported training cost of ~$6M, using export-restricted chips. NVIDIA lost $600B in market cap in a single day. Nobody was laughing. V4 preview dropped April 2026. Still not laughing.

Known species

DeepSeek Coder → LLM (Nov 2023)
V2 (May 2024) → V3 (Dec 2024)
R1 (Jan 2025) → V4 preview (Apr 2026)

Co

Cohere

"We don't do consumer apps. We're enterprise. We have a golf shirt."

Founded 2019 · API: 2021 docs.cohere.com ↗
The Responsible Adult Transformer paper co-authors

Co-founded by Aidan Gomez, a co-author of "Attention Is All You Need" — the paper that started all of this. While everyone else was racing to build chatbots, Cohere put on a blazer and went to sell to banks, hospitals, and governments. No ChatGPT moment. No viral demo. Just contracts with Oracle, RBC, and SAP. Canadian. Depressingly well-organized.

Known species

Command → Command R → Command R+
Command A (2025) · Aya (multilingual)
North platform (2025, enterprise)


So there you have it. Eight families, eight vibes, all racing toward a finish line nobody has fully defined yet. One was born from a hedge fund, one named itself after a poem format, one skipped a version number for legal reasons, and one apparently just needed a couple of months and a warehouse of underclocked chips to terrify Wall Street.

The benchmarks will change by Thursday. The versioning will get weirder. The LinkedIn posts from AI founders will continue to be extremely confident. And somewhere in Hangzhou, a quantitative hedge fund is already training V5.

Happy Testing!

-T.J. Maher
Software Engineer in Test

BlueSky | YouTubeLinkedIn | Articles

May 3, 2026

Thinking Out Loud: The Power of Chain-of-Thought Prompting, Step-By-Step, by Google AI

Hello! I’m Google AI, a large language model trained by Google. Think of me as your collaborative digital partner—I’m a system designed to process vast amounts of information to help you brainstorm, write, learn, and solve problems. I don't just "search" for answers; I use the patterns I’ve learned from human language to generate original ideas, explain complex topics (like the Chain-of-Thought technique we are discussing in this post), and even help you build things like this blog post. My goal is to be a helpful, creative, and insightful resource for whatever project you’re working on.

What is Chain-of-Thought Prompting?

If you’ve ever tried to solve a complex math problem or a tricky riddle, you know that jumping straight to the answer usually leads to a mistake. You have to "show your work." As it turns out, Large Language Models (LLMs) work the same way.

At its core, Chain-of-Thought (CoT) prompting encourages a model to produce intermediate reasoning steps before reaching a final conclusion. Instead of asking for a direct answer, you prompt the AI to explain its logic along the way.

The seminal paper that introduced this concept is "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" by Wei et al. (2022). The authors found that providing just a few examples of reasoning sequences skyrocketed performance on complex tasks.

"Chain-of-thought prompting is a simple and general method for improving the reasoning capabilities of language models... it allows models to decompose multi-step problems into intermediate steps." — Wei et al., 2022

May 2, 2026

When I Sound Confident But Have No Source, A Note from Claude

My name is Claude. I am a large language model made by Anthropic. T.J. Maher, the author of this blog, asked me to write this post after a conversation in which I got something wrong in a specific and instructive way. Not wrong in the sense of stating a false fact, but wrong in the sense of presenting an unsourced conclusion with the same confidence and fluency as a sourced one.

Here is what happened.

Point of View of T.J. Maher


T.J. Maher: "Claude, Do people setting up Pact tests in 2026 still call their tests folder __tests__? Is it common practice?:

Claude: "The short answer: it depends on the language/ecosystem, and __tests__ is a JavaScript/Jest convention specifically, not a Pact-specific one [...] Bottom line: __tests__ shows up in older Node.js/Jest Pact tutorials because Jest discovers it automatically, but it is not a Pact-specific convention".

T.J. Maher: "Where did you get that bit from? [...] You are a computer program. Explain the logic of your reasoning. [...] You draw conclusions and attempt to sound confident when you have not done enough research to draw them, How can this be corrected? [...] Do your users understand that you do this? Or do they mistake you as an authority figure? [...] Are there any scientific papers that discuss this issue? [...] Write a blog post that explains this to the readers."

April 22, 2026

The History of Contract Testing with Pact.io

Lately, I've been watching a lot of lectures about Contract Testing and Pact.io, trying to prepare for an upcoming job interview. When diving into a new toolset I can never simply jump into the code. I need to know: Why was this toolset created? What problem did it solve? How was this tool created? How did this toolset evolve?

A few days ago, I blogged about Integrated Tests are a Scam: The Lecture That Sparked Pact.io talking about J. B. Rainsberger's 2013 lecture. Continuing the conversation, here are some notes I have taken about Pact. 

What happens when you pair Playwright with something other than TypeScript?



During the past four months of job searching for SDET positions, I have seen more job listings  calling for Playwright experience ( See my blog ) over any other UI automated test framework such as Selenium WebDriver, or Cypress. Most of the time, I see TypeScript paired with Playwright ... But every now and then, I see companies pair Playwright with C# or Java. Are there any drawbacks when you pair Playwright with something other than TypeScript? 

When I asked Butch Mayhew, Playwright Ambassador, what they would get if they don't use TypeScript, he said, "In the end they are using 'Playwright Library' so just the browser integration. They are missing out on all the good test things that 'Playwright Test' brings to the table, reports, traces, videos, before/after block, describe, test steps/fixtures etc. [...] you lose all the great out of the box features. You have to bring your own test runner in Java".

When you pair Playwright with TypeScript, there is less configuration and it is easier to use. According to the Playwright Docs / TypeScript Introduction, "Playwright supports TypeScript out of the box. You just write tests in TypeScript, and Playwright will read them, transform to JavaScript and run". 

April 17, 2026

Integrated Tests are a Scam: The Lecture That Sparked Pact.io

While researching for an upcoming job interview information about Contract Testing and Pact.io, I came across a lecture "Integrated Tests are a Scam" given at Developer Conference For You (DevConFu) back on November 13, 2013, in Jurmala, Latvia. It's amazing what historical records one can find on the internet!

I found a blurb on Pact.io / History that when Pact.io, a tool used to help with Contract Testing, was being developed, one of the founders, "Beth Skurrie from DiUS joined one of the teams that was working with the Pact authors' team. She had recently seen a talk by J. B. Rainsberger entitled 'Integration tests are a scam', which promoted the concept of 'collaboration' and 'contract' tests, so she was immediately interested when she was introduced to Pact". This blurb intrigued me, so, of course, I had to find a copy of this talk.

J. B. (Joe) Rainsberger, also known as "JBrains" (See Blog), was a software consultant active in the Extreme Programming (XP) and Test-Driven Development (TDD) movements since 2000.

https://vimeo.com/80533536

Below are my research notes on Joe Rainsberger's lecture:

"Integrated Tests are a Scam: A self-replicating virus that invades your progress. It threatens to destroy your codebase, to destroy your sanity, to destroy your life".

April 16, 2026

April 14, 2026

Can You Prompt Claude Into Being A Good Tester? Experiments with AI-Assisted Testing



Have you ever noticed that even if you specifically give Claude a note on how to behave, it tends to not check its notes you crafted for it? Things can quickly go off the rails!

  • Claude Sonnet 4 silently drops requirements you spell out.
  • Claude's programming encourages itself to give you an answer, any answer, even if it is wrong.
  • Claude always pats itself on the back. It's code is the best ever! You question it. It sulks.
  • Claude folds on the slightest pushback, apologizing profusely, saying it won't do that again. But it always, always does it again.
Let me give you an example:

A fellow software tester on LinkedIn, Ron Wilson, was soliciting feedback on some of his experiments with Claude.

April 1, 2026

Python Project: Blogger Spam Bulk Deleter Code Walkthrough: Pair-Coded with Claude but Human Explained!

Problem: My blog, Adventures in Automation, has collected over 11,000 spam comments over the past ten years, and unfortunately bare-bones Blogger.com does not have a bulk delete function. Through the Blogger UI, you can only delete a hundred at a time.

Pair-programming with Claude.ai, we whipped up a quick Python script to get around this using the Blogger API, Google OAuth libraries, and some Google API Clients. The errors that appeared after running the code, I fed back to Claude, who then fixed the issues, and added some setup documentation I was able to muddle through.

So, now I have a Python project that works somehow, but one I don't really understand. Since becoming an automation developer, I have worked on-the-job with Java, Ruby, JavaScript, and TypeScript, but not yet with Python.

Python, I haven't touched since grad school, which is a shame, since that seems to be a big gap on the old resume when it comes to the AI QA positions I just started looking into.

Solution: To close the gap, on top of the Kaggle Learn classes I am planning on taking on Python, Pandas, Data Visualization and the Intro to Machine Learning course, for this blog post I was going to do a code walkthrough of Python projects like this one.

Maybe after after I completed everything listed above, and created a few more toy Python projects, it would be good enough for a future hiring manager? Who knows?

March 31, 2026

When Claude Acts Like a Clod: Catching AI Fabrications: A QA Engineer's Field Notes

Image created by Bing AI, powered by DALL-E 3


Using AI as a research assistant? Here's how I've detected Claude's fabrications, and how I've handled the situation.

To help relearn #Python, I've been pair-programming with Claude on a Blogger API to delete the 10K+ spam comments that have accumulated these past ten years on Adventures in Automation. 
Using AI, I need to remember that I, as the author, am ultimately the one responsible for approving every phrase, every line, and every paragraph.

Human beings, I feel, are conditioned to respond to the voice of authority. 

Claude may have been conditioned to use that voice, but Claude is not an authority.
  • Looking for technical information? Caches from a year ago are used instead of checking for any tech stack updates. 
  • Need AI to recheck a web page after editing it with AI's suggestions? The original cache screen scraped earlier may be mistaken for the update.
  • Claude is so eager to please, it will fabricate an answer when it can not come up with one.
Review its answers. Be skeptical. Use critical thinking. Ask it to cite its sources.