Adventures in Automation: Becoming AI QA: Why Python? How AI and Python became linked

Image created by Bing AI, powered by DALL-E 3

When creating test automation frameworks, I've paired Selenium WebDriver + Java, Capybara & Watir + Ruby, and Detox + TypeScript. What I haven't used since grad school? Python. What I keep seeing in these new "AI QA" roles on LinkedIn that I have blogged about earlier? Python... I wonder why?

Before I begin, let's get back to basics... What is Python?

Hey, Claude.ai! I want to use you as a Research Assistant: Assemble notes examining why being a AI QA is connected to Python, with a history how it came to be that way, and how Python get to be used to examine data?

What is Python?

According to the Foreword of the book Programming Python, it starts with a holiday.

In December 1989, a Dutch programmer named Guido van Rossum was sitting at the Centrum Wiskunde & Informatica (CWI) in Amsterdam with his office closed for Christmas. Looking for something to keep himself busy, he started writing an interpreter for a scripting language he'd been thinking about. He named it after the British comedy troupe Monty Python's Flying Circus, because he was a fan. The snake imagery came later.

The first public release, Python 0.9.0, came out in February 1991. It already had classes with inheritance, exception handling, functions, and the core data types -- list, dict, str -- that is still in Python.

Van Rossum's stated goals for Python, written out in a 1999 report to DARPA, were deliberately modest: easy and intuitive, open source, readable like plain English, and suitable for everyday tasks. He was not thinking about data science or machine learning. He was thinking about making programming less painful.

As Van Rossum wrote in the Foreword of Programming Python, "This emphasis on readability is no accident. As an object-oriented language, Python aims to encourage the creation of reusable code. Even if we all wrote perfect documentation all of the time, code can hardly be considered reusable if it's not readable".

Detailed in the article, Why Python Became the Language of AI, human readability over machine efficiency -- turned out to be exactly what scientists needed when they started writing code.

Python.org: https://www.python.org/
Python Institute: https://pythoninstitute.org/
YouTube: Python, The Documentary

The Python Software Foundation currently supports Python 3.10 through 3.14, with 3.15 in active development. If you are learning Python for AI QA in 2026, that is Python 3.

Python Gets Adopted -- By Science!

Python had a quality that appealed to researchers: it could be written quickly, it read like pseudocode, and you didn't have to fight the language to express a mathematical idea. Scientists who weren't primarily programmers could use it without becoming software engineers first.

Travis Oliphant was among the researchers who latched onto Python for scientific computing. In 2005 and 2006, he merged two earlier numeric Python projects (Numeric and Numarray) into what became NumPy, short for Numerical Python. NumPy gave Python fast, efficient multidimensional array objects. Suddenly Python could work with large matrices at speeds closer to C or Fortran, not because Python itself was fast, but because NumPy's core was written in C and Fortran with a Python wrapper on top.

YouTube: The early days of scientific Python with Travis Oliphant (Aug 21, 2025)
NumPy.org / About: https://numpy.org/about/

The next pieces came in quick succession:

Matplotlib was created by John D. Hunter and originally released in 2003, designed to produce publication-quality plots for scientific visualization. It became the standard for data charting in Python.

Matplotlib: Visualization with Python: https://matplotlib.org/

pandas arrived in 2008. Wes McKinney built it while working at a hedge fund, frustrated by the lack of good tools for analyzing financial data in Python. The name comes from "panel data," an econometrics term, plus a play on "Python data analysis." pandas introduced the DataFrame -- a two-dimensional labeled data structure that made manipulating tabular data feel natural. It was built on top of NumPy, as almost everything in the Python data stack is.

Pandas Documentation at PyData: https://pandas.pydata.org/docs/

scikit-learn started as a Google Summer of Code project in 2007, built by David Cournapeau. It provides a clean, consistent interface for machine learning algorithms -- classification, regression, clustering, dimensionality reduction -- all built on NumPy and SciPy.

Scikit-learn - Machine Learning in Python: https://scikit-learn.org/

By the early 2010s, Python had NumPy, SciPy, pandas, Matplotlib, and scikit-learn. As Jake VanderPlas describes in the Python Data Science Handbook: "Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets. This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind." Python now had a good suite of community-built ecosystem of libraries.

Python Used For Deep Learning

The 2010s brought deep learning -- neural networks with many layers, trained on GPU clusters, capable of tasks: image recognition, speech transcription, language translation.

Dataversity: A Brief History of Deep Learning
W3 Schools: Machine Learning With Python Course

Two main deep-learning frameworks were created, TensorFlow, which isn't being used as much according to r/machinelearning, and PyTorch:

TensorFlow was developed by the Google Brain team and released in 2015. It was designed to be a highly scalable framework for machine learning in research and production. Google built it with a Python API on top of a C++ engine -- because the data science community was already in Python, and TensorFlow needed to go where the researchers were.

Tensorflow.org: An end-to-end platform for machine learning

PyTorch was originally developed by Meta AI's research lab and open-sourced in 2016. It evolved from an earlier library called Torch, which was written in Lua -- a language that had almost no data science community around it. PyTorch offered a Python interface for the same underlying capabilities, "responding to the growing demands and popularity of Python in the data science and machine learning communities." Its dynamic computation graph -- where the graph is built as code runs, rather than compiled ahead of time -- made debugging and experimentation dramatically easier. Researchers adopted PyTorch in large numbers.

PyTorch.org / About: "PyTorch is an open source deep learning framework built to be flexible and modular for research, with the stability and support needed for production deployment".

Both TensorFlow and PyTorch chose Python as their primary user-facing language because the entire scientific computing community -- the people who built the datasets, trained the models, and evaluated the results -- was already using Python.

As Benjamin Peterson, a long-time Python core developer, observed: "Meta has PyTorch, which has an amazing Python API... There's also SciPy, NumPy, Numba, Pandas. There's a whole suite of really extensive tools." That ecosystem created a self-reinforcing cycle: more AI frameworks came to Python because the researchers were there, which made more researchers come to Python because the frameworks were there.

Claude Investigates: What AI QA Actually Requires

According to Claude.ai, here is what this means for an SDET like myself trying to pivot to being an AI QA:

"When a team is building an AI-powered product -- a chatbot, a RAG pipeline, a document summarization service -- the testing questions are not the same as testing a web application. You can't write a simple assertion that the response equals an expected string, because the whole point of a language model is that it generates text dynamically. The test cases look more like:

"Does the model hallucinate facts not in the provided context?
"Is the response relevant to the user's question?
"Does the model produce toxic, biased, or unsafe output?
"When the prompt changes slightly, does the model's behavior regress?

"These evaluation questions require statistical reasoning, comparison across multiple runs, and libraries that understand LLM output. All of those tools are, predictably, in Python.

"The framework that most cleanly maps to what SDETs already do is DeepEval, built by Confident AI. The DeepEval GitHub README describes it as "similar to Pytest but specialized for unit testing LLM apps." That description is doing a lot of work for SDETs coming from pytest backgrounds. An LLM test case in DeepEval looks like this:

import pytest
from deepeval import assert_test
from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

def test_case():
    correctness_metric = GEval(
        name="Correctness",
        criteria="Determine if the 'actual output' is correct based on the 'expected output'.",
        evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT, 
                            LLMTestCaseParams.EXPECTED_OUTPUT],
        threshold=0.5
    )
    test_case = LLMTestCase(
        input="What if these shoes don't fit?",
        actual_output="You have 30 days to get a full refund at no extra cost.",
        expected_output="We offer a 30-day full refund at no extra costs.",
        retrieval_context=["All customers are eligible for a 30 day full refund at no extra costs."]
    )
    assert_test(test_case, [correctness_metric])

"If you've used pytest, that test structure is immediately readable. The conceptual jump is not "learn an entirely new testing paradigm" -- it's "learn Python, learn what an LLMTestCase is, and learn what metrics like hallucination and answer relevancy mean."

"DeepEval integrates with pytest, supports CI/CD pipelines via GitHub Actions or similar, and plugs into LLM frameworks like LangChain, OpenAI Agents, and CrewAI. Its native pytest integration means that SDETs who know how to set up a test pipeline already understand the scaffolding.

"Other tools in this space worth knowing:

"LangSmith (LangChain's evaluation and observability platform) -- for teams building with LangChain, this is the native debugging and tracing environment.
"Ragas -- open-source metrics for evaluating RAG pipelines, measuring faithfulness, answer relevancy, and context precision.
"Arize Phoenix -- ML observability and retrieval evaluation, particularly strong for production monitoring.

"Every one of these tools is a Python library. Every tutorial, quickstart, and documentation example is in Python".

... Thank you, Claude for that help! Question, though... how would my knowledge as an SDET help me build upon becoming an AI QA?

Claude Investigates: The SDET Translation Guide

"If you've been doing test automation with Selenium WebDriver in Java or Playwright in TypeScript, here is how the concepts translate:

"The test runner concept maps directly. pytest is to Python as JUnit or TestNG is to Java, or as the built-in test runner is to Playwright. If you've used @Test annotations or test() blocks, you'll recognize def test_something(): immediately. pytest uses conventions over configuration: files named test_*.py, functions named test_*, fixtures defined with @pytest.fixture.

"The assertion library concept maps directly. pytest's assert statements read naturally: assert actual == expected, assert "error" not in response.text. More complex assertions use pytest plugins or libraries like assertpy.

"The Page Object pattern has an analog in AI QA work too. Instead of abstracting UI elements, you're abstracting LLM calls and evaluation configurations. A reusable LLMTestCase factory or a shared metrics configuration file serves the same organizational purpose.

"The CI/CD integration is the same. pytest outputs JUnit XML by default, which GitHub Actions, GitLab CI, and Jenkins all know how to consume. If you've set up a GitHub Actions workflow to run Playwright tests, the workflow for running DeepEval tests is structurally identical.

"What is genuinely new: Jupyter notebooks. Data scientists do exploratory analysis and evaluation experiments in notebooks before turning them into formal test suites. Understanding how to read, run, and eventually write notebooks is part of the AI QA workflow. That should be the subject of the next post in this series".

... Thank you for that suggestion, Claude! I will keep that in mind...

Claude Investigates: Free Resources to Get Started

How do I start learning all of this, Claude? What do you suggest?

"Python has some of the best free learning material of any language on the internet. The barrier is time, not cost.

"Python fundamentals (starting from zero):

"Python for Everybody (University of Michigan, Coursera) -- Dr. Charles Severance's course is widely considered the best on-ramp for people who are programmers in other languages. It is auditable for free. Five courses, moves deliberately, explains the why.
"Google's Python Class (Google Developers) -- Free, written by Google engineers, assumes you can already write code in some language. Covers Python 3 fundamentals with hands-on exercises. Direct and technical without being padded.
"freeCodeCamp Scientific Computing with Python -- Free, browser-based, covers Python fundamentals with an eye toward data work. Good option for self-paced learning without any account required.
"Microsoft Python Developer Certificate (Microsoft Learn) -- Free, self-paced, professionally framed. Good for SDETs who want a certification path that reads well on a resume.

"Python for data and AI specifically:

"Data Analysis with Python (IBM via Coursera) -- Covers pandas, NumPy, and Matplotlib. Auditable for free. This is the layer directly under the AI QA tooling.
"Python Data Science Handbook (Jake VanderPlas) -- The entire book is free online. Not a course, but a deep reference that covers IPython, NumPy, pandas, Matplotlib, and scikit-learn thoroughly. Bookmark this one.
"Kaggle Python and Kaggle Pandas -- Free, browser-based, no install required. Kaggle runs the exercises in the browser so you can start writing Python immediately.

"Once you have Python fundamentals and basic data skills, the AI QA-specific path:

"DeepEval documentation quickstart -- Work through it with a test project. It is genuinely approachable once you can write a pytest test in Python.
"LangChain testing documentation -- If the team you're interviewing with uses LangChain, understanding this is important.

Claude Suggestions: Jupyter Notebooks

So, what do you suggest after this, Claude? What should I learn?

"One piece of the AI QA workflow I haven't covered is Jupyter notebooks -- the interactive computing environment where data scientists explore data, prototype models, and share findings. Jupyter notebooks are the medium where a lot of AI QA exploratory work lives before it graduates into formal test files".

Thank you, Claude for that information! I will start looking into that!

And, as always, Happy Testing!

-T.J. Maher
Software Engineer in Test

BlueSky | YouTube | LinkedIn | Articles

Related posts from Adventures in Automation:

AI Wants to Take Over QA? Let It! (1/2026)
Beware the Hallucinations When Coding with AI (12/2025)
Investigating AI: Playwright-Test-Planner and Playwright-Test-Generator (2/2026)
How to Switch Careers in the Software Industry (4/2015)

Adventures in Automation

March 27, 2026

Becoming AI QA: Why Python? How AI and Python became linked

What is Python?

Python Gets Adopted -- By Science!

Python Used For Deep Learning

Claude Investigates: What AI QA Actually Requires

Claude Investigates: The SDET Translation Guide

Claude Investigates: Free Resources to Get Started

Claude Suggestions: Jupyter Notebooks

No comments:

Post a Comment

Get new posts by email:

Featured Series

Pageviews last month

Currently