Back to Blog

Symbolic Regression and Minimal Basis Stack Audits

Symbolic regression finds formulas, not predictions. What the EML operator teaches developers about stack redundancy and interpretable ML alternatives.

Jakub Czechowski

Builds websites and e-commerce at JC Web Studio, runs StackCompass – a publication on content architecture and stack decisions – and co-organizes CMS Conf, a conference on content systems.

/ / 7 min read

A recent paper on arXiv proves that every elementary mathematical function - sine, cosine, logarithm, square root, the constants π and e, all of complex arithmetic - can be generated from a single operator: eml(x, y) = exp(x) − ln(y) combined with the constant 1. Two rules. That is the complete basis.

The paper, by Andrzej Odrzywołek (April 2026), frames EML as a mathematical analog of NAND completeness in logic: just as every Boolean circuit can be built from NAND gates alone, every elementary function can be expressed as a finite tree of eml operations. The grammar fits in one line: S → 1 | eml(S, S). At tree depth 2 through 4, nearly any target function is reachable. At depth 5 and beyond, success rate collapses to under 1%.

The math stands on its own. But there are two things in this paper that transfer directly to how development teams build and maintain systems.

What “Minimal Basis” Means for Stack Auditing

The deep question the EML result forces is not “what can we build?” but “what is the irreducible set we actually need?” Everything else is derivable - present for convenience, not necessity.

Most technology stacks do not get designed this way. They accumulate. A mature SaaS product ends up with three caching layers added by three different engineers at three different company stages. An e-commerce platform runs two SEO plugins installed by different teams that never spoke to each other. An observability setup routes the same service-health signal through five different alert paths because each incident spawned a new alert rule and nobody decommissioned the old one.

Research on structured stack audits documents consistent results: teams that perform systematic reviews report 40% shorter development cycles and 35% cost reductions. A dedicated audit framework study found that AI tooling alone was cut by 40% after structured review - what researchers call “AI tool sprawl,” where a chat assistant, a code completion tool, an autonomous agent, and a pipeline orchestrator all overlap on the same narrow set of tasks without any team member having a clear picture of which one handles what.

The minimal basis concept offers a cleaner way to frame the audit. Instead of asking “can we remove this tool?” - which invites defensive answers - the better question is: what is the generating set for this stack? Which tools are primitive operations, and which are derived combinations expressible through primitives you already have?

A practical three-step process:

  1. Map outputs, not tools. List what the stack must produce - deployed artifacts, data pipelines, user-facing features, observability signals. Don’t start with tool categories.
  2. Trace derivation depth. For each output, trace back through the tools required to produce it. If the chain exceeds 3–4 hops without meaningful transformation at each step, you have redundancy.
  3. Identify the functional basis. Which tools, if removed, would make a category of outputs unreachable? Those are load-bearing. Everything else is a candidate for consolidation.

The EML depth-limit finding has a direct operational parallel: at tree depth 5+, combinatorial explosion overwhelms the search and success rate drops below 1%. Stacks with more than a handful of overlapping tools exhibit the same degradation - not from individual tools being bad, but from integration surface multiplying beyond the team’s ability to reason about it. This is the same pattern that surfaces when architectural abstractions grow past the point where they still justify their overhead.

Symbolic Regression Is Interpretable ML That Returns Formulas

The second lesson comes from the context in which EML appears: symbolic regression.

Symbolic regression is the task of finding a mathematical formula that fits a dataset, rather than fitting a model whose internals remain opaque. The difference in output:

  • “For these inputs, the model predicts 17.3” - gradient boosting, neural net
  • “The relationship is y = 2x² + sin(x) − 3 - symbolic regression

The first gives you a prediction. The second gives you something you can print, reason about, implement in any language, and hand to a domain expert who has never heard of machine learning.

PySR, an open-source library by Miles Cranmer (arXiv:2305.01582), makes this operational. It is pip install pysr away, production-ready, and used in published physics, materials science, and climate research. The key output is not a prediction - it is a ranked list of formulas along a complexity-accuracy Pareto frontier. You choose how much simplicity is worth versus how much fit you need.

Three Places Symbolic Regression Beats a Black Box

Demand forecasting with auditable logic. A gradient-boosted model for weekly demand might achieve strong accuracy but give you nothing to reason with when the forecast breaks. A symbolic regression run on the same dataset might return demand ≈ base × (1 + 0.4 · sin(2π · week/52)) × inventory^0.6. That formula is directly readable: a seasonal cycle with 40% amplitude, a sub-linear inventory effect. It can be handed to a domain expert, questioned, and improved. It ports to a spreadsheet. It runs as a threshold alert without a model serving endpoint.

Capacity planning with threshold alerts. Infrastructure teams build load models and then struggle to explain them to finance or operations. A symbolic regression on historical metrics produces an algebraic relationship between request volume, session length, and memory consumption - one that can be embedded directly in a monitoring rule. No inference endpoint, no model drift monitoring, no retraining pipeline.

Auditable pricing rules. Pricing models that need to be defensible - for internal governance, regulatory review, or enterprise contract negotiation - cannot be black boxes. A symbolic regression fit produces a formula that sales, legal, and finance can inspect, argue over, and sign off on. This requirement appears across B2B contexts far more often than most ML adoption discussions acknowledge.

What Makes Symbolic Regression Different from Curve Fitting

The obvious objection: isn’t this just polynomial regression with extra steps?

No. Classical curve fitting requires choosing a functional form first - linear, quadratic, exponential - and then fitting coefficients. Symbolic regression searches over functional forms simultaneously. It will find that your data is better described by a · log(b + x) than by any polynomial you tried, without requiring you to guess that form in advance.

PySR uses evolutionary search over expression trees, guided by Pareto optimisation between accuracy and complexity. The output is a set of expressions along the complexity frontier: a depth-2 expression might capture 85% of variance, a depth-4 expression captures 94%. You choose based on what deployment context requires.

The relevant framing for engineering teams: symbolic regression is appropriate when you need a model that is portable (formula runs anywhere, no runtime dependency), auditable (humans can read and verify it), and stable (no retraining pipeline once the formula is validated). Those three properties together describe a large class of operational business metrics that teams are currently handling with far heavier machinery - the same pattern that applies when AI tooling is added to workflows that did not need the additional layer.

Two Frameworks from One Math Paper

The EML result is mathematically elegant. Its practical value for engineering teams lies in what it clarifies about the structure of complexity.

Minimal basis thinking reframes stack audits from defensive tool-by-tool justification to a generative question: what is the irreducible set, and what is derived? That reframing reliably surfaces redundancy that incremental reviews miss, because it changes the default assumption from “keep unless proven unnecessary” to “justify as primitive or collapse into what you have.”

Symbolic regression via PySR makes interpretable ML accessible for the class of business problems where compact relationships exist. The EML depth-limit finding provides theoretical grounding: most real-world metrics live in the regime where elementary functions at depth 2–4 describe the actual relationship. When they do, the formula is the better deliverable.

Before adding a tool: what is its primitive contribution, and which existing primitives already cover it? Before training a model: does a formula suffice, and if so, what are the operational advantages of having the formula instead of the weights?

These are older questions dressed in new language. The EML paper makes them harder to ignore.

If these trade-offs - when to add a tool, when to reach for a formula instead of a model - are decisions you’re working through with a team, CmsConf is worth a look. The talks tend to be practitioner-level, covering exactly the kind of architecture and tooling decisions this paper points at.


Sources

  1. Andrzej Odrzywołek, “All elementary functions from a single binary operator”, arXiv:2603.21852v2, April 2026 - https://arxiv.org/abs/2603.21852
  2. Miles Cranmer et al., “Interpretable Machine Learning for Science with PySR and SymbolicRegression.jl”, arXiv:2305.01582 - https://arxiv.org/abs/2305.01582
  3. PySR GitHub repository - https://github.com/MilesCranmer/PySR