Home / Practice / AI & ML Systems
— Practice 03 · Intelligence

AI that does the job,
not the demo.

LLM apps, RAG, and agents grounded in your own data, behind your auth, on your cloud. Built with evals, guardrails, and observability, so you can ship to real users without wincing every time they hit it.

// 01
10+
AI systems in development
// 02
95%+
Eval pass-rate before any system ships
// 03
6 wks
Avg. time from kickoff to first release
// 04
100%
Of systems shipped with traces & evals
— The thesis

AI as a system.
Not a prompt.

AI engineering is what stands between a slick demo and a system real users depend on. Without it, organisations end up with prototypes that hallucinate in front of customers, prompts duct-taped together, and no way to tell whether last week's tweak made things better or worse.

We treat AI as software with a probabilistic core: grounded in your data via retrieval, guarded with input/output filters, evaluated against golden sets in CI, and observed in production with full traces. Models swap; the system stays.

The outcome: AI features that pass review, pass legal, and pass actual users, with the receipts to prove they keep working as the model, the prompt, or the data underneath them changes.

— What's inside

Anatomy of an AI system

Every production AI system we ship has the same five concerns. Most teams nail one or two and skip the rest, that's why their demo never makes it to launch.

// 01

Knowledge & retrieval

Your documents, tickets, transcripts, and product data, chunked, embedded, indexed, and refreshed. Retrieval is most of the answer; the model is just the wrapping.

OpenSearch pgvector Pinecone Azure AI Search Bedrock KB
// 02

Reasoning & orchestration

Prompt design, tool use, function calling, and multi-step agent flows. Deterministic where possible, model-driven only where it earns its keep.

Claude GPT-4o Bedrock Azure OpenAI LangGraph
// 03

Guardrails & safety

Input filters, output validators, PII redaction, jailbreak detection, and policy enforcement. The bit your security and legal team will ask about — handled up front.

Bedrock Guardrails AI Content Safety JSON-schema Pydantic AI
// 04

Evals & CI

Golden sets, judge models, regression suites, and red-team prompts — running in CI on every prompt, model, or retrieval change. No eval, no merge.

Promptfoo Braintrust Ragas DeepEval
// 05

Observability & feedback

Full traces of every call — prompts, retrievals, tool use, costs, latency. User feedback loops piped back into the eval set so the system gets sharper, not staler.

Langfuse LangSmith Arize OpenTelemetry
— Capabilities

What we actually do

Six AI capabilities, mapped to where most teams get stuck. Most engagements start with RAG or evals, those are where the demo-to-production gap is widest.

// 01

RAG on your data

Retrieval-augmented generation grounded in your own corpus. Chunking, embeddings, hybrid search, re-ranking, and citations the user can click.

  • Vector + BM25 hybrid retrieval
  • Document-level access control
  • Citations & source attribution
// 02

Agents & tool use

Multi-step agents that call your APIs, query your databases, and take actions, with explicit tool contracts, retries, and human-in-the-loop where stakes are high.

  • Function-calling & tool routing
  • LangGraph / Bedrock Agents
  • Approval gates & HITL
// 03

Evals & quality

Golden datasets, judge-model evals, regression suites, and red-team prompts in CI. Every prompt change runs the gauntlet before merge.

  • Eval harness & golden sets
  • LLM-as-judge with calibration
  • Red-team & jailbreak suites
// 04

Guardrails & safety

The work that makes legal and security sign off. PII redaction, prompt-injection defence, output validation, and policy enforcement.

  • Input/output filtering
  • PII detection & redaction
  • Jailbreak & injection defence
// 05

Observability & ops

Trace every call. Track cost, latency, token use, and quality drift. Pipe user feedback back into the eval set so the system improves week-on-week.

  • Langfuse / LangSmith tracing
  • Cost & latency dashboards
  • Feedback loops to evals
// 06

Classical ML & MLOps

Where prediction beats generation: forecasting, churn, fraud, recommendation. Trained, deployed, and monitored on the same platform as your AI.

  • SageMaker / Azure ML / MLflow
  • Feature stores & registries
  • Drift, retraining, & rollback
— Platforms

The technologies we build on.

The models, frameworks, and runtimes we reach for most. Cloud-agnostic where it matters, opinionated where it counts.

CL

Anthropic Claude

Frontier reasoning & tool use

Claude on Bedrock or direct API. Our default for tool-using agents and long-context document work. Strong at instruction-following and structured output.

OAI

OpenAI & Azure OpenAI

GPT-4o, o-series, embeddings

Direct or via Azure OpenAI for enterprise data residency. Strong general-purpose models, mature SDKs, and a deep ecosystem of tools and integrations.

BR

Amazon Bedrock

Multi-model + Guardrails + KB

Single API across Claude, Llama, Titan, and more. Bedrock Knowledge Bases for managed RAG, Guardrails for policy enforcement, Agents for orchestration.

LG

LangGraph & LangSmith

Agents & observability

LangGraph for stateful agent flows; LangSmith for tracing, eval, and prompt management. Pairs cleanly with Langfuse where self-hosted observability is required.

— How we engage

Three ways to start

Three shapes of engagement, depending on whether you need a use-case opinion, a delivery team, or someone to operate it once it's live.

// 01 / Consult

AI discovery

Two to four weeks of structured discovery. Use-case shortlisting, value sizing, risk and feasibility scoring, and a build-vs-buy view per use-case.

  • Use-case workshops
  • Feasibility & value scoring
  • Build-vs-buy analysis
  • Prioritised AI roadmap
// 02 / Build

AI delivery

Engineers embedded with your team, shipping AI features in fortnightly sprints. Evals before merge, traces in production, knowledge-transfer baked in.

  • RAG & agent builds
  • Eval harness & CI integration
  • Guardrails & safety
  • Classical ML & MLOps
// 03 / Run

Managed AI

We operate your AI systems — eval runs, model upgrades, prompt versioning, cost & quality monitoring, and incident response when the model misbehaves.

  • Eval & regression on every change
  • Model & prompt versioning
  • Cost, quality, & drift monitoring
  • Quarterly model upgrades
— A typical engagement

From nothing to shipped.

Most AI engagements run on this rhythm. Six weeks from kickoff to a working system in front of real users — then iterate from there.

STEP 01 ·

Frame & scope

Pick the use-case worth winning. Define the user, the success metric, and the first eval set. Output: a one-page brief, an eval suite, and a thin slice to build.

STEP 02 ·

Scaffold & evals

Repo, CI, eval harness, tracing, guardrail layer, retrieval index. The substrate every prompt and agent will run on top of.

STEP 03 ·

Build & ship

RAG / agent / classifier built behind your auth, on your data, with citations and traces. Eval pass-rate gates every release. Real users on it by week six.

STEP 04 ·

Operate & evolve

Trace every call, capture every thumbs-down, fold them into the eval set. Swap models when better ones land — without breaking the contract with users.

— Our product

Meet Recall

Recall is our in-house product for turning unstructured business conversations into structured, operational intelligence, voice notes, calls, follow-ups — captured, interpreted, and routed to your systems of record.

Source preserved. Interpretations traceable. Knowledge always current. Built for environments where accuracy, transparency, and auditability are non-negotiable.

  • Immutable source capture
  • Entity & relationship extraction
  • State-aware knowledge model
  • CRM & knowledge-base integration
  • Citations on every interpretation
  • Audit-ready by default
Explore Recall →
// product · live
Recall
CAPTURE
INTERPRET
STATE
INTEGRATE
// CONVERSATION → SYSTEMS EARLY ACCESS
— Bring us an AI problem

Bring us the use-case.
We'll bring the system for it.

Discovery, build, or full managed operation, start with a 30-minute call.