AI & ML Systems · cloudandthings.io

— The thesis

AI as a system.
Not a prompt.

AI engineering is what stands between a slick demo and a system real users depend on. Without it, organisations end up with prototypes that hallucinate in front of customers, prompts duct-taped together, and no way to tell whether last week's tweak made things better or worse.

We treat AI as software with a probabilistic core: grounded in your data via retrieval, guarded with input/output filters, evaluated against golden sets in CI, and observed in production with full traces. Models swap; the system stays.

The outcome: AI features that pass review, pass legal, and pass actual users, with the receipts to prove they keep working as the model, the prompt, or the data underneath them changes.

— What's inside

Anatomy of an AI system

Every production AI system we ship has the same five concerns. Most teams nail one or two and skip the rest, that's why their demo never makes it to launch.

// 01

Knowledge & retrieval

Your documents, tickets, transcripts, and product data, chunked, embedded, indexed, and refreshed. Retrieval is most of the answer; the model is just the wrapping.

OpenSearch pgvector Pinecone Azure AI Search Bedrock KB

// 02

Reasoning & orchestration

Prompt design, tool use, function calling, and multi-step agent flows. Deterministic where possible, model-driven only where it earns its keep.

Claude GPT-4o Bedrock Azure OpenAI LangGraph

// 03

Guardrails & safety

Input filters, output validators, PII redaction, jailbreak detection, and policy enforcement. The bit your security and legal team will ask about — handled up front.

Bedrock Guardrails AI Content Safety JSON-schema Pydantic AI

// 04

Evals & CI

Golden sets, judge models, regression suites, and red-team prompts — running in CI on every prompt, model, or retrieval change. No eval, no merge.

Promptfoo Braintrust Ragas DeepEval

// 05

Observability & feedback

Full traces of every call — prompts, retrievals, tool use, costs, latency. User feedback loops piped back into the eval set so the system gets sharper, not staler.

Langfuse LangSmith Arize OpenTelemetry

— Capabilities

What we actually do

Six AI capabilities, mapped to where most teams get stuck. Most engagements start with RAG or evals, those are where the demo-to-production gap is widest.

// 01

RAG on your data

Retrieval-augmented generation grounded in your own corpus. Chunking, embeddings, hybrid search, re-ranking, and citations the user can click.

Vector + BM25 hybrid retrieval
Document-level access control
Citations & source attribution

// 02

Agents & tool use

Multi-step agents that call your APIs, query your databases, and take actions, with explicit tool contracts, retries, and human-in-the-loop where stakes are high.

Function-calling & tool routing
LangGraph / Bedrock Agents
Approval gates & HITL

// 03

Evals & quality

Golden datasets, judge-model evals, regression suites, and red-team prompts in CI. Every prompt change runs the gauntlet before merge.

Eval harness & golden sets
LLM-as-judge with calibration
Red-team & jailbreak suites

// 04

Guardrails & safety

The work that makes legal and security sign off. PII redaction, prompt-injection defence, output validation, and policy enforcement.

Input/output filtering
PII detection & redaction
Jailbreak & injection defence

// 05

Observability & ops

Trace every call. Track cost, latency, token use, and quality drift. Pipe user feedback back into the eval set so the system improves week-on-week.

Langfuse / LangSmith tracing
Cost & latency dashboards
Feedback loops to evals

// 06

Classical ML & MLOps

Where prediction beats generation: forecasting, churn, fraud, recommendation. Trained, deployed, and monitored on the same platform as your AI.

SageMaker / Azure ML / MLflow
Feature stores & registries
Drift, retraining, & rollback

— Platforms

The technologies we build on.

The models, frameworks, and runtimes we reach for most. Cloud-agnostic where it matters, opinionated where it counts.

Anthropic Claude

Frontier reasoning & tool use

Claude on Bedrock or direct API. Our default for tool-using agents and long-context document work. Strong at instruction-following and structured output.

OAI

OpenAI & Azure OpenAI

GPT-4o, o-series, embeddings

Direct or via Azure OpenAI for enterprise data residency. Strong general-purpose models, mature SDKs, and a deep ecosystem of tools and integrations.

Amazon Bedrock

Multi-model + Guardrails + KB

Single API across Claude, Llama, Titan, and more. Bedrock Knowledge Bases for managed RAG, Guardrails for policy enforcement, Agents for orchestration.

LangGraph & LangSmith

Agents & observability

LangGraph for stateful agent flows; LangSmith for tracing, eval, and prompt management. Pairs cleanly with Langfuse where self-hosted observability is required.

— How we engage

Three ways to start

Three shapes of engagement, depending on whether you need a use-case opinion, a delivery team, or someone to operate it once it's live.

// 01 / Consult

AI discovery

Two to four weeks of structured discovery. Use-case shortlisting, value sizing, risk and feasibility scoring, and a build-vs-buy view per use-case.

Use-case workshops
Feasibility & value scoring
Build-vs-buy analysis
Prioritised AI roadmap

// 02 / Build

AI delivery

Engineers embedded with your team, shipping AI features in fortnightly sprints. Evals before merge, traces in production, knowledge-transfer baked in.

RAG & agent builds
Eval harness & CI integration
Guardrails & safety
Classical ML & MLOps

// 03 / Run

Managed AI

We operate your AI systems — eval runs, model upgrades, prompt versioning, cost & quality monitoring, and incident response when the model misbehaves.

Eval & regression on every change
Model & prompt versioning
Cost, quality, & drift monitoring
Quarterly model upgrades

— A typical engagement

From nothing to shipped.

Most AI engagements run on this rhythm. Six weeks from kickoff to a working system in front of real users — then iterate from there.

STEP 01 ·

Frame & scope

Pick the use-case worth winning. Define the user, the success metric, and the first eval set. Output: a one-page brief, an eval suite, and a thin slice to build.

→

STEP 02 ·

Scaffold & evals

Repo, CI, eval harness, tracing, guardrail layer, retrieval index. The substrate every prompt and agent will run on top of.

→

STEP 03 ·

Build & ship

RAG / agent / classifier built behind your auth, on your data, with citations and traces. Eval pass-rate gates every release. Real users on it by week six.

→

STEP 04 ·

Operate & evolve

Trace every call, capture every thumbs-down, fold them into the eval set. Swap models when better ones land — without breaking the contract with users.

— Our product

Meet Recall

Recall is our in-house product for turning unstructured business conversations into structured, operational intelligence, voice notes, calls, follow-ups — captured, interpreted, and routed to your systems of record.

Source preserved. Interpretations traceable. Knowledge always current. Built for environments where accuracy, transparency, and auditability are non-negotiable.

Immutable source capture
Entity & relationship extraction
State-aware knowledge model
CRM & knowledge-base integration
Citations on every interpretation
Audit-ready by default

Explore Recall →

// product · live

Recall

CAPTURE

INTERPRET

STATE

INTEGRATE

// CONVERSATION → SYSTEMS EARLY ACCESS

AI that does the job,
not the demo.

AI as a system.
Not a prompt.

Anatomy of an AI system

Knowledge & retrieval

Reasoning & orchestration

Guardrails & safety

Evals & CI

Observability & feedback

What we actually do

RAG on your data

Agents & tool use

Evals & quality

Guardrails & safety

Observability & ops

Classical ML & MLOps

The technologies we build on.

Anthropic Claude

OpenAI & Azure OpenAI

Amazon Bedrock

LangGraph & LangSmith

Three ways to start

AI discovery

AI delivery

Managed AI

From nothing to shipped.

Frame & scope

Scaffold & evals

Build & ship

Operate & evolve

Meet Recall

Bring us the use-case.
We'll bring the system for it.

AI as a system.Not a prompt.

Anatomy of an AI system

Knowledge & retrieval

Reasoning & orchestration

Guardrails & safety

Evals & CI

Observability & feedback

What we actually do

RAG on your data

Agents & tool use

Evals & quality

Guardrails & safety

Observability & ops

Classical ML & MLOps

The technologies we build on.

Anthropic Claude

OpenAI & Azure OpenAI

Amazon Bedrock

LangGraph & LangSmith

Three ways to start

AI discovery

AI delivery

Managed AI

From nothing to shipped.

Frame & scope

Scaffold & evals

Build & ship

Operate & evolve

Meet Recall

Bring us the use-case.We'll bring the system for it.

AI as a system.
Not a prompt.

Bring us the use-case.
We'll bring the system for it.