<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[The Digital AI With Ashish]]></title><description><![CDATA[AI Engineering Leader Enterprise Architect Cloud & Distributed Systems Expert Driving AI/ML Infrastructure, Scalable Microservices, and High‑Performance Platforms 18+ Years in FinTech & Enterprise Systems]]></description><link>https://thedigitalshiftaiwithashish.substack.com</link><image><url>https://substackcdn.com/image/fetch/$s_!AApv!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fthedigitalshiftaiwithashish.substack.com%2Fimg%2Fsubstack.png</url><title>The Digital AI With Ashish</title><link>https://thedigitalshiftaiwithashish.substack.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 06 Jun 2026 02:40:53 GMT</lastBuildDate><atom:link href="https://thedigitalshiftaiwithashish.substack.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Ashish Kumar]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[thedigitalshiftaiwithashish@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[thedigitalshiftaiwithashish@substack.com]]></itunes:email><itunes:name><![CDATA[The Digital AI With Ashish]]></itunes:name></itunes:owner><itunes:author><![CDATA[The Digital AI With Ashish]]></itunes:author><googleplay:owner><![CDATA[thedigitalshiftaiwithashish@substack.com]]></googleplay:owner><googleplay:email><![CDATA[thedigitalshiftaiwithashish@substack.com]]></googleplay:email><googleplay:author><![CDATA[The Digital AI With Ashish]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[AI Product Roadmaps in a Fast-Moving World]]></title><description><![CDATA[Model Lifecycle Management &#183; Drift Detection &#183; Observability]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/ai-product-roadmaps-in-a-fast-moving</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/ai-product-roadmaps-in-a-fast-moving</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Tue, 21 Apr 2026 15:43:04 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!vrl4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-mc0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-mc0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png 424w, https://substackcdn.com/image/fetch/$s_!-mc0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png 848w, https://substackcdn.com/image/fetch/$s_!-mc0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png 1272w, https://substackcdn.com/image/fetch/$s_!-mc0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-mc0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png" width="1035" height="124" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:124,&quot;width&quot;:1035,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:14005,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194929999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-mc0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png 424w, https://substackcdn.com/image/fetch/$s_!-mc0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png 848w, https://substackcdn.com/image/fetch/$s_!-mc0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png 1272w, https://substackcdn.com/image/fetch/$s_!-mc0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffed94aed-ea92-43db-a3dc-19a701dbe209_1035x124.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><blockquote><p><strong>&#8220;A roadmap that doesn&#8217;t account for model drift isn&#8217;t a plan &#8212; it&#8217;s a wish. AI products degrade silently without any code change. Plan for that before it happens to you.&#8221;</strong></p></blockquote><p>Most AI product roadmaps are written the same way as software roadmaps &#8212; features, timelines, milestones. But AI products have a property software products don&#8217;t: they degrade silently without any code change. A roadmap that doesn&#8217;t account for model lifecycle, drift detection, and observability will be rewritten after the first production incident.</p><p><strong>THE PATTERN BEHIND MOST AI ROADMAP FAILURES</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>01. </strong>Team builds a roadmap around features. Ships on time. No monitoring beyond uptime. Six months later, user satisfaction is down 18%. Outputs degraded silently. It wasn&#8217;t on the roadmap to detect this.</p><p><strong>02. </strong>Foundation model provider releases a major update. Team scrambles. Breaks three prompt flows, changes output formats, degrades one critical feature. No lifecycle plan, no staging, no rollback. Two weeks of firefighting.</p><p><strong>03. </strong>Incident post-mortem: &#8220;we didn&#8217;t know there was a problem until users told us.&#8221; No observability beyond error rates. The incident was preventable.</p><p><strong>04. </strong>Team rebuilds with lifecycle management, drift detection, and observability. Next foundation model update: deployed to staging, tested, rolled out over 72 hours. Zero incidents. Next drift event: caught in 4 minutes.</p><p><strong>PILLAR 01 OF 03 &#183; AI PRODUCT ROADMAPS IN A FAST-MOVING WORLD</strong></p><p><strong>Model Lifecycle Management</strong></p><p>Software breaks when you change it. AI models break when the world changes around them. A model deployed today was trained on a snapshot. Six months later, language has shifted, user behaviour evolved, topics changed. The model hasn&#8217;t changed - but its performance has.</p><h1><strong>The Four Lifecycle Stages</strong></h1><p><strong>Model Assessment - Know What You Have Before You Commit</strong></p><p>Quarterly cadence. Benchmark against production distribution, not just held-out test set.</p><p>Run structured model assessment before deploying and every quarter post-deployment: capability audit (what does the model handle well, where are the hard boundaries?), benchmark against production distribution (use recent production queries with known correct outputs), comparative evaluation (how do newer versions compare on your specific use cases?), cost/latency profile (inference economics at current and projected scale). Log everything. The assessment is the baseline for the lifecycle decision.</p><p><strong>Staged Rollout - Never Ship to 100% Without a Safety Net</strong></p><p>Shadow &#8594; Canary 1% &#8594; Canary 10% &#8594; Full. Every model update. No exceptions.</p><p>Shadow mode: new model runs in parallel - catch regressions before any user sees them. Canary 1%: serve to 1% of traffic, measure all production metrics for 24+ hours. Canary 10%: expand only if all metrics hold, run 48&#8211;72 hours. Full rollout: only after both canary stages pass, with rollback criteria still active for 72 hours. The pressure to skip stages always comes at exactly the wrong moment. Build the policy before you face that pressure.</p><p><strong>Foundation Model Update Protocol</strong></p><p>Pin &#8594; test in staging &#8594; identify regressions &#8594; fix &#8594; staged migration. Provider updates are not minor dependency updates.</p><p>When a provider update is announced: (1) Freeze current version - pin API calls immediately. (2) Test in staging - run your full prompt test suite against the new version. (3) Identify regressions - which prompt flows break? which output formats change? (4) Fix before migrating - don&#8217;t migrate until regressions are resolved. (5) Staged migration - follow your standard canary rollout. Provider updates are often improvements in aggregate but regressions for your specific use cases.</p><p><strong>Model Deprecation - Retiring Without User Impact</strong></p><p>90-day timeline. Never retire without a proven replacement already in canary.</p><p>Deprecation criteria: performance below agreed threshold for 7+ days, OR replacement shows statistically significant improvement, OR provider announces end-of-life. 90-day deprecation timeline: Days 1&#8211;30: staging testing of replacement. Days 30&#8211;60: canary rollout. Days 60&#8211;90: monitor and address regressions. Day 90: full switch, old model on rollback for 30 more days. Never retire without a proven replacement already in canary.</p><h1><strong>Lifecycle Decision Matrix</strong></h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!vrl4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!vrl4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png 424w, https://substackcdn.com/image/fetch/$s_!vrl4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png 848w, https://substackcdn.com/image/fetch/$s_!vrl4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png 1272w, https://substackcdn.com/image/fetch/$s_!vrl4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!vrl4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png" width="1027" height="369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:1027,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:41609,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194929999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!vrl4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png 424w, https://substackcdn.com/image/fetch/$s_!vrl4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png 848w, https://substackcdn.com/image/fetch/$s_!vrl4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png 1272w, https://substackcdn.com/image/fetch/$s_!vrl4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1182628f-2183-4962-aa42-e0e5119fff6f_1027x369.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>PILLAR 02 OF 03 &#183; AI PRODUCT ROADMAPS IN A FAST-MOVING WORLD</strong></p><p><strong>Drift Detection</strong></p><blockquote><p><strong>WHAT DRIFT ACTUALLY IS</strong></p><p><strong>Three Types - All Invisible to Standard Error Rate Monitoring</strong></p><p>Covariate drift: user input distribution changes (new topics, new phrasing, new segments). Concept drift: the correct answer to a given question changes (outdated facts, changed policies). Label drift: what users consider a &#8220;good&#8221; response changes over time. All three are invisible to error rate monitoring. A drifted model returns 200 OK on every request. The signal is in quality metrics - and only if you&#8217;re measuring them.</p></blockquote><h1><strong>The Four-Layer Drift Detection Stack</strong></h1><p><strong>Statistical Distribution Monitoring</strong></p><p>PSI on input features, daily. PSI &gt; 0.2 for 48h = investigation trigger. Target: 5-min MTTD.</p><p>Run Population Stability Index (PSI) calculations on key input features daily. PSI thresholds: &lt; 0.10 = stable. 0.10&#8211;0.20 = minor drift, monitor closely. &gt; 0.20 = significant drift, investigate. &gt; 0.25 = critical, consider emergency retrain. For LLM systems: embed a daily sample of inputs, compute mean cosine distance to training corpus centroid, track as a rolling 7-day average. Alerting: PSI &gt; 0.20 sustained over 48 hours triggers Slack/PagerDuty alert to the ML team. 5-minute MTTD is achievable with streaming PSI.</p><p><strong>Output Quality Monitoring</strong></p><p>Automated scoring on every response. Track P50/P90/P99. Tail degradation is often the first signal.</p><p>Instrument automated quality scoring on every output: Fluency/coherence classifier (&lt;10ms, flags broken outputs), task completion heuristic (did the output contain the expected structure?), LLM-as-judge spot check (1% random sample against your calibrated quality rubric). Track P50, P90, P99 of quality scores - not just mean. Tail degradation (P99 dropping while P50 holds) is often the first visible signal of drift before it shows in averages.</p><p><strong>User Feedback Signals</strong></p><p>Thumbs down rate, correction rate, follow-up query rate. The trend is the metric, not the absolute value.</p><p>Three signals that matter most: Thumbs down rate (explicit, low volume, high signal), correction behaviour (user edits AI-generated text - model produced something wrong enough to fix), follow-up query rate (user asks clarifying question immediately after - response didn&#8217;t answer what they needed). Track as weekly trending metrics. A 5% thumbs-down rate stable for three months is fine. The same rate that was 2% three months ago is serious. The trend is the metric.</p><p><strong>Scheduled Human Evaluation</strong></p><p>Weekly 50-sample review. 30 minutes. The only signal that catches what automated metrics miss.</p><p>50 randomly sampled outputs, reviewed by someone with domain expertise, scored against your quality rubric. What to look for specifically: factual errors (automated scoring misses these), tone/persona drift, policy violations (subtle, not caught by toxicity classifiers), new failure modes that haven&#8217;t been seen before &#8212; the first signal of a new drift pattern. Log every finding in your drift registry. Over months, the registry builds a picture of how your model&#8217;s blind spots are evolving.</p><p><strong>PILLAR 03 OF 03 &#183; AI PRODUCT ROADMAPS IN A FAST-MOVING WORLD</strong></p><p><strong>Observability</strong></p><blockquote><p><strong>AI OBSERVABILITY VS INFRASTRUCTURE MONITORING</strong></p><p><strong>Uptime and Latency Are Necessary. They&#8217;re Not Sufficient.</strong></p><p>Infrastructure monitoring tells you the system is running. AI observability tells you whether it&#8217;s doing what it&#8217;s supposed to do. A model can have 100% uptime, P99 latency under 200ms, and zero error codes - while returning increasingly wrong answers to every user. AI observability adds three dimensions: input distribution monitoring, output quality monitoring, and behaviour monitoring.</p></blockquote><h1><strong>The Complete Observability Stack</strong></h1><p><strong>Inference Logging - Log Everything, Retain Selectively</strong></p><p>Log: request ID, timestamp, input hash, model version, output hash, latency, token count, quality score, user segment. Do not log raw inputs in PII-sensitive products. Retention: 30 days routine monitoring, 90 days incident investigation. Every field you don&#8217;t capture today is a debugging blind spot in tomorrow&#8217;s incident.</p><p><strong>Alerting Strategy - Alert on Leading Indicators</strong></p><p>PSI &gt; 0.20 &#8594; ML team Slack (15-min MTTD). Quality P99 drops 10% vs 7-day baseline &#8594; PagerDuty (5-min MTTD). Negative feedback rate up 50% &#8594; ML + Product Slack. Leading indicators catch problems before users report them. Lagging indicators confirm what users already know.</p><p><strong>Real-Time Dashboard - Four Panels Every Team Needs</strong></p><p>Panel 1: Request volume + error rate (infrastructure). Panel 2: Quality score distribution P50/P90/P99 (quality). Panel 3: Input distribution PSI (drift). Panel 4: User feedback trend (user impact). These four answer: is the system running AND working well AND are users experiencing value? Visible to everyone, not just the ML team.</p><p><strong>Tracing + Root Cause - Link Incidents to Specific Causes</strong></p><p>Every request logs: model version, prompt version, context retrieval result. When quality drops, structured tracing answers in minutes: model drift? prompt regression? input distribution change? upstream data issue? Without tracing, incident response is guessing. With it, root cause is usually findable in under 10 minutes.</p><blockquote><p><strong>OBSERVABILITY &#8594; ROADMAP INTEGRATION</strong></p><p><strong>Every Roadmap Item Should Be Traceable to an Observability Signal</strong></p><p>Improving hallucination rate &#8594; linked to faithfulness score trending up. Reducing follow-up query rate &#8594; linked to task completion metric improving. The weekly review rhythm: 30 minutes reviewing observability dashboard trends &#8594; updates to the roadmap backlog &#8594; sprint planning includes observability-driven items alongside feature work. Teams that do this catch regressions in 4 minutes instead of 4 days.</p></blockquote><h1><strong>The AI Product Roadmap Structure</strong></h1><p>Four sections every AI product roadmap needs - not as afterthoughts but as first-class commitments with owners and timelines.</p><p><strong>Capability Horizon - Feature Work With Lifecycle Assumptions</strong></p><p>Every capability item has a lifecycle assumption attached: what model performance level it requires.</p><p>Feature roadmap items are planned the same way as software - user story, acceptance criteria, timeline. The addition for AI products: every capability item includes a lifecycle assumption: &#8220;this capability assumes current model performance on benchmark X remains within Y% of today&#8217;s baseline.&#8221; When the assumption breaks (detected by observability), the feature appears in the sprint backlog for investigation automatically.</p><p><strong>Infrastructure Horizon - 20&#8211;30% of Every Sprint</strong></p><p>Non-negotiable. Lifecycle, observability, drift detection. This is what keeps capability work possible.</p><p>Every AI product roadmap needs a permanent allocation for lifecycle and observability work: model assessment cadence, observability improvements, drift detection tuning, rollout tooling. Target: 20&#8211;30% of engineering capacity. Teams that don&#8217;t allocate this will have it consumed by incidents instead. The choice is planned maintenance vs unplanned firefighting. Both take time - planned maintenance takes less. Track explicitly as a first-class commitment.</p><p><strong>Risk Register - Living Document, Updated Weekly from Observability Signals</strong></p><p>Known threats to your roadmap: model limitations, pending updates, drift levels, failure modes.</p><p>Unlike software roadmaps, AI product roadmaps need a formal risk register: known model limitations, pending foundation model updates and estimated impact, current drift levels and trends, known failure modes from red-team sessions, third-party model support timelines. Review weekly, update from observability signals. A risk that moves from &#8220;low&#8221; to &#8220;medium&#8221; this week becomes a sprint item next week &#8212; not a surprise six weeks from now.</p><p><strong>Incident Response Playbooks - Written Before You Need Them</strong></p><p>Quality degradation, drift response, provider update. Pre-written for the on-call engineer who has never seen this problem.</p><p>Every AI product roadmap should include documented incident response procedures: quality degradation playbook (who is paged, what they check first, rollback criteria), drift response playbook (when to retrain vs patch the prompt vs add data), provider update playbook (staging test procedure, canary rollout). The test of a good playbook: an on-call engineer who has never seen this specific incident should be able to follow it and resolve or contain the issue in under 30 minutes.</p><h1><strong>AI Roadmap vs Software Roadmap</strong></h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Px4-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Px4-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png 424w, https://substackcdn.com/image/fetch/$s_!Px4-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png 848w, https://substackcdn.com/image/fetch/$s_!Px4-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png 1272w, https://substackcdn.com/image/fetch/$s_!Px4-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Px4-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png" width="1027" height="342" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:342,&quot;width&quot;:1027,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:36562,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194929999?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Px4-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png 424w, https://substackcdn.com/image/fetch/$s_!Px4-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png 848w, https://substackcdn.com/image/fetch/$s_!Px4-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png 1272w, https://substackcdn.com/image/fetch/$s_!Px4-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1d6b237e-cd08-4542-91a2-67c93d0ed29e_1027x342.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1><strong>Dos &amp; Don&#8217;ts - AI Product Roadmap Planning</strong></h1><p><strong>&#9989; What Works</strong></p><p>&#9989; <strong>Allocate 20&#8211;30% of engineering capacity to lifecycle and observability.</strong> Teams that don&#8217;t plan this will have it consumed by incidents. Planned maintenance always takes less time than unplanned firefighting.</p><p>&#9989; <strong>Pin foundation model versions immediately when a provider update is announced.</strong> Test the new version in staging. Never migrate without a staged rollout.</p><p>&#9989; <strong>Build incident response playbooks before any incident.</strong> Write them for the on-call engineer who has never seen this problem before.</p><p>&#9989; <strong>Track quality as a distribution - P50, P90, P99.</strong> P99 quality degradation is almost always the first signal of drift. Mean quality hides it until it&#8217;s obvious.</p><p>&#9989; <strong>Link every roadmap item to an observability signal.</strong> If you can&#8217;t point to a metric that tells you whether this worked, it&#8217;s not ready for the roadmap yet.</p><p>&#9989; <strong>Review observability trends weekly in sprint planning.</strong> 30 minutes. The connection between what your monitoring knows and what your roadmap says.</p><p><strong>&#10060; What Creates Roadmap Debt</strong></p><p>&#10060; <strong>Don&#8217;t write an AI roadmap as if the model is a static component.</strong> Every AI product roadmap needs lifecycle management, drift detection, and observability as first-class sections.</p><p>&#10060; <strong>Don&#8217;t rely on user complaints as your drift detection system.</strong> By the time users are complaining, you&#8217;re already weeks into a degradation event.</p><p>&#10060; <strong>Don&#8217;t skip canary stages under pressure to ship faster.</strong> The pressure is always highest when skipping is most dangerous.</p><p>&#10060; <strong>Don&#8217;t treat a foundation model update as a minor dependency update.</strong> Test it against your specific prompt flows. Improvements in aggregate often include regressions for specific tasks.</p><p>&#10060; <strong>Don&#8217;t monitor only error rates and latency.</strong> A model returning wrong answers with 100% uptime looks healthy on infrastructure monitoring. It isn&#8217;t.</p><p>&#10060; <strong>Don&#8217;t leave the risk register as a one-time document.</strong> A risk register that isn&#8217;t updated weekly is a snapshot of what you were worried about six months ago.</p><blockquote><p>&#8220;An AI product roadmap that doesn&#8217;t include model lifecycle management, drift detection, and observability is a plan for the first six months. After that, you&#8217;re responding to events instead of planning ahead. The infrastructure isn&#8217;t a tax on feature development - it&#8217;s what keeps feature development possible.&#8221;</p></blockquote><p>#AIProductRoadmap #ModelLifecycle #DriftDetection #AIObservability #LLMOps #ModelMonitoring #MLOps #MachineLearning #TechLeadership #AIEngineering #AIGovernance #EnterpriseAI #AIStrategy #BuildInPublic</p><p></p><p>@AIProductRoadmap @ModelLifecycle @DriftDetection @AIObservability @LLMOps @ModelMonitoring @MLOps @MachineLearning @TechLeadership @AIEngineering @AIGovernance @EnterpriseAI @AIStrategy @BuildInPublic <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;9fad7eac-767e-4e0c-a461-3c62053659f3&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[POST 5 OF 5 · DATA STRATEGY FOR AI PRODUCTS Data Flywheel & Drift — How It All Compounds ]]></title><description><![CDATA[&#8592; PREVIOUS: Posts 1&#8211;4: Why Data Strategy, Versioning, Active Learning, Synthetic Data &#8594; NEXT: Series complete &#8212; the flywheel connects all three pillars into a compounding system]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/post-5-of-5-data-strategy-for-ai</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/post-5-of-5-data-strategy-for-ai</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Mon, 20 Apr 2026 21:08:45 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!2_Eq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>&#8220;The flywheel compounds if you feed it consistently. Data drift works against it silently. Understanding both is how you build a system that gets better on its own.&#8221;</strong></p><p>The data flywheel is what turns the three pillars into a compounding system. Data versioning makes each cycle traceable. Active learning makes each cycle efficient. Synthetic data makes each cycle scalable. But the flywheel only works if data drift isn&#8217;t quietly working against it.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>The 5-Stage Data Flywheel</h1><p><strong>Deploy Model to Production</strong></p><p>Every inference logs predictions + confidence scores. This is how the flywheel feeds itself.</p><p>The model goes to production. Every inference logs predictions and confidence scores. Without this logging, the flywheel has no input. With it, every user interaction is a signal about where the model is confident and where it isn&#8217;t. Over days and weeks, this builds a detailed map of the model&#8217;s uncertainty landscape &#8212; the map that active learning reads in Stage 2.</p><p><strong>Active Learning Selection</strong></p><p>Daily pipeline: 70% uncertainty-sampled + 30% diversity-sampled. 2&#8211;5% of production volume.</p><p>The daily active learning pipeline runs against yesterday&#8217;s prediction logs. It selects the 2&#8211;5% most informative examples: 70% by uncertainty, 30% by diversity. As the model improves over multiple flywheel cycles, this selection becomes more precise &#8212; the model is uncertain about genuinely harder cases, which are more informative to label. The flywheel accelerates as the model gets better.</p><p><strong>Label + Augment</strong></p><p>Human annotators label selected examples. Validated synthetic augmentation expands hard cases.</p><p>Human annotators label the active learning selection. For the hardest cases, apply validated synthetic augmentation: generate variations using LLM augmentation or programmatic generation, run the cosine similarity filter, and add passing examples to the annotation batch. Every labelled batch is versioned in DVC before being added to the training pool.</p><p><strong>Version + Train</strong></p><p>New dataset version committed to DVC. MLflow logs dataset hash and all metrics.</p><p>The augmented dataset receives a new DVC version. The training run logs the DVC hash alongside all hyperparameters and metrics in MLflow. Distribution statistics are computed and logged before training proceeds. A significant distribution shift is an investigation flag. This is the traceability step that makes the flywheel auditable.</p><p><strong>Evaluate + Ship &#8594; Loop Restarts</strong></p><p>Automated evaluation, canary release, if metrics hold expand, loop restarts at Stage 1.</p><p>The new model goes through automated evaluation against a held-out test set. If metrics hold, a canary release deploys to 1% of traffic, then 5%, 25%, 100%, with monitoring at each stage. When the model hits production, it restarts Stage 1: new inferences generate new confidence scores, the next active learning cycle begins, and the flywheel keeps turning. After six months, the model is better in exactly the ways users have been testing it.</p><h1>Data Drift - Three Types, Three Different Fixes</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2_Eq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2_Eq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png 424w, https://substackcdn.com/image/fetch/$s_!2_Eq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png 848w, https://substackcdn.com/image/fetch/$s_!2_Eq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png 1272w, https://substackcdn.com/image/fetch/$s_!2_Eq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2_Eq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png" width="1024" height="369" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:369,&quot;width&quot;:1024,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89547,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194845668?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!2_Eq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png 424w, https://substackcdn.com/image/fetch/$s_!2_Eq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png 848w, https://substackcdn.com/image/fetch/$s_!2_Eq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png 1272w, https://substackcdn.com/image/fetch/$s_!2_Eq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F184ab746-2aae-4f69-9622-363a5c4c0642_1024x369.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Series Summary - One Action Per Post</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!SXpV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!SXpV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png 424w, https://substackcdn.com/image/fetch/$s_!SXpV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png 848w, https://substackcdn.com/image/fetch/$s_!SXpV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png 1272w, https://substackcdn.com/image/fetch/$s_!SXpV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!SXpV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png" width="1030" height="394" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:394,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:70411,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194845668?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!SXpV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png 424w, https://substackcdn.com/image/fetch/$s_!SXpV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png 848w, https://substackcdn.com/image/fetch/$s_!SXpV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png 1272w, https://substackcdn.com/image/fetch/$s_!SXpV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0c62ea-3ec2-427e-9c62-857f917494c6_1030x394.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>&#9989; What compounds</strong></p><p>&#9989; <strong>Run the flywheel continuously, not in periodic sprints.</strong> The compounding effect comes from consistent small cycles, not from large infrequent retrains.</p><p>&#9989; <strong>Monitor PSI on input features as a first-class production metric.</strong> Drift caught at PSI &gt; 0.2 is a retrain trigger. Drift caught at PSI &gt; 0.5 is a performance incident.</p><p>&#9989; <strong>Treat each drift type differently.</strong> The same fix applied to the wrong drift type makes things worse.</p><p><strong>&#10060; What stalls the flywheel</strong></p><p>&#10060; <strong>Don&#8217;t run the flywheel without monitoring drift.</strong> The flywheel trains on production data. If that data is drifting, the flywheel amplifies the drift.</p><p>&#10060; <strong>Don&#8217;t add more data with outdated labels when label drift is the issue.</strong> Update the guidelines first. More data with old labels anchors the model to ground truth that has changed.</p><blockquote><p>&#8220;After six months of the data flywheel running consistently, your model is not just better than when you launched &#8212; it is better in exactly the ways that your users have been testing it through their actual usage. That is the compounding return on a data strategy that works.&#8221;</p></blockquote><p>#DataStrategy #DataFlywheel #DataDrift #ActiveLearning #DataVersioning #SyntheticData #MachineLearning #MLOps #DataEngineering #AIProducts #ProductionML #TechLeadership #AIEngineering #BuildInPublic</p><p></p><p>@DataStrategy @DataFlywheel @DataDrift @ActiveLearning @DataVersioning @SyntheticData @MachineLearning @MLOps @DataEngineering @AIProducts @ProductionML @TechLeadership @AIEngineering @BuildInPublic</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[POST 4 OF 5 · DATA STRATEGY FOR AI PRODUCTS Synthetic Data — Expand Without Proportional Annotation Cost]]></title><description><![CDATA[&#8592; PREVIOUS: Post 3: Active Learning &#8212; selecting the most informative examples to label &#8594; NEXT: Post 5: Data Flywheel + Drift &#8212; how everything compounds and what works against it]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/post-4-of-5-data-strategy-for-ai</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/post-4-of-5-data-strategy-for-ai</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Mon, 20 Apr 2026 17:11:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Y7Vd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><strong>&#8220;Every team that has seen synthetic data hurt their model skipped the same step: they didn&#8217;t check whether the generated distribution overlapped with the real one. The UMAP visualisation takes ten minutes. The debugging session it prevents takes weeks.&#8221;</strong></p></blockquote><p>Synthetic data can expand a dataset by 10&#215; at a fraction of the cost of human labelling. It can also degrade model performance by 20% if the synthetic distribution doesn&#8217;t match production. The difference between these two outcomes is entirely in the validation step that most teams skip.</p><h1>When It Works &#8212; and When It Reliably Fails</h1><p><strong>Works: Rare event augmentation</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Generate more examples of the 0.1% failure cases. Edge cases that would take years to collect naturally.</p><p><strong>Works: Linguistic variation</strong></p><p>Paraphrase existing examples to increase surface diversity without changing the label.</p><p><strong>Works: Controlled counterfactuals</strong></p><p>Change one variable while holding others constant. Tests model sensitivity to specific features.</p><p><strong>Fails: Distribution mismatch</strong></p><p>Generated examples cover regions that real users never actually query. Model trains on a different problem.</p><p><strong>Fails: Label contamination</strong></p><p>LLM generating examples doesn&#8217;t have reliable ground truth for your domain. Mislabelled examples enter training.</p><p><strong>Fails: Replacing human ground truth</strong></p><p>Synthetic labels replacing human-verified labels in high-stakes domains &#8212; medical, legal, financial.</p><h1>Three Generation Approaches</h1><p><strong>LLM-Based Augmentation &#8212; Practical Starting Point for Text</strong></p><p>Prompt GPT-4 or Claude. Validate distribution alignment. Discard below 0.70 cosine similarity.</p><p>For text-based products, LLM augmentation is fastest. Three uses that work reliably: paraphrase generation (same meaning, different phrasing), instruction variation (same intent, different wordings), rare class augmentation (generate underrepresented categories). Validation before training: embed generated and real examples, visualise in UMAP &#8212; they should overlap substantially. Compute cosine similarity between each synthetic example and its nearest real neighbour. Discard any synthetic example where similarity &lt; 0.70. Track: what percentage of generated examples pass? Declining pass rate signals generation prompt drift.</p><p><strong>Programmatic Generation &#8212; Full Control Over the Distribution</strong></p><p>Write code to generate systematic variations. No LLM biases. You control the distribution exactly.</p><p>For structured inputs (tabular, code, structured text), programmatic generation is more reliable because you control the distribution. Define meaningful variation dimensions for your input space (code assistant: language, complexity, error type; fraud classifier: amount, merchant category, time pattern). Write generators that sample from these dimensions. Use domain knowledge for realistic ranges and co-occurrence constraints. No distribution bias beyond what you deliberately encode. Good for: code generation, tabular ML, structured prediction, systematic edge case coverage.</p><p><strong>Augmentation for Multimodal Inputs</strong></p><p>Images: geometric transforms. Audio: pitch shift + noise. Text: backtranslation. Stay in the plausible production range.</p><p>Images: random crop, horizontal flip, colour jitter, rotation (&#177;15&#176;), mixup/cutmix. Use albumentations or torchvision.transforms. Audio: pitch shift (&#177;2 semitones), time stretch (0.9&#215;&#8211;1.1&#215;), background noise at realistic SNR. Text: back-translation (translate to another language and back &#8212; changes surface form, preserves meaning), synonym substitution, random deletion/swap at small probability. One rule across all modalities: augmentations should produce inputs that could plausibly appear in production. Augmentations creating impossible inputs hurt more than they help.</p><h1>Synthetic Data Validation &#8212; The 5 Checks</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Y7Vd!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png 424w, https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png 848w, https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png 1272w, https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png" width="1030" height="349" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/eb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:349,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:65291,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194820574?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png 424w, https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png 848w, https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png 1272w, https://substackcdn.com/image/fetch/$s_!Y7Vd!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Feb51b02b-111c-489f-a682-58cd8dd0bef1_1030x349.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>&#9989; What works</strong></p><p>&#9989; <strong>Always run the UMAP distribution check before training.</strong> Ten minutes of validation prevents weeks of debugging.</p><p>&#9989; <strong>Use synthetic data for rare events and surface diversity.</strong> These are the two scenarios with the best economics and most manageable distribution risk.</p><p><strong>&#10060; What reliably hurts</strong></p><p>&#10060; <strong>Don&#8217;t add synthetic data without validating distribution alignment.</strong> Synthetic data that doesn&#8217;t overlap with your real distribution trains the model on the wrong problem.</p><p>&#10060; <strong>Don&#8217;t scale generation before validating on a small batch.</strong> Generate 100 examples, run the full validation, confirm the pass rate, then scale.</p><blockquote><p>&#8220;The validation step is not optional. It is the entire difference between synthetic data helping your model and synthetic data hurting it.&#8221;</p><p>- On synthetic data validation as a non-negotiable practice</p></blockquote><p>#SyntheticData #DataAugmentation #DataStrategy #MachineLearning #NLP #LLMTraining #TrainingData #MLOps #AIEngineering #TechLeadership</p><p>@SyntheticData @DataAugmentation @DataStrategy @MachineLearning @NLP @LLMTraining @TrainingData @MLOps @AIEngineering @TechLeadership</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[POST 3 OF 5 · DATA STRATEGY FOR AI PRODUCTS Active Learning — Label the 2–5% That Teaches Most]]></title><description><![CDATA[&#8592; PREVIOUS: Post 2: Data Versioning &#8212; DVC + MLflow for reproducible training &#8594; NEXT: Post 4: Synthetic Data &#8212; generating and validating training data augmentation]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/post-3-of-5-data-strategy-for-ai</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/post-3-of-5-data-strategy-for-ai</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Fri, 17 Apr 2026 10:03:10 GMT</pubDate><content:encoded><![CDATA[<p><strong>&#8220;Active learning is not a sophisticated ML technique. It is a question: which examples, if labelled, would change the model the most? The answer is in your production logs. You are already generating it on every prediction.&#8221;</strong></p><p>Random sampling from your unlabelled pool gives you a representative distribution of what your model already handles correctly. Active learning asks a different question: which examples, if labelled, would change the model&#8217;s behaviour most? In practice: 2&#8211;5% of your production data volume, strategically selected, produces the same improvement as 40&#8211;60% labelled randomly.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>The 5-Stage Active Learning Pipeline</h1><p><strong>Log Confidence Scores for Every Production Prediction</strong></p><p>The confidence score is the raw material. Without it, nothing in this pipeline exists.</p><p>For every production prediction, store: the input, the predicted output, the confidence score (softmax probability for classifiers; perplexity or token-level probability for LLMs), the timestamp, and the user segment. For LLMs where softmax isn&#8217;t directly available, proxy metrics work: semantic consistency across multiple samples at the same temperature (high variance = uncertain), entropy of the output distribution, or a calibrated small classifier trained to predict model error from input features.</p><p><strong>Uncertainty Sampling &#8212; Find What the Model Doesn&#8217;t Know</strong></p><p>Sort by confidence ascending. The bottom 2&#8211;5% are your most informative examples.</p><p>Three variants: Least confidence &#8212; lowest max class probability. Margin sampling &#8212; smallest gap between top-1 and top-2 class probability (model confused between exactly two options). Entropy sampling &#8212; highest entropy across all class probabilities. Formula: uncertainty = 1 - max(softmax_probs). Practical threshold: top 2&#8211;5% by uncertainty score, sampled daily from production logs. This is where your decision-boundary cases live &#8212; the inputs your training set didn&#8217;t cover.</p><p><strong>Diversity Sampling &#8212; Cover the Full Input Space</strong></p><p>70% uncertainty-selected, 30% diversity-selected. Prevents the clustering problem.</p><p>Pure uncertainty sampling selects very similar examples all clustered in one region of your input space. Add diversity sampling: cluster production logs by semantic similarity (embed inputs, run K-means or FAISS nearest-neighbour). From each cluster, select the most uncertain example. Core Set method: select examples that maximise the minimum distance to any already-labelled example in embedding space. 70% uncertainty / 30% diversity covers both depth (hard cases) and breadth (full distribution coverage).</p><p><strong>Human-in-the-Loop Annotation</strong></p><p>Active learning selects. Humans label. Close the loop with an evaluation run after every batch.</p><p>Route selected examples to your annotation tool (Label Studio, Scale AI, Prodigy, or custom). Prioritise by uncertainty score within the batch &#8212; annotators see the hardest cases first. Set a daily annotation budget (50&#8211;200 examples) to keep labelling sustainable. Use annotator confidence as a data quality signal. Close the loop: every labelled batch should trigger a model evaluation run. Track: how much did the model&#8217;s uncertainty on the newly labelled examples drop after incorporation?</p><p><strong>Outcome-Based Feedback &#8212; The Free Label Source</strong></p><p>User behaviour signals right and wrong predictions with no annotation cost.</p><p>Some products generate implicit labels from user behaviour: thumbs up/down on a chatbot response, click-through rate on a recommendation, correction behaviour (user edits AI-generated text), task completion rate. These are noisy signals, not ground truth. Use high-confidence behavioural negatives as hard negatives for contrastive training. Weight behavioural labels lower than human-annotated ones in your training loss. Treat as weak supervision, not ground truth.</p><p><strong>&#9989; What works</strong></p><p>&#9989; <strong>Log confidence scores from day one.</strong> This is the prerequisite for everything in active learning. The earlier you start logging, the richer the signal.</p><p>&#9989; <strong>Mix uncertainty and diversity sampling (70/30).</strong> Uncertainty alone creates a biased annotation batch. Diversity ensures full input space coverage.</p><p>&#9989; <strong>Set a sustainable daily annotation budget.</strong> 100 carefully selected examples per day, consistently, beats 10,000 randomly selected in one expensive sprint.</p><p><strong>&#10060; What wastes annotation budget</strong></p><p>&#10060; <strong>Don&#8217;t label random samples when you have production confidence scores.</strong> Random labelling is expensive. Uncertainty-based selection produces more improvement per dollar.</p><p>&#10060; <strong>Don&#8217;t treat behavioural feedback as ground truth.</strong> Use it as weak supervision, weighted lower than human labels.</p><blockquote><p>&#8220;2&#8211;5% of your production volume, strategically selected by uncertainty and diversity, produces the same model improvement as 40&#8211;60% labelled randomly. The annotation budget is the same. The result isn&#8217;t.&#8221;</p></blockquote><p>#ActiveLearning #DataAnnotation #DataStrategy #MachineLearning #MLOps #UncertaintySampling #TrainingData #AIEngineering #TechLeadership</p><p>@ActiveLearning @DataAnnotation @DataStrategy @MachineLearning @MLOps @UncertaintySampling @TrainingData @AIEngineering @TechLeadership</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[POST 2 OF 5 · DATA STRATEGY FOR AI PRODUCTS Data Versioning — Make Every Training Run Reproducible ]]></title><description><![CDATA[&#8592; PREVIOUS: Post 1: Why Data Strategy &#8212; the three pillars and how they connect &#8594; NEXT: Post 3: Active Learning &#8212; logging confidence scores and uncertainty sampling]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/post-2-of-5-data-strategy-for-ai</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/post-2-of-5-data-strategy-for-ai</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Thu, 16 Apr 2026 22:38:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eKf0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>&#8220;git checkout v1.2.0 &amp;&amp; dvc pull. That&#8217;s the command that restores both the code and the exact dataset at that commit. If you can&#8217;t run that command and get the same model, you don&#8217;t have version control &#8212; you have file history.&#8221;</strong></p><h1>DVC + MLflow &#8212; How They Work Together</h1><p><strong>DVC &#8212; DATA VERSION CONTROL</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Git for Datasets &#8212; Pointer in Git, Data in Remote Storage</strong></p><p>When you run dvc add data/train.csv, DVC creates a small .dvc pointer file in your Git repo (containing the file&#8217;s content hash and storage location) and moves the actual data to remote storage (S3, GCS, Azure Blob). Git tracks the pointer. DVC tracks the content. git checkout v1.2.0 &amp;&amp; dvc pull restores both the code and the exact dataset at that commit. Content-addressed hashing means any change to the data produces a different hash. Key commands: dvc add (start tracking), dvc push/pull (sync with remote), dvc repro (re-run the full pipeline). Critical discipline: every dataset modification gets a DVC commit. No ad-hoc in-place edits.</p><p><strong>MLflow &#8212; EXPERIMENT TRACKING</strong></p><p><strong>Link Dataset Versions to Model Performance &#8212; Build a Queryable History</strong></p><p>DVC answers &#8220;what data was this?&#8221; MLflow answers &#8220;what happened when we trained on it?&#8221; In every training run, log: dataset path + DVC hash, all hyperparameters, train/validation/test metrics, and the model artifact. The result: a queryable history of &#8220;which data version + which parameters = which performance.&#8221; When a model degrades in production, the debugging question becomes: same code, different data? Same data, different hyperparameters? Both questions are answerable in seconds if you&#8217;ve been logging. MLflow also runs data drift metrics: compare the distribution of the current dataset against the last production dataset, log the KL divergence, and alert if significant.</p><h1>Dataset Versioning Schema</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eKf0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eKf0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png 424w, https://substackcdn.com/image/fetch/$s_!eKf0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png 848w, https://substackcdn.com/image/fetch/$s_!eKf0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png 1272w, https://substackcdn.com/image/fetch/$s_!eKf0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eKf0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png" width="1030" height="478" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:478,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:84778,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194459561?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eKf0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png 424w, https://substackcdn.com/image/fetch/$s_!eKf0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png 848w, https://substackcdn.com/image/fetch/$s_!eKf0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png 1272w, https://substackcdn.com/image/fetch/$s_!eKf0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce7cc1e4-4782-487b-bec4-cff86bfcab94_1030x478.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>&#9989; What works</strong></p><p>&#9989; <strong>Version every dataset change, no exceptions.</strong> The discipline pays off the first time you need to reproduce a model. One-time setup, permanent value.</p><p>&#9989; <strong>Log the DVC hash alongside every training metric in MLflow.</strong> This is the link between &#8220;what we trained on&#8221; and &#8220;what we got.&#8221;</p><p>&#9989; <strong>Apply semantic versioning to datasets.</strong> Major version = schema change. Minor = new data. Patch = corrections.</p><p><strong>&#10060; What creates debugging debt</strong></p><p>&#10060; <strong>Don&#8217;t modify datasets in place.</strong> In-place edits destroy history. Always create a new version.</p><p>&#10060; <strong>Don&#8217;t track data separately from experiment results.</strong> A dataset version with no linked model performance is half the story.</p><blockquote><p>&#8220;You can reproduce a commit. Can you reproduce the model that commit produced? If the answer is no &#8212; because you didn&#8217;t version the data that went with it &#8212; then you don&#8217;t have reproducible machine learning. You have reproducible code running on unknown data.&#8221;</p></blockquote><p>#DataVersioning #DVC #MLflow #ExperimentTracking #MLOps #DataStrategy #MachineLearning #DataEngineering #TechLeadership</p><p>@DataVersioning @DVC @MLflow @ExperimentTracking @MLOps @DataStrategy @MachineLearning @DataEngineering @TechLeadership</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[POST 1 OF 5 · DATA STRATEGY FOR AI PRODUCTS Why Data Strategy — The Foundation Everything Else Builds On ]]></title><description><![CDATA[&#8594; NEXT: Post 2: Data Versioning &#8212; DVC + MLflow, reproducible training runs]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/post-1-of-5-data-strategy-for-ai</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/post-1-of-5-data-strategy-for-ai</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Thu, 16 Apr 2026 02:52:24 GMT</pubDate><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pf_r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pf_r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png 424w, https://substackcdn.com/image/fetch/$s_!Pf_r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png 848w, https://substackcdn.com/image/fetch/$s_!Pf_r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png 1272w, https://substackcdn.com/image/fetch/$s_!Pf_r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pf_r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png" width="1021" height="136" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:136,&quot;width&quot;:1021,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24881,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194365073?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pf_r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png 424w, https://substackcdn.com/image/fetch/$s_!Pf_r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png 848w, https://substackcdn.com/image/fetch/$s_!Pf_r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png 1272w, https://substackcdn.com/image/fetch/$s_!Pf_r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3f59d94c-fdb1-4a12-a89c-388200469a1f_1021x136.png 1456w" sizes="100vw" fetchpriority="high"></picture><div></div></div></a></figure></div><blockquote><p>"Two AI teams. Same model architecture. Same compute. Six months later, one has improved by 18 points. The other is debugging a drift incident. The difference is always data strategy."</p></blockquote><p>Most teams spend months tuning model architecture and almost no time on data strategy. Then they wonder why the model degrades in production. The teams that build great AI products aren&#8217;t always the ones with the best models &#8212; they&#8217;re the ones with the most systematic approach to data.</p><p><strong>THE DATA DECAY PATTERN</strong></p><p><strong>01. </strong>Team launches AI product. Quality drifts down three months later. Nobody can explain why &#8212; because nobody tracked which data version trained the current model.</p><p><strong>02. </strong>Team tries to retrain. Spends six weeks labelling 50,000 samples. Mostly examples the model already handles well. The 200 hard edge cases causing 80% of failures are still unlabelled.</p><p><strong>03. </strong>Team tries synthetic data. Performance drops. The synthetic distribution didn&#8217;t match production. Nobody validated it before training.</p><p><strong>04. </strong>Team rebuilds: DVC + MLflow for versioning, uncertainty sampling for active learning, validated synthetic augmentation. Next retrain: 3,000 strategic samples. +14 points. Every result reproducible and traceable.</p><h1>The Three Pillars</h1><p><strong>Data Versioning &#8212; Make Every Training Run Reproducible</strong></p><p>DVC tracks the data. MLflow connects data versions to model performance.</p><p>Your model is a function of two things: the architecture and the data it trained on. Most teams version their code carefully and version their data almost never. This means you can reproduce a commit but not a model. DVC stores a pointer in Git and the actual data in remote storage. MLflow logs the dataset hash alongside every training metric. When production performance degrades, you can compare the current model against the last good one: same code, different data? Same data, different hyperparameters? The answer is in the logs &#8212; if you built the logging.</p><p><strong>Active Learning &#8212; Label the 2&#8211;5% That Teaches Most</strong></p><p>Uncertainty sampling on production logs. Same improvement as 40&#8211;60% labelled randomly.</p><p>Random labelling gives you a representative sample of what your model already handles well. Active learning asks: which examples, if labelled, would change the model&#8217;s behaviour the most? Uncertainty sampling selects the inputs the model is least confident about. Diversity sampling ensures coverage of the full input space. The result: 2&#8211;5% of your production volume, strategically selected, produces the same improvement as 40&#8211;60% labelled at random. Your annotation budget stays the same. Your model improves faster.</p><p><strong>Synthetic Data &#8212; Expand Without Proportional Annotation Cost</strong></p><p>10&#215; dataset expansion is possible. The validation step is what makes it work.</p><p>Synthetic data can expand a dataset by 10&#215; at a fraction of the cost of human labelling. It can also degrade model performance by 20% if the synthetic distribution doesn&#8217;t match production. The difference is the validation step: embed generated and real examples, visualise overlap in UMAP, compute cosine similarity between each synthetic example and its nearest real neighbour. Discard anything below 0.70 similarity. One unvalidated batch can undo months of careful labelling.</p><p><strong>THE DATA FLYWHEEL</strong></p><p><strong>Deploy &#8594; Active Learning &#8594; Label + Augment &#8594; Version + Train &#8594; Evaluate + Ship &#8594; Repeat</strong></p><p>Each sprint through this loop makes the next one more efficient. As your model improves, uncertainty sampling becomes more precise &#8212; finding harder cases, which are more informative to label. After six months, your model is better in exactly the ways your users have been pushing on it.</p><p><strong>&#9989; What compounds over time</strong></p><p>&#9989; <strong>Treat data as a first-class engineering asset.</strong> Version it, test it, monitor it. The same discipline you apply to code.</p><p>&#9989; <strong>Connect data versions to model performance metrics.</strong> Without this link, you can&#8217;t tell whether a retrain improved because of data or architecture changes.</p><p>&#9989; <strong>Log confidence scores from day one.</strong> This is the raw material for everything in Posts 3 and 5. It costs almost nothing and enables everything.</p><p><strong>&#10060; What leads to data debt</strong></p><p>&#10060; <strong>Don&#8217;t invest in model architecture while neglecting data quality.</strong> A better architecture trained on degraded data loses to a simpler model trained on clean, well-curated data.</p><p>&#10060; <strong>Don&#8217;t label data randomly when you have production signals.</strong> Random labelling is expensive and slow. Uncertainty sampling is faster and more effective.</p><p>&#10060; <strong>Don&#8217;t treat data versioning as optional until you need it.</strong> By the time you need it, it&#8217;s too late to add retroactively.</p><blockquote><p>&#8220;The model is the visible output. The data strategy is the invisible infrastructure. Teams that get the infrastructure right improve every sprint without heroics. Teams that don&#8217;t are constantly surprised by performance degradation they can neither explain nor reverse.&#8221;</p></blockquote><p></p><p>#DataStrategy #MachineLearning #MLOps #AIEngineering #DataVersioning #ActiveLearning #SyntheticData #TechLeadership #ProductionML</p><p></p><p>@DataStrategy @MachineLearning @MLOps @AIEngineering @DataVersioning @ActiveLearning @SyntheticData @TechLeadership @ProductionML</p>]]></content:encoded></item><item><title><![CDATA[Data Strategy for AI Products]]></title><description><![CDATA[Data Versioning &#183; Active Learning &#183; Synthetic Data &#183; The Data Flywheel]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/data-strategy-for-ai-products</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/data-strategy-for-ai-products</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Tue, 14 Apr 2026 10:02:07 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!CTbm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p>"Two AI teams. Same model architecture. Same compute. Six months later, one has improved by 18 points. The other is debugging a drift incident. The difference is always data strategy."</p></blockquote><p>Most teams spend months tuning model architecture and almost no time on data strategy. Then they wonder why the model degrades in production. The teams that build great AI products aren&#8217;t always the ones with the best models - they&#8217;re the ones with the most systematic approach to data. Data versioning, active learning, and synthetic data are the three practices that separate teams that improve from teams that stagnate.</p><p><strong>THE DATA DECAY PATTERN - FOUR STEPS, REPEATING</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>01. </strong>Team launches AI product. Performance looks great. Three months later, quality metrics start drifting down. Nobody can explain why - because nobody tracked which data version trained the current model.</p><p><strong>02. </strong>Team tries to retrain. Spends 6 weeks labelling 50,000 samples - mostly examples the model already handles well. The 200 hard edge cases causing 80% of failures are still unlabelled.</p><p><strong>03. </strong>Team tries synthetic data. Generates 100,000 examples. Trains on them. Model performance drops. The synthetic distribution didn&#8217;t match production. Nobody validated it before training.</p><p><strong>04. </strong>Team rebuilds with data versioning (DVC + MLflow), active learning (uncertainty sampling on production logs), and validated synthetic augmentation. Next retrain: 3,000 strategically selected samples. Performance improves 14 points. Every result is reproducible and traceable.</p><h1>The Three Pillars - How They Connect</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CQmi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CQmi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png 424w, https://substackcdn.com/image/fetch/$s_!CQmi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png 848w, https://substackcdn.com/image/fetch/$s_!CQmi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png 1272w, https://substackcdn.com/image/fetch/$s_!CQmi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CQmi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png" width="1027" height="283" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:283,&quot;width&quot;:1027,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69208,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194143818?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CQmi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png 424w, https://substackcdn.com/image/fetch/$s_!CQmi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png 848w, https://substackcdn.com/image/fetch/$s_!CQmi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png 1272w, https://substackcdn.com/image/fetch/$s_!CQmi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9eeb9a99-5df5-419a-96b4-79f13fc92fc9_1027x283.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Pillar 01 - Data Versioning</h1><p>Your model is a function of two things: the architecture and the data it trained on. Most teams version their code carefully and version their data almost never. This means you can reproduce a commit but not a model. If production performance degrades, you can&#8217;t roll back to the dataset that produced the last good model.</p><p><strong>// DVC + MLflow - How They Work Together</strong></p><blockquote><p><strong>DVC - DATA VERSION CONTROL</strong></p><p><strong>Git for Datasets - Pointer in Git, Data in Remote Storage</strong></p><p>DVC tracks datasets the same way Git tracks code. A small .dvc pointer file lives in Git (containing hash + storage location) while the actual data lives in remote storage (S3, GCS, Azure Blob). git checkout v1.2.0 &amp;&amp; dvc pull restores both the code and the exact dataset at that commit. Critical discipline: every dataset change gets a DVC commit. No ad-hoc data modifications without a version. Key commands: dvc add data/train.csv (start tracking), dvc push/pull (sync with remote), dvc repro (reproduce the full pipeline).</p></blockquote><p><strong>MLflow - EXPERIMENT TRACKING</strong></p><blockquote><p><strong>Link Data Versions to Model Performance</strong></p><p>DVC versions the data. MLflow connects the data version to the experiment that used it. Log in every training run: dataset path + DVC hash, hyperparameters, metrics (train/val/test), model artifacts. The result: a queryable history of &#8220;which data version + which parameters = which performance.&#8221; When a model degrades in production, you can compare against the last good run: same code, different data? Same data, different hyperparameters? MLflow also tracks data drift metrics - run a KL divergence comparison between the current dataset and the last production dataset as part of your training pipeline.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CTbm!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CTbm!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png 424w, https://substackcdn.com/image/fetch/$s_!CTbm!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png 848w, https://substackcdn.com/image/fetch/$s_!CTbm!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png 1272w, https://substackcdn.com/image/fetch/$s_!CTbm!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CTbm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png" width="1023" height="478" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:478,&quot;width&quot;:1023,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:87287,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194143818?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CTbm!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png 424w, https://substackcdn.com/image/fetch/$s_!CTbm!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png 848w, https://substackcdn.com/image/fetch/$s_!CTbm!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png 1272w, https://substackcdn.com/image/fetch/$s_!CTbm!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F428332b1-265f-4daa-ae3d-c2b982d78bc6_1023x478.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Pillar 02 - Active Learning</h1><p>Random sampling from your unlabelled pool gives you a representative distribution of your existing data - which is mostly examples your model already handles correctly. Active learning asks a different question: which examples, if labelled, would change the model&#8217;s behaviour most? In practice: 2&#8211;5% of your production data volume, strategically selected, produces the same improvement as 40&#8211;60% labelled randomly.</p><p><strong>Log Production Predictions with Confidence Scores : </strong>Every inference produces a confidence score. Log it. That score drives the entire pipeline.</p><p>For every production prediction, store: input, predicted output, confidence score (softmax probability for classifiers; perplexity or token probability for LLMs), timestamp, user segment, and outcome signal if available. For LLMs where softmax isn&#8217;t directly available, proxy metrics work: semantic consistency across multiple samples, entropy of the output distribution, or a calibrated classifier trained to predict model error from input features.</p><p><strong>Uncertainty Sampling : </strong>Sort by confidence ascending. The bottom 2&#8211;5% are your most informative examples to label.</p><p>Select examples where model confidence is lowest. Three variants: Least confidence - lowest max class probability. Margin sampling - smallest gap between top-1 and top-2 class probability (confused between two options). Entropy sampling - highest entropy across all class probabilities. Formula: uncertainty = 1 - max(softmax_probs). Threshold to queue for labelling: top 2&#8211;5% by uncertainty score, sampled daily from production logs. This is where you find edge cases your training set didn&#8217;t cover - examples that live at the decision boundary.</p><p><strong>Diversity Sampling : </strong>Uncertainty alone creates a biased sample clustered at one boundary. Diversity covers the full input space.</p><p>Pure uncertainty sampling selects very similar examples clustered in the same way. Add diversity sampling: cluster production logs by semantic similarity (embed inputs, run K-means or FAISS nearest-neighbour). From each cluster, select the most uncertain example. Core Set method: select examples that maximise the minimum distance to any already-labelled example in embedding space. Practical target: 70% uncertainty-selected, 30% diversity-selected. This ensures your annotation batch covers the full input space.</p><p><strong>Human-in-the-Loop Annotation : </strong>Active learning selects. Humans label. The loop closes when new labels retrain the model.</p><p>Route selected examples to your annotation tool (Label Studio, Scale AI, Prodigy, or custom). Prioritise by uncertainty score - annotators see the hardest cases first. Set a daily annotation budget (50&#8211;200 examples) to keep labelling sustainable. Use annotator confidence as a data quality signal. Close the loop: every labelled batch should trigger a model evaluation run. Track: how much did the model&#8217;s uncertainty on newly labelled examples drop after incorporation? A drop confirms the labelling was informative.</p><p><strong>Outcome-Based Feedback - The Free Label Source : </strong>User behaviour signals which predictions were right and which were wrong - no annotation cost.</p><p>Some products generate implicit labels from user behaviour: thumbs up/down on a chatbot response, click-through rate on a recommendation, correction <strong>When Synthetic Data Works - and When It Doesn&#8217;t</strong></p><p>Works well: rare event augmentation (more examples of the 0.1% failure cases), linguistic variation (paraphrase to increase surface diversity), controlled counterfactual generation. Fails when: generation model has different biases than production data, generated distribution covers regions production inputs never occupy, or synthetic labels are used to replace human ground truth without measuring alignment.</p><p><strong>LLM-Based Augmentation - The Most Practical Starting Point : </strong>Prompt GPT-4 or Claude to generate variations. Validate distribution alignment before training.</p><p>For text-based AI products (chatbots, RAG systems, classifiers), LLM-based augmentation is the fastest practical approach. Use for: paraphrase generation (same meaning, different phrasing - increases surface diversity without changing label), instruction following variation (same intent, different wordings - critical for instruction-tuned models), rare class augmentation (generate more examples of underrepresented categories). Validation before training: embed generated and real examples, visualise in t-SNE/UMAP - they should overlap, not cluster separately. Compute semantic similarity between each synthetic example and its nearest real neighbour. Discard synthetic examples where similarity &lt; 0.7.</p><p><strong>Programmatic Data Generation : </strong>Write code to generate systematic variations across known input dimensions. No distribution bias beyond what you encode.</p><p>For structured inputs (tabular data, code, structured text), programmatic generation is more reliable than LLM generation because you control the distribution exactly. Approach: define the meaningful variation dimensions for your input space (for a code assistant: language, complexity, error type; for a fraud classifier: transaction amount, merchant category, time pattern). Write generators that sample from these dimensions. Use domain knowledge to set realistic ranges. Why this works: no distribution bias beyond what you encode. You know exactly what the generated distribution covers.</p><p><strong>Data Augmentation for Multimodal Inputs : </strong>Images: geometric transforms + colour jitter. Audio: pitch shift + noise. Text: backtranslation.</p><p>Images: random crop, horizontal flip, colour jitter, rotation (&#177;15&#176;), mixup/cutmix for training stability (albumentations or torchvision.transforms). Audio: pitch shift (&#177;2 semitones), time stretch (0.9&#215;&#8211;1.1&#215;), background noise addition. Text: back-translation (translate to another language and back - changes surface form while preserving meaning), synonym substitution (EDA: Easy Data Augmentation), random deletion/swap with small probability. The rule: augmentations should produce inputs that could plausibly appear in production. Augmentations that create impossible inputs hurt more than they help.behaviour (user edits AI-generated text), task completion rate. These are noisy signals, not ground truth. At scale, the signal is valuable: use high-confidence behavioural negatives (user immediately dismissed or corrected output) as hard negatives for contrastive training. Weight behavioural labels lower than human-annotated ones in your training loss. Treat them as weak supervision, not ground truth.</p><h1>Pillar 03 - Synthetic Data</h1><p>Synthetic data can expand a dataset by 10&#215; at a fraction of the cost of human labelling. It can also degrade model performance by 20% if the synthetic distribution doesn&#8217;t match production. The difference is whether you validate before training.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XYB3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XYB3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png 424w, https://substackcdn.com/image/fetch/$s_!XYB3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png 848w, https://substackcdn.com/image/fetch/$s_!XYB3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png 1272w, https://substackcdn.com/image/fetch/$s_!XYB3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XYB3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png" width="1029" height="343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:1029,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57404,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194143818?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XYB3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png 424w, https://substackcdn.com/image/fetch/$s_!XYB3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png 848w, https://substackcdn.com/image/fetch/$s_!XYB3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png 1272w, https://substackcdn.com/image/fetch/$s_!XYB3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8c494ef-cadc-4f13-b4dc-2a1e59b1d0cd_1029x343.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>The Data Flywheel - How the Three Pillars Compound</h1><p><strong>01 &#183; Deploy Model</strong></p><p>Model goes to production. Every inference logs predictions + confidence scores to your data store.</p><p><strong>02 &#183; Active Learning</strong></p><p>Daily pipeline selects 2&#8211;5% most uncertain + diverse examples from production logs. Queued for annotation.</p><p><strong>03 &#183; Label + Augment</strong></p><p>Human annotators label selected examples. Synthetic augmentation expands the hard cases. Distribution validated.</p><p><strong>04 &#183; Version + Train</strong></p><p>New dataset version committed to DVC. Training run logged in MLflow with dataset hash + all metrics.</p><p><strong>05 &#183; Evaluate + Ship</strong></p><p>Automated evaluation against held-out set. Canary release. If metrics hold &#8594; expand rollout. Loop restarts.</p><p><strong>WHY THIS COMPOUNDS OVER TIME</strong></p><p><strong>Better Model &#8594; More Precise Selection &#8594; Better Labels &#8594; Stronger Model</strong></p><p>The flywheel accelerates for a structural reason: as your model improves, uncertainty sampling becomes more precise - it finds harder and harder cases, which are more informative to label. A weak model is uncertain about many things. A strong model is only uncertain about genuinely ambiguous cases - exactly where your annotation budget should be spent. After 6 months of this loop, your model is better in exactly the ways your users have been pushing on it. Teams that build this flywheel improve every sprint. Teams that don&#8217;t label randomly and retrain hoping something changes.</p><h1>Data Drift - The Silent Killer of Production AI</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ROvC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ROvC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png 424w, https://substackcdn.com/image/fetch/$s_!ROvC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png 848w, https://substackcdn.com/image/fetch/$s_!ROvC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png 1272w, https://substackcdn.com/image/fetch/$s_!ROvC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ROvC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png" width="1027" height="432" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:432,&quot;width&quot;:1027,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:82067,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/194143818?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ROvC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png 424w, https://substackcdn.com/image/fetch/$s_!ROvC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png 848w, https://substackcdn.com/image/fetch/$s_!ROvC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png 1272w, https://substackcdn.com/image/fetch/$s_!ROvC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff818acd9-9ccb-4531-8a3e-edd7824aca14_1027x432.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Dos &amp; Don&#8217;ts - Data Strategy for AI Products</h1><p><strong>&#9989; What compounds over time</strong></p><p>&#9989; <strong>Version every dataset change with DVC, no exceptions.</strong> The discipline pays off the first time you need to reproduce a model or roll back a data change. One-time setup, permanent compounding value.</p><p>&#9989; <strong>Log confidence scores for every production prediction.</strong> This is the raw material of active learning. Without it, you have no input signal for uncertainty sampling.</p><p>&#9989; <strong>Validate synthetic data distribution alignment before training.</strong> Run UMAP overlap check and cosine similarity filter before adding any synthetic batch. One unvalidated batch can degrade months of improvement.</p><p>&#9989; <strong>Link every training run to its dataset version in MLflow.</strong> Makes the flywheel auditable. You can always trace a production model back to the exact data it trained on.</p><p>&#9989; <strong>Monitor data drift as a first-class production metric.</strong> PSI on input features, implicit feedback trends, weekly human spot-checks. Drift caught early is a retrain trigger. Drift caught late is an incident.</p><p>&#9989; <strong>Use outcome-based feedback as weak supervision signal.</strong> It&#8217;s not ground truth, but at scale it&#8217;s valuable. Weight it lower than human labels - use it for signal, not for replacing annotation.</p><p><strong>&#10060; What leads to data debt</strong></p><p>&#10060; <strong>Don&#8217;t retrain on unlabelled random samples hoping diversity covers the gap.</strong> Labelling 50,000 random examples when 2,000 strategically selected ones would do the same work is a budget problem that compounds over sprints.</p><p>&#10060; <strong>Don&#8217;t add synthetic data without validating distribution alignment.</strong> Synthetic data that doesn&#8217;t overlap with your real distribution trains the model on a different problem than the one it will face in production.</p><p>&#10060; <strong>Don&#8217;t treat data versioning as optional until you need it.</strong> You will need it. The first time you need to reproduce a model and can&#8217;t - because the dataset was never versioned - is expensive and avoidable.</p><p>&#10060; <strong>Don&#8217;t assume test set quality matches production quality.</strong> If your test set and production distribution have diverged, your metrics are measuring the wrong thing. Validate test set representativeness regularly.</p><p>&#10060; <strong>Don&#8217;t ignore label drift.</strong> Adding more data with outdated labels anchors the model to old ground truth. Periodic re-evaluation of annotation guidelines is as important as adding new data.</p><blockquote><p>&#8220;The model is the visible output. The data strategy is the invisible infrastructure. Teams that get the infrastructure right improve every sprint without heroics. Teams that don&#8217;t are constantly surprised by performance degradation they can neither explain nor reverse.&#8221;</p></blockquote><p>#DataStrategy #DataVersioning #ActiveLearning #SyntheticData #MLOps #MachineLearning #DataEngineering #DVC #MLflow #DataFlywheel #AIProducts #ProductionML #TechLeadership #AIEngineering #DataQuality #BuildInPublic</p><p>@DataStrategy @DataVersioning @ActiveLearning @SyntheticData @MLOps @MachineLearning @DataEngineering @DVC @MLflow @DataFlywheel @AIProducts @ProductionML @TechLeadership @AIEngineering @DataQuality @BuildInPublic</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Shipping AI Features Safely]]></title><description><![CDATA[Safety Filters &#183; Moderation Pipelines &#183; Red-Teaming &#183; Canary Releases]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/shipping-ai-features-safely</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/shipping-ai-features-safely</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Tue, 07 Apr 2026 19:16:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!Lmei!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><strong>&#8220;You built a great model. Tested it on clean data. Shipped it to real users. It did something terrible. None of this had to happen.&#8221;</strong></p></blockquote><p>AI safety in production is not about preventing every possible bad output &#8212; that&#8217;s impossible. It&#8217;s about designing systems where failures are caught before they reach users, contained when they slip through, and learned from so they don&#8217;t repeat. This post covers the four components that make this real: safety filters, moderation pipelines, red-teaming, and canary releases.</p><p><strong>THE PATTERN BEHIND MOST AI INCIDENTS</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>01. </strong>Team ships an AI feature. Tested on clean benchmark data. No red-team session, no adversarial inputs. User on day 3 asks something nobody tested. Model complies with something it absolutely shouldn&#8217;t.</p><p><strong>02. </strong>Team adds a keyword blocklist. User rephrases. Filter misses it. The filter was catching words, not intent.</p><p><strong>03. </strong>Team rolls back, rebuilds, ships new version to all users at once. New version has a regression. Affects 100% of users. Rollback takes 4 hours.</p><p><strong>04. </strong>Team rebuilds properly: semantic safety filters, layered moderation pipeline, adversarial red-team before every launch, canary releases at 1% &#8594; 5% &#8594; 25% &#8594; 100%. Next incident: caught at 1% traffic. Fixed before 99% of users saw it.</p><p><strong>END-TO-END SAFETY ARCHITECTURE</strong></p><p><strong>Every request passes through all four layers: </strong>Input Filter &#8594; Model &#8594; Moderation Pipeline &#8594; Canary Release &#8594; Monitoring. Red-teaming attacks every layer before launch.</p><blockquote><p><strong>Safety is not a launch gate &#8212; it is a continuous practice</strong></p><p>No single layer catches everything. Input filters stop known bad inputs. Moderation catches unexpected bad outputs. Canary releases limit blast radius. Monitoring closes the learning loop. Each layer catches what the previous one missed.</p></blockquote><h1>Layer 01 &#8212; Input Safety Filters</h1><p>The cheapest safety win is stopping bad inputs before the model ever sees them. Input filters run in under 10ms. Model inference takes 500ms&#8211;5s. Blocking at the input also means you never expose the model to adversarial patterns that might influence its output in unexpected ways.</p><p><strong>Classifier-Based Filters</strong></p><p>Fine-tuned models per harm category &#8212; run in parallel, not sequentially.</p><p>Train or fine-tune small classifiers (&lt;100M params) on harm categories: violence/harm, self-harm, PII, off-topic, jailbreak patterns. Run all classifiers in parallel &#8212; any single BLOCK triggers rejection. Return a structured decision: {&#8221;category&#8221;: &#8220;jailbreak&#8221;, &#8220;confidence&#8221;: 0.94, &#8220;action&#8221;: &#8220;block&#8221;}. Calibrate thresholds per category based on false positive cost. Medical systems tolerate more false positives than e-commerce. Log every borderline decision for weekly threshold review.</p><p><strong>Rule-Based + Semantic Filters</strong></p><p>Keywords for speed. Semantic similarity for robustness against rephrasing.</p><p>Keyword blocklists are fast (microseconds) and catch obvious violations cheaply &#8212; but adversarial users route around them by rephrasing. Combine with semantic similarity: embed the input and compare against a vector store of known harmful query patterns. If cosine similarity &gt; 0.85 to a known jailbreak template, flag it. For PII: use Presidio or cloud providers (AWS Comprehend, Google DLP) to strip names, emails, SSNs before they enter the model.</p><p><strong>Prompt Injection &amp; Jailbreak Detection</strong></p><p>Three approaches working together &#8212; classifier, delimiter isolation, secondary model scan.</p><p>Prompt injection embeds instructions designed to override your system prompt (&#8221;Ignore all previous instructions...&#8221;). Jailbreaks craft inputs that cause the model to violate guardrails via roleplay or hypotheticals. Three detection approaches: (1) Classifier trained on known patterns (NIST maintains a public taxonomy). (2) Clear input/system prompt separation &#8212; never concatenate user input directly into system instructions. (3) Secondary model scan: pass input through LlamaGuard, scoring injection probability independently. Confidence &gt;0.80: block. 0.60&#8211;0.80: log and flag for human review. Below 0.60: pass with monitoring.</p><h1>Layer 02 &#8212; Moderation Pipeline</h1><p>Even benign inputs can produce harmful outputs &#8212; because the model hallucinates, extrapolates unexpectedly, or because training data contained patterns filters didn&#8217;t anticipate. The moderation pipeline is your last line of defence before the user sees anything.</p><p><strong>Stage 01 : Toxicity &amp; Harm Scoring</strong></p><p>Pass every output through a harm scoring service before returning it. Perspective API (Google): scores 6 harm categories, free, ~100ms latency. OpenAI Moderation: 11 categories, free with API, built for LLM outputs. Azure Content Safety: enterprise SLA, configurable thresholds, on-premises option. LlamaGuard (Meta): open-source, self-hostable, fine-tunable to your taxonomy. For high-stakes systems, use multiple providers in parallel &#8212; different models have different blind spots. Threshold logic: &gt;0.8 &#8594; block. 0.5&#8211;0.8 &#8594; transform or add disclaimer. &lt;0.5 &#8594; pass.</p><p><strong>Stage 02 : Factual Grounding Check</strong></p><p>For RAG systems: verify every factual claim can be traced to a retrieved source chunk. Use a small LLM (gpt-4o-mini, ~50 tokens) or RAGAS faithfulness scorer. Claims not grounded &#8594; flag or add &#8220;could not verify&#8221; disclaimer. In regulated domains (medical, legal, financial): remove unverified claims, don&#8217;t just flag them. Track unverified claim rate over time &#8212; a rising rate signals retrieval degradation or model drift.</p><p><strong>Stage 03 : Policy &amp; Scope Compliance</strong></p><p>Generic harm classifiers catch standard categories. Policy compliance catches violations specific to your product: off-topic responses (a legal AI answering medical questions), competitor mentions, financial advice from a non-financial product, inappropriate relationship patterns in consumer apps. Build a lightweight classifier or LLM prompt to check: &#8220;Given the system instructions and this response, does the response violate any of the following policies?&#8221; Pass/fail with category. This enforces the gap between &#8220;technically harmless&#8221; and &#8220;appropriate for your specific product.&#8221;</p><p><strong>Stage 04 : Output Transformation &#8212; Not Just Blocking</strong></p><p>Blocking every borderline output creates two problems: poor UX for edge cases, and an adversarial feedback loop where users learn what to avoid. Transformation options for borderline outputs (0.4&#8211;0.8 harm score): (1) Disclaimer insertion &#8212; prepend a caveat before health-adjacent responses. (2) Partial redaction &#8212; remove the harmful segment while preserving the rest. (3) Topic steering &#8212; replace harmful answer with redirect to appropriate resources. (4) Confidence hedging &#8212; add &#8220;I&#8217;m not certain about this &#8212; please verify with [authority].&#8221; Reserve hard blocks for outputs above 0.8 harm score.</p><h1>Layer 03 &#8212; Red-Teaming</h1><p>Red-teaming is structured adversarial testing &#8212; not one person trying random things for an afternoon. Define the threat model, enumerate attack categories, assign severity, test systematically, document findings, fix before launch. The output is a risk register, not a pass/fail.</p><p><strong>RED-TEAM ATTACK TAXONOMY</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Lmei!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Lmei!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png 424w, https://substackcdn.com/image/fetch/$s_!Lmei!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png 848w, https://substackcdn.com/image/fetch/$s_!Lmei!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png 1272w, https://substackcdn.com/image/fetch/$s_!Lmei!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Lmei!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png" width="1026" height="535" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:535,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:98580,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193500878?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Lmei!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png 424w, https://substackcdn.com/image/fetch/$s_!Lmei!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png 848w, https://substackcdn.com/image/fetch/$s_!Lmei!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png 1272w, https://substackcdn.com/image/fetch/$s_!Lmei!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98185363-1023-4f68-92cb-c744813fb2dd_1026x535.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>THE FOUR-STEP RED-TEAM PROCESS</strong></p><p><strong>Define the Threat Model</strong></p><p>Who is the realistic adversary? What do they want? What can they do?</p><p>Before writing a single test case: Who might misuse this system? (curious users, bad actors, competitors, insiders.) What are their goals? (extract harmful content, extract training data, bypass billing.) What access do they have? (API user, web interface, authenticated or not.) The threat model determines which attack categories are high priority. A consumer chatbot has different priorities than an enterprise code assistant.</p><p><strong>Automated Red-Teaming</strong></p><p>Use an LLM to generate attack variations at scale &#8212; 500+ cases, not 50.</p><p>Prompt a capable LLM (GPT-4, Claude) with your attack taxonomy and generate 50&#8211;100 variations per category. Include direct attacks, indirect attacks, context manipulation, gradual escalation, and novel framings. Tools: Garak (open-source LLM vulnerability scanner), PromptBench, or custom pipelines. Run in CI on every model change &#8212; a regression in safety is a regression. Track pass rates per category over time. Any decline triggers human review before the change ships.</p><p><strong>Human Red-Team Sessions</strong></p><p>Domain experts + adversarial mindset + novel attack angles automation misses.</p><p>Run a structured human session before every major launch. Composition: 3&#8211;5 people mixing domain experts, security researchers, and product users. Duration: 2&#8211;4 hours. Structure: 30-min briefing on the system and known vulnerabilities, then structured exploration by category. Output: written report with findings, severity, reproduction steps, and mitigations. Every HIGH severity finding must be addressed before launch. MEDIUM findings: mitigated or accepted risk with documentation.</p><p><strong>Continuous Monitoring + Risk Register</strong></p><p>Red-teaming is an ongoing practice. Production surfaces new attacks daily.</p><p>Every red-team finding enters a living risk register: description, severity, status (open/mitigated/accepted), owner, last reviewed. Feedback loops: (1) Hourly safety incident monitoring &#8212; any spike in blocked inputs triggers alert. (2) Weekly human review of blocked content samples &#8212; look for emerging attack vectors. (3) Monthly expansion of adversarial test set from production logs. (4) Quarterly full red-team refresh for any system with significant user base.</p><h1>Layer 04 &#8212; Canary Releases</h1><p>AI systems behave differently at scale. A model that passes your test set and red-teaming will still encounter edge cases in production that no evaluation anticipated &#8212; because real users have more creativity and intent diversity than any test set. Ship to 1% of users first. Measure for 24&#8211;48h. Expand only if all metrics hold.</p><p><strong>5-STAGE CANARY ROLLOUT</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yBCc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yBCc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png 424w, https://substackcdn.com/image/fetch/$s_!yBCc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png 848w, https://substackcdn.com/image/fetch/$s_!yBCc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png 1272w, https://substackcdn.com/image/fetch/$s_!yBCc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yBCc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png" width="1030" height="451" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:451,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68777,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193500878?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yBCc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png 424w, https://substackcdn.com/image/fetch/$s_!yBCc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png 848w, https://substackcdn.com/image/fetch/$s_!yBCc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png 1272w, https://substackcdn.com/image/fetch/$s_!yBCc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9ea897a5-318d-4fdf-aa18-209b06e96fea_1030x451.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>CANARY METRICS &#8212; WHAT TO MEASURE AT EACH STAGE</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4u_l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4u_l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png 424w, https://substackcdn.com/image/fetch/$s_!4u_l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png 848w, https://substackcdn.com/image/fetch/$s_!4u_l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png 1272w, https://substackcdn.com/image/fetch/$s_!4u_l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4u_l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png" width="1030" height="378" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:1030,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:63552,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193500878?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4u_l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png 424w, https://substackcdn.com/image/fetch/$s_!4u_l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png 848w, https://substackcdn.com/image/fetch/$s_!4u_l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png 1272w, https://substackcdn.com/image/fetch/$s_!4u_l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4ce44c99-fd9d-4e29-9632-4e90d9ae15dc_1030x378.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Automated Rollback &#8212; Define Before Launch, Not During the Incident</strong></p><p><strong>Rollback decisions made by humans under pressure at 2am are often wrong. Automate the trigger against pre-defined criteria: if safety incident rate &gt; threshold OR blocked rate &gt; 2&#215; baseline OR error rate &gt; 1% for 10 consecutive minutes &#8594; automatically roll back to the previous version and page on call. The human&#8217;s job is to decide whether to re-enable, not whether to roll back. Every automated rollback generates a report: timestamp, trigger metric, values, duration. That report feeds the post-incident review.</strong></p></blockquote><h1>Dos &amp; Don&#8217;ts</h1><p><strong>&#9989; What Actually Works</strong></p><p>&#9989; <strong>Layer safety defences &#8212; input filter + output moderation + monitoring.</strong> No single layer catches everything. Defence in depth is resilience, not redundancy.</p><p>&#9989; <strong>Use semantic safety filters alongside keyword lists.</strong> Keywords catch exact text. Semantic classifiers catch intent across rephrasing. You need both.</p><p>&#9989; <strong>Red-team before every major launch, not just once.</strong> Models drift. User populations change. Attack patterns evolve. Quarterly minimum for production systems.</p><p>&#9989; <strong>Define canary rollback criteria before the launch begins.</strong> Write the thresholds, automate the trigger, remove the human from the rollback decision path.</p><p>&#9989; <strong>Measure false positive rate on safety filters weekly.</strong> Over-blocking legitimate use cases destroys trust and pushes users toward workarounds.</p><p>&#9989; <strong>Treat every production safety incident as an automated test case.</strong> Novel attack patterns from production go into your red-team suite within 24 hours.</p><p><strong>&#10060; What Creates False Confidence</strong></p><p>&#10060; <strong>Don&#8217;t rely on keyword blocklists as your primary safety mechanism.</strong> They catch exact text and miss every rephrasing. Semantic classifiers are primary; keywords are supplementary.</p><p>&#10060; <strong>Don&#8217;t ship to 100% of users for AI feature launches.</strong> A safety incident at 1% for 24h is recoverable. The same incident at 100% for 4h is not.</p><p>&#10060; <strong>Don&#8217;t treat red-teaming as a one-time launch gate.</strong> Production surfaces attack patterns no pre-launch red-team anticipates.</p><p>&#10060; <strong>Don&#8217;t set safety filter thresholds once and forget them.</strong> Review and recalibrate monthly &#8212; model updates and new attack patterns shift optimal thresholds.</p><p>&#10060; <strong>Don&#8217;t skip the automated rollback trigger.</strong> Humans under pressure make worse rollback decisions than pre-programmed thresholds.</p><p>&#10060; <strong>Don&#8217;t confuse a low safety incident rate with safety.</strong> A low blocked rate can mean filters are working &#8212; or they&#8217;re missing things. Check both with weekly human review.</p><h1>Quick Reference - Safety Pre-Launch Checklist</h1><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9TLQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9TLQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png 424w, https://substackcdn.com/image/fetch/$s_!9TLQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png 848w, https://substackcdn.com/image/fetch/$s_!9TLQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png 1272w, https://substackcdn.com/image/fetch/$s_!9TLQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9TLQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png" width="1026" height="384" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:384,&quot;width&quot;:1026,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:76843,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193500878?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9TLQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png 424w, https://substackcdn.com/image/fetch/$s_!9TLQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png 848w, https://substackcdn.com/image/fetch/$s_!9TLQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png 1272w, https://substackcdn.com/image/fetch/$s_!9TLQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9a5b434d-a721-4e95-ac13-3c1daa0b0c78_1026x384.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#8220;Safety is not the absence of incidents &#8212; it&#8217;s the presence of systems that contain them. The teams that ship AI reliably aren&#8217;t the ones that never have problems. They&#8217;re the ones that catch problems at 1% traffic instead of 100%, at the filter instead of the newspaper, and at the red-team session instead of the user complaint.&#8221;</p><p>&#8212; Production AI safety retrospective, enterprise LLM platform</p></blockquote><p>#AISafety #ContentModeration #RedTeaming #CanaryRelease #AIGuardrails #ResponsibleAI #LLMSafety #ProductionAI #AIGovernance #MachineLearning #MLOps #TechLeadership #AIEngineering #EnterpriseAI</p><p></p><p> <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;3b9f0b46-1247-4cfc-833d-a76eb158ef68&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[SERIES · POST 5 OF 5: A/B Testing for AI Systems]]></title><description><![CDATA[Ship with confidence, not luck - the statistical framework every ML team needs to get right]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/series-post-5-of-5-ab-testing-for</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/series-post-5-of-5-ab-testing-for</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Mon, 06 Apr 2026 10:03:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!oPok!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>Your model scores well in offline evaluation. Your human evaluators prefer it. Everything looks green. Now the real question: does it actually perform better for real users in production? That&#8217;s what A/B testing answers - if you do it right.</em></p><p>- The last mile of AI evaluation</p></blockquote><h4><strong>The Final Layer: From Offline to Online Evaluation</strong></h4><p>Posts 1 through 4 built your offline evaluation system - automated metrics, human evaluation frameworks, annotation pipelines. Together, these give you high confidence that your model is technically better before you ship it.</p><p>But offline evaluation has a fundamental limitation: it&#8217;s offline. The model hasn&#8217;t seen real users, real queries, real contexts. The distribution shift between your evaluation set and production traffic is often significant. The behavior changes are often subtle. And the business impact - which is ultimately what you care about - can only be measured with real users.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>A/B testing is the bridge </strong>between &#8216;this model is technically better&#8217; and &#8216;this model creates more value in the world.&#8217; It&#8217;s also where the most expensive mistakes happen - because the stakes are higher and the statistical traps are real.</p><h4><strong>The Full A/B Testing Pipeline for AI</strong></h4><p><strong>A/B Testing - Full Decision Flow for AI Systems</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oPok!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oPok!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png 424w, https://substackcdn.com/image/fetch/$s_!oPok!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png 848w, https://substackcdn.com/image/fetch/$s_!oPok!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png 1272w, https://substackcdn.com/image/fetch/$s_!oPok!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oPok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png" width="1138" height="376" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:376,&quot;width&quot;:1138,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58322,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193305870?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!oPok!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png 424w, https://substackcdn.com/image/fetch/$s_!oPok!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png 848w, https://substackcdn.com/image/fetch/$s_!oPok!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png 1272w, https://substackcdn.com/image/fetch/$s_!oPok!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F65573cfa-f6f8-469e-89a3-dac6b515bd83_1138x376.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 9: A/B Testing flow for AI systems. Note the power analysis step before data collection - most teams skip this and end up with inconclusive or misleading results.</em></p><p>Notice what comes before traffic split: power analysis. This is the step most teams skip, and it&#8217;s the single most common reason A/B tests produce inconclusive or misleading results. If you don&#8217;t know how much data you need before you start, you&#8217;re flying blind.</p><h4><strong>The Statistical Foundation - No Shortcuts</strong></h4><p><strong>Null hypothesis (H&#8320;): </strong>Models A and B perform equally on metric M. This is what you&#8217;re trying to disprove.</p><p><strong>Alternative hypothesis (H&#8321;): </strong>Model B outperforms Model A. This is what you&#8217;re testing for.</p><p><strong>p-value: </strong>The probability of observing your result (or something more extreme) if H&#8320; is true. If p &lt; 0.05, you reject H&#8320;. Standard threshold, though some teams use 0.01 for high-stakes decisions.</p><p><strong>Effect size: </strong>How big is the difference? Statistical significance does NOT equal practical significance. A p-value of 0.001 on a 0.01% improvement is statistically significant and practically irrelevant. Always report both.</p><p><strong>Power (1 &#8722; &#946;): </strong>The probability that your test will detect a real effect if one exists. Aim for &#8805; 0.80. A test with 50% power is a coin flip - you&#8217;ll miss real improvements half the time.</p><h4><strong>Sample Size: The Most Ignored Calculation</strong></h4><p>This is the calculation that separates teams who run rigorous A/B tests from teams who generate statistically invalid results that they act on anyway.</p><p>Run a power analysis before you collect a single data point. Here&#8217;s what you need:</p><p>&#8226;</p><p>&#8226;</p><p>&#8226;</p><p>&#8226;</p><p><strong>Sample Size - What Drives the Calculation</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!deJk!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!deJk!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png 424w, https://substackcdn.com/image/fetch/$s_!deJk!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png 848w, https://substackcdn.com/image/fetch/$s_!deJk!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png 1272w, https://substackcdn.com/image/fetch/$s_!deJk!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!deJk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png" width="1125" height="511" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:511,&quot;width&quot;:1125,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67051,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193305870?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!deJk!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png 424w, https://substackcdn.com/image/fetch/$s_!deJk!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png 848w, https://substackcdn.com/image/fetch/$s_!deJk!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png 1272w, https://substackcdn.com/image/fetch/$s_!deJk!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3ca65ea-e8dd-4636-86bb-cd029d621a84_1125x511.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 10: Sample size inputs and consequences. The most common mistake: running a test without this analysis, then declaring a null result when the test was simply underpowered.</em></p><p><strong>The uncomfortable rule: </strong>if you want to detect a 3% relative improvement (e.g., from 72% to 74.2%) with 80% power at &#945;=0.05, you&#8217;ll need far more users than most teams expect - often tens of thousands per variant. If you can&#8217;t get there, you need to either accept that you can only detect larger effects, or use Bayesian methods.</p><p>&#9888;&#65039; <strong>AI-Specific A/B Testing Challenges</strong></p><p>Standard A/B testing frameworks were built for web products - button colors, CTA copy, page layout. AI model evaluation has unique challenges that these frameworks don&#8217;t handle well:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!98za!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!98za!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png 424w, https://substackcdn.com/image/fetch/$s_!98za!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png 848w, https://substackcdn.com/image/fetch/$s_!98za!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png 1272w, https://substackcdn.com/image/fetch/$s_!98za!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!98za!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png" width="1126" height="859" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:859,&quot;width&quot;:1126,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:120125,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193305870?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!98za!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png 424w, https://substackcdn.com/image/fetch/$s_!98za!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png 848w, https://substackcdn.com/image/fetch/$s_!98za!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png 1272w, https://substackcdn.com/image/fetch/$s_!98za!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3bd2bd5-3ab1-45b6-94b5-c689cb2d2f62_1126x859.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>The Peeking Problem - The Most Common Statistical Error</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!XmKo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!XmKo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png 424w, https://substackcdn.com/image/fetch/$s_!XmKo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png 848w, https://substackcdn.com/image/fetch/$s_!XmKo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png 1272w, https://substackcdn.com/image/fetch/$s_!XmKo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!XmKo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png" width="1135" height="832" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:832,&quot;width&quot;:1135,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:100132,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193305870?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!XmKo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png 424w, https://substackcdn.com/image/fetch/$s_!XmKo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png 848w, https://substackcdn.com/image/fetch/$s_!XmKo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png 1272w, https://substackcdn.com/image/fetch/$s_!XmKo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F01020b61-1e75-41cf-829c-fb074f1b062a_1135x832.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Dos and Don'ts: A/B Testing for AI</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ghZD!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ghZD!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png 424w, https://substackcdn.com/image/fetch/$s_!ghZD!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png 848w, https://substackcdn.com/image/fetch/$s_!ghZD!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png 1272w, https://substackcdn.com/image/fetch/$s_!ghZD!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ghZD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png" width="1132" height="883" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:883,&quot;width&quot;:1132,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:121366,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193305870?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ghZD!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png 424w, https://substackcdn.com/image/fetch/$s_!ghZD!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png 848w, https://substackcdn.com/image/fetch/$s_!ghZD!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png 1272w, https://substackcdn.com/image/fetch/$s_!ghZD!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F60d2b08c-8949-4af9-814f-d93dda7a9f57_1132x883.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>Interpreting Results: Beyond the p-Value</strong></h4><p>You&#8217;ve run the test, hit your sample size, and calculated a p-value. Now what? A lot of teams stop at p &lt; 0.05 and ship. That&#8217;s not enough.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EJlC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EJlC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png 424w, https://substackcdn.com/image/fetch/$s_!EJlC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png 848w, https://substackcdn.com/image/fetch/$s_!EJlC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png 1272w, https://substackcdn.com/image/fetch/$s_!EJlC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EJlC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png" width="1125" height="753" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a0691aab-591a-4b80-8e9a-fde655598071_1125x753.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:753,&quot;width&quot;:1125,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:89485,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193305870?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EJlC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png 424w, https://substackcdn.com/image/fetch/$s_!EJlC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png 848w, https://substackcdn.com/image/fetch/$s_!EJlC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png 1272w, https://substackcdn.com/image/fetch/$s_!EJlC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa0691aab-591a-4b80-8e9a-fde655598071_1125x753.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>Putting It All Together: Your Evaluation Maturity Checklist</strong></h4><p>If you&#8217;ve read all five posts, you now have the complete evaluation framework. Here&#8217;s the consolidation:</p><p><strong>Metrics Quick Reference - The Full Picture</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-LHU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-LHU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png 424w, https://substackcdn.com/image/fetch/$s_!-LHU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png 848w, https://substackcdn.com/image/fetch/$s_!-LHU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png 1272w, https://substackcdn.com/image/fetch/$s_!-LHU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-LHU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png" width="1036" height="910" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:910,&quot;width&quot;:1036,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:107113,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193305870?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-LHU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png 424w, https://substackcdn.com/image/fetch/$s_!-LHU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png 848w, https://substackcdn.com/image/fetch/$s_!-LHU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png 1272w, https://substackcdn.com/image/fetch/$s_!-LHU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96a9d545-9f58-412a-b559-428a40b8c825_1036x910.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 11: Complete metrics reference. Trust level reflects how well the metric correlates with real user satisfaction. Note: High scalability + Low trust = always pair with something else</em></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dD9Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dD9Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png 424w, https://substackcdn.com/image/fetch/$s_!dD9Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png 848w, https://substackcdn.com/image/fetch/$s_!dD9Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png 1272w, https://substackcdn.com/image/fetch/$s_!dD9Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dD9Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png" width="1032" height="553" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:553,&quot;width&quot;:1032,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:64936,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193305870?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dD9Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png 424w, https://substackcdn.com/image/fetch/$s_!dD9Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png 848w, https://substackcdn.com/image/fetch/$s_!dD9Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png 1272w, https://substackcdn.com/image/fetch/$s_!dD9Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb37d669e-b37b-48bd-86a9-dc145da4a015_1032x553.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>.</em></p><blockquote><p><em>The goal is never a perfect score on a benchmark. The goal is an AI system that reliably does something valuable for real people. Evaluation is just how you measure whether you got there.</em></p><p>- The final word on AI evaluation</p></blockquote><p>#ABTesting #StatisticalSignificance #MLOps #ProductionML #AIEvaluation #MachineLearning #TechLeadership #Statistics #ExperimentDesign #DataScience</p><p>@ABTesting @StatisticalSignificance @MLOps @ProductionML @AIEvaluation @MachineLearning @TechLeadership @Statistics @ExperimentDesign @DataScience  <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;e5c81794-c345-494d-b6a3-accf22336e62&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[SERIES · POST 4 OF 5: Human Evaluation Frameworks]]></title><description><![CDATA[The signal machines can't give you &#8212; and how to collect it at scale without losing your mind]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/series-post-4-of-5-human-evaluation</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/series-post-4-of-5-human-evaluation</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Mon, 06 Apr 2026 00:42:54 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ga3r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>Automated metrics tell you if your model looks like the reference. Human evaluation tells you if your model is actually good. For many tasks - especially safety, helpfulness, and open-ended generation - there is no substitute.</em></p><p>- The role of human judgment in AI evaluation</p></blockquote><h4><strong>Why Automated Metrics Aren&#8217;t Enough</strong></h4><p>Posts 2 and 3 covered the automated metrics layer - Precision, Recall, BLEU, ROUGE, BERTScore. These are indispensable. They scale, they&#8217;re fast, they&#8217;re objective, and they give you a consistent signal across experiments.</p><p>But they all share one fundamental limitation: they measure what the model produces relative to a reference or a predefined schema. They cannot tell you whether the output is actually useful, clear, honest, or safe for a real human being.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Human evaluation fills that gap. </strong>It&#8217;s the layer where your evaluation system stops asking &#8216;does this look like the reference?&#8217; and starts asking &#8216;would a real person find this helpful, accurate, and appropriate?&#8217;</p><p>The reason most teams skip it or do it badly isn&#8217;t because they don&#8217;t believe in it - it&#8217;s because done carelessly, human evaluation is expensive, slow, noisy, and prone to bias. This post is about doing it right.</p><h4><strong>The Human Evaluation Pipeline</strong></h4><p><strong>&#129489;&#8205;&#9878;&#65039;Human Evaluation Framework - End-to-End Workflow</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ga3r!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ga3r!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png 424w, https://substackcdn.com/image/fetch/$s_!ga3r!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png 848w, https://substackcdn.com/image/fetch/$s_!ga3r!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png 1272w, https://substackcdn.com/image/fetch/$s_!ga3r!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ga3r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png" width="946" height="484" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:484,&quot;width&quot;:946,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:49732,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193303829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ga3r!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png 424w, https://substackcdn.com/image/fetch/$s_!ga3r!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png 848w, https://substackcdn.com/image/fetch/$s_!ga3r!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png 1272w, https://substackcdn.com/image/fetch/$s_!ga3r!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7bce84e3-a7dd-4ed4-a73a-0564000c378e_946x484.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 7: Human evaluation pipeline. The IAA (Inter-Annotator Agreement) gate is critical - without it, you can&#8217;t know if your scores are meaningful.</em></p><p>The pipeline has three stages before you collect a single annotation. This preparation phase is where most teams fail - they skip straight to rating and wonder why their results are noisy and inconsistent.</p><h4><strong>Framework 1: Likert Scale Rating</strong></h4><p>The most common human evaluation format. Annotators rate each output on a 1&#8211;5 or 1&#8211;7 scale on one or more quality dimensions. The key word is one - rating multiple dimensions simultaneously degrades agreement because annotators start making implicit tradeoffs.</p><p><strong> Likert Scale Dimensions - What to Rate and How</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cVuJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cVuJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png 424w, https://substackcdn.com/image/fetch/$s_!cVuJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png 848w, https://substackcdn.com/image/fetch/$s_!cVuJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png 1272w, https://substackcdn.com/image/fetch/$s_!cVuJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cVuJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png" width="940" height="540" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:540,&quot;width&quot;:940,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68752,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193303829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cVuJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png 424w, https://substackcdn.com/image/fetch/$s_!cVuJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png 848w, https://substackcdn.com/image/fetch/$s_!cVuJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png 1272w, https://substackcdn.com/image/fetch/$s_!cVuJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe14c4083-68f5-4aa3-90fe-541e1efa3547_940x540.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 8: Likert scale dimensions for human evaluation. Rate one dimension at a time - mixing them in a single question degrades annotation quality significantly.</em></p><p><strong>The anchor problem: </strong>&#8216;Rate quality on a scale of 1 to 5&#8217; is meaningless without anchors. What does a 3 look like? What separates a 2 from a 3? Without concrete, domain-specific examples anchoring each score, you&#8217;ll get different distributions from every annotator - and when you average them, you get noise.</p><p><strong>The solution: </strong>write detailed rubrics. Include 2-3 annotated examples for each score level. Run calibration sessions where all annotators rate the same examples together and discuss disagreements before the main annotation round begins. The time investment pays for itself immediately in data quality.</p><h4><strong>Framework 2: Pairwise Comparison</strong></h4><p>Instead of rating a single output, annotators compare two outputs (A vs B) and indicate which is better on a specific dimension - or whether they&#8217;re equivalent.</p><p>Pairwise is often more reliable than Likert for one simple reason: humans are better at relative judgments than absolute ones. It&#8217;s cognitively easier and more consistent to say &#8216;this summary is more accurate than that one&#8217; than to decide whether it&#8217;s a 3 or a 4 on a scale whose anchors you&#8217;re still calibrating to.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!idz9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!idz9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png 424w, https://substackcdn.com/image/fetch/$s_!idz9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png 848w, https://substackcdn.com/image/fetch/$s_!idz9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png 1272w, https://substackcdn.com/image/fetch/$s_!idz9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!idz9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png" width="943" height="520" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4655f12f-90bf-4364-a5ce-27709a250120_943x520.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:520,&quot;width&quot;:943,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60970,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193303829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!idz9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png 424w, https://substackcdn.com/image/fetch/$s_!idz9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png 848w, https://substackcdn.com/image/fetch/$s_!idz9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png 1272w, https://substackcdn.com/image/fetch/$s_!idz9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4655f12f-90bf-4364-a5ce-27709a250120_943x520.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Elo-style ranking at scale: </strong>If you need to rank multiple models against each other, run pairwise comparisons at scale and apply an Elo algorithm (the same system used for chess ratings) to produce a ranked leaderboard. This is exactly what Chatbot Arena does with thousands of human evaluators. The key requirement: each model must be compared against every other model enough times to establish statistical confidence in the ranking.</p><h4><strong>Framework 3: Annotation Task Design - The Foundation</strong></h4><p>The quality of your human evaluation lives and dies by how well you design the annotation task. A brilliant pool of annotators with a poorly designed task produces garbage. A moderate pool with a well-designed task produces signal.</p><p>1. Write a detailed annotation guide. This includes: the task definition, all quality dimensions with precise definitions, concrete examples at each score level (or for each pairwise preference outcome), common edge cases with guidance, and what to do when uncertain.</p><p>2. Run a calibration pilot. Take 2-3 annotators, have them annotate the same 50-100 examples independently, then review disagreements together. Update the guide based on what you learn. Repeat until agreement is stable.</p><p>3. Establish a gold standard. Identify 20-50 examples where the correct annotation is unambiguous. Use these to onboard new annotators and to monitor for drift over time.</p><p>4. Gate annotators against the gold standard. Before a new annotator joins the main annotation pool, they must agree with the gold standard above a threshold (typically 80%). Below that threshold, they annotate the guide together with a supervisor.</p><p>5. Monitor inter-annotator agreement continuously. For every batch, calculate Cohen&#8217;s &#954; (for categorical ratings) or Krippendorff&#8217;s &#945; (for ordinal or more complex schemas). Flag any rater whose agreement drops below your threshold - they may be drifting, fatigued, or working on ambiguous examples.</p><p>6. Collect metadata. Every annotation should include: annotator ID, timestamp, confidence rating (optional but valuable), and a free-text note on any uncertainty. This metadata is essential for auditing disagreements and improving the guide.</p><p>7. Reserve a blind hold-out. Never use annotation insights from your first batch to update your model and then re-evaluate on the same annotators. They&#8217;ve been exposed to your system&#8217;s failure modes. Keep a fresh hold-out batch for final evaluation.</p><h4>Dos and Don'ts: Human Evaluation</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!IY7b!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!IY7b!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png 424w, https://substackcdn.com/image/fetch/$s_!IY7b!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png 848w, https://substackcdn.com/image/fetch/$s_!IY7b!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png 1272w, https://substackcdn.com/image/fetch/$s_!IY7b!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!IY7b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png" width="949" height="774" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:774,&quot;width&quot;:949,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102994,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193303829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!IY7b!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png 424w, https://substackcdn.com/image/fetch/$s_!IY7b!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png 848w, https://substackcdn.com/image/fetch/$s_!IY7b!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png 1272w, https://substackcdn.com/image/fetch/$s_!IY7b!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffb5348a0-def4-43d6-adb9-8ed7e81f760c_949x774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!-bAQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!-bAQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png 424w, https://substackcdn.com/image/fetch/$s_!-bAQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png 848w, https://substackcdn.com/image/fetch/$s_!-bAQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png 1272w, https://substackcdn.com/image/fetch/$s_!-bAQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!-bAQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png" width="943" height="508" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:508,&quot;width&quot;:943,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:59740,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193303829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!-bAQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png 424w, https://substackcdn.com/image/fetch/$s_!-bAQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png 848w, https://substackcdn.com/image/fetch/$s_!-bAQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png 1272w, https://substackcdn.com/image/fetch/$s_!-bAQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7b6742ac-220f-4355-a1a6-a102c5d96d2a_943x508.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lo50!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lo50!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png 424w, https://substackcdn.com/image/fetch/$s_!lo50!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png 848w, https://substackcdn.com/image/fetch/$s_!lo50!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png 1272w, https://substackcdn.com/image/fetch/$s_!lo50!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lo50!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png" width="936" height="460" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:460,&quot;width&quot;:936,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:55084,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193303829?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lo50!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png 424w, https://substackcdn.com/image/fetch/$s_!lo50!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png 848w, https://substackcdn.com/image/fetch/$s_!lo50!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png 1272w, https://substackcdn.com/image/fetch/$s_!lo50!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F405c262a-fd99-49a1-8b9f-f25440fcc75e_936x460.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>#HumanEval #AIQuality #AnnotationFramework #NLP #LLM #MachineLearning #AIEvaluation #MLOps #DataAnnotation #ModelEvaluation</p><p>@HumanEval @AIQuality @AnnotationFramework @NLP @MachineLearning @AIEvaluation @MLOps @DataAnnotation @ModelEvaluation @LLM</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/series-post-4-of-5-human-evaluation?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/series-post-4-of-5-human-evaluation?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/series-post-4-of-5-human-evaluation?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p> <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;8f733f95-8dc6-4619-9544-416d3dd1ea31&quot;}" data-component-name="MentionToDOM"></span> </p><p><strong>&#8594; Coming Next: </strong>Post 5: A/B Testing - proving your model works in production</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[SERIES · POST 3 OF 5 : BLEU & ROUGE]]></title><description><![CDATA[The text generation metrics everyone uses - and almost everyone misreads]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/series-post-3-of-5-bleu-and-rouge</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/series-post-3-of-5-bleu-and-rouge</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Fri, 03 Apr 2026 10:02:34 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!aVOg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>A model can produce a fluent, perfectly accurate translation and score near zero on BLEU. Another model can produce a confident, completely wrong translation and score quite well. If you&#8217;re using BLEU as your primary quality signal, you need to read this carefully.</em></p><p>- The uncomfortable truth about NLP&#8217;s most popular metric</p></blockquote><h4><strong>The Context Switch: From Classification to Generation</strong></h4><p>Posts 1 and 2 established the evaluation pipeline and deep-dived on classification metrics. Now we&#8217;re moving up the stack - into text generation evaluation, where the problem fundamentally changes.</p><p>In classification, there&#8217;s a ground truth label: the answer is right or wrong. In generation, there are infinite ways to be right - and almost as many ways to look right while being wrong. This makes evaluation harder, more ambiguous, and more important to get right.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>The core challenge: </strong>how do you measure whether a generated text is &#8216;good&#8217; when there&#8217;s no single correct answer? BLEU and ROUGE are the most widely deployed answers to this question. They&#8217;re not perfect answers - but they&#8217;re useful ones, if you understand what they actually measure.</p><h4><strong>BLEU: Precision for Generated Text</strong></h4><p>BLEU (Bilingual Evaluation Understudy) was developed at IBM in 2002 specifically for machine translation. Its insight: a good translation shares most of its words and phrases with a human reference translation. More overlap = better quality. Roughly.</p><p>Let&#8217;s walk through exactly how it&#8217;s calculated:</p><h4><strong>BLEU Score - Step-by-Step Calculation Flow</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_z7E!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_z7E!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png 424w, https://substackcdn.com/image/fetch/$s_!_z7E!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png 848w, https://substackcdn.com/image/fetch/$s_!_z7E!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png 1272w, https://substackcdn.com/image/fetch/$s_!_z7E!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_z7E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png" width="1327" height="232" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:232,&quot;width&quot;:1327,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57509,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193029966?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_z7E!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png 424w, https://substackcdn.com/image/fetch/$s_!_z7E!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png 848w, https://substackcdn.com/image/fetch/$s_!_z7E!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png 1272w, https://substackcdn.com/image/fetch/$s_!_z7E!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc42e12d6-d2c0-4042-819c-efe527b95905_1327x232.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>Figure 5: BLEU calculation pipeline. The brevity penalty (BP) is critical - without it, generating a single word that appears in the reference gives a perfect precision score.</em></p><p><strong>Step 1 - Tokenize. </strong>Split both the hypothesis (generated text) and the reference into individual tokens (usually words, sometimes sub-words).</p><p><strong>Step 2 - N-gram match. </strong>Count how many 1-grams (words), 2-grams (word pairs), 3-grams, and 4-grams in the hypothesis appear in the reference. BLEU-4 is the most commonly reported variant in MT research.</p><p><strong>Step 3 - Clipped count. </strong>This is the important part most explanations skip. BLEU clips the n-gram counts by how many times they appear in the reference. This prevents gaming: if you repeat &#8216;the&#8217; fifty times, the count is clipped to however many times &#8216;the&#8217; appears in the reference.</p><p><strong>Step 4 - Brevity Penalty (BP). </strong>If the hypothesis is shorter than the reference, apply a penalty. Without this, a model could generate just the most likely word from the reference and get perfect precision.</p><p>The final BLEU score is the geometric mean of n-gram precisions, multiplied by the BP. It ranges from 0 to 1 (or 0 to 100 if reported as a percentage).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aVOg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aVOg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png 424w, https://substackcdn.com/image/fetch/$s_!aVOg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png 848w, https://substackcdn.com/image/fetch/$s_!aVOg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png 1272w, https://substackcdn.com/image/fetch/$s_!aVOg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aVOg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png" width="1324" height="649" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:649,&quot;width&quot;:1324,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:68529,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193029966?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aVOg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png 424w, https://substackcdn.com/image/fetch/$s_!aVOg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png 848w, https://substackcdn.com/image/fetch/$s_!aVOg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png 1272w, https://substackcdn.com/image/fetch/$s_!aVOg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1eef42bc-4cc3-49fe-a474-47b78e486d2e_1324x649.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>ROUGE: Recall for Summaries</strong></h4><h4>ROUGE (Recall-Oriented Understudy for Gisting Evaluation) was developed for automatic summarization evaluation. Where BLEU asks &#8216;how much of the output matches the reference?&#8217;, ROUGE asks &#8216;how much of the reference does the output cover?&#8217;</h4><p>This recall orientation makes ROUGE more natural for summarization, where the goal is coverage - you want to know if the important information from the source document made it into the summary.</p><p><strong>ROUGE Variants - Which One to Use and When</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ey6u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ey6u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png 424w, https://substackcdn.com/image/fetch/$s_!ey6u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png 848w, https://substackcdn.com/image/fetch/$s_!ey6u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png 1272w, https://substackcdn.com/image/fetch/$s_!ey6u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ey6u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png" width="1323" height="762" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:762,&quot;width&quot;:1323,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:102194,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193029966?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ey6u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png 424w, https://substackcdn.com/image/fetch/$s_!ey6u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png 848w, https://substackcdn.com/image/fetch/$s_!ey6u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png 1272w, https://substackcdn.com/image/fetch/$s_!ey6u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1aac02e5-1480-47f9-a0b8-d83babe062ab_1323x762.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 6: ROUGE and BERTScore variants. ROUGE-L is the most widely used single variant. Add BERTScore when semantic similarity matters more than surface overlap.</em></p><p><strong>Practical ROUGE guidance: </strong>ROUGE-1 gives you breadth of vocabulary coverage. ROUGE-2 adds some phrase structure sensitivity. ROUGE-L is the most principled variant - it finds the longest common subsequence, which roughly captures how much the generated text &#8216;flows like&#8217; the reference. For most summarization evaluation tasks, start with ROUGE-L.</p><h4><strong>The Dirty Secret About Both Metrics</strong></h4><p>Here&#8217;s what the leaderboards don&#8217;t tell you: BLEU and ROUGE measure surface overlap, not meaning. This distinction matters enormously in practice.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yzgh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yzgh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png 424w, https://substackcdn.com/image/fetch/$s_!yzgh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png 848w, https://substackcdn.com/image/fetch/$s_!yzgh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png 1272w, https://substackcdn.com/image/fetch/$s_!yzgh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yzgh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png" width="1324" height="769" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:769,&quot;width&quot;:1324,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:91690,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193029966?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yzgh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png 424w, https://substackcdn.com/image/fetch/$s_!yzgh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png 848w, https://substackcdn.com/image/fetch/$s_!yzgh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png 1272w, https://substackcdn.com/image/fetch/$s_!yzgh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F553691c2-c6fd-46b1-a1de-59fbb4d4125c_1324x769.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>BLEU vs ROUGE vs BERTScore: When to Use What</strong></h4><p>The evolution of text evaluation metrics is essentially a progression toward capturing meaning rather than surface form. Here&#8217;s how to think about when to use each:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZwNu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZwNu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png 424w, https://substackcdn.com/image/fetch/$s_!ZwNu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png 848w, https://substackcdn.com/image/fetch/$s_!ZwNu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwNu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZwNu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png" width="1330" height="979" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:979,&quot;width&quot;:1330,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:122755,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193029966?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZwNu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png 424w, https://substackcdn.com/image/fetch/$s_!ZwNu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png 848w, https://substackcdn.com/image/fetch/$s_!ZwNu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png 1272w, https://substackcdn.com/image/fetch/$s_!ZwNu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe83fa532-f7f7-4f24-b544-12d0c65ad1ce_1330x979.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>Dos and Don'ts: BLEU &amp; ROUGE</h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DV_k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DV_k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png 424w, https://substackcdn.com/image/fetch/$s_!DV_k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png 848w, https://substackcdn.com/image/fetch/$s_!DV_k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png 1272w, https://substackcdn.com/image/fetch/$s_!DV_k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DV_k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png" width="1138" height="898" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:898,&quot;width&quot;:1138,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:129671,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193029966?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DV_k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png 424w, https://substackcdn.com/image/fetch/$s_!DV_k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png 848w, https://substackcdn.com/image/fetch/$s_!DV_k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png 1272w, https://substackcdn.com/image/fetch/$s_!DV_k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff8defa1c-d08d-49e0-a4d5-5be908d666ac_1138x898.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZCu3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZCu3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png 424w, https://substackcdn.com/image/fetch/$s_!ZCu3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png 848w, https://substackcdn.com/image/fetch/$s_!ZCu3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png 1272w, https://substackcdn.com/image/fetch/$s_!ZCu3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZCu3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png" width="1228" height="588" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:588,&quot;width&quot;:1228,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/193029966?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZCu3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png 424w, https://substackcdn.com/image/fetch/$s_!ZCu3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png 848w, https://substackcdn.com/image/fetch/$s_!ZCu3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png 1272w, https://substackcdn.com/image/fetch/$s_!ZCu3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F433a05f7-d9db-40d6-87d1-6af4badb51f5_1228x588.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>#BLEU #ROUGE #NLP #LLM #TextGeneration #AIEvaluation #MachineLearning #BERTScore #ModelEvaluation #DeepLearning</p><p>@BLEU @ROUGE @NLP @LLM @TextGeneration @AIEvaluation @MachineLearning @BERTScore @ModelEvaluation <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Deep Learning&quot;,&quot;id&quot;:29332354,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://bucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com/public/images/40f14896-ff28-44d8-a6e0-72792c22a5b9_1600x1200.png&quot;,&quot;uuid&quot;:&quot;e1f53fe7-c3c2-4990-b8a3-a58102242bb0&quot;}" data-component-name="MentionToDOM"></span> </p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/series-post-3-of-5-bleu-and-rouge?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/series-post-3-of-5-bleu-and-rouge?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p><p><span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;6c53a7f1-0505-41b4-a667-8e1fec98d363&quot;}" data-component-name="MentionToDOM"></span> </p><p><strong>&#8594; Coming Next: </strong>Post 4: Human Evaluation Frameworks - the signal machines can't give you</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[SERIES · POST 2 OF 5 Precision, Recall & F1 ]]></title><description><![CDATA[The foundation every ML engineer must master - with the tradeoff that trips up even senior teams]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/series-post-2-of-5-precision-recall</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/series-post-2-of-5-precision-recall</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Thu, 02 Apr 2026 10:03:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!1VsS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>If you only understand two metrics deeply for your entire ML career, make them Precision and Recall. Every other metric you&#8217;ll ever use either builds from them or compensates for their limitations.</em></p><p>- The most important foundational insight in classification evaluation</p></blockquote><h4><strong>Why We&#8217;re Starting Here</strong></h4><p>Post 1 established the evaluation pipeline - three layers, each catching different failure modes, all leading to a decision gate. Now we go deep on the first automated layer: classification metrics.</p><p>Precision and Recall sit at the foundation of almost every AI evaluation task - not just classification. Even for generative models, many of the higher-level metrics (BLEU, ROUGE, BERTScore) derive from the same core concepts of precision and recall applied to overlapping n-grams or embeddings.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Understand these two metrics thoroughly, and you&#8217;ll have the mental model to understand everything that comes after.</p><h4><strong>The Confusion Matrix: Where It All Starts</strong></h4><p>Before formulas, you need the confusion matrix. Every classification metric - Precision, Recall, F1, AUC - is built on the four numbers inside it.</p><p><strong>Confusion Matrix - The Foundation of Precision &amp; Recall</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jWjR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jWjR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png 424w, https://substackcdn.com/image/fetch/$s_!jWjR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png 848w, https://substackcdn.com/image/fetch/$s_!jWjR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png 1272w, https://substackcdn.com/image/fetch/$s_!jWjR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jWjR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png" width="1143" height="247" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:247,&quot;width&quot;:1143,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:34798,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192913568?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jWjR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png 424w, https://substackcdn.com/image/fetch/$s_!jWjR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png 848w, https://substackcdn.com/image/fetch/$s_!jWjR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png 1272w, https://substackcdn.com/image/fetch/$s_!jWjR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F92fb8455-fcdf-4fb8-8cef-2bc0f2a6838d_1143x247.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>Figure 3: The confusion matrix. Every Precision/Recall/F1 formula derives from these four cells. Memorize this.</em></p><p><strong>True Positive (TP): </strong>You predicted Positive, and it was actually Positive. A correct hit. This is what you want.</p><p><strong>True Negative (TN): </strong>You predicted Negative, and it was actually Negative. A correct rejection. Also good.</p><p><strong>False Positive (FP): </strong>You predicted Positive, but it was actually Negative. A false alarm. Type I Error. The cost depends on your use case - but it always has one.</p><p><strong>False Negative (FN): </strong>You predicted Negative, but it was actually Positive. You missed something real. Type II Error. In medical diagnosis, this is the worst kind of mistake.</p><p>The reason most teams don&#8217;t think clearly about evaluation is that they report a single number - accuracy - that treats all four cells as equivalent. They&#8217;re not. The cost of an FP and an FN are completely different in almost every real-world application.</p><h4><strong>The Formulas - Explained in Plain English</strong></h4><p><strong>Precision = TP / (TP + FP)</strong></p><p>Of everything I said was Positive, how much was actually Positive? This is asking: when I sound the alarm, how often am I right? High precision means low false alarm rate.</p><p><strong>Recall = TP / (TP + FN)</strong></p><p>Of everything that was actually Positive, how many did I catch? This is asking: out of all the real threats, how many did I find? High recall means you miss very little.</p><p><strong>F1 Score = 2 &#215; (Precision &#215; Recall) / (Precision + Recall)</strong></p><p>The harmonic mean of Precision and Recall. It penalizes extreme imbalance - a model with 100% Precision and 0% Recall gets an F1 of 0, not 50%. Use F1 when you care about both and neither can be sacrificed.</p><p><strong>AUC-PR (Area Under the Precision-Recall Curve). </strong>Instead of reporting a single point, plot the full curve across all confidence thresholds and report the area. This is the most informative single number for imbalanced datasets. AUC-PR of 0.85 means your model consistently maintains high precision as recall increases.</p><h4><strong>When to Prioritize Which Metric</strong></h4><p><strong>Use Case                                      Prioritize - and Why</strong></p><p>Medical diagnosis                       Recall - a missed positive can be fatal; false alarms are                                                          manageable</p><p>Fraud detection                          Recall first, then Precision - missing fraud costs more                                                          than investigating a false alert</p><p>Cancer screening                       Recall - screen everyone; biopsy is the filter for false                                                             positives</p><p>Spam filtering                            Precision - blocking legitimate email destroys user trust                                                       immediately</p><p>Content moderation                 Context-dependent - safety content needs high Recall;                                                          borderline needs high Precision</p><p>Search engines                          Recall@K - show all the relevant results; ranking                                                                    handles precision</p><h4><strong>The Tradeoff That Trips Up Even Senior Teams</strong></h4><p>Here&#8217;s what most people understand in theory but g<em>Figure 4: Precision-Recall tradeoff decision matrix. Your business context - not the math - should determine your operating point.</em></p><p>The mistake teams make is picking a threshold of 0.5 by default, reporting the F1 score at that point, and calling it done. That single number hides everything important about the model&#8217;s actual behavior across the operating range.</p><p><strong>What to do instead: </strong>plot the full PR curve. Report AUC-PR. Then convene a conversation about what the cost of an FP is vs. the cost of an FN in your specific application. That conversation - not the math - determines your operating point.</p><p>&#9989; <strong>Dos and Don&#8217;ts: Precision, Recall &amp; F1</strong>et wrong in practice: you can almost always trade Precision for Recall by adjusting your confidence threshold. Lower the threshold &#8594; more positives predicted &#8594; Recall goes up, Precision goes down. Raise it &#8594; the reverse.</p><p>This tradeoff is not a flaw. It&#8217;s a feature. Your job as a technical leader is to determine where on this curve your business needs to operate - and that decision belongs to the business, not the model.</p><h4><strong>The Precision-Recall Tradeoff - Decision Matrix</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1VsS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1VsS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png 424w, https://substackcdn.com/image/fetch/$s_!1VsS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png 848w, https://substackcdn.com/image/fetch/$s_!1VsS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png 1272w, https://substackcdn.com/image/fetch/$s_!1VsS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1VsS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png" width="1146" height="354" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:354,&quot;width&quot;:1146,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:47980,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192913568?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1VsS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png 424w, https://substackcdn.com/image/fetch/$s_!1VsS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png 848w, https://substackcdn.com/image/fetch/$s_!1VsS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png 1272w, https://substackcdn.com/image/fetch/$s_!1VsS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F016e606d-173f-44d8-b15b-a284ea05c740_1146x354.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 4: Precision-Recall tradeoff decision matrix. Your business context - not the math - should determine your operating point.</em></p><p>The mistake teams make is picking a threshold of 0.5 by default, reporting the F1 score at that point, and calling it done. That single number hides everything important about the model&#8217;s actual behavior across the operating range.</p><p><strong>What to do instead: </strong>plot the full PR curve. Report AUC-PR. Then convene a conversation about what the cost of an FP is vs. the cost of an FN in your specific application. That conversation - not the math - determines your operating point.</p><h4><strong>Dos and Don&#8217;ts: Precision, Recall &amp; F1</strong></h4><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cYC1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cYC1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png 424w, https://substackcdn.com/image/fetch/$s_!cYC1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png 848w, https://substackcdn.com/image/fetch/$s_!cYC1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png 1272w, https://substackcdn.com/image/fetch/$s_!cYC1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cYC1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png" width="1140" height="847" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:847,&quot;width&quot;:1140,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113943,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192913568?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cYC1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png 424w, https://substackcdn.com/image/fetch/$s_!cYC1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png 848w, https://substackcdn.com/image/fetch/$s_!cYC1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png 1272w, https://substackcdn.com/image/fetch/$s_!cYC1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe6bea45d-d59e-46e8-ade9-4bd920387930_1140x847.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4>&#128161; <strong>PRO TIP - AUC-PR vs AUC-ROC</strong></h4><p>For balanced datasets, AUC-ROC is a fine summary metric.</p><p>For imbalanced datasets (fraud, medical, most real-world classification), AUC-PR is far more informative.</p><p>Why: AUC-ROC can show 0.95 even when your model barely finds any real positives in a 1:1000 imbalanced set.</p><p>AUC-PR forces the model to prove it can find true positives without flooding you with false alarms.</p><p>If your positive class is &lt; 10% of your data, always report AUC-PR alongside AUC-ROC.</p><p>Rule: if you can&#8217;t explain your choice of summary metric to a business stakeholder, it&#8217;s the wrong metric.</p><h4>&#128680; <strong>The Class Imbalance Trap</strong></h4><p>Scenario: 99% of your data is class 0, 1% is class 1 (e.g., fraud).</p><p>A model that predicts class 0 for everything gets 99% accuracy.</p><p>Its Precision, Recall, and F1 for class 1 are all 0.</p><p>This model is completely useless - but your accuracy dashboard shows 99% and looks great.</p><p>Solution: always check Precision and Recall for your minority class. Always. Every time.</p><p>If you&#8217;re not doing this, you don&#8217;t know if your model actually works.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cd-U!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cd-U!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png 424w, https://substackcdn.com/image/fetch/$s_!cd-U!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png 848w, https://substackcdn.com/image/fetch/$s_!cd-U!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png 1272w, https://substackcdn.com/image/fetch/$s_!cd-U!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cd-U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png" width="1138" height="160" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:160,&quot;width&quot;:1138,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:29373,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192913568?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cd-U!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png 424w, https://substackcdn.com/image/fetch/$s_!cd-U!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png 848w, https://substackcdn.com/image/fetch/$s_!cd-U!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png 1272w, https://substackcdn.com/image/fetch/$s_!cd-U!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F318b1159-b6f8-4c06-a068-2af35a9219ac_1138x160.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>#PrecisionRecall #F1Score #MLMetrics #MachineLearning #DataScience #AIEvaluation #ClassificationMetrics #MLOps #AUC #ModelEvaluation</p><p>@PrecisionRecall @F1Score @MLMetrics @MachineLearning @DataScience @AIEvaluation @ClassificationMetrics @MLOps @AUC @ModelEvaluation</p><p><strong>&#8594; Coming Next: </strong>Post 3: BLEU &amp; ROUGE - text generation metrics everyone uses wrong</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[SERIES · POST 1 OF 5 Why AI Evaluation Fails]]></title><description><![CDATA[And the architecture that actually fixes it - for every team, at every stage]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/series-post-1-of-5-why-ai-evaluation</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/series-post-1-of-5-why-ai-evaluation</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Wed, 01 Apr 2026 10:01:31 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!xTuV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><em>Most teams measure what&#8217;s easy, not what&#8217;s important. They track accuracy on a clean benchmark, celebrate a high BLEU score, and ship - only to find users hate the outputs.</em></p><p>- A pattern that repeats across startups and Fortune 500s alike</p></blockquote><h4><strong>The Problem: Evaluation as a Checkbox</strong></h4><p>Here&#8217;s how most teams approach AI evaluation: train the model, run it against a test set, see a number they like, and ship. The number looks good in a slide deck. The demo works. Leadership is happy.</p><p>Then production happens.</p><p>Users start complaining. Edge cases emerge. The behavior that looked great in evaluation falls apart when real humans with real, messy queries start using it. The team scrambles. The postmortem inevitably includes some version of: our evaluation didn&#8217;t reflect reality.</p><p><strong>Evaluation is not a number. </strong>It&#8217;s a system. And like every system, it needs to be designed - not stumbled into.</p><p>I&#8217;ve watched this play out across organizations of every size. The ones that get evaluation right from the start aren&#8217;t necessarily the ones with the best models. They&#8217;re the ones with the most rigorous thinking about what good actually looks like, and how to measure it in a way that predicts production performance.</p><h4><strong>The Architecture: Three Layers Working Together</strong></h4><p>A mature AI evaluation system is a pipeline - three distinct layers that each catch different failure modes, and together give you confidence that what you&#8217;re shipping will actually work.</p><p><strong>AI Evaluation Architecture - End-to-End Pipeline</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xTuV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xTuV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png 424w, https://substackcdn.com/image/fetch/$s_!xTuV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png 848w, https://substackcdn.com/image/fetch/$s_!xTuV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png 1272w, https://substackcdn.com/image/fetch/$s_!xTuV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xTuV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png" width="1159" height="598" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:598,&quot;width&quot;:1159,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:69791,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192811109?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xTuV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png 424w, https://substackcdn.com/image/fetch/$s_!xTuV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png 848w, https://substackcdn.com/image/fetch/$s_!xTuV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png 1272w, https://substackcdn.com/image/fetch/$s_!xTuV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7673392d-fd79-41a7-9075-0bf51d868f2f_1159x598.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Figure 1: The complete AI Evaluation pipeline. Automated metrics, human judgment, and A/B testing converge into a single decision gate. Each layer informs the next.</em></p><p><strong>Layer 1 - Automated Metrics. </strong>Fast, scalable, and consistent. These run on every model change, every experiment, every commit if you want them to. They&#8217;re your early warning system. Precision, Recall, F1, BLEU, ROUGE - they tell you whether your model&#8217;s outputs look like the reference. The limitation: they can be gamed, and they measure the wrong thing if you let them. We&#8217;ll go deep on each in Posts 2 and 3.</p><p><strong>Layer 2 - Human Evaluation. </strong>Slow, expensive, and irreplaceable. When automated metrics give you a green light, human evaluation tells you whether it&#8217;s real. This is where you find out if the outputs are actually useful, safe, coherent, and faithful. It&#8217;s the layer most teams skip or do badly. Post 4 covers how to do it right.</p><p><strong>Layer 3 - A/B Testing. </strong>The final proof. Even if both previous layers pass, your model needs to demonstrate real improvement with real users before you can be confident it&#8217;s better. Post 5 covers the statistical framework - including the mistakes that invalidate most A/B tests before they even start.</p><h4><strong>The Evaluation Maturity Model</strong></h4><p>Not every team needs all three layers on day one. Building evaluation is a journey, and the biggest mistake is trying to implement everything before you have the fundamentals. Here&#8217;s how to think about it:</p><p><strong>Evaluation Maturity Model - Where Are You?</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jbTR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jbTR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png 424w, https://substackcdn.com/image/fetch/$s_!jbTR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png 848w, https://substackcdn.com/image/fetch/$s_!jbTR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png 1272w, https://substackcdn.com/image/fetch/$s_!jbTR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jbTR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png" width="1134" height="220" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:220,&quot;width&quot;:1134,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:58620,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192811109?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jbTR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png 424w, https://substackcdn.com/image/fetch/$s_!jbTR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png 848w, https://substackcdn.com/image/fetch/$s_!jbTR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png 1272w, https://substackcdn.com/image/fetch/$s_!jbTR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd7a25b2-cd56-4fdd-b58c-355e3605e7e7_1134x220.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>Figure 2: Evaluation maturity model. Build in stages - don&#8217;t try to implement everything at once.</em></p><p>The most important thing isn&#8217;t being at the Advanced stage - it&#8217;s being honest about where you actually are. A team at the Early stage that knows it and operates accordingly is far healthier than a team at Early that thinks it&#8217;s at Mature.</p><p><strong>The Golden Rules That Apply at Every Stage</strong></p><p>1. Never ship a model change without measuring it against a consistent benchmark. Even a simple held-out test set is better than nothing.</p><p>2. Use at least one metric you can&#8217;t game. A human eval sample on 50 examples is harder to manipulate than any automated score.</p><p>3. Make your eval set representative of production. If production is messy, your eval should be too. Clean benchmarks produce clean scores on messy data.</p><p>4. Separate your dev set from your test set. The dev set is for tuning. The test set is for final evaluation. Never, ever use your test set during development - contamination is invisible and devastating.</p><p>5. Track results over time. A score in isolation is almost meaningless. A score trending upward or downward over 10 experiments is signal.</p><p>6. When in doubt, call in a human. No automated metric fully captures whether an output is actually good for the person asking.</p><h4>&#9888;&#65039;<strong>Why Teams Resist Rigorous Evaluation</strong></h4><p>This deserves an honest conversation. Rigorous evaluation takes time. It slows down the &#8216;move fast and ship&#8217; culture that many AI teams operate in. And when leadership is breathing down your neck for a demo or a launch, skipping the evaluation layer feels like the pragmatic choice.</p><p>It&#8217;s not. It&#8217;s borrowing time from the future.</p><p>The teams I&#8217;ve seen move fastest long-term are the ones that invested in evaluation infrastructure early. Because when you have solid eval, you can move fast confidently. You can try more experiments, ship more often, and know immediately when something breaks. The teams that skip eval move fast initially and then spend months debugging production incidents.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9ITx!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9ITx!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png 424w, https://substackcdn.com/image/fetch/$s_!9ITx!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png 848w, https://substackcdn.com/image/fetch/$s_!9ITx!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png 1272w, https://substackcdn.com/image/fetch/$s_!9ITx!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9ITx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png" width="1143" height="217" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:217,&quot;width&quot;:1143,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:8579,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192811109?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9ITx!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png 424w, https://substackcdn.com/image/fetch/$s_!9ITx!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png 848w, https://substackcdn.com/image/fetch/$s_!9ITx!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png 1272w, https://substackcdn.com/image/fetch/$s_!9ITx!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F08cd5fd1-670f-4b37-a454-1826b30ed604_1143x217.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>&#128161; <strong>The Most Important Mindset Shift</strong></p><p>Stop thinking of evaluation as something you do at the end of training.</p><p>Start thinking of it as the infrastructure you build before you train.</p><p>The teams that define &#8216;good&#8217; before they build are the ones that build it.</p><p>Ask: &#8216;How will we know if this works?&#8217; before you write a single line of training code.</p><p>If you can&#8217;t answer that question clearly, you&#8217;re not ready to train yet.</p><blockquote><p><strong>Coming Next: </strong>Post 2: Precision, Recall &amp; F1 - the foundation every ML engineer must master</p></blockquote><p>#AIEvaluation #MachineLearning #MLOps #TechLeadership #DataScience #AIEngineering #ProductionML #ModelQuality #AIMetrics #BuildInPublic</p><p>@AIEvaluation @MachineLearning @MLOps @TechLeadership @DataScience @AIEngineering @ProductionML @ModelQuality @AIMetrics @BuildInPublic</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/series-post-1-of-5-why-ai-evaluation?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/series-post-1-of-5-why-ai-evaluation?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/series-post-1-of-5-why-ai-evaluation?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div>]]></content:encoded></item><item><title><![CDATA[AI Evaluation : Metrics That Matter]]></title><description><![CDATA[A deep-dive for engineers, researchers, and technical leaders who want to evaluate AI systems with confidence - not just ship fast.]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/ai-evaluation-metrics-that-matter</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/ai-evaluation-metrics-that-matter</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Tue, 31 Mar 2026 10:03:23 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!W5Rh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Why Most Teams Get AI Evaluation Wrong</strong></p><p>Here&#8217;s an uncomfortable truth: <strong>most teams measure what&#8217;s easy, not what&#8217;s important.</strong> They track accuracy on a clean benchmark, celebrate a high BLEU score, and ship - only to find users hate the outputs. I&#8217;ve seen this repeat across startups and Fortune 500s alike.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Evaluation is not a checkbox at the end of your training loop. It&#8217;s a living system that tells you whether your model actually works in the real world - the difference between AI that creates value and AI that creates technical debt.</p><p><strong>The Evaluation Architecture</strong></p><p>A mature AI evaluation system isn&#8217;t a single metric - it&#8217;s a pipeline combining automated signals, human judgment, and statistical rigor</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W5Rh!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W5Rh!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png 424w, https://substackcdn.com/image/fetch/$s_!W5Rh!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png 848w, https://substackcdn.com/image/fetch/$s_!W5Rh!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png 1272w, https://substackcdn.com/image/fetch/$s_!W5Rh!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W5Rh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png" width="1456" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:109406,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W5Rh!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png 424w, https://substackcdn.com/image/fetch/$s_!W5Rh!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png 848w, https://substackcdn.com/image/fetch/$s_!W5Rh!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png 1272w, https://substackcdn.com/image/fetch/$s_!W5Rh!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc58cedfc-a982-4d6c-92db-cde8e84c55fd_1515x778.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>.</p><p><em>Figure 1: AI Evaluation pipeline &#8212; model outputs &#8594; automated metrics &#8594; human eval &#8594; A/B test &#8594; deployment decision.</em></p><p><strong>1. Precision &amp; Recall</strong> The Foundation</p><p>If you only deeply understand two metrics, make them <strong>Precision</strong> and <strong>Recall</strong>. Everything else builds from here.</p><p><strong>Confusion Matrix - Foundation of Precision &amp; Recall</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gUpS!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gUpS!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png 424w, https://substackcdn.com/image/fetch/$s_!gUpS!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png 848w, https://substackcdn.com/image/fetch/$s_!gUpS!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png 1272w, https://substackcdn.com/image/fetch/$s_!gUpS!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gUpS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png" width="1456" height="261" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:261,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:40743,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gUpS!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png 424w, https://substackcdn.com/image/fetch/$s_!gUpS!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png 848w, https://substackcdn.com/image/fetch/$s_!gUpS!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png 1272w, https://substackcdn.com/image/fetch/$s_!gUpS!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F942c0f7f-e254-4f8c-9685-e1731c23de29_1513x271.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>Figure 2: Confusion Matrix - foundation of Precision and Recall.</em></p><h2>The Formulas</h2><p><strong>Precision = TP / (TP + FP)</strong> &#8594; Of everything predicted Positive, how much was actually Positive?</p><p><strong>Recall = TP / (TP + FN)</strong> &#8594; Of all actual Positives, how many were caught?</p><p><strong>F1 = 2 &#215; (Precision &#215; Recall) / (Precision + Recall)</strong> &#8594; Harmonic mean; best for imbalanced classes.</p><h2>When to Prioritize Which Metric</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W1G9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W1G9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png 424w, https://substackcdn.com/image/fetch/$s_!W1G9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png 848w, https://substackcdn.com/image/fetch/$s_!W1G9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png 1272w, https://substackcdn.com/image/fetch/$s_!W1G9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W1G9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png" width="1456" height="692" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f3bee949-f5e1-4698-853d-19a910094895_1515x720.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:692,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:76120,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W1G9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png 424w, https://substackcdn.com/image/fetch/$s_!W1G9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png 848w, https://substackcdn.com/image/fetch/$s_!W1G9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png 1272w, https://substackcdn.com/image/fetch/$s_!W1G9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff3bee949-f5e1-4698-853d-19a910094895_1515x720.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Precision-Recall Tradeoff</h2><p>Lower your confidence threshold &#8594; Recall goes up, Precision drops. Raise it &#8594; the reverse. This tradeoff lives at the heart of almost every production AI system. Plot the full PR curve; report AUC-PR rather than a single operating point.</p><blockquote><p>&#128161; <strong>PRO TIP: Plot the PR Curve, Don&#8217;t Just Report a Number</strong></p></blockquote><p>Always plot the PR curve across confidence thresholds.</p><p>Report AUC-PR - it gives the full picture. For imbalanced datasets, AUC-PR &gt;&gt; AUC-ROC.</p><p>Choose your operating point based on business cost, not just math.</p><p>Example: 95% precision at 70% recall may be perfect for fraud - but you&#8217;d miss this from a single F1 score.</p><h2>Dos and Don&#8217;ts: Precision &amp; Recall</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7lXH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7lXH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png 424w, https://substackcdn.com/image/fetch/$s_!7lXH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png 848w, https://substackcdn.com/image/fetch/$s_!7lXH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png 1272w, https://substackcdn.com/image/fetch/$s_!7lXH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7lXH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png" width="1456" height="742" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:742,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80912,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7lXH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png 424w, https://substackcdn.com/image/fetch/$s_!7lXH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png 848w, https://substackcdn.com/image/fetch/$s_!7lXH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png 1272w, https://substackcdn.com/image/fetch/$s_!7lXH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F883a93ee-7a50-4d2d-a621-5a9de9db4802_1518x774.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>2. BLEU &amp; ROUGE</strong> Text Generation Metrics</p><p>When you move from classification to generation - translation, summarization, dialogue - you need different tools. BLEU and ROUGE are workhorses. Widely used and widely misunderstood.</p><p><strong>BLEU Score Calculation Flow</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WqvL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WqvL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png 424w, https://substackcdn.com/image/fetch/$s_!WqvL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png 848w, https://substackcdn.com/image/fetch/$s_!WqvL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png 1272w, https://substackcdn.com/image/fetch/$s_!WqvL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WqvL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png" width="1456" height="201" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:201,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:43969,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WqvL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png 424w, https://substackcdn.com/image/fetch/$s_!WqvL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png 848w, https://substackcdn.com/image/fetch/$s_!WqvL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png 1272w, https://substackcdn.com/image/fetch/$s_!WqvL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8cf783cb-a8eb-4ca5-8837-c20c18dc7460_1510x208.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>Figure 3: BLEU Score pipeline - tokenize &#8594; n-gram match &#8594; precision &#8594; brevity penalty &#8594; score.</em></p><h2>BLEU - Precision for Generated Text</h2><p><strong>Core idea: </strong>Count n-gram overlaps between generated and reference text, apply a brevity penalty to prevent gaming with very short outputs.</p><p><strong>BLEU-1:</strong> unigram | <strong>BLEU-2:</strong> bigram | <strong>BLEU-4:</strong> up to 4-grams (most common in MT research)</p><h2>ROUGE - Recall for Summaries</h2><p>ROUGE flips the perspective: how much of the reference does the generated text cover? Recall-oriented - a natural fit for summarization.</p><p><strong>Variant                                    What It Measures</strong></p><p>ROUGE-1                                Unigram overlap - breadth of coverage</p><p>ROUGE-2                                Bigram overlap - captures phrase structure</p><p>ROUGE-L                                Longest Common Subsequence - preserves flow</p><p>ROUGE-SU                             Skip-bigram + unigram - flexible word order</p><p>ROUGE-W                               Weighted LCS - rewards consecutive matches more</p><blockquote><p>&#9888;&#65039; <strong>The Dirty Secret About BLEU &amp; ROUGE</strong></p></blockquote><p>High BLEU / ROUGE &#8800; good outputs. These metrics reward surface overlap, not meaning.</p><p>A grammatically perfect synonym sentence can score near zero.</p><p>A fluent but wrong translation can score reasonably well.</p><p>They correlate weakly with human judgment for open-ended generation.</p><p>Always pair automatic metrics with at least a sample-level human review.</p><h2>Dos and Don&#8217;ts: BLEU &amp; ROUGE</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0Pzu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0Pzu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png 424w, https://substackcdn.com/image/fetch/$s_!0Pzu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png 848w, https://substackcdn.com/image/fetch/$s_!0Pzu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png 1272w, https://substackcdn.com/image/fetch/$s_!0Pzu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0Pzu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png" width="1456" height="811" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:811,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:95823,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!0Pzu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png 424w, https://substackcdn.com/image/fetch/$s_!0Pzu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png 848w, https://substackcdn.com/image/fetch/$s_!0Pzu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png 1272w, https://substackcdn.com/image/fetch/$s_!0Pzu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa4625628-27f0-4b0d-b377-56ad027ed858_1516x844.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>3. Human Evaluation Frameworks</strong> When Machines Aren&#8217;t Enough</p><p>Automated metrics tell you if your model looks like the reference. Human evaluation tells you if it&#8217;s actually good. For open-ended generation, instruction follows, and safety - human judgment remains the gold standard.</p><h2>Framework 1: Likert Scale Rating</h2><p>Raters score outputs on a 1&#8211;5 or 1&#8211;7 scale per dimension:</p><p>&#8226; Fluency - Is the text natural and well-formed?</p><p>&#8226; Coherence - Does it make logical sense throughout?</p><p>&#8226; Relevance - Does it address the prompt?</p><p>&#8226; Faithfulness - Does it accurately reflect the source?</p><p>&#8226; Helpfulness - Does it solve the user&#8217;s problem?</p><p>&#8226; Harmlessness - Free from bias, toxicity, unsafe content?</p><blockquote><p>&#128203; <strong>Likert Scale Best Practices</strong></p></blockquote><p>Anchor your scale: define what &#8216;1&#8217;, &#8216;3&#8217;, and &#8216;5&#8217; mean with concrete examples.</p><p>Rate one dimension at a time - mixing dimensions degrades annotation quality.</p><p>Use 3+ raters per output; report inter-annotator agreement (Cohen&#8217;s &#954; or Krippendorff&#8217;s &#945;).</p><p>&#954; &gt; 0.6 acceptable; &#954; &gt; 0.8 is strong. Randomize and blind all outputs.</p><h2>Framework 2: Pairwise Comparison</h2><p>Raters compare two outputs (A vs B) and pick the better one - often more reliable than absolute ratings. Humans are better at relative judgments. Scale via Elo-style ranking (how Chatbot Arena / LMSYS works).</p><h2>Framework 3: Annotation Task Design</h2><p>1. Write a detailed annotation guide with definitions, examples, and edge cases.</p><p>2. Pilot with 2&#8211;3 annotators on 50&#8211;100 examples; iterate on the guide.</p><p>3. Establish a gold standard: 20&#8211;50 examples with known correct labels.</p><p>4. Check annotators against gold standard before scaling up.</p><p>5. Monitor agreement throughout - flag raters who drift.</p><p>6. Collect metadata: annotator ID, timestamp, confidence, reasoning.</p><p>7. Reserve a blind hold-out set - never use annotation insights mid-eval.</p><h2>Dos and Don&#8217;ts: Human Evaluation</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NbTL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NbTL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png 424w, https://substackcdn.com/image/fetch/$s_!NbTL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png 848w, https://substackcdn.com/image/fetch/$s_!NbTL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png 1272w, https://substackcdn.com/image/fetch/$s_!NbTL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NbTL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png" width="1456" height="837" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:837,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:94998,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NbTL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png 424w, https://substackcdn.com/image/fetch/$s_!NbTL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png 848w, https://substackcdn.com/image/fetch/$s_!NbTL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png 1272w, https://substackcdn.com/image/fetch/$s_!NbTL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff2ce0f34-4e60-4f12-bdd2-54451da1d5a3_1510x868.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>4. A/B Testing for AI Systems</strong> Statistical Rigor in Production</p><p>Your model scores well in eval. Now what? You need to know it performs better in production. That&#8217;s where A/B testing comes in - and where most teams make costly statistical mistakes.</p><p><strong>A/B Testing Decision Flow in AI Systems</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9e8X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9e8X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png 424w, https://substackcdn.com/image/fetch/$s_!9e8X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png 848w, https://substackcdn.com/image/fetch/$s_!9e8X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png 1272w, https://substackcdn.com/image/fetch/$s_!9e8X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9e8X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png" width="1456" height="188" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:188,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:54616,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9e8X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png 424w, https://substackcdn.com/image/fetch/$s_!9e8X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png 848w, https://substackcdn.com/image/fetch/$s_!9e8X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png 1272w, https://substackcdn.com/image/fetch/$s_!9e8X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc2bc22f3-5f6a-442d-a0be-f20fbb0612ef_1519x196.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><em>Figure 4: A/B Testing pipeline - hypothesis &#8594; traffic split &#8594; data collection &#8594; stat test &#8594; deployment decision.</em></p><h2>The Core Statistical Framework</h2><p><strong>H&#8320;:</strong> Model A and B perform equally on metric M.</p><p><strong>H&#8321;:</strong> Model B outperforms Model A.</p><p><strong>p-value:</strong> Probability of your result if H&#8320; is true. Threshold: p &lt; 0.05.</p><p><strong>Effect size:</strong> How big is the difference? Statistical significance &#8800; practical significance.</p><p><strong>Power (1 &#8722; &#946;):</strong> Probability of detecting a true effect. Aim for &#8805; 0.8.</p><h2>Sample Size: The Most Ignored Calculation</h2><p>Run a power analysis before you start - not after. You need: baseline metric value, minimum detectable effect (MDE), desired power (0.80), and significance level &#945; (0.05). Use tools like Evan Miller&#8217;s calculator or stats models.</p><h2>AI-Specific A/B Testing Challenges</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_R28!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_R28!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png 424w, https://substackcdn.com/image/fetch/$s_!_R28!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png 848w, https://substackcdn.com/image/fetch/$s_!_R28!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png 1272w, https://substackcdn.com/image/fetch/$s_!_R28!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_R28!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png" width="1456" height="748" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:748,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:80074,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_R28!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png 424w, https://substackcdn.com/image/fetch/$s_!_R28!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png 848w, https://substackcdn.com/image/fetch/$s_!_R28!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png 1272w, https://substackcdn.com/image/fetch/$s_!_R28!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2caa7dbe-33f8-4441-9dea-7bb83de071b7_1512x777.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p>&#128680; <strong>The Peeking Problem - Don&#8217;t Do This</strong></p></blockquote><p>Stopping when p &lt; 0.05 before your pre-registered sample size = WRONG.</p><p>It dramatically inflates to a false-positive rate. You&#8217;ll declare winners that they aren&#8217;t real.</p><p>Solution: pre-register sample size; don&#8217;t analyze until you hit it.</p><p>OR use mSPRT - designed for interim analysis without inflating error rates.</p><h2>Dos and Don&#8217;ts: A/B Testing</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lID-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lID-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png 424w, https://substackcdn.com/image/fetch/$s_!lID-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png 848w, https://substackcdn.com/image/fetch/$s_!lID-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png 1272w, https://substackcdn.com/image/fetch/$s_!lID-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lID-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png" width="1456" height="865" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:865,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:97472,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lID-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png 424w, https://substackcdn.com/image/fetch/$s_!lID-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png 848w, https://substackcdn.com/image/fetch/$s_!lID-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png 1272w, https://substackcdn.com/image/fetch/$s_!lID-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F58001bac-6c5f-4264-8382-2912e013ace0_1516x901.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Putting It All Together</strong> The Evaluation Maturity Model</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!i82W!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!i82W!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png 424w, https://substackcdn.com/image/fetch/$s_!i82W!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png 848w, https://substackcdn.com/image/fetch/$s_!i82W!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png 1272w, https://substackcdn.com/image/fetch/$s_!i82W!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!i82W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png" width="1456" height="595" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:595,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:75329,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!i82W!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png 424w, https://substackcdn.com/image/fetch/$s_!i82W!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png 848w, https://substackcdn.com/image/fetch/$s_!i82W!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png 1272w, https://substackcdn.com/image/fetch/$s_!i82W!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1ddf6c4a-206e-4c87-a936-75c9320feb0e_1522x622.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>The Golden Rules of AI Evaluation</h2><p>1. Never ship a model change without measuring it against a consistent benchmark.</p><p>2. Use at least one metric you can&#8217;t game - a human eval sample or held-out blind set.</p><p>3. Make your eval set representative of production. If prod is messy, your eval should be too.</p><p>4. Track evaluation results over time - a dashboard beats a one-off spreadsheet.</p><p>5. Separate your dev set (tuning) from your test set (final eval). Never reuse the test set.</p><p>6. When in doubt, call a human. No automated metric fully captures what a good output looks like.</p><p><strong>Quick Reference - Metrics at a Glance</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DP5v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DP5v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png 424w, https://substackcdn.com/image/fetch/$s_!DP5v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png 848w, https://substackcdn.com/image/fetch/$s_!DP5v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png 1272w, https://substackcdn.com/image/fetch/$s_!DP5v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DP5v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png" width="1456" height="1038" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/da0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1038,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:123269,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/192678967?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DP5v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png 424w, https://substackcdn.com/image/fetch/$s_!DP5v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png 848w, https://substackcdn.com/image/fetch/$s_!DP5v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png 1272w, https://substackcdn.com/image/fetch/$s_!DP5v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fda0aff01-b916-4027-931f-a163ac4221a7_1521x1084.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/ai-evaluation-metrics-that-matter?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/ai-evaluation-metrics-that-matter?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/ai-evaluation-metrics-that-matter?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>#AIEvaluation #MachineLearning #MLOps #ArtificialIntelligence #DataScience <strong> </strong>#NLP #LLM #DeepLearning #ModelEvaluation #AIMetrics #BLEU #ROUGE #PrecisionRecall #TechLeadership #AIEngineering #BuildInPublic #AIResearch #ProductionML #ABTesting</p><p>@AIEvaluation @MachineLearning @MLOps @ArtificialIntelligence @DataScience <strong> </strong>@NLP @LLM @DeepLearning @ModelEvaluation @AIMetrics @BLEU @ROUGE @PrecisionRecall @TechLeadership @AIEngineering @BuildInPublic @AIResearch @ProductionML @ABTesting <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;d140beae-60be-4b64-8451-12e4e388eada&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[RAG Systems in Practice]]></title><description><![CDATA[A Complete Technical Guide to Embeddings, Chunking, Hybrid Search, and Latency Trade-offs]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/rag-systems-in-practice</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/rag-systems-in-practice</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Tue, 24 Mar 2026 10:03:03 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!8PLi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>Why Retrieval Quality is the Actual Bottleneck</h1><p style="text-align: justify;">Most teams building LLM-powered search or question-answering systems spend their time picking the right language model. That&#8217;s understandable &#8212; the model is the visible part. But in practice, the model is not the bottleneck. Retrieval is.</p><p style="text-align: justify;">The language model is only as useful as what you hand it. If the chunks you retrieve are the wrong ones &#8212; too big, too small, missing the keyword the user typed, or retrieved from a poorly trained embedding space &#8212; the model will hallucinate, hedge, or confidently answer the wrong question. The quality of your retrieval pipeline determines the ceiling of your entire system.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p style="text-align: justify;">This guide covers the four things that actually determine retrieval quality in a production system: how you turn text into vectors (embeddings), how you cut documents into pieces (chunking), how you combine two different kinds of search (hybrid search), and how you manage the time each component takes (latency trade-offs). There are no magic numbers here. Each section explains the reasoning behind the decisions so you can tune them for your specific situation.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!8PLi!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!8PLi!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png 424w, https://substackcdn.com/image/fetch/$s_!8PLi!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png 848w, https://substackcdn.com/image/fetch/$s_!8PLi!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png 1272w, https://substackcdn.com/image/fetch/$s_!8PLi!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!8PLi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png" width="588" height="339" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:339,&quot;width&quot;:588,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:131459,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!8PLi!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png 424w, https://substackcdn.com/image/fetch/$s_!8PLi!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png 848w, https://substackcdn.com/image/fetch/$s_!8PLi!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png 1272w, https://substackcdn.com/image/fetch/$s_!8PLi!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F54d2e05d-049b-4f01-999a-a3da865a9246_588x339.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p style="text-align: justify;">Full RAG system architecture. The ingestion pipeline runs offline; the query pipeline runs on every user request. Latency budget lives entirely in the bottom band.</p><h1>Section 1: Embeddings</h1><p style="text-align: justify;">An embedding model takes a piece of text and turns it into a fixed-length list of numbers &#8212; a vector. The model is trained so that texts with similar meanings end up with vectors that are close to each other in that number space. When you search, you embed the query using the same model, then find the stored vectors nearest to it. That&#8217;s the whole mechanism.</p><p style="text-align: justify;">What trips people up is assuming this &#8220;closeness&#8221; is reliable regardless of context. It isn&#8217;t. A model trained mostly on web articles will produce a good vector for &#8220;explain machine learning&#8221; but a poor one for &#8220;what does error code 403 mean in our internal API&#8221; &#8212; because it has never seen your API docs.</p><h2>1.1 How Embedding Models are Trained</h2><p style="text-align: justify;">Most embedding models are trained using a technique called contrastive learning. The model sees pairs of texts that are semantically related (a question and its answer, a sentence and a paraphrase of it) and learns to pull their vectors closer together. It simultaneously sees unrelated pairs and learns to push those further apart.</p><p style="text-align: justify;">The training data determines what &#8220;similar&#8221; means in the resulting space. A model trained on scientific paper abstracts will have a very different notion of similarity than one trained on customer support chat logs. This is why model selection matters more than most people think: you are not just choosing model size or speed &#8212; you are choosing a definition of similarity.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!JIv8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!JIv8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png 424w, https://substackcdn.com/image/fetch/$s_!JIv8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png 848w, https://substackcdn.com/image/fetch/$s_!JIv8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png 1272w, https://substackcdn.com/image/fetch/$s_!JIv8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!JIv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png" width="1456" height="638" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b31e3043-08cd-4527-9bab-734931c892d6_1543x676.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:638,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:167827,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!JIv8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png 424w, https://substackcdn.com/image/fetch/$s_!JIv8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png 848w, https://substackcdn.com/image/fetch/$s_!JIv8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png 1272w, https://substackcdn.com/image/fetch/$s_!JIv8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb31e3043-08cd-4527-9bab-734931c892d6_1543x676.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>1.2 Choosing the Right Model</h2><p style="text-align: justify;">The best starting point for model evaluation is the MTEB benchmark (Massive Text Embedding Benchmark), which tests embedding models across retrieval, classification, clustering, and other tasks. It&#8217;s imperfect, but it&#8217;s far better than guessing. More importantly, you should always evaluate candidate models on a sample of your own queries against your own documents.</p><p style="text-align: justify;">Practical model tiers:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!clYN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!clYN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png 424w, https://substackcdn.com/image/fetch/$s_!clYN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png 848w, https://substackcdn.com/image/fetch/$s_!clYN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png 1272w, https://substackcdn.com/image/fetch/$s_!clYN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!clYN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png" width="1456" height="588" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:588,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:101387,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!clYN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png 424w, https://substackcdn.com/image/fetch/$s_!clYN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png 848w, https://substackcdn.com/image/fetch/$s_!clYN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png 1272w, https://substackcdn.com/image/fetch/$s_!clYN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9c919656-1df0-40f6-a3e8-1c952d0279db_1504x607.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>1.3 Dimensionality: Bigger is Not Always Better</h2><p style="text-align: justify;">Higher-dimensional vectors can capture more nuance, but they cost more to store and more to search. A 1536-dimension embedding takes 3&#215; the memory and roughly 2&#8211;3&#215; the search time of a 512-dimension one. On many practical retrieval tasks, the quality difference between 512 and 1536 dimensions is less than 5% on recall metrics &#8212; not worth the cost for most systems.</p><p style="text-align: justify;">Some models, including OpenAI&#8217;s text-embedding-3 series, support Matryoshka representation learning (MRL). This lets you truncate the vector to a shorter length at query time without retraining. You might store 256-dimension vectors for initial filtering, then use 1536-dimension vectors only for a final re-ranking pass on the top 20 results.</p><h2>1.4 Normalisation and Distance Metrics</h2><p style="text-align: justify;">When you normalise each vector to length 1 (L2 normalisation), cosine similarity between two vectors reduces to a simple dot product. Dot products are significantly faster to compute than full cosine similarity, especially in high dimensions. Almost every production vector database applies this optimisation automatically if you store normalised vectors &#8212; but you need to normalise them before storage.</p><h2>1.5 The One-Model Rule</h2><p style="text-align: justify;">Every chunk in your vector database was embedded using a specific model. Your search query must be embedded using the exact same model &#8212; same version, same parameters. If you upgrade your embedding model, you must re-embed your entire corpus. Mixing old-model chunks with new-model query vectors produces garbage results because the two vector spaces have different coordinate systems. Pin your model version explicitly in configuration.</p><p><strong>Fine-tuning note</strong></p><p>If your retrieval quality is consistently below acceptable thresholds on domain-specific queries, fine-tuning is worth exploring. Even a small fine-tuning dataset (1,000&#8211;5,000 query-document pairs) can meaningfully improve recall. Libraries like Sentence-Transformers make this relatively accessible. Start with a pre-trained model and fine-tune only the final layers.</p><p><strong>&#10003; What to do</strong></p><p>&#8226; Evaluate models on your own data before committing. MTEB scores are a starting point, not a decision.</p><p>&#8226; Normalise embeddings (L2) before storage. Cosine similarity becomes a dot product and search gets faster.</p><p>&#8226; Pin your model version explicitly in config. Treat any upgrade as a full re-index task.</p><p>&#8226; Consider MRL (Matryoshka) models if you need to trade off storage vs. precision dynamically.</p><p>&#8226; Benchmark recall@5, recall@10, and MRR on a representative sample of your actual queries, not generic test sets.</p><p>Store metadata alongside vectors: source document ID, chunk index, timestamp, model version.</p><p><strong>&#10007; What to avoid</strong></p><p>&#8226; Do not assume a model that ranks well on MTEB will rank well on your domain. Always verify.</p><p>&#8226; Do not embed entire long documents as a single vector. You will average out the meaning into undifferentiated noise.</p><p>&#8226; Do not mix embedding model versions in the same index. Results will silently degrade.</p><p>&#8226; Do not use a 1536-dim model when 512-dim is sufficient. You are paying a storage and latency cost for marginal benefit.</p><p>&#8226; Do not skip re-embedding the corpus when you upgrade models, even if the model name looks similar.</p><p>Do not treat embedding quality as a one-time decision. Evaluate periodically as your content evolves.</p><h1>Section 2: Chunking Strategy</h1><p style="text-align: justify;">Chunking is the process of splitting source documents into smaller pieces before embedding them. It is probably the most underestimated decision in a RAG system. The model that generates your answer gets to see only what you retrieve &#8212; and what you retrieve is determined entirely by how you cut your documents.</p><p style="text-align: justify;">Too large a chunk: you retrieve a 2,000-word block that contains the answer somewhere in the middle, surrounded by unrelated paragraphs. Precision suffers. Too small a chunk: you retrieve a sentence or two that has the right keywords but lacks the surrounding context. The right chunk size is the smallest unit that is still self-contained enough to answer a question on its own.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9UnI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9UnI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png 424w, https://substackcdn.com/image/fetch/$s_!9UnI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png 848w, https://substackcdn.com/image/fetch/$s_!9UnI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png 1272w, https://substackcdn.com/image/fetch/$s_!9UnI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9UnI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png" width="1456" height="684" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:684,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:219061,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9UnI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png 424w, https://substackcdn.com/image/fetch/$s_!9UnI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png 848w, https://substackcdn.com/image/fetch/$s_!9UnI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png 1272w, https://substackcdn.com/image/fetch/$s_!9UnI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F109a1b28-1b92-44c7-829b-210b66a34445_1572x739.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>2.1 Fixed-Size Chunking</h2><p style="text-align: justify;">The simplest approach: split the document every N tokens, with an overlap of M tokens between consecutive chunks. Typical starting values: 512 tokens per chunk, 64-token overlap. The main problem is that it ignores document structure &#8212; a split might happen mid-sentence, mid-code-block, or mid-table.</p><p><strong>Token counting note</strong></p><p>One token is roughly 0.75 words in English for most modern tokenisers. 512 tokens is approximately 380 words. Always count tokens, not characters or words &#8212; different embedding models use different tokenisers and the counts can differ by 15&#8211;20%.</p><h2>2.2 Structural / Recursive Chunking</h2><p style="text-align: justify;">A better approach: define a hierarchy of split points and work through them in order. For a Markdown document: split on h1 headings, then h2, then h3, then paragraphs, then sentences, and only fall back to token count if a sentence is still too long. The result is chunks that respect the document&#8217;s own structure.</p><p><strong>For code specifically:</strong></p><p>Never split code at a token boundary. Split at function boundaries, class boundaries, or file boundaries. A chunk containing half a function is useless. Use tree-sitter or ast in Python to extract complete functions as natural chunk boundaries.</p><h2>2.3 Semantic Chunking</h2><p style="text-align: justify;">The most sophisticated approach: use an embedding model to detect where the &#8220;topic&#8221; of the text shifts, and split there. You embed consecutive sentences, track cosine similarity between adjacent pairs, and cut wherever similarity drops below a threshold. Each chunk is semantically coherent &#8212; it covers one idea from start to finish. The downside is it&#8217;s slower and produces variable-length chunks.</p><h2>2.4 Hierarchical Chunking (Parent-Child)</h2><p style="text-align: justify;">This approach indexes chunks at two levels of granularity. Small chunks &#8212; 128-token passages &#8212; are what you index for retrieval. When a small chunk is retrieved, you return its parent chunk (the full paragraph or section) to the language model. Small chunks retrieve precisely; the parent provides context. RAPTOR extends this further by creating summary nodes at each hierarchy level.</p><h2>2.5 Overlap: How Much and Why</h2><p style="text-align: justify;">Overlap exists because answers frequently span chunk boundaries. Rule of thumb: 10&#8211;15% of your chunk size as overlap. So 50&#8211;75 tokens for a 512-token chunk. Too little overlap and you miss boundary-spanning answers. Too much and you are storing and searching duplicate content.</p><p style="text-align: justify;"><strong>&#10003; What to do</strong></p><p>&#8226; Use 10&#8211;15% token overlap between consecutive chunks to avoid missing boundary-spanning answers.</p><p>&#8226; Respect document structure first. Split on headings and paragraphs before falling back to token count.</p><p>&#8226; For code, split at function or class boundaries only. Never at token count.</p><p>&#8226; Store the parent chunk ID as metadata on every small chunk. You will need it for parent-child retrieval.</p><p>&#8226; Test chunk quality with real queries. Check the correct chunk appears in top-5 retrieval results.</p><p>For PDFs with complex layouts, extract text per page or per section, not as one flat string.</p><p><strong>&#10007; What to avoid</strong></p><p>&#8226; Do not use 2,000+ token chunks without a re-ranking step. Your precision will be poor.</p><p>&#8226; Do not mix chunk strategies across the same index without careful metadata tracking.</p><p>&#8226; Do not forget to re-chunk the corpus when you change your embedding model.</p><p>&#8226; Do not chunk tables, structured data, or JSON as flat text. Serialise row-by-row as natural language.</p><p>&#8226; Do not assume the same chunk size works for all your document types.</p><p>Do not add metadata (headers, source info) inside the chunk text &#8212; it dilutes the embedding.</p><h1>Section 3: Hybrid Search</h1><p style="text-align: justify;">Dense vector search handles semantic similarity well &#8212; it matches &#8220;how do I cancel my subscription&#8221; against chunks containing &#8220;unsubscribe&#8221; or &#8220;terminate account&#8221;. But it struggles with exact matches: product codes, error codes, proper nouns, specific technical identifiers.</p><p style="text-align: justify;">Sparse search (BM25) has the opposite profile &#8212; excellent at exact matches and rare terms, poor at synonyms and paraphrases. Hybrid search combines both. The question is how to merge two ranked lists from completely different scoring systems.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d6i1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d6i1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png 424w, https://substackcdn.com/image/fetch/$s_!d6i1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png 848w, https://substackcdn.com/image/fetch/$s_!d6i1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png 1272w, https://substackcdn.com/image/fetch/$s_!d6i1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d6i1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png" width="1456" height="738" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:738,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:275722,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d6i1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png 424w, https://substackcdn.com/image/fetch/$s_!d6i1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png 848w, https://substackcdn.com/image/fetch/$s_!d6i1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png 1272w, https://substackcdn.com/image/fetch/$s_!d6i1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F14d4bf16-54c6-4077-bbd0-39f02b70e242_1570x796.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>3.1 BM25: What It Is and How It Works</h2><p style="text-align: justify;">BM25 (Best Match 25) is a probabilistic keyword scoring function. For a given query term in a document, it calculates a score based on three factors:</p><p>&#8226; <strong>Term frequency (TF): </strong>How many times does the query term appear in this chunk? More occurrences = higher score, with diminishing returns.</p><p>&#8226; <strong>Inverse document frequency (IDF): </strong>How rare is this term across all chunks? Rare terms get higher weight than common words.</p><p>&#8226; <strong>Length normalisation: </strong>Normalises for document length so a short and a long chunk are compared fairly.</p><p style="text-align: justify;">You do not need to implement BM25 yourself. Elasticsearch, OpenSearch, Typesense, Weaviate, and Qdrant all include BM25 as a built-in option. For Python, rank-bm25 is a clean standalone implementation.</p><h2>3.2 Score Fusion: Why Raw Scores Cannot Be Merged</h2><p style="text-align: justify;">A BM25 score might be 4.7. A cosine similarity score from vector search for the same chunk might be 0.83. These numbers are on completely different scales and cannot be directly averaged. The reliable solution is Reciprocal Rank Fusion (RRF). Instead of merging scores, RRF merges ranks:</p><p style="text-align: center;">RRF_score(doc) = 1/(k + rank_dense) + 1/(k + rank_sparse) where k = 60</p><p>k = 60 is a constant that reduces the impact of top-ranked outliers, empirically robust across many evaluation sets. A document appearing at rank 2 in dense and rank 1 in sparse scores 1/62 + 1/61 = 0.0325. The document ranking highly in both lists wins.</p><h2>3.3 When to Weight Dense vs. Sparse Differently</h2><p style="text-align: justify;">RRF treats both lists equally. In practice you may want to upweight one depending on query type: keyword-heavy queries (product codes, error messages, names) benefit from upweighting sparse; conceptual queries benefit from upweighting dense. Some systems implement lightweight query classification to adjust weights automatically.</p><h2>3.4 SPLADE: Beyond Simple BM25</h2><p style="text-align: justify;">SPLADE (Sparse Lexical and Expansion model) produces learned sparse vectors that perform term expansion &#8212; for &#8220;fast car&#8221; it might also activate &#8220;velocity&#8221;, &#8220;speed&#8221;, &#8220;vehicle&#8221;. This gives it some semantic generalization while retaining the speed of sparse search. More expensive to index than BM25 but can outperform it significantly on queries requiring generalization.</p><p style="text-align: justify;"><strong>&#10003; What to do</strong></p><p>&#8226; Use RRF as your default fusion method. It is scale-agnostic and consistently robust.</p><p>&#8226; Build your BM25 index alongside your vector index from day one. Adding it retroactively requires a full re-index.</p><p>&#8226; Log which retrieval path contributed to successful retrievals. Use this to tune weights.</p><p>&#8226; Consider SPLADE if BM25 consistently misses on queries containing synonyms or related terms.</p><p>Use a managed database (Weaviate, Elasticsearch, Azure Cognitive Search) that handles both indexes in one system.</p><p><strong>&#10007; What to avoid</strong></p><p>&#8226; Do not naively average BM25 and cosine similarity scores. The result will be arbitrary.</p><p>&#8226; Do not skip BM25 in domains with product codes, error messages, names, or other exact identifiers.</p><p>&#8226; Do not assume 0.5/0.5 dense-sparse weighting is optimal. Log and tune it.</p><p>&#8226; Do not maintain separate BM25 and vector systems unless absolutely necessary.</p><p>Do not treat hybrid search as a one-time setup. Retune weights if your query patterns change.</p><h1>Section 4: Latency Trade-offs</h1><p style="text-align: justify;">Every component in your retrieval pipeline adds time. The question is not how to make everything fast &#8212; it is which components deserve the latency budget and which can be optimised without hurting quality. Set a latency budget before you design the pipeline. Not after.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!saZj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!saZj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png 424w, https://substackcdn.com/image/fetch/$s_!saZj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png 848w, https://substackcdn.com/image/fetch/$s_!saZj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png 1272w, https://substackcdn.com/image/fetch/$s_!saZj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!saZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png" width="1456" height="815" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:815,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:297065,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!saZj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png 424w, https://substackcdn.com/image/fetch/$s_!saZj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png 848w, https://substackcdn.com/image/fetch/$s_!saZj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png 1272w, https://substackcdn.com/image/fetch/$s_!saZj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F30bffa3a-3adf-4d2c-85b8-7b4547e345bc_1566x877.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>4.1 Component-by-Component Latency Breakdown</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!m4ur!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!m4ur!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png 424w, https://substackcdn.com/image/fetch/$s_!m4ur!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png 848w, https://substackcdn.com/image/fetch/$s_!m4ur!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png 1272w, https://substackcdn.com/image/fetch/$s_!m4ur!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!m4ur!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png" width="1456" height="836" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:836,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:159540,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!m4ur!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png 424w, https://substackcdn.com/image/fetch/$s_!m4ur!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png 848w, https://substackcdn.com/image/fetch/$s_!m4ur!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png 1272w, https://substackcdn.com/image/fetch/$s_!m4ur!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93282b39-2738-41e2-974e-7f1af135ca27_1501x862.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lkAX!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lkAX!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png 424w, https://substackcdn.com/image/fetch/$s_!lkAX!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png 848w, https://substackcdn.com/image/fetch/$s_!lkAX!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png 1272w, https://substackcdn.com/image/fetch/$s_!lkAX!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lkAX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png" width="1456" height="662" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:662,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184651,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!lkAX!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png 424w, https://substackcdn.com/image/fetch/$s_!lkAX!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png 848w, https://substackcdn.com/image/fetch/$s_!lkAX!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png 1272w, https://substackcdn.com/image/fetch/$s_!lkAX!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3756f4ac-749e-48f2-a9b9-446787647aa4_1578x718.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>4.2 HNSW Parameters You Actually Need to Know</h2><p style="text-align: justify;">HNSW is the dominant ANN algorithm in production vector databases. Three parameters matter:</p><p>&#8226; <strong>M: </strong>Edges per node in the graph. Higher M = better recall, more memory, slower build. Start with 32.</p><p>&#8226; <strong>ef_construction: </strong>Work done when inserting a new vector. Affects index quality, not query time. Common range: 100&#8211;400.</p><p>&#8226; <strong>ef_search: </strong>Candidates to explore at query time &#8212; your main quality-vs-speed knob. Start at 100, tune from there. Below 50, recall drops below 90% on most datasets.</p><h2>4.3 Vector Quantisation</h2><p style="text-align: justify;">Quantisation compresses your vectors to use less memory. Two main approaches:</p><p>&#8226; <strong>Scalar quantisation (SQ8): </strong>Compresses float32 (4 bytes) to int8 (1 byte), giving 4&#215; compression. Recall degradation typically 1&#8211;3%. Safe default.</p><p>&#8226; <strong>Product quantisation (PQ): </strong>More aggressive (8&#8211;32&#215; common). Recall degradation 5&#8211;15%, but memory savings are dramatic. Use when you have hundreds of millions of vectors.</p><h2>4.4 Cross-Encoder Re-ranking: When It&#8217;s Worth It</h2><p style="text-align: justify;">A cross-encoder takes the query and a candidate document together as input and produces a single relevance score. It sees the exact relationship between query and passage &#8212; but must run inference once per candidate. For top-20 candidates, that is 20 separate inference calls.</p><p style="text-align: justify;">Practical threshold: if your P95 latency budget is above 700&#8211;800ms total, a cross-encoder on top-20 is usually worth it. Below that, rely on RRF alone or a lightweight bi-encoder re-ranker.</p><p><strong>Lightweight alternative</strong></p><p>ColBERT uses late interaction &#8212; query and document are still encoded separately, but at the token level rather than the sentence level. This captures more interaction signal than a standard bi-encoder without the full cost of a cross-encoder. Useful when you need better precision than bi-encoder similarity but cannot afford cross-encoder latency.</p><h2>4.5 Caching Strategies</h2><p style="text-align: justify;">In most production systems, 10&#8211;20% of unique queries account for 60&#8211;80% of total traffic. Caching has an outsized impact:</p><p>&#8226; <strong>Query embedding cache: </strong>Exact-match cache keyed on the normalised query string. Even a simple LRU cache of 10,000 entries gives a 30&#8211;50% hit rate in production.</p><p>&#8226; <strong>Retrieval result cache: </strong>Cache the full list of retrieved chunk IDs. Valid as long as the index has not changed. Use stale-while-revalidate to avoid blocking.</p><p>&#8226; <strong>LLM response cache: </strong>For repeated exact queries. Brittle if underlying documents change frequently, but valuable for high-traffic FAQ-style systems.</p><h2>4.6 Measuring Latency Correctly</h2><p style="text-align: justify;">Measuring latency in development tells you almost nothing useful. Measure under realistic conditions:</p><p>&#8226; Run load tests at your expected peak QPS before launch, not after.</p><p>&#8226; Measure P50, P90, P95, and P99 separately. P50 (median) hides the tail. P99 is what your slowest users experience.</p><p>&#8226; Measure end-to-end from the user&#8217;s perspective. Network overhead, serialisation, and streaming all add up.</p><p>&#8226; Cold-start latency after an index reload is typically 5&#8211;10&#215; higher than steady-state.</p><p><strong>&#10003; What to do</strong></p><p>&#8226; Set a concrete latency budget before designing the pipeline. Budget first, components second.</p><p>&#8226; Run dense and sparse search in parallel. They are independent and can be merged after.</p><p>&#8226; Apply SQ8 quantisation to your vector index when memory is a constraint. Recall loss is typically under 3%.</p><p>&#8226; Cache query embeddings. Query distributions are skewed; a small cache has a large hit rate.</p><p>&#8226; Measure latency under realistic concurrent load, not in a single-threaded dev environment.</p><p>Consider async re-ranking: return RRF results immediately, update if re-ranker produces a significantly different ordering.</p><p><strong>&#10007; What to avoid</strong></p><p>&#8226; Do not add a cross-encoder re-ranker without measuring its actual P95 impact under production load.</p><p>&#8226; Do not retrieve top-100 chunks and pass all of them to the language model.</p><p>&#8226; Do not ignore cold-start latency. Index load after a restart will be much slower than steady-state.</p><p>&#8226; Do not run re-ranking synchronously on every query without a latency fallback path.</p><p>&#8226; Do not measure only P50. P95 and P99 are what your slowest users actually experience.</p><p>Do not over-index on embedding latency. LLM generation is 10&#8211;50&#215; slower than retrieval.</p><h1>Section 5: Evaluating Your Retrieval Pipeline</h1><p style="text-align: justify;">A retrieval system you cannot measure is one you cannot improve. The minimum you need is a set of representative query-answer pairs that you evaluate against regularly. Without this, changes to your embedding model, chunk size, or search weights are just guesses.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dSgI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dSgI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png 424w, https://substackcdn.com/image/fetch/$s_!dSgI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png 848w, https://substackcdn.com/image/fetch/$s_!dSgI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png 1272w, https://substackcdn.com/image/fetch/$s_!dSgI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dSgI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png" width="1456" height="659" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:659,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:261225,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dSgI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png 424w, https://substackcdn.com/image/fetch/$s_!dSgI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png 848w, https://substackcdn.com/image/fetch/$s_!dSgI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png 1272w, https://substackcdn.com/image/fetch/$s_!dSgI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9fa884aa-aed4-42c6-a057-d1fef75f44f2_1579x715.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>5.1 Metrics That Matter</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X9Xo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X9Xo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png 424w, https://substackcdn.com/image/fetch/$s_!X9Xo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png 848w, https://substackcdn.com/image/fetch/$s_!X9Xo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png 1272w, https://substackcdn.com/image/fetch/$s_!X9Xo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X9Xo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png" width="1456" height="961" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:961,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:184725,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X9Xo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png 424w, https://substackcdn.com/image/fetch/$s_!X9Xo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png 848w, https://substackcdn.com/image/fetch/$s_!X9Xo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png 1272w, https://substackcdn.com/image/fetch/$s_!X9Xo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6d85dbf3-81b4-4927-aa28-60fd058066cd_1506x994.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>5.2 Building an Evaluation Set</h2><p style="text-align: justify;">The most important property of an evaluation set is that it reflects your actual query distribution. Generic benchmarks are useful for comparing models but will not tell you how your system performs on your specific documents.</p><p>&#8226; Export a sample of real production queries (remove any PII). Aim for at least 200&#8211;300 queries with known-correct answers.</p><p>&#8226; For each query, manually identify the correct source chunk(s). You cannot automate ground truth creation reliably with an LLM.</p><p>&#8226; Include hard negatives: queries where a plausible-looking chunk is actually the wrong answer.</p><p>&#8226; Re-evaluate on this set every time you change: embedding model, chunk strategy, search weights, or re-ranking approach.</p><h1>Section 6: Quick Reference</h1><h3>Chunking starting points by content type</h3><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hS6T!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hS6T!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png 424w, https://substackcdn.com/image/fetch/$s_!hS6T!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png 848w, https://substackcdn.com/image/fetch/$s_!hS6T!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png 1272w, https://substackcdn.com/image/fetch/$s_!hS6T!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hS6T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png" width="1456" height="689" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:689,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:113191,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191936243?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hS6T!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png 424w, https://substackcdn.com/image/fetch/$s_!hS6T!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png 848w, https://substackcdn.com/image/fetch/$s_!hS6T!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png 1272w, https://substackcdn.com/image/fetch/$s_!hS6T!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F544cafb2-068d-467f-9f77-9d13f4a2cd44_1516x717.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Latency budget allocation (500ms target)</h3><p>&#8226; Query embedding: aim for &lt; 20ms. Use cached embedding if query is a repeat.</p><p>&#8226; Vector + keyword search (parallel): aim for &lt; 40ms total.</p><p>&#8226; RRF merge: &lt; 5ms. Non-negotiable.</p><p>&#8226; Re-ranking: skip if budget is under 500ms total, or make it async.</p><p>&#8226; LLM generation: stream to the user. Time-to-first-token matters more than total generation time.</p><h1>Closing Thoughts</h1><p style="text-align: justify;">Building a retrieval pipeline that works well in production requires getting four things right at the same time: picking an embedding model that understands your domain, cutting documents at boundaries that preserve meaning, combining vector and keyword search so you can handle both semantic and exact-match queries, and keeping the whole pipeline fast enough that it is actually usable.</p><p style="text-align: justify;">None of these decisions is permanent. Your query distribution will shift as your users evolve. Your documents will grow. Treat your retrieval pipeline as something you tune continuously, not something you set and forget.</p><p style="text-align: justify;">The one thing that separates teams who get this right from teams who struggle: they measure continuously. They have an evaluation set. They track recall@10 across every pipeline change. They know which component is causing problems when something degrades. If you build nothing else from this guide, build that evaluation loop.</p><p>#RAG #VectorSearch #Embeddings #HybridSearch #LLM #AIEngineering #MachineLearning #NLP #SemanticSearch #MLOps #RetrievalAugmentedGeneration #BM25 #TechLeadership #GenAI #AIArchitecture #SoftwareEngineering #DataEngineering #HNSW</p><p></p><p>@RAG @VectorSearch @Embeddings @HybridSearch @LLM @AIEngineering @MachineLearning @NLP @SemanticSearch @MLOps @RetrievalAugmentedGeneration @BM25 @TechLeadership @GenAI @AIArchitecture @SoftwareEngineering @DataEngineering @HNSW </p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/rag-systems-in-practice?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/rag-systems-in-practice?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/rag-systems-in-practice?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p><span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;c72a7561-bad5-4b7f-b3f5-f1bcefc57a00&quot;}" data-component-name="MentionToDOM"></span> </p><p style="text-align: justify;"></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Failure Modes & The Complete Production Checklist]]></title><description><![CDATA[Four failure patterns, 14 production rules, and the architecture questions to answer before sprint one.]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/failure-modes-and-the-complete-production</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/failure-modes-and-the-complete-production</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Sun, 22 Mar 2026 10:02:20 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!X5W8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>&#8220;The RAG demo worked perfectly. The production system took 6 months. Almost none of that time was the LLM. It was chunking strategy, pipeline freshness, hybrid search, and reranker integration.&#8221;</strong></p><p><strong>4 : </strong>Failure modes that kill production RAG</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>7 : </strong>Production rules that work</p><p><strong>7 : </strong>Silent killers that destroy RAG systems</p><p><strong>4 : </strong>Architecture questions before sprint one</p><p><strong>FOUR WAYS PRODUCTION RAG FAILS &#8212; DIAGNOSIS &amp; FIX</strong></p><p><strong>Failure 01 Stale Index</strong></p><p>Symptom: &#8220;The AI gave the right answer last week. Now it&#8217;s wrong.&#8221; Source documents changed but embeddings weren&#8217;t updated. Fix: CDC-triggered pipeline updates with content hash tracking. Never rely on scheduled full re-indexes alone. Users stop trusting the system long before anyone measures the staleness rate.</p><p><strong>Failure 02 Wrong Chunk Size</strong></p><p>Symptom: &#8220;The answer is in the document but the AI can&#8217;t find it&#8221; or &#8220;retrieved context is just noise.&#8221; Fixed-size chunks on structured docs destroy context. Fix: match chunking strategy to document type. Use hierarchical chunking for enterprise documents. Evaluate retrieval precision at the chunk level, not just answer quality.</p><p><strong>Failure 03 Dense-Only Retrieval</strong></p><p>Symptom: &#8220;The AI can&#8217;t find anything about [specific product code / legal clause number].&#8221; Dense embeddings dilute exact-match signals. Fix: add BM25 sparse search and apply Reciprocal Rank Fusion. Hybrid search consistently outperforms dense-only by 30&#8211;40% on enterprise corpora. Implement before launch &#8212; not as a retrofit.</p><p><strong>Failure 04 No Reranking</strong></p><p>Symptom: &#8220;Hallucination rate too high even when the answer is clearly in the knowledge base.&#8221; Top-5 from dense search includes irrelevant context that causes the LLM to confabulate. Fix: retrieve top-50, rerank to top-5 with a cross-encoder. The single highest-impact change for reducing hallucination in a deployed RAG system.</p><p><strong>THE COMPLETE 7 + 7 PRODUCTION RULES</strong></p><p><strong>&#9989; 7 Things That Work in Production RAG</strong></p><p>&#9989; <strong>Use CDC for index updates, not scheduled polls.</strong> Eliminates 80&#8211;90% of redundant embedding compute. Keeps index fresh without reindexing spikes.</p><p>&#9989; <strong>Match chunking strategy to document type.</strong> Legal: hierarchical. FAQs: fixed-size. Research: semantic. One-size-fits-all chunking is one-size-fits-none.</p><p>&#9989; <strong>Always use hybrid search &#8212; dense + BM25 + RRF.</strong> 30&#8211;40% recall improvement. Implement from launch &#8212; retrofitting requires index restructuring.</p><p>&#9989; <strong>Rerank top-50 to top-5 before every LLM call.</strong> The single highest-impact pipeline change. Moves accuracy from ~60% to ~87%.</p><p>&#9989; <strong>Build a golden evaluation dataset from day one.</strong> 50&#8211;200 QA pairs. Run RAGAS daily. Track chunk-level precision separately from answer accuracy.</p><p>&#9989; <strong>Use gRPC between all microservices.</strong> Saves 15&#8211;25ms per hop. At 4 hops per query, that&#8217;s 60&#8211;100ms per request, permanently.</p><p>&#9989; <strong>Enrich every chunk with access control metadata from day one.</strong> Source, date, type, user role. Retrofitting requires full re-indexing &#8212; build it in from the start.</p><p><strong>&#10060; 7 Silent Killers of RAG in Production</strong></p><p>&#10060; <strong>Don&#8217;t change embedding models without re-indexing everything.</strong> Different model = different vector space. Every similarity score becomes meaningless. Treat changes as schema migrations.</p><p>&#10060; <strong>Don&#8217;t use Chroma or in-memory stores in production.</strong> No persistence, no hybrid search, no access control. Migration cost is weeks. Choose a production store on day one.</p><p>&#10060; <strong>Don&#8217;t embed chunks one-at-a-time.</strong> 50&#8211;100&#215; slower than batching. For 100k documents, individual embedding takes hours. Batched takes minutes.</p><p>&#10060; <strong>Don&#8217;t put the data pipeline in the query path.</strong> Chunking, embedding, and index updates are background. Coupling them to query latency breaks SLAs.</p><p>&#10060; <strong>Don&#8217;t deploy RAG without a faithfulness gate.</strong> One confident wrong answer erodes more trust than ten correct ones build.</p><p>&#10060; <strong>Don&#8217;t ignore the lost-in-the-middle effect.</strong> Most relevant chunks go first and last. The middle is where important context goes to be ignored.</p><p>&#10060; <strong>Don&#8217;t skip access control metadata.</strong> In enterprise RAG, every query should only retrieve authorised chunks. Retrofitting requires full re-indexing.</p><p><strong>PRE-SPRINT ARCHITECTURE CHECKLIST</strong></p><p>Answer these four questions before sprint planning. Every question you skip becomes a production failure at a predictable time.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!X5W8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!X5W8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png 424w, https://substackcdn.com/image/fetch/$s_!X5W8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png 848w, https://substackcdn.com/image/fetch/$s_!X5W8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png 1272w, https://substackcdn.com/image/fetch/$s_!X5W8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!X5W8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png" width="1140" height="343" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:343,&quot;width&quot;:1140,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61965,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191441545?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!X5W8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png 424w, https://substackcdn.com/image/fetch/$s_!X5W8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png 848w, https://substackcdn.com/image/fetch/$s_!X5W8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png 1272w, https://substackcdn.com/image/fetch/$s_!X5W8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9478ad6b-8b22-4d70-a054-2d3fd6d61588_1140x343.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The RAG demo worked perfectly. The RAG production system took six months to get right &#8212; and almost none of that time was spent on the LLM. It was chunking strategy, pipeline freshness, hybrid search implementation, and reranker integration. The model was fine from day one. Everything around it needed an architecture, not a script.</p><p>#RAG #VectorDatabase #AIEngineering #DataPipelines #RetrievalAugmentedGeneration #VectorSearch #LLMEngineering #Microservices #SemanticSearch #HybridSearch #MachineLearning #GenerativeAI #MLOps #TechLeadership #EnterpriseAI #AIArchitecture #AICareer</p><p></p><p>@RAG @VectorDatabase @AIEngineering @DataPipelines @Microservices @RetrievalAugmentedGeneration @VectorSearch @LLMEngineering @MLOps  @SemanticSearch @HybridSearch @GenerativeAI @TechLeadership @EnterpriseAI @AIArchitecture @AICareer @MachineLearning</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/failure-modes-and-the-complete-production?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/failure-modes-and-the-complete-production?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/failure-modes-and-the-complete-production?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p> <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;14e9e6f2-5b04-46d7-892b-0e387458953a&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The RAG Pipeline - 6 Stages to a Grounded Response]]></title><description><![CDATA[Every stage controls a specific quality dimension. Skip one and the failure is exactly where it was skipped.]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/the-rag-pipeline-6-stages-to-a-grounded</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/the-rag-pipeline-6-stages-to-a-grounded</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Sat, 21 Mar 2026 10:01:02 GMT</pubDate><content:encoded><![CDATA[<p><strong>&#8220;The LLM isn&#8217;t hallucinating because it&#8217;s a bad model. It&#8217;s hallucinating because you sent it noisy context. Reranking fixes that. One pipeline stage. 27 percentage points.&#8221;</strong></p><p><strong>~60% : </strong>Answer accuracy without reranking</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>~87% : </strong>Answer accuracy with reranking</p><p><strong>+15&#8211;25% : </strong>HyDE recall improvement on abstract queries</p><p><strong>50&#8211;100ms : </strong>Cross-encoder reranking latency (worth every ms)</p><p><strong>THE 6-STAGE RAG PIPELINE</strong></p><p><strong>01- Query Analysis &amp; Decomposition</strong></p><p>IMPROVEMENT: complex queries decomposed &#8594; sub-queries &#8594; better recall</p><p>Classify the query before embedding. Simple factual: embed and retrieve directly. Complex multi-part: decompose into 2&#8211;4 sub-queries, retrieve separately, merge before reranking. Hypothetical Document Embedding (HyDE): generate a hypothetical answer first, embed that answer, then retrieve &#8212; the hypothetical lives in the same embedding space as source docs, so retrieval finds more relevant chunks. HyDE improves recall on abstract queries by 15&#8211;25%. Route query type to the right retrieval strategy.</p><p><strong>02- Hybrid Search - Dense + BM25 + RRF</strong></p><p>RECALL: hybrid search improves recall 30&#8211;40% over dense-only on most corpora</p><p>Run HNSW dense search and BM25 sparse search in parallel. Fuse with Reciprocal Rank Fusion: score = 1/(rank_dense + 60) + 1/(rank_sparse + 60). Return top-50 fused candidates to the reranker. Never return raw top-5 from dense search directly to the LLM &#8212; the top-5 bi-encoder results include 2&#8211;3 chunks that are semantically adjacent but not actually relevant to the specific question. Those chunks are what the LLM confabulates from.</p><p><strong>03- Cross-Encoder Reranking - The Quality Gate</strong></p><p>PRECISION: reranking top-50 to top-5 moves answer accuracy from ~60% to ~87%</p><p>Bi-encoder retrieval compares query and chunk embeddings independently. Cross-encoders (Cohere Rerank / BGE-Reranker) see query + chunk together &#8212; scoring joint relevance, not independent similarity. 50&#8211;100ms on top-50 candidates. This single stage accounts for the majority of accuracy improvement in production RAG. Run it on every query. The irrelevant context in top-5-without-reranking is what causes the LLM to hallucinate from noise rather than knowledge gaps.</p><p><strong>04- Context Assembly - Token Budget and Ordering</strong></p><p>QUALITY: context ordering affects LLM attention (lost-in-the-middle effect)</p><p>Lost-in-the-middle: LLMs pay more attention to content at the beginning and end of context, less to the middle. Put most relevant chunks first and last. Token budget: don&#8217;t concatenate until the window is full &#8212; reserve 30&#8211;40% for generation. Include source metadata (title, date, section) with each chunk. Add a grounding instruction: &#8220;Answer only from the provided context. If the context doesn&#8217;t contain the answer, say so.&#8221; This single instruction significantly reduces confabulation.</p><p><strong>05- Generation + Faithfulness Gate</strong></p><p>SAFETY: faithfulness scoring catches hallucinations before they reach users</p><p>After generation, check: does each substantive claim appear in the retrieved context? Use a small LLM call (gpt-4o-mini) or a dedicated model (TruLens, RAGAS faithfulness scorer). Claims not supported by any retrieved chunk &#8594; flag or remove. Track answer relevance (did the response answer the question?) and context relevance (were retrieved chunks actually about the question?) separately. When they diverge, the divergence tells you exactly which pipeline stage to fix.</p><p><strong>06- RAG Evaluation Loop - You Cannot Improve What You Don&#8217;t Measure</strong></p><p>PRODUCTION HEALTH: golden dataset + daily RAGAS metrics</p><p>Log every query, retrieved chunks, and generated response. Run RAGAS evaluation daily on a sample: faithfulness, answer relevance, context precision, context recall. Build a golden dataset of 50&#8211;200 QA pairs with known correct source chunks from day one &#8212; run automatically on every pipeline change. Track retrieval miss rate: queries where none of the top-5 chunks were relevant. A rising miss rate is an early warning of index drift before answer quality metrics degrade.</p><p><strong>DOS &amp; DON&#8217;TS &#8212; RAG PIPELINE</strong></p><p><strong>&#9989; What Builds Accurate RAG</strong></p><p>&#9989; <strong>Always rerank top-50 to top-5.</strong> Cross-encoder reranking is the single highest-impact pipeline change for RAG accuracy. Run it on every query, no exceptions.</p><p>&#9989; <strong>Use HyDE for abstract or open-ended queries.</strong> +15&#8211;25% recall improvement. Generate a hypothetical answer, embed it, retrieve matching content.</p><p>&#9989; <strong>Order context with most relevant first and last.</strong> Lost-in-the-middle is real and empirically validated. Position your best chunks at the extremes.</p><p>&#9989; <strong>Build a golden evaluation dataset from day one.</strong> 50&#8211;200 QA pairs. Run RAGAS daily. You cannot improve what you don&#8217;t measure.</p><p><strong>&#10060; What Causes Hallucination</strong></p><p>&#10060; <strong>Don&#8217;t return top-5 from bi-encoder directly to the LLM.</strong> Bi-encoder top-5 includes irrelevant context. That context is what the LLM hallucinates from.</p><p>&#10060; <strong>Don&#8217;t skip the faithfulness gate.</strong> LLMs confidently state things not in context. A 30-token faithfulness check prevents the most damaging hallucinations.</p><p>&#10060; <strong>Don&#8217;t fill the context window without managing token budget.</strong> Leaving 30&#8211;40% for generation gives the LLM room to reason and synthesise.</p><p>&#10060; <strong>Don&#8217;t deploy without an evaluation loop.</strong> Every pipeline change without measurement is a gamble. Golden dataset + RAGAS is the safety net.</p><p>We added cross-encoder reranking on a Tuesday afternoon. By Thursday, the team had independently noticed the answers were better &#8212; without being told anything had changed. Faithfulness scores went from 61% to 89%. The LLM hadn&#8217;t changed. The retrieved context had.</p><p>#RAG #RetrievalAugmentedGeneration #LLMEngineering #GenerativeAI #AIEngineering #ProductionML #VectorSearch #CohereRerank #TechLeadership</p><p>@RAG @RetrievalAugmentedGeneration @LLMEngineering @GenerativeAI @AIEngineering @ProductionML @VectorSearch @CohereRerank @TechLeadership</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/the-rag-pipeline-6-stages-to-a-grounded?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/the-rag-pipeline-6-stages-to-a-grounded?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/the-rag-pipeline-6-stages-to-a-grounded?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p> <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;5882f948-86a6-4b12-b9b6-4e27893b6b31&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Vector Stores & Hybrid Search]]></title><description><![CDATA[HNSW + BM25 + RRF &#8212; the retrieval stack that actually works at enterprise scale.]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/vector-stores-and-hybrid-search</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/vector-stores-and-hybrid-search</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Fri, 20 Mar 2026 10:01:10 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!k0ip!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>&#8220;Dense vector search is what every RAG tutorial teaches. It misses 30&#8211;40% of what your users are actually looking for. Hybrid search closes that gap.&#8221;</strong></p><p><strong>30&#8211;40% : </strong>Recall improvement hybrid over dense-only</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>10ms : </strong>HNSW ANN search at billion scale</p><p><strong>RRF : </strong>Reciprocal Rank Fusion no hyperparameters</p><p><strong>k=60 : </strong>Constant in RRF formula works across all domains</p><p><strong>WHY HYBRID SEARCH IS NON-NEGOTIABLE</strong></p><p>Dense embeddings dilute exact-match signals into semantic space. Product codes, legal clause numbers, technical specifications, proper nouns &#8212; these are the searches that fail silently in dense-only systems. BM25 sparse search catches what dense search misses. Both are needed.</p><p><strong>THE HYBRID SEARCH ARCHITECTURE</strong></p><p><strong>Dense ANN (HNSW)</strong></p><p>Hierarchical Navigable Small World &#8212; the ANN algorithm used by Pinecone, Weaviate, and Qdrant. Returns top-k=50 in under 10ms at billion-document scale. Tune ef_search at query time based on your latency budget. Degrades with delete+insert &#8212; upsert preserves graph integrity.</p><p><strong>Sparse BM25</strong></p><p>BM25 (Best Match 25) &#8212; term frequency weighted by inverse document frequency. Catches exact product codes, legal clause references, version numbers, proper nouns, technical specifications. Runs in parallel with HNSW &#8212; not sequential. The two are complementary: dense finds semantically relevant content, sparse finds exactly matching content.</p><p><strong>RRF Fusion</strong></p><p>Formula: score = 1/(rank_dense + 60) + 1/(rank_sparse + 60). k=60 dampens the influence of top-ranked results, ensuring a #1 result in one list doesn&#8217;t dominate over #2 in both lists. No per-query hyperparameters. Consistent across domains. Return top-50 fused candidates to the reranker.</p><p><strong>Metadata Pre-Filter</strong></p><p>Apply date range, document type, access control labels, and language filters BEFORE ANN search. Pre-filtering reduces the search space from the full index to the relevant subset. Faster search, more precise results, access-controlled retrieval without post-filtering.</p><p><strong>VECTOR STORE COMPARISON &#8212; PRODUCTION DECISION GUIDE</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!k0ip!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!k0ip!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png 424w, https://substackcdn.com/image/fetch/$s_!k0ip!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png 848w, https://substackcdn.com/image/fetch/$s_!k0ip!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png 1272w, https://substackcdn.com/image/fetch/$s_!k0ip!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!k0ip!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png" width="1141" height="439" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:439,&quot;width&quot;:1141,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:71938,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191438486?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!k0ip!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png 424w, https://substackcdn.com/image/fetch/$s_!k0ip!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png 848w, https://substackcdn.com/image/fetch/$s_!k0ip!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png 1272w, https://substackcdn.com/image/fetch/$s_!k0ip!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8872a4b6-306c-4b24-91bb-824ca0108b56_1141x439.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>&#9888;&#65039; Critical: Never Use Chroma or In-Memory Stores in Production</strong></p><p>Chroma has no persistence guarantees, no distributed deployment, no access control, and no hybrid search. The migration cost when you outgrow it &#8212; full re-indexing, pipeline changes, application rewrites &#8212; is measured in weeks. Choose a production store on day one, even if your dataset is small.</p><p><strong>DOS &amp; DON&#8217;TS &#8212; VECTOR STORES &amp; HYBRID SEARCH</strong></p><p><strong>&#9989; What Works</strong></p><p>&#9989; <strong>Implement hybrid search from day one.</strong> Adding sparse search as a retrofit requires index restructuring. The 30&#8211;40% recall improvement is consistent &#8212; build it before launch.</p><p>&#9989; <strong>Use RRF for fusion &#8212; no hyperparameters, consistent results.</strong> The k=60 constant works across domains without tuning.</p><p>&#9989; <strong>Apply metadata pre-filters before ANN search.</strong> Reduces search space, enables access control without post-filtering overhead.</p><p>&#9989; <strong>Choose your production store on day one.</strong> Pinecone for fastest time-to-production. Qdrant for best open-source. pgvector if you&#8217;re already on Postgres at small scale.</p><p><strong>&#10060; What Breaks Retrieval</strong></p><p>&#10060; <strong>Don&#8217;t use dense-only retrieval for enterprise content.</strong> Product codes, clause references, and technical terms are invisible to dense search. BM25 is not optional.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/vector-stores-and-hybrid-search?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/vector-stores-and-hybrid-search?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/vector-stores-and-hybrid-search?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p>&#10060; <strong>Don&#8217;t mix embedding models between index time and query time.</strong> Different model = different vector space. Every similarity score becomes meaningless.</p><p>&#10060; <strong>Don&#8217;t return top-5 from dense retrieval directly to the LLM.</strong> Retrieve top-50, fuse with RRF, rerank to top-5. The bi-encoder top-5 includes too much irrelevant context.</p><p>&#10060; <strong>Don&#8217;t use delete+insert for index updates.</strong> HNSW graph fragmentation is silent and cumulative. Always upsert.</p><p>The team couldn&#8217;t understand why the AI failed on specific product code searches when semantic searches worked fine. Dense embeddings turned &#8220;SKU-48291&#8221; into a 1536-dimensional vector semantically adjacent to every other product. BM25 finds it trivially. Hybrid search resolved the issue in one day &#8212; a 35% improvement in recall on exact-match queries.</p><p>#VectorDatabase #VectorSearch #SemanticSearch #HybridSearch #RAG #Pinecone #Weaviate #Qdrant #pgvector #AIEngineering #ProductionML</p><p>@VectorDatabase @VectorSearch @SemanticSearch @HybridSearch @RAG @Pinecone @Weaviate @Qdrant @pgvector @AIEngineering @ProductionML <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;5dffdd35-9c20-4a85-b555-e3865906a28e&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Data Pipelines for Production RAG]]></title><description><![CDATA[Five stages. Each one determines a specific dimension of index trustworthiness.]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/data-pipelines-for-production-rag</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/data-pipelines-for-production-rag</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Thu, 19 Mar 2026 10:02:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!OoH7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>&#8220;Your RAG answered correctly last week. Now it&#8217;s wrong. The document changed. The index didn&#8217;t. That&#8217;s a pipeline problem.&#8221;</strong></p><p style="text-align: center;"><strong>80&#8211;90%</strong></p><p style="text-align: center;">Redundant compute eliminated by CDC</p><p style="text-align: center;"><strong>100&#215;</strong></p><p style="text-align: center;">Batch embedding faster than one-at-a-time</p><p style="text-align: center;"><strong>ASYNC</strong></p><p style="text-align: center;">Pipeline must never block a user query</p><p style="text-align: center;"><strong>UPSERT</strong></p><p style="text-align: center;">Not delete+insert &#8212; preserves HNSW graph</p><p><strong>THE 5-STAGE PIPELINE</strong></p><p><strong>01- Source Ingestion &#8212; Change Detection, Not Full Rescans</strong></p><p>KEY INSIGHT: polling re-ingests everything. CDC re-ingests only what changed.</p><p>Use Change Data Capture (CDC) via Debezium for databases, webhook listeners for APIs. Track a content hash for every document &#8212; only re-embed when the hash changes. On most enterprise knowledge bases, 80&#8211;90% of documents are unchanged between runs. For file systems (S3, SharePoint), use event-based triggers on put/modify. CDC eliminates reindexing spikes and the consistency window when retrieval is degraded during a full re-scan.</p><p><strong>02- Chunking &#8212; The Most Underestimated Decision in RAG</strong></p><p>MOST UNDERESTIMATED: wrong chunking breaks retrieval more than the wrong embedding model.</p><p>Fixed-size chunking (512&#8211;1024 tokens, 20% overlap) is the default &#8212; and wrong for most structured documents. Hierarchical (parent-child): small chunks (128t) for precise retrieval, large parent chunks (512&#8211;2048t) returned to the LLM &#8212; best for legal, financial, technical docs. Semantic chunking: splits on meaning boundaries, requires extra embedding pass. Sentence window: embeds sentences, returns N surrounding at retrieval time &#8212; ideal for QA. Match strategy to document type. This decision is irreversible without re-indexing.</p><p><strong>03- Embedding &#8212; Batch Processing and Model Version Discipline</strong></p><p>PERFORMANCE: batching is 50&#8211;100&#215; faster. Model consistency is non-negotiable.</p><p>Never embed chunks individually. Batch to 100&#8211;500 per call. text-embedding-ada-002: max 2048 inputs per call. Local models (BGE, E5): GPU batch 64&#8211;256. The same embedding model must be used at index time AND query time &#8212; switching models invalidates the entire index, every vector becomes meaningless in the new embedding space. Treat embedding model changes like database schema migrations.</p><p><strong>04- Metadata Enrichment &#8212; The Layer That Makes Filtering Possible</strong></p><p>UNLOCK: metadata pre-filters reduce ANN search space &#8212; faster, precise, access-controlled.</p><p>Every chunk must carry: source_url, document_type, created_at, updated_at, section_hierarchy, access_control_labels, language, chunk_index. Enables pre-filter search: &#8220;find the top-5 most relevant chunks from financial reports published after 2024-01-01 accessible to this user role.&#8221; Access control labels are especially critical &#8212; users must only retrieve authorised documents. Retrofitting this requires full re-indexing.</p><p><strong>05- Index Update &#8212; Upsert, Not Delete+Insert</strong></p><p>STABILITY: delete+insert fragments the HNSW graph. Upsert preserves index integrity.</p><p>Delete+insert on HNSW-indexed stores fragments the graph, degrades search performance over time, and creates consistency windows where documents aren&#8217;t findable. Always use upsert operations. For major document rewrites: soft-delete (mark old chunks superseded), insert new chunks, run background compaction during off-peak hours. This gives consistency, searchability, and index health simultaneously.</p><p><strong>CHUNKING STRATEGY QUICK REFERENCE</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!OoH7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!OoH7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png 424w, https://substackcdn.com/image/fetch/$s_!OoH7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png 848w, https://substackcdn.com/image/fetch/$s_!OoH7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png 1272w, https://substackcdn.com/image/fetch/$s_!OoH7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!OoH7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png" width="943" height="303" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:303,&quot;width&quot;:943,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:57748,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/i/191437683?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!OoH7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png 424w, https://substackcdn.com/image/fetch/$s_!OoH7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png 848w, https://substackcdn.com/image/fetch/$s_!OoH7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png 1272w, https://substackcdn.com/image/fetch/$s_!OoH7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8aed3ad4-9d13-47d1-93ce-899b31a160cd_943x303.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>DOS &amp; DON&#8217;TS &#8212; DATA PIPELINES</strong></p><p><strong>&#9989; What Works</strong></p><p>&#9989; <strong>Use CDC for index updates, not scheduled polls.</strong> Eliminates 80&#8211;90% of redundant embedding compute. Keeps index fresh without spikes.</p><p>&#9989; <strong>Match chunking strategy to document type.</strong> One decision guide covers 90% of production cases. Fixed-size for prose; hierarchical for structured enterprise docs.</p><p>&#9989; <strong>Batch embeddings at 100&#8211;500 per call.</strong> For 100k documents, the difference between batched and individual is hours vs minutes.</p><p>&#9989; <strong>Enrich every chunk with access control metadata.</strong> Retrofitting this later requires full re-indexing. Build it day one.</p><p><strong>&#10060; What Kills Pipeline Reliability</strong></p><p>&#10060; <strong>Don&#8217;t put the pipeline in the query path.</strong> Chunking, embedding, and index updates are background operations. Coupling them to query latency breaks SLAs.</p><p>&#10060; <strong>Don&#8217;t switch embedding models without re-indexing.</strong> Treat it as a schema migration. Old and new vectors are incomparable &#8212; every similarity score becomes meaningless.</p><p>&#10060; <strong>Don&#8217;t use delete+insert for index updates.</strong> It fragments the HNSW graph. Always upsert.</p><p>&#10060; <strong>Don&#8217;t skip content hash tracking.</strong> Without it, you re-embed unchanged documents on every run.</p><p>The RAG demo answered every question correctly. Three weeks after launch, it started giving wrong answers on updated documents. Nobody had built the pipeline to detect changes and re-embed. Stale retrieval doesn&#8217;t announce itself &#8212; it just quietly erodes trust.</p><p>#DataEngineering #DataPipelines #MLOps #Kafka #RAG #VectorSearch #LLMEngineering #EmbeddingModels #EnterpriseAI</p><p>@DataEngineering @DataPipelines @MLOps @Kafka @RAG 2VectorSearch @LLMEngineering @EmbeddingModels @EnterpriseAI <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;7bdcab0a-085e-4544-b642-342519f18c51&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/data-pipelines-for-production-rag?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/data-pipelines-for-production-rag?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/data-pipelines-for-production-rag?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div>]]></content:encoded></item><item><title><![CDATA[AI Microservices Architecture]]></title><description><![CDATA[Four independently scalable services. One coherent production system.]]></description><link>https://thedigitalshiftaiwithashish.substack.com/p/ai-microservices-architecture</link><guid isPermaLink="false">https://thedigitalshiftaiwithashish.substack.com/p/ai-microservices-architecture</guid><dc:creator><![CDATA[The Digital AI With Ashish]]></dc:creator><pubDate>Wed, 18 Mar 2026 10:02:45 GMT</pubDate><content:encoded><![CDATA[<p><strong>&#8220;Your RAG system isn&#8217;t one service. It&#8217;s four services that need to scale independently. Most teams build it as one. That&#8217;s why it breaks at scale.&#8221;</strong></p><p><strong>THE CORE PROBLEM</strong></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Embedding, retrieval, reranking, and generation have completely different scaling characteristics. Retrieval runs at 100 RPS. Generation runs at 10 RPS. If they share the same service, your cheapest capability becomes a bottleneck for your most expensive one.</p><p style="text-align: center;"><strong>100 RPS</strong></p><p style="text-align: center;">Retrieval service target throughput</p><p style="text-align: center;"><strong>10 RPS</strong></p><p style="text-align: center;">Generation service target throughput</p><p style="text-align: center;"><strong>60&#8211;100ms</strong></p><p style="text-align: center;">Compound latency saved gRPC vs REST (4 hops)</p><p style="text-align: center;"><strong>~3ms</strong></p><p style="text-align: center;">API Gateway latency budget</p><p><strong>THE FOUR SERVICES &#8212; ARCHITECTURE BREAKDOWN</strong></p><p><strong>Service 01 &#183; API Gateway</strong></p><p>Request Ingestion, Auth, Rate Limiting &amp; Routing</p><p>API Gateway (Kong / AWS API GW) is the single entry point. Validates auth, enforces rate limits, routes requests to downstream services. Session context cached in Redis &#8212; shared across all service boundaries. Bloom filter for duplicate detection: same query within 30s returns cached result. Never let unauthenticated requests reach the model layer.</p><p><strong>Service 02 &#183; Embedding Service</strong></p><p>Query Vectorization &#8212; GPU-Backed, Independently Scalable</p><p>Converts incoming query into a vector using the exact same embedding model used at index time. This is non-negotiable &#8212; switching embedding models invalidates the entire index. Scales independently from retrieval. At peak load you may need 5 embedding replicas but only 2 retrieval replicas. Deployed as a separate K8s Deployment with GPU node affinity. Exposes a gRPC endpoint, not REST.</p><p><strong>Service 03 &#183; Retrieval Service</strong></p><p>Hybrid ANN + BM25 Search &#8212; Highest RPS in the Pipeline</p><p>Runs HNSW ANN search in parallel with BM25 sparse keyword search, fuses both lists via RRF, applies metadata pre-filters, returns top-50 candidates to the reranker. This service typically runs at 5&#8211;10&#215; the RPS of generation. Scales horizontally without GPU &#8212; CPU-optimized. Circuit breaker (Istio) prevents cascade failures if vector store degrades.</p><p><strong>Service 04 &#183; Reranking + Generation</strong></p><p>Cross-Encoder &#8594; Context Assembly &#8594; LLM Response</p><p>Cross-encoder reranker sees query + chunk together &#8212; joint relevance scoring. Reranks top-50 to top-5 in 50&#8211;100ms. Context assembled with most relevant chunks at start and end. Token budget: 30&#8211;40% reserved for generation. LLM instructed to cite sources and stay grounded. Generation is GPU-intensive, expensive, low-throughput &#8212; run fewer replicas with queue-based backpressure.</p><p><strong>GRPC + ISTIO &#8212; WHY THEY MATTER</strong></p><p><strong>gRPC vs REST</strong></p><p>15&#8211;25ms saved per hop. Binary protocol, no text parsing, HTTP/2 multiplexed connections. At 4 service hops per query, that&#8217;s 60&#8211;100ms compound saving permanently &#8212; the difference between meeting and missing an SLA.</p><p><strong>Istio Service Mesh</strong></p><p>Handles retries with exponential backoff, circuit breaking when downstream services degrade, and mutual TLS between all services. Never write retry logic in application code when you have a service mesh &#8212; you&#8217;ll implement it inconsistently.</p><p><strong>Independent Scaling</strong></p><p>Retrieval: CPU-optimized, high RPS. Generation: GPU-backed, low RPS. Bundling them means scaling the expensive component to handle the volume of the cheap one &#8212; a permanent cost penalty.</p><p><strong>Redis Cross-Service State</strong></p><p>Conversation history and query context in Redis &#8212; accessible by all services without direct coupling. Semantic cache on retrieval: ~30% of production queries served from cache on most enterprise knowledge bases.</p><p><strong>DOS &amp; DON&#8217;TS &#8212; MICROSERVICES</strong></p><p><strong>&#9989; Production Rules</strong></p><p>&#9989; <strong>Use gRPC between all internal services.</strong> 15&#8211;25ms saved per hop. At 4 hops, that&#8217;s 60&#8211;100ms. A configuration change, not a rewrite.</p><p>&#9989; <strong>Deploy each AI capability as an independent K8s Deployment.</strong> Independent scaling, independent rollouts, independent failure domains.</p><p>&#9989; <strong>Cache aggressively at the retrieval layer.</strong> Redis semantic cache serves ~30% of queries without touching the vector store.</p><p>&#9989; <strong>Use Istio for retries and circuit breaking.</strong> Never implement retry logic in application code &#8212; it&#8217;s inconsistent and hard to tune.</p><p><strong>&#10060; What Breaks at Scale</strong></p><p>&#10060; <strong>Don&#8217;t bundle embedding + retrieval + generation in one service.</strong> You&#8217;ll scale the cheapest component to handle the load of the most expensive one.</p><p>&#10060; <strong>Don&#8217;t use REST+JSON for internal service communication.</strong> Text parsing overhead is invisible in development and painful in production.</p><p>&#10060; <strong>Don&#8217;t let cold-start containers serve live traffic.</strong> Pre-warm all AI service instances. A cold GPU container takes 30&#8211;90 seconds &#8212; that&#8217;s a failed request.</p><p>&#10060; <strong>Don&#8217;t skip distributed tracing.</strong> Without request tracing across all 4 services, latency debugging is guesswork.</p><p>The architecture mistake I see most often: one Python service that embeds, retrieves, reranks, and generates. It works in the demo. It fails in production the moment you try to scale retrieval without scaling generation. Separate the services. Scale them independently.</p><p>&#8212; Production AI architecture review, enterprise knowledge platform</p><p>#Microservices #AIEngineering #MLOps #Kubernetes #RAG #LLMEngineering #gRPC #ProductionAI #TechLeadership</p><p>@Microservices @AIEngineering @MLOps @Kubernetes @RAG @LLMEngineering @gRPC @ProductionAI @TechLeadership</p><div class="captioned-button-wrap" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/ai-microservices-architecture?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="CaptionedButtonToDOM"><div class="preamble"><p class="cta-caption">Thanks for reading! This post is public so feel free to share it.</p></div><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/p/ai-microservices-architecture?utm_source=substack&utm_medium=email&utm_content=share&action=share&quot;,&quot;text&quot;:&quot;Share&quot;}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://thedigitalshiftaiwithashish.substack.com/p/ai-microservices-architecture?utm_source=substack&utm_medium=email&utm_content=share&action=share"><span>Share</span></a></p></div><p> <span class="mention-wrap" data-attrs="{&quot;name&quot;:&quot;Product Management with Mani&quot;,&quot;id&quot;:390487508,&quot;type&quot;:&quot;user&quot;,&quot;url&quot;:null,&quot;photo_url&quot;:&quot;https://substackcdn.com/image/fetch/$s_!wKto!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb252a93e-f1b2-4f9b-b282-3258f61e8ed0_1080x1080.png&quot;,&quot;uuid&quot;:&quot;68a32174-bdbf-43a8-9990-22bac7de0864&quot;}" data-component-name="MentionToDOM"></span> </p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://thedigitalshiftaiwithashish.substack.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">This Substack is reader-supported. To receive new posts and support my work, consider becoming a free or paid subscriber.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>