Conversational APIs break most of the evaluation methods that search and recommendation teams have built up over the past two decades. When we started working on the Foursquare ASK API, our natural language place search endpoint, we quickly learned that we would need to take a new approach to evaluation. We couldn’t label a static ground-truth set the way we did for keyword search, because the query surface was effectively infinite. Similarly, in a pre-launch period, we didn’t yet have enough traffic to lean on behavioral signals like clicks and dwell time. To meet our standards of quality to take this offering to customers, we knew we needed defensibility.
So we built our own solution. In this blog, we’ll walk through what we built, what it produces, and what we’ve learned along the way. While our approach largely focuses on the use case of place search, there is much to be taken away from our journey to inspire your own evaluation methods specific to your products, your constraints, and your users.
Why conversational APIs were hard for us to evaluate
Traditional search evaluation leans on three distinct areas: labeled relevance judgments, behavioral signals like clicks and dwell time, and well-understood ranking metrics like NDCG or MRR. For the ASK API, all three were strained.
The query surface was wider than we were used to. For example, “Italian restaurants near me” is trivial to label. Whereas, “somewhere fun for a first date that is not too loud and has good cocktails in the Mission” is not. Labeling at scale for the latter class was expensive, and we watched labels age quickly as the system evolved.
The answer was not a single item. The user experience depends on intent interpretation, ranking, breadth of coverage, whether the system explains itself, and whether it stayed in the right geography. We saw plenty of cases where a system got the top result right and still produced a bad experience by missing everywhere else.
And the definition of correct was contextual. For example, “emergency plumber tonight”, urgency dominated. Whereas, “cocktail lounge in Miami”, breadth and variety mattered more than any single best answer. A metric that treated these the same would have systematically misled us.
All of this required that the framework we build have the ability to handle all of this nuance while being both cost-effective enough to run on every model or retrieval change, and rigorous enough that we’d trust its verdicts.
With that context, let’s dive into our solution and learnings along the way.
We compare, we don’t score
The first (and most vital) decision was to structure the evaluation as pairwise comparisons between two versions of the system, not as absolute scores against an ideal.
Pointwise scoring (“rate this result list from 1 to 5”) initially sounded simpler, but was noisier than we wanted. Judges—human or model—drifted in their calibration across sessions, across query types, and across days. Small but real improvements got lost in that drift. Pairwise comparison sidestepped the problem: the judge didn’t need to agree with itself on an absolute scale, only on which of two lists was better for a given query, and that turned out to be a much more stable task.
Pairwise also mapped more cleanly to the question we actually needed to answer. Instead of abstractly asking, “is this good”, we were asking, “Is this better than what we had before?” Pairwise evaluation answered that question directly.
How we use the judge
For anything above a modest scale, we needed an automated judge. We use an LLM as the judge, which trades away some per-case nuance for the ability to evaluate thousands of queries repeatably. That tradeoff has proven worthy due to our discipline in how we configure the judge—and we learned to be disciplined by making the mistakes first.
Three things have ended up mattering:
- Determinism. Temperature set low, seeds fixed, prompts version-controlled. If we rerun the same evaluation tomorrow with no code changes, we want the same verdicts. Early on, we weren’t careful enough about this, and we couldn’t tell whether a change in the numbers came from the system under test or from judge variability. Once we tightened up the config, we could track progress over time without second-guessing it.
- Blinding. The judge never sees which system produced which result list. We shuffle the two sides on every query and strip any metadata that could leak the source. None of this is novel—it’s standard—but we underestimated early on how strong the bias toward the new one can be, and it was worth the engineering cost to do it properly.
- Calibration against human judgments. We check that the judge’s verdicts agree with thoughtful human reviewers on a sampled subset. When the judge has systematically over-weighted one dimension or misread a class of queries, we’ve wanted to know before we used it to make product decisions. We’ve caught issues this way, and they’ve always been worse than we expected going in.
What worked for us.
We ask the judge for its confidence on every call, not just the winner. That one change unlocked one of the most useful views we built (more on that below). A judgment is only as useful as the confidence behind it, and the models we’ve used are able to produce a reasonable self-reported confidence when prompted for it.
Why we score multiple dimensions
A single “which is better” verdict is the goal, but this simplification hides the actual difference. So we ask the judge to score each list along five dimensions that, in our work on place search, have captured the real user experience well:
- Intent match — Do the results match what the user actually asked for?
- Ranking quality — Are the best results surfaced at the top?
- Coverage and diversity — Does the list cover the relevant space well without becoming repetitive?
- Justification quality — Do supporting snippets or rationale help explain why each result is relevant?
- Geo correctness — Are results correctly localized when geography is explicit in the query?
These dimensions are specific to place search. What’s worked well for us is maintaining a small, fixed set of axes that collectively define what “good” looks like for our product—and scoring each one independently. This allows us to understand why a version performs better, not just that it does. For example, in our most recent ASK release, intent-match showed +5 deltas on queries like “family dentist near Sunnyvale, CA” and “zoo tickets in San Diego, CA”—cases where the previous system returned the wrong category of place entirely. Geo-correctness improvements were most noticeable in comparison queries such as “cheaper than X in city”, where staying anchored to the correct local market was critical. Those kinds of insights would have been difficult to capture using a single relevance score.
The dimensions also allow us to build a composite relevance score—an average of the per-dimension scores on a 0–1 scale. This approach smooths out the noise of binary win/loss counts and provides a more continuous quality signal for tracking performance over time.
How we try to earn our conclusions
This was the hardest part for us to get right, and the one where we see teams (including past versions of ourselves) most often come up short. The trap is familiar: you produce a number, the number goes up, the team ships. But “went up” isn’t the same as “went up more than random variation would explain.”
Three statistical views we’ve come to rely on, and what they showed on our most recent ASK release:
| View | What it tells us | Result |
|---|---|---|
| Sign test on pairwise wins | Is the win margin unlikely to be random? | p = 0.0017 |
| Wilcoxon on composite deltas | Does the margin hold when we weight by size? | p = 0.0001 |
| Bootstrap 95% CI on composite delta | What’s the plausible range of the gain? | [+0.037, +0.099] |
The sign test answers whether one version wins more often than chance. The Wilcoxon test uses the size of each win, not just the direction, so that a lot of small wins don’t outweigh a few big losses. The confidence interval tells us what the real improvement most plausibly is: if the interval excludes zero, we believe the effect is real; if the lower bound is only marginally above zero, we try to be honest about having a small effect dressed up in large-sample confidence.
Across these views, each answers a different question: does one version win more often, does it win by more, and is the average improvement itself meaningfully different from zero. We’ve learned to treat “yes” on all three as the bar for quality before defining an improvement as “better”. On the ASK release above, all three views agreed, and the lower bound of the CI sat comfortably above zero.
Why we slice by query complexity
An additional lesson we learned early: aggregate wins can hide asymmetric tradeoffs. A common failure mode in ranking systems is a change that makes hard queries much better, easy queries slightly worse, and still looks positive in aggregate. For a product where easy queries dominate real-world usage, that’s a silent regression we don’t want to miss.
Given this, we slice our benchmark into low, medium, and high complexity tiers and look at wins, composite deltas, and intent-match distributions in each. A change we’re comfortable shipping has to show meaningful gains at high complexity and not regress—ideally gain modestly—at low complexity. A rescue/loss analysis (how many easy queries moved from bad to good versus good to bad) has turned out to be a particularly sharp ship-safety check, because it surfaces movement that aggregate scores can mask.
On our most recent ASK release, the slice looked like this:
| Complexity | Net win rate | Composite lift |
|---|---|---|
| High | +27.5 pts | +0.101 |
| Medium | +5.4 pts | +0.038 |
| Low | +24.6 pts | +0.064 |
Within the low-complexity slice, the mean intent-match score rose from 4.23 to 4.54, and 13.1% of easy queries were rescued from a bad state while only 8.2% moved the other way. That shape—biggest gains on hard queries, no meaningful degradation on easy ones, and in fact net rescues there too—is what a change we’re ready to ship tends to look like for us.
Why we look at the high-confidence subset
Remember the confidence scores we ask the judge to produce? This is where they earn their keep.
After we compute headline results across the full benchmark, we rerun the same analysis on the subset of queries where the judge reported high confidence in its own call. There are two patterns we look for:
- If the improvement shrinks or disappears in the high-confidence subset, our overall win is being driven by borderline calls where the judge could plausibly have gone the other way. We treat that as a weak signal.
- If the improvement gets stronger in the high-confidence subset, the clearest cases—the ones the judge was most certain about—are driving the result. That’s a strong signal.
On our most recent ASK release, the full-benchmark net win rate was +18.5%, but when we restricted to the subset where the judge reported confidence of 0.80 or higher, the net win rate jumped to +34.0% with a 95.4% non-tie rate and 94.5% of the overall net wins came from that high-confidence subset. That told us the headline wasn’t a pile of coin-flip calls stacking in one direction by chance. The clearest judgments were pulling in the same direction as the aggregate.
How we treat regressions
Every pairwise comparison produces some number of cases where the new version is worse. Our instinct early on was to minimize that number, We’ve learned that the better instinct is not to explain them away. Handled carefully, the regression set has become the single most useful artifact the evaluation produces, because it tells us exactly what to work on next.
We categorize regressions along two axes: severity (does the user still get a usable answer, or did the product fail?) and reproducibility (does the failure happen every time, or only sometimes?). This gives us four quadrants, each with a different action we take:
| Reproducible | Non-reproducible | |
|---|---|---|
| High severity | Prioritize fixing immediately | Treat as instability—underrated |
| Low severity | Backlog: edges of the envelope | Note and move on |
In our most recent cycle of work, fewer than 9% of flagged regression cases landed in the critical-or-high-severity quadrant; the vast majority still produced usable results with varying degrees of quality degradation. The compound-concept pattern—where a query like “arcade bar in Los Angeles, CA” returns one strong match and then falls back to generic bars—sat in the reproducible quadrant and is now a named workstream on our roadmap. The non-reproducible cases—where severe misses on queries like “pizza places with gluten-free crust in Denver, CO” couldn’t be re-created on retest—pushed us to treat non-determinism as a distinct class of issue rather than filing it under regressions. Without the structured look, we’d have been working off anecdotes.
What this has given us
The payoff from this kind of framework has been bigger for us than any one result it produces.
We iterate faster. When every model or retrieval change goes through the same benchmark, we spend less time trying to determine whether a change is actually producing better results and more time shipping the ones that are.
We make more quantitative product decisions. Rather than decision making led by opinion or intuition, our discussions have upleveled to those like, “We improved the composite score by 10% with a confidence interval that excludes zero. 94.5% of the net wins came from the judgments the evaluator was most sure about.” This changes how we talk about quality, both internally and with customers.
We get an honest read on work to be done. In practice, the regression set is one of the most useful outputs the framework produces. It tells us what the next cycle of work should be, and keeps us honest about the gap between where the product is and where we want it to be. The framework produces defensible evidence on data quality improvements for ourselves, customers, and partners alike – turning quality from a claim into a measurable metric.
The Foursquare ASK API is our natural language place search endpoint – powered by improvements validated with our evaluation framework. Working on something similar and want to trade notes?