What Is the Mean Average Precision (MAP) Metric and How to Calculate It?

Pratik Bhavsar

Evals & Leaderboards @ Galileo Labs

What Is the Mean Average Precision (MAP) Metric | Galileo

Your RAG pipeline retrieved five documents for a customer query, but the two most relevant ones landed at positions four and five. The LLM never used them. Your response quality tanked, and you had no metric granular enough to explain why.

Mean Average Precision (MAP) captures exactly this failure mode. Unlike single-threshold metrics that only check whether relevant results appear somewhere in the list, MAP measures where they appear, penalizing systems that bury critical context below the cutoff. This guide covers how MAP works, how to calculate it, how it compares to NDCG and MRR, and how to use it as a regression metric for retrieval quality.

TLDR:

  • MAP averages precision scores at each position where a relevant result appears, rewarding systems that rank relevant content early

  • Unlike NDCG (graded relevance) or MRR (first relevant result only), MAP evaluates all relevant items with binary relevance labels

  • MAP@K is the variant that matters for RAG pipelines, where K matches your context window size

  • Use MAP as a regression metric in CI/CD to catch retrieval quality degradation before it reaches production

  • Galileo's Metrics Engine supports retrieval quality evaluation with Context Adherence, Chunk Relevance, and Completeness metrics

What Is Mean Average Precision

Mean Average Precision evaluates ranking tasks by calculating the average of Average Precision (AP) scores across a set of queries. It provides a single score that combines relevance detection with position sensitivity, telling you how effectively your system ranks relevant results near the top. 

Originally standardized through TREC benchmarks at NIST, MAP has become the default metric for comparing retrieval system quality across both academic research and production deployments.

When a search engine returns ten documents, MAP cares not just whether relevant documents appear, but where they appear. A relevant document ranked first contributes more than the same document ranked tenth. This makes MAP particularly valuable for RAG pipelines, search engines, and recommendation systems where ranking order directly affects output quality.

MAP for Ranking Systems vs. Object Detection

MAP has two distinct definitions across ML domains, and conflating them produces meaningless comparisons. In ranking systems (search, recommendation, RAG retrieval), MAP measures how well relevant results are ordered, rewarding systems that surface the right content early. You calculate AP per query based on where relevant documents fall in the ranked list, then average across queries.

In object detection, mAP evaluates spatial accuracy: how well predicted bounding boxes overlap with ground truth at various IoU thresholds. Models like YOLO and Faster R-CNN report mAP@0.5:0.95 on benchmarks like COCO, where the calculation involves precision-recall curves across confidence thresholds and intersection-over-union ranges.

For LLM and agent evaluation, you want ranking MAP. The goal is positioning relevant content where it gets used, not measuring spatial overlap. When reading benchmark papers or configuring evaluation frameworks, verify which definition applies before comparing scores. A "good" mAP in object detection (40-50% on COCO) looks completely different from a good MAP in information retrieval.

What Is a Good MAP Score

No universal threshold defines a "good" MAP score. Performance varies significantly across domains and depends on query difficulty, corpus characteristics, and the number of relevant documents per query. A MAP of 0.4 might be strong for a broad enterprise knowledge base with thousands of overlapping documents, while the same score would be concerning for a curated FAQ retrieval system with clear answers.

Establish baselines through comparative evaluation against published benchmarks on identical datasets rather than seeking absolute thresholds. Track your own MAP over time and focus on relative changes: a 10% drop after a retrieval configuration change matters more than whether your absolute score exceeds a specific number.

How MAP Compares to NDCG and MRR

Choosing the right ranking metric depends on your retrieval strategy and how your system uses the results.

MAP assumes binary relevance, where documents are either relevant or not, and rewards systems that rank all relevant items early. Use it when you have multiple relevant documents per query and care about finding all of them, such as multi-document RAG retrieval.

NDCG (Normalized Discounted Cumulative Gain) handles graded relevance, distinguishing between highly relevant and marginally relevant results. Choose NDCG when relevance varies in degree: product search where some items are perfect matches while others are acceptable alternatives.

MRR (Mean Reciprocal Rank) only considers the first relevant result. Use MRR when you typically need just one good answer: factoid question answering, navigational queries, or single-document retrieval.

For RAG pipelines, the choice depends on your context window. If your agent needs multiple supporting documents for reasoning, MAP captures whether all relevant context lands in the window. 

If you're retrieving one authoritative source, MRR is more appropriate. If your relevance labels distinguish between primary and supporting sources, NDCG provides finer-grained signal. Galileo's Context Precision metric complements all three by measuring whether retrieved chunks actually contain information relevant to the query, regardless of ranking position.

How to Calculate Mean Average Precision

The calculation follows two steps: compute Average Precision (AP) for individual queries, then average across all queries for the final MAP score.

Average Precision Formula

Average Precision for a single query is calculated as:

AP = (1 / number of relevant documents) × Σ P(k) × rel(k)

Where P(k) is precision at cutoff k (relevant items in top k results divided by k), and rel(k) is 1 if the item at rank k is relevant, 0 otherwise. The sum runs over all positions in the ranked list. The formula only accumulates precision at positions where relevant documents appear, making it sensitive to ranking quality rather than just total recall.

The multiplication by rel(k) means irrelevant results at any position don't inflate the score. A system that returns five relevant documents in positions 1-5 earns a perfect AP of 1.0, while one that scatters them across positions 1, 5, 10, 15, and 20 scores much lower despite retrieving the same number of relevant items.

MAP is then the mean of AP scores across all queries: MAP = (1/Q) × Σ AP_i, where Q is the total number of queries in your evaluation set. Libraries like scikit-learn provide built-in implementations through average_precision_score, while pytrec_eval and TorchMetrics offer specialized variants for information retrieval and deep learning workflows respectively.

Worked Example with Three Queries

Query 1: Retrieved results: [Relevant, Not Relevant, Relevant, Not Relevant, Relevant]

Precision at relevant positions: P(1) = 1/1 = 1.0, P(3) = 2/3 = 0.667, P(5) = 3/5 = 0.6

AP₁ = (1.0 + 0.667 + 0.6) / 3 = 0.756

Query 2: Retrieved results: [Not Relevant, Relevant, Relevant, Not Relevant, Not Relevant]

Precision at relevant positions: P(2) = 1/2 = 0.5, P(3) = 2/3 = 0.667

AP₂ = (0.5 + 0.667) / 2 = 0.584

Query 3: Retrieved results: [Relevant, Relevant, Not Relevant, Relevant, Relevant]

Precision at relevant positions: P(1) = 1.0, P(2) = 1.0, P(4) = 3/4 = 0.75, P(5) = 4/5 = 0.8

AP₃ = (1.0 + 1.0 + 0.75 + 0.8) / 4 = 0.888

Final MAP = (0.756 + 0.584 + 0.888) / 3 = 0.743

Query 3 scores highest because relevant documents cluster near the top. Query 2 scores lowest because the first relevant result doesn't appear until position 2, and only two relevant documents were retrieved total.

MAP@K for Top-K Retrieval

In production RAG systems, context windows impose hard limits on how many retrieved documents actually reach the LLM. MAP@K addresses this by truncating evaluation at position K, ignoring everything below the cutoff.

If your agent's context window fits 5 documents, MAP@10 is meaningless. MAP@5 tells you whether relevant content lands where it actually gets used. Choose K based on your system's constraints: context window size, latency budgets, or UX patterns. Common values include MAP@3 for tight context windows and MAP@10 for more permissive retrieval.

Track multiple K values during development to understand how ranking quality degrades as you move down the results list. A sharp drop between MAP@3 and MAP@10 suggests your retriever finds relevant documents but struggles to rank them near the top, while consistently low MAP across all K values points to a fundamental relevance detection gap in your retrieval model.

Interpreting MAP with Precision-Recall Curves

Precision-recall curves visualize how precision changes as recall increases, giving you a diagnostic lens into why your MAP score is what it is. For a single query, the average precision is the area under this curve. When you average AP scores across queries, you get MAP.

How PR Curves Explain Your MAP Score

If your system consistently ranks relevant items early, your PR curves stay high and MAP follows suit. Curves that drop quickly or appear jagged signal inconsistency; maybe you retrieve some relevant results early but can't sustain that precision as recall increases. A curve that starts high and drops sharply after 30% recall tells a different story than one that maintains moderate precision through 80% recall, even if both produce similar MAP scores.

For RAG workflows, this diagnostic is especially important because you're feeding retrieved content directly into a generative model. If the top-ranked items aren't relevant, the final output suffers regardless of what appears further down the list. 

PR curves let you pinpoint where the breakdown happens: a relevance detection problem means your retriever doesn't recognize relevant documents at all, while a ranking problem means it finds them but places them too low. Each failure mode requires a different fix, and the curve shape tells you which one you're facing.

Using PR Curves to Detect Drift and Compare Models

Two models might have identical MAP scores but very different precision-recall dynamics. One might rank relevant items sharply at the top with precision dropping off steeply after position 3, which is ideal for user-facing search, where only the first few results matter. 

Another might spread relevant items more evenly across the top 10, which could work better for exploratory research tools or multi-document reasoning tasks. PR curves surface those differences so you're comparing retrieval behavior, not just a single aggregate number.

In a live environment, changes to your PR curves over time can reveal model drift, broken retrieval signals, or shifts in user behavior. A comprehensive survey of RAG evaluation methods confirms that traditional metrics like MAP remain dominant in retrieval assessment, though combining them with LLM-based evaluation provides fuller coverage. 

MAP might stay flat for a while, but if early precision is gradually falling, you'll feel it in response quality long before the aggregate number moves. Monitoring precision at fixed recall points across weekly snapshots catches these degradation patterns early.

Where Mean Average Precision Applies in Production AI

MAP's position sensitivity makes it valuable in systems where ranking order directly affects downstream quality. Two domains stand out for production AI teams.

Search and RAG Retrieval

MAP measures how effectively your retrieval system surfaces relevant documents at the positions that matter. In RAG pipelines, this directly determines whether your LLM receives the context it needs to generate accurate responses. 

A retriever with high recall but poor ranking floods the context window with marginally relevant chunks, diluting the signal your model depends on for grounded generation. For a deeper dive into which RAG performance metrics matter most, MAP works best alongside generation-side metrics that measure how well your LLM actually uses the retrieved context.

Autonomous agents in customer service and research workflows rely on MAP-evaluated retrieval to ensure critical details land within the context window where reasoning processes can use them. When your agent needs to synthesize information from multiple sources to answer a complex query, MAP@K reveals whether all the necessary pieces made it into the top K positions. 

Pair MAP with Galileo's Context Adherence and Chunk Relevance metrics to evaluate both retrieval ranking and generation groundedness, or use Chunk Attribution to identify which retrieved chunks actually influenced the generated response.

Recommendation Systems

E-commerce platforms and streaming services use MAP to evaluate how effectively they position relevant items where users will interact with them. Unlike precision@K, which only counts how many relevant items appear in the top K, MAP penalizes systems that place a highly relevant product at position 8 instead of position 2.

MAP analysis guides algorithmic adjustments by measuring both recommendation accuracy and ranking quality. When you A/B test a new recommendation algorithm, MAP@K gives you a single metric to compare ranking quality across variants. Production agents powering personalized assistants use MAP-optimized ranking to recommend actions, tools, or content based on user intent.

Best Practices for Evaluating Retrieval Quality with MAP

Implementing MAP effectively requires more than plugging in a formula. The quality of your evaluation depends on your labels, your cutoff choices, and how you integrate MAP into your workflow.

Building Representative Relevance Labels

Label quality drives MAP accuracy more than any other factor. Inconsistent or incomplete relevance judgments produce scores that mislead rather than inform. Use multiple annotators for ambiguous queries and establish clear relevance criteria before annotation begins: what counts as "relevant" for a customer support query needs explicit definition.

Production feedback loops help refine labels over time. Track which retrieved documents your users actually click, quote, or act on, and feed those signals back into your evaluation datasets. For RAG systems, distinguish between documents that directly answer the query and documents that provide supporting context. 

A document that contains the answer buried in paragraph 12 is technically relevant, but if your chunking strategy splits it poorly, that relevance never reaches the LLM. Galileo's Chunk Utilization metric helps you measure this gap. Align your labels with what your system can actually surface, not what exists somewhere in the corpus.

Choosing the Right K for Your System

Tie K to your system's actual constraints rather than picking a round number. If your LLM's context window fits 4,096 tokens and your average chunk is 512 tokens, MAP@8 is a reasonable starting point. If latency budgets limit you to 3 retrieval calls, MAP@3 is what matters.

Run evaluations at multiple K values to understand where ranking quality degrades. A sharp drop between MAP@3 and MAP@5 suggests your retriever finds relevant documents but struggles to rank them near the top; reranking might solve it. A gradual decline from MAP@3 through MAP@20 indicates a relevance detection gap where your embedding model isn't capturing the right semantic relationships.

Tracking MAP Across Retrieval Changes

Use MAP as a regression metric in your CI/CD pipeline. Before any retrieval configuration change, whether it's a new embedding model, modified chunking strategy, or updated reranker, run your evaluation suite and compare MAP against your baseline. Set degradation thresholds that block merges when MAP drops below acceptable levels.

Galileo's experiment comparison tools let you run these evaluations side-by-side, comparing MAP alongside Completeness and other RAG evaluation metrics across different configurations. For a broader framework on optimizing RAG evaluation, combine MAP with generation-side metrics to cover both retrieval and response quality. Track MAP over time as your corpus grows; what worked well for 10,000 documents may degrade at 100,000.

Building Retrieval Quality into Your Eval Strategy

Ranking quality is the foundation of every RAG pipeline, search system, and recommendation engine you operate. MAP gives you a position-sensitive metric that reveals whether relevant content reaches the positions where it gets used. Combined with precision-recall curve analysis and systematic regression testing, MAP transforms retrieval evaluation from a one-time benchmark into a continuous quality signal.

Galileo provides the evaluation infrastructure to operationalize retrieval quality measurement at production scale:

  • Metrics Engine: 20+ out-of-the-box metrics including Context Adherence, Chunk Relevance, and Completeness for end-to-end RAG evaluation

  • Luna-2 SLMs: Purpose-built evaluation models running at 98% lower cost than LLM-based evaluation with sub-200ms latency

  • Signals: Automatic failure pattern detection that surfaces retrieval quality degradation you didn't know to look for

  • Runtime Protection: Real-time guardrails that intercept low-quality outputs before they reach your users

  • CLHF: Improve metric accuracy from as few as 2-5 annotated examples through continuous learning

  • Experiments: Compare retrieval configurations side-by-side with statistical rigor across multiple metrics

Book a demo to see how Galileo's evaluation platform helps you measure, monitor, and improve retrieval quality across your AI systems.

FAQ

What is the difference between Average Precision and Mean Average Precision? Average Precision (AP) measures ranking quality for a single query by calculating precision at each position where a relevant document appears and averaging those values. Mean Average Precision (MAP) takes the mean of AP scores across all queries in your evaluation set, giving you a single aggregate metric for overall system performance.

How do I choose between MAP, NDCG, and MRR for my retrieval system? Use MAP when you have binary relevance labels and care about finding all relevant documents, which is common in RAG pipelines that need multiple supporting chunks. Use NDCG when relevance varies in degree and you need to distinguish between highly relevant and marginally useful results. Use MRR when only the first relevant result matters, such as in factoid question answering.

What MAP@K value should I use for RAG pipelines? Tie K to your system's actual context window size. If your LLM can process 5 retrieved chunks effectively, evaluate at MAP@5. Running evaluations at multiple K values reveals whether quality issues stem from relevance detection or ranking order, which informs different optimization strategies.

Can MAP measure graded relevance or only binary relevance? Standard MAP uses binary relevance labels where documents are either relevant (1) or not (0). If you need graded relevance, where some results are more relevant than others, NDCG is the better choice. Some implementations support graded MAP variants, but binary MAP remains the standard in information retrieval benchmarks.

How does Galileo evaluate retrieval quality in production RAG systems? Galileo's Metrics Engine provides dedicated RAG metrics including Context Adherence (measuring response groundedness), Chunk Relevance (evaluating retrieval precision), and Completeness (assessing whether responses use all relevant context). These metrics run at production scale via Luna-2 SLMs at 98% lower cost than LLM-based evaluation, with Signals automatically surfacing retrieval quality degradation patterns.

Pratik Bhavsar