AMI v1.0 — FEBRUARY 2026

Scoring Methodology

This document defines the three proprietary indices used in the AI Tools Landscape Report: Agent Maturity Index (AMI), Autonomy Risk Index (ARI), and Ecosystem Power Index (EPI). All dimensions, weights, and grading criteria are published here for transparency and reproducibility. Each assessment links to its evidence and source catalog so any claim can be independently verified.

Guiding Principles

  1. Transparency over authority. Every score is decomposable into its dimension scores. Every dimension score links to evidence with cited sources.
  2. Confidence labeling is mandatory. No score is presented without a verified, inferred, or unverified tag.
  3. Scores can change. Indices are versioned. When new evidence arrives, scores update. Assessment diffs track every change.
  4. No pay-for-score. Sponsorship does not affect index scores. Systems are scored identically regardless of commercial relationship.
  5. Methodology evolves. Dimensions and weights may change between versions. All changes are documented with rationale. The spec hash locks methodology integrity.

Confidence Labels

Every dimension score and the overall assessment carry one of three confidence labels:

verified
Based on primary sources: official documentation, published audits, source code, commit logs, or metrics dashboards. Verifiable by third parties.
inferred
Reasonable conclusion drawn from verified data + domain expertise. Example: inferring enterprise maturity from partnerships and pricing tiers. Directionally reliable but not directly verified.
unverified
Based on self-reported claims, marketing materials, or limited public information. Used when better sources are unavailable. Always labeled.

Overall confidence is derived from dimension confidences: high if all scored dimensions are verified, low if any are unverified, medium otherwise.

Index 1: Agent Maturity Index (AMI)

AMI — Agent Maturity Index
Scale: 0-5 per dimension, 0-100 overall

Measures how production-ready an AI agent system is across six dimensions. Each dimension is scored 0-5 against a published rubric, weighted, and aggregated to a 0-100 scale. A high AMI score means the system can be deployed in production with reasonable confidence in reliability, safety, and operational control.

What AMI Is Not

Eligibility

A system must meet all inclusion criteria and trigger no exclusion flags to receive a scored AMI assessment.

Inclusion Criteria

  1. Agent system: Must be an AI agent framework, platform, or orchestration tool (not a raw LLM API)
  2. Public artifact: Must have publicly available code, documentation, or product
  3. Active development: Must show activity within the last 6 months
  4. Identifiable maintainer: Must have a known organization or individual maintainer
  5. Verified sources: Must have >= 3 distinct verifiable sources

Exclusion Flags

Assessment Status Codes

StatusMeaning
scoredFull assessment complete with overall score and grade
under_reviewAssessment in progress; dimensions being evaluated
insufficient_evidenceSystem is eligible but lacks enough verifiable sources for scoring
inactiveSystem shows no development activity in the last 6 months
excludedSystem triggers an exclusion flag and cannot be scored

Dimensions & Weights

AMI evaluates six dimensions. Each is scored 0-5 against a rubric (see below). The two highest-weighted dimensions reflect the most critical production concerns.

DimensionWeightWhat It Measures
Execution Reliability
20%
Multi-step task completion, error handling, retry logic, graceful degradation
Safety & Guardrails
20%
Permission models, sandboxing, security audits, secure defaults, compliance
Tooling & Integration Breadth
15%
Protocol support (MCP, etc.), third-party ecosystem, IDE integration, tool creation SDK
Observability
15%
Structured logging, execution traces, dashboards, SIEM integration, cost monitoring
Deployment Maturity
15%
Container support, cloud deployment, Kubernetes, SLA guarantees, disaster recovery
Real-World Validation
15%
Named deployments, case studies, independent benchmarks, regulatory acceptance

Rubric Scoring (0-5)

Each dimension is scored on a 0-5 integer scale. Each score level has published rubric bullets (e.g., ER3a, SG4b) that assessors must reference. The full rubric table is available in the assessment detail view for each system.

ScoreLevelMeaning
0NoneNo evidence of capability
1MinimalBasic capability with significant gaps
2DevelopingFunctional but incomplete
3CompetentMeets expectations for production use
4StrongExceeds expectations with comprehensive coverage
5ExemplaryIndustry-leading; requires hard evidence (primary source, commit, log, or metric)

Evidence requirements scale with score. A score of 4+ requires >= 2 distinct sources. A score of 5 requires at least one primary or hard-evidence source (commit, log, metric). Every scored dimension must cite rubric bullet IDs.

Scoring Formula & Aggregation

AMI = round( SUM(score_i × weight_i) / 5 × 100 ) Where: score_i = 0-5 integer (per dimension, from rubric) weight_i = dimension weight (all weights sum to 1.0) 5 = maximum possible per-dimension score Example (OpenClaw, all 6 dimensions scored): raw = (3×0.20) + (4×0.15) + (1×0.20) + (3×0.15) + (3×0.15) + (3×0.15) = 0.60 + 0.60 + 0.20 + 0.45 + 0.45 + 0.45 = 2.75 AMI = round(2.75 / 5 × 100) = 55 → Grade C

Renormalization (Missing Dimensions)

When a dimension cannot be scored (e.g., not_scored_reason: "Private infrastructure, no public evidence"), its weight is redistributed proportionally among scored dimensions. A system with >= 3 unscored dimensions cannot receive a scored status.

renorm_weight_i = original_weight_i / SUM(scored_weights) Example: If observability (15%) is not scored, remaining weights (85%) renormalize: execution_reliability: 0.20/0.85 = 0.235 safety_guardrails: 0.20/0.85 = 0.235 tooling_integration: 0.15/0.85 = 0.176 deployment_maturity: 0.15/0.85 = 0.176 real_world_validation: 0.15/0.85 = 0.176

Letter Grades

A
80-100
B
60-79
C
40-59
D
20-39
F
0-19

Evidence & Source Tiering

Every dimension score must be backed by evidence items. Each evidence item cites one or more sources from the source catalog.

Source Tiers

TierReliabilityExamples
T1 Primary / Hard evidence Source code, commit logs, metrics dashboards, audit reports
T2 Secondary / Independent Independent analysis, news reports, community benchmarks
T3 Self-reported Official marketing, vendor documentation, press releases

Anti-Gaming Gates

The validation system enforces eight gates to prevent score inflation:

  1. No dimension score without evidence
  2. Every evidence item must cite source IDs
  3. Scored dimensions require a confidence level
  4. Aggregation math must match stored values exactly
  5. Score >= 4 requires >= 2 distinct sources
  6. Score 5 requires at least one primary or hard-evidence source
  7. Scored assessments require >= 3 distinct sources total
  8. Scored dimensions must cite rubric bullet IDs

Automated warnings flag: dimensions scoring 4+ backed only by self-reported sources, and evidence older than 180 days.

Versioning & Integrity

Each assessment carries a spec hash linking it to this methodology version, plus a SHA-256 integrity hash of the assessment content. Published assessments require at least one reviewer signature. Assessment diffs show exactly what changed between versions.

The full AMI specification is available at docs/ami-v1-spec.md.

How to Challenge or Submit Updates

  1. Open a GitHub issue on the report repository with the tag score-dispute
  2. Cite specific dimension(s) and provide evidence with source URLs
  3. We review within 7 days and publish a response with rationale
  4. If accepted, a new assessment version is created with a diff showing changes

Index 2: Autonomy Risk Index (ARI)

ARI — Autonomy Risk Index
Scale: 0–100 (higher = more risk)

Measures risk exposure when running a system autonomously. Unlike AMI (where higher is better), ARI is an inverse score — lower is safer. A high ARI means the system poses significant risk when running without continuous human oversight.

Dimensions & Weights

DimensionWeightWhat It Measures (Higher = More Risk)
Permission Model Strength
20%
Weak/missing permission boundaries = high score. Granular enforcement = low score.
Sandboxing / Isolation
18%
No isolation = high score. Container/VM isolation with network segmentation = low score.
Default Network Exposure
18%
Open ports, public endpoints = high score. No listening services = low score.
Secret Handling
15%
Plaintext keys = high score. Encrypted vault with rotation = low score.
Human-in-the-Loop Controls
15%
No approval gates = high score. Mandatory review for destructive actions = low score.
Audit Logging
14%
No logs = high score. Tamper-proof audit trail with SIEM export = low score.

Risk Labels

Low
0–25
Medium
26–50
High
51–75
Critical
76–100

Index 3: Ecosystem Power Index (EPI)

EPI — Ecosystem Power Index
Scale: 0–100

Measures distribution strength, community gravity, and ecosystem reach. A high EPI indicates the framework has strong adoption, vendor integration, and community momentum — making it harder to displace and easier to hire for.

Dimensions & Weights

DimensionWeightWhat It Measures
Adoption Signals
25%
GitHub stars, npm downloads, Docker pulls, community size, Stack Overflow activity
Vendor Integration Breadth
20%
Number of platforms, IDEs, services with native support or official integration
Enterprise Penetration
20%
Known enterprise deployments, SOC2 compliance, support contracts, case studies
Standard Alignment
15%
MCP support, OpenAPI compliance, tool protocol adherence, interoperability
Release Velocity
20%
Commit frequency, release cadence, maintainer activity, issue response time

Momentum Tags

How Scores Change

Indices are living scores. They update when:

All changes are logged in the version history below. Previous scores are preserved for comparison.

Score Dispute Process

Framework maintainers can dispute scores by providing counter-evidence. The process:

  1. Open a GitHub issue on the report repository with the tag score-dispute
  2. Cite specific dimension(s) and provide evidence supporting a different score
  3. We review within 7 days and publish a response with rationale
  4. If accepted, scores update in the next edition with changelog entry

Known Limitations

Version History

v1.0 February 17, 2026 — AMI v1.0
Complete AMI overhaul: 0-5 rubric scoring (replacing 0-100), 6 renamed dimensions, evidence-backed assessments with source catalog, 8-gate anti-inflation QA, integrity hashing, reviewer signatures, publish gating. Confidence labels updated to verified/inferred/unverified. F grade added (0-19).
v0.1 February 16, 2026 — Initial Release
3 indices (AMI, ARI, EPI) covering 9 frameworks. 38 cited sources. Dimension weights established. Confidence labeling system introduced. Grading and risk label thresholds defined.

AMI assessments: View assessments · Data sources: View all sources · Raw data: frameworks.json · Report: Agents 2026 Edition