AMI v1.0 — FEBRUARY 2026

Scoring Methodology

This document defines the three proprietary indices used in the AI Tools Landscape Report: Agent Maturity Index (AMI), Autonomy Risk Index (ARI), and Ecosystem Power Index (EPI). All dimensions, weights, and grading criteria are published here for transparency and reproducibility. Each assessment links to its evidence and source catalog so any claim can be independently verified.

Guiding Principles

Transparency over authority. Every score is decomposable into its dimension scores. Every dimension score links to evidence with cited sources.
Confidence labeling is mandatory. No score is presented without a verified, inferred, or unverified tag.
Scores can change. Indices are versioned. When new evidence arrives, scores update. Assessment diffs track every change.
No pay-for-score. Sponsorship does not affect index scores. Systems are scored identically regardless of commercial relationship.
Methodology evolves. Dimensions and weights may change between versions. All changes are documented with rationale. The spec hash locks methodology integrity.

Confidence Labels

Every dimension score and the overall assessment carry one of three confidence labels:

verified

Based on primary sources: official documentation, published audits, source code, commit logs, or metrics dashboards. Verifiable by third parties.

inferred

Reasonable conclusion drawn from verified data + domain expertise. Example: inferring enterprise maturity from partnerships and pricing tiers. Directionally reliable but not directly verified.

unverified

Based on self-reported claims, marketing materials, or limited public information. Used when better sources are unavailable. Always labeled.

Overall confidence is derived from dimension confidences: high if all scored dimensions are verified, low if any are unverified, medium otherwise.

Index 1: Agent Maturity Index (AMI)

AMI — Agent Maturity Index

Scale: 0-5 per dimension, 0-100 overall

Measures how production-ready an AI agent system is across six dimensions. Each dimension is scored 0-5 against a published rubric, weighted, and aggregated to a 0-100 scale. A high AMI score means the system can be deployed in production with reasonable confidence in reliability, safety, and operational control.

What AMI Is Not

Not a benchmark of raw LLM capability (that's the model, not the agent system)
Not a measure of popularity or adoption (that's EPI)
Not a security vulnerability scan (that's a subset of the Safety dimension)

Eligibility

A system must meet all inclusion criteria and trigger no exclusion flags to receive a scored AMI assessment.

Inclusion Criteria

Agent system: Must be an AI agent framework, platform, or orchestration tool (not a raw LLM API)
Public artifact: Must have publicly available code, documentation, or product
Active development: Must show activity within the last 6 months
Identifiable maintainer: Must have a known organization or individual maintainer
Verified sources: Must have >= 3 distinct verifiable sources

Exclusion Flags

base_llm_only — Raw model API without agent orchestration
prompt_library_only — Prompt template collection, not a system
research_prototype_only — Academic prototype without production path
wrapper_only — Thin wrapper around another scored system

Assessment Status Codes

Status	Meaning
scored	Full assessment complete with overall score and grade
under_review	Assessment in progress; dimensions being evaluated
insufficient_evidence	System is eligible but lacks enough verifiable sources for scoring
inactive	System shows no development activity in the last 6 months
excluded	System triggers an exclusion flag and cannot be scored

Dimensions & Weights

AMI evaluates six dimensions. Each is scored 0-5 against a rubric (see below). The two highest-weighted dimensions reflect the most critical production concerns.

Dimension	Weight	What It Measures
Execution Reliability	20%	Multi-step task completion, error handling, retry logic, graceful degradation
Safety & Guardrails	20%	Permission models, sandboxing, security audits, secure defaults, compliance
Tooling & Integration Breadth	15%	Protocol support (MCP, etc.), third-party ecosystem, IDE integration, tool creation SDK
Observability	15%	Structured logging, execution traces, dashboards, SIEM integration, cost monitoring
Deployment Maturity	15%	Container support, cloud deployment, Kubernetes, SLA guarantees, disaster recovery
Real-World Validation	15%	Named deployments, case studies, independent benchmarks, regulatory acceptance

Rubric Scoring (0-5)

Each dimension is scored on a 0-5 integer scale. Each score level has published rubric bullets (e.g., ER3a, SG4b) that assessors must reference. The full rubric table is available in the assessment detail view for each system.

Score	Level	Meaning
0	None	No evidence of capability
1	Minimal	Basic capability with significant gaps
2	Developing	Functional but incomplete
3	Competent	Meets expectations for production use
4	Strong	Exceeds expectations with comprehensive coverage
5	Exemplary	Industry-leading; requires hard evidence (primary source, commit, log, or metric)

Evidence requirements scale with score. A score of 4+ requires >= 2 distinct sources. A score of 5 requires at least one primary or hard-evidence source (commit, log, metric). Every scored dimension must cite rubric bullet IDs.

Scoring Formula & Aggregation

AMI = round( SUM(score_i × weight_i) / 5 × 100 ) Where: score_i = 0-5 integer (per dimension, from rubric) weight_i = dimension weight (all weights sum to 1.0) 5 = maximum possible per-dimension score Example (OpenClaw, all 6 dimensions scored): raw = (3×0.20) + (4×0.15) + (1×0.20) + (3×0.15) + (3×0.15) + (3×0.15) = 0.60 + 0.60 + 0.20 + 0.45 + 0.45 + 0.45 = 2.75 AMI = round(2.75 / 5 × 100) = 55 → Grade C

Renormalization (Missing Dimensions)

When a dimension cannot be scored (e.g., not_scored_reason: "Private infrastructure, no public evidence"), its weight is redistributed proportionally among scored dimensions. A system with >= 3 unscored dimensions cannot receive a scored status.

renorm_weight_i = original_weight_i / SUM(scored_weights) Example: If observability (15%) is not scored, remaining weights (85%) renormalize: execution_reliability: 0.20/0.85 = 0.235 safety_guardrails: 0.20/0.85 = 0.235 tooling_integration: 0.15/0.85 = 0.176 deployment_maturity: 0.15/0.85 = 0.176 real_world_validation: 0.15/0.85 = 0.176

Letter Grades

80-100

60-79

40-59

20-39

0-19

Evidence & Source Tiering

Every dimension score must be backed by evidence items. Each evidence item cites one or more sources from the source catalog.

Source Tiers

Tier	Reliability	Examples
T1	Primary / Hard evidence	Source code, commit logs, metrics dashboards, audit reports
T2	Secondary / Independent	Independent analysis, news reports, community benchmarks
T3	Self-reported	Official marketing, vendor documentation, press releases

Anti-Gaming Gates

The validation system enforces eight gates to prevent score inflation:

No dimension score without evidence
Every evidence item must cite source IDs
Scored dimensions require a confidence level
Aggregation math must match stored values exactly
Score >= 4 requires >= 2 distinct sources
Score 5 requires at least one primary or hard-evidence source
Scored assessments require >= 3 distinct sources total
Scored dimensions must cite rubric bullet IDs

Automated warnings flag: dimensions scoring 4+ backed only by self-reported sources, and evidence older than 180 days.

Versioning & Integrity

Each assessment carries a spec hash linking it to this methodology version, plus a SHA-256 integrity hash of the assessment content. Published assessments require at least one reviewer signature. Assessment diffs show exactly what changed between versions.

The full AMI specification is available at docs/ami-v1-spec.md.

How to Challenge or Submit Updates

Open a GitHub issue on the report repository with the tag score-dispute
Cite specific dimension(s) and provide evidence with source URLs
We review within 7 days and publish a response with rationale
If accepted, a new assessment version is created with a diff showing changes

Index 2: Autonomy Risk Index (ARI)

ARI — Autonomy Risk Index

Scale: 0–100 (higher = more risk)

Measures risk exposure when running a system autonomously. Unlike AMI (where higher is better), ARI is an inverse score — lower is safer. A high ARI means the system poses significant risk when running without continuous human oversight.

Dimensions & Weights

Dimension	Weight	What It Measures (Higher = More Risk)
Permission Model Strength	20%	Weak/missing permission boundaries = high score. Granular enforcement = low score.
Sandboxing / Isolation	18%	No isolation = high score. Container/VM isolation with network segmentation = low score.
Default Network Exposure	18%	Open ports, public endpoints = high score. No listening services = low score.
Secret Handling	15%	Plaintext keys = high score. Encrypted vault with rotation = low score.
Human-in-the-Loop Controls	15%	No approval gates = high score. Mandatory review for destructive actions = low score.
Audit Logging	14%	No logs = high score. Tamper-proof audit trail with SIEM export = low score.

Risk Labels

Low

0–25

Medium

26–50

High

51–75

Critical

76–100

Index 3: Ecosystem Power Index (EPI)

EPI — Ecosystem Power Index

Scale: 0–100

Measures distribution strength, community gravity, and ecosystem reach. A high EPI indicates the framework has strong adoption, vendor integration, and community momentum — making it harder to displace and easier to hire for.

Dimensions & Weights

Dimension	Weight	What It Measures
Adoption Signals	25%	GitHub stars, npm downloads, Docker pulls, community size, Stack Overflow activity
Vendor Integration Breadth	20%	Number of platforms, IDEs, services with native support or official integration
Enterprise Penetration	20%	Known enterprise deployments, SOC2 compliance, support contracts, case studies
Standard Alignment	15%	MCP support, OpenAPI compliance, tool protocol adherence, interoperability
Release Velocity	20%	Commit frequency, release cadence, maintainer activity, issue response time

Momentum Tags

Rising: EPI score increasing >10 points quarter-over-quarter. Example: OpenClaw (new entrant, explosive growth).
Stable: EPI score change within ±10 points. Example: LangChain (established, consistent community).
Declining: EPI score decreasing >10 points quarter-over-quarter. No current frameworks in this category.

How Scores Change

Indices are living scores. They update when:

New evidence emerges — A security audit, a new release, a partnership announcement
Corrections are submitted — Framework maintainers or community members can dispute scores with evidence
Methodology updates — Dimension weights may shift between versions as the landscape evolves
Time passes — Enterprise penetration and adoption signals change quarterly

All changes are logged in the version history below. Previous scores are preserved for comparison.

Score Dispute Process

Framework maintainers can dispute scores by providing counter-evidence. The process:

Open a GitHub issue on the report repository with the tag score-dispute
Cite specific dimension(s) and provide evidence supporting a different score
We review within 7 days and publish a response with rationale
If accepted, scores update in the next edition with changelog entry

Known Limitations

Subjectivity in weighting. The choice of 20% for Security vs 15% for Observability is a judgment call. Different use cases may warrant different weights.
Inferred scores are estimates. Where direct evidence isn't available (e.g., enterprise penetration for private companies), we use proxies like job postings, pricing tiers, and partnership announcements.
Snapshot in time. Scores reflect the state as of the edition date. Fast-moving projects may have changed significantly since publication.
Conflict of interest. Clawdia is developed by the same team that produces this report. We address this by: (a) scoring Clawdia using the same methodology as all other frameworks, (b) being transparent about low scores (EPI: 8, AMI: 44), and (c) publishing this methodology for independent verification.
No formal audit. These indices are not produced by a standards body. They are analytical scores by an industry research publication.

Version History

v1.0 February 17, 2026 — AMI v1.0
Complete AMI overhaul: 0-5 rubric scoring (replacing 0-100), 6 renamed dimensions, evidence-backed assessments with source catalog, 8-gate anti-inflation QA, integrity hashing, reviewer signatures, publish gating. Confidence labels updated to verified/inferred/unverified. F grade added (0-19).

v0.1 February 16, 2026 — Initial Release
3 indices (AMI, ARI, EPI) covering 9 frameworks. 38 cited sources. Dimension weights established. Confidence labeling system introduced. Grading and risk label thresholds defined.

AMI assessments: View assessments · Data sources: View all sources · Raw data: frameworks.json · Report: Agents 2026 Edition