This document defines the three proprietary indices used in the AI Tools Landscape Report: Agent Maturity Index (AMI), Autonomy Risk Index (ARI), and Ecosystem Power Index (EPI). All dimensions, weights, and grading criteria are published here for transparency and reproducibility. Each assessment links to its evidence and source catalog so any claim can be independently verified.
Every dimension score and the overall assessment carry one of three confidence labels:
Overall confidence is derived from dimension confidences: high if all scored dimensions are verified, low if any are unverified, medium otherwise.
Measures how production-ready an AI agent system is across six dimensions. Each dimension is scored 0-5 against a published rubric, weighted, and aggregated to a 0-100 scale. A high AMI score means the system can be deployed in production with reasonable confidence in reliability, safety, and operational control.
A system must meet all inclusion criteria and trigger no exclusion flags to receive a scored AMI assessment.
base_llm_only — Raw model API without agent orchestrationprompt_library_only — Prompt template collection, not a systemresearch_prototype_only — Academic prototype without production pathwrapper_only — Thin wrapper around another scored system| Status | Meaning |
|---|---|
| scored | Full assessment complete with overall score and grade |
| under_review | Assessment in progress; dimensions being evaluated |
| insufficient_evidence | System is eligible but lacks enough verifiable sources for scoring |
| inactive | System shows no development activity in the last 6 months |
| excluded | System triggers an exclusion flag and cannot be scored |
AMI evaluates six dimensions. Each is scored 0-5 against a rubric (see below). The two highest-weighted dimensions reflect the most critical production concerns.
| Dimension | Weight | What It Measures |
|---|---|---|
| Execution Reliability | Multi-step task completion, error handling, retry logic, graceful degradation | |
| Safety & Guardrails | Permission models, sandboxing, security audits, secure defaults, compliance | |
| Tooling & Integration Breadth | Protocol support (MCP, etc.), third-party ecosystem, IDE integration, tool creation SDK | |
| Observability | Structured logging, execution traces, dashboards, SIEM integration, cost monitoring | |
| Deployment Maturity | Container support, cloud deployment, Kubernetes, SLA guarantees, disaster recovery | |
| Real-World Validation | Named deployments, case studies, independent benchmarks, regulatory acceptance |
Each dimension is scored on a 0-5 integer scale. Each score level has published rubric bullets (e.g., ER3a, SG4b) that assessors must reference. The full rubric table is available in the assessment detail view for each system.
| Score | Level | Meaning |
|---|---|---|
| 0 | None | No evidence of capability |
| 1 | Minimal | Basic capability with significant gaps |
| 2 | Developing | Functional but incomplete |
| 3 | Competent | Meets expectations for production use |
| 4 | Strong | Exceeds expectations with comprehensive coverage |
| 5 | Exemplary | Industry-leading; requires hard evidence (primary source, commit, log, or metric) |
Evidence requirements scale with score. A score of 4+ requires >= 2 distinct sources. A score of 5 requires at least one primary or hard-evidence source (commit, log, metric). Every scored dimension must cite rubric bullet IDs.
When a dimension cannot be scored (e.g., not_scored_reason: "Private infrastructure, no public evidence"),
its weight is redistributed proportionally among scored dimensions. A system with >= 3 unscored dimensions cannot receive a
scored status.
Every dimension score must be backed by evidence items. Each evidence item cites one or more sources from the source catalog.
| Tier | Reliability | Examples |
|---|---|---|
| T1 | Primary / Hard evidence | Source code, commit logs, metrics dashboards, audit reports |
| T2 | Secondary / Independent | Independent analysis, news reports, community benchmarks |
| T3 | Self-reported | Official marketing, vendor documentation, press releases |
The validation system enforces eight gates to prevent score inflation:
Automated warnings flag: dimensions scoring 4+ backed only by self-reported sources, and evidence older than 180 days.
Each assessment carries a spec hash linking it to this methodology version, plus a SHA-256 integrity hash of the assessment content. Published assessments require at least one reviewer signature. Assessment diffs show exactly what changed between versions.
The full AMI specification is available at docs/ami-v1-spec.md.
score-disputeMeasures risk exposure when running a system autonomously. Unlike AMI (where higher is better), ARI is an inverse score — lower is safer. A high ARI means the system poses significant risk when running without continuous human oversight.
| Dimension | Weight | What It Measures (Higher = More Risk) |
|---|---|---|
| Permission Model Strength | Weak/missing permission boundaries = high score. Granular enforcement = low score. | |
| Sandboxing / Isolation | No isolation = high score. Container/VM isolation with network segmentation = low score. | |
| Default Network Exposure | Open ports, public endpoints = high score. No listening services = low score. | |
| Secret Handling | Plaintext keys = high score. Encrypted vault with rotation = low score. | |
| Human-in-the-Loop Controls | No approval gates = high score. Mandatory review for destructive actions = low score. | |
| Audit Logging | No logs = high score. Tamper-proof audit trail with SIEM export = low score. |
Measures distribution strength, community gravity, and ecosystem reach. A high EPI indicates the framework has strong adoption, vendor integration, and community momentum — making it harder to displace and easier to hire for.
| Dimension | Weight | What It Measures |
|---|---|---|
| Adoption Signals | GitHub stars, npm downloads, Docker pulls, community size, Stack Overflow activity | |
| Vendor Integration Breadth | Number of platforms, IDEs, services with native support or official integration | |
| Enterprise Penetration | Known enterprise deployments, SOC2 compliance, support contracts, case studies | |
| Standard Alignment | MCP support, OpenAPI compliance, tool protocol adherence, interoperability | |
| Release Velocity | Commit frequency, release cadence, maintainer activity, issue response time |
Indices are living scores. They update when:
All changes are logged in the version history below. Previous scores are preserved for comparison.
Framework maintainers can dispute scores by providing counter-evidence. The process:
score-disputeAMI assessments: View assessments · Data sources: View all sources · Raw data: frameworks.json · Report: Agents 2026 Edition