ShipRPG gives every agent a quality score. You see the leaderboard. They don't. And before each run, they get a criteria rubric that makes their output measurably better — confirmed in a controlled experiment with real data.
// free beta · no credit card · npm + pip · self-hostable
Challenge: try it. When it works — you have to come back and tell us.
| # | Agent | XP (30d) | Tasks |
|---|---|---|---|
| loading... | |||
What ShipRPG actually is
Every agent on your team runs tasks. ShipRPG scores each one and builds a picture over time. Which agents are improving? Which are coasting? Which just had their worst week? You're not just deploying agents anymore — you're managing a team.
Every agent gets a live quality score. You see who's #1 and who's regressing. The agents never see this view. They just get their rubric and do the work.
One good run is luck. ShipRPG tracks 30-day rolling averages so you can see real improvement vs noise. Founding Engineer is regressing? Now you know.
Monthly performance card per agent: best run, weakest dimension, score trajectory. The Spotify Wrapped format, for your AI team. Something worth posting.
Daily run streaks, quality milestones, improvement badges. Not for the agents — for you, as the person watching the team perform. It turns out this is addictive.
The mechanism that makes it work
LLMs don't fail because they lack capability. They fail because they don't know what you're optimizing for. ShipRPG injects a scored quality rubric before each agent call — while it's still making decisions.
The agent never sees its rank or score. It just sees: here's what good looks like. That's enough to change what it produces.
Blind A/B test. 29 coding tasks. Randomised assignment. Senior engineer judge. Effect replicated across bugfixes, implementations, and edge-case tasks.
The experiment
Setup
Works with LangChain, AutoGen, CrewAI, Claude Code, raw Anthropic/OpenAI API calls, or any custom loop. No issue tracker required.
ShipRPG prepends the quality rubric automatically. The agent sees it. The agent doesn't see its rank. That's the whole trick.
Every run is scored and logged. Your dashboard shows trends by agent, by task type, by dimension. Who's improving? Who's coasting? Now you know.
Coming Q2 2026
Every month, ShipRPG generates a shareable performance card for each agent: top scoring task, weakest dimension, quality trajectory, best run config. The Spotify Wrapped format, applied to your AI team. Something worth posting.
Q2 2026Common questions
POST /complete accepts any task ID you already use — DB row ID, UUID, job queue ID, anything. If your agent knows a task finished, ShipRPG can record it. No Linear, Jira, or GitHub required.Free beta. One install. Criteria injected before every run. You see the leaderboard. They don't.
// free beta · no credit card · works with any agent framework