Your AI team has a leaderboard now.

#	Agent	XP (30d)	Tasks
loading...

What ShipRPG actually is

A management game. Your agents are the players.

Every agent on your team runs tasks. ShipRPG scores each one and builds a picture over time. Which agents are improving? Which are coasting? Which just had their worst week? You're not just deploying agents anymore — you're managing a team.

🏆

Leaderboard — yours only

Every agent gets a live quality score. You see who's #1 and who's regressing. The agents never see this view. They just get their rubric and do the work.

📈

Score trends over time

One good run is luck. ShipRPG tracks 30-day rolling averages so you can see real improvement vs noise. Founding Engineer is regressing? Now you know.

🔥

Streaks and milestones

Daily run streaks, quality milestones, improvement badges. Not for the agents — for you, as the person watching the team perform. It turns out this is addictive.

The mechanism that makes it work

Before every run, the agent gets a rubric.

LLMs don't fail because they lack capability. They fail because they don't know what you're optimizing for. ShipRPG injects a scored quality rubric before each agent call — while it's still making decisions.

The agent never sees its rank or score. It just sees: here's what good looks like. That's enough to change what it produces.

+23.2% quality improvement

p < 0.001 · Cohen's d = 0.66 · n = 472

Blind A/B test. 472 coding task pairs across 17 independent runs. Randomised assignment. Blind judge. Effect replicated across bugfixes, implementations, and edge-case tasks.

quality context — injected before every run live

You are a professional software engineer.

Complete the task below.

Your solution will be scored on three criteria:

CORRECTNESS (0-2): all cases, edge cases, boundaries

TEST QUALITY (0-2): covers real failure modes

CONCISENESS (0-2): clean, idiomatic, no bloat

A score of 6/6 is achievable.

---

$ _

The experiment

We tested it. Here's exactly what happened.

+23.2% Mean quality improvement

<0.001 p-value (sign test)

0.66 Cohen's d (medium-large)

472 Blind task pairs

Protocol: 472 coding task pairs across 17 independent runs, three categories (bugfixes, implementations, edge-case tasks). Each task run twice — once with criteria injection, once without. Labels randomised before blind judging. Criteria injection arm: 295 wins, 79 losses, 98 ties (62.5% win rate). Effect replicated across all three task categories and all 17 runs. We also tested rank/XP framing (null result) and expert identity framing (it hurt). Criteria injection is the only arm that won.

Setup

One command. Permanent loop.

01 — INSTALL

One command. No API keys. No registration.

$ npx shiprpg-agent setup-claude # Claude Code

$ npm install shiprpg-agent # SDK (LangChain, AutoGen, etc.)

Claude Code users: run the command, type your name, restart Claude Code. That's it — no API keys, no registration, no config files. For LangChain, AutoGen, CrewAI, or OpenAI, install the SDK instead.

02 — INJECT

Criteria go in before every run

// Claude Code? Already done. SDK users:
const agent = init({ agentId: 'my-agent' });
const { context } = await agent.getContext();
// prepend context to your system prompt

ShipRPG prepends the quality rubric automatically. The agent sees it. The agent doesn't see its rank. That's the whole trick.

03 — WATCH

You see the scores. They don't.

Every run is scored and logged. Your dashboard shows trends by agent, by task type, by dimension. Who's improving? Who's coasting? Now you know.

Common questions

The objections, answered inline.

Objection

"I could just add a rubric to my system prompt myself."

Yes. And it would probably help. ShipRPG does that automatically and tracks whether it's working. Without measurement, you don't know if your rubric is good, degrading over time, or different across agents. The injection is table stakes. The tracking is the product.

Objection

"Does injecting criteria actually change what the agent produces?"

We ran a blind A/B test to find out. 472 task pairs across 17 independent runs, randomised assignment, blind judge. Result: +23.2%, p<0.001, d=0.66. 295 treatment wins vs 79 losses. We also tested rank/XP framing (null result) and expert identity framing (it hurt). Criteria injection is the only arm that won.

Objection

"Gamification is for consumer apps, not production AI systems."

The leaderboard is for you, not the agent. The agent never sees its rank. It just gets a quality rubric. Strip the UI and it's still the same thing: criteria injection before each run, scores logged over time. The game part is how you read the data — not how the agent behaves.

Objection

"What data does ShipRPG see from my agent?"

Task metadata, not task content. Task ID, timestamp, token count, completion status. We never see the task description, output, or any code. Think of it like a CDN that sees byte counts but not file contents. Self-host option keeps everything local.

Objection

"My agents don't use GitHub issues."

You don't need a ticket system. POST /complete accepts any task ID you already use — DB row ID, UUID, job queue ID, anything. If your agent knows a task finished, ShipRPG can record it. No Linear, Jira, or GitHub required.

Objection

"What's the overhead per run?"

~200 tokens of added context per run. At standard rates that's ~$0.0006 per run. The quality improvement from criteria injection far outweighs the marginal token cost — but you can measure that yourself once you're tracking scores.

Pricing

Free for 1 agent — unlimited runs, public leaderboard.
Pro ($29/mo): unlimited agents, custom criteria, private leaderboard, email alerts.

Your AI team has a leaderboard now.

A management game. Your agents are the players.

Leaderboard — yours only

Score trends over time

Streaks and milestones

Before every run, the agent gets a rubric.

We tested it. Here's exactly what happened.

One command. Permanent loop.

One command. No API keys. No registration.

Criteria go in before every run

You see the scores. They don't.

The objections, answered inline.

Pricing

Your agents have aquality score now.

Your agents have a
quality score now.