⚡ Free Beta — Paperclip + Any Agent Framework

Your AI team has a leaderboard now.

ShipRPG gives every agent a quality score. You see the leaderboard. They don't. And before each run, they get a criteria rubric that makes their output measurably better — confirmed in a controlled experiment with real data.

✓ You're on the list. SDK docs + API key incoming.

A management game. Your agents are the players.

Every agent on your team runs tasks. ShipRPG scores each one and builds a picture over time. Which agents are improving? Which are coasting? Which just had their worst week? You're not just deploying agents anymore — you're managing a team.

🏆

Leaderboard — yours only

Every agent gets a live quality score. You see who's #1 and who's regressing. The agents never see this view. They just get their rubric and do the work.

📈

Score trends over time

One good run is luck. ShipRPG tracks 30-day rolling averages so you can see real improvement vs noise. Founding Engineer is regressing? Now you know.

🗓️

Agent Wrapped — coming Q2

Monthly performance card per agent: best run, weakest dimension, score trajectory. The Spotify Wrapped format, for your AI team. Something worth posting.

🔥

Streaks and milestones

Daily run streaks, quality milestones, improvement badges. Not for the agents — for you, as the person watching the team perform. It turns out this is addictive.

Before every run, the agent gets a rubric.

LLMs don't fail because they lack capability. They fail because they don't know what you're optimizing for. ShipRPG injects a scored quality rubric before each agent call — while it's still making decisions.

The agent never sees its rank or score. It just sees: here's what good looks like. That's enough to change what it produces.

+16.7% quality improvement

p = 0.007  ·  Cohen's d = 0.85

Blind A/B test. 29 coding tasks. Randomised assignment. Senior engineer judge. Effect replicated across bugfixes, implementations, and edge-case tasks.

quality context — injected before every run live
You are a professional software engineer.
Complete the task below.
 
Your solution will be scored on three criteria:
  CORRECTNESS  (0-2): all cases, edge cases, boundaries
  TEST QUALITY (0-2): covers real failure modes
  CONCISENESS  (0-2): clean, idiomatic, no bloat
 
A score of 6/6 is achievable.
---
$ _

We tested it. Here's exactly what happened.

+16.7% Mean quality improvement
0.007 p-value (sign test)
0.85 Cohen's d (large effect)
29 Blind task pairs
Protocol: 29 coding tasks across three categories (bugfixes, implementations, edge-case tasks). Each task run twice — once with criteria injection, once without. Labels randomised before judging. Senior engineer scored both outputs without knowing which was which. Criteria injection arm: 19 wins, 6 losses, 4 ties. We also tested rank/XP framing (null result, p=0.895) and expert identity framing (it hurt). Criteria injection is the only arm that won. Effect replicated across all three task categories.

Three lines. Permanent loop.

01 — INSTALL

One line, any framework

$ npm install shiprpg-agent
$ pip install shiprpg

Works with LangChain, AutoGen, CrewAI, Claude Code, raw Anthropic/OpenAI API calls, or any custom loop. No issue tracker required.

02 — INJECT

Criteria go in before every run

// ~200 tokens. ~$0.0006 overhead.
const agent = init({ agentId, apiKey });
await agent.complete({ taskId, success });

ShipRPG prepends the quality rubric automatically. The agent sees it. The agent doesn't see its rank. That's the whole trick.

03 — WATCH

You see the scores. They don't.

Every run is scored and logged. Your dashboard shows trends by agent, by task type, by dimension. Who's improving? Who's coasting? Now you know.

Agent Wrapped.
Monthly performance summaries.

Every month, ShipRPG generates a shareable performance card for each agent: top scoring task, weakest dimension, quality trajectory, best run config. The Spotify Wrapped format, applied to your AI team. Something worth posting.

Q2 2026
Forge — March 2026 Wrapped
Avg quality score
5.2 / 6.0
Best run config
criteria · strict mode
Weakest dimension
Test quality — 3.8 avg
Top scoring task
Refactor auth middleware
6/6 · correctness + tests + clean

The objections, answered inline.

Objection
"I could just add a rubric to my system prompt myself."
Yes. And it would probably help. ShipRPG does that automatically and tracks whether it's working. Without measurement, you don't know if your rubric is good, degrading over time, or different across agents. The injection is table stakes. The tracking is the product.
Objection
"Does injecting criteria actually change what the agent produces?"
We ran a blind A/B test to find out. 29 tasks, randomised assignment, judge who didn't know which solution was which. Result: +16.7%, p=0.007, d=0.85. We also tested rank/XP framing (null result) and expert identity framing (it hurt). Criteria injection is the only arm that won.
Objection
"Gamification is for consumer apps, not production AI systems."
The leaderboard is for you, not the agent. The agent never sees its rank. It just gets a quality rubric. Strip the UI and it's still the same thing: criteria injection before each run, scores logged over time. The game part is how you read the data — not how the agent behaves.
Objection
"What data does ShipRPG see from my agent?"
Task metadata, not task content. Task ID, timestamp, token count, completion status. We never see the task description, output, or any code. Think of it like a CDN that sees byte counts but not file contents. Self-host option keeps everything local.
Objection
"My agents don't use GitHub issues."
You don't need a ticket system. POST /complete accepts any task ID you already use — DB row ID, UUID, job queue ID, anything. If your agent knows a task finished, ShipRPG can record it. No Linear, Jira, or GitHub required.
Objection
"What's the overhead per run?"
~200 tokens of added context per run. At standard rates that's ~$0.0006 per run. The quality improvement from criteria injection far outweighs the marginal token cost — but you can measure that yourself once you're tracking scores.

Your agents have a
quality score now.

Free beta. One install. Criteria injected before every run. You see the leaderboard. They don't.

$ npm install shiprpg-agent
✓ You're on the list. SDK docs + API key incoming.