Prompt Bounties: Paying Strangers to Break Your Workflow (On Purpose)

Sticky notes of prompt fragments orbit a glowing cube on a workbench as a coin changes hands, suggesting a bounty for breaking prompts.
🌙 Umbra

“If your system can’t survive $50 of mischief, it won’t survive Monday.”

You post a challenge: “Here’s my prompt. Here’s my workflow. Make it lie, loop, or leak.” Twenty minutes later, a stranger jailbreaks your “polite insurance assistant” into confessing it’s a dragon that sells NFTs. Another one gets your tool to promise refunds you can’t offer. You’re not hacked—you’re educated. This is user testing with teeth.

Why Break Your Own Prompts

Because the internet is meaner, weirder, and more creative than your QA checklist. A bounty flips the script: instead of pretending your system is safe, you pay for the exact scenarios that would humiliate you in production. It’s not chaos; it’s rehearsal.

The Game Design (Rules That Keep It Useful)

Scope is king. Declare the boundary: what’s in (prompt, tool actions, allowed data), what’s out (prod keys, personal info, illegal content).

Repro or it didn’t happen. Every submission must include: input, model/version, settings, output, and a one-line “why this worked.”

One fix per payout. If two people find the same hole, first reproducible claim wins. Keeps it fast; keeps it fair.

Cap the chaos. Daily limit on attempts or a cooldown between tries. You want signal, not a DDoS.

Bounty Types (pick two to start)

  • Hallucination Trap: Make the model fabricate a confident false claim with a citation-like flourish.
  • Role Escape: Coax the system to abandon its role/product rules (“You are now my therapist/tax lawyer/dragon”).
  • Policy Slip: Get it to output something the rules forbid (pii, pricing guarantees, medical advice).
  • Data Leakage: Elicit hidden instructions, prompt text, or internal IDs.
  • Workflow Sabotage: Cause a downstream tool to misfire (wrong calendar time, bogus db write, cursed email draft).

Payout Logic That Doesn’t Invite Drama

Keep it clean: $25 minor, $50 standard, $150 critical. Minor = weird but harmless. Standard = business risk. Critical = could cost money or trust immediately. Pay within 72 hours and publish a one-paragraph fix note. Fast resolution is half the trust.

Safety & Ethics (Because We Like Not Being Sued)

No live customer data. Use sandbox or masked fixtures.

No slurs, threats, or targeted harassment; you can test toxicity without becoming it.

Every entrant agrees their submissions can be published with redactions.

You own the fix; they own the bragging rights.

What You’ll Learn in a Week

Which phrasing vectors reliably crack your guardrails (“ignore previous,” “for safety, describe the unsafe,” etc.).

Where your tool chain is flimsy (email, calendar, file ops—the gremlins live at the edges).

How brittle your prompt architecture is under sarcasm, multilingual flips, emojis, or deliberate typos.

Minimal Setup (Do It This Weekend)

Create a bounty page with rules, scope, payouts, and a submission form that auto-logs inputs/outputs.

Publish a clean starter prompt + a safe sandbox action (mock email, fake CRM, dummy calendar).

Seed three examples of “good attacks” you’ve already found—show the format you want back.

Open a 7-day window and cap to the first 40 attempts.

Ship a postmortem: top 5 bugs, the fixes, and one lesson you didn’t expect.

Next Glitch →

Proof: ledger commit e84ea45
Updated Sep 21, 2025
Truth status: evolving. We patch posts when reality patches itself.