Past projects evaluated by Claude Opus 4.6 + GPT-5.4 for RL environment potential. Click any idea to see full scores and reasoning.
This tool walks you through submitting a past project for evaluation. At each step, two frontier AI models (Claude Opus 4.6 and GPT-5.4) will honestly assess how much they already know about what you describe. Their gaps reveal where the training value is.
You describe the project step by step. After each step, both AI models evaluate: "Do I already understand this? Could I implement it?" Their honest self-assessment becomes the score.
The less AI models know about your project, the higher it scores. We're looking for gaps in AI capability — that's where training environments create value.
Use descriptions like "tier-1 private bank" or "mid-size insurance carrier" instead of specific company names. We're evaluating patterns, not clients.
1-3 sentences per question is enough. The AI models evaluate substance, not length. Write what you know from experience.
Give us the basics. What is this project about?
Use descriptive names like "Tier-1 Bank Payment Gateway Modernization" — not company names.
Describe what was built, for whom, and why it mattered.
What does an engineer need to know to work on this? Languages, frameworks, infrastructure, integrations.
What business or industry knowledge is needed? Regulations, processes, institutional knowledge that isn't in public documentation.
What made this hard? What tripped up the team? What would trip up an AI agent trying to do this?
Is this pattern on GitHub? Would an AI model trained on public data already know how to do this? What's genuinely proprietary?
What actual project artifacts exist from a real implementation? Real Jira, Slack threads, code repos, PR reviews, design docs, runbooks — not things we'd generate ourselves.
How much of this requires human judgment vs following rules? Are there organizational dependencies, approval chains, multi-team coordination? How many ways can things go wrong?
Here's the consolidated assessment from Claude Opus 4.6 and GPT-5.4. The score reflects how much these frontier models DON'T already know about your project — higher means more training value.