g2i logo

Senior Software Engineer - AI Interaction Evaluator (Codex / Claude Code, up to $200/hr)

g2i

São Paulofull-timePosted 3 day(s) ago$0-$0 / yr

$0-$0 / yr

Salary

são paulo

Region

ASAP

Start Date

About g2i

No company information provided.

About this Role.

SENIOR AI INTERACTION EVALUATOR (CODEX / CLAUDE CODE)

Contract | $50-200/hr | 10+ hrs/week | Project-based

Roles open on a rolling basis - apply to join the talent bench and we’ll reach out when one matches. Expect 40+ hrs once a project starts; timing depends on availability, but we move people in at the earliest genuine opportunity.

These roles are currently filled but we hire on a rolling basis as new projects open up. Apply now to join our talent bench — qualified candidates will be contacted directly when roles become available.

Check out this Loom video for more details! https://www.loom.com/share/b0d1b0bf24c44ae8b95dca84b9db60e5

We’re looking for highly experienced software engineer (SR+) to help evaluate the quality of interactions with modern coding agents such as OpenAI Codex and Claude Code.

This is not a traditional engineering role.

You won’t be writing production code. You’ll be evaluating something harder: whether the model thinks like a great engineer.

WHAT THIS ROLE ACTUALLY IS

You will assess how AI coding agents behave in real-world scenarios — focusing on:

  • Whether the response makes sense

  • Whether the preamble and reasoning are useful

  • Whether the output reflects strong engineering judgment

  • Whether the interaction feels right to an experienced developer

This role is about engineering taste — not syntax correctness.

WHAT YOU’LL BE DOING

  • Evaluate AI-generated coding interactions end-to-end

  • Judge whether outputs are:

  • Useful

  • Correct (at a high level)

  • Aligned with how a strong engineer would think

  • Assess the quality of explanations and reasoning, not just code

  • Distinguish between different levels of response quality (e.g. what makes something a 2 vs 4)

  • Provide clear, opinionated feedback on:

  • What worked

  • What didn’t

  • What felt “off” or misleading

  • Help define what great looks like when interacting with tools like Cursor

WHAT WE MEAN BY “TASTE”

We’re specifically looking for engineers who can answer questions like:

  • Does this feel like something a strong engineer would actually say?

  • Is this explanation helpful, or just technically correct?

  • Is the model guiding the user well, or just dumping output?

  • Would this interaction build or erode trust?

You should be comfortable making subjective but rigorous judgments.

WHO YOU ARE

  • Staff / Principal-level engineer (or equivalent experience)

  • Strong background in one of the below:

  • TypeScript / JavaScript

  • Python

  • Hands-on experience using:

  • OpenAI Codex

  • Claude Code

  • Cursor

  • Deep familiarity with modern AI-assisted dev workflows

  • Able to evaluate code without needing to fully execute or deeply review every line

  • Comfortable giving direct, opinionated feedback

  • High bar for what “good engineering” looks like

NICE TO HAVE

  • Experience with tools like Cursor or similar AI-first IDEs

  • Prior exposure to prompt design or evaluation workflows

  • Experience mentoring senior engineers or defining engineering standards

ENGAGEMENT DETAILS

  • US and Canada up to $200/hr

  • EU and Latam up to $150/hr

  • Other locations up to $100/hr

  • Hours: ~10–20 hours/week

  • Duration: Ongoing — project-based

  • Process:

  • Take-home evaluation exercise

  • One behavioral interview

Skills Required

Ready to Apply?

Apply Now

Similar jobs

No similar jobs found.