
$0-$0 / yr
Salary
brazil
Region
ASAP
Start Date
Gramian Consultancy brings together the perspective of a software engineer, the knowledge of a technical recruiter, and the vision of a business builder. This unique experience is our signature advantage to delivering top quality services in the domain of recruiting, staff augmentation, and outsourcing.
About Us
Gramian Consultancy is a boutique consultancy specializing in IT professional services and engineering talent solutions. With a strong background in software engineering and leadership, we help companies build high-performing teams by matching them with professionals who truly fit their needs.
Role overview
We are looking for an AI Evaluation Engineer specialized in software engineering to design benchmark tasks based on real-world coding workflows.
You will create scenarios where AI systems must analyze large codebases, apply precise changes (bug fixes, refactors, migrations), and produce correct, testable outputs.
Commitments Required: 8 hours per day with an overlap of 4 hours with PST.
Employment type: Contractor assignment (no medical/paid leave)
Duration of contract: 4 weeks+
Location: Bangladesh, Brazil, Colombia, Egypt, Ghana, India, Indonesia, Kenya, Nigeria,Turkey, Vietnam
Interview: take home assessment
Design and build multi-agent benchmark tasks based on real-world code changes (bug fixes, migrations, refactors)
Work with the Harbor evaluation framework to run and validate tasks in containerized environments
Write clear, precise task instructions (file paths, function signatures, expected behavior, constraints)
Develop Python-based verification scripts to validate correctness of code changes
Define task decomposition strategies across multiple specialized agents
Analyze and navigate large open-source codebases to extract realistic task scenarios
Run, debug, and refine tasks in Docker environments to ensure reproducibility
Improve task quality, clarity, and difficulty based on evaluation results
Requirements
5+ years of experience in software development (Python and JavaScript)
Strong experience working with large codebases (e.g., Django, Flask, FastAPI, Node.js or similar)
Familiarity with Git workflows (pull requests, diffs, commits, cherry-picking)
Experience writing tests or validation scripts (pytest, unittest, or similar)
Ability to write clear, precise technical specifications
Familiarity with AI coding benchmarks or evaluation frameworks (e.g., SWE-bench or similar)
Hands-on experience with Docker (Dockerfiles, image builds, debugging)
Experience contributing to or maintaining open-source projects
Experience with code migrations or large-scale refactoring
Familiarity with CI/CD pipelines and automated testing workflows
Exposure to LLM-based coding tools or evaluation frameworks