RYZ Labs is looking for an experienced **AI Evaluation Engineer** to join one of our clients’ teams.
### Responsibilities
* Design and implement evaluation pipelines to measure the performance and reliability of AI models.
* Develop automated testing frameworks to assess model outputs at scale.
* Analyze model performance using both traditional statistical metrics and AI-specific evaluation methods.
* Evaluate AI systems built on modern architectures such as **LLM-based applications and Retrieval-Augmented Generation (RAG)**.
* Identify potential issues related to **accuracy, hallucinations, bias, safety, and model drift**.
* Conduct adversarial testing to uncover vulnerabilities and ensure safe model behavior.
* Collaborate with engineering and AI teams to improve prompt design, model outputs, and system performance.
* Monitor model performance in production and help define best practices for AI evaluation and observability.
### Requirements
* Proficiency in **Python** and experience building scripts or pipelines to evaluate model outputs.
* Experience working with **AI/ML systems**, particularly **large language models (LLMs)** or generative AI applications.
* Familiarity with concepts such as **prompt engineering, prompt optimization, and LLM evaluation**.
* Understanding of evaluation metrics such as **precision, recall, F1-score**, and AI-specific metrics related to model quality and safety.
* Experience evaluating **RAG systems or knowledge retrieval pipelines** is a plus.
* Experience with modern **AI evaluation or observability tools** is a plus (e.g., DeepEval, Promptfoo, RAGAS, LangSmith, Arize, Weights & Biases).
* Strong analytical mindset with the ability to interpret model behavior and propose improvements.
### Nice to Have
* Experience performing **adversarial testing or red-teaming** of AI systems.
* Familiarity with **AI safety, bias detection, and model alignment practices**.
* Experience working in production environments deploying or monitoring AI systems.