Staff Developer, AI Evaluation & Reliability

caseware

Colombiafull time - permanentPosted 17 day(s) ago$0-$0 / yr

Apply Now

$0-$0 / yr

Salary

colombia

Region

ASAP

Start Date

About caseware

No description provided.

About this Role.

Caseware is one of Canada's original Fintech companies, having led the global audit and accounting software industry for over 30 years, with more than 500,000 users across 130 countries and available in 16 different languages. While you might not have heard of us (yet) over 36,000 accounting and audit professionals list Caseware as a skill on their LinkedIn profiles! As we build the next generation of intelligent, cloud-based solutions for auditors, accountants, and financial professionals, agentic AI is a core pillar of our strategy. We are developing a reusable, enterprise-grade agentic AI platform that enables product teams across Caseware Cloud to safely, consistently, and efficiently deliver AI-powered capabilities in highly regulated environments. We are looking for a Staff Developer – AI Evaluation & Reliability to raise the bar on the quality, trustworthiness, and operational reliability of our AI platform. This is a senior individual contributor role with broad technical influence and leadership expectations. You will own how our agentic systems are evaluated, validated, and governed in production, and help define the standards that product teams across Caseware rely on. In this role, you will provide technical stewardship for evaluation frameworks, reliability mechanisms, and compliance-aligned controls that sit at the center of Caseware’s AI strategy. You’ll partner closely with Staff Engineers, Product Management, QA, Security, Data, and Infrastructure teams to ensure the platform scales reliably, meets enterprise and regulatory standards, and delivers measurable value to both product teams and customers. 📍 Location: This is a fully remote position located in Colombia. Contact Maira Russo - Senior Talent Acquisition Partner What you will be doing * Own and evolve **evaluation strategy** for LLM- and agent-based systems, including golden datasets, rubric-based scoring, reference-free evaluations, regression testing, and A/B experimentation. * Benchmark and analyze **foundation model performance** within Caseware’s domain, identifying capability gaps, failure modes, and opportunities for improvement. * Lead the design and optimization of **Retrieval-Augmented Generation (RAG)** pipelines, including embeddings, retrieval strategies, reranking, and retrieval quality metrics. * Design and maintain **feedback and evaluation pipelines** that connect real-world user behavior to measurable improvements in agent performance. * Apply data science techniques to analyze agent behavior, diagnose reliability issues, detect drift, and surface systemic risks. * Define and implement **guardrails** for agentic systems, including schema validation, content filtering, tool governance, and policy enforcement. * Establish **approval gates, audit trails, and controlled rollout mechanisms** for AI and agent changes, including feature flags, staged deployments, and kill switches. * Partner with Security and Data teams to embed **privacy-by-design** practices, including PII detection and masking, data minimization, and retention controls. * Support and influence **SOC 2 and ISO 27001-aligned controls** across AI data flows, including access management, logging, and incident response. * Act as a **Staff-level technical leader**, mentoring other engineers, shaping best practices, and raising the overall bar for AI reliability and evaluation across the organization. What you’ll bring * Strong **data science foundation**, including Python, SQL, statistics, and experiment design. * Deep hands-on experience with **LLMs**, prompting strategies, and agent reasoning patterns. * Practical expertise with **embeddings, vector databases, retrieval metrics, and reranking approaches**. * Proven experience designing or operating **evaluation frameworks for generative AI or agentic systems**, including automated and human-in-the-loop evaluation. * Strong understanding of **AI reliability, safety, and governance**, including guardrails, validation, monitoring, and change control. * Working knowledge of **privacy engineering principles** and familiarity with GDPR/CCPA concepts such as consent, purpose limitation, and data subject rights. * Experience operating in **enterprise or regulated environments**, including contributions to SOC 2 / ISO 27001-aligned systems and processes. * Ability to influence across teams, communicate clearly about complex AI trade-offs, and drive alignment without direct authority. * **Strong English language communication and collaboration skills** Nice to have * Experience with agent frameworks such as **LangChain** or similar. * Domain experience in **finance, accounting**, or other regulated industries (e.g., healthcare, legal). * Experience with **AI safety or red-teaming**, including prompt injection, data exfiltration, or tool misuse. * Familiarity with **governed change management**, including feature flags, staged rollouts, and kill switches. * Experience with **agentic coding** or autonomous development workflows. Technology stack your team works with * Backend & Platform: TypeScript, NestJS, Python * Cloud & Infrastructure: AWS EKS, AWS Lambda, AWS Bedrock, AWS AgentCore * Search & Retrieval: AWS OpenSearch Serverless * Document & Data Processing: AWS Textract, DynamoDB, S3 * AI Evaluation & Observability: LangFuse, LangSmith (or equivalent) * AI-assisted development tools: GitHub Copilot, AWS Kiro * Developer Tooling: GitHub, GitHub Actions, Nx Monorepo * Collaboration: Jira, Confluence, Microsoft Teams, Outlook Perks & Benefits * ¨Contrato a termino Indefinido¨ with all the legal benefits * Prepaid Medicine * Life insurance and funeral assistance * Internet allowance * Home office stipend * Competitive compensation — above the market average * 100% remote work environment and an excellent work-life balance * Opportunity to work for a growing global SaaS leader company * A culture that promotes independence, innovation, trust, and accountability * Open space to be creative, innovative and strategize for the future * Mentorship by highly experienced professional * Budget for training, we want you to grow * 5 Personal Time Off days per year * Sick Leave Top up to total 100% of salary paid by the employer from Day 3 to 90. * Recognition Award, additional paid time off in recognition of the corresponding year of service * Upgrade vacation starting at 5 years of service * Global Employee Assistance Program - Telus Health

Similar jobs

No similar jobs found.

Staff Developer, AI Evaluation & Reliability

caseware

About caseware

About this Role.

Skills Required

Benefits & Perks

Ready to Apply?

Similar jobs