QA Engineer

Titan AI • Full-time • Remote (United States) • 23h ago

About Titan

Titan builds AI software for banks: purpose-built small language models, a banking ontology, and AI bankers that financial institutions can trust. Our models outperform general-purpose LLMs by 30 to 80 percent on banking tasks. Customers include community banks, credit unions, and large regional and super-regional institutions. We are backed by leading fintech investors and operate under the compliance, audit, and model-risk standards that banking requires.

Why This Role Exists

Titan is scaling from a handful of live banking customers to thirty, then to hundreds. Right now, there is no formal QA function. There is no evaluation framework, no regression baseline, no quality gate in CI/CD. A QA failure at a bank is not a user experience problem. It is an operational and regulatory risk. This role exists because that gap has to close before the customer count grows.

This is a hands-on, individual-contributor role first. You are coming in to do the work: write the test cases, build the evaluation framework, set up CI/CD gates, and triage bugs alongside engineering. The function gets built because you build it yourself. Once the practice is stable and documented, you bring in QE engineers to scale it.

What You Own

AI Evaluation. You personally design and execute the evaluation framework for LLM and agentic AI outputs across Foundry, Agent Builder, and client-deployed instances. You write the assertions, define the behavioral contracts, and own regression baselines for model behavior. Standard QA methods break down here: you cannot write a deterministic assertion for whether an AI accurately summarized a 200-page loan agreement. You need to think in distributions and confidence intervals, and you need to build tooling that does too.

Test Coverage. You write and maintain the automated test suite: end-to-end, integration, and regression coverage for backend APIs, document ingestion pipelines, AI inference workflows, and frontend surfaces. You own performance and load testing for latency-sensitive inference paths. You set up and enforce quality gates in CI/CD pipelines. When a bug surfaces in production, you are in the triage, you write the reproduction case, and you own the regression test that prevents it from coming back.

Compliance and Client Quality. You produce the test artifacts, audit logs, and process documentation that meet SOC 2 Type II standards. You work directly with Forward Deployed Engineering on client-side validation and production issue reproduction. Bank examiners will scrutinize this work. It needs to be defensible on its own.

Who You Are

Seven or more years in software QA engineering, with at least two years personally testing AI or ML systems. You have written test cases against LLM outputs, built evaluation pipelines from scratch, and know the difference between a flaky test and a genuinely non-deterministic system. You are fluent in Python and have built automated suites using pytest, Playwright, or Selenium. You have hands-on experience with RAGAS, DeepEval, LangSmith, or comparable evaluation tooling—not just familiarity with the names.

You can trace a failure from the application layer to infrastructure and know enough about Azure, async systems, and REST APIs to do it without waiting on an engineer to walk you through it. You have integrated QA gates into CI/CD pipelines and owned the process end to end. Experience in fintech, banking, or another regulated environment is a strong advantage. Familiarity with document processing pipelines, multi-agent architectures, RAG validation, or observability tooling such as Arize or Langfuse puts you ahead. You are not here to manage. You are here to build and test.

What Success Looks Like

In your first 90 days: a diagnostic of current test coverage shared with engineering leadership, an evaluation framework running against at least one AI-powered workflow that you built yourself, and quality gates live in CI/CD. In your first six months: regression baselines established for model behavior, SOC 2 test artifacts documented and audit-ready, and the test suite running on every release without manual intervention. At one year: the function is staffed, coverage scales with every product release, and quality is a first-class input to every deployment decision. The work you did personally is the foundation the team builds on.