Senior Researcher
Want to build vision-language models that understand complex, real-world environments?
You’ll join a small, highly technical team working on foundational problems in multimodal AI, focused on training models that can interpret, reason, and act on large-scale first-person video data.
You’ll be part of the founding research team, working directly with the Chief Science Officer, shaping how models are designed, trained, and evaluated. The work sits at the intersection of VLMs, long-context reasoning, and real-world deployment.
The focus is on building systems that move beyond static perception, towards temporal understanding, activity recognition, and higher-level reasoning across dynamic environments.
Your work will centre on:
- Designing and training VLMs on large-scale video datasets
- Developing post-training approaches including SFT, RLHF, and parameter-efficient tuning
- Building scalable training and evaluation pipelines
- Exploring long-context and temporal modelling
- Designing efficient systems across edge and server-side inference
- Defining benchmarks for spatial and behavioural understanding
Experience with large datasets, model optimisation, or deploying models into production environments will be valuable. Exposure to video data or long-context modelling is particularly relevant.
This is a team that values speed, ownership, and first-principles thinking. You’ll be working on open-ended problems with real-world impact, with the freedom to explore and define approaches.
Compensation: $200K - $400K (negotiable d.o.e) + equity
Location: San Francisco, on-site
If you’re interested in building multimodal systems that operate in real-world settings, and want to join a well-funded, highly skilled research team, please apply now!
All applicants will receive a response.