AI Engineers

Accellor • Full-time • Mountain View, CA, US • 2d ago

Accellor is an AI-native services firm purpose-built for the post-ChatGPT era. Free from legacy constraints, we focus on delivering measurable business outcomes through advanced AI, data, and engineering capabilities. Our mission is to operationalize AI at scale and unlock sustained enterprise value.

Our offerings span AI solutions, data services, enterprise applications, and product engineering, tailored to industry-specific needs across healthcare, life sciences, telecom, retail, financial services, and technology. By leveraging design thinking and technology-agnostic architectures, we ensure faster time-to-value and seamless interoperability.

With a proven track record of enabling Fortune 100 enterprises and global innovators, Accellor stands as a trusted partner for organizations seeking to harness the full potential of AI. Our vision is clear: to build intelligent, connected ecosystems that deliver measurable outcomes and redefine the future of enterprise transformation.

Role Summary:

We are looking for an AI Engineer with strong problem-solving ability, solid Python engineering skills, and hands-on understanding of machine learning, deep learning, GPUs, and CUDA fundamentals. The ideal candidate may be early in their career but must be technically sharp, curious, implementation-driven, and capable of learning complex AI systems quickly.

This role is suited for someone who can build models, debug training issues, optimize GPU workloads, understand tensor operations, and work closely with research, platform, and engineering teams.

Key Responsibilities:

Design, build, train, and evaluate AI/ML models using Python, TensorFlow, PyTorch, or JAX
Develop clean training pipelines with data loading, checkpointing, logging, validation, and experiment tracking
Work on deep learning models including CNNs, Transformers, LLMs, embeddings, and attention-based architectures
Debug model issues such as poor convergence, overfitting, unstable loss, NaNs, tensor-shape errors, and GPU memory failures
Run and optimize models on GPUs with awareness of CUDA execution, memory usage, batching, mixed precision, and kernel performance
Profile training and inference workloads to identify bottlenecks in compute, memory, data loading, and communication
Support development of custom operators, fused kernels, CUDA/Triton-based optimizations, or framework-level performance improvements
Build AI inference pipelines with focus on latency, throughput, reliability, cost, and quality
Create evaluation pipelines for model accuracy, robustness, hallucination risk, safety, latency, and regression testing
Read research papers, implement ideas, run experiments, and clearly document findings, trade-offs, and limitations

Requirements

Strong programming experience in Python
Good understanding of data structures, algorithms, numerical programming, and clean code practices
Hands-on experience with TensorFlow, PyTorch, or JAX
Strong fundamentals in:
Deep learning
Backpropagation
Gradient descent
Optimizers
Loss functions
Regularization
CNNs
Transformers
Attention mechanisms
Embeddings
Model evaluation
Experience training and debugging models on CUDA-enabled GPUs
Basic understanding of:
GPU memory
CUDA kernels
Threads, blocks, and warps
Memory bandwidth
Kernel launch overhead
Mixed precision training
CUDA out-of-memory debugging
Comfortable with Linux, Git, Docker, shell scripting, and experiment management
Ability to reason from first principles and solve ambiguous technical problems

Preferred Skills:

CUDA C/C++ or Triton kernel development
C++ exposure for performance-critical AI systems
Distributed training using PyTorch DDP, TensorFlow Distributed Strategy, JAX, NCCL, MPI, or similar tools
Experience with GPU profiling tools such as:
NVIDIA Nsight Systems
NVIDIA Nsight Compute
PyTorch Profiler
TensorFlow Profiler
Experience with inference optimization, quantization, TensorRT, ONNX, XLA, or model compilation
Exposure to LLMs, RAG systems, agents, embeddings, multimodal AI, or AI evaluation frameworks
Understanding of FP32, FP16, BF16, INT8, tensor cores, and numerical stability

What Excellent Looks Like:

Can implement a model from a research paper without blindly copying code
Can debug why a model is not learning
Can explain why GPU utilization is low
Can identify whether a workload is compute-bound, memory-bound, or communication-bound
Can write clean, reproducible, and testable Python code
Can profile before optimizing
Can work across model logic, GPU execution, data pipelines, and inference systems
Can communicate technical findings clearly and precisely

Ideal Candidate Profile:

The ideal candidate is an early-career AI engineer who is not limited to using ML libraries as black boxes.
They understand how models train, how tensors move, how GPUs execute workloads, and how performance, quality, reliability, and safety come together in real AI systems
The strongest candidates will show hands-on projects in model implementation, GPU acceleration, CUDA/Triton operators, distributed training, LLM evaluation, inference optimization, or research paper reproduction