Accellor is an AI-native services firm purpose-built for the post-ChatGPT era. Free from legacy constraints, we focus on delivering measurable business outcomes through advanced AI, data, and engineering capabilities. Our mission is to operationalize AI at scale and unlock sustained enterprise value.
Our offerings span AI solutions, data services, enterprise applications, and product engineering, tailored to industry-specific needs across healthcare, life sciences, telecom, retail, financial services, and technology. By leveraging design thinking and technology-agnostic architectures, we ensure faster time-to-value and seamless interoperability.
With a proven track record of enabling Fortune 100 enterprises and global innovators, Accellor stands as a trusted partner for organizations seeking to harness the full potential of AI. Our vision is clear: to build intelligent, connected ecosystems that deliver measurable outcomes and redefine the future of enterprise transformation.
Role Summary:
We are looking for an
AI Engineer with strong problem-solving ability, solid Python engineering skills, and hands-on understanding of machine learning, deep learning, GPUs, and CUDA fundamentals. The ideal candidate may be early in their career but must be technically sharp, curious, implementation-driven, and capable of learning complex AI systems quickly.
This role is suited for someone who can build models, debug training issues, optimize GPU workloads, understand tensor operations, and work closely with research, platform, and engineering teams.
Key Responsibilities:
- Design, build, train, and evaluate AI/ML models using Python, TensorFlow, PyTorch, or JAX
- Develop clean training pipelines with data loading, checkpointing, logging, validation, and experiment tracking
- Work on deep learning models including CNNs, Transformers, LLMs, embeddings, and attention-based architectures
- Debug model issues such as poor convergence, overfitting, unstable loss, NaNs, tensor-shape errors, and GPU memory failures
- Run and optimize models on GPUs with awareness of CUDA execution, memory usage, batching, mixed precision, and kernel performance
- Profile training and inference workloads to identify bottlenecks in compute, memory, data loading, and communication
- Support development of custom operators, fused kernels, CUDA/Triton-based optimizations, or framework-level performance improvements
- Build AI inference pipelines with focus on latency, throughput, reliability, cost, and quality
- Create evaluation pipelines for model accuracy, robustness, hallucination risk, safety, latency, and regression testing
- Read research papers, implement ideas, run experiments, and clearly document findings, trade-offs, and limitations
Requirements
- Strong programming experience in Python
- Good understanding of data structures, algorithms, numerical programming, and clean code practices
- Hands-on experience with TensorFlow, PyTorch, or JAX
- Strong fundamentals in:
- Deep learning
- Backpropagation
- Gradient descent
- Optimizers
- Loss functions
- Regularization
- CNNs
- Transformers
- Attention mechanisms
- Embeddings
- Model evaluation
- Experience training and debugging models on CUDA-enabled GPUs
- Basic understanding of:
- GPU memory
- CUDA kernels
- Threads, blocks, and warps
- Memory bandwidth
- Kernel launch overhead
- Mixed precision training
- CUDA out-of-memory debugging
- Comfortable with Linux, Git, Docker, shell scripting, and experiment management
- Ability to reason from first principles and solve ambiguous technical problems
Preferred Skills:
- CUDA C/C++ or Triton kernel development
- C++ exposure for performance-critical AI systems
- Distributed training using PyTorch DDP, TensorFlow Distributed Strategy, JAX, NCCL, MPI, or similar tools
- Experience with GPU profiling tools such as:
- NVIDIA Nsight Systems
- NVIDIA Nsight Compute
- PyTorch Profiler
- TensorFlow Profiler
- Experience with inference optimization, quantization, TensorRT, ONNX, XLA, or model compilation
- Exposure to LLMs, RAG systems, agents, embeddings, multimodal AI, or AI evaluation frameworks
- Understanding of FP32, FP16, BF16, INT8, tensor cores, and numerical stability
What Excellent Looks Like:
- Can implement a model from a research paper without blindly copying code
- Can debug why a model is not learning
- Can explain why GPU utilization is low
- Can identify whether a workload is compute-bound, memory-bound, or communication-bound
- Can write clean, reproducible, and testable Python code
- Can profile before optimizing
- Can work across model logic, GPU execution, data pipelines, and inference systems
- Can communicate technical findings clearly and precisely
Ideal Candidate Profile:
- The ideal candidate is an early-career AI engineer who is not limited to using ML libraries as black boxes.
- They understand how models train, how tensors move, how GPUs execute workloads, and how performance, quality, reliability, and safety come together in real AI systems
- The strongest candidates will show hands-on projects in model implementation, GPU acceleration, CUDA/Triton operators, distributed training, LLM evaluation, inference optimization, or research paper reproduction