Nirmal Pratheep Natarajan | AI/ML Research Engineer

13+ years at AMD & Xilinx bridging Applied AI and systems engineering. Built end-to-end AI based solutions. Mentored 10+ engineers across global sites.

Hackathons & Research

Deep RL for FloorPlan Optimization — AMD Internal Conference Finalist; GIN on 15M-node netlists, 2% QoR gain. (arXiv pending)
ML-based Delay Prediction for EDA — AMD Internal Conference Finalist; GNN complexity models with automated fine-tuning & drift detection
Adaptive OFDM Pilots — IEEE WAMICON 2009; adaptive pilot placement for OFDM systems
ALTERA Design Challenge — Top 15, Innovate India Design Contest 2007
ImageNet Classifier — ResNet-50, 77.4% Top-1 on ImageNet-1K; CutMix, MixUp, Random Erasing, LR Finder
Elite Mentorship Program — AMD

Applied AI

DAG Scheduler & Memory-Constrained Execution

Researching auto-generated scheduling algorithms for executing large computational DAGs on memory-constrained accelerators — where tensors far exceed on-chip capacity, requiring orchestrated data movement across execution stages
Approach: auto-research pipeline identifies DAG motifs and synthesizes motif-specific scheduling strategies using a combination of greedy & heuristic algorithms with AI-assisted heuristic selection via Multi-arm Bandit — MAB learns which heuristic action to apply per motif, avoiding one-size-fits-all solutions — 50% latency reduction vs. baseline schedulers

GPU & Kernel Optimization

Built agentic auto-optimizer for MoE workload optimization on B200: two-agent feedback loop where one agent iteratively runs, builds, and evaluates kernels while a second agent profiles using Nsight Compute & Nsight Systems — loop continues until hitting the SPOL limit, achieving ~20% gain over naive PyTorch baseline

Distributed RL & ML Systems (AMD)

Deep RL for FPGA directive optimization: GIN feature extraction on 15M-node netlists, reward shaping — 2% QoR gain
Ray distributed training with Grid/ASHA/PBT hyperparameter search for scalable RL experiments
GNN delay prediction with automated fine-tuning, monitoring, and drift detection

Agentic AI (AMD)

Production Agentic AI framework: reverse-engineers all EDA tools to build a skill document cataloguing every known error pattern and its resolution; during live sessions, the agent reasons over the error against the skill & Python docs to suggest targeted fixes — graph-based LLM orchestration, iterative self-correction, Dockerized evaluation

Backend Engineering

AMD (formerly Xilinx) 2012 – Present · San Jose, CA

Senior Staff / Research Engineer — Design Automation & Systems

Distributed Systems & Infrastructure

Mentored 10+ engineers across global sites on simulation tooling for 10nm/7nm/2nm FPGA nodes
Client/server system (Boost Asio + Protobuf) for concurrent multi-capture — 3x throughput
Divide-and-conquer parallel processing via LSF Farm — 20x scale

Data Pipelines & Tooling

Graph compression pipeline: 3.5B datapoints → 500K patterns with Python analytics
Tool profiling, linters, dashboards, and YAML semantic verifier for HW/SW validation

Technical Skills

ML/DL Frameworks: PyTorch, HuggingFace, TRL, vLLM, DeepSpeed, FSDP

RL & Agents: Stable Baselines 3, Ray, Gym, multi-agent orchestration

GPU & Performance: Triton, Flash Attention, Nsight Systems/Compute, CUDA, mixed-precision

Languages: Python, C++, Golang

Infrastructure: Kubernetes, Docker, Ray, LSF, W&B, Optuna

Education

M.Eng Electrical Engineering — University of Cincinnati, 2012

B.Eng Electronics & Communication — Anna University, 2007

Certifications

Triton Kernel Dev on AMD Instinct GPUs · LLM Serving with vLLM & MI300X · Agentic Framework (HuggingFace) · Generative AI with LLMs (DeepLearning.AI) · ML Ops (DeepLearning.AI) · Machine Learning (Stanford) · Analytics Edge (MITx) · Parallel & Distributed Computing (Rice) · Kubernetes (Udacity) · Big Data with Spark (Berkeley)