Learning | Nagharjun Mathi Mariappan 👋

Python • TVM • CUDA

Benchmarking a Vision Transformer encoder + transformer decoder in TVM Relax with baseline vs. optimized pipelines.

C++ • MLIR • LLVM • CUDA

Tiny playground for a custom MLIR dialect with Add+ReLU fusion and lowering to PTX.

C++ • MLIR • LLVM

Learning by adding new operations to an existing dialect and lowering it down to LLVM.

Python • PyTorch • ONNX Runtime • Kria KV260

Detect Wi-Fi extender power loss by watching its status LED and sending instant alerts.

Python • PyTorch • Triton

Custom implementation of Flash Attention forward/backward kernels with benchmarking.

C++ • CUDA 12.x • RTX 3060

Benchmarking and comparing Matrix Multiplication operation on shared memory and naive approach.

Learning new stuff