Hi, I'm Nagharjun Mathi Mariappan.
A
I’m someone who loves solving real-world problems with a mix of creativity, precision, and a deep attention to detail.
About
I’m a Data Scientist at Mount Sinai, working on applying machine learning to healthcare problems. I have a Master’s in Computer Engineering from NYU, where I built a strong foundation in ML, deep learning, and signal processing. Most of my recent work has been around time series modeling and transformer-based approaches for signal-to-signal tasks, particularly in respiratory flow prediction from plethysmography signals. I’m also interested in experimenting with different model architectures and running large models and algorithms on hardware efficiently. Beyond my professional work, I enjoy playing soccer.
- Programming: Python, C++, SQL
- Libraries/ML: PyTorch, TensorFlow, Scikit-learn, Pandas, NumPy
- MLOps & Data: MLflow, Airflow, Spark, Docker, Kubernetes, Git, Tableau, Linux
- Cloud: AWS - EC2, S3, Lambda, Bedrock, Redshift, ECS; Azure - VMs, Functions, Cognitive Services
- Machine Learning: Supervised ML, Clustering, PCA, CNNs, LSTMs, Attention, LLMs, Hypothesis Testing
Experience
- Generated airflow signal for home sleep-apnea studies lacking nasal cannulas by training a Dilated Residual CNN + Attention on plethysmography, reaching 84% recall and 75% precision.
- Restored a failed MLflow service within 24h and 100% data integrity via custom backup/restore pods and automated MinIO data snapshots using Kubernetes CronJobs.
- Improved team model tracking, logging, experimentation, and versioning by 30% by deploying MLflow on Ubuntu Server, accelerating workflow efficiency and reproducibility.
- Flagged and removed sensitive entities on clinical notes with a lightweight zero-shot GLiNER NER, blocking 98% of protected health information (PHI) leaks before LLM inference via AWS Bedrock.
- Reduced token usage by 40% by clustering redundant clinical note chunks in RAG using cosine similarity.
- Lifted contract engagement 20% using a two-tower recommender system combining BERT text embeddings and company/contract metadata.
- Slashed document-parsing time 90% by deploying a GPT-3.5 Lambda API for PDF/DOCX/PPT extraction.
- Saved 8 h/week with Apache Airflow pipelines feeding AWS Redshift and S3.
- Enhanced Llama2 13B text generation structure by fine-tuning, achieving targeted structure in 90% of instances. Integrated Retrieval Augmented Generation to extract data from housing and zoning documents, improving data processing.
- Adopted Prompt Engineering techniques to control hallucinations in large language models, and applied few-shot learning for the remaining 10% of outputs with unexpected structures. This strategy completely eliminated manual intervention.
- Significantly reduced Llama2 inference time by 57.7%(from approximately 26 seconds to 11 seconds per query), by integrating ExLlamaV2 with faster processing kernels on NYU’s HPC.
Open Source Contributions & Writing
Publications on Towards Data Science
Projects

A NYC Citibike dashboard based on Tableau
- Tools: AWS S3, Snowflake, Snowpipe, Apache Airflow, AWS EC2, Tableau
- Fetched real-time Citi Bike data and stored it in AWS S3.
- Automated ingestion into Snowflake using Snowpipe pipelines.
- Orchestrated ETL workflows with Apache Airflow on AWS EC2.
- Powered live Tableau dashboards for NYC bike availability visualization.

An object detection model poisoning for evading Ad Blockers
- Tools: YOLOv5, PyTorch, OpenCV, Python
- Trained a YOLOv5 model to detect ads with 90% mean IoU accuracy.
- Created adversarial images to reduce detection rates by 15%.
- Analyzed model vulnerabilities against manipulated ad visuals.
- Retrained and fine-tuned the model to recover lost performance.

Predicting NYC subway crowds using SparkML
- Tools: Apache Spark, PySpark, SparkML, Python
- Parallelized data transformations on 11M+ NYC subway crowd records using Apache Spark.
- Engineered features for crowd-level forecasting models.
- Trained regression models in SparkML achieving R² = 0.62.
- Enabled accurate real-time crowd predictions across subway stations.

An vision transformer based image captioning model.
- Tools: Vision Transformer (ViT), Transformer Decoder, PyTorch, Flickr8K, GloVe
- Extracted image features using a pre-trained Vision Transformer encoder.
- Trained a transformer decoder to generate image captions.
- Achieved a 4-gram BLEU score of 19.6 on the Flickr8K dataset.
- Enabled image captioning for accessibility applications.

A webapp for doctors to segment and classify skin lesions.
- Tools: TensorFlow, React, Flask, Cloud Firestore, HAM10000
- Built a two-stage segmentation and classification pipeline for skin lesions.
- Achieved 89.3% accuracy and 89.1% precision across seven disease classes.
- Used atrous convolutions with residual learning for balanced performance.
- Deployed an interactive web app for faster skin cancer diagnosis.

A custom fine tuned Language Model
- Tools: GPT-3, OpenAI API, Weights & Biases (wandb), Python, Google Colab
- Fine-tuned GPT-3 for script generation tasks using few-shot learning and transfer learning.
- Optimized fine-tuning by experimenting with epochs, batch size, and learning rates.
- Achieved high-quality results with minimal training and validation data.
- Proposed a cost-effective strategy for fine-tuning large language models.

A visually efficient Regression to predict Insurance prices.

A map visualization of the city of Barton Rouge through GeoPandas and Folium.
- Tools: Python, Folium, Pandas, GeoPandas, Matplotlib
- Visualized public parks and gyms in Baton Rouge using GeoJSON data.
- Created interactive maps with Folium and custom styling.
- Processed and cleaned spatial data with Pandas and GeoPandas.
- Enhanced urban data insights through layered map visualizations.
Research
Publications
- Ramamurthy, K., Muthuswamy, A., Mathimariappan, N., & Kathiresan, G. S. A novel two‐staged network for skin disease detection using atrous residual convolutional networks. Concurrency and Computation: Practice and Experience. Link to paper
- Karthik, R., Menaka, R., Kathiresan, G. S., Anirudh, M., & Nagharjun, M. Gaussian dropout based stacked ensemble CNN for classification of breast tumor in ultrasound images. IRBM, 43(6), 715-733. Link to paper
- Kathiresan, G., Anirudh, M., Nagharjun, M., & Karthik, R. Disease detection in rice leaves using transfer learning techniques. Journal of Physics: Conference Series, Vol. 1911, No. 1. Link to paper
Skills
Languages and Databases





Libraries







Frameworks





Other




Education
New York, NY, USA
Degree: Master of Science in Computer Engineering
CGPA: 3.96/4.0
- Big Data
- High Performance Machine Learning
- Deep Learning
- Data Structures and Algorithms
- Image Processing
- Databases
Relevant Courseworks: