Hi, I'm Nagharjun Mathi Mariappan.

A
I’m someone who loves solving real-world problems with a mix of creativity, precision, and a deep attention to detail.

About

I’m a Data Scientist at Mount Sinai, working on applying machine learning to healthcare problems. I have a Master’s in Computer Engineering from NYU, where I built a strong foundation in ML, deep learning, and signal processing. Most of my recent work has been around time series modeling and transformer-based approaches for signal-to-signal tasks, particularly in respiratory flow prediction from plethysmography signals. I’m also interested in experimenting with different model architectures and running large models and algorithms on hardware efficiently. Beyond my professional work, I enjoy playing soccer.

  • Programming: Python, C++, SQL
  • Libraries/ML: PyTorch, TensorFlow, Scikit-learn, Pandas, NumPy
  • MLOps & Data: MLflow, Airflow, Spark, Docker, Kubernetes, Git, Tableau, Linux
  • Cloud: AWS - EC2, S3, Lambda, Bedrock, Redshift, ECS; Azure - VMs, Functions, Cognitive Services
  • Machine Learning: Supervised ML, Clustering, PCA, CNNs, LSTMs, Attention, LLMs, Hypothesis Testing

Experience

Data Scientist
  • Generated airflow signal for home sleep-apnea studies lacking nasal cannulas by training a Dilated Residual CNN + Attention on plethysmography, reaching 84% recall and 75% precision.
  • Restored a failed MLflow service within 24h and 100% data integrity via custom backup/restore pods and automated MinIO data snapshots using Kubernetes CronJobs.
  • Improved team model tracking, logging, experimentation, and versioning by 30% by deploying MLflow on Ubuntu Server, accelerating workflow efficiency and reproducibility.
  • Flagged and removed sensitive entities on clinical notes with a lightweight zero-shot GLiNER NER, blocking 98% of protected health information (PHI) leaks before LLM inference via AWS Bedrock.
  • Reduced token usage by 40% by clustering redundant clinical note chunks in RAG using cosine similarity.
July 2024 - Present | New York, NY
Data Science Intern
  • Lifted contract engagement 20% using a two-tower recommender system combining BERT text embeddings and company/contract metadata.
  • Slashed document-parsing time 90% by deploying a GPT-3.5 Lambda API for PDF/DOCX/PPT extraction.
  • Saved 8 h/week with Apache Airflow pipelines feeding AWS Redshift and S3.
January 2024 - May 2024 | New York, NY
Research Assistant
  • Enhanced Llama2 13B text generation structure by fine-tuning, achieving targeted structure in 90% of instances. Integrated Retrieval Augmented Generation to extract data from housing and zoning documents, improving data processing.
  • Adopted Prompt Engineering techniques to control hallucinations in large language models, and applied few-shot learning for the remaining 10% of outputs with unexpected structures. This strategy completely eliminated manual intervention.
  • Significantly reduced Llama2 inference time by 57.7%(from approximately 26 seconds to 11 seconds per query), by integrating ExLlamaV2 with faster processing kernels on NYU’s HPC.
September 2023 - December 2023 | New York, NY

Open Source Contributions & Writing

Projects

nyc-citibike-dashboard
NYC Citibike Dashboard

A NYC Citibike dashboard based on Tableau

Accomplishments
  • Tools: AWS S3, Snowflake, Snowpipe, Apache Airflow, AWS EC2, Tableau
  • Fetched real-time Citi Bike data and stored it in AWS S3.
  • Automated ingestion into Snowflake using Snowpipe pipelines.
  • Orchestrated ETL workflows with Apache Airflow on AWS EC2.
  • Powered live Tableau dashboards for NYC bike availability visualization.
adevade
Evading Ad Blockers

An object detection model poisoning for evading Ad Blockers

Accomplishments
  • Tools: YOLOv5, PyTorch, OpenCV, Python
  • Trained a YOLOv5 model to detect ads with 90% mean IoU accuracy.
  • Created adversarial images to reduce detection rates by 15%.
  • Analyzed model vulnerabilities against manipulated ad visuals.
  • Retrained and fine-tuned the model to recover lost performance.
NYC Subway
NYC Station Sense

Predicting NYC subway crowds using SparkML

Accomplishments
  • Tools: Apache Spark, PySpark, SparkML, Python
  • Parallelized data transformations on 11M+ NYC subway crowd records using Apache Spark.
  • Engineered features for crowd-level forecasting models.
  • Trained regression models in SparkML achieving R² = 0.62.
  • Enabled accurate real-time crowd predictions across subway stations.
ViT Architecture
Image Captioning using ViT

An vision transformer based image captioning model.

Accomplishments
  • Tools: Vision Transformer (ViT), Transformer Decoder, PyTorch, Flickr8K, GloVe
  • Extracted image features using a pre-trained Vision Transformer encoder.
  • Trained a transformer decoder to generate image captions.
  • Achieved a 4-gram BLEU score of 19.6 on the Flickr8K dataset.
  • Enabled image captioning for accessibility applications.
Screenshot of  web app
Skin Lesion Segment and Classify

A webapp for doctors to segment and classify skin lesions.

Accomplishments
  • Tools: TensorFlow, React, Flask, Cloud Firestore, HAM10000
  • Built a two-stage segmentation and classification pipeline for skin lesions.
  • Achieved 89.3% accuracy and 89.1% precision across seven disease classes.
  • Used atrous convolutions with residual learning for balanced performance.
  • Deployed an interactive web app for faster skin cancer diagnosis.
Screenshot of  web app
Fine Tuning GPT-3

A custom fine tuned Language Model

Accomplishments
  • Tools: GPT-3, OpenAI API, Weights & Biases (wandb), Python, Google Colab
  • Fine-tuned GPT-3 for script generation tasks using few-shot learning and transfer learning.
  • Optimized fine-tuning by experimenting with epochs, batch size, and learning rates.
  • Achieved high-quality results with minimal training and validation data.
  • Proposed a cost-effective strategy for fine-tuning large language models.
Changing variables
Interactive Multiple Regression

A visually efficient Regression to predict Insurance prices.

Accomplishments
  • Tools: Scikit-learn, Pandas, Matplotlib, ipywidgets
  • Built a linear regression model to predict insurance charges.
  • Selected features using correlation and ANOVA tests.
  • Achieved \$4256 mean absolute error on test data.
  • Created an interactive widget for real-time predictions.
Folium Map
Urban Data Visualization

A map visualization of the city of Barton Rouge through GeoPandas and Folium.

Accomplishments
  • Tools: Python, Folium, Pandas, GeoPandas, Matplotlib
  • Visualized public parks and gyms in Baton Rouge using GeoJSON data.
  • Created interactive maps with Folium and custom styling.
  • Processed and cleaned spatial data with Pandas and GeoPandas.
  • Enhanced urban data insights through layered map visualizations.

Research

Publications

  • Ramamurthy, K., Muthuswamy, A., Mathimariappan, N., & Kathiresan, G. S. A novel two‐staged network for skin disease detection using atrous residual convolutional networks. Concurrency and Computation: Practice and Experience. Link to paper
  • Karthik, R., Menaka, R., Kathiresan, G. S., Anirudh, M., & Nagharjun, M. Gaussian dropout based stacked ensemble CNN for classification of breast tumor in ultrasound images. IRBM, 43(6), 715-733. Link to paper
  • Kathiresan, G., Anirudh, M., Nagharjun, M., & Karthik, R. Disease detection in rice leaves using transfer learning techniques. Journal of Physics: Conference Series, Vol. 1911, No. 1. Link to paper

Skills

Languages and Databases

Python
C++
MySQL
PostgreSQL
Shell Scripting

Libraries

PyTorch
TensorFlow
NumPy
Pandas
OpenCV
scikit-learn
matplotlib

Frameworks

MLflow
Apache Airflow
Apache Spark
Docker
Kubernetes

Other

Git
Tableau
Linux
AWS
Azure

Education

New York University

New York, NY, USA

Degree: Master of Science in Computer Engineering
CGPA: 3.96/4.0

    Relevant Courseworks:

    • Big Data
    • High Performance Machine Learning
    • Deep Learning
    • Data Structures and Algorithms
    • Image Processing
    • Databases

Contact