Incident Classification with BERT
Replaced TF-IDF baseline with fine-tuned BERT for mixed stack traces and free text, improving F1 from 0.78 to 0.92. Deployed on AWS EKS with p95 latency under 100ms for live Jira ticket processing.
Hello, I'm
Data Scientist with 3 years of hands-on experience in modern data architecture, Lakehouse/Databricks, ETL migration, and production-grade deep learning (NLP/CV). From PySpark pipelines to LLM-powered agents — I build systems that deliver measurable impact.
Data Scientist & Engineer based in Potsdam, Germany — building scalable data platforms, production ML systems, and modern web applications.
3 years at Envestnet Inc. and Cognizant — data pipeline modernization, Lakehouse architecture, anomaly detection, and full-stack development.
M.Sc. Data Science — University of Europe for Applied Sciences, Potsdam.
B.Tech EEE — Govt. Engineering College, Barton Hill, India.
PySpark & Delta Lake pipelines, PyTorch/TensorFlow for NLP & CV, RAG systems with LangChain, and React/Next.js frontends.
Replaced TF-IDF baseline with fine-tuned BERT for mixed stack traces and free text, improving F1 from 0.78 to 0.92. Deployed on AWS EKS with p95 latency under 100ms for live Jira ticket processing.
End-to-end streaming pipeline for financial transaction events with Kafka, Structured Streaming, and Delta Lake sink. Exactly-once semantics with under 30s end-to-end latency at 10k+ events/min.
Fine-tuned ResNet50 for surface defect classification (scratches, cracks) with 96.5% validation accuracy. Achieved 30 FPS on CPU-limited edge devices via INT8 quantization and ONNX Runtime.
RAG system for 5,000+ pages of financial documents with hybrid retrieval (vector + BM25), metadata filtering, and conversation memory. Reduced hallucination rate by 30% on compliance queries.
Fine-tuned BERT-Base for financial document classification across 8 categories with Macro-F1 of 0.91. Optimized to 45ms p95 latency via ONNX INT8 quantization for CPU deployment.
Configurable data quality framework with Great Expectations for automated validation across Bronze/Silver/Gold layers. Lineage tracking with Airflow XComs and dbt manifest parsing visualized in Streamlit.
University of Europe for Applied Sciences, Potsdam
Master thesis: Predictive Maintenance on heavy-duty truck sensor data. Built hybrid ensemble (TFT + XGBoost) outperforming RNN baselines by 18% F1 on 1.1M measurements. Derived EUR 450K annual savings potential via asymmetric loss optimization. Used SHAP for interpretable fault driver identification.
Envestnet Inc.
Refactored legacy stored procedures to PySpark jobs (10M+ rows), accelerating runtimes by 60%. Built Delta Lake pipelines (Bronze/Silver/Gold) for 500k+ daily financial records. Reduced pipeline downtime 75% via Airflow orchestration. Developed LSTM autoencoder anomaly detection reducing server downtime 15%. Deployed inference services with Docker/TF Serving at p99 < 50ms.
Cognizant Technology Solutions
Migrated 10-year-old on-prem Hadoop pipelines to cloud Lakehouse architecture (TB-scale), reducing data availability latency by 40%. Cut cloud compute costs 35% for credit risk scoring pipelines via execution plan optimization. Implemented RBAC and column-level PII masking for zero-trust compliance.
Govt. Engineering College, Barton Hill, India
Capstone: Autonomous vehicle with CNNs and sensor fusion — achieved 30 FPS inference on CPU, reduced steering MAE by 15% vs PilotNet, 95% collision avoidance over 200 test runs. CGPA: 8.0/10.
I'm always open to new opportunities — whether it's data engineering, ML projects, agentic AI systems, or creative frontend work. Drop me a message and let's build something great.