Case Study

Classification of Point Sources in Space

A machine learning research project that reduces imaging costs by leveraging cost-effective low-resolution photometric measurements to accurately classify astronomical point sources.

machine-learningdata-sciencepythonjupyter

Final Report View on GitHub

App in Action

Classification of Point Sources analysis

Model comparison charts

Project Details

Duration

4 Months

Role

Machine Learning Research

Platform

Jupyter Notebooks

Technology

Python, scikit-learn, XGBoost, TensorFlow

Problem / Approach / Result

The Problem

Spectroscopic imaging is too expensive for large-scale classification

Traditional classification of astronomical point sources -- stars, quasars, and galaxies -- relies heavily on high-resolution spectroscopic imaging, which is both time-consuming and expensive. As the volume of astronomical survey data grows exponentially, there is a pressing need for automated methods that work with cost-effective, low-resolution photometric data.

High-resolution spectroscopic imaging is prohibitively expensive at scale
Astronomical survey data is growing exponentially, outpacing manual classification
No systematic comparison of ML algorithms existed for photometric classification
Researchers needed reproducible pipelines to evaluate multiple approaches objectively

The Approach

Build and benchmark a multi-model classification pipeline

The project followed a structured data science workflow, implementing six distinct classification algorithms and evaluating each against consistent metrics. Every model was trained, tuned, and benchmarked in separate Jupyter notebooks with clear documentation for full reproducibility.

Implemented KNN, Decision Tree, SVM, XGBoost, Gaussian Mixture, and Neural Network classifiers
Comprehensive data preprocessing including missing values, feature scaling, and class balancing
Consistent evaluation metrics -- accuracy, precision, recall, and F1-score across all models
Separate Jupyter notebook per algorithm for modularity and reproducibility
Neural network component built as a standalone Python script for flexibility

The Result

Cost-effective classification with systematic model comparison

A complete, reproducible ML pipeline that demonstrates photometric data can replace expensive spectroscopic imaging for celestial object classification. The systematic comparison across six algorithms provides clear evidence for which approaches work best in this domain.

Six classification algorithms tested and benchmarked with consistent metrics
Demonstrated viability of low-resolution photometric data for accurate classification
Fully reproducible research with documented Jupyter notebooks per algorithm
Comprehensive model comparison pipeline for objective algorithm assessment
Neural network captured complex nonlinear relationships in photometric features

Key Features

Multi-Model Pipeline

Six classification algorithms tested and benchmarked -- KNN, Decision Tree, SVM, XGBoost, Gaussian Mixture, and Neural Network.

Cost Reduction

Uses low-resolution photometric data instead of expensive high-resolution spectroscopic imaging for celestial classification.

Systematic Evaluation

Consistent metrics including accuracy, precision, recall, and F1-score across all models for objective comparison.

Reproducible Research

Separate Jupyter notebooks for each algorithm with clear documentation and reproducible workflows.

Deep Learning

Neural network implemented in Python to capture complex nonlinear relationships between photometric features and source classifications.

Data Pipeline

Comprehensive preprocessing handling missing values, feature scaling, and class balancing before model training.

Architecture

Raw Data

Low-resolution photometric measurements from astronomical surveys, targeting cost-effective classification without spectroscopic imaging.

Preprocessing

Comprehensive preprocessing pipeline handles missing values, normalizes features, and balances class distributions before training.

Classical ML

Three classical machine learning algorithms trained and tuned in separate Jupyter notebooks with consistent evaluation metrics.

Advanced Models

Advanced models including gradient boosting, probabilistic clustering, and deep learning to capture complex nonlinear relationships.

Evaluation

Systematic evaluation using accuracy, precision, recall, and F1-score for objective algorithm comparison across all models.

Key Metrics

Algorithms

Tested and benchmarked

Metrics

Accuracy, precision, recall, F1

Notebooks

One per algorithm

Low

Cost

Photometric vs spectroscopic

Tech Stack

Language

Python

Core language for all data processing and modeling

Environment

Jupyter Notebooks

Interactive development for analysis and visualization

Classical ML

scikit-learn

KNN, SVM, Decision Trees, and evaluation metrics

Ensemble

XGBoost

Gradient boosting framework for ensemble classification

Deep Learning

TensorFlow / Keras

Neural network implementation for nonlinear patterns

Data Tools

Pandas & NumPy

Data manipulation and numerical computation

Visualization

Matplotlib & Seaborn

Data visualization and result plotting

Category	Technology	Purpose
Language	Python	Core language for all data processing and modeling
Environment	Jupyter Notebooks	Interactive development for analysis and visualization
Classical ML	scikit-learn	KNN, SVM, Decision Trees, and evaluation metrics
Ensemble	XGBoost	Gradient boosting framework for ensemble classification
Deep Learning	TensorFlow / Keras	Neural network implementation for nonlinear patterns
Data Tools	Pandas & NumPy	Data manipulation and numerical computation
Visualization	Matplotlib & Seaborn	Data visualization and result plotting