Improving Model Performance When One Class Dominates the Dataset
Introduction
In real-world machine learning systems, imbalanced datasets are common. Fraud detection, medical diagnosis, anomaly detection, and spam filtering often contain a majority class that significantly outweighs the minority class.
A model trained on such data may show high accuracy but perform poorly in detecting the minority class — which is usually the class of interest.
Let’s examine a practical scenario.

Problem Statement
You are building a binary classification model to detect fraudulent transactions.
Dataset distribution:
- 95% Non-Fraud (Class 0)
- 5% Fraud (Class 1)
You trained a Logistic Regression model and obtained:
- Accuracy: 95%
- Fraud detection rate: Very low
At first glance, 95% accuracy looks excellent.
However, the model predicts almost all transactions as Non-Fraud.

Why is accuracy misleading in this case, and how should the problem be handled properly?
Analysis
Accuracy is calculated as:Accuracy=TotalPredictionsCorrectPredictions
If 95% of the data is Non-Fraud, a model predicting “Non-Fraud” for every case will still achieve 95% accuracy.
This means:
- The model is biased toward the majority class.
- It fails at detecting fraud (minority class).
- Business impact is negative despite high accuracy.
Therefore, accuracy is not an appropriate metric for imbalanced datasets.
Correct Approach
1. Use Better Evaluation Metrics
Instead of accuracy, use:
- Precision
- Recall
- F1-Score
- ROC-AUC
- Confusion Matrix
For fraud detection, Recall for Class 1 (Fraud) is critical.
Recall measures:Recall=TruePositives+FalseNegativesTruePositives
High recall ensures fewer fraud cases are missed.
2. Apply Resampling Techniques
To handle imbalance:
Oversampling
- SMOTE (Synthetic Minority Over-sampling Technique)
- Random oversampling
Undersampling
- Reduce majority class samples
Example using SMOTE (conceptually):
from imblearn.over_sampling import SMOTEsmote = SMOTE()X_resampled, y_resampled = smote.fit_resample(X, y)
3. Use Class Weights
Many algorithms support class weighting:
from sklearn.linear_model import LogisticRegressionmodel = LogisticRegression(class_weight='balanced')
This penalizes mistakes on minority class more heavily.
4. Try Tree-Based Models
Algorithms like:
- Random Forest
- XGBoost
- LightGBM
often perform better on imbalanced data when tuned properly.

Conclusion
Imbalanced classification problems require more than training a standard model and checking accuracy.
Key Takeaways:
- Accuracy can be misleading.
- Use Recall, Precision, and F1-score.
- Apply resampling techniques like SMOTE.
- Adjust class weights.
- Evaluate using confusion matrix and ROC-AUC.
Machine learning performance is not about high numbers it is about meaningful metrics aligned with the problem objective.