If you’re working in machine learning with Python, Scikit-learn (sklearn) is one of the most powerful and beginner-friendly libraries you’ll ever use. It provides tools for data preprocessing, model selection, evaluation and deployment.
This Scikit-learn cheatsheet is your one-stop guide, covering everything from the basics to advanced techniques. Bookmark it and you’ll never get stuck again!
Table of Contents
1. Getting Started with Scikit-learn
Installation
pip install scikit-learn
Importing
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import datasets
Loading Datasets
# Built-in datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
2. Data Preprocessing
Handling Missing Data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)
Feature Scaling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Encoding Categorical Data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Label Encoding
le = LabelEncoder()
y_encoded = le.fit_transform(y)
# One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False)
X_encoded = ohe.fit_transform(X)
3. Machine Learning Models
Classification
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
Regression
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)
Clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.labels_
Dimensionality Reduction
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
4. Model Evaluation
Classification Metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))
Regression Metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))
Cross-Validation
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, y, cv=5)
print(scores.mean())
5. Model Selection & Hyperparameter Tuning
Grid Search
from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [100, 200], 'max_depth': [5, 10]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)
print(grid.best_params_)
Random Search
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(RandomForestClassifier(), param_grid, n_iter=10, cv=5)
random_search.fit(X_train, y_train)
6. Advanced Topics
Pipelines
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression())
])
pipeline.fit(X_train, y_train)
Feature Selection
from sklearn.feature_selection import SelectKBest, chi2
selector = SelectKBest(score_func=chi2, k=2)
X_new = selector.fit_transform(X, y)
Handling Imbalanced Data
from sklearn.utils.class_weight import compute_class_weight
weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
print(weights)
Ensemble Learning (Stacking)
from sklearn.ensemble import StackingClassifier
estimators = [
('rf', RandomForestClassifier(n_estimators=10)),
('svr', SVC(probability=True))
]
stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)
7. Model Deployment
Save and Load Models
import joblib
joblib.dump(clf, 'model.pkl')
loaded_model = joblib.load('model.pkl')
Conclusion
This Scikit-learn Cheatsheet 2025 has walked you through everything from basic data preprocessing to advanced model tuning and deployment. Whether you’re a beginner just getting started or an advanced ML engineer, this guide is your quick reference for all things Scikit-learn.
Bookmark this blog and the next time you’re stuck, you’ll know exactly where to look!
Related Reads
- NumPy Cheatsheet 2025: From Basics to Advanced in One Guide
- Pandas Cheatsheet: The Ultimate Guide for Data Analysis in Python
- 10 Best AI Engineering Books to Read in 2025
- Mastering GPT-5 Prompting: The Complete Guide to Smarter AI Outputs
- Mastering Context Engineering: 6 Proven Strategies to Make AI Agents Smarter and More Reliable
1 thought on “Scikit-learn Cheatsheet 2025: From Beginner to Advanced”