Scikit-learn Cheatsheet 2025: From Beginner to Advanced

If you’re working in machine learning with Python, Scikit-learn (sklearn) is one of the most powerful and beginner-friendly libraries you’ll ever use. It provides tools for data preprocessing, model selection, evaluation and deployment.

This Scikit-learn cheatsheet is your one-stop guide, covering everything from the basics to advanced techniques. Bookmark it and you’ll never get stuck again!

1. Getting Started with Scikit-learn

Installation

pip install scikit-learn

Importing

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import datasets

Loading Datasets

# Built-in datasets
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

2. Data Preprocessing

Handling Missing Data

from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

Feature Scaling

from sklearn.preprocessing import StandardScaler, MinMaxScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Encoding Categorical Data

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Label Encoding
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# One-Hot Encoding
ohe = OneHotEncoder(sparse_output=False)
X_encoded = ohe.fit_transform(X)

3. Machine Learning Models

Classification

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

Regression

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor

reg = LinearRegression()
reg.fit(X_train, y_train)
y_pred = reg.predict(X_test)

Clustering

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
labels = kmeans.labels_

Dimensionality Reduction

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

4. Model Evaluation

Classification Metrics

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

print(accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

Regression Metrics

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

print(mean_squared_error(y_test, y_pred))
print(r2_score(y_test, y_pred))

Cross-Validation

from sklearn.model_selection import cross_val_score

scores = cross_val_score(clf, X, y, cv=5)
print(scores.mean())

5. Model Selection & Hyperparameter Tuning

from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [100, 200], 'max_depth': [5, 10]}
grid = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid.fit(X_train, y_train)

print(grid.best_params_)
from sklearn.model_selection import RandomizedSearchCV

random_search = RandomizedSearchCV(RandomForestClassifier(), param_grid, n_iter=10, cv=5)
random_search.fit(X_train, y_train)

6. Advanced Topics

Pipelines

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', LogisticRegression())
])

pipeline.fit(X_train, y_train)

Feature Selection

from sklearn.feature_selection import SelectKBest, chi2

selector = SelectKBest(score_func=chi2, k=2)
X_new = selector.fit_transform(X, y)

Handling Imbalanced Data

from sklearn.utils.class_weight import compute_class_weight

weights = compute_class_weight('balanced', classes=np.unique(y), y=y)
print(weights)

Ensemble Learning (Stacking)

from sklearn.ensemble import StackingClassifier

estimators = [
    ('rf', RandomForestClassifier(n_estimators=10)),
    ('svr', SVC(probability=True))
]

stack = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())
stack.fit(X_train, y_train)

7. Model Deployment

Save and Load Models

import joblib

joblib.dump(clf, 'model.pkl')
loaded_model = joblib.load('model.pkl')

Conclusion

This Scikit-learn Cheatsheet 2025 has walked you through everything from basic data preprocessing to advanced model tuning and deployment. Whether you’re a beginner just getting started or an advanced ML engineer, this guide is your quick reference for all things Scikit-learn.

Bookmark this blog and the next time you’re stuck, you’ll know exactly where to look!

External Resources

1 thought on “Scikit-learn Cheatsheet 2025: From Beginner to Advanced”

Leave a Comment