When it comes to deep learning frameworks, two names dominate the field: TensorFlow and PyTorch. While TensorFlow has been around longer, PyTorch has quickly gained popularity due to its flexibility, dynamic computation graph and Pythonic style. Researchers, developers, and data scientists across the globe use PyTorch for everything from computer vision to natural language processing.

However, with its wide set of functions and features, it can be difficult to remember every detail while coding. That’s why this PyTorch Cheatsheet is a must-have. It serves as a quick reference guide with the most important commands, methods and workflows you’ll use in PyTorch projects.
Table of Contents
Whether you’re a beginner exploring deep learning or an experienced developer, this cheatsheet will save you time and boost productivity.
What is PyTorch?
PyTorch is an open-source machine learning framework developed by Facebook’s AI Research lab (FAIR). It is known for:
- Dynamic computation graphs (define-by-run)
- Easy debugging with native Python tools
- GPU acceleration with CUDA
- Strong ecosystem of libraries for vision, NLP, reinforcement learning, etc.
PyTorch feels more “Pythonic” than other frameworks, making it easier to learn and integrate into research or production code.
1. Installation & Setup
pip install torch torchvision torchaudio
import torch
print(torch.__version__)
print("GPU Available:", torch.cuda.is_available())
Install PyTorch, along with torchvision (for image datasets and models) and torchaudio (for audio tasks). Check GPU availability to ensure CUDA is working. If torch.cuda.is_available() returns True, you can accelerate computations on your GPU.
2. Creating Tensors
torch.tensor([1, 2, 3])
torch.zeros(3, 3)
torch.ones(2, 2)
torch.eye(3)
torch.arange(0, 10, 2)
torch.rand(2, 2)
torch.randn(2, 2)
Tensors are the core data structure in PyTorch, similar to NumPy arrays but optimized for GPUs. You can initialize tensors with constants (zeros, ones, eye), ranges (arange), or random distributions (rand, randn). Use .to('cuda') to move them to GPU memory.
3. Tensor Operations
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
y = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)
torch.add(x, y)
torch.sub(x, y)
torch.mul(x, y)
torch.mm(x, y)
x.t()
x.view(-1)
These are basic mathematical and linear algebra operations.
mulis element-wise multiplication, whilemmperforms matrix multiplication.t()transposes a matrix.view()reshapes a tensor. Use-1to let PyTorch infer the correct dimension automatically.
4. Data Loading with DataLoader
from torch.utils.data import DataLoader, TensorDataset
dataset = TensorDataset(x, y)
loader = DataLoader(dataset, batch_size=2, shuffle=True)
for batch_x, batch_y in loader:
print(batch_x, batch_y)
PyTorch provides DataLoader for batching, shuffling, and parallel data loading. TensorDataset wraps tensors into a dataset object. For large datasets, use built-in datasets in torchvision.datasets and apply transformations with torchvision.transforms.
5. Building Models
Sequential API
import torch.nn as nn
model = nn.Sequential(
nn.Flatten(),
nn.Linear(28*28, 128),
nn.ReLU(),
nn.Dropout(0.2),
nn.Linear(128, 10),
nn.Softmax(dim=1)
)
nn.Sequential builds a model layer by layer in a linear stack. Common layers include Linear (fully connected), activation functions (ReLU, Sigmoid), and Dropout. Softmax normalizes outputs into class probabilities.
Custom Model Class
class MyModel(nn.Module):
def __init__(self):
super(MyModel, self).__init__()
self.fc1 = nn.Linear(28*28, 128)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(128, 10)
def forward(self, x):
x = self.fc1(x)
x = self.relu(x)
x = self.fc2(x)
return x
model = MyModel()
For more control, subclass nn.Module and define a custom forward method. This is the standard approach in PyTorch for building models.
6. Common Layers
nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
nn.MaxPool2d(kernel_size=2, stride=2)
nn.LSTM(input_size=50, hidden_size=100, num_layers=2, batch_first=True)
nn.Embedding(num_embeddings=5000, embedding_dim=300)
nn.BatchNorm1d(128)
nn.Dropout(0.5)
Conv2dandMaxPool2dare standard for CNNs.LSTMis useful for sequences and time-series data.Embeddingmaps discrete tokens to dense vectors.BatchNorm1dandDropoutimprove training stability and generalization.
7. Defining Loss and Optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
Choose a loss function suitable for your task (e.g., MSELoss for regression, CrossEntropyLoss for classification). The optimizer updates weights based on gradients; Adam is a good default choice.
8. Training Loop
for epoch in range(5):
for batch_x, batch_y in loader:
outputs = model(batch_x)
loss = criterion(outputs, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {loss.item()}")
Unlike TensorFlow/Keras where .fit() abstracts training, PyTorch requires an explicit loop:
- Forward pass (
model(batch_x)) - Compute loss (
criterion) - Zero gradients (
zero_grad) - Backward pass (
loss.backward()) - Update weights (
optimizer.step())
This flexibility makes PyTorch great for research.
9. Evaluation
model.eval()
with torch.no_grad():
correct = 0
for batch_x, batch_y in loader:
outputs = model(batch_x)
predicted = torch.argmax(outputs, dim=1)
correct += (predicted == batch_y).sum().item()
print("Accuracy:", correct / len(dataset))
Switch to evaluation mode with model.eval(), which disables dropout and batch norm updates. Use torch.no_grad() to avoid computing gradients and save memory.
10. Saving & Loading Models
# Save
torch.save(model.state_dict(), "model.pth")
# Load
model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()
PyTorch recommends saving model weights with state_dict(). This gives flexibility if you later want to reload weights into a modified architecture. Call .eval() when using the model for inference.
11. GPU Acceleration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
x = x.to(device)
Move models and tensors to GPU with .to(device). Always check for CUDA availability. For multi-GPU setups, use torch.nn.DataParallel or torch.distributed for distributed training.
12. Visualization of Training
import matplotlib.pyplot as plt
losses = []
for epoch in range(5):
for batch_x, batch_y in loader:
outputs = model(batch_x)
loss = criterion(outputs, batch_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.item())
plt.plot(losses)
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.show()
Recording and plotting the loss curve helps monitor training progress. A decreasing loss indicates the model is learning; if loss stalls or increases, you may need to tune hyperparameters.
13. Transfer Learning with Pretrained Models
import torchvision.models as models
# Load pretrained ResNet18
model = models.resnet18(pretrained=True)
# Freeze all layers
for param in model.parameters():
param.requires_grad = False
# Replace final layer for 10-class classification
model.fc = nn.Linear(model.fc.in_features, 10)
Transfer learning lets you reuse pretrained models (like ResNet, VGG, BERT) for new tasks. Freezing layers saves training time and prevents overfitting on small datasets.
14. Custom Dataset Class
from torch.utils.data import Dataset
from PIL import Image
class MyDataset(Dataset):
def __init__(self, file_paths, labels, transform=None):
self.file_paths = file_paths
self.labels = labels
self.transform = transform
def __len__(self):
return len(self.file_paths)
def __getitem__(self, idx):
img = Image.open(self.file_paths[idx])
if self.transform:
img = self.transform(img)
return img, self.labels[idx]
When working with custom image, text, or audio datasets, subclass Dataset and implement __getitem__ and __len__. Combine it with DataLoader for batching.
15. Mixed Precision Training (Faster on GPU)
scaler = torch.cuda.amp.GradScaler()
for epoch in range(5):
for x_batch, y_batch in loader:
optimizer.zero_grad()
with torch.cuda.amp.autocast():
outputs = model(x_batch)
loss = criterion(outputs, y_batch)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Mixed precision uses FP16 + FP32 for faster training with lower memory usage on GPUs that support Tensor Cores (NVIDIA RTX/A100). torch.cuda.amp automates this.
16. Gradient Clipping (Stabilizing Training)
nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
Gradient clipping prevents exploding gradients, especially in RNNs/LSTMs. It caps gradients at a specified norm.
17. Distributed Training (Multi-GPU)
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
dist.init_process_group("nccl")
model = model.to(device)
model = DDP(model, device_ids=[rank])
PyTorch supports distributed training across multiple GPUs/machines with DistributedDataParallel. It is faster and more scalable than DataParallel.
18. TorchScript for Model Deployment
scripted_model = torch.jit.script(model)
torch.jit.save(scripted_model, "model_scripted.pt")
TorchScript converts PyTorch models into a serialized format for deployment in production environments (e.g., C++ runtimes, mobile apps).
19. Learning Rate Scheduler
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)
for epoch in range(30):
train(...)
scheduler.step()
Schedulers adjust the learning rate dynamically during training to improve convergence. Options: StepL, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau.
20. Model Checkpointing
# Save
torch.save({
'epoch': epoch,
'model_state_dict': model.state_dict(),
'optimizer_state_dict': optimizer.state_dict(),
'loss': loss
}, "checkpoint.pth")
# Load
checkpoint = torch.load("checkpoint.pth")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
Instead of saving only model weights, save full checkpoints including optimizer state and epoch number to resume training later.
21. Visualization with TensorBoard
from torch.utils.tensorboard import SummaryWriter
writer = SummaryWriter("runs/experiment1")
for epoch in range(5):
writer.add_scalar("Loss/train", loss.item(), epoch)
writer.add_scalar("Accuracy/train", acc, epoch)
TensorBoard can be used with PyTorch for monitoring metrics, histograms, and graphs. Launch it via tensorboard --logdir=runs.
22. Autograd Tricks
x = torch.randn(3, requires_grad=True)
y = x ** 2
y.backward(torch.tensor([1.0, 0.5, 2.0]))
print(x.grad)
PyTorch’s autograd automatically computes gradients. You can pass custom gradients in .backward(), useful for vector-Jacobian products.
Conclusion
This PyTorch Cheatsheet provides a complete quick reference guide to building, training, and deploying deep learning models with PyTorch. From creating tensors to writing custom training loops, this cheatsheet covers everything you need to get started and be productive.
Bookmark this guide and keep it handy while coding. With practice, PyTorch’s simplicity and flexibility will allow you to build powerful machine learning models confidently.
Related Reads
- The Ultimate TensorFlow Cheatsheet: From Basics to Advanced
- Applied Machine Learning – CS 5785 at Cornell Tech: Complete Course Guide
- NLP Text Preprocessing Cheatsheet 2025: The Ultimate Powerful Guide
- Plotly Cheatsheet 2025: Powerful Techniques from Beginner to Advanced
- Matplotlib Cheatsheet 2025: From Beginner to Advanced
External Links
🔗 PyTorch Official Documentation
🔗 PyTorch Tutorials – Official
🔗 PyTorch Beginner Guide (LearnPyTorch.io)
🔗 Deep Learning with PyTorch: A 60 Minute Blitz
🔗 GeeksforGeeks – PyTorch Basics
5 thoughts on “PyTorch Cheatsheet: The Ultimate Quick Reference for Beginners and Developers”