PyTorch Cheatsheet: The Ultimate Quick Reference for Beginners and Developers

When it comes to deep learning frameworks, two names dominate the field: TensorFlow and PyTorch. While TensorFlow has been around longer, PyTorch has quickly gained popularity due to its flexibility, dynamic computation graph and Pythonic style. Researchers, developers, and data scientists across the globe use PyTorch for everything from computer vision to natural language processing.

However, with its wide set of functions and features, it can be difficult to remember every detail while coding. That’s why this PyTorch Cheatsheet is a must-have. It serves as a quick reference guide with the most important commands, methods and workflows you’ll use in PyTorch projects.

Whether you’re a beginner exploring deep learning or an experienced developer, this cheatsheet will save you time and boost productivity.

What is PyTorch?

PyTorch is an open-source machine learning framework developed by Facebook’s AI Research lab (FAIR). It is known for:

Dynamic computation graphs (define-by-run)
Easy debugging with native Python tools
GPU acceleration with CUDA
Strong ecosystem of libraries for vision, NLP, reinforcement learning, etc.

PyTorch feels more “Pythonic” than other frameworks, making it easier to learn and integrate into research or production code.

1. Installation & Setup

pip install torch torchvision torchaudio

import torch
print(torch.__version__)
print("GPU Available:", torch.cuda.is_available())

Install PyTorch, along with torchvision (for image datasets and models) and torchaudio (for audio tasks). Check GPU availability to ensure CUDA is working. If torch.cuda.is_available() returns True, you can accelerate computations on your GPU.

2. Creating Tensors

torch.tensor([1, 2, 3])                     
torch.zeros(3, 3)                           
torch.ones(2, 2)                            
torch.eye(3)                                
torch.arange(0, 10, 2)                      
torch.rand(2, 2)                            
torch.randn(2, 2)

Tensors are the core data structure in PyTorch, similar to NumPy arrays but optimized for GPUs. You can initialize tensors with constants (zeros, ones, eye), ranges (arange), or random distributions (rand, randn). Use .to('cuda') to move them to GPU memory.

3. Tensor Operations

x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float32)
y = torch.tensor([[5, 6], [7, 8]], dtype=torch.float32)

torch.add(x, y)         
torch.sub(x, y)         
torch.mul(x, y)         
torch.mm(x, y)          
x.t()                   
x.view(-1)

These are basic mathematical and linear algebra operations.

mul is element-wise multiplication, while mm performs matrix multiplication.
t() transposes a matrix.
view() reshapes a tensor. Use -1 to let PyTorch infer the correct dimension automatically.

4. Data Loading with DataLoader

from torch.utils.data import DataLoader, TensorDataset

dataset = TensorDataset(x, y)
loader = DataLoader(dataset, batch_size=2, shuffle=True)

for batch_x, batch_y in loader:
    print(batch_x, batch_y)

PyTorch provides DataLoader for batching, shuffling, and parallel data loading. TensorDataset wraps tensors into a dataset object. For large datasets, use built-in datasets in torchvision.datasets and apply transformations with torchvision.transforms.

5. Building Models

Sequential API

import torch.nn as nn

model = nn.Sequential(
    nn.Flatten(),
    nn.Linear(28*28, 128),
    nn.ReLU(),
    nn.Dropout(0.2),
    nn.Linear(128, 10),
    nn.Softmax(dim=1)
)

nn.Sequential builds a model layer by layer in a linear stack. Common layers include Linear (fully connected), activation functions (ReLU, Sigmoid), and Dropout. Softmax normalizes outputs into class probabilities.

Custom Model Class

class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.fc1 = nn.Linear(28*28, 128)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

model = MyModel()

For more control, subclass nn.Module and define a custom forward method. This is the standard approach in PyTorch for building models.

6. Common Layers

nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
nn.MaxPool2d(kernel_size=2, stride=2)
nn.LSTM(input_size=50, hidden_size=100, num_layers=2, batch_first=True)
nn.Embedding(num_embeddings=5000, embedding_dim=300)
nn.BatchNorm1d(128)
nn.Dropout(0.5)

Conv2d and MaxPool2d are standard for CNNs.
LSTM is useful for sequences and time-series data.
Embedding maps discrete tokens to dense vectors.
BatchNorm1d and Dropout improve training stability and generalization.

7. Defining Loss and Optimizer

criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

Choose a loss function suitable for your task (e.g., MSELoss for regression, CrossEntropyLoss for classification). The optimizer updates weights based on gradients; Adam is a good default choice.

8. Training Loop

for epoch in range(5):
    for batch_x, batch_y in loader:
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    
    print(f"Epoch {epoch+1}, Loss: {loss.item()}")

Unlike TensorFlow/Keras where .fit() abstracts training, PyTorch requires an explicit loop:

Forward pass (model(batch_x))
Compute loss (criterion)
Zero gradients (zero_grad)
Backward pass (loss.backward())
Update weights (optimizer.step())

This flexibility makes PyTorch great for research.

9. Evaluation

model.eval()
with torch.no_grad():
    correct = 0
    for batch_x, batch_y in loader:
        outputs = model(batch_x)
        predicted = torch.argmax(outputs, dim=1)
        correct += (predicted == batch_y).sum().item()

print("Accuracy:", correct / len(dataset))

Switch to evaluation mode with model.eval(), which disables dropout and batch norm updates. Use torch.no_grad() to avoid computing gradients and save memory.

10. Saving & Loading Models

# Save
torch.save(model.state_dict(), "model.pth")

# Load
model = MyModel()
model.load_state_dict(torch.load("model.pth"))
model.eval()

PyTorch recommends saving model weights with state_dict(). This gives flexibility if you later want to reload weights into a modified architecture. Call .eval() when using the model for inference.

11. GPU Acceleration

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
x = x.to(device)

Move models and tensors to GPU with .to(device). Always check for CUDA availability. For multi-GPU setups, use torch.nn.DataParallel or torch.distributed for distributed training.

12. Visualization of Training

import matplotlib.pyplot as plt

losses = []

for epoch in range(5):
    for batch_x, batch_y in loader:
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    losses.append(loss.item())

plt.plot(losses)
plt.xlabel("Epochs")
plt.ylabel("Loss")
plt.show()

Recording and plotting the loss curve helps monitor training progress. A decreasing loss indicates the model is learning; if loss stalls or increases, you may need to tune hyperparameters.

13. Transfer Learning with Pretrained Models

import torchvision.models as models

# Load pretrained ResNet18
model = models.resnet18(pretrained=True)

# Freeze all layers
for param in model.parameters():
    param.requires_grad = False

# Replace final layer for 10-class classification
model.fc = nn.Linear(model.fc.in_features, 10)

Transfer learning lets you reuse pretrained models (like ResNet, VGG, BERT) for new tasks. Freezing layers saves training time and prevents overfitting on small datasets.

14. Custom Dataset Class

from torch.utils.data import Dataset
from PIL import Image

class MyDataset(Dataset):
    def __init__(self, file_paths, labels, transform=None):
        self.file_paths = file_paths
        self.labels = labels
        self.transform = transform

    def __len__(self):
        return len(self.file_paths)

    def __getitem__(self, idx):
        img = Image.open(self.file_paths[idx])
        if self.transform:
            img = self.transform(img)
        return img, self.labels[idx]

When working with custom image, text, or audio datasets, subclass Dataset and implement __getitem__ and __len__. Combine it with DataLoader for batching.

15. Mixed Precision Training (Faster on GPU)

scaler = torch.cuda.amp.GradScaler()

for epoch in range(5):
    for x_batch, y_batch in loader:
        optimizer.zero_grad()
        with torch.cuda.amp.autocast():
            outputs = model(x_batch)
            loss = criterion(outputs, y_batch)
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

Mixed precision uses FP16 + FP32 for faster training with lower memory usage on GPUs that support Tensor Cores (NVIDIA RTX/A100). torch.cuda.amp automates this.

16. Gradient Clipping (Stabilizing Training)

nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

Gradient clipping prevents exploding gradients, especially in RNNs/LSTMs. It caps gradients at a specified norm.

17. Distributed Training (Multi-GPU)

import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP

dist.init_process_group("nccl")
model = model.to(device)
model = DDP(model, device_ids=[rank])

PyTorch supports distributed training across multiple GPUs/machines with DistributedDataParallel. It is faster and more scalable than DataParallel.

18. TorchScript for Model Deployment

scripted_model = torch.jit.script(model)
torch.jit.save(scripted_model, "model_scripted.pt")

TorchScript converts PyTorch models into a serialized format for deployment in production environments (e.g., C++ runtimes, mobile apps).

19. Learning Rate Scheduler

scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

for epoch in range(30):
    train(...)
    scheduler.step()

Schedulers adjust the learning rate dynamically during training to improve convergence. Options: StepL, ExponentialLR, CosineAnnealingLR, ReduceLROnPlateau.

20. Model Checkpointing

# Save
torch.save({
    'epoch': epoch,
    'model_state_dict': model.state_dict(),
    'optimizer_state_dict': optimizer.state_dict(),
    'loss': loss
}, "checkpoint.pth")

# Load
checkpoint = torch.load("checkpoint.pth")
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])

Instead of saving only model weights, save full checkpoints including optimizer state and epoch number to resume training later.

21. Visualization with TensorBoard

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter("runs/experiment1")

for epoch in range(5):
    writer.add_scalar("Loss/train", loss.item(), epoch)
    writer.add_scalar("Accuracy/train", acc, epoch)

TensorBoard can be used with PyTorch for monitoring metrics, histograms, and graphs. Launch it via tensorboard --logdir=runs.

22. Autograd Tricks

x = torch.randn(3, requires_grad=True)
y = x ** 2
y.backward(torch.tensor([1.0, 0.5, 2.0]))
print(x.grad)

PyTorch’s autograd automatically computes gradients. You can pass custom gradients in .backward(), useful for vector-Jacobian products.

Conclusion

This PyTorch Cheatsheet provides a complete quick reference guide to building, training, and deploying deep learning models with PyTorch. From creating tensors to writing custom training loops, this cheatsheet covers everything you need to get started and be productive.

Bookmark this guide and keep it handy while coding. With practice, PyTorch’s simplicity and flexibility will allow you to build powerful machine learning models confidently.

External Links

🔗 PyTorch Official Documentation
🔗 PyTorch Tutorials – Official
🔗 PyTorch Beginner Guide (LearnPyTorch.io)
🔗 Deep Learning with PyTorch: A 60 Minute Blitz
🔗 GeeksforGeeks – PyTorch Basics