Neural Networks Explained

DownloadOpen this link in a laptop or a desktop to download

Edit

Download

1. Neural Networks as Function Approximators

2. Vectors, Matrices, and Layers

3. Activation Functions and Nonlinearity

4. Loss Functions Define the Goal

5. Forward Pass: Computing Predictions

6. Gradient Descent and Learning Direction

7. Backpropagation: Efficient Credit Assignment

8. Parameter Updates and Optimizers

9. Data Pipelines and Normalization

10. Multilayer Perceptrons

11. Convolutional Neural Networks

12. RNNs and Transformers for Sequences

13. Generalization, Validation, and Regularization

14. Common Failure Modes

15. End-to-End Training Loop

1. Neural Networks as Function Approximators

Loading equations

Does a neural network explicitly store the training examples?

Usually no; it stores parameters that shape a function, though overly large models can memorize data.

Why is the function view useful?

It unifies tasks such as classification, regression, translation, and control as input-output mapping problems.

Function approximation: Learning a parameterized function that maps inputs to desired outputs.

Input vector: A numerical representation of one example, often written as \(x\).

Parameters: Trainable weights and biases collectively denoted by \(\theta\).

Prediction: The network output, commonly written as \(\hat{y}\).

Add Comment

2. Vectors, Matrices, and Layers

Loading equations

Why use matrices instead of writing every neuron separately?

Matrix notation is compact, matches hardware acceleration, and reveals the structure of layer computation.

What does a neuron represent mathematically?

It is one coordinate of a transformed vector, formed by a weighted sum, bias, and activation.

Weight matrix: A matrix \(W\) containing trainable coefficients connecting one layer to the next.

Bias vector: A trainable offset \(b\) added before activation.

Affine transformation: A linear map plus translation, written as \(Wx+b\).

Batch: A group of examples processed together for efficient training.

Add Comment

3. Activation Functions and Nonlinearity

Loading equations

Why not use only linear layers?

A stack of linear layers is equivalent to a single linear layer, so it cannot model complex nonlinear relationships.

Is ReLU always the best activation?

No; alternatives such as GELU, tanh, sigmoid, and leaky ReLU can be better depending on architecture and task.

Activation function: A nonlinear function applied to layer outputs before passing them onward.

ReLU: The function \(\max(0,z)\), widely used in hidden layers.

Sigmoid: A squashing function that maps real values into the interval \((0,1)\).

Vanishing gradient: A failure mode where gradients become too small to train early layers effectively.

Add Comment

4. Loss Functions Define the Goal

Loading equations

Why must the loss be differentiable or almost differentiable?

Gradient-based learning needs derivatives or subgradients to know how to change parameters.

Can high accuracy and high loss happen together?

Yes, if correct predictions are made with low confidence or a few wrong predictions are extremely confident.

Loss function: A scalar objective measuring how bad predictions are.

Mean squared error: A regression loss based on squared differences between predictions and targets.

Cross-entropy: A classification loss that penalizes low probability assigned to the true class.

Regularization: An added constraint or penalty that reduces overfitting.

Add Comment

5. Forward Pass: Computing Predictions

Loading equations

Why store intermediate activations during training?

Backpropagation needs them to calculate derivatives for earlier parameters.

Is inference the same as training?

No; inference computes predictions only, while training also computes loss, gradients, and updates.

Forward pass: The computation from input through layers to prediction.

Hidden representation: An intermediate vector produced inside the network.

Softmax: A function that converts class scores into probabilities summing to one.

Inference: Using a trained model to make predictions without updating parameters.

Add Comment

6. Gradient Descent and Learning Direction

Loading equations

Why subtract the gradient?

The gradient points toward steepest local increase, so subtracting it moves toward lower loss.

Can gradient descent get stuck?

It can slow near saddles or poor regions, but stochasticity, momentum, and adaptive methods often help.

Gradient: A vector containing derivatives of the loss with respect to parameters.

Learning rate: The step size \(\eta\) used in parameter updates.

Mini-batch SGD: Gradient descent using a small subset of training examples per update.

Nonconvex surface: A loss landscape with many valleys, saddles, and local structures.

Add Comment

7. Backpropagation: Efficient Credit Assignment

Loading equations

Is backpropagation biologically realistic?

It is mainly an engineering and mathematical algorithm, not a confirmed model of biological learning.

Why is backpropagation efficient?

It reuses intermediate derivatives and computes all parameter gradients in time comparable to a few forward passes.

Backpropagation: An algorithm for computing gradients through composed operations.

Chain rule: A calculus rule for differentiating nested functions.

Credit assignment: Determining how internal parameters affected the final loss.

Optimizer: A method that uses gradients to update parameters.

Add Comment

8. Parameter Updates and Optimizers

Loading equations

Why is Adam popular?

It often works well with minimal tuning because it adapts learning rates for different parameters.

Does a better optimizer guarantee better generalization?

No; it may reduce training loss faster, but validation performance also depends on data, model size, and regularization.

SGD: Stochastic gradient descent, an optimizer using mini-batch gradient estimates.

Momentum: A technique that accumulates update direction to smooth optimization.

Adam: An adaptive optimizer using estimates of first and second gradient moments.

Weight decay: A regularization update that discourages large parameter values.

Add Comment

9. Data Pipelines and Normalization

Loading equations

Why compute normalization statistics only on training data?

Using validation or test data would leak information and overestimate real-world performance.

Can preprocessing change model behavior?

Yes; inconsistent scaling or tokenization can make a trained model fail even if the architecture is unchanged.

Data pipeline: The process that prepares and delivers data batches for training or inference.

Normalization: Rescaling features to comparable numerical ranges or distributions.

Data leakage: Accidental use of validation or test information during training.

Tokenization: Converting text into discrete units that a model can process.

Add Comment

10. Multilayer Perceptrons

Loading equations

Are MLPs obsolete?

No; they remain strong for tabular data, embeddings, and as components inside modern architectures.

Why can dense layers be inefficient for images?

They do not exploit local spatial structure, so they may require many parameters to learn simple visual patterns.

MLP: A feedforward network composed mainly of fully connected layers.

Dense layer: A layer where each output unit receives input from every previous unit.

Feedforward: Information flows from input to output without recurrent loops.

Parameter count: The number of trainable weights and biases in a model.

Add Comment

11. Convolutional Neural Networks

Convolutional neural networks are designed for grid-like data such as images, spectrograms, and medical scans. Instead of connecting every input pixel to every neuron, a convolution applies small learnable filters across local neighborhoods. This creates parameter sharing: the same detector can recognize an edge or texture in many positions. A convolutional layer transforms an input tensor into feature maps, often followed by nonlinearities, pooling, normalization, or residual connections. CNNs encode useful inductive biases: locality, translation equivariance, and hierarchical feature extraction. Early layers often detect edges and colors, middle layers detect motifs, and later layers detect object parts or semantic patterns. This design greatly reduces parameters compared with fully connected image models while improving data efficiency.

Why are CNNs good for images?

They exploit local patterns and reuse filters across positions, matching common visual structure.

What does pooling do?

Pooling summarizes nearby activations, reducing spatial size and adding some robustness to small shifts.

Convolution: A local filtering operation applied across a grid-like input.

Feature map: A spatial array of activations produced by a learned filter.

Parameter sharing: Reusing the same weights at multiple input locations.

Inductive bias: A built-in modeling assumption that helps learning on certain data types.

Add Comment

12. RNNs and Transformers for Sequences

Loading equations

Why did Transformers replace many RNNs?

They model long-range context more directly and train efficiently in parallel on modern hardware.

Do Transformers understand word order?

They need positional information, such as positional embeddings, because attention alone is permutation-invariant.

RNN: A recurrent model that updates hidden state across sequence steps.

Hidden state: A vector storing information from previous sequence elements.

Self-attention: A mechanism letting tokens weight information from other tokens.

Transformer: A sequence architecture built primarily from attention and feedforward blocks.

Add Comment

13. Generalization, Validation, and Regularization

A network’s goal is not to minimize training loss perfectly, but to perform well on unseen data. Generalization is estimated with validation and test sets that are not used for parameter updates. Overfitting occurs when a model learns training-specific noise or shortcuts, producing low training loss and poor validation performance. Regularization methods reduce this risk: dropout randomly masks activations during training, data augmentation creates realistic variations, weight decay penalizes large weights, and early stopping halts training when validation performance stops improving. The bias-variance tradeoff appears in model capacity: too small a network underfits, while too large a poorly regularized network may overfit. Good evaluation also requires metrics aligned with the task, class balance, and deployment conditions.

Why not train until the loss is as low as possible?

Training loss may keep improving after validation performance worsens, indicating overfitting.

Is a larger dataset a form of regularization?

In practice yes; more diverse data reduces reliance on accidental patterns and improves generalization.

Generalization: The ability to perform well on data not seen during training.

Overfitting: Learning training-specific patterns that do not transfer to new examples.

Dropout: A regularization method that randomly disables activations during training.

Validation set: Held-out data used to tune models without training on it directly.

Add Comment

14. Common Failure Modes

Neural networks fail in predictable ways that are often diagnosable. Underfitting appears when both training and validation performance are poor, suggesting insufficient capacity, bad features, weak optimization, or excessive regularization. Overfitting appears when training performance is strong but validation performance is weak. Vanishing or exploding gradients make deep or recurrent models train slowly or unstably; gradient clipping, normalization, residual connections, and better initialization can help. Dataset problems are equally serious: mislabeled examples, class imbalance, distribution shift, leakage, and spurious correlations can produce models that look accurate in benchmarks but fail in deployment. Numerical issues such as NaNs, saturated activations, and inappropriate learning rates can stop learning entirely. Debugging requires observing data, losses, gradients, and predictions together.

What is the first thing to check when training fails?

Inspect the data and labels, then verify that loss decreases on a tiny subset of examples.

Can high benchmark accuracy still be unsafe?

Yes; models may rely on shortcuts, fail under distribution shift, or perform poorly for underrepresented groups.

Underfitting: Failure to learn enough structure from the training data.

Distribution shift: A mismatch between training data and real deployment data.

Exploding gradient: A gradient that grows too large, causing unstable updates.

Spurious correlation: A misleading pattern that predicts labels in training but is not causally reliable.

Add Comment

15. End-to-End Training Loop

Loading equations

Why log both training and validation metrics?

Their relationship reveals optimization progress, overfitting, underfitting, and data issues.

Why save checkpoints?

They allow recovery from failures and make it possible to select the best validation-performing model.

Training loop: The repeated procedure that moves from data batches to parameter updates.

Epoch: One pass through the training dataset.

Checkpoint: A saved model state that can be restored later.

Gradient norm: A scalar summary of gradient magnitude used to monitor stability.

Add Comment

Share your stories

Start with a prompt or upload a file create a visual book in minutes