Building a Neural Network from Scratch¶
In this notebook, we'll build a complete neural network using only NumPy. We'll implement:
- Forward propagation with multiple layers
- Various activation functions
- Backpropagation algorithm
- Mini-batch gradient descent
- Training on MNIST digit classification
This hands-on implementation will help you understand the inner workings of neural networks.
In [14]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
from sklearn.datasets import make_moons, make_circles
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')
# Set random seed for reproducibility
np.random.seed(42)
# Configure matplotlib
plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline
1. Activation Functions¶
Let's implement common activation functions and their derivatives.
In [15]:
class Activations:
@staticmethod
def relu(x):
"""ReLU activation function."""
return np.maximum(0, x)
@staticmethod
def relu_derivative(x):
"""Derivative of ReLU."""
return (x > 0).astype(float)
@staticmethod
def sigmoid(x):
"""Sigmoid activation function."""
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
@staticmethod
def sigmoid_derivative(x):
"""Derivative of sigmoid."""
s = Activations.sigmoid(x)
return s * (1 - s)
@staticmethod
def tanh(x):
"""Tanh activation function."""
return np.tanh(x)
@staticmethod
def tanh_derivative(x):
"""Derivative of tanh."""
return 1 - np.tanh(x) ** 2
@staticmethod
def softmax(x):
"""Softmax activation for multi-class classification."""
exp_x = np.exp(x - np.max(x, axis=1, keepdims=True))
return exp_x / np.sum(exp_x, axis=1, keepdims=True)
# Visualize activation functions
def plot_activation_functions():
x = np.linspace(-5, 5, 100)
fig, axes = plt.subplots(2, 3, figsize=(12, 8))
# ReLU
axes[0, 0].plot(x, Activations.relu(x), 'b-', linewidth=2)
axes[0, 0].set_title('ReLU', fontweight='bold')
axes[0, 0].grid(True, alpha=0.3)
axes[0, 0].axhline(y=0, color='k', linewidth=0.5)
axes[0, 0].axvline(x=0, color='k', linewidth=0.5)
axes[1, 0].plot(x, Activations.relu_derivative(x), 'r-', linewidth=2)
axes[1, 0].set_title('ReLU Derivative', fontweight='bold')
axes[1, 0].grid(True, alpha=0.3)
axes[1, 0].axhline(y=0, color='k', linewidth=0.5)
axes[1, 0].axvline(x=0, color='k', linewidth=0.5)
# Sigmoid
axes[0, 1].plot(x, Activations.sigmoid(x), 'b-', linewidth=2)
axes[0, 1].set_title('Sigmoid', fontweight='bold')
axes[0, 1].grid(True, alpha=0.3)
axes[0, 1].axhline(y=0, color='k', linewidth=0.5)
axes[0, 1].axvline(x=0, color='k', linewidth=0.5)
axes[1, 1].plot(x, Activations.sigmoid_derivative(x), 'r-', linewidth=2)
axes[1, 1].set_title('Sigmoid Derivative', fontweight='bold')
axes[1, 1].grid(True, alpha=0.3)
axes[1, 1].axhline(y=0, color='k', linewidth=0.5)
axes[1, 1].axvline(x=0, color='k', linewidth=0.5)
# Tanh
axes[0, 2].plot(x, Activations.tanh(x), 'b-', linewidth=2)
axes[0, 2].set_title('Tanh', fontweight='bold')
axes[0, 2].grid(True, alpha=0.3)
axes[0, 2].axhline(y=0, color='k', linewidth=0.5)
axes[0, 2].axvline(x=0, color='k', linewidth=0.5)
axes[1, 2].plot(x, Activations.tanh_derivative(x), 'r-', linewidth=2)
axes[1, 2].set_title('Tanh Derivative', fontweight='bold')
axes[1, 2].grid(True, alpha=0.3)
axes[1, 2].axhline(y=0, color='k', linewidth=0.5)
axes[1, 2].axvline(x=0, color='k', linewidth=0.5)
plt.suptitle('Activation Functions and Their Derivatives', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
plot_activation_functions()
2. Neural Network Implementation¶
In [16]:
class NeuralNetwork:
def __init__(self, layer_sizes, activation='relu', learning_rate=0.01):
"""
Initialize neural network.
Parameters:
- layer_sizes: list of integers representing neurons in each layer
- activation: activation function ('relu', 'sigmoid', 'tanh')
- learning_rate: learning rate for gradient descent
"""
self.layer_sizes = layer_sizes
self.n_layers = len(layer_sizes)
self.learning_rate = learning_rate
# Set activation functions
self.activation_name = activation
if activation == 'relu':
self.activation = Activations.relu
self.activation_derivative = Activations.relu_derivative
elif activation == 'sigmoid':
self.activation = Activations.sigmoid
self.activation_derivative = Activations.sigmoid_derivative
elif activation == 'tanh':
self.activation = Activations.tanh
self.activation_derivative = Activations.tanh_derivative
# Initialize weights and biases
self.weights = []
self.biases = []
for i in range(1, self.n_layers):
# He initialization for ReLU, Xavier for others
if activation == 'relu':
w = np.random.randn(layer_sizes[i-1], layer_sizes[i]) * np.sqrt(2.0 / layer_sizes[i-1])
else:
w = np.random.randn(layer_sizes[i-1], layer_sizes[i]) * np.sqrt(1.0 / layer_sizes[i-1])
b = np.zeros((1, layer_sizes[i]))
self.weights.append(w)
self.biases.append(b)
# Storage for training history
self.loss_history = []
self.accuracy_history = []
def forward_propagation(self, X):
"""
Perform forward propagation.
Returns:
- activations: list of activations for each layer
- z_values: list of weighted sums for each layer
"""
activations = [X]
z_values = []
for i in range(self.n_layers - 1):
z = activations[-1] @ self.weights[i] + self.biases[i]
z_values.append(z)
# Use softmax for output layer in multi-class classification
if i == self.n_layers - 2 and self.layer_sizes[-1] > 1:
a = Activations.softmax(z)
# Use sigmoid for binary classification output
elif i == self.n_layers - 2 and self.layer_sizes[-1] == 1:
a = Activations.sigmoid(z)
else:
a = self.activation(z)
activations.append(a)
return activations, z_values
def compute_loss(self, y_true, y_pred):
"""
Compute loss based on problem type.
"""
m = y_true.shape[0]
# Binary cross-entropy for binary classification
if self.layer_sizes[-1] == 1:
y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
# Categorical cross-entropy for multi-class
else:
y_pred = np.clip(y_pred, 1e-7, 1 - 1e-7)
loss = -np.mean(np.sum(y_true * np.log(y_pred), axis=1))
return loss
def backward_propagation(self, X, y, activations, z_values):
"""
Perform backward propagation and update weights.
"""
m = X.shape[0]
# Initialize gradients storage
dw_list = [None] * (self.n_layers - 1)
db_list = [None] * (self.n_layers - 1)
# Calculate output layer gradient
if self.layer_sizes[-1] == 1:
# Binary classification
dz = activations[-1] - y
else:
# Multi-class classification
dz = activations[-1] - y
# Backpropagate through layers
for layer_idx in range(self.n_layers - 2, -1, -1):
# Calculate weight and bias gradients for this layer
dw = (1/m) * (activations[layer_idx].T @ dz)
db = (1/m) * np.sum(dz, axis=0, keepdims=True)
# Store gradients
dw_list[layer_idx] = dw
db_list[layer_idx] = db
# Calculate gradient for previous layer (if not at input layer)
if layer_idx > 0:
# Propagate gradient backwards through current weights
da = dz @ self.weights[layer_idx].T
# Apply activation derivative from the previous layer
dz = da * self.activation_derivative(z_values[layer_idx - 1])
# Update all weights and biases
for i in range(self.n_layers - 1):
self.weights[i] -= self.learning_rate * dw_list[i]
self.biases[i] -= self.learning_rate * db_list[i]
def train(self, X, y, epochs=100, batch_size=32, validation_split=0.2, verbose=True):
"""
Train the neural network.
"""
# Split data for validation
n_samples = X.shape[0]
n_val = int(n_samples * validation_split)
indices = np.random.permutation(n_samples)
X_train = X[indices[n_val:]]
y_train = y[indices[n_val:]]
X_val = X[indices[:n_val]]
y_val = y[indices[:n_val]]
n_train = X_train.shape[0]
for epoch in range(epochs):
# Shuffle training data
indices = np.random.permutation(n_train)
X_train = X_train[indices]
y_train = y_train[indices]
# Mini-batch training
for i in range(0, n_train, batch_size):
batch_X = X_train[i:i+batch_size]
batch_y = y_train[i:i+batch_size]
# Forward propagation
activations, z_values = self.forward_propagation(batch_X)
# Backward propagation
self.backward_propagation(batch_X, batch_y, activations, z_values)
# Calculate validation metrics
val_activations, _ = self.forward_propagation(X_val)
val_loss = self.compute_loss(y_val, val_activations[-1])
if self.layer_sizes[-1] == 1:
val_predictions = (val_activations[-1] > 0.5).astype(int)
val_accuracy = np.mean(val_predictions == y_val)
else:
val_predictions = np.argmax(val_activations[-1], axis=1)
val_accuracy = np.mean(val_predictions == np.argmax(y_val, axis=1))
self.loss_history.append(val_loss)
self.accuracy_history.append(val_accuracy)
if verbose and epoch % 10 == 0:
print(f"Epoch {epoch:3d} - Loss: {val_loss:.4f}, Accuracy: {val_accuracy:.4f}")
def predict(self, X):
"""Make predictions."""
activations, _ = self.forward_propagation(X)
if self.layer_sizes[-1] == 1:
return (activations[-1] > 0.5).astype(int)
else:
return np.argmax(activations[-1], axis=1)
print("Neural Network class defined successfully!")
Neural Network class defined successfully!
3. Test on Simple Classification Problem¶
In [17]:
# Generate moon dataset
X, y = make_moons(n_samples=1000, noise=0.1, random_state=42)
y = y.reshape(-1, 1)
# Normalize features
scaler = StandardScaler()
X = scaler.fit_transform(X)
# Create and train neural network
nn = NeuralNetwork(
layer_sizes=[2, 16, 8, 1], # 2 input, 2 hidden layers, 1 output
activation='relu',
learning_rate=0.1
)
print("Training neural network on moon dataset...")
nn.train(X, y, epochs=100, batch_size=32, verbose=False)
print(f"Final accuracy: {nn.accuracy_history[-1]:.4f}")
Training neural network on moon dataset... Final accuracy: 1.0000
4. Visualize Decision Boundary and Training Progress¶
In [18]:
def plot_decision_boundary(X, y, model, title="Decision Boundary"):
"""Plot decision boundary for 2D data."""
h = 0.02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(figsize=(10, 6))
plt.contourf(xx, yy, Z, cmap=plt.cm.RdBu, alpha=0.8)
# Plot data points
scatter = plt.scatter(X[:, 0], X[:, 1], c=y.ravel(), cmap=plt.cm.RdBu,
edgecolor='black', linewidth=0.5)
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title(title, fontsize=14, fontweight='bold')
plt.colorbar(scatter)
plt.tight_layout()
plt.show()
# Plot decision boundary
plot_decision_boundary(X, y, nn, "Neural Network Decision Boundary")
# Plot training history
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
ax1.plot(nn.loss_history, 'b-', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss', fontweight='bold')
ax1.grid(True, alpha=0.3)
ax2.plot(nn.accuracy_history, 'g-', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Validation Accuracy', fontweight='bold')
ax2.grid(True, alpha=0.3)
plt.suptitle('Training Progress', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
5. Visualize Network Architecture¶
In [19]:
def visualize_network_architecture(nn):
"""Visualize the neural network architecture."""
fig, ax = plt.subplots(figsize=(12, 8))
layer_sizes = nn.layer_sizes
n_layers = len(layer_sizes)
# Calculate positions
layer_positions = np.linspace(0.1, 0.9, n_layers)
# Draw neurons and connections
for layer_idx in range(n_layers):
layer_size = layer_sizes[layer_idx]
x_pos = layer_positions[layer_idx]
# Calculate y positions for neurons in this layer
y_positions = np.linspace(0.1, 0.9, layer_size)
# Draw neurons
for neuron_idx in range(layer_size):
y_pos = y_positions[neuron_idx]
# Draw connections to next layer
if layer_idx < n_layers - 1:
next_layer_size = layer_sizes[layer_idx + 1]
next_x_pos = layer_positions[layer_idx + 1]
next_y_positions = np.linspace(0.1, 0.9, next_layer_size)
for next_neuron_idx in range(next_layer_size):
next_y_pos = next_y_positions[next_neuron_idx]
# Get weight value
weight = nn.weights[layer_idx][neuron_idx, next_neuron_idx]
# Color based on weight value
color = 'blue' if weight > 0 else 'red'
alpha = min(abs(weight) / 2, 0.8)
ax.plot([x_pos, next_x_pos], [y_pos, next_y_pos],
color=color, alpha=alpha, linewidth=0.5)
# Draw neuron
circle = plt.Circle((x_pos, y_pos), 0.02,
color='darkblue', zorder=10)
ax.add_patch(circle)
# Add layer label
if layer_idx == 0:
label = f'Input\n({layer_size} neurons)'
elif layer_idx == n_layers - 1:
label = f'Output\n({layer_size} neurons)'
else:
label = f'Hidden {layer_idx}\n({layer_size} neurons)'
ax.text(x_pos, 0.02, label, ha='center', fontsize=10)
ax.set_xlim(0, 1)
ax.set_ylim(0, 1)
ax.set_aspect('equal')
ax.axis('off')
ax.set_title('Neural Network Architecture\n(Blue: positive weights, Red: negative weights)',
fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
visualize_network_architecture(nn)
6. Weight Visualization and Analysis¶
In [20]:
def visualize_weights(nn):
"""Visualize weight matrices as heatmaps."""
n_weight_matrices = len(nn.weights)
fig, axes = plt.subplots(1, n_weight_matrices, figsize=(4*n_weight_matrices, 4))
if n_weight_matrices == 1:
axes = [axes]
for idx, (weights, ax) in enumerate(zip(nn.weights, axes)):
im = ax.imshow(weights, cmap='coolwarm', aspect='auto')
ax.set_title(f'Layer {idx+1} Weights\nShape: {weights.shape}')
ax.set_xlabel('To neuron')
ax.set_ylabel('From neuron')
plt.colorbar(im, ax=ax)
plt.suptitle('Weight Matrices Visualization', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
visualize_weights(nn)
# Weight statistics
print("\nWeight Statistics:")
for idx, weights in enumerate(nn.weights):
print(f"Layer {idx+1}:")
print(f" Shape: {weights.shape}")
print(f" Mean: {weights.mean():.4f}")
print(f" Std: {weights.std():.4f}")
print(f" Min: {weights.min():.4f}")
print(f" Max: {weights.max():.4f}")
Weight Statistics: Layer 1: Shape: (2, 16) Mean: -0.1511 Std: 1.0939 Min: -2.4372 Max: 1.8965 Layer 2: Shape: (16, 8) Mean: 0.0186 Std: 0.4826 Min: -1.4957 Max: 1.4920 Layer 3: Shape: (8, 1) Mean: 0.0808 Std: 1.5586 Min: -2.3508 Max: 3.1191
7. Test on More Complex Dataset¶
In [21]:
# Generate a more complex dataset (circles)
X_complex, y_complex = make_circles(n_samples=1000, noise=0.1, factor=0.5, random_state=42)
y_complex = y_complex.reshape(-1, 1)
# Normalize
X_complex = StandardScaler().fit_transform(X_complex)
# Create deeper network
nn_complex = NeuralNetwork(
layer_sizes=[2, 32, 16, 8, 1], # Deeper network
activation='relu',
learning_rate=0.1
)
print("Training on circles dataset...")
nn_complex.train(X_complex, y_complex, epochs=150, batch_size=32, verbose=False)
print(f"Final accuracy: {nn_complex.accuracy_history[-1]:.4f}")
# Visualize results
plot_decision_boundary(X_complex, y_complex, nn_complex, "Complex Dataset Decision Boundary")
Training on circles dataset... Final accuracy: 0.9850
8. Activation Patterns Analysis¶
In [22]:
def analyze_activation_patterns(nn, X, n_samples=5):
"""Analyze and visualize activation patterns for sample inputs."""
# Get random samples
indices = np.random.choice(X.shape[0], n_samples, replace=False)
X_samples = X[indices]
# Get activations for each sample
fig, axes = plt.subplots(n_samples, len(nn.layer_sizes),
figsize=(3*len(nn.layer_sizes), 2*n_samples))
for sample_idx, x in enumerate(X_samples):
activations, _ = nn.forward_propagation(x.reshape(1, -1))
for layer_idx, activation in enumerate(activations):
ax = axes[sample_idx, layer_idx] if n_samples > 1 else axes[layer_idx]
# Reshape activation for visualization
act_values = activation.flatten()
# Create bar plot
ax.bar(range(len(act_values)), act_values, color='blue', alpha=0.7)
ax.set_ylim([0, 1.1])
ax.set_xlabel('Neuron')
ax.set_ylabel('Activation')
if sample_idx == 0:
if layer_idx == 0:
ax.set_title(f'Input Layer')
elif layer_idx == len(nn.layer_sizes) - 1:
ax.set_title(f'Output Layer')
else:
ax.set_title(f'Hidden Layer {layer_idx}')
plt.suptitle('Activation Patterns for Sample Inputs', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
analyze_activation_patterns(nn, X, n_samples=3)
9. Compare Different Activation Functions¶
In [23]:
def compare_activation_functions(X, y):
"""Compare performance of different activation functions."""
activations = ['relu', 'sigmoid', 'tanh']
results = {}
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
for idx, activation in enumerate(activations):
# Train network
nn = NeuralNetwork(
layer_sizes=[2, 16, 8, 1],
activation=activation,
learning_rate=0.1
)
nn.train(X, y, epochs=100, batch_size=32, verbose=False)
results[activation] = {
'final_accuracy': nn.accuracy_history[-1],
'final_loss': nn.loss_history[-1],
'model': nn
}
# Plot training curves
axes[idx].plot(nn.accuracy_history, linewidth=2)
axes[idx].set_xlabel('Epoch')
axes[idx].set_ylabel('Accuracy')
axes[idx].set_title(f'{activation.upper()} Activation')
axes[idx].grid(True, alpha=0.3)
axes[idx].set_ylim([0.5, 1.0])
plt.suptitle('Activation Function Comparison', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Print results
print("\nActivation Function Performance:")
print("-" * 50)
for activation, metrics in results.items():
print(f"{activation.upper():8s} - Accuracy: {metrics['final_accuracy']:.4f}, Loss: {metrics['final_loss']:.4f}")
compare_activation_functions(X, y)
Activation Function Performance: -------------------------------------------------- RELU - Accuracy: 1.0000, Loss: 0.0036 SIGMOID - Accuracy: 0.8700, Loss: 0.2628 TANH - Accuracy: 1.0000, Loss: 0.0052
10. Mini MNIST-like Example¶
In [24]:
# Create a simple digit classification problem (simplified MNIST)
from sklearn.datasets import load_digits
# Load digits dataset
digits = load_digits(n_class=5) # Only use first 5 digits
X_digits = digits.data / 16.0 # Normalize to [0, 1]
y_digits = digits.target
# Convert to one-hot encoding
y_digits_onehot = np.zeros((y_digits.shape[0], 5))
y_digits_onehot[np.arange(y_digits.shape[0]), y_digits] = 1
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X_digits, y_digits_onehot, test_size=0.2, random_state=42
)
# Create network for digit classification
nn_digits = NeuralNetwork(
layer_sizes=[64, 32, 16, 5], # 64 input features, 5 output classes
activation='relu',
learning_rate=0.1
)
print("Training on digits dataset (0-4)...")
nn_digits.train(X_train, y_train, epochs=200, batch_size=32, verbose=False)
# Test accuracy
predictions = nn_digits.predict(X_test)
true_labels = np.argmax(y_test, axis=1)
test_accuracy = np.mean(predictions == true_labels)
print(f"Test Accuracy: {test_accuracy:.4f}")
# Visualize some predictions
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i in range(10):
ax = axes[i // 5, i % 5]
# Reshape back to 8x8 image
image = X_test[i].reshape(8, 8)
ax.imshow(image, cmap='gray')
pred = predictions[i]
true = true_labels[i]
color = 'green' if pred == true else 'red'
ax.set_title(f'Pred: {pred}, True: {true}', color=color)
ax.axis('off')
plt.suptitle('Digit Classification Results', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()
# Plot confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(true_labels, predictions)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix', fontweight='bold')
plt.show()
Training on digits dataset (0-4)... Test Accuracy: 0.9945
Summary and Conclusions¶
In this notebook, we've built a complete neural network from scratch using only NumPy. Key accomplishments:
What We Implemented:¶
- Forward Propagation: Matrix multiplication through layers with activation functions
- Backpropagation: Gradient calculation and weight updates
- Multiple Activation Functions: ReLU, Sigmoid, Tanh with their derivatives
- Mini-batch Gradient Descent: Efficient training on batches
- Different Loss Functions: Binary and categorical cross-entropy
Key Insights:¶
- Weight Initialization: He initialization for ReLU, Xavier for others
- Activation Functions: ReLU generally performs best for hidden layers
- Network Depth: Deeper networks can learn more complex patterns
- Learning Rate: Critical hyperparameter affecting convergence
Visualizations Created:¶
- Decision boundaries showing how the network separates classes
- Weight matrices revealing learned patterns
- Activation patterns showing information flow
- Training curves demonstrating convergence
Next Steps:¶
- Implement advanced optimizers (Adam, RMSprop)
- Add dropout and batch normalization
- Implement convolutional layers
- Try on real-world datasets
This implementation helps understand the fundamental mathematics behind deep learning frameworks like TensorFlow and PyTorch!