Neural Networks - Post-Quiz

Time: 15 minutes
Questions: 10
Passing Score: 70%
Purpose: Validate your learning after completing Phase 5

Question 1 (Medium)

What is the output of the sigmoid activation function when input = 0?

A) 0.0
B) 0.5 ✓
C) 1.0
D) Undefined

Explanation

Answer: B) 0.5

Sigmoid function: σ(x) = 1 / (1 + e^(-x))

When x = 0:


σ(0) = 1 / (1 + e^0)
     = 1 / (1 + 1)
     = 1 / 2
     = 0.5

This is the midpoint of the sigmoid curve, which ranges from 0 to 1.

Reference: Phase 6 - Activation Functions

Question 2 (Hard)


def forward_pass(X, W1, b1, W2, b2):
    Z1 = np.dot(X, W1) + b1
    A1 = relu(Z1)
    Z2 = np.dot(A1, W2) + b2
    A2 = sigmoid(Z2)
    return A2

In this 2-layer network, what does Z1 represent?

A) Activated output of first layer
B) Pre-activation output of first layer ✓
C) Input to the network
D) Final output

Explanation

Answer: B) Pre-activation output of first layer

Notation:

Z = Pre-activation (weighted sum + bias, before activation function)
A = Activation (after applying activation function)

So the sequence is:

Z1 = XW1 + b1 ← Pre-activation
A1 = ReLU(Z1) ← Activation
Z2 = A1W2 + b2 ← Pre-activation
A2 = Sigmoid(Z2) ← Final output

Reference: Phase 5 - Forward Propagation

Question 3 (Medium)

Why is the ReLU activation function preferred over sigmoid in hidden layers?

A) It’s easier to compute
B) It mitigates the vanishing gradient problem ✓
C) It always outputs positive values
D) It’s more accurate

Explanation

Answer: B) It mitigates the vanishing gradient problem

ReLU advantages:

Gradient is either 0 or 1 (doesn’t shrink like sigmoid)
Faster training (no exponential computation)
Prevents vanishing gradients in deep networks

ReLU: f(x) = max(0, x)
Gradient: f’(x) = 1 if x > 0, else 0

Sigmoid problems:

Gradient saturates (very small) for large |x|
Causes vanishing gradients in deep networks

Reference: Phase 6 - Activation Functions

Question 4 (Hard)

What is the derivative of the ReLU function at x = 0?

A) 0
B) 1
C) 0.5
D) Technically undefined, but set to 0 in practice ✓

Explanation

Answer: D) Technically undefined, but set to 0 in practice

ReLU: f(x) = max(0, x)

Derivative:

f’(x) = 1 if x > 0
f’(x) = 0 if x < 0
f’(x) = undefined at x = 0 (discontinuity)

In practice: We set f’(0) = 0 (or sometimes 0.5), which works well in gradient descent.

Reference: Phase 6 - Activation Function Derivatives

Question 5 (Medium)

In gradient descent, weights are updated using:

A) W = W + learning_rate * gradient
B) W = W - learning_rate * gradient ✓
C) W = W * learning_rate * gradient
D) W = W / learning_rate * gradient

Explanation

Answer: B) W = W - learning_rate * gradient

Gradient Descent Update Rule:


W_new = W_old - α * ∂L/∂W

Where:

α = learning rate
∂L/∂W = gradient of loss w.r.t. weight

Why subtract? Gradient points in direction of increasing loss. We want to go in the opposite direction (decreasing loss).

Reference: Phase 5 - Gradient Descent

Question 6 (Hard)


def backprop_step(dZ, A_prev, W):
    m = A_prev.shape[0]
    dW = (1/m) * np.dot(A_prev.T, dZ)
    db = (1/m) * np.sum(dZ, axis=0, keepdims=True)
    dA_prev = np.dot(dZ, W.T)
    return dW, db, dA_prev

What does dW represent?

A) Change in weights
B) Gradient of loss with respect to weights ✓
C) New weight values
D) Weight updates after learning rate

Explanation

Answer: B) Gradient of loss with respect to weights

Backpropagation calculates:

dW = ∂L/∂W (gradient of loss w.r.t. weights)
db = ∂L/∂b (gradient of loss w.r.t. biases)
dA_prev = ∂L/∂A_prev (gradient to pass to previous layer)

The actual weight update is:


W = W - learning_rate * dW

Reference: Phase 5 - Backpropagation Implementation

Question 7 (Medium)

What is the purpose of dividing by m (batch size) in gradient calculation?

A) To speed up computation
B) To average gradients across the batch ✓
C) To normalize weights
D) To prevent overflow

Explanation

Answer: B) To average gradients across the batch

When training on a batch of m examples, we calculate the average gradient across all examples:


dW = (1/m) * Σ(gradient_for_each_example)

Why average?

Makes gradients independent of batch size
Ensures consistent learning rate effect
Reduces gradient variance (more stable training)

Reference: Phase 5 - Mini-Batch Gradient Descent

Question 8 (Hard)

Which initialization strategy is best for deep networks with ReLU?

A) All zeros
B) All ones
C) Random small values from N(0, 0.01)
D) He initialization ✓

Explanation

Answer: D) He initialization

He Initialization:


W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in)

Why it works:

Designed for ReLU activation
Maintains variance across layers
Prevents vanishing/exploding gradients

Wrong answers:

All zeros: Neurons learn same features (symmetry problem)
All ones: Same issue, worse
N(0, 0.01): Too small for deep networks (vanishing gradients)

For sigmoid/tanh: Use Xavier initialization instead

Reference: Phase 5 - Weight Initialization

Question 9 (Medium)

What is “one epoch” in neural network training?

A) One forward pass
B) One backward pass
C) One complete pass through the entire training dataset ✓
D) One weight update

Explanation

Answer: C) One complete pass through the entire training dataset

Training terminology:

Iteration: One forward + backward pass on one batch
Epoch: Complete pass through all training data
Batch: Subset of training data processed together

Example:

1000 training samples, batch size = 100
1 epoch = 10 iterations (1000/100)

Reference: Phase 5 - Training Process

Question 10 (Hard)


# Training loop
for epoch in range(100):
    # Forward pass
    A = forward_pass(X, W1, b1, W2, b2)
    loss = compute_loss(A, Y)
    
    # Backward pass
    dW1, db1, dW2, db2 = backprop(X, Y, A, W1, b1, W2, b2)
    
    # Update weights
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2

This implements which optimization algorithm?

A) Stochastic Gradient Descent (SGD)
B) Batch Gradient Descent ✓
C) Mini-Batch Gradient Descent
D) Adam

Explanation

Answer: B) Batch Gradient Descent

Clue: forward_pass(X, ...) processes entire dataset X at once.

Gradient Descent variants:

Batch GD:

Uses entire dataset for each update
Most accurate gradients, but slow
What’s shown in the code

Stochastic GD (SGD):

Uses one example at a time
Fast but noisy updates

Mini-Batch GD:

Uses small batches (e.g., 32, 64, 128)
Good balance: fast + stable
Most commonly used in practice

Adam:

Adaptive learning rates
Requires momentum terms (not shown)

Reference: Phase 5 - Optimization Algorithms

Self-Check Guide

0-5 correct: Review Phase 5 content more carefully. Focus on:

Forward and backward propagation
Activation functions and their derivatives
Weight initialization and updates

6-7 correct: Good progress. Review the questions you missed and practice implementing a neural network from scratch.

8-9 correct: Strong understanding. Practice on real datasets to reinforce concepts.

10 correct: Excellent grasp of neural network fundamentals. You’re ready for more advanced topics.

Compare Your Scores

Pre-Quiz Score: ___ / 10
Post-Quiz Score: ___ / 10
Improvement: +___ points

Typical improvement: Several questions
Strong improvement: Most of the quiz

Next Steps

If several concepts still feel shaky:
- Re-watch Phase 5 videos
- Redo the assignment
- Work through challenges
- Implement a neural network from scratch
If most concepts now feel comfortable:
- ✅ Move to Phase 6 (Advanced topics)
- Try the optional stretch challenges
- Build a project using neural networks
If the quiz felt easy end to end:
- Mentor others in the community
- Contribute to the repository
- Explore research papers on neural architectures

Congratulations on completing Phase 5! 🎉🧠