Neural Networks - Post-Quiz
Time: 15 minutes
Questions: 10
Passing Score: 70%
Purpose: Validate your learning after completing Phase 5
Question 1 (Medium)
What is the output of the sigmoid activation function when input = 0?
A) 0.0
B) 0.5 ✓
C) 1.0
D) Undefined
Explanation
Answer: B) 0.5
Sigmoid function: σ(x) = 1 / (1 + e^(-x))
When x = 0:
σ(0) = 1 / (1 + e^0)
= 1 / (1 + 1)
= 1 / 2
= 0.5This is the midpoint of the sigmoid curve, which ranges from 0 to 1.
Reference: Phase 6 - Activation Functions
Question 2 (Hard)
def forward_pass(X, W1, b1, W2, b2):
Z1 = np.dot(X, W1) + b1
A1 = relu(Z1)
Z2 = np.dot(A1, W2) + b2
A2 = sigmoid(Z2)
return A2In this 2-layer network, what does Z1 represent?
A) Activated output of first layer
B) Pre-activation output of first layer ✓
C) Input to the network
D) Final output
Explanation
Answer: B) Pre-activation output of first layer
Notation:
- Z = Pre-activation (weighted sum + bias, before activation function)
- A = Activation (after applying activation function)
So the sequence is:
Z1 = XW1 + b1← Pre-activationA1 = ReLU(Z1)← ActivationZ2 = A1W2 + b2← Pre-activationA2 = Sigmoid(Z2)← Final output
Reference: Phase 5 - Forward Propagation
Question 3 (Medium)
Why is the ReLU activation function preferred over sigmoid in hidden layers?
A) It’s easier to compute
B) It mitigates the vanishing gradient problem ✓
C) It always outputs positive values
D) It’s more accurate
Explanation
Answer: B) It mitigates the vanishing gradient problem
ReLU advantages:
- Gradient is either 0 or 1 (doesn’t shrink like sigmoid)
- Faster training (no exponential computation)
- Prevents vanishing gradients in deep networks
ReLU: f(x) = max(0, x)
Gradient: f’(x) = 1 if x > 0, else 0
Sigmoid problems:
- Gradient saturates (very small) for large |x|
- Causes vanishing gradients in deep networks
Reference: Phase 6 - Activation Functions
Question 4 (Hard)
What is the derivative of the ReLU function at x = 0?
A) 0
B) 1
C) 0.5
D) Technically undefined, but set to 0 in practice ✓
Explanation
Answer: D) Technically undefined, but set to 0 in practice
ReLU: f(x) = max(0, x)
Derivative:
- f’(x) = 1 if x > 0
- f’(x) = 0 if x < 0
- f’(x) = undefined at x = 0 (discontinuity)
In practice: We set f’(0) = 0 (or sometimes 0.5), which works well in gradient descent.
Reference: Phase 6 - Activation Function Derivatives
Question 5 (Medium)
In gradient descent, weights are updated using:
A) W = W + learning_rate * gradient
B) W = W - learning_rate * gradient ✓
C) W = W * learning_rate * gradient
D) W = W / learning_rate * gradient
Explanation
Answer: B) W = W - learning_rate * gradient
Gradient Descent Update Rule:
W_new = W_old - α * ∂L/∂WWhere:
- α = learning rate
- ∂L/∂W = gradient of loss w.r.t. weight
Why subtract? Gradient points in direction of increasing loss. We want to go in the opposite direction (decreasing loss).
Reference: Phase 5 - Gradient Descent
Question 6 (Hard)
def backprop_step(dZ, A_prev, W):
m = A_prev.shape[0]
dW = (1/m) * np.dot(A_prev.T, dZ)
db = (1/m) * np.sum(dZ, axis=0, keepdims=True)
dA_prev = np.dot(dZ, W.T)
return dW, db, dA_prevWhat does dW represent?
A) Change in weights
B) Gradient of loss with respect to weights ✓
C) New weight values
D) Weight updates after learning rate
Explanation
Answer: B) Gradient of loss with respect to weights
Backpropagation calculates:
dW = ∂L/∂W(gradient of loss w.r.t. weights)db = ∂L/∂b(gradient of loss w.r.t. biases)dA_prev = ∂L/∂A_prev(gradient to pass to previous layer)
The actual weight update is:
W = W - learning_rate * dWReference: Phase 5 - Backpropagation Implementation
Question 7 (Medium)
What is the purpose of dividing by m (batch size) in gradient calculation?
A) To speed up computation
B) To average gradients across the batch ✓
C) To normalize weights
D) To prevent overflow
Explanation
Answer: B) To average gradients across the batch
When training on a batch of m examples, we calculate the average gradient across all examples:
dW = (1/m) * Σ(gradient_for_each_example)Why average?
- Makes gradients independent of batch size
- Ensures consistent learning rate effect
- Reduces gradient variance (more stable training)
Reference: Phase 5 - Mini-Batch Gradient Descent
Question 8 (Hard)
Which initialization strategy is best for deep networks with ReLU?
A) All zeros
B) All ones
C) Random small values from N(0, 0.01)
D) He initialization ✓
Explanation
Answer: D) He initialization
He Initialization:
W = np.random.randn(n_in, n_out) * np.sqrt(2 / n_in)Why it works:
- Designed for ReLU activation
- Maintains variance across layers
- Prevents vanishing/exploding gradients
Wrong answers:
- All zeros: Neurons learn same features (symmetry problem)
- All ones: Same issue, worse
- N(0, 0.01): Too small for deep networks (vanishing gradients)
For sigmoid/tanh: Use Xavier initialization instead
Reference: Phase 5 - Weight Initialization
Question 9 (Medium)
What is “one epoch” in neural network training?
A) One forward pass
B) One backward pass
C) One complete pass through the entire training dataset ✓
D) One weight update
Explanation
Answer: C) One complete pass through the entire training dataset
Training terminology:
- Iteration: One forward + backward pass on one batch
- Epoch: Complete pass through all training data
- Batch: Subset of training data processed together
Example:
- 1000 training samples, batch size = 100
- 1 epoch = 10 iterations (1000/100)
Reference: Phase 5 - Training Process
Question 10 (Hard)
# Training loop
for epoch in range(100):
# Forward pass
A = forward_pass(X, W1, b1, W2, b2)
loss = compute_loss(A, Y)
# Backward pass
dW1, db1, dW2, db2 = backprop(X, Y, A, W1, b1, W2, b2)
# Update weights
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2This implements which optimization algorithm?
A) Stochastic Gradient Descent (SGD)
B) Batch Gradient Descent ✓
C) Mini-Batch Gradient Descent
D) Adam
Explanation
Answer: B) Batch Gradient Descent
Clue: forward_pass(X, ...) processes entire dataset X at once.
Gradient Descent variants:
Batch GD:
- Uses entire dataset for each update
- Most accurate gradients, but slow
- What’s shown in the code
Stochastic GD (SGD):
- Uses one example at a time
- Fast but noisy updates
Mini-Batch GD:
- Uses small batches (e.g., 32, 64, 128)
- Good balance: fast + stable
- Most commonly used in practice
Adam:
- Adaptive learning rates
- Requires momentum terms (not shown)
Reference: Phase 5 - Optimization Algorithms
Self-Check Guide
0-5 correct: Review Phase 5 content more carefully. Focus on:
- Forward and backward propagation
- Activation functions and their derivatives
- Weight initialization and updates
6-7 correct: Good progress. Review the questions you missed and practice implementing a neural network from scratch.
8-9 correct: Strong understanding. Practice on real datasets to reinforce concepts.
10 correct: Excellent grasp of neural network fundamentals. You’re ready for more advanced topics.
Compare Your Scores
Pre-Quiz Score: ___ / 10
Post-Quiz Score: ___ / 10
Improvement: +___ points
Typical improvement: Several questions
Strong improvement: Most of the quiz
Next Steps
-
If several concepts still feel shaky:
- Re-watch Phase 5 videos
- Redo the assignment
- Work through challenges
- Implement a neural network from scratch
-
If most concepts now feel comfortable:
- ✅ Move to Phase 6 (Advanced topics)
- Try the optional stretch challenges
- Build a project using neural networks
-
If the quiz felt easy end to end:
- Mentor others in the community
- Contribute to the repository
- Explore research papers on neural architectures
Congratulations on completing Phase 5! 🎉🧠