In stochastic gradient descent (SGD), how is the gradient estimated at each iteration?
What is the key advantage of using mini-batch SGD over standard SGD?
Which of the following statements is true about the expected update step in stochastic gradient descent?
In multinomial logistic regression, what is the role of the softmax function (\(\gamma\))?
What is the Kullback-Leibler (KL) divergence used for in multinomial logistic regression?
Which of the following is NOT a component of the forward pass in the backpropagation algorithm for multinomial logistic regression?
In the analysis of the MNIST dataset, what is the purpose of the Flatten
layer in PyTorch?
What is the purpose of the zero_grad()
method in PyTorch optimizers?