1. Credit: https://commons.wikimedia.org/wiki/File:Transformer,_full_architecture.png

  2. Although time-consuming, this is in fact the best way to study a mathematical text whether the proofs are new to you or not.

  3. https://www.kaggle.com/datasets/harrywang/housing

  4. Check that you can derive it for yourself!

  5. “Deep learning” refers to the use of very large artificial neural networks for machine learning, an approach which rose to prominence at the beginning of the 2010s and enabled the ongoing revolution in machine learning and artificial intelligence.

  6. Check that then indeed \((\Sigma_{X,X}^{-\frac{1}{2}})^{2}=\Sigma_{X,X}^{-1}.\)

  7. https://en.wikipedia.org/wiki/Confusion_matrix

  8. The case of rank \(k=0\) is trivial case \(X=0.\)

  9. The equation ([eq:ch5_higher_dim_lin_gradient]) is the transpose of ([eq:ch7_norm_eqs_grad]) with \(W^{T}\) in place of \(A,\) since here the parameter matrix is \(W\in\mathbb{R}^{d\times m}\) while there it is \(A\in\mathbb{R}^{m\times d}\)

  10. Technical detail: together with a \(\sigma\)-algebra \(\mathcal{A}\) on \(\mathcal{X}\times\mathcal{Y},\) so that \((\mu,\mathcal{A},\mathcal{X}\times\mathcal{Y})\) is a probability space.

  11. Technical detail: \(f_{*}\) is assumed to be measurable, where \(\mathcal{X}\) and \(\mathcal{Y}\) are equipped with the \(\sigma\)-algebras arising as restrictions of \(\mathcal{A}.\) In particular, the latter can be achieved by equipping \(\mathcal{X}\) and \(\mathcal{Y}\) with \(\sigma\)-algebras \(\mathcal{A}_{\mathcal{X}}\) resp. \(\mathcal{A}_{\mathcal{Y}}\) and letting \(\mathcal{A}\) be the product \(\sigma\)-algebra \(\mathcal{A}_{\mathcal{X}}\otimes\mathcal{A}_{\mathcal{Y}}.\)

  12. See M08 Stochastics, Example 0.2, page 10 (screen pdf): https://moodle.fernuni.ch/pluginfile.php/366253/mod_resource/content/76/probability_fernuni_screen.pdf#page=10

  13. Technical detail: for \(A\) measurable.

  14. Technical detail: On the measurable space obtained from the countably infinite product of \((\mathcal{X}\times\mathcal{Y},\mathcal{A}).\)

  15. Technical detail: measurable.

  16. Technical detail: measurable.

  17. Technical detail: Assuming that \(R(f)\) is finite, which is the only reasonable case, the random variables \(L(f(x_{i}^{{\rm train}}),y_{i}^{{\rm train}})\) have finite \(L^{1}\)-norm, so the law of large numbers applies.

  18. The original version of this section had \(y=x+\varepsilon,\) i.e. the case \(w_{*}=1,\) and the variance of \(\varepsilon\) was \(\frac{1}{10},\) i.e. \(s^{2}=\frac{1}{10}.\) I have changed to \(y=xw_{*}+\varepsilon\) and variance \(s^{2}\) for arbitrary unknown \(w_{*}\) and \(s^{2}>0,\) to emphasize that the analysis her here plausibly applies in a rough sense also to a real world situation where the data distribution \(\mu\) is not known to the user of the machine learning algorithm.

  19. Credit and license: https://neuralnetworksanddeeplearning.com/chap1.html

  20. These were initially introduced in the 90s for rendering 3d graphics, which also involves huge numbers of matrix-vector operations that can be carried out in parallel. After they became popular for neural network training, manufacturers such as Nvidia started producing chips purely for machine learning, without graphics capabilities. The name GPU has stuck, however, although Google makes chips they call TPUs (Tensor Processing Unit) and other similar chips or chip subsystems are called NPUs (Neural Processing Unit).

  21. Named for Kaiming He, one its inventors.

  22. Named for Xavier Glorot, one of its inventors.

  23. Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. ImageNet classification with deep convolutional neural networks. Communications of the ACM. 60 (6): 84–90.

Home

Contents

Weeks