A set is collection of objects, it can be finite, or countably infinite, or uncountably infinite, it can be empty (denoted by $\phi$). The size of a set is denoted by $|\mathcal{X}|$
A function $f : \mathcal{X} \longrightarrow \mathcal{Y}$ is a mapping from a set of points known domain to another set of points known as codomain. For example, $y = x^2$ is a function mapping $X = \mathbb{R}$ to $\mathcal{Y} = \mathbb{R}_+$.
$$\forall\epsilon > 0.\ \exists\ \delta > 0.\ \text{s.t.}\ ||\bold{x} - \bold{a}|| < \delta \Longrightarrow ||f(\bold{x}) - f(\bold{a})|| < \epsilon$$
If we move small distance in the input space, then the function's response should only change a little bit.
Lipschitz constant of a function measures how much the function changes as we vary its input. In 1d case, this is defined as a constant $C \ge 0$, we have
$$|f(x_1) - f(x_2)| \le C|x_1 - x_2|$$
For a given constant $C$, the function output cannot change by more than $C$ if we change the input by 1 unit.
Consider a function that maps a vector to another vector, $\bold{f} : \mathbb{R}^n \rightarrow \mathbb{R}^m$, the jacobian matrix of this function is $m \times n$ matrix of partial derivatives (we lay out the results in numerator layout):
$$\bold{J} \triangleq \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \ldots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \ldots & \frac{\partial f_m}{\partial x_n} \end{pmatrix} $$
Jacobian vector product is multiplying the Jacobian matrix $\bold{J} \in \mathbb{R}^{m \times n}$ by a vector $\bold{v} \in \mathbb{R}^n$. Vector Jacobian product is left-multiplying the Jacobian matrix by a vector. JVP is efficient if $m \ge n$, and VJP is efficient when $m \le n$.
For a function $f : \mathbb{R}^n \rightarrow \mathbb{R}$ that is twice differentiable, we define Hessian matrix as $n \times n$ matrix of second partial derivativess:
$$\bold{H} = \frac{\partial^2 f}{\partial \bold{x}^2} = \nabla^2 f = \begin{pmatrix} \frac{\partial^2 f}{\partial x_1^2} & \ldots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ & \vdots & \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \ldots & \frac{\partial^2}{\partial x_n^2} \end{pmatrix} $$
Convex set
We say $\mathcal{s}$ is a convex set if, for any $\bold{x}, \bold{x}\prime \in \mathcal{S}$, we have,
$$\lambda \bold{x} + (1 - \lambda)\bold{x}\prime, \forall \lambda \in [0, 1]$$
That is, if we draw a line from $\bold{x}$ to $\bold{x}\prime$, all points on the line lie inside the set.
Convex function
We say a function $f$ is convex if its epigraph (the set of points above the function) defines a convex set.
A function $f$ is convex iff $\bold{H} = \nabla^2f(\bold{x})$ is positive semi definite for all $\bold{x} \in \text{dom}(f)$, $f$ is strictly convex if $\bold{H}$ is positive definite.
Jensen's inequality states that for any convex function $f$, we have that
$$f(\sum_{i=1}^{n} \lambda_i \bold{x}_i) \le \sum_{i=1}^n \lambda_i f(\bold{x}_i)$$
where $\lambda_i \ge 0$ and $\sum_{i=1}^n \lambda_i = 1$
Linear approximation around a point $a \in \mathbb{R}^n$
$$\widehat{f}(\bold{x}) = f(\bold{a}) + \nabla f(\bold{a})^{\mathsf{T}}(\bold{x} - \bold{a})$$
Quadratic approximation,
$$\widehat{f}(\bold{x}) = f(\bold{a}) + \nabla f(\mathbf{a})^{\top}(\mathbf{x - a}) + (\bold{x} - \bold{a})^{\mathsf{T}}\nabla^2f(\bold{a})(\bold{x} - \bold{a})$$
Let $f : \Omega \rightarrow \mathbb{R}$ be a continuously differentiable, strictly convex function defined on a closed convex set $\Omega$. Bregman divergence is defined as follows:
$$D_f(\bold{w}, \bold{v}) = f(\bold{w}) - f(\bold{v}) - (\bold{w} - \bold{v})^{\mathsf{T}}\nabla f(\bold{v})$$
Conjugate duality is a useful way to construct linear lower bounds to non-convex functions, which we can then easily maximize.
Consider an arbitrary continuous function $f(\bold{x})$, suppose we create a linear lower bound on it of the form,
$$\mathcal{L}(\bold{x}, \boldsymbol{\lambda}) \triangleq \lambda^{\mathsf{T}}\bold{x} - f^*(\boldsymbol{\lambda}) \le f(\bold{x})$$
where $\boldsymbol{\lambda}$ is the slope, $f^*(\boldsymbol{\lambda})$ is the intercept, which we solve for.
It is interesting to see what happens if we take the conjugate of the conjugate. If $f$ is convex, then $f^{**} = f$, so $f$ and $f^*$ are called conjugate duals.