do we always have to choose principal components based on maximum
variance explained?
Yes. "Maximum variance explained" is closely related to the main objective as follows.
Our main objective is: for a limited budget K dimensions, what information $\mbox{a}=(a_1,...,a_K)$ to keep from original data $\mbox{x}=(x_1,...,x_D)$ ($D \gg K$) in order to be able to reconstruct $\mbox{x}$ from $\mbox{a}$ as close as possible?
If we only allow rotation and scaling of original data, i.e. $a_k := \mbox{x}.\mbox{v}_k$ for unknown set of vectors $V_K=\{\mbox{v}_k|\mbox{v}_k \in \mathbb{R}^D, 1 \leq k \leq K\}$, and define the reconstruction error as
$$loss(\mbox{x},V_K):=\left \| \mbox{x}-\underbrace{\sum_{k=1}^{K}\overbrace{(\mbox{x}.\mbox{v}_k)}^{a_k}\mbox{v}_k}_{\hat{\mbox{x}}} \right \|^2,$$
the solution $V^*_K$ that minimizes this error is PCA. For first dimension, PCA keeps the projection of data on vector $\mbox{v}^*_1$ in the direction of largest data variance, namely $a^*_1$. For second dimension, it keeps the projection on vector $\mbox{v}^*_2$ in the direction of second largest data variance, namely $a^*_2$, and so on.
In other words, when we try to find a K-vector set $V_K$ that minimizes $loss(X,V_K)=\frac{1}{N}\sum_{n=1}^{N}loss(\mbox{x}_n,V_K)$, the solution
$V^*_K$ includes $\mbox{v}^*_k$ that is in the direction of $k\mbox{-th}$ largest data variance.
Note that "ratio of variance explained" is a measure from statistics. Using the previous notations, it is defined as:
$$\mbox{R}(X,V_K):=1 - \frac{loss(X,V_K)}{Var(X)}$$
Since variance of original data $Var(X)$ is independent of solution, minimum of $loss(X,V_K)$ is equivalent to maximum of $\mbox{R}(X,V_K)$. For example, if $K=2$, then $V^*_2=\{\mbox{v}^*_1, \mbox{v}^*_2\}$ minimizes $loss(X,V_2)$ and equivalently maximizes $\mbox{R}(X,V_2)$. Ideally, if original data $X$ can be perfectly reconstructed from $V_K$, then $R(X, V_K)$ would be $1$.
Does maximum variance means most information about my data in higher
dimension is captured into lower dimension?
Yes. If we agree that "keep as much information as possible" is equivalent to "be able to reconstruct the data as close as possible", then our objective $min_{V_K}loss(X,V_K)$ formalizes "keep as much information as possible", and its solution is "maximum variance".