Aalto Dictionary of ML – Probability Mass Function

The pmf of a discrete random variable (RV) $x$ is a function $p^{(x)}\left(\cdot\right): \mathcal{X}\rightarrow [0,1]$ that assigns to each possible value $x’ \in \mathcal{X}$ of the random variable (RV) $x$ the probability $p^{(x)}\left(x’\right) = \mathbb{P}\left(x’ = x\right)$ (Papoulis and Pillai 2002). Fig. 1{reference-type=”ref” reference=”fig_pmf_dict”} illustrates the pmf of a discrete random variable (RV) $x$.

The pmf $p^{(x)}\left(\cdot\right)$ of a discrete random variable (RV)
$x$ taking values in the set $\mathcal{X}= {\star,\otimes}$. Three
datasets are also shown whose relative frequencies of data points match
this pmf exactly. Such datasets could arise as realizations of
independent and identically distributed (i.i.d.) random variables (RVs)
sharing the common pmf $p^{(x)}\left(\cdot\right)$. []{#fig_pmf_dict
label="fig_pmf_dict"}{#fig_pmf_dict width=”80%”}

A pmf always satisfies $\sum_{x’ \in \mathcal{X}} p^{(x)}\left(x’\right) = 1$. We can view a pmf as representing a collection of (sufficiently long) datasets. This collection contains any $\mathcal{D}= {x^{(1)}, \,\ldots, \,x^{(m)}}$, with the relative frequencies of every value $x’ \in \mathcal{X}$ being close to the corresponding pmf value $p^{(x)}\left(x’\right)$, \(\frac{\big|r\in \{1,\,\ldots,\,m\}: x^{(r)}= x' \big|} {m} \approx p^{(x)}\left(x'\right).\) Note that requiring relative frequencies to be close to the probability mass function (pmf) values implies that the empirical entropy of such a dataset is close to the entropy of the probability mass function (pmf) $p^{(x)}\left(\cdot\right)$. Information theory refers to the collection of such datasets as the typical set corresponding to the probability mass function (pmf) $p^{(x)}\left(\cdot\right)$ (Cover and Thomas 2006). A main result of information theory states that a dataset generated by independent and identically distributed (i.i.d.) sampling from $p^{(x)}\left(\cdot\right)$ belongs, with high probability, to the typical set with respect to $p^{(x)}\left(\cdot\right)$ (Cover and Thomas 2006, Th. 3.1.2).
See also: random variable (RV), probability, probability distribution, probabilistic model.

Cover, T. M., and J. A. Thomas. 2006. Elements of Information Theory. 2nd ed. Hoboken, NJ, USA: Wiley.

Papoulis, A., and S. Unnikrishna Pillai. 2002. Probability, Random Variables, and Stochastic Processes. 4th ed. New York, NY, USA: McGraw-Hill Higher Education.


📚 This explanation is part of the Aalto Dictionary of Machine Learning — an open-access multi-lingual glossary developed at Aalto University to support accessible and precise communication in ML.

Written on December 7, 2025