Generalization in Machine Learning: Why Models Fail on Unseen Data

TL;DR: Generalization is a model’s ability to make accurate predictions on new, unseen data after training. This post explains probabilistic and robustness-based views of generalization, including empirical risk minimization (ERM) and perturbation tests.

Generalization refers to the ability of a machine learning (ML) method trained on a training set to make accurate predictions on new, unseen data. This is a central goal of ML and AI: to learn patterns that extend beyond the data available during training.

Most ML systems use empirical risk minimization (ERM) to learn a hypothesis $\hat{h} \in \mathcal{H}$ by minimizing the average loss over a training set of data points ${\bf z}^{(1)}, \ldots, {\bf z}^{(m)}$, denoted as $\mathcal{D}^{(\rm train)}$. However, success on the training set does not guarantee success on unseen data—this discrepancy is the challenge of generalization.

To study generalization mathematically, we often assume a probabilistic model for data generation, such as the i.i.d. assumption. That is, we interpret data points as independent random variables (RVs) with an identical probability distribution $p({\bf z})$. This distribution, while unknown, allows us to define the risk of a trained model $\hat{h}$ as the expected loss:

\[\bar{L} \big( \hat{h} \big) = \mathbb{E}_{{\bf z} \sim p({\bf z})} \left\{ L(\hat{h}, {\bf z}) \right\}.\]

The difference between risk $\bar{L}(\hat{h})$ and empirical risk $\widehat{L}(\hat{h}|\mathcal{D}^{(\rm train)})$ is called the generalization gap. Tools from probability theory—such as concentration inequalities and uniform convergence—can be used to bound this gap under specific conditions (Shalev-Shwartz and Ben-David 2014).

Generalization Without Probability

Probability theory is one way to study generalization, but not the only one. Another approach uses perturbations to the training data. The idea is simple: a good model $\hat{h}$ should be robust. Its prediction $\hat{h}({\bf x})$ should not change much if we slightly modify the input ${\bf x}$ of a data point ${\bf z}$.

For example, an object detector trained on smartphone photos should still detect the object if a few random pixels are masked (Su, Vargas, and Sakurai 2019). Similarly, it should produce consistent predictions if the object is rotated (Mallat 2016).

Visualization of generalization in machine learning, showing ERM with two training points and evaluation via i.i.d. or perturbations — Two data points ${\bf z}^{(1)},{\bf z}^{(2)}$ used as a training set to learn a hypothesis $\hat{h}$ via ERM. Generalization can be evaluated using either the i.i.d. assumption or deterministic perturbations of the training data.

❓ Frequently Asked Questions

What is generalization in machine learning?
Generalization is a model’s ability to make accurate predictions on new, unseen data after training on a known dataset.

What causes poor generalization?
Overfitting, data imbalance, and overly complex models often lead to poor generalization.

Can we study generalization without probability theory?
Yes. Deterministic approaches like robustness testing with perturbations offer a complementary perspective.

What is the generalization gap?
It’s the difference between the empirical loss on training data and the expected loss on unseen data.

📚 References

Mallat, Stéphane. 2016. “Understanding Deep Convolutional Networks.”
Philosophical Transactions of the Royal Society A 374 (2065): 20150203.
https://doi.org/10.1098/rsta.2015.0203

Shalev-Shwartz, S., and S. Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms.
Cambridge University Press.

Su, J., D. V. Vargas, and K. Sakurai. 2019. “One Pixel Attack for Fooling Deep Neural Networks.”
IEEE Transactions on Evolutionary Computation 23 (5): 828–841.
https://doi.org/10.1109/TEVC.2019.2890858

📚 This explanation is part of the Aalto Dictionary of Machine Learning —
an open-access multilingual glossary developed at Aalto University to support accessible and precise communication in ML.

Written on June 2, 2025

Dictionary of ML - Geometric Median

The geometric median $\boldsymbol{z}$ of a set of vectors $\boldsymbol{x}^{(1)}, \ldots, \boldsymbol{x}^{(n)}$ minimizes the total Euclidean distance:

\[\boldsymbol{z} \in \arg\min_{\boldsymbol{y} \in \mathbb{R}^d} \sum_{i=1}^{n} \left\| \boldsymbol{y} - \boldsymbol{x}^{(i)} \right\|_2.\]

This makes it a robust alternative to the mean when data includes outliers.

📚 This explanation is part of the Aalto Dictionary of Machine Learning — an open-access multi-lingual glossary developed at Aalto University to support accessible and precise communication in ML.

Written on May 12, 2025

Developing a Machine Learning Curriculum

It is almost 10 years since I started on the tenure track at Aalto University (Finland). From the very beginning, I had the unique opportunity to contribute to shaping and enhancing the Machine Learning curriculum at our university.

Over the past decade, I have designed and implemented core courses for the machine learning curricula at Aalto University at both the Bachelor and Master levels. These courses have not only benefited degree-seeking students but also reached beyond, offering opportunities for adult learners through the Finnish Institute of Technology (FITech.io).

Additionally, I have played a key role in the Human-Centered Machine Learning Summer School (2022), which was organized for the Unite! European University Network, fostering collaboration and cross-border learning.

🚀 Reflections on Curriculum Development

Developing a machine learning curriculum is iterative. A curriculum needs to balance theoretical foundations with practical applications. Three core principles have driven my approach:

Practical Relevance: The design of a course starts with writing down the learning goals. These goals include very concrete skills and core theoretical concepts. It can be effective to motivate and demonstrate theoretical concepts by every-day life applications.
Student-Centric Design: I view student feedback as important as peer reviews for my research papers. Similar to the peer review process of journals I prepare response letters to explain how sutdent feedback has been taken into account.
Collaboration & Inclusivity: By working with partners like FITech and Unite!, the curriculum extends beyond Aalto University, offering access to a broader learning community, including adult learners and international students.

These principles foster a curriculum that grows alongside advancements in machine learning and AI.

📣 Testimonial from a Co-Lecturer

“…I like to take the opportunity to express that, in my opinion, Alex has generated a very strong, didactically excellent course with a good focus on the necessary basic concepts and principles. It mixes theory and focused exercises with a machine learning project in which students grow while applying their learned knowledge on actual data. From last year to this year, the course has been further improved significantly by rigorously addressing the feedback collected by the students.”

📣 Testimonial from VP Education of Aalto University

“…I learnt to know Alex in 2015 in my earlier role, VP Education of Aalto University, when Alex joined the faculty of Aalto University School of Science as an Assistant Professor. My first impression was his dedication and passion to create powerful learning experiences for all his students. It was no surprise that the student feedback was extremely positive from day one…”

🌐 Looking Ahead

As the field of machine learning evolves, so must our university curricula. My current focus is on integrating more content related to explainable AI (XAI), federated learning, and trustworthy AI. These are areas of growing importance in both academia and industry. We need to equip students with required skills to build human-centered and trustworthy AI.

There is also growing demand for education in legal literacy in AI and data-driven technologies. We need to reshape tech-focused curricula in order to cover the intersection of AI and law. This includes courses that provide students with an understanding of key legal concepts such as data privacy regulations (like GDPR), AI ethics, and accountability frameworks. These additions will enable students to grasp the legal and ethical implications of deploying AI systems, which is increasingly essential for roles in both academia and industry.

Written on December 13, 2024

Explainable AI for Weather Prediction

An essential requirement for trustworthy artificial intelligence (AI) is its explainability to human users [¹]. One principled approach to explainable AI is via the concept of simulatability: Roughly speaking, the better a specific human user can anticipate the behaviour of an AI, the more explainable it is (to this particular user).

In the context of machine learning (ML), which is at the core of many current AI systems, we can formalize explainability by the notion of a user signal [²]. The user signal is some subjective characteristic of data points. We can think of a user signal as a specific feature that a user assigns to a data point. We denote the user signal u(x) as a function of the raw features x of a data point.

In this blog post, we explore explainable ML using a straightforward weather prediction example based on real-world data from the Finnish Meteorological Institute (FMI). The data was recorded at a FMI weather station near Kustavi Isokari

We aim to create an explainable weather prediction model that aligns with human intuition. Some user signal models this intuition.

Predicting Maximum Daytime Temperature with Explainable AI: A Real-World Example

Given historical weather data, we want to learn to predict the maximum daytime temperature (maxtemp) solely from the minimum daytime temperature (mintemp). To this end, we first download weather recordings from the FMI website

Finnish Meteorological Institute weather data download site

into a csv file KustaviIsokari.csv. The following code snippet reads in the downloaded data from the csv file and stores their features and labels in the variables X and y, respectively [³],

# Load the data from the CSV file
file_path = "KustaviIsokari.csv"  # Replace with the actual file path
data = pd.read_csv(file_path)

# Extract relevant columns
data = data[["Maximum temperature [°C]", "Minimum temperature [°C]"]]

# Randomly select a specified number of data points
num_data_points = 15  # Specify the number of data points to select
data = data.sample(n=num_data_points, random_state=42)

# build feature matrix (with one column in this case) and label vector
X = data[["Minimum temperature [°C]"]]  
y = data["Maximum temperature [°C]"]    

Using the features X and labels y, we next train two basic ML models: a decision tree regressor and a polynomial regressor [⁴].

# Train a Decision Tree Regressor
maxdep=3 
dt_model = DecisionTreeRegressor(random_state=42, max_depth=maxdep)
dt_model.fit(X, y)

# Train a Polynomial Regression model
poly_degree = 7  # Specify the degree of the polynomial
poly_features = PolynomialFeatures(degree=poly_degree)
X_train_poly = poly_features.fit_transform(X)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y)  

We then plot the predictions of the trained models along with training data.

AI models for weather prediction: Decision Tree and Polynomial Regression

How do you like the behaviour of the trained models? Both models predict increasing maxtemp for decreasing mintemp for very cold days (towards the left in the above plot). Moreover, the polynomial regressor predicts decreasing maxtemp with increasing mintemp for warm days (towards the right in the above plot). Predicting a decreasing maxtemp for increasing mintemp is counter-intuitive.

Enhancing Explainability in AI with Data Augmentation

It seems reasonable to assume that higher min. temps. result in higher max.temps. We can exploit this intuition (or user knowledge) to regularize the above model training via data augmentation [⁵]:

For each original data point, with mintemp x and maxtemp y, we add two additional data points:

one with mintemp x+1 and maxtemp y+1
one with mintemp x-1 and maxtemp y-1

Note that this data augmentation strategy can be interpreted as a specific choice for a user signal u(x). In particular, the user signal satisfies u(x+1)=y+1 and u(x-1)=y-1 for any data point (x,y) in the original training set.

The following code snippet implements the above data augmentation and then retrains the decision tree and polynomial regression model.

# Augment the training set
augmented_data = data.copy()
augmented_data["Minimum temperature [°C]"] = data["Minimum temperature [°C]"] + 1
augmented_data["Maximum temperature [°C]"] = data["Maximum temperature [°C]"] + 1

augmented_data_minus = data.copy()
augmented_data_minus["Minimum temperature [°C]"] = data["Minimum temperature [°C]"] - 1
augmented_data_minus["Maximum temperature [°C]"] = data["Maximum temperature [°C]"] - 1

# Combine original and augmented data
data_augmented = pd.concat([data, augmented_data, augmented_data_minus], ignore_index=True)

# Split augmented data into predictors and target
X_augmented = data_augmented[["Minimum temperature [°C]"]]
y_augmented = data_augmented["Maximum temperature [°C]"]

# Train a Decision Tree Regressor
dt_model = DecisionTreeRegressor(random_state=42, max_depth=3)
dt_model.fit(X_augmented, y_augmented)

# Train a Polynomial Regression model
poly_degree = 7  # Specify the degree of the polynomial
poly_features = PolynomialFeatures(degree=poly_degree)
X_train_poly = poly_features.fit_transform(X_augmented)

poly_model = LinearRegression()
poly_model.fit(X_train_poly, y_augmented)

And here are the resulting trained DT and polynomial regressor, along with the original and augmented data points. Carefully note that the trained models now respect my (your?) intuition that maxtemp is monotonically increasing with mintemp. In this sense, these models can be considered more explainable than the trained models without data augmentation.

Improved weather prediction models with data augmentation

References and Further Reading on Explainable AI

High-Level Expert Group on Artificial Intelligence. (2019). Ethics guidelines for trustworthy AI. European Commission. click here ↩
A. Jung and P. H. J. Nardelli, “An Information-Theoretic Approach to Personalized Explainable Machine Learning,” in IEEE Signal Processing Letters, vol. 27, pp. 825-829, 2020, doi: 10.1109/LSP.2020.2993176. ↩
You can find a Python script to reproduce the presented results here: click me ↩
A. Jung, Machine Learning: The Basics, Springer, 2022. https://doi.org/10.1007/978-981-16-8193-6 ↩
L. Zhang, G. Karakasidis, A. Odnoblyudova, et al., “Explainable empirical risk minimization,” in Neural Comput & Applic 36, 3983–3996 (2024). https://doi.org/10.1007/s00521-023-09269-3 ↩

Written on November 28, 2024

Education is More Than a Business

Alex in full effect during a lecture on Convex Optimization. — Alex teaching a class during a Machine Learning course at Aalto University.

In recent years, there’s been a troubling trend in higher education: the gradual transformation of taxpayer-funded universities into profit-driven entities. Under the guise of global competitiveness, many institutions are becoming little more than research factories, churning out papers and funding applications to sustain bloated systems while neglecting their core missions—education and the public good.

The Commodification of Knowledge

Taxpayer-funded universities are pillars of society. They are meant to be sanctuaries of learning, innovation, and critical thinking. But as financial pressures mount and market-oriented policies take hold, the focus has shifted alarmingly.

Academic success is increasingly measured by metrics—the number of publications, grant money secured, or the size of international collaborations (Hicks et al., 2015). These metrics prioritize quantity over quality, reducing groundbreaking research to a box-ticking exercise. Worse, they incentivize universities to exploit their researchers, burden faculty with administrative tasks, and disregard the well-being of students.

Who Benefits? Not the Taxpayer.

During my academic career, I have observed a growing trend (or fashion) at universities to funnel resources into large, well-funded research groups. These groups act like private companies within the university ecosystem, hoarding resources and focusing on high-revenue projects. While such groups rake in grants, smaller departments and programs suffer budget cuts, and students face higher tuition fees with diminishing returns in terms of education quality (Slaughter & Rhoades, 2004).

Taxpayers fund these institutions expecting them to educate future generations and contribute to society—not to turn into profit centers that primarily serve private interests.

The Human Cost of Overproduction

This obsession with profitability also has devastating consequences for academia’s human core:

Faculty Burnout: Professors are stretched thin and expected to churn out high-quality publications and secure external funding (Shanafelt et al., 2015). This leaves little time for mentorship, directly impacting the next generation of scholars.
Precarious Positions: Early-career researchers typically face insecure, short-term contracts, jumping from project to project without stable employment (Chisholm, 2023).
Student Neglect: Overcrowded classrooms, reduced course offerings, and a lack of individual attention are the norm as resources are funneled into research output rather than teaching quality (Ingleby, 2021).

A Call for Change

It’s time to reclaim universities as public institutions dedicated to serving society. Here are three steps we must take:

Prioritize Education: Public funding must be used to ensure high-quality education instead of funnelling tax-payer money into expanding research portfolios.
Value Impact Over Metrics: Shift the focus away from publication counts and grant amounts. Reward meaningful research that directly benefits society, rather than research done for the sake of securing the next funding cycle (Moher et al., 2018).
Protect Academic Freedom: Universities must provide faculty with the freedom to pursue research and teaching without the relentless pressure to commercialize or industrialize their work (Altbach et al., 2020).

A University for the People

Taxpayer-funded universities are a collective investment in the future. They should serve the public, not function as cash cows for a select few. Education and research are not commodities—they are the bedrock of a thriving, equitable society. Let’s demand better. Let’s ensure that universities remain places of inspiration, innovation, and integrity—not profit.

References:

Hicks, D., Wouters, P., Waltman, L., de Rijcke, S., & Rafols, I. (2015). Bibliometrics: The Leiden Manifesto for research metrics. Nature, 520(7548), 429-431.
Slaughter, S., & Rhoades, G. (2004). Academic Capitalism and the New Economy: Markets, State, and Higher Education. Johns Hopkins University Press.
Shanafelt, T. D., Hasan, O., Dyrbye, L. N., et al. (2015). Changes in burnout and satisfaction with work-life balance in physicians and the general US working population between 2011 and 2014. Mayo Clinic Proceedings, 90(12), 1600-1613.
Chisholm, K.I., Finelli, M.J. (2023). Enhancing research culture in academia: a spotlight on early career researchers. BMC Neurosci 24, 46. https://doi.org/10.1186/s12868-023-00816-1
Ingleby, E. (2021). Neoliberalism and Higher Education. In: Neoliberalism Across Education. Palgrave Studies on Global Policy and Critical Futures in Education. Palgrave Pivot, Cham. https://doi.org/10.1007/978-3-030-73962-1_6
Moher D, Naudet F, Cristea IA, Miedema F, Ioannidis JPA, Goodman SN (2018) Assessing scientists for hiring, promotion, and tenure. PLoS Biol 16(3): e2004089. https://doi.org/10.1371/journal.pbio.2004089
Altbach, P. G., Reisberg, L., & de Wit, H. (2020). Responding to Massification: Differentiation in Postsecondary Education Worldwide. SensePublishers Rotterdam, 2017

Let me know your thoughts: Are you seeing this trend in your own university or organization? How can we push back against it?

Written on November 18, 2024

Welcome!

My name is Alexander Jung, and I am a tenured Associate Professor specializing in Machine Learning. With over 15 years of experience in both research and education, I have dedicated my career to advancing the field of machine learning and mentoring the next generation of experts.

Alex in full effect during a lecture on Convex Optimization.

Written on November 17, 2024