Learning from Imbalanced Classes
For this month’s Throwback Thursday, a post that provides insight and concrete advice on how to tackle imbalanced data.
Tom has over 20 years experience applying machine learning and data science across five different companies. He is co-author of the highly regarded and top-selling book Data Science for Business (O’Reilly, 2013), which is now used in over 140 universities around the world.
Prior to joining SVDS, as a senior architect at Proofpoint, Tom applied machine learning techniques, including social network analysis and probabilistic inference, to email analysis and filtering. While at Stanford’s Center for the Study of Language and Information, he led a DARPA-sponsored project on Transfer Learning. He has also held senior research scientist positions at HP Labs, NYNEX, and GTE Labs.
Tom holds a Ph.D. in Computer Science (Machine Learning) from the University of Massachusetts, Amherst. He is an action editor of Machine Learning Journal; he also serves on the editorial boards of the journals Data Mining and Knowledge Discovery and Big Data, as well as on the advisory board of the Berkeley Extension Data Science Program.
For this month’s Throwback Thursday, a post that provides insight and concrete advice on how to tackle imbalanced data.
Business leaders cannot afford to ignore their organization’s data—rather, that data should be used to make informed decisions. In this post, Principal Data Scientist Tom Fawcett and Professor of Data […]
You should understand whether the right things have been measured and whether the results are suitable for the business problem.
We (Tom, a Machine Learning practitioner, and Drew, a professional Statistician) have worked together for several years. We believe we have an understanding of the role of each field within data science, which we attempt to articulate here.
In this post, we will look at driving product engagement with behavioral data, as well as building an integrated analytical environment.
The promise of data and analytics for product companies is that they can help you understand usage, and improve your ability to build, deploy, and service products to customers much more accurately and efficiently. In this post, we look at understanding the customer life cycle.
In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API.
A basic mantra in statistics and data science is correlation is not causation, meaning that just because two things appear to be related to each other doesn’t mean that one causes the other. This is a lesson worth learning.
Here we share some further thoughts on imbalanced classes, and offer more resources.
This post gives insight and concrete advice on how to tackle imbalanced data.
In this post, we will explore some aspects of the train delay data we’ve been collecting from the Caltrain API over the past few months. The goal is to get our heads into the data before setting off on building a prediction model.
A previous blog post made the point that classifiers shouldn’t use classification accuracy as a performance metric. The next part in this series was going to discuss other evaluation techniques such as ROC curves, profit curves, and lift curves. However, there are several important points to be made first. Here I present a sequence that shows the progression and inter-relation of the issues.
If it’s easy, it’s probably wrong.
A basic mantra in statistics and data science is correlation is not causation. This is a lesson worth learning.
As an R&D project, we have been playing with data science techniques to understand and predict delays in the Caltrain system.