Machine learning made easy

Predicting wine quality

This post is as much about wine as it is about machine learning, so if you enjoy wine, like we do, you may find it especially interesting. Here’s some R and Matlab code, and if you want to get right to the point, skip to the charts.

There’s a book by Philipp Janert called Data Analysis with Open Source Tools, which, by the way, we would recommend. From this book we found out about the wine quality datasets. There are two, one for red wine and one for white wine, and they are interesting because they contain quality ratings (1 - 10) for a few thousands of wines, along with their physical and chemical properties. We could probably use these properties to predict a rating for a wine. We’ll be looking at white and red wine separately for the reasons you will see shortly.

Principal component analysis for white wine

Janert performs a principal component analysis (PCA) and shows a resulting plot for white wine. What’s interesting about this plot is that judging by the first two principal components, a quality is very much correlated with alcohol content and fixed acidity (or pH - it’s basically the same thing).

Moreover, these two properties are easily obtained for other wines, and hence may be of practical value. It is true especially for alcohol content, which is listed on every bottle. Fixed acidity can be measured with a pretty low-cost electronic pH meter. Just dip it and you’re done.

Here’s the plot with PC1 on the horizontal axis and PC2 on the vertical axis:

White wine PCA

Bear in mind that the plot only shows the first two principal components, so what we see here is not the whole story, because there are more components:

White wine PCA components

PCA for red wine

The picture suggests that for red wines, alcohol content is just as much important for quality, while fixed acidity is not.

White wine PCA

White wine PCA components

Now let’s do some more analysis involving quality, alcohol content and fixed acidity.

Assumptions and terminology

We make two main assumptions:

  1. That the information about Portuguese vinho verde could be generalized to other wines
  2. That the ratings are trustworthy

If they hold, turns out we can tell a lot about a wine just knowing its alcohol content and fixed acidity. It’s not enough to predict a rating, but enough to provide some guidelines for selecting wines.

Let’s introduce some terminology:

  • good means rating seven or higher
  • bad means rating four or lower
  • mediocre means rating five
  • OK means rating six.

We don’t show OK category on the charts, because it’s all over the place.

White wine

Good and bad

Red means good. Blue means bad. Horizontal axis is alcohol content, vertical axis is fixed acidity.

White wine

If alcohol content is at least 11%, or better yet, 12%, you are very likely to have a good wine. On the other hand, if it’s 10% or less, don’t set your hopes too high.

But that’s not the whole story. There’s another factor: the less acidity, the better. Basically, you can get away with lower alcohol content if the wine has relatively less fixed acidity. That’s the space between 10 and 12 percent alcohol.

Overall, one can draw a line. If you see a wine on the right side of that line, it is very likely to be good. If it’s on the other side, than you don’t know.

The Good, The Bad, and The Mediocre

White wine

This picture also shows mediocre examples, in green. As you can see, there are a lot of them and they are distributed similarly to bad ones. This is also the case with red wines (see the chart below).

Red wine

Alcohol OR acidity, please

Red wine

Similarly to white wines, if alcohol content is 12% or more, probably you’re all set. Also, you want a wine which is either high in alcohol or highly acidic. In other words, there are quite a lot good wines with less than 12% alcohol; what they have in common is high fixed acidity and some mediocre ones between them.

Note the slope of the separating line. Compare with a white wine case - it’s the other way around. There, it’s high alcohol AND low acidity. Here, it’s OR.

To sum up, judging a wine on just two properties is rather simplistic. There are other factors to consider, for example age. All other things being equal, we much prefer a two year old wine to a one year old wine. Still, those charts do tell a story, don’t they?