Machine learning made easy

Classifying time series using feature extraction

When you want to classify a time series, there are two options. One is to use a time series specific method. An example would be LSTM, or a recurrent neural network in general. The other one is to extract features from the series and use them with normal supervised learning. In this article, we look at how to automatically extract relevant features with a Python package called tsfresh.

The datasets we use come from the Time Series Classification Repository. The site provides information of the best accuracy achieved for each dataset. It looks like we get results close to the state of the art, or better, with every dataset we try.

Time series are more tricky than standard tasks, because by definition the examples are not independent (the closer in time they are to each other, the more codependent they are). Think of temperature. If it’s 20 degrees Celsius today, it may be 15 or 25 tomorrow, but probably not 5 or 35, even if these temperatures might happen another time of the year.

This means we can’t use a normal classifier, because a normal classifier assumes independent examples. Moreover, the structure of data is one level deeper. In classification, an example is not a single point, it is a time series consisting of multiple points (steps). Each step might consist of several attributes, for example temperature, humidity and wind speed.

However, we can reduce a series to a single point by extracting features. For example, if we’re dealing with a time series of daily weather over a month, we could use the following features:

  • minimum temperature
  • maximum temperature
  • average temperature
  • median temperature
  • variance of temperatures
  • minimum humidity
  • maximum humidity

And, in fact, many more. As you might guess, inventing and implementing them can be tedious. Luckily, there is a Python package called tsfresh, which extracts a boatload of features automatically.

One can expect most of them to be irrelevant, so it’s good to select those with predictive power. Tsfresh does that also. Note that this step uses classification labels, so to avoid label leakage, you should first split the data set into training and validation and only use the training part for feature selection. Otherwise your validation results would be overly optimistic.

Feature extraction and selection. The serpent symbolizes time series data. Art by Ungoogleable Michaelangelo.

As far as we understand, tsfresh uses pairwise (feature-target) significance test for selection. This might pose a problem if the target happens to be determined solely by interaction of features and not by any one separate feature.

In practice

We grabbed three of the biggest datasets: FordA, FordB, and Wafer. The time series from the repository appear to be all one-dimensional (for example, temperature, or humidity, but not both). In this setup, each series is a row in the CSV file and columns represent time steps:

In [9]: d.head()
       0       1        2        3
0  1.01430  1.0143  1.01430  1.01430
1 -0.88485 -1.0375 -0.97771 -1.01690
2  0.58040  0.5804  0.59777  0.59777
3 -0.88390 -1.0371 -0.97998 -1.01210
4  1.10500  1.2856  1.19630  1.25610

Therefore, we need to reshape the data:

d = d.stack()
d.index.rename([ 'id', 'time' ], inplace = True )
d = d.reset_index()

To get the tsfresh format:

In [11]: d.head()
   id  time       0
0   0     0  1.0143
1   0     1  1.0143
2   0     2  1.0143
3   0     3  1.0143
4   0     4  1.0143

Feature extraction and selection are quite compute-intensive, so tsfresh does them in parallel. The byproduct of this is that one needs to write programs in if __name__ == '__main__': style, otherwise multiprocessing goes haywire. Alternatively, one can set the n_jobs parameter to 1.

Especially the feature extraction step takes a long while.

f = extract_features( d, column_id = "id", column_sort = "time" )
# Feature Extraction: 20it [22:33, 67.67s/it]

Some of the feature constructors output nulls. To deal with them, tsfresh provides the impute() function.

impute( f )
assert f.isnull().sum().sum() == 0

When selecting, there is a hyperparameter to tune: fdr_level. It is the theoretical expected percentage of irrelevant features among all created features. By default, it is set pretty low, at 5%. As long as our downstream classifier is able to deal with non-informative features (and which isn’t?), we might want to increase the fdr_level, depending on a number of features we get from selection, to 0.5, or 0.9 even. On the other hand, we want the examples to features ratio as high as possible to get the best generalization and avoid the curse of dimensionality.

In [2]: run
loading data/wafer/features.csv
selecting features...
selected 247 features.
saving data/wafer/train.csv
saving data/wafer/test.csv

Selected features, divided into train and test sets. Art by Nikolai Bartram.

Afterwards, we are free to train and evaluate a few classifiers. Logistic regression run on scaled features works fine usually.

The complete code is available at GitHub. It should work with all [binary classification] datasets from the Time Series Repository, because they are all in the same format. One just needs to remove the ARFF headers from the CSV files after downloading.