Today it’s about Merck challenge - let’s beat the benchmark real quick. Not by much, but quick. If you look at the source code, you’ll notice this:
# data sets 1 and 6 are too large to fit into memory and run basic
# random forest. Sample 20% of data set instead.
if (i== 1 | i==6) {
Nrows = length(train[,1])
train <- train[sample(Nrows, as.integer(0.2*Nrows)),]
}
It means that sets one and six are used in 20% only. This suggests an angle of attack, because more data beats a cleverer algorithm [1].
Let’s try our pal VW. We’ll just convert those sets to Vowpal Wabbit format, run training, run prediction, and convert results to Kaggle format. OK, 0.39, we’re done for the evening.
Earlier we tried training a random forest implementation which could take the whole set into memory, but it took maybe an hour to run anyway, and the result of that first attempt wasn’t so good. We’re not into this kind of tempo so we explored other possibilities.
[1] Pedro Domingos - A Few Useful Things to Know about Machine Learning .pdf