Here’s the final version of the script, with two ideas improving on our previous take: searching in product names and spelling correction.
As far as product names go, we will use product data available in XML format to extract SKU and name for each product:
8564564 Ace Combat 6: Fires of Liberation Platinum Hits
2755149 Ace Combat: Assault Horizon
1208344 Adrenalin Misfits
Further, we will process the names in the same way we processed queries:
8564564 acecombat6firesofliberationplatinumhits
2755149 acecombatassaulthorizon
1208344 adrenalinmisfits
When we need to fill in some predictions, instead of taking them from the benchmark, we will search in names. If that is not enough, then we go to the benchmark.
But wait! There’s more. We will spell-correct test queries when looking in our query -> sku mapping. For example, we may have a SKU for L.A. Noire (lanoire), but not for L.A. Noir (lanoir). It is easy to see that this is basically the same query, only misspelled. Edit distance is one (just adding “e”), so we will catch that easily.
How do we do spelling correction? The Peter Norvig way: http://norvig.com/spell-correct.html
The algorithm is as follows: given a query,
* search in queries -> SKUs mapping
* if there's less than five results, spell correct the query and search again
* if there's still less than five results, search in product names
* if there's sitll less than five results, fill in with popular SKUs
We can also apply spelling correction to name search, but we’re not sure if it’s going to improve the score.
This time, the script will run a little longer. To make your waiting more pleasant, it will output both name matches and corrected queries, like this:
matches: blazingangels ---> ['blazingangels2secretmissionsofwwii']
matches: dragonknightsaga ---> ['divinityiithedragonknightsaga']
gearsoffwar ---> ['gearsofwar', 'gearsofgwar']`
batmanakham ---> ['batmanarkham']
hardenedediton ---> ['hardenededition']
And to sum up:
Found mapping in 26277 / 28241 (0.930455720407)
Found corrected in 4430 / 28241 (0.156864133706)
Used search in 6538 / 28241 (0.231507382883)
Found search in 2975 / 28241 (0.105343295209)
Used benchmark in 6469 / 28241 (0.229064126624)
Does it work? Instead of 72.8%, we get 74.3%. That’s good ten places on the leaderboard.