Last time we wrote about the Facebook challenge. Now it’s time for some more details. The main concept is this: in its original state, the data is useless. That’s because there are many names referring to the same entity. Precisely, there are about 350k unique names, and the total number of entities is maybe 20k. So cleaning the data is the first and most important step.
That’s the things we do, in order:
- extract all the names, including numbers, from the graph files and the paths file
- compute a fingerprint for each name. We used fingerprints similar to ones in Google Refine. Ours not only had sorted tokens (words) in a name, but also sorted letters in a token.
Names with the same fingerprint are very likely the same. Remaining distortions are mostly in the form of word combinations:
a name a longer name a still longer name
- this can be dealt with by checking if a given name is a subset of any other name. If it is a subset of only one other name, it’s a good bet that they refer to the same entity. By applying this iteratively, you will find that the three names above are likely the same.
- finally, map the (still noisy) resulting canonical names to numbers. Using numbers in search will be more efficient than using names.
Now we take each path and search each graph for alternative paths to determine a given path’s optimality.
First we check the cost of the path. If the cost is zero, the path is optimal by definition, so that’s it. If the cost is bigger than zero, we search for paths with a smaller cost (effectively, the cost - 1). If we find such path, it’s clear that the given path is not optimal, and vice versa.
If you have some experience with search algorithms, it’s relatively easy to roll your own code for the task. We used a uniform cost search. You can also use graph libraries like Networkx or igraph, which have built-in search functionality and potentially might be faster. The search was the most computer-time-consuming part for us: it took a few hours to check all paths.
Let’s predict a historical mode, the same way as in the benchmark. This will let us know how good the data cleaning process was. We got AUC of 0.690, while the target is 0.705 - far from perfect, but pretty good.
Then let’s try a mean instead of a mode: this gives a better result. Markov chain predictions are better still, and that’s were we ended up, with 0.712 on a public leaderboard and the benchmark beaten.