Our team - Paul Perry, Elena Cuoco* and Zygmunt Zając - did not discover the leak. We didn’t attain a winning score, but on the other hand, the features we used and the analysis we present are not tainted by using that source of extraneous information.
This document contains high-level insights. For a more detailed study, see the notebook.
We were able to score > 0.88 with a single XGBoost model using just 100 features. It appears that the most important features come from three groups: demographics, general patient activity, and practitioner information.
Demographics that matter are location and age, and - to a lesser extent - education level, household income and ethnicity.
Patient activity (in the general sense, not referring to the activity table), consists mostly of basic statistics (such as the total number of procedures, number of visits, time between the first and last visit).
Practitioner information relates to the doctors - for example, practitioner ID or percentage of screeners among the patients - as well as location, which of course is correlated with patient’s location.
Feature importance plot
Here is the feature importance plot from our best single model showing the details.
The most important features are, in order:
- (activity) num_procedures - a total count of procedures for a patient
- (location) cbsa_y - CBSA of primary_practitioner (see the Location section below)
- (practitioner) phy_practitioner_id - physician who diagnosed a GYN checkup
- (location) patient_state
- (location) cbsa_x - CBSA of the most visited practitioner for diagnosis across all diagnoses
- (activity) visits - total count of visits from patient_activity
- (activity) num_visits - total count of visits to the primary CBSA
- (activity) date_delta = last_visit - first_visit number of months in medical from 2008 to 2014
- (practitioner) obg_screen_pct - screening % of the OBG
- (practitioner) obg_id - _primary_practitioner_id with specialty_code in (‘OBG’,’GYN’,’REN’, ‘OBS’)
- (activity) V72_date - date of first GYN checkup V72.31
- (practitioner) obg_patient_count - number of patients of the OBG
A bunch features in the middle of the plot - of form A_2014, R_2014, etc. - are activity counts for a given year (A and R types respectively).
The most important demographics are location and age. We look at those closer.
Location features include a patient’s state and CBSA (see below).
CBSA stands for Core-Based Statistical Areas. Defined by the US government Office of Management and Budget, these are geographic locations neighboring urban areas of at least 10,000 people and/or socioeconomically tied to the urban center by commuting.
We matched patient’s location with a CBSA code and used that as a feature.
In : train.groupby( 'patient_age_group' ).is_screener.mean() Out: patient_age_group 24-26 71.9% 27-29 70.5% 30-32 67.8% 33-35 65.7% 36-38 64.2% 39-41 62.5% 42-44 60.3% 45-47 58.1% 48-50 55.9% 51-53 53.7% 54-56 51.0% 57-59 49.0% 60-62 46.0% 63-65 40.8% 66-68 34.5% 69-71 28.5% In : train.groupby( 'patient_age_group' ).is_screener.mean().plot( kind = 'bar', color = 'g' )
One can see that is_screener percentage is strictly declining with each age group.
It appears that if you were to select one data table, procedures offer the most predictive information. Just raw procedure counts by patient and by code allowed us to exceed 0.8 AUC. Therefore, it might be interesting which procedure codes are important for a classifier. We performed a L1 (Lasso) feature selection using Vowpal Wabbit:
In : varinfo[['procedure_code', 'procedure_description', 'RelScore']].head( 10 ) Out: procedure_code procedure_description RelScore 0 57454 COLPOSCOPY CERVIX BX CERVIX & ENDOCRV CURRETAGE 100.00% 1 81252 GJB2 GENE ANALYSIS FULL GENE SEQUENCE 96.93% 2 57456 COLPOSCOPY CERVIX ENDOCERVICAL CURETTAGE 95.00% 3 57455 COLPOSCOPY CERVIX UPPR/ADJCNT VAGINA W/CERVIX BX 91.42% 4 S4020 IN VITRO FERTILIZATION PROCEDURE CANCELLED BEFOR 85.76% 5 S0605 DIGITAL RECTAL EXAMINATION, MALE, ANNUAL 83.64% 6 G0143 SCREENING CYTOPATHOLOGY, CERVICAL OR VAGINAL (AN 78.39% 7 90696 DTAP-IPV VACCINE CHILD 4-6 YRS FOR IM USE 76.98% 8 S4023 DONOR EGG CYCLE, INCOMPLETE, CASE RATE 76.67% 9 69710 IMPLTJ/RPLCMT EMGNT BONE CNDJ DEV TEMPORAL BONE 72.06%
The procedures above are positively correlated of a patient being a screener. Here are the codes that indicate otherwise:
In : varinfo[['procedure_code', 'procedure_description', 'RelScore']].tail( 10 ) Out: procedure_code procedure_description RelScore 14337 K0735 SKIN PROTECTION WHEELCHAIR SEAT CUSHION, ADJUSTA -62.48% 14338 34805 EVASC RPR AAA AORTO-UNIILIAC/AORTO-UNIFEM PROSTH -64.68% 14339 L5975 ALL LOWER EXTREMITY PROSTHESIS, COMBINATION SING -65.39% 14340 89321 SEMEN ANALYSIS SPERM PRESENCE&/MOTILITY SPRM -65.51% 14341 S9145 INSULIN PUMP INITIATION, INSTRUCTION IN INITIAL -69.96% 14342 00632 ANESTHESIA LUMBAR REGION LUMBAR SYMPATHECTOMY -77.13% 14343 27756 PRQ SKELETAL FIXATION TIBIAL SHAFT FRACTURE -78.61% 14344 3303F AJCC CANCER STAGE IA, DOCUMENTED (ONC), (ML) -82.77% 14345 23675 CLTX SHOULDER DISLC W/SURG/ANTMCL NECK FX W/MANJ -83.14% 14346 Q4111 GAMMAGRAFT, PER SQUARE CENTIMETER -85.49%
Complete data and code are available on request.