FastML

Machine learning made easy

Basic Insight

Our team - Paul Perry, Elena Cuoco* and Zygmunt Zając - did not discover the leak. We didn’t attain a winning score, but on the other hand, the features we used and the analysis we present are not tainted by using that source of extraneous information.

This document contains high-level insights. For a more detailed study, see the notebook.

*Elena, as it happens, is part of the gravitational waves discovery team and a co-author of the paper. Congratulations!

Executive summary

We were able to score > 0.88 with a single XGBoost model using just 100 features. It appears that the most important features come from three groups: demographics, general patient activity, and practitioner information.

Demographics that matter are location and age, and - to a lesser extent - education level, household income and ethnicity.

Patient activity (in the general sense, not referring to the activity table), consists mostly of basic statistics (such as the total number of procedures, number of visits, time between the first and last visit).

Practitioner information relates to the doctors - for example, practitioner ID or percentage of screeners among the patients - as well as location, which of course is correlated with patient’s location.

Feature importance plot

Here is the feature importance plot from our best single model showing the details.


Click to view the full-size plot

The most important features are, in order:

  • (activity) num_procedures - a total count of procedures for a patient
  • (location) cbsa_y - CBSA of primary_practitioner (see the Location section below)
  • (practitioner) phy_practitioner_id - physician who diagnosed a GYN checkup
  • (location) patient_state
  • (location) cbsa_x - CBSA of the most visited practitioner for diagnosis across all diagnoses
  • (activity) visits - total count of visits from patient_activity
  • (activity) num_visits - total count of visits to the primary CBSA
  • (activity) date_delta = last_visit - first_visit number of months in medical from 2008 to 2014
  • (practitioner) obg_screen_pct - screening % of the OBG
  • (practitioner) obg_id - _primary_practitioner_id with specialty_code in (‘OBG’,’GYN’,’REN’, ‘OBS’)
  • (activity) V72_date - date of first GYN checkup V72.31
  • patient_age_group
  • (practitioner) obg_patient_count - number of patients of the OBG

A bunch features in the middle of the plot - of form A_2014, R_2014, etc. - are activity counts for a given year (A and R types respectively).

Demographics

The most important demographics are location and age. We look at those closer.

Location

Location features include a patient’s state and CBSA (see below).

CBSA stands for Core-Based Statistical Areas. Defined by the US government Office of Management and Budget, these are geographic locations neighboring urban areas of at least 10,000 people and/or socioeconomically tied to the urban center by commuting.

We matched patient’s location with a CBSA code and used that as a feature.

Age

In [10]: train.groupby( 'patient_age_group' ).is_screener.mean()
Out[10]:
patient_age_group
24-26       71.9%
27-29       70.5%
30-32       67.8%
33-35       65.7%
36-38       64.2%
39-41       62.5%
42-44       60.3%
45-47       58.1%
48-50       55.9%
51-53       53.7%
54-56       51.0%
57-59       49.0%
60-62       46.0%
63-65       40.8%
66-68       34.5%
69-71       28.5%

In [11]: train.groupby( 'patient_age_group' ).is_screener.mean().plot( kind = 'bar', color = 'g' )

One can see that is_screener percentage is strictly declining with each age group.

Procedures

It appears that if you were to select one data table, procedures offer the most predictive information. Just raw procedure counts by patient and by code allowed us to exceed 0.8 AUC. Therefore, it might be interesting which procedure codes are important for a classifier. We performed a L1 (Lasso) feature selection using Vowpal Wabbit:

In [37]: varinfo[['procedure_code', 'procedure_description', 'RelScore']].head( 10 )
Out[37]:
  procedure_code                             procedure_description RelScore
0          57454   COLPOSCOPY CERVIX BX CERVIX & ENDOCRV CURRETAGE  100.00%
1          81252             GJB2 GENE ANALYSIS FULL GENE SEQUENCE   96.93%
2          57456          COLPOSCOPY CERVIX ENDOCERVICAL CURETTAGE   95.00%
3          57455  COLPOSCOPY CERVIX UPPR/ADJCNT VAGINA W/CERVIX BX   91.42%
4          S4020  IN VITRO FERTILIZATION PROCEDURE CANCELLED BEFOR   85.76%
5          S0605          DIGITAL RECTAL EXAMINATION, MALE, ANNUAL   83.64%
6          G0143  SCREENING CYTOPATHOLOGY, CERVICAL OR VAGINAL (AN   78.39%
7          90696         DTAP-IPV VACCINE CHILD 4-6 YRS FOR IM USE   76.98%
8          S4023            DONOR EGG CYCLE, INCOMPLETE, CASE RATE   76.67%
9          69710   IMPLTJ/RPLCMT EMGNT BONE CNDJ DEV TEMPORAL BONE   72.06%

The procedures above are positively correlated of a patient being a screener. Here are the codes that indicate otherwise:

In [42]: varinfo[['procedure_code', 'procedure_description', 'RelScore']].tail( 10 )
Out[42]:
      procedure_code                             procedure_description RelScore
14337          K0735  SKIN PROTECTION WHEELCHAIR SEAT CUSHION, ADJUSTA  -62.48%
14338          34805  EVASC RPR AAA AORTO-UNIILIAC/AORTO-UNIFEM PROSTH  -64.68%
14339          L5975  ALL LOWER EXTREMITY PROSTHESIS, COMBINATION SING  -65.39%
14340          89321      SEMEN ANALYSIS SPERM PRESENCE&/MOTILITY SPRM  -65.51%
14341          S9145  INSULIN PUMP INITIATION, INSTRUCTION IN INITIAL   -69.96%
14342          00632     ANESTHESIA LUMBAR REGION LUMBAR SYMPATHECTOMY  -77.13%
14343          27756       PRQ SKELETAL FIXATION TIBIAL SHAFT FRACTURE  -78.61%
14344          3303F      AJCC CANCER STAGE IA, DOCUMENTED (ONC), (ML)  -82.77%
14345          23675  CLTX SHOULDER DISLC W/SURG/ANTMCL NECK FX W/MANJ  -83.14%
14346          Q4111                 GAMMAGRAFT, PER SQUARE CENTIMETER  -85.49%

Complete data and code are available on request.

Comments