Open Data Meetup

Pierre Gutierrez, Data Scientist at Dataiku (www.dataiku.com).

Data

  • medicare (real) data can be found here : https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/Medicare-Provider-Charge-Data/Part-D-Prescriber.html

  • complements for other years can be found on propublica website : https://projects.propublica.org/data-store/

  • payments can be found here : http://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads.html

  • npi exclusions can be found here : https://oig.hhs.gov/exclusions/exclusions_list.asp#instruct

  • FDA information come from : http://www.fda.gov/Drugs/InformationOnDrugs/ucm079750.htm

  • we provide the drug_paiments_companies file which is enables to join companies from fda with the ones given by . It was generated with fuzzy matching and is not considered exact. It can contain errors.

NB : Propublica has a lot of interesting articles if needed. Ex : on fraud : http://www.propublica.org/article/fraud-still-plagues-medicare-drug-program-watchdog-finds http://www.propublica.org/article/fanny-pack-mixup-unravels-massive-medicare-fraud-scheme http://www.propublica.org/article/how-fraud-flourishes-in-medicares-drug-plan

Interesting report on how to detect fraud : http://oig.hhs.gov/oei/reports/oei-02-09-00603.pdf

Importing and Merging Data into Postgresql

  • take csv and synchronise into postgresql using pgloader (https://github.com/dimitri/pgloader)
  • use the postgresql database dump.
  • see sql code for the rest in the directory.

NB : - if you don't have a nice sql interface, use psycopg2 !

con = psycopg2.connect(host="localhost", user=USER, password=PASSWORD, database=DATABASE)  
cur = con.cursor()  
cur.execute('CRATE TABLE AS SELECT ... ')    
con.commit()
  • most aggregations can also be done in python with the pandas package. It is also possible to have another sql database.

Creating a supervised Model for Fraud detection

In [102]:
import numpy as np
import pandas as pd
import seaborn as sns
import random as rd
import psycopg2
from pandas.io import sql
import scipy

# scikit learn 
from sklearn import cross_validation
from sklearn.preprocessing import StandardScaler
#from sklearn.preprocessing import OneHotEncoder
#from sklearn.preprocessing import LabelEncoder 
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
In [7]:
# define the connexion
USER = *****
PASSWORD = *****
DATABASE = *****
con = psycopg2.connect(host="localhost", user=USER, password=PASSWORD, database=DATABASE)
In [8]:
q = " SELECT * from complete ; "
In [9]:
df = sql.read_frame(q, con)
/home/dataiku/software/dataiku-dss-2.0.0/python.packages/pandas/io/sql.py:1632: FutureWarning: read_frame is deprecated, use read_sql
  warnings.warn("read_frame is deprecated, use read_sql", FutureWarning)

In [10]:
df.columns
Out[10]:
Index([u'npi', u'speciality', u'nb_diff_prescriptions', u'sum_prescr',
       u'sum_days', u'sum_cost', u'max_prescr', u'max_days', u'max_cost',
       u'var_prescr', u'var_days', u'var_cost', u'nb_perc_presc',
       u'nb_perc_days', u'nb_perc_cost', u'perc_unicost', u'score',
       u'weighted_score', u'payments_2013', u'is_fraud'],
      dtype='object')
In [11]:
categorical_features = [u'speciality']
numerical_features = ['nb_diff_prescriptions','sum_prescr','sum_days','sum_cost'
                      ,'var_prescr','var_days','var_cost'
                      ,'max_prescr','max_days','max_cost']
# base features
numerical_features = numerical_features + []
# outliers features
numerical_features = numerical_features +['nb_perc_presc', 'nb_perc_days', 'nb_perc_cost', 'perc_unicost']
# graph features
numerical_features = numerical_features + ['score','weighted_score']
# payment features
numerical_features = numerical_features + ['payments_2013']

target = ['is_fraud']

allvars = categorical_features + numerical_features + target
In [84]:
y = df["is_fraud"].values
X = df[allvars].drop('is_fraud',axis=1)
In [85]:
X_train, X_valid, y_train, y_valid = cross_validation.train_test_split(X, y, test_size=0.2, random_state=0)
print X_train.shape, X_valid.shape

# and this is a 5-fold cross validation example
#kf = cross_validation.KFold(X.shape[0], n_folds=5, random_state=0)
#for train_index, valid_index in kf:
#    print("TRAIN:", len(train_index), "TEST:", len(valid_index))
#    X_train, X_valid = X.values[train_index], X.values[valid_index]
#    y_train, y_valid = y[train_index], y[valid_index]
(646417, 18) (161605, 18)

In [86]:
# filling missing values 
X_train[numerical_features] = X_train[numerical_features].fillna(0) # is mean better ? 
X_valid[numerical_features] = X_valid[numerical_features].fillna(0) # is mean better ? 
X_train[categorical_features] = X_train[categorical_features].fillna('unknown') 
X_valid[categorical_features] = X_valid[categorical_features].fillna('unknown')
In [87]:
# rescaling 
scaler= StandardScaler() 
X_train[numerical_features] = scaler.fit_transform(X_train[numerical_features].values)
X_valid[numerical_features] = scaler.transform(X_valid[numerical_features].values)
In [89]:
# dummify categorical features
vectorizer = DictVectorizer()
tmp = [dict(enumerate(sample)) for sample in X_train[categorical_features].values]
tmp2 = [dict(enumerate(sample)) for sample in X_valid[categorical_features].values]
X_train = scipy.sparse.hstack((X_train[numerical_features].values, vectorizer.fit_transform(tmp)))
X_valid = scipy.sparse.hstack((X_valid[numerical_features].values, vectorizer.transform(tmp2)))
In [120]:
# declare classifier 
clf1 = LogisticRegression() # pimp me 
clf2 = RandomForestClassifier(n_estimators =100, max_depth = 10, class_weight = 'auto') # pimp me
In [121]:
# train model 1
clf1.fit(X_train,y_train)
Out[121]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0)
In [122]:
# train model 2
clf2.fit(X_train.toarray(),y_train)
Out[122]:
RandomForestClassifier(bootstrap=True, class_weight='auto', criterion='gini',
            max_depth=10, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)
In [123]:
# evaluate and plot roc curve 1
probas = clf1.predict_proba(X_valid)
fpr, tpr, thresholds = roc_curve(y_valid, probas[:, 1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label='ROC (area = %0.2f)' % roc_auc)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
In [124]:
# evaluate and plot roc curve 1
probas = clf2.predict_proba(X_valid.toarray())
fpr, tpr, thresholds = roc_curve(y_valid, probas[:, 1])
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, lw=1, label='ROC (area = %0.2f)' % roc_auc)
plt.xlim([-0.05, 1.05])
plt.ylim([-0.05, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()
In [125]:
# That's all ! It's now up to you to create your own features and to tune your algorithm !