Kaggle Competition Part 1
18 Nov 2016This post chronicles my attempt at the Kaggle Titanic competition. The objective of this competition is to build a classifier that correctly predicts passenger survival in the training data. My high score as a result of this effort is 370th out of ~5800, or roughly the top 7%. This post is only part 1 of this process. It outlines the use of logistic regression as a classifier and model specification. Part 2 will detail the random forest approach, which had much better results.
Data Munging and Variable Inspection
First, import the CSV data anyway you like.
train <- read.csv(".../train.csv")
test <- read.csv(".../test.csv")
For this process as well, let’s combine the training and test sets for easier data manipulation. The test set starts at passenger 892, so we can resplit the data later for analysis.
test$Survived <- NA #test needs matching column length for binding
comb_df <- rbind(train, test)
Use the head() function to inspect the first six observations for each variable. Also, the str() function is useful for looking at data-frame component characteristics. These include data type, measurement levels, and matching variable names for this information.
head(comb_df)
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp
## 1 Braund, Mr. Owen Harris male 22 1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1
## 3 Heikkinen, Miss. Laina female 26 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1
## 5 Allen, Mr. William Henry male 35 0
## 6 Moran, Mr. James male NA 0
## Parch Ticket Fare Cabin Embarked
## 1 0 A/5 21171 7.2500 S
## 2 0 PC 17599 71.2833 C85 C
## 3 0 STON/O2. 3101282 7.9250 S
## 4 0 113803 53.1000 C123 S
## 5 0 373450 8.0500 S
## 6 0 330877 8.4583 Q
str(comb_df)
## 'data.frame': 1309 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...
One should quickly notice that the the dependent variable (survived) is not a factor. Pclass also appears to be misclassified as an integer as well. These should be transformed into the factor data type with ordered levels. I make new variables for each factor because I prefer leaving the original variables in their original form.
comb_df$Survived2 <- as.factor(comb_df$Survived)
comb_df$Pclass2 <- as.factor(comb_df$Pclass)
Now let’s examine the extent of missing data. This can be easily done by using the VIM package and the aggr plot function. Let’s also quickly remove the survived variable since it is obviously missing for the test data.
library(VIM)
comb_df2 <- comb_df[ ,-c(2,13)]
aggr_plot <- aggr(comb_df2, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(comb_df2), cex.axis=.7, gap=3,
ylab=c("Histogram of missing data","Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## Age 0.2009167303
## Fare 0.0007639419
## PassengerId 0.0000000000
## Pclass 0.0000000000
## Name 0.0000000000
## Sex 0.0000000000
## SibSp 0.0000000000
## Parch 0.0000000000
## Ticket 0.0000000000
## Cabin 0.0000000000
## Embarked 0.0000000000
## Pclass2 0.0000000000
The histogram and resulting table indicate that roughly 20% of age values are missing, and that a very small percent (0.0007) of the fare values are missing.1 Logistic regression can handle missing values through listwise deletion, but this is not a sound strategy if we have the tools to impute missing values. Methods like Random Forests can’t work with missing values anyways, so we should really find a nice imputation solution. Let’s assume that these values are missing at random and are of an adequately low percentage of full data-frame representation.2
Since fare is only missing one, it is easily enough to impute using some central tendency measure. The mean is 33.29, while the median is 14.45. Inspecting the fare variable shows a number of problems.
summary(comb_df$Fare)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 7.896 14.450 33.300 31.280 512.300 1
par(mfrow=c(1,2))
plot(comb_df$Fare)
hist(comb_df$Fare)
First, there are Fares of 0. Did these people get on for free? Probably not. Second, it is clear upon inspection that outliers are pushing the mean Fare value up. Now we have more than just a simple imputation to deal with. We can impute the missing Fare with the median fare based on his class using the Rpart package, but first let’s set the zero values to NA and impute median values based on class.
comb_df$fare2 <- ifelse(comb_df$Fare==0, NA, comb_df$Fare)
summary(comb_df$fare2) #inspect to see no more zero value, and now 18 NAs
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.171 7.925 14.500 33.730 31.330 512.300 18
library(rpart)
fareimp <- rpart(fare2 ~ Pclass, data=comb_df[!is.na(comb_df$fare2),],
method="anova")
comb_df$fare2[is.na(comb_df$fare2)] <- predict(fareimp, comb_df[is.na(comb_df$fare2),])
summary(comb_df$fare2) #inspect to see no more NAs
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.171 7.925 14.500 33.880 31.390 512.300
Now let’s impute age based on multiple covariates.
ageimp <- rpart(Age ~ Pclass2 + Sex + SibSp + Parch + fare2 + Embarked,
data=comb_df[!is.na(comb_df$Age),],
method="anova")
comb_df$age2 <- NA
comb_df$age2[is.na(comb_df$age2)] <- predict(ageimp, comb_df[is.na(comb_df$age2),])
summary(comb_df$age2) #inspect to make sure.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.60 27.43 27.43 29.63 32.10 40.55
Finally in terms of data extraction, let’s look at titles housed in the name variable. People more clever than I noticed that titles were consistent throughout the names listed in the passenger data. Historically, titles were also associated with different personal characteristics such as gender, class, nobility, etc.
comb_df$Name <- as.character(comb_df$Name)
strsplit(comb_df$Name[2], split='[,.]') #want to isolate the "title" string segment
## [[1]]
## [1] "Cumings"
## [2] " Mrs"
## [3] " John Bradley (Florence Briggs Thayer)"
comb_df$Title <- sapply(comb_df$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
comb_df$Title <- sub(' ', '', comb_df$Title)
comb_df$Title[comb_df$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
comb_df$Title[comb_df$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
comb_df$Title[comb_df$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
comb_df$Title <- factor(comb_df$Title)
table(comb_df$Title)
##
## Col Dr Lady Master Miss Mlle Mr Mrs Ms Rev
## 4 8 4 61 260 3 757 197 2 8
## Sir
## 5
Next, we can create a couple of family related variables. These are family size and family id or cluster:
#family size.
comb_df$FamilySize <- comb_df$SibSp + comb_df$Parch + 1 #add one to account for the passenger as well.
#combine surname and family size into a factor
comb_df$Surname <- sapply(comb_df$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
comb_df$FamilyID <- paste(as.character(comb_df$FamilySize), comb_df$Surname, sep="")
comb_df$FamilyID[comb_df$FamilySize <= 2] <- 'Small'
famIDs <- data.frame(table(comb_df$FamilyID))
famIDs <- famIDs[famIDs$Freq <= 2,]
comb_df$FamilyID[comb_df$FamilyID %in% famIDs$Var1] <- 'Small'
comb_df$FamilyID <- factor(comb_df$FamilyID)
table(comb_df$FamilyID)
##
## 11Sage 3Abbott 3Boulos 3Bourke 3Brown
## 11 3 3 3 4
## 3Caldwell 3Collyer 3Compton 3Coutts 3Crosby
## 3 3 3 3 3
## 3Danbom 3Davies 3Dodge 3Drew 3Elias
## 3 5 3 3 3
## 3Goldsmith 3Hart 3Hickman 3Johnson 3Klasen
## 3 3 3 3 3
## 3Mallet 3McCoy 3Moubarek 3Nakid 3Navratil
## 3 3 3 3 3
## 3Peacock 3Peter 3Quick 3Rosblom 3Samaan
## 3 3 3 3 3
## 3Sandstrom 3Spedden 3Taussig 3Thayer 3Touma
## 3 3 3 3 3
## 3van Billiard 3Van Impe 3Wells 3Wick 3Widener
## 3 3 3 3 3
## 4Allison 4Baclini 4Becker 4Carter 4Dean
## 4 4 4 4 4
## 4Herman 4Johnston 4Laroche 4West 5Ford
## 4 4 4 4 5
## 5Lefebre 5Palsson 5Ryerson 6Fortune 6Panula
## 5 5 5 6 6
## 6Rice 6Skoog 7Andersson 7Asplund 8Goodwin
## 6 6 9 7 8
## Small
## 1074
Finally, let’s make a women and children first variable. This is based on the common anecdotal story of life boats only accepting women and children before adult males.
comb_df$priority <- 0
comb_df$priority[which(comb_df$Sex=="female" | comb_df$age2 < 16)] <- 1
We have done a lot of work here! Let’s run some models. First resplit the combined data set back into training and test sets.
train <- comb_df[1:891,]
test <- comb_df[892:1309,]
Logistic Regression
Logistic regression is a classic classifier in statistics and econometrics. It is also heavily used in political science and sociology as a workhorse modeling method. Like linear regression and classification methods, we are interested in understanding the probability of an outcome \(Y=1\) occurring conditional on a set of \(n\) predictor variables within \(x\prime\).
\[ Pr(Y = 1|x\prime) \]
One can read the above as being the probability of \(Y\) equaling 1 given the measures found in \(x\prime\). To model this predictive relationship, we can utilize conventions from linear regression.
First we can represent the above conditional probability as:
\[ Pr(Y = 1|x\prime) = p(x\prime) \]
Next, propose a linear function of this probability:
\[ p(x\prime) = \beta_{0} + \beta_{i}x_{i}...+ \beta_{n}x_{n} \]
Unfortunately, stopping here would not help us. Our model would sometimes result in probabilities that are \(p(x\prime) < 0\) or \(p(x\prime) > 1\). It is impossible to have probabilities outside of the range \(p \in [0,1]\). Fortunately, we can use the logistic function to guarantee outputs between 0 and 1 as such:
\[ p(x\prime) = \frac{e^{\beta_{0}+\beta_{i}x_{i}...+\beta_{n}x_{n}}}{1 + e^{\beta_{0}+\beta_{i}x_{i}...+\beta_{n}x_{n}}} \]
From here we can calculate the odds of \(p(x\prime\)):
\[ \frac{p(x\prime)}{1 - p(x\prime)} = e^{\beta_{0} + \beta_{i}x_{i}...+\beta{n}x_{n}} \]
Reiterating basic probability, the odds tells us about the probability of \(Y = 1\). Odds close to 0 indicate very low probabilities for \(Y = 1\), while probabilities ever approaching \(\infty\) indicate very high probabilities for \(Y = 1\). By logging both sides of the above odds equation, we can calculate the logit or logged odds:
\[ log \bigg( \, \frac{p(x\prime)}{1 - p(x\prime)} \bigg) \, = \beta_{0} + \beta_{i}x_{i}...+ \beta{n}x_{n} \]
This is the basis for using logistic regression to make classification predictions. We can interpret the influence of predictor \(x_{i}\) as the change in the logged odds of \(Y = 1\) by \(\beta_{i}\). This is not a linear increase like the linear regression interpretation, but is a conditional increase based on the current value of \(x_{i}\). Even though the relationship is non-linear, we can generalize the interpretation of \(\beta_{i}\). If \(\beta_{i}\) is positive, then increasing the value of \(x_{i}\) will increase the logged odds of \(Y = 1\). As such, a negative \(\beta_{i}\) would have the opposite effect.
The above seems easy enough, but how can we know the values of \(\beta_{0}\), \(\beta_{i}\), etc..? These parameters are generally estimated though the maximum likelihood method. Let’s assume \(n\) observations and that \(x\) and \(\beta\) are sets of covariates and their corresponding beta estimates. First start with the likelihood function:
\[ \ell(\beta_{0},\beta) = \prod_{i = 1}^{n}p(x_{i})^{y_{i}}(1 - p(x_{i}))^{1 - y_{i}} \]
Through further derivation of the log-likelihood function, products turn into sums:
\[ \ell(\beta_{0}, \beta) = \sum_{i=1}^{n}-log1 + e^{\beta_{0} + \beta x_{i}} + \sum_{i=1}^{n}y_{i}(\beta_{0} + \beta x_{i}) \]
Finally, one can find the maximum likelihood estimates for a specific parameter by taking the partial derivative of the log likelihood function. Say we wanted the estimate \(\beta_{j}\) from the set of estimates \(\beta\). We take the partial derviative of the log likelihood function with respect to \(\beta_{j}\)
\[ \frac{\partial \ell}{\partial \beta_{j}} = -\sum_{i=1}^{n} \frac{1}{1 + e^{\beta_{0} + \beta x_{i}}} e^{\beta_{0} + \beta x_{i}}x_{ij} + \sum_{i=1}^{n} y_{i}x_{ij} \]
\[ = \sum_{i=1}^{n}(y_{i} - p(x_{i};\beta_{0},\beta))x_{ij} \]
From here, we have the basis for estimating \(\beta_{j}\). The resulting beta values are estimates because the partial derivative function is transcendental.
Logistic Regression using GLM
So let’s put the above into action. Logitistic regression models are easy to implement in R using the glm() function. Here I use the logit link.
lfit <- glm(Survived2 ~ Pclass2 + SibSp + fare2 + Embarked + priority + age2 + Parch + Title,
data = train, family = binomial(logit))
summary(lfit)
##
## Call:
## glm(formula = Survived2 ~ Pclass2 + SibSp + fare2 + Embarked +
## priority + age2 + Parch + Title, family = binomial(logit),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4405 -0.6336 -0.3982 0.5188 2.4633
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.114166 1.749083 0.637 0.52412
## Pclass22 -1.086626 0.403463 -2.693 0.00708 **
## Pclass23 -2.181913 0.444036 -4.914 0.000000893 ***
## SibSp -0.438038 0.143767 -3.047 0.00231 **
## fare2 0.002477 0.002457 1.008 0.31342
## EmbarkedQ -0.108594 0.395787 -0.274 0.78380
## EmbarkedS -0.456701 0.245963 -1.857 0.06334 .
## priority -2.171446 0.979025 -2.218 0.02656 *
## age2 -0.023742 0.024821 -0.957 0.33880
## Parch -0.360929 0.136907 -2.636 0.00838 **
## TitleDr 0.467218 1.652747 0.283 0.77741
## TitleLady 2.478722 2.116785 1.171 0.24160
## TitleMaster 4.471005 1.645752 2.717 0.00659 **
## TitleMiss 4.874589 1.739323 2.803 0.00507 **
## TitleMlle 17.430855 840.228654 0.021 0.98345
## TitleMr -0.338500 1.440880 -0.235 0.81427
## TitleMrs 5.381037 1.756308 3.064 0.00219 **
## TitleMs 18.896646 1455.398594 0.013 0.98964
## TitleRev -14.406445 591.633321 -0.024 0.98057
## TitleSir -0.156936 1.700415 -0.092 0.92647
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 729.89 on 871 degrees of freedom
## AIC: 769.89
##
## Number of Fisher Scoring iterations: 14
dev = 1186.6 - 630.75
dfree <- 890 - 812
sig <- 1 - pchisq(dev, df=dfree)
sig
## [1] 0
#significant model at the 0% level
anova(lfit, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Survived2
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 890 1186.66
## Pclass2 2 103.547 888 1083.11 < 0.00000000000000022 ***
## SibSp 1 0.040 887 1083.07 0.8417158
## fare2 1 7.479 886 1075.59 0.0062416 **
## Embarked 2 16.721 884 1058.87 0.0002339 ***
## priority 1 238.582 883 820.29 < 0.00000000000000022 ***
## age2 1 3.652 882 816.63 0.0559969 .
## Parch 1 0.941 881 815.69 0.3320660
## Title 10 85.802 871 729.89 0.00000000000003626 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(lfit)
## GVIF Df GVIF^(1/(2*Df))
## Pclass2 4.533083 2 1.459145
## SibSp 2.085296 1 1.444055
## fare2 1.640160 1 1.280687
## Embarked 1.309050 2 1.069644
## priority 27.055227 1 5.201464
## age2 4.978660 1 2.231291
## Parch 1.521050 1 1.233309
## Title 46.902284 10 1.212159
test$Survived <- predict(lfit, newdata = test, type="response")
test$Survived[test$Survived > .5]="Yes"
test$Survived[test$Survived <= .5]="No"
vars <- c("PassengerId", "Survived")
sub <- test[vars]
sub$Survived[sub$Survived=="Yes"]=1
sub$Survived[sub$Survived=="No"]=0
sub$Survived <- as.factor(sub$Survived)
write.csv(sub, "besaw_sub.csv")
Score of 0.79426, not bad…Let’s try some interaction effects.
lfit <- glm(Survived2 ~ Pclass2 + SibSp + fare2 + Embarked + priority + age2 + Parch + Title + priority*Pclass2 +
SibSp*priority + Pclass2*SibSp + Pclass2*Parch,
data = train, family = binomial(logit))
summary(lfit)
##
## Call:
## glm(formula = Survived2 ~ Pclass2 + SibSp + fare2 + Embarked +
## priority + age2 + Parch + Title + priority * Pclass2 + SibSp *
## priority + Pclass2 * SibSp + Pclass2 * Parch, family = binomial(logit),
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.0165 -0.5355 -0.4199 0.3414 2.3268
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.679764 1.914146 0.355 0.72249
## Pclass22 -1.604792 0.547557 -2.931 0.00338 **
## Pclass23 -1.345472 0.536659 -2.507 0.01217 *
## SibSp 0.305911 0.357008 0.857 0.39151
## fare2 0.001726 0.002698 0.640 0.52227
## EmbarkedQ -0.155566 0.381959 -0.407 0.68380
## EmbarkedS -0.535461 0.250159 -2.140 0.03232 *
## priority -0.140686 1.170246 -0.120 0.90431
## age2 -0.011483 0.031468 -0.365 0.71519
## Parch -0.744546 0.418493 -1.779 0.07522 .
## TitleDr -0.097666 1.689988 -0.058 0.95392
## TitleLady 0.816524 2.033841 0.401 0.68808
## TitleMaster 4.284450 1.700119 2.520 0.01173 *
## TitleMiss 4.415994 1.798349 2.456 0.01407 *
## TitleMlle 15.384437 840.259572 0.018 0.98539
## TitleMr -0.600261 1.449138 -0.414 0.67871
## TitleMrs 4.603745 1.813619 2.538 0.01114 *
## TitleMs 17.652788 1455.398737 0.012 0.99032
## TitleRev -13.915157 589.976359 -0.024 0.98118
## TitleSir -0.362893 1.712217 -0.212 0.83215
## Pclass22:priority -0.139366 0.974519 -0.143 0.88628
## Pclass23:priority -2.485059 0.842521 -2.950 0.00318 **
## SibSp:priority -0.321285 0.382498 -0.840 0.40093
## Pclass22:SibSp -0.559350 0.606727 -0.922 0.35657
## Pclass23:SibSp -0.469754 0.476165 -0.987 0.32387
## Pclass22:Parch 1.193951 0.607597 1.965 0.04941 *
## Pclass23:Parch 0.424308 0.443516 0.957 0.33872
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1186.66 on 890 degrees of freedom
## Residual deviance: 691.46 on 864 degrees of freedom
## AIC: 745.46
##
## Number of Fisher Scoring iterations: 14
Deviance change seemed promising, but a score of 0.78, not an improvement…
It is actually just one..↩
Multiple imputation assumes that missing values are missing at random (MAR). One should be careful when performing multiple imputation on real world data, as the MAR assumption is often not feasible. I am not sure why Age has so many missing values here. It could be a design by Kaggle, which is what I personally assume for this analysis. Normally, if one variable has such a high degree of missingness, it is unlikely to be missing at random. In addition, missing values should normally not be more than 5%-10% of a measure’s representation. Having much higher degrees of missingness is often a sign of MAR violation. Consequently, with more missingness, you are more likely to artificially construct patterns within the imputed data if your baseline data is unrepresentative of the real population.↩