Kaggle Competition Part 1

This post chronicles my attempt at the Kaggle Titanic competition. The objective of this competition is to build a classifier that correctly predicts passenger survival in the training data. My high score as a result of this effort is 370th out of ~5800, or roughly the top 7%. This post is only part 1 of this process. It outlines the use of logistic regression as a classifier and model specification. Part 2 will detail the random forest approach, which had much better results.

Data Munging and Variable Inspection

First, import the CSV data anyway you like.

train <- read.csv(".../train.csv")

test <- read.csv(".../test.csv")

For this process as well, let’s combine the training and test sets for easier data manipulation. The test set starts at passenger 892, so we can resplit the data later for analysis.

test$Survived <- NA #test needs matching column length for binding

comb_df <- rbind(train, test)

Use the head() function to inspect the first six observations for each variable. Also, the str() function is useful for looking at data-frame component characteristics. These include data type, measurement levels, and matching variable names for this information.

head(comb_df)
##   PassengerId Survived Pclass
## 1           1        0      3
## 2           2        1      1
## 3           3        1      3
## 4           4        1      1
## 5           5        0      3
## 6           6        0      3
##                                                  Name    Sex Age SibSp
## 1                             Braund, Mr. Owen Harris   male  22     1
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female  38     1
## 3                              Heikkinen, Miss. Laina female  26     0
## 4        Futrelle, Mrs. Jacques Heath (Lily May Peel) female  35     1
## 5                            Allen, Mr. William Henry   male  35     0
## 6                                    Moran, Mr. James   male  NA     0
##   Parch           Ticket    Fare Cabin Embarked
## 1     0        A/5 21171  7.2500              S
## 2     0         PC 17599 71.2833   C85        C
## 3     0 STON/O2. 3101282  7.9250              S
## 4     0           113803 53.1000  C123        S
## 5     0           373450  8.0500              S
## 6     0           330877  8.4583              Q
str(comb_df)
## 'data.frame':    1309 obs. of  12 variables:
##  $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
##  $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
##  $ Name       : Factor w/ 1307 levels "Abbing, Mr. Anthony",..: 109 191 358 277 16 559 520 629 417 581 ...
##  $ Sex        : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
##  $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
##  $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
##  $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
##  $ Ticket     : Factor w/ 929 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
##  $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
##  $ Cabin      : Factor w/ 187 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
##  $ Embarked   : Factor w/ 3 levels "C","Q","S": 3 1 3 3 3 2 3 3 3 1 ...

One should quickly notice that the the dependent variable (survived) is not a factor. Pclass also appears to be misclassified as an integer as well. These should be transformed into the factor data type with ordered levels. I make new variables for each factor because I prefer leaving the original variables in their original form.

comb_df$Survived2 <- as.factor(comb_df$Survived)

comb_df$Pclass2 <- as.factor(comb_df$Pclass)

Now let’s examine the extent of missing data. This can be easily done by using the VIM package and the aggr plot function. Let’s also quickly remove the survived variable since it is obviously missing for the test data.

library(VIM)
comb_df2 <- comb_df[ ,-c(2,13)]

aggr_plot <- aggr(comb_df2, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(comb_df2), cex.axis=.7, gap=3, 
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##     Variable        Count
##          Age 0.2009167303
##         Fare 0.0007639419
##  PassengerId 0.0000000000
##       Pclass 0.0000000000
##         Name 0.0000000000
##          Sex 0.0000000000
##        SibSp 0.0000000000
##        Parch 0.0000000000
##       Ticket 0.0000000000
##        Cabin 0.0000000000
##     Embarked 0.0000000000
##      Pclass2 0.0000000000

The histogram and resulting table indicate that roughly 20% of age values are missing, and that a very small percent (0.0007) of the fare values are missing.1 Logistic regression can handle missing values through listwise deletion, but this is not a sound strategy if we have the tools to impute missing values. Methods like Random Forests can’t work with missing values anyways, so we should really find a nice imputation solution. Let’s assume that these values are missing at random and are of an adequately low percentage of full data-frame representation.2

Since fare is only missing one, it is easily enough to impute using some central tendency measure. The mean is 33.29, while the median is 14.45. Inspecting the fare variable shows a number of problems.

summary(comb_df$Fare)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   7.896  14.450  33.300  31.280 512.300       1
par(mfrow=c(1,2))

plot(comb_df$Fare)

hist(comb_df$Fare)

First, there are Fares of 0. Did these people get on for free? Probably not. Second, it is clear upon inspection that outliers are pushing the mean Fare value up. Now we have more than just a simple imputation to deal with. We can impute the missing Fare with the median fare based on his class using the Rpart package, but first let’s set the zero values to NA and impute median values based on class.

comb_df$fare2 <- ifelse(comb_df$Fare==0, NA, comb_df$Fare)

summary(comb_df$fare2) #inspect to see no more zero value, and now 18 NAs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.171   7.925  14.500  33.730  31.330 512.300      18
library(rpart)

fareimp <- rpart(fare2 ~ Pclass, data=comb_df[!is.na(comb_df$fare2),], 
                method="anova")


comb_df$fare2[is.na(comb_df$fare2)] <- predict(fareimp, comb_df[is.na(comb_df$fare2),])

summary(comb_df$fare2) #inspect to see no more NAs
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.171   7.925  14.500  33.880  31.390 512.300

Now let’s impute age based on multiple covariates.

ageimp <- rpart(Age ~ Pclass2 + Sex + SibSp + Parch + fare2 + Embarked,
                data=comb_df[!is.na(comb_df$Age),], 
                method="anova")

comb_df$age2 <- NA

comb_df$age2[is.na(comb_df$age2)] <- predict(ageimp, comb_df[is.na(comb_df$age2),])

summary(comb_df$age2) #inspect to make sure.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.60   27.43   27.43   29.63   32.10   40.55

Finally in terms of data extraction, let’s look at titles housed in the name variable. People more clever than I noticed that titles were consistent throughout the names listed in the passenger data. Historically, titles were also associated with different personal characteristics such as gender, class, nobility, etc.

comb_df$Name <- as.character(comb_df$Name)

strsplit(comb_df$Name[2], split='[,.]') #want to isolate the "title" string segment
## [[1]]
## [1] "Cumings"                               
## [2] " Mrs"                                  
## [3] " John Bradley (Florence Briggs Thayer)"
comb_df$Title <- sapply(comb_df$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][2]})
comb_df$Title <- sub(' ', '', comb_df$Title)
comb_df$Title[comb_df$Title %in% c('Mme', 'Mlle')] <- 'Mlle'
comb_df$Title[comb_df$Title %in% c('Capt', 'Don', 'Major', 'Sir')] <- 'Sir'
comb_df$Title[comb_df$Title %in% c('Dona', 'Lady', 'the Countess', 'Jonkheer')] <- 'Lady'
comb_df$Title <- factor(comb_df$Title)
table(comb_df$Title)
## 
##    Col     Dr   Lady Master   Miss   Mlle     Mr    Mrs     Ms    Rev 
##      4      8      4     61    260      3    757    197      2      8 
##    Sir 
##      5

Next, we can create a couple of family related variables. These are family size and family id or cluster:

#family size.
comb_df$FamilySize <- comb_df$SibSp + comb_df$Parch + 1 #add one to account for the passenger as well.

#combine surname and family size into a factor 
comb_df$Surname <- sapply(comb_df$Name, FUN=function(x) {strsplit(x, split='[,.]')[[1]][1]})
comb_df$FamilyID <- paste(as.character(comb_df$FamilySize), comb_df$Surname, sep="")
comb_df$FamilyID[comb_df$FamilySize <= 2] <- 'Small'
famIDs <- data.frame(table(comb_df$FamilyID))
famIDs <- famIDs[famIDs$Freq <= 2,]
comb_df$FamilyID[comb_df$FamilyID %in% famIDs$Var1] <- 'Small'
comb_df$FamilyID <- factor(comb_df$FamilyID)
table(comb_df$FamilyID)
## 
##        11Sage       3Abbott       3Boulos       3Bourke        3Brown 
##            11             3             3             3             4 
##     3Caldwell      3Collyer      3Compton       3Coutts       3Crosby 
##             3             3             3             3             3 
##       3Danbom       3Davies        3Dodge         3Drew        3Elias 
##             3             5             3             3             3 
##    3Goldsmith         3Hart      3Hickman      3Johnson       3Klasen 
##             3             3             3             3             3 
##       3Mallet        3McCoy     3Moubarek        3Nakid     3Navratil 
##             3             3             3             3             3 
##      3Peacock        3Peter        3Quick      3Rosblom       3Samaan 
##             3             3             3             3             3 
##    3Sandstrom      3Spedden      3Taussig       3Thayer        3Touma 
##             3             3             3             3             3 
## 3van Billiard     3Van Impe        3Wells         3Wick      3Widener 
##             3             3             3             3             3 
##      4Allison      4Baclini       4Becker       4Carter         4Dean 
##             4             4             4             4             4 
##       4Herman     4Johnston      4Laroche         4West         5Ford 
##             4             4             4             4             5 
##      5Lefebre      5Palsson      5Ryerson      6Fortune       6Panula 
##             5             5             5             6             6 
##         6Rice        6Skoog    7Andersson      7Asplund      8Goodwin 
##             6             6             9             7             8 
##         Small 
##          1074

Finally, let’s make a women and children first variable. This is based on the common anecdotal story of life boats only accepting women and children before adult males.

comb_df$priority <- 0
comb_df$priority[which(comb_df$Sex=="female" | comb_df$age2 < 16)] <- 1

We have done a lot of work here! Let’s run some models. First resplit the combined data set back into training and test sets.

train <- comb_df[1:891,]
test <- comb_df[892:1309,]

Logistic Regression

Logistic regression is a classic classifier in statistics and econometrics. It is also heavily used in political science and sociology as a workhorse modeling method. Like linear regression and classification methods, we are interested in understanding the probability of an outcome \(Y=1\) occurring conditional on a set of \(n\) predictor variables within \(x\prime\).

\[ Pr(Y = 1|x\prime) \]

One can read the above as being the probability of \(Y\) equaling 1 given the measures found in \(x\prime\). To model this predictive relationship, we can utilize conventions from linear regression.

First we can represent the above conditional probability as:

\[ Pr(Y = 1|x\prime) = p(x\prime) \]

Next, propose a linear function of this probability:

\[ p(x\prime) = \beta_{0} + \beta_{i}x_{i}...+ \beta_{n}x_{n} \]

Unfortunately, stopping here would not help us. Our model would sometimes result in probabilities that are \(p(x\prime) < 0\) or \(p(x\prime) > 1\). It is impossible to have probabilities outside of the range \(p \in [0,1]\). Fortunately, we can use the logistic function to guarantee outputs between 0 and 1 as such:

\[ p(x\prime) = \frac{e^{\beta_{0}+\beta_{i}x_{i}...+\beta_{n}x_{n}}}{1 + e^{\beta_{0}+\beta_{i}x_{i}...+\beta_{n}x_{n}}} \]

From here we can calculate the odds of \(p(x\prime\)):

\[ \frac{p(x\prime)}{1 - p(x\prime)} = e^{\beta_{0} + \beta_{i}x_{i}...+\beta{n}x_{n}} \]

Reiterating basic probability, the odds tells us about the probability of \(Y = 1\). Odds close to 0 indicate very low probabilities for \(Y = 1\), while probabilities ever approaching \(\infty\) indicate very high probabilities for \(Y = 1\). By logging both sides of the above odds equation, we can calculate the logit or logged odds:

\[ log \bigg( \, \frac{p(x\prime)}{1 - p(x\prime)} \bigg) \, = \beta_{0} + \beta_{i}x_{i}...+ \beta{n}x_{n} \]

This is the basis for using logistic regression to make classification predictions. We can interpret the influence of predictor \(x_{i}\) as the change in the logged odds of \(Y = 1\) by \(\beta_{i}\). This is not a linear increase like the linear regression interpretation, but is a conditional increase based on the current value of \(x_{i}\). Even though the relationship is non-linear, we can generalize the interpretation of \(\beta_{i}\). If \(\beta_{i}\) is positive, then increasing the value of \(x_{i}\) will increase the logged odds of \(Y = 1\). As such, a negative \(\beta_{i}\) would have the opposite effect.

The above seems easy enough, but how can we know the values of \(\beta_{0}\), \(\beta_{i}\), etc..? These parameters are generally estimated though the maximum likelihood method. Let’s assume \(n\) observations and that \(x\) and \(\beta\) are sets of covariates and their corresponding beta estimates. First start with the likelihood function:

\[ \ell(\beta_{0},\beta) = \prod_{i = 1}^{n}p(x_{i})^{y_{i}}(1 - p(x_{i}))^{1 - y_{i}} \]

Through further derivation of the log-likelihood function, products turn into sums:

\[ \ell(\beta_{0}, \beta) = \sum_{i=1}^{n}-log1 + e^{\beta_{0} + \beta x_{i}} + \sum_{i=1}^{n}y_{i}(\beta_{0} + \beta x_{i}) \]

Finally, one can find the maximum likelihood estimates for a specific parameter by taking the partial derivative of the log likelihood function. Say we wanted the estimate \(\beta_{j}\) from the set of estimates \(\beta\). We take the partial derviative of the log likelihood function with respect to \(\beta_{j}\)

\[ \frac{\partial \ell}{\partial \beta_{j}} = -\sum_{i=1}^{n} \frac{1}{1 + e^{\beta_{0} + \beta x_{i}}} e^{\beta_{0} + \beta x_{i}}x_{ij} + \sum_{i=1}^{n} y_{i}x_{ij} \]

\[ = \sum_{i=1}^{n}(y_{i} - p(x_{i};\beta_{0},\beta))x_{ij} \]

From here, we have the basis for estimating \(\beta_{j}\). The resulting beta values are estimates because the partial derivative function is transcendental.

Logistic Regression using GLM

So let’s put the above into action. Logitistic regression models are easy to implement in R using the glm() function. Here I use the logit link.

lfit <- glm(Survived2 ~ Pclass2 + SibSp + fare2 + Embarked + priority + age2 + Parch + Title, 
            data = train, family = binomial(logit))


summary(lfit)
## 
## Call:
## glm(formula = Survived2 ~ Pclass2 + SibSp + fare2 + Embarked + 
##     priority + age2 + Parch + Title, family = binomial(logit), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4405  -0.6336  -0.3982   0.5188   2.4633  
## 
## Coefficients:
##                Estimate  Std. Error z value    Pr(>|z|)    
## (Intercept)    1.114166    1.749083   0.637     0.52412    
## Pclass22      -1.086626    0.403463  -2.693     0.00708 ** 
## Pclass23      -2.181913    0.444036  -4.914 0.000000893 ***
## SibSp         -0.438038    0.143767  -3.047     0.00231 ** 
## fare2          0.002477    0.002457   1.008     0.31342    
## EmbarkedQ     -0.108594    0.395787  -0.274     0.78380    
## EmbarkedS     -0.456701    0.245963  -1.857     0.06334 .  
## priority      -2.171446    0.979025  -2.218     0.02656 *  
## age2          -0.023742    0.024821  -0.957     0.33880    
## Parch         -0.360929    0.136907  -2.636     0.00838 ** 
## TitleDr        0.467218    1.652747   0.283     0.77741    
## TitleLady      2.478722    2.116785   1.171     0.24160    
## TitleMaster    4.471005    1.645752   2.717     0.00659 ** 
## TitleMiss      4.874589    1.739323   2.803     0.00507 ** 
## TitleMlle     17.430855  840.228654   0.021     0.98345    
## TitleMr       -0.338500    1.440880  -0.235     0.81427    
## TitleMrs       5.381037    1.756308   3.064     0.00219 ** 
## TitleMs       18.896646 1455.398594   0.013     0.98964    
## TitleRev     -14.406445  591.633321  -0.024     0.98057    
## TitleSir      -0.156936    1.700415  -0.092     0.92647    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  729.89  on 871  degrees of freedom
## AIC: 769.89
## 
## Number of Fisher Scoring iterations: 14
dev = 1186.6 - 630.75
dfree <- 890 - 812
sig <- 1 - pchisq(dev, df=dfree)
sig
## [1] 0
#significant model at the 0% level



anova(lfit, test="Chisq")
## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Survived2
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev              Pr(>Chi)    
## NULL                       890    1186.66                          
## Pclass2   2  103.547       888    1083.11 < 0.00000000000000022 ***
## SibSp     1    0.040       887    1083.07             0.8417158    
## fare2     1    7.479       886    1075.59             0.0062416 ** 
## Embarked  2   16.721       884    1058.87             0.0002339 ***
## priority  1  238.582       883     820.29 < 0.00000000000000022 ***
## age2      1    3.652       882     816.63             0.0559969 .  
## Parch     1    0.941       881     815.69             0.3320660    
## Title    10   85.802       871     729.89   0.00000000000003626 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
vif(lfit)
##               GVIF Df GVIF^(1/(2*Df))
## Pclass2   4.533083  2        1.459145
## SibSp     2.085296  1        1.444055
## fare2     1.640160  1        1.280687
## Embarked  1.309050  2        1.069644
## priority 27.055227  1        5.201464
## age2      4.978660  1        2.231291
## Parch     1.521050  1        1.233309
## Title    46.902284 10        1.212159
test$Survived <- predict(lfit, newdata = test, type="response")
test$Survived[test$Survived > .5]="Yes"
test$Survived[test$Survived <= .5]="No"
vars <- c("PassengerId", "Survived")
sub <- test[vars]
sub$Survived[sub$Survived=="Yes"]=1
sub$Survived[sub$Survived=="No"]=0
sub$Survived <- as.factor(sub$Survived)
write.csv(sub, "besaw_sub.csv")

Score of 0.79426, not bad…Let’s try some interaction effects.

lfit <- glm(Survived2 ~ Pclass2 + SibSp + fare2 + Embarked + priority + age2 + Parch + Title + priority*Pclass2 +
              SibSp*priority + Pclass2*SibSp + Pclass2*Parch, 
            data = train, family = binomial(logit))


summary(lfit)
## 
## Call:
## glm(formula = Survived2 ~ Pclass2 + SibSp + fare2 + Embarked + 
##     priority + age2 + Parch + Title + priority * Pclass2 + SibSp * 
##     priority + Pclass2 * SibSp + Pclass2 * Parch, family = binomial(logit), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.0165  -0.5355  -0.4199   0.3414   2.3268  
## 
## Coefficients:
##                      Estimate  Std. Error z value Pr(>|z|)   
## (Intercept)          0.679764    1.914146   0.355  0.72249   
## Pclass22            -1.604792    0.547557  -2.931  0.00338 **
## Pclass23            -1.345472    0.536659  -2.507  0.01217 * 
## SibSp                0.305911    0.357008   0.857  0.39151   
## fare2                0.001726    0.002698   0.640  0.52227   
## EmbarkedQ           -0.155566    0.381959  -0.407  0.68380   
## EmbarkedS           -0.535461    0.250159  -2.140  0.03232 * 
## priority            -0.140686    1.170246  -0.120  0.90431   
## age2                -0.011483    0.031468  -0.365  0.71519   
## Parch               -0.744546    0.418493  -1.779  0.07522 . 
## TitleDr             -0.097666    1.689988  -0.058  0.95392   
## TitleLady            0.816524    2.033841   0.401  0.68808   
## TitleMaster          4.284450    1.700119   2.520  0.01173 * 
## TitleMiss            4.415994    1.798349   2.456  0.01407 * 
## TitleMlle           15.384437  840.259572   0.018  0.98539   
## TitleMr             -0.600261    1.449138  -0.414  0.67871   
## TitleMrs             4.603745    1.813619   2.538  0.01114 * 
## TitleMs             17.652788 1455.398737   0.012  0.99032   
## TitleRev           -13.915157  589.976359  -0.024  0.98118   
## TitleSir            -0.362893    1.712217  -0.212  0.83215   
## Pclass22:priority   -0.139366    0.974519  -0.143  0.88628   
## Pclass23:priority   -2.485059    0.842521  -2.950  0.00318 **
## SibSp:priority      -0.321285    0.382498  -0.840  0.40093   
## Pclass22:SibSp      -0.559350    0.606727  -0.922  0.35657   
## Pclass23:SibSp      -0.469754    0.476165  -0.987  0.32387   
## Pclass22:Parch       1.193951    0.607597   1.965  0.04941 * 
## Pclass23:Parch       0.424308    0.443516   0.957  0.33872   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1186.66  on 890  degrees of freedom
## Residual deviance:  691.46  on 864  degrees of freedom
## AIC: 745.46
## 
## Number of Fisher Scoring iterations: 14

Deviance change seemed promising, but a score of 0.78, not an improvement…


  1. It is actually just one..

  2. Multiple imputation assumes that missing values are missing at random (MAR). One should be careful when performing multiple imputation on real world data, as the MAR assumption is often not feasible. I am not sure why Age has so many missing values here. It could be a design by Kaggle, which is what I personally assume for this analysis. Normally, if one variable has such a high degree of missingness, it is unlikely to be missing at random. In addition, missing values should normally not be more than 5%-10% of a measure’s representation. Having much higher degrees of missingness is often a sign of MAR violation. Consequently, with more missingness, you are more likely to artificially construct patterns within the imputed data if your baseline data is unrepresentative of the real population.

comments powered by Disqus