San Francisco Crime Data Analysis Part 1

Hello. Today I will analyze the San Francisco Crime Data which can be found at Kaggle. In another post, I will plot the data onto the San Francisco map. Are you ready to discover how crime is taking place in this beautiful city?

NOTE: In my heat maps, if the plotted values are calculated within aparticular variable, across all the other variables (rather than both variables which is the usual case in heat maps), then to distinguish the particular variable I draw grey (horizontal or vertical) lines. I prefer this kind of heat maps when analyzing two variables, and their joint behavior (instead of say faceted plots), since in some cases having heat maps is easier to distinguish and provides a clearer big picture.

library(data.table)
library(ggplot2)
library(lubridate)
library(plotly)

trn = fread('train.csv')
## 
Read 67.2% of 878049 rows
Read 878049 rows and 9 (of 9) columns from 0.119 GB file in 00:00:03
str(trn)
## Classes 'data.table' and 'data.frame':   878049 obs. of  9 variables:
##  $ Dates     : chr  "2015-05-13 23:53:00" "2015-05-13 23:53:00" "2015-05-13 23:33:00" "2015-05-13 23:30:00" ...
##  $ Category  : chr  "WARRANTS" "OTHER OFFENSES" "OTHER OFFENSES" "LARCENY/THEFT" ...
##  $ Descript  : chr  "WARRANT ARREST" "TRAFFIC VIOLATION ARREST" "TRAFFIC VIOLATION ARREST" "GRAND THEFT FROM LOCKED AUTO" ...
##  $ DayOfWeek : chr  "Wednesday" "Wednesday" "Wednesday" "Wednesday" ...
##  $ PdDistrict: chr  "NORTHERN" "NORTHERN" "NORTHERN" "NORTHERN" ...
##  $ Resolution: chr  "ARREST, BOOKED" "ARREST, BOOKED" "ARREST, BOOKED" "NONE" ...
##  $ Address   : chr  "OAK ST / LAGUNA ST" "OAK ST / LAGUNA ST" "VANNESS AV / GREENWICH ST" "1500 Block of LOMBARD ST" ...
##  $ X         : num  -122 -122 -122 -122 -122 ...
##  $ Y         : num  37.8 37.8 37.8 37.8 37.8 ...
##  - attr(*, ".internal.selfref")=
summary(trn)
##     Dates             Category           Descript        
##  Length:878049      Length:878049      Length:878049     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##   DayOfWeek          PdDistrict         Resolution       
##  Length:878049      Length:878049      Length:878049     
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##    Address                X                Y        
##  Length:878049      Min.   :-122.5   Min.   :37.71  
##  Class :character   1st Qu.:-122.4   1st Qu.:37.75  
##  Mode  :character   Median :-122.4   Median :37.78  
##                     Mean   :-122.4   Mean   :37.77  
##                     3rd Qu.:-122.4   3rd Qu.:37.78  
##                     Max.   :-120.5   Max.   :90.00
trn[, DayOfWeek:=factor(DayOfWeek, ordered=TRUE, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))]

The train set has 878049 rows, and 9 columns. The DayOfWeek is also converted into an ordered factor. This is important if you want the days to be shown properly on plots.

Let’s check address real quick. I do not think that address will be that important regarding plotting data and making predictions, since we already have latitude and longitude data.

trn[, .N, by = Address][order(N, decreasing = T)][1:10]
##                      Address     N
##  1:   800 Block of BRYANT ST 26533
##  2:   800 Block of MARKET ST  6581
##  3: 2000 Block of MISSION ST  5097
##  4: 1000 Block of POTRERO AV  4063
##  5:   900 Block of MARKET ST  3251
##  6:       0 Block of TURK ST  3228
##  7:        0 Block of 6TH ST  2884
##  8:    300 Block of ELLIS ST  2703
##  9:    400 Block of ELLIS ST  2590
## 10:     16TH ST / MISSION ST  2504

It is interesting that some streets really have very high crime rate. 800 block of Bryant St and Market St pop out from the rest.

Let’s see how Category and Description fares. These must be highly correlated.

d=trn[, .N, by = Category][order(N, decreasing = T)]
ggplot(d, aes(x = reorder(Category, -N), y = N, fill = N)) + 
    geom_bar(stat = 'identity') + 
    coord_flip() + 
    scale_fill_continuous(guide=FALSE) + 
    labs(x = '', y = 'Total Number of Crimes', title = 'Total Count of Crimes per Category') 

trn[, .N, by = Descript][order(N, decreasing = T)]
##                                               Descript     N
##   1:                      GRAND THEFT FROM LOCKED AUTO 60022
##   2:                                     LOST PROPERTY 31729
##   3:                                           BATTERY 27441
##   4:                                 STOLEN AUTOMOBILE 26897
##   5:             DRIVERS LICENSE, SUSPENDED OR REVOKED 26839
##  ---                                                        
## 875:                       PLANTING/CULTIVATING PEYOTE     1
## 876:                               REFUSAL TO IDENTIFY     1
## 877:                      SHOOTING BY JUVENILE SUSPECT     1
## 878:            DESTROYING JAIL PROPERTY-$200 OR UNDER     1
## 879: EMBEZZLEMENT, GRAND THEFT PUBLIC/PRIVATE OFFICIAL     1

There are 39 crime categories, and 879 unique descriptions. The descriptions are like categories with finer details. It could be useful to break down the descriptions into words, and use them as predictors.

Let’s now have a look at police districts, resolution and some heat maps with previous variables.

d = trn[, .N, by = PdDistrict][order(N, decreasing = T)]
ggplot(d, aes(x = reorder(PdDistrict, -N), y = N, fill = PdDistrict)) + 
    geom_bar(stat = 'identity') + 
    coord_flip() + 
    guides(fill = F) + 
    labs(x = '', y = 'Total Number of Crimes', title = 'Total Number of Crimes per District') 

d = trn[, .N, by = Resolution][order(N, decreasing = T)]
ggplot(d, aes(x = reorder(Resolution, -N), y = N, fill = Resolution)) + 
    geom_bar(stat = 'identity') + 
    coord_flip() + 
    guides(fill = F) + 
    labs(x = '', y = 'Total Number of Crimes', title = 'Total Number of Crimes per Resolution Type') 

So we have 17 unique resolutions for comes, and 10 unique districts, with the Southern District being the top crime area. I wonder whether the crime rich streets fall within this district.

trn[i= Address =='800 Block of BRYANT ST', .N, by = PdDistrict]
##    PdDistrict     N
## 1:   SOUTHERN 26533

Yes, indeed it does. There is an easier way actually. Let’s check the top 10 streets and their districts.

trn[,.N, by =.(Address, PdDistrict)][order(N, decreasing = T)][1:10]
##                      Address PdDistrict     N
##  1:   800 Block of BRYANT ST   SOUTHERN 26533
##  2:   800 Block of MARKET ST   SOUTHERN  6581
##  3: 2000 Block of MISSION ST    MISSION  5090
##  4: 1000 Block of POTRERO AV    MISSION  4063
##  5:   900 Block of MARKET ST   SOUTHERN  3251
##  6:       0 Block of TURK ST TENDERLOIN  3228
##  7:        0 Block of 6TH ST   SOUTHERN  2884
##  8:    300 Block of ELLIS ST TENDERLOIN  2703
##  9:    400 Block of ELLIS ST TENDERLOIN  2589
## 10:     16TH ST / MISSION ST    MISSION  2504

By holding the first position in number of crimes, and having 4 contenders in the top 10, Southern District is the winner.

Now it is time to analyze how variables interact with each other. Let’s check how each district’s preferred resolution is.

d = trn[, .N, by =.(PdDistrict, Resolution)][order(N, decreasing = T)]
g = ggplot(d, aes(x = PdDistrict, y = Resolution, fill = N)) + 
    geom_tile() +
    scale_fill_distiller(palette = 'RdYlGn') + 
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(g)

We see that ‘None’ resolution dominates all the others, so taking it out will make the plot more useful.

d = trn[, .N, by =.(PdDistrict, Resolution)][order(N, decreasing = T)][i= !Resolution %in% c('NONE')]
g = ggplot(d, aes(x = PdDistrict, y = Resolution, fill = N)) + 
    geom_tile() +
    scale_fill_distiller(palette = 'RdYlGn') +
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(g)

Arrest-Booked sticks out from the rest as top crime type, happening the most in top three crime districts, namely Southern, Mission, and Tenderloin.

How about plotting the same heat map, but now the percentages of each district within the resolution types?

d = trn[, .N, by =.(PdDistrict, Resolution)][,Percentage:=N/sum(N)*100, by = Resolution][order(N, decreasing = T)]
g = ggplot(d, aes(x = PdDistrict, y = Resolution, fill = Percentage)) + 
    geom_tile() +
    geom_hline(yintercept=seq(0.5, 30, by=1), size = 1, color = 'grey') + 
    scale_fill_distiller(palette = 'RdYlGn') +
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(g)

# g = ggplot(d, aes(x = PdDistrict, y = Percentage, fill = Percentage)) + 
#     geom_bar(stat = 'identity') +
#     # geom_tile() +
#     # geom_hline(yintercept=seq(0.5, 30, by=1), size = 1) + 
#     facet_grid(.~Resolution) +
#     scale_fill_distiller(palette = 'RdYlGn') +
#     theme(axis.title.y = element_blank(), 
#           axis.text.y = element_blank(), 
#           axis.text.x = element_blank())
# g

We can see that each crime type is distributed evenly over the districts in most cases. Exceptional clearance, psychopathic case is happening the most in Southern District. I have added horizontal grey lines to make it clear that the percentages over the districts sum up to 100% for a particular Resolution.

Does the day affect crime type and volume? First let's order the days of the week, starting from Monday, and then plot a heat map.

d = trn[, .N, by =.(PdDistrict, DayOfWeek)][,Percentage:=N/sum(N)*100]
g = ggplot(d, aes(x = DayOfWeek, y = PdDistrict, fill = Percentage)) + 
    geom_tile() +
    scale_fill_distiller(palette = 'RdYlGn') +
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank())
ggplotly(g)

Again we can confirm that Southern District being the least safe place, while Richmond and Park being the best place in terms of crime. Friday and Saturday seems to be highest crime days compared to the rest of the week.

Now, let's see whether certain days have more crime rate.

d = trn[, .N, by =.(PdDistrict, DayOfWeek)][,Percentage:=N/sum(N)*100, by = DayOfWeek]
#d[,sum(Percentage),by= DayOfWeek]

g = ggplot(d, aes(x = DayOfWeek, y = PdDistrict, fill = Percentage)) + 
    geom_tile() +
    geom_vline(xintercept=seq(0.5, 10, by=1), size = 1, color = 'grey') +
    scale_fill_distiller(palette = 'RdYlGn') + 
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank())
ggplotly(g)

Above, we try to see whether on a given day, the crime rate differs across the districts. As was shown before, Southern District is the most crime rich area. We can also see whether on certain days, the ordering of districts changes or not. The crime rate over the districts does not change with day of week, and is quite stable.

d = trn[, .N, by =.(PdDistrict, DayOfWeek)][,Percentage:=N/sum(N)*100, by = PdDistrict]
#d[,sum(Percentage),by= PdDistrict]

g = ggplot(d, aes(x = DayOfWeek, y = PdDistrict, fill = Percentage)) + 
    geom_tile() +
    geom_hline(yintercept=seq(0.5, 10, by=1), size = 1, color = 'grey') + 
    scale_fill_distiller(palette = 'RdYlGn') +
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank())
ggplotly(g)

With the above plot we can see whether crime rate changes within a police district based on the day. We can see that Friday is the day with highest crime rate, and Sunday is the lowest. Wednesday is also seemingly high. Another interesting fact is that Tenderloin is relatively a safer neighborhood to be at on Fridays.

How about day-crime type relation? Do you think day would affect the behavior of the criminals? Let's see.

d = trn[, .N, by =.(Category, DayOfWeek)][,Percentage:=N/sum(N)*100, by = DayOfWeek][order(N, decreasing = T)]
g = ggplot(d, aes(x = DayOfWeek, y = Category, fill = Percentage)) + 
    geom_tile() +
    geom_vline(xintercept=seq(0.5, 10, by=1), size = 1, color = 'grey') + 
    scale_fill_distiller(palette = 'RdYlGn') + 
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank())
ggplotly(g)

The distribution of crime categories stay the same in each day, with Fridays and Saturdays being a little more crowded. Also, some classes behave opposite to each other. Theft and crimes-other for example, have negative relationship. It seems like on Fridays and Saturdays Theft is preferred more, and on the other days 'other' crimes.

Does the crime rates with respect to categories change over days? It is very much possible that some days, some crimes might be more proffered compared to the rest. For this I'll calculate the percentages within each category.

d = trn[, .N, by =.(Category, DayOfWeek)][,Percentage:=N/sum(N)*100, by = Category][order(N, decreasing = T)]
#d[,sum(Percentage),by= Category]

g = ggplot(d, aes(x = DayOfWeek, y = Category, fill = Percentage)) + 
    geom_tile() +
    geom_hline(yintercept=seq(0.5, 40, by=1), size = 1, color = 'grey') + 
    scale_fill_distiller(palette = 'RdYlGn') + 
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank())
ggplotly(g)

![]https://datacademy.files.wordpress.com/2016/06/unnamed-chunk-14-11.png)

Indeed we can clearly some crimes being more prevalent on certain days. Gambling for example is mostly done on Fridays, while drunkenness is mostly at the weekends.

# Convert date
trn[, Dates := parse_date_time(Dates, "ymd HMS")]
trn[, `:=`(Year = year(Dates),
           Month = month(Dates),
           Hour = hour(Dates))]

trn[, `:=`(Year = as.factor(Year), Month = as.factor(Month))]
d = trn[,.N,by=Year]
g = ggplot(d, aes(x = Year, y = N)) + geom_bar(stat = 'identity')
ggplotly(g)

d = trn[, .N, by=.(Month, Year)][order(Year,Month, decreasing = T)]
d[1:12]
##     Month Year    N
##  1:     5 2015 2250
##  2:     4 2015 6609
##  3:     3 2015 6851
##  4:     2 2015 6008
##  5:     1 2015 5866
##  6:    12 2014 5391
##  7:    11 2014 6471
##  8:    10 2014 7303
##  9:     9 2014 6667
## 10:     8 2014 6147
## 11:     7 2014 5808
## 12:     6 2014 5992
d = d[,.N, by = Year]
d
##     Year  N
##  1: 2015  5
##  2: 2014 12
##  3: 2013 12
##  4: 2012 12
##  5: 2011 12
##  6: 2010 12
##  7: 2009 12
##  8: 2008 12
##  9: 2007 12
## 10: 2006 12
## 11: 2005 12
## 12: 2004 12
## 13: 2003 12

We can see that overall the number of crimes committed per year stayed about the same. However, 2015 is very low. What suddenly happened in 2015? Well this info is already given on the data site, but let's find out ourselves. Suspecting not the same amount of data was collected in 2015, I have created the table showing each month and year combination, and then the number of months from each year data was collected. Indeed, in 2015 data in the first 5 months was collected. Taking that into account, 2015 is also about the same with the other years.

Now let's see whether crime categories changed over the years, or stayed the same. 2015 will be taken out from the data since it has only 5 months. I will also take out Trea since it is present only recently, and the occurrences are very few.

d = trn[i=Year!=2015 & Category != 'TREA', .N, by = .(Category, Year)][,Percentage:=N/sum(N)*100, by=Category]
g = ggplot(d, aes(x = Year, y = Category, fill = Percentage)) + 
    geom_tile() +
    geom_hline(yintercept=seq(0.5, 40, by=1), size = 1, color = 'grey') + 
    scale_fill_distiller(palette = 'RdYlGn') +
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank())
ggplotly(g)

By holding the category constant and calculating the percentage of crime over the years, we see that some categories have some changes over the years. Loitering was higher during 2007 and 2008 with close to 18%, while it decreased to 2% in 2014. Vehicle theft also decreased by more than 50% from 17% in 200, to 7% in 2014.

d = trn[i=Year!=2015 & Category != 'TREA', .N, by =.(Category, Year)][,Percentage:=N/sum(N)*100, by = Year]
g = ggplot(d, aes(x = Year, y = Category, fill = Percentage)) + 
    geom_tile() +
    geom_vline(xintercept=seq(0.5, 40, by=1), size = 1, color = 'grey') + 
    scale_fill_distiller(palette = 'RdYlGn') + 
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank())
ggplotly(g)

The above plot let's us see the percentage of each crime type within certain a year, and also let's us realize any trends if there is any. Very quickly it can be seen that drug, burglary, assault, larceny/theft are the most common type of crimes. Also, the weight of larceny/theft within a year increased from 18% in 2003 to 26% in 2014. This can help the SFPD to look into this situation and take more precautions. Vehicle theft on the other hand decreased from 13% in 2005 to 5% in 2014.

I am also very much interested in seeing the relation between the crime types and districts. It might be that some crime types are more prevalent in certain areas.

d = trn[, .N, by =.(Category, PdDistrict)][,Percentage:=N/sum(N)*100]
g = ggplot(d, aes(x = PdDistrict, y = Category, fill = Percentage)) + 
    geom_tile() +
    scale_fill_distiller(palette = 'RdYlGn') + 
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(g)

Looking at the overall heat map we can see that some crime types stick out in some areas.

d = trn[, .N, by =.(Category, PdDistrict)][,Percentage:=N/sum(N)*100, by = Category]
g = ggplot(d, aes(x = PdDistrict, y = Category, fill = Percentage)) + 
    geom_tile() +
    geom_hline(yintercept=seq(0.5, 40, by=1), size = 1, color = 'grey') + 
    scale_fill_distiller(palette = 'RdYlGn') +
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(g)

When the category is held constant, we can see that prostitution happens most in mission district with 50% among all the districts, drug happens most in Tenderloin with 33%, and loitering in Southern with 35% among all the districts. So it is a fact that some crime types happen to choose some areas more than the others.

d = trn[, .N, by =.(Category, PdDistrict)][,Percentage:=N/sum(N)*100, by = PdDistrict]
g = ggplot(d, aes(x = PdDistrict, y = Category, fill = Percentage)) + 
    geom_tile() +
    geom_vline(xintercept=seq(0.5, 40, by=1), size = 1, color = 'grey') + 
    scale_fill_distiller(palette = 'RdYlGn') + 
    theme(axis.title.y = element_blank(), 
          axis.title.x = element_blank(),
          axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(g)

By checking the distribution of crime types within each area, we can see that they are mostly stable with about the same percentage within each district. However, larceny/theft happens to be significantly more prevalent in Central, Northern, and Southern.

In this post I have analyzed the San Francisco Crime Data. Although this analysis can provide insight into the relation between crime, district, weekday, and some other variables, it is by no way complete without maps. That’s exactly what I’ll be doing in my next post. See you then!

Ilyas Ustun

Expedia Data Analysis Part 1

Expedia Hotel Recommendations

This dataset can be found at Kaggle. We are given logs of visitors at different Expedia sites and are asked to predict the hotel clusters in the test set. Expedia aims to use customer data to improve their hotel recommendations. In this blog, I will analyze this dataset and try to get some insight of it.

## 

Read 37670293 rows and 24 (of 24) columns from 3.791 GB file in 00:02:37

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  25981347 1387.6   36236006 1935.3  25989623 1388.0
## Vcells 646504616 4932.5 1095841779 8360.7 651377761 4969.7

## Classes 'data.table' and 'data.frame':   37670293 obs. of  24 variables:
##  $ date_time                : chr  "2014-08-11 07:46:59" "2014-08-11 08:22:12" "2014-08-11 08:24:33" "2014-08-09 18:05:16" ...
##  $ site_name                : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ posa_continent           : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ user_location_country    : int  66 66 66 66 66 66 66 66 66 66 ...
##  $ user_location_region     : int  348 348 348 442 442 442 189 189 189 189 ...
##  $ user_location_city       : int  48862 48862 48862 35390 35390 35390 10067 10067 10067 10067 ...
##  $ orig_destination_distance: num  2234 2234 2234 913 914 ...
##  $ user_id                  : int  12 12 12 93 93 93 501 501 501 501 ...
##  $ is_mobile                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_package               : int  1 1 0 0 0 0 0 1 0 0 ...
##  $ channel                  : int  9 9 9 3 3 3 2 2 2 2 ...
##  $ srch_ci                  : chr  "2014-08-27" "2014-08-29" "2014-08-29" "2014-11-23" ...
##  $ srch_co                  : chr  "2014-08-31" "2014-09-02" "2014-09-02" "2014-11-28" ...
##  $ srch_adults_cnt          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ srch_children_cnt        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ srch_rm_cnt              : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ srch_destination_id      : int  8250 8250 8250 14984 14984 14984 8267 8267 8267 8267 ...
##  $ srch_destination_type_id : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ is_booking               : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ cnt                      : int  3 1 1 1 1 1 2 1 1 1 ...
##  $ hotel_continent          : int  2 2 2 2 2 2 2 2 2 2 ...
##  $ hotel_country            : int  50 50 50 50 50 50 50 50 50 50 ...
##  $ hotel_market             : int  628 628 628 1457 1457 1457 675 675 675 675 ...
##  $ hotel_cluster            : int  1 1 1 80 21 92 41 41 69 70 ...
##  - attr(*, ".internal.selfref")=

##   date_time           site_name      posa_continent user_location_country
##  Length:37670293    Min.   : 2.000   Min.   :0.00   Min.   :  0.00       
##  Class :character   1st Qu.: 2.000   1st Qu.:3.00   1st Qu.: 66.00       
##  Mode  :character   Median : 2.000   Median :3.00   Median : 66.00       
##                     Mean   : 9.795   Mean   :2.68   Mean   : 86.11       
##                     3rd Qu.:14.000   3rd Qu.:3.00   3rd Qu.: 70.00       
##                     Max.   :53.000   Max.   :4.00   Max.   :239.00       
##                                                                          
##  user_location_region user_location_city orig_destination_distance
##  Min.   :   0.0       Min.   :    0      Min.   :    0            
##  1st Qu.: 174.0       1st Qu.:13009      1st Qu.:  313            
##  Median : 314.0       Median :27655      Median : 1140            
##  Mean   : 308.4       Mean   :27753      Mean   : 1970            
##  3rd Qu.: 385.0       3rd Qu.:42413      3rd Qu.: 2553            
##  Max.   :1027.0       Max.   :56508      Max.   :12408            
##                                          NA's   :13525001         
##     user_id          is_mobile        is_package        channel      
##  Min.   :      0   Min.   :0.0000   Min.   :0.0000   Min.   : 0.000  
##  1st Qu.: 298910   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: 2.000  
##  Median : 603914   Median :0.0000   Median :0.0000   Median : 9.000  
##  Mean   : 604452   Mean   :0.1349   Mean   :0.2489   Mean   : 5.871  
##  3rd Qu.: 910168   3rd Qu.:0.0000   3rd Qu.:0.0000   3rd Qu.: 9.000  
##  Max.   :1198785   Max.   :1.0000   Max.   :1.0000   Max.   :10.000  
##                                                                      
##    srch_ci            srch_co          srch_adults_cnt srch_children_cnt
##  Length:37670293    Length:37670293    Min.   :0.000   Min.   :0.0000   
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:0.0000   
##  Mode  :character   Mode  :character   Median :2.000   Median :0.0000   
##                                        Mean   :2.024   Mean   :0.3321   
##                                        3rd Qu.:2.000   3rd Qu.:0.0000   
##                                        Max.   :9.000   Max.   :9.0000   
##                                                                         
##   srch_rm_cnt    srch_destination_id srch_destination_type_id
##  Min.   :0.000   Min.   :    0       Min.   :0.000           
##  1st Qu.:1.000   1st Qu.: 8267       1st Qu.:1.000           
##  Median :1.000   Median : 9147       Median :1.000           
##  Mean   :1.113   Mean   :14441       Mean   :2.582           
##  3rd Qu.:1.000   3rd Qu.:18790       3rd Qu.:5.000           
##  Max.   :8.000   Max.   :65107       Max.   :9.000           
##                                                              
##    is_booking           cnt          hotel_continent hotel_country  
##  Min.   :0.00000   Min.   :  1.000   Min.   :0.000   Min.   :  0.0  
##  1st Qu.:0.00000   1st Qu.:  1.000   1st Qu.:2.000   1st Qu.: 50.0  
##  Median :0.00000   Median :  1.000   Median :2.000   Median : 50.0  
##  Mean   :0.07966   Mean   :  1.483   Mean   :3.156   Mean   : 81.3  
##  3rd Qu.:0.00000   3rd Qu.:  2.000   3rd Qu.:4.000   3rd Qu.:106.0  
##  Max.   :2.00000   Max.   :269.000   Max.   :6.000   Max.   :212.0  
##                                                                     
##   hotel_market    hotel_cluster  
##  Min.   :   0.0   Min.   : 0.00  
##  1st Qu.: 160.0   1st Qu.:25.00  
##  Median : 593.0   Median :49.00  
##  Mean   : 600.5   Mean   :49.81  
##  3rd Qu.: 701.0   3rd Qu.:73.00  
##  Max.   :2117.0   Max.   :99.00  
## 

The is_booking and cnt variables are not present in the test
set. And as expected the hotel_cluster variable is also not present
in the test set, as this is the variable to be predicted.

The test set has an ID column, which is not present in the train set.
The ID will be the first column in the submission file, and the
predicted hotel_cluster will be the second. The sample submission file
already has the ID column, thus the ID of test file is not imported.

Let’s convert the types of some columns. is_mobile, is_package,
channel, posa_continent, hotel_continent are categorical variables and
will be converted by using the as.factor function. There are some
other categorical varaibles such as hotel_cluster, hotel_country,
user_location, etc. but these will be left as integer values, since
they have too manyentries, which makes it harder to deal with them when
plotting.

The date columns are imported as characters, and will be converted to
POSIXct date format. The date_time column is the time a search was
made on the website by a user. The year, month, day of week, and day of
month, and hour information will be extracted from the date_time
columns.

I am also making use of garbage collector gc() functions, since I have
realized that when working with big data, there is a lot of leftovers in
the memory which should be cleaned.

trn[, `:=`(is_mobile = as.factor(is_mobile),
               is_package = as.factor(is_package),
               channel = as.factor(channel),
               posa_continent = as.factor(posa_continent),
               hotel_continent = as.factor(hotel_continent))]
gc()
trn[, date_time := parse_date_time(date_time, "%y-%m-%d %H:%M:%S")]
class(trn[,date_time])
## [1] 'POSIXct' 'POSIXt'
trn[, `:=`(date_year = as.factor(year(date_time)), 
               date_month = as.factor(month(date_time)), 
               date_day = as.factor(day(date_time)), 
               date_wday = as.factor(wday(date_time, label = T)), 
               date_hour = as.factor(hour(date_time)) )]
gc()

The train dataset has 37,670,293 rows, and 24 columns. This is huge! Now
that we have imported the data and made some conversions, let's start
our analysis.

Exploring the proportion of sites visited reveals that site_name 2 is
the most frequently visited site, which dominates all the others. So, it
has been taken out in the second graph, in order to see other
proportions better.

# barplot site_name
d = trn[, .N, by = site_name][, j = .(site_name, Prop = N/ sum(N))]
g = ggplot(d, aes(x = site_name, y = Prop))
g1 = g + geom_bar(aes(fill = factor(site_name)), stat = 'identity' ) + scale_fill_discrete(name = 'site_name')
ggplotly(g1)

# Site name 2 dominates all the others
d = d[i = site_name != 2]
g = ggplot(d, aes(x = site_name, y = Prop))
g1 = g + geom_bar(aes(fill = factor(site_name)), stat = 'identity') + scale_fill_discrete(name = 'site_name')
ggplotly(g1)

Hotel Cluster – Mobile – Package Relationship

The following graphs are to gain some insight into hotel cluster –
mobile relationship. The majority of customers are not mobile. I use
tile graph where each point is depicted as a rectangle and colored based
on the intensity of the 'fill' value. In this case this is the number of
points falling into each category.

d = trn[,j = .N, by = .(hotel_cluster, is_mobile)]
g = ggplot(d, aes(x = is_mobile, y = hotel_cluster)) + geom_raster(aes(fill = N))
ggplotly(g)

Are there any hotel clusters in which mobile is more prevalent? Instead
of total number at each hotel cluster, it makes more sense to check the
proportions. The proportions are calculated over each hotel_cluster.
Thus, the probabilities in is_mobile0 and is_mobile1 sum up to 1 for
each hotel_cluster.

d = trn[,j = .N, by = .(hotel_cluster, is_mobile)][, Prop := N/ sum(N), by = .(hotel_cluster)]
g = ggplot(d, aes(x = is_mobile, y = hotel_cluster)) + geom_raster(aes(fill = Prop))
ggplotly(g)


All the hotel_clusters have more or less the same proportion among
is_mobile 0 and 1, with just a few having a larger proportion than
average. Although I like tile graphs, bar graph could be a better
visualization in making a comparison. Let's do that.

d = trn[,j = .N, by = .(hotel_cluster, is_mobile)][, Prop := N/ sum(N), by = .(hotel_cluster)]
g = ggplot(d, aes(x = hotel_cluster, y = Prop)) + geom_bar(aes(fill = Prop), stat = 'identity') + facet_grid(is_mobile~., labeller = label_both)
ggplotly(g)

OK, we see the same here. So no differnce between being mobile or not
and hotel_clusters.

Now, let's check the relation between is_package and hotel_cluster.
Here I have preferred to use the bar plot rather than the tile graph.

d = trn[,j = .N, by = .(hotel_cluster, is_package)]
g = ggplot(d, aes(x = hotel_cluster, y = N)) + geom_bar(aes(fill=factor(hotel_cluster)), stat = 'identity') + facet_grid(is_package~., labeller = label_both)
ggplotly(g)

The above graph tells us that some hotel clusters have more visits. But
in order to see if any hotel cluster is more related to being searched
within a package, we will need to plot the proportions within each
hotel_cluster. In my opinion, some hotel clusters can be related to
being searched within a package deal.

d = trn[,j = .N, by = .(hotel_cluster, is_package)][, Prop := N/ sum(N), by = .(hotel_cluster)]
g = ggplot(d, aes(x = hotel_cluster, y = Prop)) + geom_bar(aes(fill=factor(hotel_cluster)) , stat = 'identity') + facet_grid(is_package~., labeller = label_both) + theme_dark()
ggplotly(g)

Indeed, some clusters such as 52, 65, 66, and 87 are more related to
being in a package.

Below I am trying to find a relationship between hotel_cluster, is_mobile, and is_package.

First, when is_mobile is 1, does is_package change or stay the same? As, can be seen, it stays the same, albeit slightly higher proportions of is_package 1 and is_mobile1. Second, when is_package has a large value for a hotel_cluster, does it behave similarly in is_mobile? As can be seen from second graph, is_mobile behaves the same in both of its categories, suggesting indifference to is_package.

d = trn[,j = .N, by = .(hotel_cluster, is_mobile, is_package)][, Prop := N/ sum(N), by = .(hotel_cluster, is_package)]
g = ggplot(d, aes(x = hotel_cluster, y = Prop, fill = Prop))+ geom_bar(stat = 'identity')+facet_grid(is_mobile~is_package, labeller = label_both)
ggplotly(g)

d = trn[,j = .N, by = .(hotel_cluster, is_mobile, is_package)][, Prop := N/ sum(N), by = .(hotel_cluster, is_mobile)]
g = ggplot(d, aes(x = hotel_cluster, y = Prop, fill = Prop))+ geom_bar(stat = 'identity')+facet_grid(is_mobile~is_package, labeller = label_both)
ggplotly(g)

Channel of Marketing

Let's investigate the behavior of marketing channel.

levels(trn$channel)

    ##  "0"  "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10"
d = trn[, j = .N, by = .(channel)][, Prop:=N/sum(N)]
g = ggplot(d, aes(x=channel, y= Prop, fill = channel)) + geom_bar(stat = 'identity')
ggplotly(g)

The proportion of channel 9 dominates all the others. Channel 10 is the
least used channel. It makes me wonder whether there is any relation
between channel and being mobile?

d = trn[, j = .N, by = .(is_mobile, channel)][, Prop:=N/sum(N), by = channel]
g = ggplot(d, aes(x = is_mobile, y = channel, fill = Prop)) + geom_raster() + scale_fill_gradient2(name="N", low="blue", mid = 'white' , high="red")
ggplotly(g)

Channel 0, 1 and 2 have larger proportions than average in mobile users,
suggesting that these marketing channels might be a little bit more
geared towards mobile users.

Channel 10, and 6 are interesting as their proportion was very low which
requires further investigation. It might be that Channel 10 and 6 were
unsuccessful and were abandoned after a while. Let's see.

Let's plot months over the two years for channel 9, 6, and 10. For this
I will use the new data extracted from the date field.

d = trn[i = channel == 9, .N, by = .(date_year, date_month)][order(date_year, date_month)]
g = ggplot(d,aes(x= date_month, y = N))+ geom_bar(aes(fill = date_month), stat = 'identity') + facet_grid(date_year~.) + theme(legend.position="none") + ggtitle("Channel 9")
ggplotly(g)

d = trn[i = channel == 6, .N, by = .(date_year, date_month)][order(date_year, date_month)]
g = ggplot(d,aes(x= date_month, y = N, fill = date_month))+ geom_bar(stat = 'identity') + facet_grid(date_year~.) + theme(legend.position="none") + ggtitle("Channel 6")
ggplotly(g)

d = trn[i = channel == 10, .N, by = .(date_year, date_month)][order(date_year, date_month)]
g = ggplot(d,aes(x= date_month, y = N, fill = date_month))+ geom_bar(stat = 'identity') + facet_grid(date_year~.) + theme(legend.position="none") + ggtitle("Channel 10")
ggplotly(g)

As can be seen from the tables, I first checked channel 9, which is
distributed over the months in 2013, and 2014. The amount increased by
more than 100% after the first half of 2014.

Channel 6 seems to be a new marketing channel which is seeing some
growth over time. In fact, it has almost quadrupled in the second half
of 2014, compared to January of 2014.

Channel 10 as suspected was tried in 2013 and 2014 but it seems to be a
failure. The numbers are so low, and it seems abandoned after June of
2014. It might also be a very niche market, tried for a little while.

I also have checked all the other marketing channels,and they all behave
almost the same as channel 9. This makes me wonder that ther are amny
more datapoints from 2014, than 2013.

trn[, .N, by = date_year]
##    date_year        N
## 1:      2014 26483412
## 2:      2013 11186881

Indeed, as can be seen there about 11 million points from 2013, and
almost 26.5 million from 2014. This change might be due to the fact the
internet traffic to Expedia is rising by large amounts, and/or simply
more points were sampled from 2014.

Anyway, this is all for now. Hope you gained some insight. I will
analyze the other variables in my second post on this topic. I will also
look further into the trends of different variables. I am curious to see
whether mobile traffic increased from 2013 to 2014.

Please comment below. I am always open to suggestions, and best
practices. Thank you and see you soon!