Exploring the Online Retail Dataset
This workbook was created using the “dataexpks” template:
https://github.com/DublinLearningGroup/dataexpks
1 Introduction
This workbook performs the basic data exploration of the dataset.
<- 100
dataexp_level_exclusion_threshold
<- 40
dataexp_cat_level_count <- 50 dataexp_hist_bins_count
2 Load Data
First we load the dataset as well as some support datasets.
<- read_rds("data/retail_data_raw_tbl.rds")
rawdata_tbl
%>% glimpse() rawdata_tbl
## Rows: 1,044,848
## Columns: 9
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Y…
## $ Invoice <chr> "489434", "489434", "489434", "489434", "489434", "48943…
## $ StockCode <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", …
## $ Description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGH…
## $ Quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, …
## $ InvoiceDate <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07…
## $ Price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.…
## $ `Customer ID` <chr> "13085", "13085", "13085", "13085", "13085", "13085", "1…
## $ Country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "U…
2.1 Perform Quick Data Cleaning
Some of the dates provided in the dataset are in an irregular format.
<- rawdata_tbl %>% set_colnames(names(.) %>% to_snake_case())
data_tbl
%>% glimpse() data_tbl
## Rows: 1,044,848
## Columns: 9
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Ye…
## $ invoice <chr> "489434", "489434", "489434", "489434", "489434", "489434…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHT…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3…
## $ invoice_date <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07:…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.7…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "13…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Un…
2.2 Create Derived Variables
We now create derived features useful for modelling. These values are new variables calculated from existing variables in the data.
## Rows: 1,044,848
## Columns: 22
## $ row_id <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month <fct> December, December, December, December, December, De…
## $ invoice_dow <fct> Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
3 Perform Basic Checks on Data
We now want to look at some very high level checks on the data, and we leverage
some of the functionality provided by DataExplorer
.
3.1 Create High-Level Visualisations
We first want to look at a visualisation of some high-level summarys of the meta-data on this dataset. This gives us a quick view of the categorical and numeric values in the dataset, as well as the proportions of missing values.
%>%
data_tbl plot_intro(
title = "High Level Table Summary",
ggtheme = theme_cowplot()
)
3.2 Check Missing Values
Before we do anything with the data, we first check for missing values in the dataset. In some cases, missing data is coded by a special character rather than as a blank, so we first correct for this.
### _TEMPLATE_
### ADD CODE TO CORRECT FOR DATA ENCODING HERE
With missing data properly encoded, we now visualise the missing data in a number of different ways.
3.2.1 Univariate Missing Data
%>%
data_tbl plot_missing(
title = "Summary of Data Missingness",
group = list(Good = 0.05, Acceptable = 0.2, Bad = 0.8, Remove = 1),
ggtheme = theme_cowplot()
)
We now want to repeat this plot but only for those columns that have some missing values.
%>%
data_tbl plot_missing(
title = "Summary of Data Missingness (missing variables only)",
missing_only = TRUE,
group = list(Good = 0.05, Acceptable = 0.2, Bad = 0.8, Remove = 1),
ggtheme = theme_cowplot()
)
3.2.2 Multivariate Missing Data
It is useful to get an idea of what combinations of variables tend to have variables with missing values simultaneously, so to construct a visualisation for this we create a count of all the times given combinations of variables have missing values, producing a heat map for these combination counts.
<- 20
dataexp_missing_group_count
<- rawdata_tbl %>% nrow()
row_count
<- ~ .x %>% are_na() %>% vec_cast(integer())
count_nas
<- rawdata_tbl %>%
missing_vizdata_tbl mutate(across(everything(), count_nas)) %>%
mutate(label = pmap_chr(., str_c)) %>%
group_by(label) %>%
mutate(
miss_count = n(),
miss_prop = miss_count / row_count
%>%
) slice_max(order_by = miss_prop, n = 1, with_ties = FALSE) %>%
ungroup() %>%
pivot_longer(
!c(label, miss_count, miss_prop),
names_to = "variable_name",
values_to = "presence"
%>%
) mutate(
prop_label = sprintf("%6.4f", miss_prop)
)
<- missing_vizdata_tbl %>%
top10_data_tbl select(label, miss_prop) %>%
distinct() %>%
slice_max(order_by = miss_prop, n = dataexp_missing_group_count)
<- missing_vizdata_tbl %>%
missing_plot_tbl semi_join(top10_data_tbl, by = "label")
ggplot(missing_plot_tbl) +
geom_tile(aes(x = variable_name, y = prop_label, fill = presence), height = 0.8) +
scale_fill_continuous() +
scale_x_discrete(position = "top", labels = abbreviate) +
xlab("Variable") +
ylab("Proportion of Rows") +
theme(
legend.position = "none",
axis.text.x = element_text(angle = 90, vjust = 0.5)
)
This visualisation takes a little explaining.
Each row represents a combination of variables with simultaneous missing values. For each row in the graphic, the coloured entries show which particular variables are missing in that combination. The proportion of rows with that combination is displayed in both the label for the row and the colouring for the cells in the row.
3.3 Inspect High-level-count Categorical Variables
With the raw data loaded up we now remove obvious unique or near-unique variables that are not amenable to basic exploration and plotting.
<- create_coltype_list(data_tbl)
coltype_lst
<- ~ .x %>% unique() %>% length()
count_levels
<- data_tbl %>%
catvar_valuecount_tbl summarise(
.groups = "drop",
across(coltype_lst$split$discrete, count_levels)
%>%
) pivot_longer(
cols = everything(),
names_to = "var_name",
values_to = "level_count"
%>%
) arrange(desc(level_count))
print(catvar_valuecount_tbl)
## # A tibble: 15 × 2
## var_name level_count
## <chr> <int>
## 1 row_id 1044848
## 2 invoice_id 53628
## 3 customer_id 5943
## 4 description 5656
## 5 stock_code 5304
## 6 stock_code_upr 5131
## 7 invoice_minute 60
## 8 invoice_woy 52
## 9 country 43
## 10 invoice_dom 31
## 11 invoice_ym 25
## 12 invoice_hour 16
## 13 invoice_month 12
## 14 invoice_dow 7
## 15 excel_sheet 2
<- data_tbl %>% nrow()
row_count
cat(glue("Dataset has {row_count} rows\n"))
## Dataset has 1044848 rows
Now that we a table of the counts of all the categorical variables we can automatically exclude unique variables from the exploration, as the level count will match the row count.
<- catvar_valuecount_tbl %>%
unique_vars filter(level_count == row_count) %>%
pull(var_name)
print(unique_vars)
## [1] "row_id"
<- data_tbl %>%
explore_data_tbl select(-one_of(unique_vars))
Having removed the unique identifier variables from the dataset, we may also wish to exclude categoricals with high level counts also, so we create a vector of those variable names.
<- catvar_valuecount_tbl %>%
highcount_vars filter(level_count >= dataexp_level_exclusion_threshold,
< row_count) %>%
level_count pull(var_name)
cat(str_c(highcount_vars, collapse = ", "))
## invoice_id, customer_id, description, stock_code, stock_code_upr
We now can continue doing some basic exploration of the data. We may also choose to remove some extra columns from the dataset.
### You may want to comment out these next few lines to customise which
### categoricals are kept in the exploration.
<- c(highcount_vars)
drop_vars
if (length(drop_vars) > 0) {
<- explore_data_tbl %>%
explore_data_tbl select(-one_of(drop_vars))
cat(str_c(drop_vars, collapse = ", "))
}
## invoice_id, customer_id, description, stock_code, stock_code_upr
4 Univariate Data Exploration
Now that we have loaded the data we can prepare it for some basic data exploration.
4.1 Quick Univariate Data Summaries
We use a number of summary visualisations provided by DataExplorer
: a
facet plot across each variable with categorical variables getting bar plots
and numerical plots getting histograms.
We first look at the barplots of categorical variables.
plot_bar(
data_tbl,ncol = 2,
nrow = 2,
title = "Barplots of Data",
ggtheme = theme_cowplot()
)
## 10 columns ignored with more than 50 categories.
## row_id: 1044848 categories
## invoice_id: 53628 categories
## stock_code: 5304 categories
## description: 5656 categories
## invoice_date: 604 categories
## customer_id: 5943 categories
## stock_code_upr: 5131 categories
## invoice_dttm: 47635 categories
## invoice_minute: 60 categories
## invoice_woy: 52 categories
We then have a quick look at histograms of the numeric variables.
plot_histogram(
data_tbl,ncol = 2,
nrow = 2,
title = "Histograms of Data",
ggtheme = theme_cowplot()
)
Finally, we split the remaining variables into different categories and then produce a sequence of plots for each variable.
<- create_coltype_list(explore_data_tbl)
coltype_lst
print(coltype_lst)
## $split
## $split$continuous
## [1] "quantity" "price" "stock_value"
## [4] "invoice_monthprop"
##
## $split$datetime
## [1] "invoice_date" "invoice_dttm"
##
## $split$discrete
## [1] "excel_sheet" "country" "invoice_month" "invoice_dow"
## [5] "invoice_dom" "invoice_hour" "invoice_minute" "invoice_woy"
## [9] "invoice_ym"
##
## $split$logical
## [1] "cancellation"
##
##
## $columns
## excel_sheet quantity invoice_date price
## "discrete" "continuous" "datetime" "continuous"
## country cancellation invoice_dttm invoice_month
## "discrete" "logical" "datetime" "discrete"
## invoice_dow invoice_dom invoice_hour invoice_minute
## "discrete" "discrete" "discrete" "discrete"
## invoice_woy invoice_ym stock_value invoice_monthprop
## "discrete" "discrete" "continuous" "continuous"
4.2 Logical Variables
Logical variables only take two values: TRUE or FALSE. It is useful to see missing data as well though, so we also plot the count of those.
<- coltype_lst$split$logical %>% sort()
logical_vars
for (plot_varname in logical_vars) {
cat("--\n")
cat(glue("{plot_varname}\n"))
<- explore_data_tbl %>% pull(.data[[plot_varname]]) %>% are_na() %>% sum()
na_count
<- glue("Barplot of Counts for Variable: {plot_varname} ({na_count} missing values)")
plot_title
<- ggplot(explore_data_tbl) +
explore_plot geom_bar(aes(x = .data[[plot_varname]])) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(plot_title) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5))
plot(explore_plot)
}
## --
## cancellation
4.3 Numeric Variables
Numeric variables are usually continuous in nature, though we also have integer data.
<- coltype_lst$split$continuous %>% sort()
numeric_vars
for (plot_varname in numeric_vars) {
cat("--\n")
cat(glue("{plot_varname}\n"))
<- explore_data_tbl %>% pull(.data[[plot_varname]])
plot_var <- plot_var %>% are_na() %>% sum()
na_count
%>% summary() %>% print()
plot_var
<- glue("Histogram Plot for Variable: {plot_varname} ({na_count} missing values)")
plot_title
<- ggplot() +
all_plot geom_histogram(aes(x = plot_var), bins = dataexp_hist_bins_count) +
geom_vline(xintercept = mean(plot_var, na.rm = TRUE),
colour = "red", size = 1.5) +
geom_vline(xintercept = median(plot_var, na.rm = TRUE),
colour = "green", size = 1.5) +
xlab(plot_varname) +
ylab("Count") +
scale_x_continuous(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle(
plot_title,subtitle = "(red line is mean, green line is median)"
)
<- explore_data_tbl %>%
pos_data_tbl filter(.data[[plot_varname]] >= 0) %>%
mutate(var_val = abs(.data[[plot_varname]]))
<- ggplot(pos_data_tbl) +
pos_log_plot geom_histogram(aes(x = var_val), bins = dataexp_hist_bins_count) +
xlab(plot_varname) +
ylab("Count") +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle("Positive Values")
<- explore_data_tbl %>%
neg_data_tbl filter(.data[[plot_varname]] < 0) %>%
mutate(var_val = abs(.data[[plot_varname]]))
<- ggplot(neg_data_tbl) +
neg_log_plot geom_histogram(aes(x = var_val), bins = dataexp_hist_bins_count) +
xlab(plot_varname) +
ylab("Count") +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle("Negative Values")
plot_grid(
all_plot,NULL,
pos_log_plot,
neg_log_plot,nrow = 2
%>%
) print()
}
## --
## invoice_monthprop Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03226 0.28571 0.53333 0.52630 0.76667 1.00000
## --
## price Min. 1st Qu. Median Mean 3rd Qu. Max.
## -53594.36 1.25 2.10 4.59 4.13 38970.00
## --
## quantity Min. 1st Qu. Median Mean 3rd Qu. Max.
## -80995.00 1.00 3.00 9.99 10.00 80995.00
## --
## stock_value Min. 1st Qu. Median Mean 3rd Qu. Max.
## -168469.60 3.75 9.90 18.10 17.70 168469.60
4.4 Categorical Variables
Categorical variables only have values from a limited, and usually fixed, number of possible values
<- coltype_lst$split$discrete %>% sort()
categorical_vars
for (plot_varname in categorical_vars) {
cat("--\n")
cat(glue("{plot_varname}\n"))
<- explore_data_tbl %>% pull(.data[[plot_varname]]) %>% are_na() %>% sum()
na_count
<- glue("Barplot of Counts for Variable: {plot_varname} ({na_count} missing values)")
plot_title
<- explore_data_tbl %>%
standard_plot_tbl count(.data[[plot_varname]])
<- ggplot(standard_plot_tbl) +
standard_plot geom_bar(aes(x = .data[[plot_varname]], weight = n)) +
xlab(plot_varname) +
ylab("Count") +
scale_x_discrete(labels = ~ abbreviate(.x, minlength = 10)) +
scale_y_continuous(labels = label_comma()) +
ggtitle(plot_title) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5))
%>% print()
standard_plot
<- explore_data_tbl %>%
desc_plot_tbl pull(.data[[plot_varname]]) %>%
fct_lump(n = dataexp_cat_level_count) %>%
fct_count() %>%
mutate(f = fct_relabel(f, str_trunc, width = 15))
<- ggplot(desc_plot_tbl) +
desc_plot geom_bar(aes(x = fct_reorder(f, -n), weight = n)) +
xlab(plot_varname) +
ylab("Count") +
scale_x_discrete(labels = abbreviate) +
scale_y_continuous(labels = label_comma()) +
ggtitle(plot_title) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5))
%>% print()
desc_plot }
## --
## country
## --
## excel_sheet
## --
## invoice_dom
## --
## invoice_dow
## --
## invoice_hour
## --
## invoice_minute
## --
## invoice_month
## --
## invoice_woy
## --
## invoice_ym
4.5 Date/Time Variables
Date/Time variables represent calendar or time-based data should as time of the day, a date, or a timestamp.
<- coltype_lst$split$datetime %>% sort()
datetime_vars
for (plot_varname in datetime_vars) {
cat("--\n")
cat(glue("{plot_varname}\n"))
<- explore_data_tbl %>% pull(.data[[plot_varname]])
plot_var <- plot_var %>% are_na() %>% sum()
na_count
%>% summary() %>% print()
plot_var
<- glue("Barplot of Dates/Times in Variable: {plot_varname} ({na_count} missing values)")
plot_title
<- ggplot(explore_data_tbl) +
explore_plot geom_histogram(aes(x = .data[[plot_varname]]), bins = dataexp_hist_bins_count) +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(plot_title)
plot(explore_plot)
}
## --
## invoice_date Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2009-12-01" "2010-07-05" "2010-12-08" "2011-01-02" "2011-07-27" "2011-12-09"
## --
## invoice_dttm Min. 1st Qu. Median
## "2009-12-01 07:45:00" "2010-07-05 11:11:00" "2010-12-08 16:34:00"
## Mean 3rd Qu. Max.
## "2011-01-03 11:44:15" "2011-07-27 13:42:00" "2011-12-09 12:50:00"
5 Bivariate Facet Plots
We now move on to looking at bivariate plots of the data set.
A natural way to explore relationships in data is to create univariate visualisations facetted by a categorical value.
<- "invoice_month"
facet_varname
<- 3 dataexp_facet_count_max
5.1 Logical Variables
For logical variables we facet on barplots of the levels, comparing TRUE, FALSE and missing data.
<- logical_vars[!logical_vars %in% facet_varname] %>% sort()
logical_vars
for (plot_varname in logical_vars) {
cat("--\n")
cat(plot_varname)
<- data_tbl %>% filter(!are_na(.data[[plot_varname]]))
plot_tbl
<- ggplot(plot_tbl) +
explore_plot geom_bar(aes(x = .data[[plot_varname]])) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(glue("{facet_varname}-Faceted Histogram for Variable: {plot_varname}")) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5))
plot(explore_plot)
}
## --
## cancellation
5.2 Numeric Variables
For numeric variables, we facet on histograms of the data.
for (plot_varname in numeric_vars) {
cat("--\n")
cat(plot_varname)
<- data_tbl %>% filter(!are_na(.data[[plot_varname]]))
plot_tbl
<- ggplot(plot_tbl) +
explore_plot geom_histogram(aes(x = .data[[plot_varname]]), bins = dataexp_hist_bins_count) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_x_continuous(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle(glue("{facet_varname}-Faceted Histogram for Variable: {plot_varname}")) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5))
print(explore_plot)
}
## --
## invoice_monthprop
## --
## price
## --
## quantity
## --
## stock_value
5.3 Categorical Variables
We treat categorical variables like logical variables, faceting the barplots of the different levels of the data.
<- categorical_vars[!categorical_vars %in% facet_varname] %>% sort()
categorical_vars
for (plot_varname in categorical_vars) {
cat("--\n")
cat(plot_varname)
<- data_tbl %>%
plot_tbl filter(!are_na(.data[[plot_varname]])) %>%
mutate(
varname_trunc = fct_relabel(.data[[plot_varname]], str_trunc, width = 10)
)
<- ggplot(plot_tbl) +
explore_plot geom_bar(aes(x = varname_trunc)) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_x_discrete(labels = abbreviate) +
scale_y_continuous(labels = label_comma()) +
ggtitle(glue("{facet_varname}-Faceted Histogram for Variable: {plot_varname}")) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5))
plot(explore_plot)
}
## --
## country
## --
## excel_sheet
## --
## invoice_dom
## --
## invoice_dow
## --
## invoice_hour
## --
## invoice_minute
## --
## invoice_woy
## --
## invoice_ym
5.4 Date/Time Variables
Like the univariate plots, we facet on histograms of the years in the dates.
for (plot_varname in datetime_vars) {
cat("--\n")
cat(plot_varname)
<- data_tbl %>% filter(!are_na(.data[[plot_varname]]))
plot_tbl
<- ggplot(plot_tbl) +
explore_plot geom_histogram(aes(x = .data[[plot_varname]]), bins = dataexp_hist_bins_count) +
facet_wrap(facet_varname, scales = "free") +
xlab(plot_varname) +
ylab("Count") +
scale_y_continuous(labels = label_comma()) +
ggtitle(glue("{facet_varname}-Faceted Histogram for Variable: {plot_varname}")) +
theme(axis.text.x = element_text(angle = 30, vjust = 0.5))
plot(explore_plot)
}
## --
## invoice_date
## --
## invoice_dttm
6 Custom Explorations
In this section we perform various data explorations.
6.1 Custom Checks for Data Integrity
We want to check the transaction data for consistency, so we create a table of all distinct
<- data_tbl %>%
stock_codes_lookup_tbl select(stock_code_upr, description) %>%
distinct() %>%
arrange(stock_code_upr, description) %>%
drop_na(description)
%>% glimpse() stock_codes_lookup_tbl
## Rows: 6,342
## Columns: 2
## $ stock_code_upr <chr> "10002", "10002R", "10080", "10080", "10109", "10120", …
## $ description <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", …
We now take a look at the first 50 rows of this table to get a sense of any
possible duplication of stock_code
.
%>% datatable() stock_codes_lookup_tbl
6.1.1 Items per Transactions
As another check on the data, we want to look at how many different objects are included in
<- data_tbl %>%
plot_tbl filter(quantity > 0) %>%
count(invoice_id, name = "n_items")
ggplot(plot_tbl) +
geom_histogram(aes(x = n_items), bins = 40) +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
xlab("Number of Items") +
ylab("Transaction Count") +
ggtitle("Histogram of Item Counts per Transactions")
6.2 Explore Aggregate Amounts
We now turn our focus to aggregating the data set in various ways and inspect how these aggregate totals are distributed.
6.2.1 Invoice-Level Amounts
We first aggregate the data at the invoice level, and inspect how those amounts are distributed.
<- data_tbl %>%
invoice_data_tbl group_by(invoice_id) %>%
summarise(
.groups = "drop",
invoice_amount = sum(price * quantity) %>% round(2)
)
<- invoice_data_tbl %>% pull(invoice_amount) %>% mean() %>% round(2)
invoice_mean <- invoice_data_tbl %>% pull(invoice_amount) %>% median() %>% round(2)
invoice_median
ggplot(invoice_data_tbl) +
geom_histogram(aes(x = invoice_amount), bins = 50) +
geom_vline(aes(xintercept = invoice_mean), colour = "black") +
geom_vline(aes(xintercept = invoice_median), colour = "red") +
xlab("Invoice Amount") +
ylab("Count") +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle(
label = "Histogram Plot for Invoice Amount",
subtitle = glue("Mean is {invoice_mean}, Median is {invoice_median}")
)
We see there is a broad range of different invoice totals, with mean and median being a few hundred pounds.
6.2.2 Customer-Level Amounts
<- data_tbl %>%
customer_data_tbl group_by(customer_id) %>%
summarise(
.groups = "drop",
customer_spend = sum(price * quantity) %>% round(2)
)
ggplot(customer_data_tbl) +
geom_histogram(aes(x = customer_spend), bins = 50) +
xlab("Customer Spend") +
ylab("Count") +
scale_x_log10(labels = label_comma()) +
scale_y_continuous(labels = label_comma()) +
ggtitle("Histogram Plot for Customer Spend")
%>% pull(customer_spend) %>% hill() customer_data_tbl
6.2.3 Stock Code Investigation
We now want to have a quick look at some high level summary statistics on the stock codes, so we look at quantities and total value.
<- data_tbl %>%
stock_code_summary_tbl group_by(stock_code_upr) %>%
summarise(
.groups = "drop",
row_count = n(),
net_quantity = sum(quantity),
abs_quantity = abs(quantity) %>% sum(),
net_value = sum(stock_value),
abs_value = abs(quantity) %>% sum()
%>%
) mutate(
sc_nchar = nchar(stock_code_upr), .before = "row_count"
)
%>% datatable() stock_code_summary_tbl
It appears there are number of odd stock codes in the dataset, so we look at those codes that are 4 characters or less and inspect those.
<- data_tbl %>%
short_stock_codes_tbl semi_join(stock_code_summary_tbl %>% filter(sc_nchar < 5), by = "stock_code_upr")
%>% datatable() short_stock_codes_tbl
6.2.4 Stock Code Price Data
We want to look at the range of different prices assigned to the same
stock_code
value.
<- data_tbl %>%
stock_price_counts_tbl group_by(stock_code) %>%
summarise(
.groups = "drop",
n_prices = n(),
min_price = min(price),
p25_price = quantile(price, 0.25),
mean_price = mean(price) %>% round(2),
p50_price = median(price),
p75_price = quantile(price, 0.75),
max_price = max(price),
range_price = ((max_price - min_price) / mean_price) %>% round(4)
)
%>% datatable() stock_price_counts_tbl
<- data_tbl %>%
stock_distinct_price_counts_tbl select(stock_code, price) %>%
distinct() %>%
group_by(stock_code) %>%
summarise(
.groups = "drop",
n_prices = n(),
min_price = min(price),
p25_price = quantile(price, 0.25),
mean_price = mean(price) %>% round(2),
p50_price = median(price),
p75_price = quantile(price, 0.75),
max_price = max(price),
range_price = (max_price - min_price) / mean_price
)
%>% datatable() stock_distinct_price_counts_tbl
6.3 Construct Time-Series / Date-Based Data
Another way to look at this data is to combine all the invoice values by various time period such as daily, weekly and monthly to see how it looks.
As we are going to do a number of aggregations based on various aspects of the date, we construct a function that takes a table of data and adds a number of derived columns based on that date: things like day of week, calendar month and so on.
<- function(data_tbl, date_col) {
append_calendar_columns <- data_tbl %>%
updated_data_tbl mutate(
invoice_date = {{date_col}} %>% as.Date(),
invoice_month = {{date_col}} %>% format("%B"),
invoice_dow = {{date_col}} %>% format("%A"),
invoice_dom = {{date_col}} %>% format("%d"),
invoice_hour = {{date_col}} %>% format("%H"),
invoice_minute = {{date_col}} %>% format("%M"),
invoice_woy = {{date_col}} %>% format("%V"),
invoice_ym = {{date_col}} %>% format("%Y%m"),
.after = {{date_col}}
)
return(updated_data_tbl)
}
%>%
data_tbl select(excel_sheet, invoice_id, invoice_date, stock_value) %>%
append_calendar_columns(invoice_date) %>%
glimpse()
## Rows: 1,044,848
## Columns: 11
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "4894…
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-1…
## $ invoice_month <chr> "December", "December", "December", "December", "Decemb…
## $ invoice_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", …
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ invoice_hour <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "…
## $ invoice_minute <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49", "…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "2009…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50…
6.3.1 Create Univariate Time-Series of Amounts
<- data_tbl %>%
ts_data_tbl mutate(
ts_week = format(invoice_dttm, "%Y-%U"),
ts_month = format(invoice_dttm, "%Y-%m")
)
<- ts_data_tbl %>%
ts_daily_tbl group_by(label = invoice_date %>% format("%Y-%m-%d")) %>%
summarise(
.groups = "drop",
period_date = min(invoice_date),
total_spend = sum(price * quantity) %>% round(2)
)
ggplot(ts_daily_tbl) +
geom_line(aes(x = period_date, y = total_spend)) +
expand_limits(y = 0) +
scale_y_continuous(labels = label_comma()) +
xlab("Date") +
ylab("Total Spend") +
ggtitle("Lineplot of Total Spend by Day")
<- ts_data_tbl %>%
ts_weekly_tbl group_by(label = ts_week) %>%
summarise(
.groups = "drop",
period_date = min(invoice_date),
total_spend = sum(price * quantity)
)
ggplot(ts_weekly_tbl) +
geom_line(aes(x = period_date, y = total_spend)) +
expand_limits(y = 0) +
scale_y_continuous(labels = label_comma()) +
xlab("Date") +
ylab("Total Spend") +
ggtitle("Lineplot of Total Spend by Week")
<- ts_data_tbl %>%
ts_monthly_tbl group_by(label = ts_month) %>%
summarise(
.groups = "drop",
period_date = min(invoice_date),
total_spend = sum(price * quantity) %>% round(2)
)
ggplot(ts_monthly_tbl) +
geom_line(aes(x = period_date, y = total_spend)) +
expand_limits(y = 0) +
scale_y_continuous(labels = label_comma()) +
xlab("Date") +
ylab("Total Spend") +
ggtitle("Lineplot of Total Spend by Month")
To avoid dealing with multiple files for the time series, we combine them into a single object.
<- list(
ts_data_tbl daily = ts_daily_tbl,
weekly = ts_weekly_tbl,
monthly = ts_monthly_tbl
%>%
) bind_rows(.id = "series") %>%
arrange(series, period_date)
%>% write_rds("data/retail_timeseries_tbl.rds") ts_data_tbl
6.3.2 Calendar-Based Boxplots
We have aggregated our data across time periods, but it is also worth looking at both transaction-level and invoice-level amount over time.
ggplot(data_tbl) +
geom_boxplot(aes(x = invoice_woy, y = stock_value, group = invoice_woy)) +
scale_y_log10(labels = label_comma()) +
xlab("Week of Year") +
ylab("Transaction Amount") +
ggtitle("Boxplot of Transaction Sizes by Week of Year") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
<- data_tbl %>%
plot_tbl group_by(invoice_woy, invoice_id) %>%
summarise(
.groups = "drop",
invoice_amount = sum(stock_value) %>% round(2)
)
ggplot(plot_tbl) +
geom_boxplot(aes(x = invoice_woy, y = invoice_amount, group = invoice_woy)) +
scale_y_log10(labels = label_comma()) +
xlab("Week of Year") +
ylab("Invoice Amount") +
ggtitle("Boxplot of Invoice Amounts by Week of Year") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5))
6.4 Check Distribution of Daily Purchases
We now look at individual invoice amounts, and look at how they are distributed on a daily basis.
<- data_tbl %>%
daily_distribution_tbl group_by(invoice_date, invoice_id) %>%
summarise(
.groups = "drop",
total_spend = sum(stock_value)
)
%>% glimpse() daily_distribution_tbl
## Rows: 53,628
## Columns: 3
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ invoice_id <chr> "489434", "489435", "489436", "489437", "489438", "489439…
## $ total_spend <dbl> 505.30, 145.80, 630.33, 310.75, 2286.24, 426.30, 50.40, 3…
We now produce a boxplot for each day so we have a sense of how the distribution of transaction amounts changes and evolves over time.
<- daily_distribution_tbl %>%
plot_tbl filter(total_spend > 0) %>%
group_by(invoice_date) %>%
summarise(
.groups = "drop",
qn = c("total_spend", "p10", "p25", "p50", "p75", "p90"),
vals = c(
sum(total_spend),
quantile(total_spend, probs = c(0.10, 0.25, 0.50, 0.75, 0.90))
)%>%
) pivot_wider(
names_from = qn,
values_from = vals
)
ggplot(plot_tbl) +
geom_errorbar(aes(x = invoice_date, ymin = p25, ymax = p75, colour = total_spend),
width = 0) +
scale_y_log10(labels = label_comma()) +
scale_colour_continuous(labels = label_comma()) +
labs(x = "Date", y = "Transaction Amount", colour = "Spend") +
ggtitle("P25-P75 Quartile of Transaction Amounts by Date")
It does not look there is much of a change over time or any particular patterns that stand out, so we also want to look at the refunds.
<- daily_distribution_tbl %>%
plot_tbl filter(total_spend < 0) %>%
mutate(total_refund = abs(total_spend)) %>%
group_by(invoice_date) %>%
summarise(
.groups = "drop",
qn = c("total_refund", "p10", "p25", "p50", "p75", "p90"),
vals = c(
sum(total_refund),
quantile(total_refund, probs = c(0.10, 0.25, 0.50, 0.75, 0.90))
)%>%
) pivot_wider(
names_from = qn,
values_from = vals
)
ggplot(plot_tbl) +
geom_errorbar(aes(x = invoice_date, ymin = p25, ymax = p75, colour = total_refund),
width = 0) +
scale_y_log10(labels = label_comma()) +
scale_colour_continuous(labels = label_comma()) +
labs(x = "Date", y = "Transaction Amount", colour = "Spend") +
ggtitle("P25-P75 Quartile of Refund Amounts by Date")
6.5 Investigate Refunds
This transactions data also includes returns and refunds - adding a level of complexity to this analysis as we need to account for this when assessing the data.
One option is to ignore the refunds, at least initially, but this may add a large source of bias to our analysis and we need to get a feel for this.
To do this, we will look at each line-item entry with a negative quantity and
we will look back in time to the previous time such an time was sold, and we
match on the customer_id
, stock_code
and price
. This is strong evidence
that the previous sale was then returned, and so we can label these entries.
Due to the time snapshot nature of this data, this also means there are likely to be some returns in our data that do not have a corresponding sale, so we can also ignore those.
<- function(use_stock, use_customer, use_dttm,
filter_returns_transactions
data_tbl) {<- data_tbl %>%
matches_tbl filter(stock_code == use_stock,
== use_customer,
customer_id <= use_dttm,
invoice_dttm > 0)
quantity
return(matches_tbl)
}
<- function(data_tbl, return_price, return_quantity) {
match_tnx_prices <- data_tbl %>%
matched_tbl filter(abs(price) == abs(return_price)) %>%
select(row_id, invoice_id, stock_code, quantity, price, invoice_dttm) %>%
arrange(desc(invoice_dttm)) %>%
mutate(
cuml_quantity = cumsum(quantity),
return_quantity = return_quantity,
remaining = cuml_quantity + return_quantity
)
return(matched_tbl)
}
<- function(data_tbl) {
detemine_return_records <- data_tbl %>%
negative_tbl filter(remaining <= 0)
<- data_tbl %>%
first_tbl filter(remaining > 0) %>%
arrange(desc(invoice_dttm)) %>%
head(1)
<- list(negative_tbl, first_tbl) %>%
matched_tbl bind_rows() %>%
arrange(desc(invoice_dttm)) %>%
rename(orig_row_id = row_id)
return(matched_tbl)
}
<- data_tbl %>%
returns_data_tbl filter(quantity < 0) %>%
select(row_id, stock_code, customer_id, quantity, price, invoice_dttm) %>%
mutate(
prev_tnx_data = future_pmap(
list(use_stock = stock_code,
use_customer = customer_id,
use_dttm = invoice_dttm),
filter_returns_transactions,data_tbl = data_tbl,
.options = furrr_options(seed = 421),
.progress = TRUE
),price_data = future_pmap(
list(data_tbl = prev_tnx_data,
return_price = price,
return_quantity = quantity),
match_tnx_prices,
.options = furrr_options(seed = 422),
.progress = TRUE
),match_data = map(price_data, detemine_return_records)
)
%>% glimpse() returns_data_tbl
## Rows: 22,557
## Columns: 9
## $ row_id <chr> "ROW1025684", "ROW1025685", "ROW1025686", "ROW1025687", …
## $ stock_code <chr> "22087", "85206A", "21895", "21896", "22083", "21871", "…
## $ customer_id <chr> "16321", "16321", "16321", "16321", "16321", "16321", "1…
## $ quantity <dbl> -12, -6, -4, -6, -12, -12, -12, -24, -12, -3, -3, -3, -3…
## $ price <dbl> 2.95, 1.65, 4.25, 2.10, 2.95, 1.25, 1.25, 0.85, 2.95, 4.…
## $ invoice_dttm <dttm> 2009-12-01 10:32:59, 2009-12-01 10:32:59, 2009-12-01 10…
## $ prev_tnx_data <list> [<tbl_df[0 x 22]>], [<tbl_df[0 x 22]>], [<tbl_df[0 x 22…
## $ price_data <list> [<tbl_df[0 x 9]>], [<tbl_df[0 x 9]>], [<tbl_df[0 x 9]>]…
## $ match_data <list> [<tbl_df[0 x 9]>], [<tbl_df[0 x 9]>], [<tbl_df[0 x 9]>]…
We can then use this data to get an idea of how often items are returned, and how much time tends to pass between the purchase and the return.
<- returns_data_tbl %>%
returns_lookup_tbl select(return_row_id = row_id, return_dttm = invoice_dttm, match_data) %>%
unnest(match_data) %>%
select(
orig_row_id, return_row_id, return_dttm,adjust_quantity = return_quantity
%>%
) group_by(return_row_id) %>%
arrange(orig_row_id) %>%
slice_head(n = 1) %>%
ungroup()
%>% glimpse() returns_lookup_tbl
## Rows: 15,538
## Columns: 4
## $ orig_row_id <chr> "ROW0000260", "ROW0001235", "ROW0001214", "ROW0001233"…
## $ return_row_id <chr> "ROW1025730", "ROW1025733", "ROW1025734", "ROW1025735"…
## $ return_dttm <dttm> 2009-12-01 12:42:59, 2009-12-01 13:09:00, 2009-12-01 …
## $ adjust_quantity <dbl> -3, -24, -24, -12, -24, -81, -24, -48, -3, -12, -12, -…
We probably need to do a better job of this, as this current method is inadequate to properly match this up.
Stepping up this work to properly match up the transactions will be a topic we return to later.
We now want to look at the distribution of times between the purchase and the item being returned.
<- data_tbl %>%
orig_data_tbl select(orig_row_id = row_id, purchase_dttm = invoice_dttm)
<- data_tbl %>%
ret_data_tbl select(return_row_id = row_id, return_dttm = invoice_dttm)
<- returns_lookup_tbl %>%
return_times_tbl select(orig_row_id, return_row_id) %>%
left_join(orig_data_tbl, by = "orig_row_id") %>%
left_join(ret_data_tbl, by = "return_row_id") %>%
mutate(
return_time = return_dttm - purchase_dttm,
return_days = as.numeric(return_time) / (24 * 60 * 60)
)
<- ggplot(return_times_tbl) +
base_plot geom_histogram(aes(x = return_days), bins = 50) +
geom_vline(aes(xintercept = 90), colour = "red") +
xlab("Days") +
ylab("Count") +
ggtitle("Histogram of Days Between Purchase and Return per Item Transaction")
<- base_plot +
norm_plot scale_x_continuous(labels = label_comma())
<- base_plot +
log_plot scale_x_log10(labels = label_comma())
plot_grid(norm_plot, log_plot, nrow = 2)
7 Indicate Excluded Rows
As we have found a number of issues with this data, we now will indicate which rows we wish to exclude from futher analysis. We will then exclude these rows at a later point of the analysis.
7.1 Filter Extra Stock Codes
We discovered there are a number of extra data in this dataset for things like bad debt management, discounts, gift cards and so on, so we remove those from this.
<- c(
screen_stock_code "B", "C2", "C3", "D", "M", "S", "CRUK", "POST", "DOT", "BANK CHARGES",
"AMAZONFEE", "ADJUST", "ADJUST2", "TEST001", "TEST002"
)
<- data_tbl %>%
clean_stock_code_tbl filter(stock_code_upr %in% screen_stock_code) %>%
select(row_id)
%>% glimpse() clean_stock_code_tbl
## Rows: 5,712
## Columns: 1
## $ row_id <chr> "ROW0000090", "ROW0000127", "ROW0000174", "ROW0000593", "ROW102…
<- data_tbl %>%
clean_gift_tbl filter(str_detect(stock_code_upr, "GIFT")) %>%
select(row_id)
%>% glimpse() clean_gift_tbl
## Rows: 101
## Columns: 1
## $ row_id <chr> "ROW0029851", "ROW0030310", "ROW0031279", "ROW0040002", "ROW004…
7.2 Create Exclusion Indicator
<- list(
exclusions_tbl
clean_stock_code_tbl,
clean_gift_tbl%>%
) bind_rows() %>%
mutate(exclude = TRUE)
<- data_tbl %>%
cleaned_data_tbl left_join(exclusions_tbl, by = "row_id") %>%
replace_na(list(exclude = FALSE))
%>% glimpse() cleaned_data_tbl
## Rows: 1,044,848
## Columns: 23
## $ row_id <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month <fct> December, December, December, December, December, De…
## $ invoice_dow <fct> Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
8 Construct Stock Code Lookups
Finally, we want to construct a lookup table that provides some free-text
fields for each stock_code
value.
<- function(stock_desc) {
dedupe_stock_descs <- stock_desc %>%
dedupe_descs enframe(name = NULL, value = "stock_desc") %>%
mutate(
desc_len = stock_desc %>% nchar(),
desc_dedupe = stock_desc %>% str_trim() %>% str_replace_all("[^\\w]| ", ""),
desc_output = stock_desc %>% str_trim() %>% str_squish()
%>%
) group_by(desc_dedupe) %>%
slice_max(order_by = desc_len, n = 1, with_ties = FALSE) %>%
ungroup() %>%
pull(desc_output)
return(dedupe_descs)
}
We use some simple logic to attempt to de-dupe the descriptions as much as possible.
<- cleaned_data_tbl %>%
stock_description_tbl filter(
== FALSE,
exclude > 0,
quantity > 0,
price !are_na(description)
%>%
) mutate(
stock_code = stock_code %>% str_trim() %>% str_to_upper()
%>%
) select(stock_code, description) %>%
drop_na(description) %>%
distinct() %>%
group_by(stock_code) %>%
summarise(
.groups = "drop",
desc = description %>% sort() %>% dedupe_stock_descs() %>% str_c(collapse = " : ")
%>%
) arrange(stock_code)
%>% glimpse() stock_description_tbl
## Rows: 4,725
## Columns: 2
## $ stock_code <chr> "10002", "10002R", "10080", "10109", "10120", "10123C", "10…
## $ desc <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", "GRO…
We also look at this output using DT
%>% datatable() stock_description_tbl
9 BTYD Visualisations
Despite our extensive exploration of the data earlier, the concepts around BTYD modelling suggest a few more than are worth exploring, so we will look at those now.
We model the purchase data first, then combine this to create an individual customer/invoice pairing with the total amount spent as an additional column.
<- cleaned_data_tbl %>%
tnx_purchase_tbl filter(
> 0,
quantity > 0,
price == FALSE
exclude %>%
) select(
invoice_date, invoice_id, stock_code, customer_id, description,
quantity, price, stock_value
)
%>% glimpse() tnx_purchase_tbl
## Rows: 1,015,088
## Columns: 8
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "489434…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085", "13…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHT…
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.7…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, …
Use of BTYD models assumes a total spend over a period of day and those differences between the times. This is calculated internally by the various BTYD routines so we keep just the per-invoice spend.
<- tnx_purchase_tbl %>%
daily_spend_invoice_tbl drop_na(customer_id) %>%
group_by(invoice_date, customer_id, invoice_id) %>%
summarise(
.groups = "drop",
invoice_spend = sum(stock_value)
)
%>% glimpse() daily_spend_invoice_tbl
## Rows: 36,594
## Columns: 4
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12…
## $ customer_id <chr> "12490", "12533", "12682", "12758", "12836", "12913", "1…
## $ invoice_id <chr> "489557", "489526", "489439", "489599", "489593", "48954…
## $ invoice_spend <dbl> 531.94, 821.92, 372.30, 2454.68, 423.87, 537.96, 261.00,…
<- daily_spend_invoice_tbl %>%
daily_spend_tbl group_by(invoice_date, customer_id) %>%
summarise(
.groups = "drop",
total_spend = sum(invoice_spend),
tnx_count = n()
)
%>% glimpse() daily_spend_tbl
## Rows: 32,878
## Columns: 4
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ customer_id <chr> "12490", "12533", "12682", "12758", "12836", "12913", "12…
## $ total_spend <dbl> 531.94, 821.92, 372.30, 2454.68, 423.87, 537.96, 261.00, …
## $ tnx_count <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, …
To start with, it might be worth understanding a bit more about when customers are ‘born’ in the system - that is, the date on which they make their first purchase. Another important quantity is the time between transactions for a customer, we we will visualise these.
<- daily_spend_tbl %>%
customer_cohort_tbl group_by(customer_id) %>%
summarise(
.groups = "drop",
first_tnx_date = min(invoice_date),
total_tnx_count = n()
%>%
) mutate(
cohort_qtr = first_tnx_date %>% as.yearqtr() %>% as.character(),
cohort_ym = first_tnx_date %>% format("%Y %m"),
.after = "customer_id"
)
%>% glimpse() customer_cohort_tbl
## Rows: 5,852
## Columns: 5
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", …
## $ cohort_qtr <chr> "2010 Q1", "2010 Q4", "2010 Q3", "2010 Q2", "2011 Q1",…
## $ cohort_ym <chr> "2010 03", "2010 10", "2010 09", "2010 04", "2011 02",…
## $ first_tnx_date <date> 2010-03-02, 2010-10-31, 2010-09-27, 2010-04-29, 2011-…
## $ total_tnx_count <int> 3, 8, 5, 3, 1, 1, 9, 2, 1, 2, 6, 2, 5, 10, 6, 4, 10, 2…
Now that we have a first date for each customer, we look at the total number of customers joining at each date.
<- customer_cohort_tbl %>%
plot_tbl count(first_tnx_date, name = "n_customer")
ggplot(plot_tbl) +
geom_line(aes(x = first_tnx_date, y = n_customer)) +
labs(
x = "First Transaction Date",
y = "New Customers",
title = "Plot of Count of New Customer by Date"
)
We know look at how time differences between purchases are distributed.
<- daily_spend_tbl %>%
customer_tnx_diffs_tbl group_by(customer_id) %>%
summarise(
.groups = "drop",
time_diff = diff(invoice_date) %>% as.numeric() %>% divide_by(7)
)
<- customer_tnx_diffs_tbl %>% pull(time_diff) %>% mean()
mean_diff
ggplot(customer_tnx_diffs_tbl) +
geom_histogram(aes(x = time_diff), bins = 50) +
geom_vline(aes(xintercept = mean_diff), colour = "red") +
scale_y_continuous(labels = label_comma()) +
labs(
x = "Time Difference (weeks)",
y = "Frequency",
title = "Histogram of Differences Between Transactions for Customers",
subtitle = glue(
"Mean Difference is {mean_diff} weeks", mean_diff = mean_diff %>% round(2)
) )
We also want to look at a number of customers and make some line plots of their transactions.
<- customer_cohort_tbl %>%
keep_customers_tbl filter(total_tnx_count > 2) %>%
slice_sample(n = 30)
<- daily_spend_tbl %>%
plot_tbl semi_join(keep_customers_tbl, by = "customer_id")
ggplot(plot_tbl, aes(x = invoice_date, y = customer_id, group = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Transaction Date",
y = "Customer ID",
title = "Visualisation of Transaction Times for 30 Customers"
+
) theme(axis.text.y = element_text(size = 12))
9.1 Investigate Cohorts
Finally, we want to take a look at the distribution of transaction times based on various first-transaction cohorts in the data.
ggplot(customer_cohort_tbl) +
geom_histogram(aes(x = first_tnx_date), bins = 50) +
labs(
x = "Date of First Transaction",
y = "Count",
title = "Histogram of New Customer Start Dates"
)
We also want to get a sense of the total count of customers in each cohort.
<- customer_cohort_tbl %>%
plot_tbl count(cohort_qtr, name = "customer_count") %>%
mutate(cohort_qtr = cohort_qtr %>% as.character())
ggplot(plot_tbl) +
geom_col(aes(x = cohort_qtr, y = customer_count)) +
scale_y_continuous(labels = label_comma()) +
labs(
x = "Customer Cohort",
y = "Customer Count",
title = "Bar Plot of Customer Quarterly Cohort Sizes"
)
We also want to see the monthly cohorts:
<- customer_cohort_tbl %>%
plot_tbl count(cohort_ym, name = "customer_count") %>%
mutate(cohort_ym = cohort_ym %>% as.character())
ggplot(plot_tbl) +
geom_col(aes(x = cohort_ym, y = customer_count)) +
scale_y_continuous(labels = label_comma()) +
labs(
x = "Customer Cohort",
y = "Customer Count",
title = "Bar Plot of Customer Monthly Cohort Sizes"
+
) theme(
axis.text.x = element_text(size = 10, angle = 20, vjust = 0.5)
)
For the cohort analysis, we start with a boxplot of the time difference between transactions by cohort.
<- customer_cohort_tbl %>%
plot_tbl inner_join(customer_tnx_diffs_tbl, by = "customer_id") %>%
mutate(cohort_qtr = cohort_qtr %>% as.character())
ggplot(plot_tbl) +
geom_boxplot(aes(x = cohort_qtr, y = time_diff)) +
scale_y_log10() +
labs(
x = "Cohort",
y = "Time Difference (weeks)",
title = "Boxplot of Time Differences by Starting Cohort"
)
We also construct a density plot of the time differences for these cohorts.
ggplot(plot_tbl, aes(x = time_diff, colour = cohort_qtr)) +
geom_line(stat = "density") +
geom_dl(aes(label = cohort_qtr), method = "top.bumpup", stat = "density") +
labs(
x = "Time Difference (weeks)",
y = "Density",
title = "Comparison Density Plot for Transaction Time Differences Between Cohorts"
+
) theme(legend.position = "none")
And we also look at a facetted-histogram
ggplot(plot_tbl) +
geom_histogram(aes(x = time_diff), bins = 50) +
facet_wrap(vars(cohort_qtr), scales = "free_y") +
scale_y_continuous(labels = label_comma()) +
labs(
x = "Time Difference (weeks)",
y = "Count",
title = "Facetted Histograms of Time Between Transactions"
)
9.2 Investigate Dropout Rates
We want to plot some visualisations of the lifetime and dropout rate of customers in each cohort.
<- customer_cohort_tbl %>%
cohort_dropout_est_tbl select(
customer_id, first_tnx_date, cohort_qtr, cohort_ym%>%
) inner_join(daily_spend_tbl, by = "customer_id") %>%
group_by(cohort_qtr, customer_id) %>%
mutate(
final_tnx_date = max(invoice_date)
%>%
) ungroup() %>%
select(
customer_id, cohort_qtr, first_tnx_date, final_tnx_date%>%
) distinct() %>%
mutate(
obs_lifetime = difftime(final_tnx_date, first_tnx_date, units = "week") %>%
as.numeric()
%>%
) filter(obs_lifetime > 0) %>%
group_by(cohort_qtr) %>%
summarise(
.groups = "drop",
lifetimes = list(obs_lifetime)
%>%
) mutate(
exp_fit = map(lifetimes, MASS::fitdistr, densfun = "exponential"),
param_data = map(exp_fit, broom::tidy)
%>%
) select(cohort_qtr, lifetimes, param_data) %>%
unnest(param_data)
%>% glimpse() cohort_dropout_est_tbl
## Rows: 9
## Columns: 5
## $ cohort_qtr <chr> "2009 Q4", "2010 Q1", "2010 Q2", "2010 Q3", "2010 Q4", "201…
## $ lifetimes <list> <104.2857143, 96.7142857, 104.2857143, 89.4285714, 94.0000…
## $ term <chr> "rate", "rate", "rate", "rate", "rate", "rate", "rate", "ra…
## $ estimate <dbl> 0.01180524, 0.01468272, 0.01792846, 0.02185343, 0.02433965,…
## $ std.error <dbl> 0.0004011582, 0.0004636136, 0.0007177127, 0.0010688864, 0.0…
%>% datatable() cohort_dropout_est_tbl
10 Output Cleaned Data
Finally we output the various datasets we have constructed to disks.
%>% write_rds("data/stock_description_tbl.rds")
stock_description_tbl
%>% write_rds("data/returns_lookup_tbl.rds")
returns_lookup_tbl
%>% write_rds("data/retail_data_cleaned_tbl.rds")
cleaned_data_tbl
%>% write_rds("data/customer_cohort_tbl.rds")
customer_cohort_tbl
%>% write_rds("data/daily_spend_invoice_tbl.rds")
daily_spend_invoice_tbl
%>% write_rds("data/daily_spend_tbl.rds") daily_spend_tbl
11 R Environment
::session_info() sessioninfo
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.1.1 (2021-08-10)
## os Ubuntu 20.04.3 LTS
## system x86_64, linux-gnu
## ui X11
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2021-12-06
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.1.0)
## backports 1.3.0 2021-10-27 [1] RSPM (R 4.1.0)
## bookdown 0.24 2021-09-02 [1] RSPM (R 4.1.0)
## broom 0.7.9 2021-07-27 [1] RSPM (R 4.1.0)
## bslib 0.3.1 2021-10-06 [1] RSPM (R 4.1.0)
## cachem 1.0.6 2021-08-19 [1] RSPM (R 4.1.0)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.1.0)
## cli 3.1.0 2021-10-27 [1] RSPM (R 4.1.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.1.1)
## colorspace 2.0-2 2021-06-24 [1] RSPM (R 4.1.0)
## conflicted * 1.0.4 2019-06-21 [1] RSPM (R 4.1.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.1.0)
## crayon 1.4.1 2021-02-08 [1] RSPM (R 4.1.0)
## crosstalk 1.1.1 2021-01-12 [1] RSPM (R 4.1.0)
## curl 4.3.2 2021-06-23 [1] RSPM (R 4.1.0)
## data.table 1.14.2 2021-09-27 [1] RSPM (R 4.1.0)
## DataExplorer * 0.8.2 2020-12-15 [1] RSPM (R 4.1.0)
## DBI 1.1.1 2021-01-15 [1] RSPM (R 4.1.0)
## dbplyr 2.1.1 2021-04-06 [1] RSPM (R 4.1.0)
## digest 0.6.28 2021-09-23 [1] RSPM (R 4.1.0)
## directlabels * 2021.1.13 2021-01-16 [1] RSPM (R 4.1.0)
## dplyr * 1.0.7 2021-06-18 [1] RSPM (R 4.1.0)
## DT * 0.19 2021-09-02 [1] RSPM (R 4.1.0)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.1.0)
## evaluate 0.14 2019-05-28 [1] RSPM (R 4.1.0)
## evir * 1.7-4 2018-03-20 [1] RSPM (R 4.1.0)
## fansi 0.5.0 2021-05-25 [1] RSPM (R 4.1.0)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.1.0)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.1.0)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.1.0)
## fs * 1.5.0 2020-07-31 [1] RSPM (R 4.1.0)
## furrr * 0.2.3 2021-06-25 [1] RSPM (R 4.1.0)
## future * 1.22.1 2021-08-25 [1] RSPM (R 4.1.0)
## generics 0.1.1 2021-10-25 [1] RSPM (R 4.1.0)
## ggplot2 * 3.3.5 2021-06-25 [1] RSPM (R 4.1.0)
## globals 0.14.0 2020-11-22 [1] RSPM (R 4.1.0)
## glue * 1.4.2 2020-08-27 [1] RSPM (R 4.1.0)
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.1.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.1.0)
## haven 2.4.3 2021-08-04 [1] RSPM (R 4.1.0)
## highr 0.9 2021-04-16 [1] RSPM (R 4.1.0)
## hms 1.1.1 2021-09-26 [1] RSPM (R 4.1.0)
## htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.1.0)
## htmlwidgets 1.5.4 2021-09-08 [1] RSPM (R 4.1.0)
## httr 1.4.2 2020-07-20 [1] RSPM (R 4.1.0)
## igraph 1.2.7 2021-10-15 [1] RSPM (R 4.1.0)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.1.0)
## jsonlite 1.7.2 2020-12-09 [1] RSPM (R 4.1.0)
## knitr 1.36 2021-09-29 [1] RSPM (R 4.1.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.1.0)
## lattice 0.20-44 2021-05-02 [2] CRAN (R 4.1.1)
## lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.1.0)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.1.0)
## lubridate * 1.8.0 2021-10-07 [1] RSPM (R 4.1.0)
## magrittr * 2.0.1 2020-11-17 [1] RSPM (R 4.1.0)
## MASS 7.3-54 2021-05-03 [2] CRAN (R 4.1.1)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.1.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.1.0)
## networkD3 0.4 2017-03-18 [1] RSPM (R 4.1.0)
## parallelly 1.28.1 2021-09-09 [1] RSPM (R 4.1.0)
## PerformanceAnalytics * 2.0.4 2020-02-06 [1] RSPM (R 4.1.0)
## pillar 1.6.4 2021-10-18 [1] RSPM (R 4.1.0)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.1.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.1.0)
## quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.1.0)
## Quandl 2.11.0 2021-08-11 [1] RSPM (R 4.1.0)
## quantmod * 0.4.18 2020-12-09 [1] RSPM (R 4.1.0)
## R6 2.5.1 2021-08-19 [1] RSPM (R 4.1.0)
## Rcpp 1.0.7 2021-07-07 [1] RSPM (R 4.1.0)
## readr * 2.0.2 2021-09-27 [1] RSPM (R 4.1.0)
## readxl 1.3.1 2019-03-13 [1] RSPM (R 4.1.0)
## reprex 2.0.1 2021-08-05 [1] RSPM (R 4.1.0)
## rlang * 0.4.12 2021-10-18 [1] RSPM (R 4.1.0)
## rmarkdown 2.11 2021-09-14 [1] RSPM (R 4.1.0)
## rmdformats 1.0.3 2021-10-06 [1] RSPM (R 4.1.0)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.1.0)
## rvest 1.0.2 2021-10-16 [1] RSPM (R 4.1.0)
## sass 0.4.0 2021-05-12 [1] RSPM (R 4.1.0)
## scales * 1.1.1 2020-05-11 [1] RSPM (R 4.1.0)
## sessioninfo 1.1.1 2018-11-05 [1] RSPM (R 4.1.0)
## snakecase * 0.11.0 2019-05-25 [1] RSPM (R 4.1.0)
## stringi 1.7.5 2021-10-04 [1] RSPM (R 4.1.0)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.1.0)
## tibble * 3.1.5 2021-09-30 [1] RSPM (R 4.1.0)
## tidyquant * 1.0.3 2021-03-05 [1] RSPM (R 4.1.0)
## tidyr * 1.1.4 2021-09-27 [1] RSPM (R 4.1.0)
## tidyselect 1.1.1 2021-04-30 [1] RSPM (R 4.1.0)
## tidyverse * 1.3.1 2021-04-15 [1] RSPM (R 4.1.0)
## TTR * 0.24.2 2020-09-01 [1] RSPM (R 4.1.0)
## tzdb 0.2.0 2021-10-27 [1] RSPM (R 4.1.0)
## utf8 1.2.2 2021-07-24 [1] RSPM (R 4.1.0)
## vctrs * 0.3.8 2021-04-29 [1] RSPM (R 4.1.0)
## withr 2.4.2 2021-04-18 [1] RSPM (R 4.1.0)
## xfun 0.27 2021-10-18 [1] RSPM (R 4.1.0)
## xml2 1.3.2 2020-04-23 [1] RSPM (R 4.1.0)
## xts * 0.12.1 2020-09-09 [1] RSPM (R 4.1.0)
## yaml 2.2.1 2020-02-01 [1] RSPM (R 4.1.0)
## zoo * 1.8-9 2021-03-09 [1] RSPM (R 4.1.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library