Exploring the Online Retail Dataset

This workbook was created using the “dataexpks” template:

https://github.com/DublinLearningGroup/dataexpks

1 Introduction

This workbook performs the basic data exploration of the dataset.

dataexp_level_exclusion_threshold <- 100

dataexp_cat_level_count <- 40
dataexp_hist_bins_count <- 50

2 Load Data

First we load the dataset as well as some support datasets.

rawdata_tbl <- read_rds("data/retail_data_raw_tbl.rds")

rawdata_tbl %>% glimpse()

## Rows: 1,044,848
## Columns: 9
## $ excel_sheet   <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Y…
## $ Invoice       <chr> "489434", "489434", "489434", "489434", "489434", "48943…
## $ StockCode     <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", …
## $ Description   <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGH…
## $ Quantity      <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, …
## $ InvoiceDate   <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07…
## $ Price         <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.…
## $ `Customer ID` <chr> "13085", "13085", "13085", "13085", "13085", "13085", "1…
## $ Country       <chr> "United Kingdom", "United Kingdom", "United Kingdom", "U…

2.1 Perform Quick Data Cleaning

Some of the dates provided in the dataset are in an irregular format.

data_tbl <- rawdata_tbl %>% set_colnames(names(.) %>% to_snake_case())

data_tbl %>% glimpse()

## Rows: 1,044,848
## Columns: 9
## $ excel_sheet  <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "Ye…
## $ invoice      <chr> "489434", "489434", "489434", "489434", "489434", "489434…
## $ stock_code   <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "…
## $ description  <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHT…
## $ quantity     <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3…
## $ invoice_date <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-01 07:…
## $ price        <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.7…
## $ customer_id  <chr> "13085", "13085", "13085", "13085", "13085", "13085", "13…
## $ country      <chr> "United Kingdom", "United Kingdom", "United Kingdom", "Un…

2.2 Create Derived Variables

We now create derived features useful for modelling. These values are new variables calculated from existing variables in the data.

## Rows: 1,044,848
## Columns: 22
## $ row_id            <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet       <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id        <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code        <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description       <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity          <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date      <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price             <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id       <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country           <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr    <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm      <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month     <fct> December, December, December, December, December, De…
## $ invoice_dow       <fct> Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday…
## $ invoice_dom       <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour      <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute    <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy       <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym        <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value       <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…

3 Perform Basic Checks on Data

We now want to look at some very high level checks on the data, and we leverage some of the functionality provided by DataExplorer.

3.1 Create High-Level Visualisations

We first want to look at a visualisation of some high-level summarys of the meta-data on this dataset. This gives us a quick view of the categorical and numeric values in the dataset, as well as the proportions of missing values.

data_tbl %>%
  plot_intro(
    title   = "High Level Table Summary",
    ggtheme = theme_cowplot()
    )

3.2 Check Missing Values

Before we do anything with the data, we first check for missing values in the dataset. In some cases, missing data is coded by a special character rather than as a blank, so we first correct for this.

### _TEMPLATE_
### ADD CODE TO CORRECT FOR DATA ENCODING HERE

With missing data properly encoded, we now visualise the missing data in a number of different ways.

3.2.1 Univariate Missing Data

data_tbl %>%
  plot_missing(
    title   = "Summary of Data Missingness",
    group   = list(Good = 0.05, Acceptable = 0.2, Bad = 0.8, Remove = 1),
    ggtheme = theme_cowplot()
    )

We now want to repeat this plot but only for those columns that have some missing values.

data_tbl %>%
  plot_missing(
    title        = "Summary of Data Missingness (missing variables only)",
    missing_only = TRUE,
    group        = list(Good = 0.05, Acceptable = 0.2, Bad = 0.8, Remove = 1),
    ggtheme      = theme_cowplot()
    )

3.2.2 Multivariate Missing Data

It is useful to get an idea of what combinations of variables tend to have variables with missing values simultaneously, so to construct a visualisation for this we create a count of all the times given combinations of variables have missing values, producing a heat map for these combination counts.

dataexp_missing_group_count <- 20

row_count <- rawdata_tbl %>% nrow()

count_nas <- ~ .x %>% are_na() %>% vec_cast(integer())

missing_vizdata_tbl <- rawdata_tbl %>%
  mutate(across(everything(), count_nas)) %>%
  mutate(label = pmap_chr(., str_c)) %>%
  group_by(label) %>%
  mutate(
    miss_count = n(),
    miss_prop  = miss_count / row_count
    ) %>%
  slice_max(order_by = miss_prop, n = 1, with_ties = FALSE) %>%
  ungroup() %>%
  pivot_longer(
    !c(label, miss_count, miss_prop),
    names_to = "variable_name",
    values_to = "presence"
    ) %>%
  mutate(
    prop_label = sprintf("%6.4f", miss_prop)
    )

top10_data_tbl <- missing_vizdata_tbl %>%
  select(label, miss_prop) %>%
  distinct() %>%
  slice_max(order_by = miss_prop, n = dataexp_missing_group_count)

missing_plot_tbl <- missing_vizdata_tbl %>%
  semi_join(top10_data_tbl, by = "label")

ggplot(missing_plot_tbl) +
  geom_tile(aes(x = variable_name, y = prop_label, fill = presence), height = 0.8) +
  scale_fill_continuous() +
  scale_x_discrete(position = "top", labels = abbreviate) +
  xlab("Variable") +
  ylab("Proportion of Rows") +
  theme(
    legend.position = "none",
    axis.text.x = element_text(angle = 90, vjust = 0.5)
    )

This visualisation takes a little explaining.

Each row represents a combination of variables with simultaneous missing values. For each row in the graphic, the coloured entries show which particular variables are missing in that combination. The proportion of rows with that combination is displayed in both the label for the row and the colouring for the cells in the row.

3.3 Inspect High-level-count Categorical Variables

With the raw data loaded up we now remove obvious unique or near-unique variables that are not amenable to basic exploration and plotting.

coltype_lst <- create_coltype_list(data_tbl)

count_levels <- ~ .x %>% unique() %>% length()

catvar_valuecount_tbl <- data_tbl %>%
  summarise(
    .groups = "drop",

    across(coltype_lst$split$discrete, count_levels)
    ) %>%
  pivot_longer(
    cols      = everything(),
    names_to  = "var_name",
    values_to = "level_count"
    ) %>%
  arrange(desc(level_count))

print(catvar_valuecount_tbl)

## # A tibble: 15 × 2
##    var_name       level_count
##    <chr>                <int>
##  1 row_id             1044848
##  2 invoice_id           53628
##  3 customer_id           5943
##  4 description           5656
##  5 stock_code            5304
##  6 stock_code_upr        5131
##  7 invoice_minute          60
##  8 invoice_woy             52
##  9 country                 43
## 10 invoice_dom             31
## 11 invoice_ym              25
## 12 invoice_hour            16
## 13 invoice_month           12
## 14 invoice_dow              7
## 15 excel_sheet              2

row_count <- data_tbl %>% nrow()

cat(glue("Dataset has {row_count} rows\n"))

## Dataset has 1044848 rows

Now that we a table of the counts of all the categorical variables we can automatically exclude unique variables from the exploration, as the level count will match the row count.

unique_vars <- catvar_valuecount_tbl %>%
  filter(level_count == row_count) %>%
  pull(var_name)

print(unique_vars)

## [1] "row_id"

explore_data_tbl <- data_tbl %>%
  select(-one_of(unique_vars))

Having removed the unique identifier variables from the dataset, we may also wish to exclude categoricals with high level counts also, so we create a vector of those variable names.

highcount_vars <- catvar_valuecount_tbl %>%
  filter(level_count >= dataexp_level_exclusion_threshold,
         level_count < row_count) %>%
  pull(var_name)

cat(str_c(highcount_vars, collapse = ", "))

## invoice_id, customer_id, description, stock_code, stock_code_upr

We now can continue doing some basic exploration of the data. We may also choose to remove some extra columns from the dataset.

### You may want to comment out these next few lines to customise which
### categoricals are kept in the exploration.
drop_vars <- c(highcount_vars)

if (length(drop_vars) > 0) {
  explore_data_tbl <- explore_data_tbl %>%
      select(-one_of(drop_vars))

  cat(str_c(drop_vars, collapse = ", "))
}

## invoice_id, customer_id, description, stock_code, stock_code_upr

4 Univariate Data Exploration

Now that we have loaded the data we can prepare it for some basic data exploration.

4.1 Quick Univariate Data Summaries

We use a number of summary visualisations provided by DataExplorer: a facet plot across each variable with categorical variables getting bar plots and numerical plots getting histograms.

We first look at the barplots of categorical variables.

plot_bar(
    data_tbl,
    ncol    = 2,
    nrow    = 2,
    title   = "Barplots of Data",
    ggtheme = theme_cowplot()
    )

## 10 columns ignored with more than 50 categories.
## row_id: 1044848 categories
## invoice_id: 53628 categories
## stock_code: 5304 categories
## description: 5656 categories
## invoice_date: 604 categories
## customer_id: 5943 categories
## stock_code_upr: 5131 categories
## invoice_dttm: 47635 categories
## invoice_minute: 60 categories
## invoice_woy: 52 categories

We then have a quick look at histograms of the numeric variables.

plot_histogram(
    data_tbl,
    ncol    = 2,
    nrow    = 2,
    title   = "Histograms of Data",
    ggtheme = theme_cowplot()
    )

Finally, we split the remaining variables into different categories and then produce a sequence of plots for each variable.

coltype_lst <- create_coltype_list(explore_data_tbl)

print(coltype_lst)

## $split
## $split$continuous
## [1] "quantity"          "price"             "stock_value"      
## [4] "invoice_monthprop"
## 
## $split$datetime
## [1] "invoice_date" "invoice_dttm"
## 
## $split$discrete
## [1] "excel_sheet"    "country"        "invoice_month"  "invoice_dow"   
## [5] "invoice_dom"    "invoice_hour"   "invoice_minute" "invoice_woy"   
## [9] "invoice_ym"    
## 
## $split$logical
## [1] "cancellation"
## 
## 
## $columns
##       excel_sheet          quantity      invoice_date             price 
##        "discrete"      "continuous"        "datetime"      "continuous" 
##           country      cancellation      invoice_dttm     invoice_month 
##        "discrete"         "logical"        "datetime"        "discrete" 
##       invoice_dow       invoice_dom      invoice_hour    invoice_minute 
##        "discrete"        "discrete"        "discrete"        "discrete" 
##       invoice_woy        invoice_ym       stock_value invoice_monthprop 
##        "discrete"        "discrete"      "continuous"      "continuous"

4.2 Logical Variables

Logical variables only take two values: TRUE or FALSE. It is useful to see missing data as well though, so we also plot the count of those.

logical_vars <- coltype_lst$split$logical %>% sort()

for (plot_varname in logical_vars) {
  cat("--\n")
  cat(glue("{plot_varname}\n"))

  na_count <- explore_data_tbl %>% pull(.data[[plot_varname]]) %>% are_na() %>% sum()

  plot_title <- glue("Barplot of Counts for Variable: {plot_varname} ({na_count} missing values)")

  explore_plot <- ggplot(explore_data_tbl) +
    geom_bar(aes(x = .data[[plot_varname]])) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(plot_title) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

  plot(explore_plot)
}

## --
## cancellation

4.3 Numeric Variables

Numeric variables are usually continuous in nature, though we also have integer data.

numeric_vars <- coltype_lst$split$continuous %>% sort()

for (plot_varname in numeric_vars) {
  cat("--\n")
  cat(glue("{plot_varname}\n"))

  plot_var <- explore_data_tbl %>% pull(.data[[plot_varname]])
  na_count <- plot_var %>% are_na() %>% sum()

  plot_var %>% summary() %>% print()

  plot_title <- glue("Histogram Plot for Variable: {plot_varname} ({na_count} missing values)")


  all_plot <- ggplot() +
    geom_histogram(aes(x = plot_var), bins = dataexp_hist_bins_count) +
    geom_vline(xintercept = mean(plot_var, na.rm = TRUE),
               colour = "red", size = 1.5) +
    geom_vline(xintercept = median(plot_var, na.rm = TRUE),
               colour = "green", size = 1.5) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_x_continuous(labels = label_comma()) +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(
      plot_title,
      subtitle = "(red line is mean, green line is median)"
      )

  pos_data_tbl <- explore_data_tbl %>%
    filter(.data[[plot_varname]] >= 0) %>%
    mutate(var_val = abs(.data[[plot_varname]]))

  pos_log_plot <- ggplot(pos_data_tbl) +
    geom_histogram(aes(x = var_val), bins = dataexp_hist_bins_count) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_x_log10(labels = label_comma()) +
    scale_y_continuous(labels = label_comma()) +
    ggtitle("Positive Values")

  
  neg_data_tbl <- explore_data_tbl %>%
    filter(.data[[plot_varname]] < 0) %>%
    mutate(var_val = abs(.data[[plot_varname]]))

  neg_log_plot <- ggplot(neg_data_tbl) +
    geom_histogram(aes(x = var_val), bins = dataexp_hist_bins_count) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_x_log10(labels = label_comma()) +
    scale_y_continuous(labels = label_comma()) +
    ggtitle("Negative Values")


  plot_grid(
      all_plot,
      NULL,
      pos_log_plot,
      neg_log_plot,
      nrow = 2
      ) %>%
    print()
}

## --
## invoice_monthprop   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03226 0.28571 0.53333 0.52630 0.76667 1.00000

## --
## price     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -53594.36      1.25      2.10      4.59      4.13  38970.00

## --
## quantity     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -80995.00      1.00      3.00      9.99     10.00  80995.00

## --
## stock_value      Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
## -168469.60       3.75       9.90      18.10      17.70  168469.60

4.4 Categorical Variables

Categorical variables only have values from a limited, and usually fixed, number of possible values

categorical_vars <- coltype_lst$split$discrete %>% sort()

for (plot_varname in categorical_vars) {
  cat("--\n")
  cat(glue("{plot_varname}\n"))

  na_count <- explore_data_tbl %>% pull(.data[[plot_varname]]) %>% are_na() %>% sum()

  plot_title <- glue("Barplot of Counts for Variable: {plot_varname} ({na_count} missing values)")

  standard_plot_tbl <- explore_data_tbl %>%
    count(.data[[plot_varname]])

  standard_plot <- ggplot(standard_plot_tbl) +
    geom_bar(aes(x = .data[[plot_varname]], weight = n)) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_x_discrete(labels = ~ abbreviate(.x, minlength = 10)) +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(plot_title) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

  standard_plot %>% print()


  desc_plot_tbl <- explore_data_tbl %>%
    pull(.data[[plot_varname]]) %>%
    fct_lump(n = dataexp_cat_level_count) %>%
    fct_count() %>%
    mutate(f = fct_relabel(f, str_trunc, width = 15))

  desc_plot <- ggplot(desc_plot_tbl) +
    geom_bar(aes(x = fct_reorder(f, -n), weight = n)) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_x_discrete(labels = abbreviate) +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(plot_title) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

  desc_plot %>% print()
}

## --
## country

## --
## excel_sheet

## --
## invoice_dom

## --
## invoice_dow

## --
## invoice_hour

## --
## invoice_minute

## --
## invoice_month

## --
## invoice_woy

## --
## invoice_ym

4.5 Date/Time Variables

Date/Time variables represent calendar or time-based data should as time of the day, a date, or a timestamp.

datetime_vars <- coltype_lst$split$datetime %>% sort()

for (plot_varname in datetime_vars) {
  cat("--\n")
  cat(glue("{plot_varname}\n"))

  plot_var <- explore_data_tbl %>% pull(.data[[plot_varname]])
  na_count <- plot_var %>% are_na() %>% sum()

  plot_var %>% summary() %>% print()

  plot_title <- glue("Barplot of Dates/Times in Variable: {plot_varname} ({na_count} missing values)")


  explore_plot <- ggplot(explore_data_tbl) +
    geom_histogram(aes(x = .data[[plot_varname]]), bins = dataexp_hist_bins_count) +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(plot_title)

  plot(explore_plot)
}

## --
## invoice_date        Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
## "2009-12-01" "2010-07-05" "2010-12-08" "2011-01-02" "2011-07-27" "2011-12-09"

## --
## invoice_dttm                 Min.               1st Qu.                Median 
## "2009-12-01 07:45:00" "2010-07-05 11:11:00" "2010-12-08 16:34:00" 
##                  Mean               3rd Qu.                  Max. 
## "2011-01-03 11:44:15" "2011-07-27 13:42:00" "2011-12-09 12:50:00"

5 Bivariate Facet Plots

We now move on to looking at bivariate plots of the data set.

A natural way to explore relationships in data is to create univariate visualisations facetted by a categorical value.

facet_varname <- "invoice_month"

dataexp_facet_count_max <- 3

5.1 Logical Variables

For logical variables we facet on barplots of the levels, comparing TRUE, FALSE and missing data.

logical_vars <- logical_vars[!logical_vars %in% facet_varname] %>% sort()


for (plot_varname in logical_vars) {
  cat("--\n")
  cat(plot_varname)

  plot_tbl <- data_tbl %>% filter(!are_na(.data[[plot_varname]]))

  explore_plot <- ggplot(plot_tbl) +
    geom_bar(aes(x = .data[[plot_varname]])) +
    facet_wrap(facet_varname, scales = "free") +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(glue("{facet_varname}-Faceted Histogram for Variable: {plot_varname}")) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

  plot(explore_plot)
}

## --
## cancellation

5.2 Numeric Variables

For numeric variables, we facet on histograms of the data.

for (plot_varname in numeric_vars) {
  cat("--\n")
  cat(plot_varname)

  plot_tbl <- data_tbl %>% filter(!are_na(.data[[plot_varname]]))

  explore_plot <- ggplot(plot_tbl) +
    geom_histogram(aes(x = .data[[plot_varname]]), bins = dataexp_hist_bins_count) +
    facet_wrap(facet_varname, scales = "free") +
    xlab(plot_varname) +
    ylab("Count") +
    scale_x_continuous(labels = label_comma()) +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(glue("{facet_varname}-Faceted Histogram for Variable: {plot_varname}")) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

  print(explore_plot)
}

## --
## invoice_monthprop

## --
## price

## --
## quantity

## --
## stock_value

5.3 Categorical Variables

We treat categorical variables like logical variables, faceting the barplots of the different levels of the data.

categorical_vars <- categorical_vars[!categorical_vars %in% facet_varname] %>% sort()

for (plot_varname in categorical_vars) {
  cat("--\n")
  cat(plot_varname)

  plot_tbl <- data_tbl %>%
    filter(!are_na(.data[[plot_varname]])) %>%
    mutate(
      varname_trunc = fct_relabel(.data[[plot_varname]], str_trunc, width = 10)
      )

  explore_plot <- ggplot(plot_tbl) +
    geom_bar(aes(x = varname_trunc)) +
    facet_wrap(facet_varname, scales = "free") +
    xlab(plot_varname) +
    ylab("Count") +
    scale_x_discrete(labels = abbreviate) +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(glue("{facet_varname}-Faceted Histogram for Variable: {plot_varname}")) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

  plot(explore_plot)
}

## --
## country

## --
## excel_sheet

## --
## invoice_dom

## --
## invoice_dow

## --
## invoice_hour

## --
## invoice_minute

## --
## invoice_woy

## --
## invoice_ym

5.4 Date/Time Variables

Like the univariate plots, we facet on histograms of the years in the dates.

for (plot_varname in datetime_vars) {
  cat("--\n")
  cat(plot_varname)

  plot_tbl <- data_tbl %>% filter(!are_na(.data[[plot_varname]]))

  explore_plot <- ggplot(plot_tbl) +
    geom_histogram(aes(x = .data[[plot_varname]]), bins = dataexp_hist_bins_count) +
    facet_wrap(facet_varname, scales = "free") +
    xlab(plot_varname) +
    ylab("Count") +
    scale_y_continuous(labels = label_comma()) +
    ggtitle(glue("{facet_varname}-Faceted Histogram for Variable: {plot_varname}")) +
    theme(axis.text.x = element_text(angle = 30, vjust = 0.5))

  plot(explore_plot)
}

## --
## invoice_date

## --
## invoice_dttm

6 Custom Explorations

In this section we perform various data explorations.

6.1 Custom Checks for Data Integrity

We want to check the transaction data for consistency, so we create a table of all distinct

stock_codes_lookup_tbl <- data_tbl %>%
  select(stock_code_upr, description) %>%
  distinct() %>%
  arrange(stock_code_upr, description) %>%
  drop_na(description)

stock_codes_lookup_tbl %>% glimpse()

## Rows: 6,342
## Columns: 2
## $ stock_code_upr <chr> "10002", "10002R", "10080", "10080", "10109", "10120", …
## $ description    <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", …

We now take a look at the first 50 rows of this table to get a sense of any possible duplication of stock_code.

stock_codes_lookup_tbl %>% datatable()

6.1.1 Items per Transactions

As another check on the data, we want to look at how many different objects are included in

plot_tbl <- data_tbl %>%
  filter(quantity > 0) %>%
  count(invoice_id, name = "n_items")

ggplot(plot_tbl) +
  geom_histogram(aes(x = n_items), bins = 40) +
  scale_x_log10(labels = label_comma()) +
  scale_y_continuous(labels = label_comma()) +
  xlab("Number of Items") +
  ylab("Transaction Count") +
  ggtitle("Histogram of Item Counts per Transactions")

6.2 Explore Aggregate Amounts

We now turn our focus to aggregating the data set in various ways and inspect how these aggregate totals are distributed.

6.2.1 Invoice-Level Amounts

We first aggregate the data at the invoice level, and inspect how those amounts are distributed.

invoice_data_tbl <- data_tbl %>%
  group_by(invoice_id) %>%
  summarise(
    .groups = "drop",
    
    invoice_amount = sum(price * quantity) %>% round(2)
  )

invoice_mean   <- invoice_data_tbl %>% pull(invoice_amount) %>% mean()   %>% round(2)
invoice_median <- invoice_data_tbl %>% pull(invoice_amount) %>% median() %>% round(2)

ggplot(invoice_data_tbl) +
  geom_histogram(aes(x = invoice_amount), bins = 50) +
  geom_vline(aes(xintercept = invoice_mean),   colour = "black") +
  geom_vline(aes(xintercept = invoice_median), colour = "red") +
  xlab("Invoice Amount") +
  ylab("Count") +
  scale_x_log10(labels = label_comma()) +
  scale_y_continuous(labels = label_comma()) +
  ggtitle(
    label    = "Histogram Plot for Invoice Amount",
    subtitle = glue("Mean is {invoice_mean}, Median is {invoice_median}")
    )

We see there is a broad range of different invoice totals, with mean and median being a few hundred pounds.

6.2.2 Customer-Level Amounts

customer_data_tbl <- data_tbl %>%
  group_by(customer_id) %>%
  summarise(
    .groups = "drop",
    
    customer_spend = sum(price * quantity) %>% round(2)
  )
  
ggplot(customer_data_tbl) +
  geom_histogram(aes(x = customer_spend), bins = 50) +
  xlab("Customer Spend") +
  ylab("Count") +
  scale_x_log10(labels = label_comma()) +
  scale_y_continuous(labels = label_comma()) +
  ggtitle("Histogram Plot for Customer Spend")

customer_data_tbl %>% pull(customer_spend) %>% hill()

6.2.3 Stock Code Investigation

We now want to have a quick look at some high level summary statistics on the stock codes, so we look at quantities and total value.

stock_code_summary_tbl <- data_tbl %>%
  group_by(stock_code_upr) %>%
  summarise(
    .groups = "drop",
    
    row_count = n(),

    net_quantity = sum(quantity),
    abs_quantity = abs(quantity) %>% sum(),
    net_value    = sum(stock_value),
    abs_value    = abs(quantity) %>% sum()
    ) %>%
  mutate(
    sc_nchar = nchar(stock_code_upr), .before = "row_count"
    )

stock_code_summary_tbl %>% datatable()

It appears there are number of odd stock codes in the dataset, so we look at those codes that are 4 characters or less and inspect those.

short_stock_codes_tbl <- data_tbl %>%
  semi_join(stock_code_summary_tbl %>% filter(sc_nchar < 5), by = "stock_code_upr")

short_stock_codes_tbl %>% datatable()

6.2.4 Stock Code Price Data

We want to look at the range of different prices assigned to the same stock_code value.

stock_price_counts_tbl <- data_tbl %>%
  group_by(stock_code) %>%
  summarise(
    .groups = "drop",
    
    n_prices    = n(),
    min_price   = min(price),
    p25_price   = quantile(price, 0.25),
    mean_price  = mean(price) %>% round(2),
    p50_price   = median(price),
    p75_price   = quantile(price, 0.75),
    max_price   = max(price),
    range_price = ((max_price - min_price) / mean_price) %>% round(4)
    )

stock_price_counts_tbl %>% datatable()

stock_distinct_price_counts_tbl <- data_tbl %>%
  select(stock_code, price) %>%
  distinct() %>%
  group_by(stock_code) %>%
  summarise(
    .groups = "drop",
    
    n_prices    = n(),
    min_price   = min(price),
    p25_price   = quantile(price, 0.25),
    mean_price  = mean(price) %>% round(2),
    p50_price   = median(price),
    p75_price   = quantile(price, 0.75),
    max_price   = max(price),
    range_price = (max_price - min_price) / mean_price
    )

stock_distinct_price_counts_tbl %>% datatable()

6.3 Construct Time-Series / Date-Based Data

Another way to look at this data is to combine all the invoice values by various time period such as daily, weekly and monthly to see how it looks.

As we are going to do a number of aggregations based on various aspects of the date, we construct a function that takes a table of data and adds a number of derived columns based on that date: things like day of week, calendar month and so on.

append_calendar_columns <- function(data_tbl, date_col) {
  updated_data_tbl <- data_tbl %>%
    mutate(
      invoice_date   = {{date_col}} %>% as.Date(),
      
      invoice_month  = {{date_col}} %>% format("%B"),
      invoice_dow    = {{date_col}} %>% format("%A"),
      invoice_dom    = {{date_col}} %>% format("%d"),
      invoice_hour   = {{date_col}} %>% format("%H"),
      invoice_minute = {{date_col}} %>% format("%M"),
      invoice_woy    = {{date_col}} %>% format("%V"),
      invoice_ym     = {{date_col}} %>% format("%Y%m"),
      
      .after = {{date_col}}
      )
  
  return(updated_data_tbl)
}

data_tbl %>%
  select(excel_sheet, invoice_id, invoice_date, stock_value) %>%
  append_calendar_columns(invoice_date) %>%
  glimpse()

## Rows: 1,044,848
## Columns: 11
## $ excel_sheet    <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010", "…
## $ invoice_id     <chr> "489434", "489434", "489434", "489434", "489434", "4894…
## $ invoice_date   <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-1…
## $ invoice_month  <chr> "December", "December", "December", "December", "Decemb…
## $ invoice_dow    <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", …
## $ invoice_dom    <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01", "…
## $ invoice_hour   <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "…
## $ invoice_minute <chr> "00", "00", "00", "00", "00", "00", "00", "00", "00", "…
## $ invoice_woy    <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49", "…
## $ invoice_ym     <chr> "200912", "200912", "200912", "200912", "200912", "2009…
## $ stock_value    <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50…

6.3.1 Create Univariate Time-Series of Amounts

ts_data_tbl <- data_tbl %>%
  mutate(
    ts_week  = format(invoice_dttm, "%Y-%U"),
    ts_month = format(invoice_dttm, "%Y-%m")
    )

ts_daily_tbl <- ts_data_tbl %>%
  group_by(label = invoice_date %>% format("%Y-%m-%d")) %>%
  summarise(
    .groups = "drop",

    period_date = min(invoice_date),
    total_spend = sum(price * quantity) %>% round(2)
    )

ggplot(ts_daily_tbl) +
  geom_line(aes(x = period_date, y = total_spend)) +
  expand_limits(y = 0) +
  scale_y_continuous(labels = label_comma()) +
  xlab("Date") +
  ylab("Total Spend") +
  ggtitle("Lineplot of Total Spend by Day")

ts_weekly_tbl <- ts_data_tbl %>%
  group_by(label = ts_week) %>%
  summarise(
    .groups = "drop",

    period_date = min(invoice_date),
    total_spend = sum(price * quantity)
    )

ggplot(ts_weekly_tbl) +
  geom_line(aes(x = period_date, y = total_spend)) +
  expand_limits(y = 0) +
  scale_y_continuous(labels = label_comma()) +
  xlab("Date") +
  ylab("Total Spend") +
  ggtitle("Lineplot of Total Spend by Week")

ts_monthly_tbl <- ts_data_tbl %>%
  group_by(label = ts_month) %>%
  summarise(
    .groups = "drop",

    period_date = min(invoice_date),
    total_spend = sum(price * quantity) %>% round(2)
    )

ggplot(ts_monthly_tbl) +
  geom_line(aes(x = period_date, y = total_spend)) +
  expand_limits(y = 0) +
  scale_y_continuous(labels = label_comma()) +
  xlab("Date") +
  ylab("Total Spend") +
  ggtitle("Lineplot of Total Spend by Month")

To avoid dealing with multiple files for the time series, we combine them into a single object.

ts_data_tbl <- list(
    daily   = ts_daily_tbl,
    weekly  = ts_weekly_tbl,
    monthly = ts_monthly_tbl
    ) %>%
  bind_rows(.id = "series") %>%
  arrange(series, period_date)

ts_data_tbl %>% write_rds("data/retail_timeseries_tbl.rds")

6.3.2 Calendar-Based Boxplots

We have aggregated our data across time periods, but it is also worth looking at both transaction-level and invoice-level amount over time.

ggplot(data_tbl) +
  geom_boxplot(aes(x = invoice_woy, y = stock_value, group = invoice_woy)) +
  scale_y_log10(labels = label_comma()) +
  xlab("Week of Year") +
  ylab("Transaction Amount") +
  ggtitle("Boxplot of Transaction Sizes by Week of Year") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

plot_tbl <- data_tbl %>%
  group_by(invoice_woy, invoice_id) %>%
  summarise(
    .groups = "drop",
    
    invoice_amount = sum(stock_value) %>% round(2)
  )

ggplot(plot_tbl) +
  geom_boxplot(aes(x = invoice_woy, y = invoice_amount, group = invoice_woy)) +
  scale_y_log10(labels = label_comma()) +
  xlab("Week of Year") +
  ylab("Invoice Amount") +
  ggtitle("Boxplot of Invoice Amounts by Week of Year") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

6.4 Check Distribution of Daily Purchases

We now look at individual invoice amounts, and look at how they are distributed on a daily basis.

daily_distribution_tbl <- data_tbl %>%
  group_by(invoice_date, invoice_id) %>%
  summarise(
    .groups = "drop",
    
    total_spend = sum(stock_value)
    )

daily_distribution_tbl %>% glimpse()

## Rows: 53,628
## Columns: 3
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ invoice_id   <chr> "489434", "489435", "489436", "489437", "489438", "489439…
## $ total_spend  <dbl> 505.30, 145.80, 630.33, 310.75, 2286.24, 426.30, 50.40, 3…

We now produce a boxplot for each day so we have a sense of how the distribution of transaction amounts changes and evolves over time.

plot_tbl <- daily_distribution_tbl %>%
  filter(total_spend > 0) %>%
  group_by(invoice_date) %>%
  summarise(
    .groups = "drop",
    
    qn   = c("total_spend", "p10", "p25", "p50", "p75", "p90"),
    vals = c(
      sum(total_spend),
      quantile(total_spend, probs = c(0.10, 0.25, 0.50, 0.75, 0.90))
      )
    ) %>%
  pivot_wider(
    names_from = qn,
    values_from = vals
  )

ggplot(plot_tbl) +
  geom_errorbar(aes(x = invoice_date, ymin = p25, ymax = p75, colour = total_spend),
                width = 0) +
  scale_y_log10(labels = label_comma()) +
  scale_colour_continuous(labels = label_comma()) +
  labs(x = "Date", y = "Transaction Amount", colour = "Spend") +
  ggtitle("P25-P75 Quartile of Transaction Amounts by Date")

It does not look there is much of a change over time or any particular patterns that stand out, so we also want to look at the refunds.

plot_tbl <- daily_distribution_tbl %>%
  filter(total_spend < 0) %>%
  mutate(total_refund = abs(total_spend)) %>%
  group_by(invoice_date) %>%
  summarise(
    .groups = "drop",
    
    qn   = c("total_refund", "p10", "p25", "p50", "p75", "p90"),
    vals = c(
      sum(total_refund),
      quantile(total_refund, probs = c(0.10, 0.25, 0.50, 0.75, 0.90))
      )
    ) %>%
  pivot_wider(
    names_from = qn,
    values_from = vals
  )

ggplot(plot_tbl) +
  geom_errorbar(aes(x = invoice_date, ymin = p25, ymax = p75, colour = total_refund),
                width = 0) +
  scale_y_log10(labels = label_comma()) +
  scale_colour_continuous(labels = label_comma()) +
  labs(x = "Date", y = "Transaction Amount", colour = "Spend") +
  ggtitle("P25-P75 Quartile of Refund Amounts by Date")

6.5 Investigate Refunds

This transactions data also includes returns and refunds - adding a level of complexity to this analysis as we need to account for this when assessing the data.

One option is to ignore the refunds, at least initially, but this may add a large source of bias to our analysis and we need to get a feel for this.

To do this, we will look at each line-item entry with a negative quantity and we will look back in time to the previous time such an time was sold, and we match on the customer_id, stock_code and price. This is strong evidence that the previous sale was then returned, and so we can label these entries.

Due to the time snapshot nature of this data, this also means there are likely to be some returns in our data that do not have a corresponding sale, so we can also ignore those.

filter_returns_transactions <- function(use_stock, use_customer,  use_dttm,
                                        data_tbl) {
  matches_tbl <- data_tbl %>%
    filter(stock_code   == use_stock,
           customer_id  == use_customer,
           invoice_dttm <= use_dttm,
           quantity     >  0)

  return(matches_tbl)
}

match_tnx_prices <- function(data_tbl, return_price, return_quantity) {
  matched_tbl <- data_tbl %>%
    filter(abs(price) == abs(return_price)) %>%
    select(row_id, invoice_id, stock_code, quantity, price, invoice_dttm) %>%
    arrange(desc(invoice_dttm)) %>%
    mutate(
      cuml_quantity   = cumsum(quantity),
      return_quantity = return_quantity,
      remaining       = cuml_quantity + return_quantity
      )

  return(matched_tbl)
}


detemine_return_records <- function(data_tbl) {
  negative_tbl <- data_tbl %>%
    filter(remaining <= 0)
  
  first_tbl <- data_tbl %>%
    filter(remaining > 0) %>%
    arrange(desc(invoice_dttm)) %>%
    head(1)
  
  
  matched_tbl <- list(negative_tbl, first_tbl) %>%
    bind_rows() %>%
    arrange(desc(invoice_dttm)) %>%
    rename(orig_row_id = row_id)
  
  return(matched_tbl)
}


returns_data_tbl <- data_tbl %>%
  filter(quantity < 0) %>%
  select(row_id, stock_code, customer_id, quantity, price, invoice_dttm) %>%
  mutate(
    prev_tnx_data = future_pmap(
      list(use_stock    = stock_code,
           use_customer = customer_id,
           use_dttm     = invoice_dttm),
      filter_returns_transactions,
      data_tbl = data_tbl,
      
      .options  = furrr_options(seed = 421),
      .progress = TRUE      
      ),
    price_data = future_pmap(
      list(data_tbl        = prev_tnx_data,
           return_price    = price,
           return_quantity = quantity),
      match_tnx_prices,
      
      .options  = furrr_options(seed = 422),
      .progress = TRUE
      ),
    match_data = map(price_data, detemine_return_records)
    )


returns_data_tbl %>% glimpse()

## Rows: 22,557
## Columns: 9
## $ row_id        <chr> "ROW1025684", "ROW1025685", "ROW1025686", "ROW1025687", …
## $ stock_code    <chr> "22087", "85206A", "21895", "21896", "22083", "21871", "…
## $ customer_id   <chr> "16321", "16321", "16321", "16321", "16321", "16321", "1…
## $ quantity      <dbl> -12, -6, -4, -6, -12, -12, -12, -24, -12, -3, -3, -3, -3…
## $ price         <dbl> 2.95, 1.65, 4.25, 2.10, 2.95, 1.25, 1.25, 0.85, 2.95, 4.…
## $ invoice_dttm  <dttm> 2009-12-01 10:32:59, 2009-12-01 10:32:59, 2009-12-01 10…
## $ prev_tnx_data <list> [<tbl_df[0 x 22]>], [<tbl_df[0 x 22]>], [<tbl_df[0 x 22…
## $ price_data    <list> [<tbl_df[0 x 9]>], [<tbl_df[0 x 9]>], [<tbl_df[0 x 9]>]…
## $ match_data    <list> [<tbl_df[0 x 9]>], [<tbl_df[0 x 9]>], [<tbl_df[0 x 9]>]…

We can then use this data to get an idea of how often items are returned, and how much time tends to pass between the purchase and the return.

returns_lookup_tbl <- returns_data_tbl %>%
  select(return_row_id = row_id, return_dttm = invoice_dttm, match_data) %>%
  unnest(match_data) %>%
  select(
    orig_row_id, return_row_id, return_dttm,
    adjust_quantity = return_quantity
    ) %>%
  group_by(return_row_id) %>%
  arrange(orig_row_id) %>%
  slice_head(n = 1) %>%
  ungroup()

returns_lookup_tbl %>% glimpse()

## Rows: 15,538
## Columns: 4
## $ orig_row_id     <chr> "ROW0000260", "ROW0001235", "ROW0001214", "ROW0001233"…
## $ return_row_id   <chr> "ROW1025730", "ROW1025733", "ROW1025734", "ROW1025735"…
## $ return_dttm     <dttm> 2009-12-01 12:42:59, 2009-12-01 13:09:00, 2009-12-01 …
## $ adjust_quantity <dbl> -3, -24, -24, -12, -24, -81, -24, -48, -3, -12, -12, -…

We probably need to do a better job of this, as this current method is inadequate to properly match this up.

Stepping up this work to properly match up the transactions will be a topic we return to later.

We now want to look at the distribution of times between the purchase and the item being returned.

orig_data_tbl <- data_tbl %>%
  select(orig_row_id   = row_id, purchase_dttm = invoice_dttm)

ret_data_tbl <- data_tbl %>%
  select(return_row_id = row_id, return_dttm   = invoice_dttm)

return_times_tbl <- returns_lookup_tbl %>%
  select(orig_row_id, return_row_id) %>%
  left_join(orig_data_tbl, by = "orig_row_id") %>%
  left_join(ret_data_tbl,  by = "return_row_id") %>%
  mutate(
    return_time = return_dttm - purchase_dttm,
    return_days  = as.numeric(return_time) / (24 * 60 * 60)
    )

base_plot <- ggplot(return_times_tbl) +
  geom_histogram(aes(x = return_days), bins = 50) +
  geom_vline(aes(xintercept = 90), colour = "red") +
  xlab("Days") +
  ylab("Count") +
  ggtitle("Histogram of Days Between Purchase and Return per Item Transaction")

norm_plot <- base_plot +
  scale_x_continuous(labels = label_comma())

log_plot <- base_plot +
  scale_x_log10(labels = label_comma())

plot_grid(norm_plot, log_plot, nrow = 2)

7 Indicate Excluded Rows

As we have found a number of issues with this data, we now will indicate which rows we wish to exclude from futher analysis. We will then exclude these rows at a later point of the analysis.

7.1 Filter Extra Stock Codes

We discovered there are a number of extra data in this dataset for things like bad debt management, discounts, gift cards and so on, so we remove those from this.

screen_stock_code <- c(
  "B", "C2", "C3", "D", "M", "S", "CRUK", "POST", "DOT", "BANK CHARGES",
  "AMAZONFEE", "ADJUST", "ADJUST2", "TEST001", "TEST002"
  )

clean_stock_code_tbl <- data_tbl %>%
  filter(stock_code_upr %in% screen_stock_code) %>%
  select(row_id)

clean_stock_code_tbl %>% glimpse()

## Rows: 5,712
## Columns: 1
## $ row_id <chr> "ROW0000090", "ROW0000127", "ROW0000174", "ROW0000593", "ROW102…

clean_gift_tbl <- data_tbl %>%
  filter(str_detect(stock_code_upr, "GIFT")) %>%
  select(row_id)

clean_gift_tbl %>% glimpse()

## Rows: 101
## Columns: 1
## $ row_id <chr> "ROW0029851", "ROW0030310", "ROW0031279", "ROW0040002", "ROW004…

7.2 Create Exclusion Indicator

exclusions_tbl <- list(
    clean_stock_code_tbl,
    clean_gift_tbl
    ) %>%
  bind_rows() %>%
  mutate(exclude = TRUE)

cleaned_data_tbl <- data_tbl %>%
  left_join(exclusions_tbl, by = "row_id") %>%
  replace_na(list(exclude = FALSE))

cleaned_data_tbl %>% glimpse()

## Rows: 1,044,848
## Columns: 23
## $ row_id            <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet       <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id        <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code        <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description       <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity          <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date      <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price             <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id       <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country           <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr    <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm      <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month     <fct> December, December, December, December, December, De…
## $ invoice_dow       <fct> Tuesday, Tuesday, Tuesday, Tuesday, Tuesday, Tuesday…
## $ invoice_dom       <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour      <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute    <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy       <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym        <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value       <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude           <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…

8 Construct Stock Code Lookups

Finally, we want to construct a lookup table that provides some free-text fields for each stock_code value.

dedupe_stock_descs <- function(stock_desc) {
  dedupe_descs <- stock_desc %>%
    enframe(name = NULL, value = "stock_desc") %>%
    mutate(
      desc_len    = stock_desc %>% nchar(),
      desc_dedupe = stock_desc %>% str_trim() %>% str_replace_all("[^\\w]| ", ""),
      desc_output = stock_desc %>% str_trim() %>% str_squish()
      ) %>%
    group_by(desc_dedupe) %>%
    slice_max(order_by = desc_len, n = 1, with_ties = FALSE) %>%
    ungroup() %>%
    pull(desc_output)

  return(dedupe_descs)
}

We use some simple logic to attempt to de-dupe the descriptions as much as possible.

stock_description_tbl <- cleaned_data_tbl %>%
  filter(
    exclude == FALSE,
    quantity > 0,
    price > 0,
    !are_na(description)
    ) %>%
  mutate(
    stock_code = stock_code %>% str_trim() %>% str_to_upper()
    ) %>%
  select(stock_code, description) %>%
  drop_na(description) %>%
  distinct() %>%
  group_by(stock_code) %>%
  summarise(
    .groups = "drop",
    
    desc = description %>% sort() %>% dedupe_stock_descs() %>% str_c(collapse = " : ")
    ) %>%
  arrange(stock_code)

stock_description_tbl %>% glimpse()

## Rows: 4,725
## Columns: 2
## $ stock_code <chr> "10002", "10002R", "10080", "10109", "10120", "10123C", "10…
## $ desc       <chr> "INFLATABLE POLITICAL GLOBE", "ROBOT PENCIL SHARPNER", "GRO…

We also look at this output using DT

stock_description_tbl %>% datatable()

9 BTYD Visualisations

Despite our extensive exploration of the data earlier, the concepts around BTYD modelling suggest a few more than are worth exploring, so we will look at those now.

We model the purchase data first, then combine this to create an individual customer/invoice pairing with the total amount spent as an additional column.

tnx_purchase_tbl <- cleaned_data_tbl %>%
  filter(
    quantity > 0,
    price > 0,
    exclude == FALSE
    ) %>%
  select(
    invoice_date, invoice_id, stock_code, customer_id, description,
    quantity, price, stock_value
    )

tnx_purchase_tbl %>% glimpse()

## Rows: 1,015,088
## Columns: 8
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ invoice_id   <chr> "489434", "489434", "489434", "489434", "489434", "489434…
## $ stock_code   <chr> "85048", "79323P", "79323W", "22041", "21232", "22064", "…
## $ customer_id  <chr> "13085", "13085", "13085", "13085", "13085", "13085", "13…
## $ description  <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY LIGHT…
## $ quantity     <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, 18, 3…
## $ price        <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55, 3.7…
## $ stock_value  <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59.50, …

Use of BTYD models assumes a total spend over a period of day and those differences between the times. This is calculated internally by the various BTYD routines so we keep just the per-invoice spend.

daily_spend_invoice_tbl <- tnx_purchase_tbl %>%
  drop_na(customer_id) %>%
  group_by(invoice_date, customer_id, invoice_id) %>%
  summarise(
    .groups = "drop",
    
    invoice_spend = sum(stock_value)
    )

daily_spend_invoice_tbl %>% glimpse()

## Rows: 36,594
## Columns: 4
## $ invoice_date  <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12…
## $ customer_id   <chr> "12490", "12533", "12682", "12758", "12836", "12913", "1…
## $ invoice_id    <chr> "489557", "489526", "489439", "489599", "489593", "48954…
## $ invoice_spend <dbl> 531.94, 821.92, 372.30, 2454.68, 423.87, 537.96, 261.00,…

daily_spend_tbl <- daily_spend_invoice_tbl %>%
  group_by(invoice_date, customer_id) %>%
  summarise(
    .groups = "drop",
    
    total_spend = sum(invoice_spend),
    tnx_count   = n()
    )

daily_spend_tbl %>% glimpse()

## Rows: 32,878
## Columns: 4
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-…
## $ customer_id  <chr> "12490", "12533", "12682", "12758", "12836", "12913", "12…
## $ total_spend  <dbl> 531.94, 821.92, 372.30, 2454.68, 423.87, 537.96, 261.00, …
## $ tnx_count    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, …

To start with, it might be worth understanding a bit more about when customers are ‘born’ in the system - that is, the date on which they make their first purchase. Another important quantity is the time between transactions for a customer, we we will visualise these.

customer_cohort_tbl <- daily_spend_tbl %>%
  group_by(customer_id) %>%
  summarise(
    .groups = "drop",
    
    first_tnx_date  = min(invoice_date),
    total_tnx_count = n()
    ) %>%
  mutate(
    cohort_qtr = first_tnx_date %>% as.yearqtr() %>% as.character(),
    cohort_ym  = first_tnx_date %>% format("%Y %m"),
    
    .after = "customer_id"
    )


customer_cohort_tbl %>% glimpse()

## Rows: 5,852
## Columns: 5
## $ customer_id     <chr> "12346", "12347", "12348", "12349", "12350", "12351", …
## $ cohort_qtr      <chr> "2010 Q1", "2010 Q4", "2010 Q3", "2010 Q2", "2011 Q1",…
## $ cohort_ym       <chr> "2010 03", "2010 10", "2010 09", "2010 04", "2011 02",…
## $ first_tnx_date  <date> 2010-03-02, 2010-10-31, 2010-09-27, 2010-04-29, 2011-…
## $ total_tnx_count <int> 3, 8, 5, 3, 1, 1, 9, 2, 1, 2, 6, 2, 5, 10, 6, 4, 10, 2…

Now that we have a first date for each customer, we look at the total number of customers joining at each date.

plot_tbl <- customer_cohort_tbl %>%
  count(first_tnx_date, name = "n_customer")

ggplot(plot_tbl) +
  geom_line(aes(x = first_tnx_date, y = n_customer)) +
  labs(
    x = "First Transaction Date",
    y = "New Customers",
    title = "Plot of Count of New Customer by Date"
    )

We know look at how time differences between purchases are distributed.

customer_tnx_diffs_tbl <- daily_spend_tbl %>%
  group_by(customer_id) %>%
  summarise(
    .groups = "drop",
    
    time_diff = diff(invoice_date) %>% as.numeric() %>% divide_by(7)
  )

mean_diff <- customer_tnx_diffs_tbl %>% pull(time_diff) %>% mean()

ggplot(customer_tnx_diffs_tbl) +
  geom_histogram(aes(x = time_diff), bins = 50) +
  geom_vline(aes(xintercept = mean_diff), colour = "red") +
  scale_y_continuous(labels = label_comma()) +
  labs(
    x = "Time Difference (weeks)",
    y = "Frequency",
    title = "Histogram of Differences Between Transactions for Customers",
    subtitle = glue(
      "Mean Difference is {mean_diff} weeks", mean_diff = mean_diff %>% round(2)
      )
    )

We also want to look at a number of customers and make some line plots of their transactions.

keep_customers_tbl <- customer_cohort_tbl %>%
  filter(total_tnx_count > 2) %>%
  slice_sample(n = 30)

plot_tbl <- daily_spend_tbl %>%
  semi_join(keep_customers_tbl, by = "customer_id")

ggplot(plot_tbl, aes(x = invoice_date, y = customer_id, group = customer_id)) +
  geom_line() +
  geom_point() +
  labs(
    x = "Transaction Date",
    y = "Customer ID",
    title = "Visualisation of Transaction Times for 30 Customers"
    ) +
  theme(axis.text.y = element_text(size = 12))

9.1 Investigate Cohorts

Finally, we want to take a look at the distribution of transaction times based on various first-transaction cohorts in the data.

ggplot(customer_cohort_tbl) +
  geom_histogram(aes(x = first_tnx_date), bins = 50) +
  labs(
    x = "Date of First Transaction",
    y = "Count",
    title = "Histogram of New Customer Start Dates"
    )

We also want to get a sense of the total count of customers in each cohort.

plot_tbl <- customer_cohort_tbl %>%
  count(cohort_qtr, name = "customer_count") %>%
  mutate(cohort_qtr = cohort_qtr %>% as.character())
  
ggplot(plot_tbl) +
  geom_col(aes(x = cohort_qtr, y = customer_count)) +
  scale_y_continuous(labels = label_comma()) +
  labs(
    x = "Customer Cohort",
    y = "Customer Count",
    title = "Bar Plot of Customer Quarterly Cohort Sizes"
    )

We also want to see the monthly cohorts:

plot_tbl <- customer_cohort_tbl %>%
  count(cohort_ym, name = "customer_count") %>%
  mutate(cohort_ym = cohort_ym %>% as.character())

ggplot(plot_tbl) +
  geom_col(aes(x = cohort_ym, y = customer_count)) +
  scale_y_continuous(labels = label_comma()) +
  labs(
    x = "Customer Cohort",
    y = "Customer Count",
    title = "Bar Plot of Customer Monthly Cohort Sizes"
    ) +
  theme(
    axis.text.x = element_text(size = 10, angle = 20, vjust = 0.5)
    )

For the cohort analysis, we start with a boxplot of the time difference between transactions by cohort.

plot_tbl <- customer_cohort_tbl %>%
  inner_join(customer_tnx_diffs_tbl, by = "customer_id") %>%
  mutate(cohort_qtr = cohort_qtr %>% as.character())

ggplot(plot_tbl) +
  geom_boxplot(aes(x = cohort_qtr, y = time_diff)) +
  scale_y_log10() +
  labs(
    x = "Cohort",
    y = "Time Difference (weeks)",
    title = "Boxplot of Time Differences by Starting Cohort"
    )

We also construct a density plot of the time differences for these cohorts.

ggplot(plot_tbl, aes(x = time_diff, colour = cohort_qtr)) +
  geom_line(stat = "density") +
  geom_dl(aes(label = cohort_qtr), method = "top.bumpup", stat = "density") +
  labs(
    x = "Time Difference (weeks)",
    y = "Density",
    title = "Comparison Density Plot for Transaction Time Differences Between Cohorts"
    ) +
  theme(legend.position = "none")

And we also look at a facetted-histogram

ggplot(plot_tbl) +
  geom_histogram(aes(x = time_diff), bins = 50) +
  facet_wrap(vars(cohort_qtr), scales = "free_y") +
  scale_y_continuous(labels = label_comma()) +
  labs(
    x = "Time Difference (weeks)",
    y = "Count",
    title = "Facetted Histograms of Time Between Transactions"
    )

9.2 Investigate Dropout Rates

We want to plot some visualisations of the lifetime and dropout rate of customers in each cohort.

cohort_dropout_est_tbl <- customer_cohort_tbl %>%
  select(
    customer_id, first_tnx_date, cohort_qtr, cohort_ym
    ) %>%
  inner_join(daily_spend_tbl, by = "customer_id") %>%
  group_by(cohort_qtr, customer_id) %>%
  mutate(
    final_tnx_date = max(invoice_date)
    ) %>%
  ungroup() %>%
  select(
    customer_id, cohort_qtr, first_tnx_date, final_tnx_date
    ) %>%
  distinct() %>%
  mutate(
    obs_lifetime = difftime(final_tnx_date, first_tnx_date, units = "week") %>%
      as.numeric()
    ) %>%
  filter(obs_lifetime > 0) %>%
  group_by(cohort_qtr) %>%
  summarise(
    .groups = "drop",
    
    lifetimes = list(obs_lifetime)
    ) %>%
  mutate(
    exp_fit    = map(lifetimes, MASS::fitdistr, densfun = "exponential"),
    param_data = map(exp_fit, broom::tidy)
    ) %>%
  select(cohort_qtr, lifetimes, param_data) %>%
  unnest(param_data)

cohort_dropout_est_tbl %>% glimpse()

## Rows: 9
## Columns: 5
## $ cohort_qtr <chr> "2009 Q4", "2010 Q1", "2010 Q2", "2010 Q3", "2010 Q4", "201…
## $ lifetimes  <list> <104.2857143, 96.7142857, 104.2857143, 89.4285714, 94.0000…
## $ term       <chr> "rate", "rate", "rate", "rate", "rate", "rate", "rate", "ra…
## $ estimate   <dbl> 0.01180524, 0.01468272, 0.01792846, 0.02185343, 0.02433965,…
## $ std.error  <dbl> 0.0004011582, 0.0004636136, 0.0007177127, 0.0010688864, 0.0…

cohort_dropout_est_tbl %>% datatable()

10 Output Cleaned Data

Finally we output the various datasets we have constructed to disks.

stock_description_tbl %>% write_rds("data/stock_description_tbl.rds")

returns_lookup_tbl %>% write_rds("data/returns_lookup_tbl.rds")

cleaned_data_tbl %>% write_rds("data/retail_data_cleaned_tbl.rds")

customer_cohort_tbl %>% write_rds("data/customer_cohort_tbl.rds")

daily_spend_invoice_tbl %>% write_rds("data/daily_spend_invoice_tbl.rds")

daily_spend_tbl %>% write_rds("data/daily_spend_tbl.rds")

11 R Environment

sessioninfo::session_info()

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value                       
##  version  R version 4.1.1 (2021-08-10)
##  os       Ubuntu 20.04.3 LTS          
##  system   x86_64, linux-gnu           
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  ctype    en_US.UTF-8                 
##  tz       Etc/UTC                     
##  date     2021-12-06                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package              * version   date       lib source        
##  assertthat             0.2.1     2019-03-21 [1] RSPM (R 4.1.0)
##  backports              1.3.0     2021-10-27 [1] RSPM (R 4.1.0)
##  bookdown               0.24      2021-09-02 [1] RSPM (R 4.1.0)
##  broom                  0.7.9     2021-07-27 [1] RSPM (R 4.1.0)
##  bslib                  0.3.1     2021-10-06 [1] RSPM (R 4.1.0)
##  cachem                 1.0.6     2021-08-19 [1] RSPM (R 4.1.0)
##  cellranger             1.1.0     2016-07-27 [1] RSPM (R 4.1.0)
##  cli                    3.1.0     2021-10-27 [1] RSPM (R 4.1.0)
##  codetools              0.2-18    2020-11-04 [2] CRAN (R 4.1.1)
##  colorspace             2.0-2     2021-06-24 [1] RSPM (R 4.1.0)
##  conflicted           * 1.0.4     2019-06-21 [1] RSPM (R 4.1.0)
##  cowplot              * 1.1.1     2020-12-30 [1] RSPM (R 4.1.0)
##  crayon                 1.4.1     2021-02-08 [1] RSPM (R 4.1.0)
##  crosstalk              1.1.1     2021-01-12 [1] RSPM (R 4.1.0)
##  curl                   4.3.2     2021-06-23 [1] RSPM (R 4.1.0)
##  data.table             1.14.2    2021-09-27 [1] RSPM (R 4.1.0)
##  DataExplorer         * 0.8.2     2020-12-15 [1] RSPM (R 4.1.0)
##  DBI                    1.1.1     2021-01-15 [1] RSPM (R 4.1.0)
##  dbplyr                 2.1.1     2021-04-06 [1] RSPM (R 4.1.0)
##  digest                 0.6.28    2021-09-23 [1] RSPM (R 4.1.0)
##  directlabels         * 2021.1.13 2021-01-16 [1] RSPM (R 4.1.0)
##  dplyr                * 1.0.7     2021-06-18 [1] RSPM (R 4.1.0)
##  DT                   * 0.19      2021-09-02 [1] RSPM (R 4.1.0)
##  ellipsis               0.3.2     2021-04-29 [1] RSPM (R 4.1.0)
##  evaluate               0.14      2019-05-28 [1] RSPM (R 4.1.0)
##  evir                 * 1.7-4     2018-03-20 [1] RSPM (R 4.1.0)
##  fansi                  0.5.0     2021-05-25 [1] RSPM (R 4.1.0)
##  farver                 2.1.0     2021-02-28 [1] RSPM (R 4.1.0)
##  fastmap                1.1.0     2021-01-25 [1] RSPM (R 4.1.0)
##  forcats              * 0.5.1     2021-01-27 [1] RSPM (R 4.1.0)
##  fs                   * 1.5.0     2020-07-31 [1] RSPM (R 4.1.0)
##  furrr                * 0.2.3     2021-06-25 [1] RSPM (R 4.1.0)
##  future               * 1.22.1    2021-08-25 [1] RSPM (R 4.1.0)
##  generics               0.1.1     2021-10-25 [1] RSPM (R 4.1.0)
##  ggplot2              * 3.3.5     2021-06-25 [1] RSPM (R 4.1.0)
##  globals                0.14.0    2020-11-22 [1] RSPM (R 4.1.0)
##  glue                 * 1.4.2     2020-08-27 [1] RSPM (R 4.1.0)
##  gridExtra              2.3       2017-09-09 [1] RSPM (R 4.1.0)
##  gtable                 0.3.0     2019-03-25 [1] RSPM (R 4.1.0)
##  haven                  2.4.3     2021-08-04 [1] RSPM (R 4.1.0)
##  highr                  0.9       2021-04-16 [1] RSPM (R 4.1.0)
##  hms                    1.1.1     2021-09-26 [1] RSPM (R 4.1.0)
##  htmltools              0.5.2     2021-08-25 [1] RSPM (R 4.1.0)
##  htmlwidgets            1.5.4     2021-09-08 [1] RSPM (R 4.1.0)
##  httr                   1.4.2     2020-07-20 [1] RSPM (R 4.1.0)
##  igraph                 1.2.7     2021-10-15 [1] RSPM (R 4.1.0)
##  jquerylib              0.1.4     2021-04-26 [1] RSPM (R 4.1.0)
##  jsonlite               1.7.2     2020-12-09 [1] RSPM (R 4.1.0)
##  knitr                  1.36      2021-09-29 [1] RSPM (R 4.1.0)
##  labeling               0.4.2     2020-10-20 [1] RSPM (R 4.1.0)
##  lattice                0.20-44   2021-05-02 [2] CRAN (R 4.1.1)
##  lifecycle              1.0.1     2021-09-24 [1] RSPM (R 4.1.0)
##  listenv                0.8.0     2019-12-05 [1] RSPM (R 4.1.0)
##  lubridate            * 1.8.0     2021-10-07 [1] RSPM (R 4.1.0)
##  magrittr             * 2.0.1     2020-11-17 [1] RSPM (R 4.1.0)
##  MASS                   7.3-54    2021-05-03 [2] CRAN (R 4.1.1)
##  modelr                 0.1.8     2020-05-19 [1] RSPM (R 4.1.0)
##  munsell                0.5.0     2018-06-12 [1] RSPM (R 4.1.0)
##  networkD3              0.4       2017-03-18 [1] RSPM (R 4.1.0)
##  parallelly             1.28.1    2021-09-09 [1] RSPM (R 4.1.0)
##  PerformanceAnalytics * 2.0.4     2020-02-06 [1] RSPM (R 4.1.0)
##  pillar                 1.6.4     2021-10-18 [1] RSPM (R 4.1.0)
##  pkgconfig              2.0.3     2019-09-22 [1] RSPM (R 4.1.0)
##  purrr                * 0.3.4     2020-04-17 [1] RSPM (R 4.1.0)
##  quadprog               1.5-8     2019-11-20 [1] RSPM (R 4.1.0)
##  Quandl                 2.11.0    2021-08-11 [1] RSPM (R 4.1.0)
##  quantmod             * 0.4.18    2020-12-09 [1] RSPM (R 4.1.0)
##  R6                     2.5.1     2021-08-19 [1] RSPM (R 4.1.0)
##  Rcpp                   1.0.7     2021-07-07 [1] RSPM (R 4.1.0)
##  readr                * 2.0.2     2021-09-27 [1] RSPM (R 4.1.0)
##  readxl                 1.3.1     2019-03-13 [1] RSPM (R 4.1.0)
##  reprex                 2.0.1     2021-08-05 [1] RSPM (R 4.1.0)
##  rlang                * 0.4.12    2021-10-18 [1] RSPM (R 4.1.0)
##  rmarkdown              2.11      2021-09-14 [1] RSPM (R 4.1.0)
##  rmdformats             1.0.3     2021-10-06 [1] RSPM (R 4.1.0)
##  rstudioapi             0.13      2020-11-12 [1] RSPM (R 4.1.0)
##  rvest                  1.0.2     2021-10-16 [1] RSPM (R 4.1.0)
##  sass                   0.4.0     2021-05-12 [1] RSPM (R 4.1.0)
##  scales               * 1.1.1     2020-05-11 [1] RSPM (R 4.1.0)
##  sessioninfo            1.1.1     2018-11-05 [1] RSPM (R 4.1.0)
##  snakecase            * 0.11.0    2019-05-25 [1] RSPM (R 4.1.0)
##  stringi                1.7.5     2021-10-04 [1] RSPM (R 4.1.0)
##  stringr              * 1.4.0     2019-02-10 [1] RSPM (R 4.1.0)
##  tibble               * 3.1.5     2021-09-30 [1] RSPM (R 4.1.0)
##  tidyquant            * 1.0.3     2021-03-05 [1] RSPM (R 4.1.0)
##  tidyr                * 1.1.4     2021-09-27 [1] RSPM (R 4.1.0)
##  tidyselect             1.1.1     2021-04-30 [1] RSPM (R 4.1.0)
##  tidyverse            * 1.3.1     2021-04-15 [1] RSPM (R 4.1.0)
##  TTR                  * 0.24.2    2020-09-01 [1] RSPM (R 4.1.0)
##  tzdb                   0.2.0     2021-10-27 [1] RSPM (R 4.1.0)
##  utf8                   1.2.2     2021-07-24 [1] RSPM (R 4.1.0)
##  vctrs                * 0.3.8     2021-04-29 [1] RSPM (R 4.1.0)
##  withr                  2.4.2     2021-04-18 [1] RSPM (R 4.1.0)
##  xfun                   0.27      2021-10-18 [1] RSPM (R 4.1.0)
##  xml2                   1.3.2     2020-04-23 [1] RSPM (R 4.1.0)
##  xts                  * 0.12.1    2020-09-09 [1] RSPM (R 4.1.0)
##  yaml                   2.2.1     2020-02-01 [1] RSPM (R 4.1.0)
##  zoo                  * 1.8-9     2021-03-09 [1] RSPM (R 4.1.0)
## 
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library