Parts Inventrory - Applied Machine Learning

This dataset was originally posted on Kaggle. The key task is to predict whether a product/part will go on backorder.

Product backorder may be the result of strong sales performance (e.g. the product is in such a high demand that production cannot keep up with sales). However, backorders can upset consumers, lead to canceled orders and decreased customer loyalty. Companies want to avoid backorders, but also avoid overstocking every product (leading to higher inventory costs).

This dataset has ~1.9 million observations of products/parts in an 8 week period. The source of the data is unreferenced.

  • Outcome: whether the product went on backorder
  • Predictors: Current inventory, sales history, forecasted sales, recommended stocking amount, product risk flags etc. (22 predictors in total)

The features and the target variable of the dataset are as follows:

Description

# Features: 
sku - Random ID for the product
national_inv - Current inventory level for the part
lead_time - Transit time for product (if available)
in_transit_qty - Amount of product in transit from source
forecast_3_month - Forecast sales for the next 3 months
forecast_6_month - Forecast sales for the next 6 months
forecast_9_month - Forecast sales for the next 9 months
sales_1_month - Sales quantity for the prior 1 month time period
sales_3_month - Sales quantity for the prior 3 month time period
sales_6_month - Sales quantity for the prior 6 month time period
sales_9_month - Sales quantity for the prior 9 month time period
min_bank - Minimum recommend amount to stock
potential_issue - Source issue for part identified
pieces_past_due - Parts overdue from source
perf_6_month_avg - Source performance for prior 6 month period
perf_12_month_avg - Source performance for prior 12 month period
local_bo_qty - Amount of stock orders overdue
deck_risk - Part risk flag
oe_constraint - Part risk flag
ppap_risk - Part risk flag
stop_auto_buy - Part risk flag
rev_stop - Part risk flag

# Target 
went_on_backorder - Product actually went on backorder

Overview / Roadmap

General steps:

  • Part 1: Dataset Carpentry & Exploratory Data Analysis

  • Part 2: Training and Validation

    • Anomaly detection
    • Dimensionality reduction
    • Classification
    • Train chosen model with full training data
  • Part 3: Testing and Evaluation

    • Apply model to testing data
    • Evaluate model performance

Part 1: Data Carpentry & Exploratory Analysis

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

Load dataset

In [2]:
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)

dataset.head().transpose()
/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[2]:
0 1 2 3 4
sku 1132159 2938083 2120328 2923968 1607861
national_inv 155.0 20.0 11189.0 49.0 21.0
lead_time 8.0 2.0 2.0 8.0 12.0
in_transit_qty 46.0 0.0 4020.0 1.0 0.0
forecast_3_month 100.0 0.0 18000.0 21.0 14.0
forecast_6_month 200.0 0.0 34500.0 56.0 14.0
forecast_9_month 200.0 0.0 52000.0 84.0 28.0
sales_1_month 16.0 0.0 5098.0 22.0 5.0
sales_3_month 52.0 0.0 16231.0 43.0 10.0
sales_6_month 110.0 0.0 32603.0 72.0 22.0
sales_9_month 205.0 0.0 44111.0 110.0 33.0
min_bank 49.0 0.0 4892.0 15.0 0.0
potential_issue No No No No No
pieces_past_due 0.0 0.0 0.0 0.0 0.0
perf_6_month_avg 1.0 0.8 1.0 0.83 0.93
perf_12_month_avg 0.98 0.64 1.0 0.83 0.93
local_bo_qty 0.0 0.0 0.0 0.0 0.0
deck_risk No Yes No No No
oe_constraint No No No No No
ppap_risk No No Yes No No
stop_auto_buy Yes Yes Yes Yes Yes
rev_stop No No No No No
went_on_backorder No No No No No

Review datset

In [3]:
dataset.describe().transpose()
Out[3]:
count mean std min 25% 50% 75% max
national_inv 1687860.0 496.111782 29615.233831 -27256.0 4.00 15.00 80.00 12334404.0
lead_time 1586967.0 7.872267 7.056024 0.0 4.00 8.00 9.00 52.0
in_transit_qty 1687860.0 44.052022 1342.741731 0.0 0.00 0.00 0.00 489408.0
forecast_3_month 1687860.0 178.119284 5026.553102 0.0 0.00 0.00 4.00 1427612.0
forecast_6_month 1687860.0 344.986664 9795.151861 0.0 0.00 0.00 12.00 2461360.0
forecast_9_month 1687860.0 506.364431 14378.923562 0.0 0.00 0.00 20.00 3777304.0
sales_1_month 1687860.0 55.926069 1928.195879 0.0 0.00 0.00 4.00 741774.0
sales_3_month 1687860.0 175.025930 5192.377625 0.0 0.00 1.00 15.00 1105478.0
sales_6_month 1687860.0 341.728839 9613.167104 0.0 0.00 2.00 31.00 2146625.0
sales_9_month 1687860.0 525.269701 14838.613523 0.0 0.00 4.00 47.00 3205172.0
min_bank 1687860.0 52.772303 1254.983089 0.0 0.00 0.00 3.00 313319.0
pieces_past_due 1687860.0 2.043724 236.016500 0.0 0.00 0.00 0.00 146496.0
perf_6_month_avg 1687860.0 -6.872059 26.556357 -99.0 0.63 0.82 0.97 1.0
perf_12_month_avg 1687860.0 -6.437947 25.843331 -99.0 0.66 0.81 0.95 1.0
local_bo_qty 1687860.0 0.626451 33.722242 0.0 0.00 0.00 0.00 12530.0

Recode missing values

In [4]:
# Assuming -99 in the perf_6_month_avg and perf_12_month_avg columns
# represents a missing value, and replacing those with average 
# values of products with non-missing values.

z6 = dataset['perf_6_month_avg'].loc[dataset['perf_6_month_avg'] != -99]
dataset['perf_6_month_avg'].replace(-99, z6.mean(), inplace=True)

z12 = dataset['perf_12_month_avg'].loc[dataset['perf_12_month_avg'] != -99]
dataset['perf_12_month_avg'].replace(-99, z12.mean(), inplace=True)
In [5]:
dataset.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687861 entries, 0 to 1687860
Data columns (total 23 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   sku                1687861 non-null  object 
 1   national_inv       1687860 non-null  float64
 2   lead_time          1586967 non-null  float64
 3   in_transit_qty     1687860 non-null  float64
 4   forecast_3_month   1687860 non-null  float64
 5   forecast_6_month   1687860 non-null  float64
 6   forecast_9_month   1687860 non-null  float64
 7   sales_1_month      1687860 non-null  float64
 8   sales_3_month      1687860 non-null  float64
 9   sales_6_month      1687860 non-null  float64
 10  sales_9_month      1687860 non-null  float64
 11  min_bank           1687860 non-null  float64
 12  potential_issue    1687860 non-null  object 
 13  pieces_past_due    1687860 non-null  float64
 14  perf_6_month_avg   1687860 non-null  float64
 15  perf_12_month_avg  1687860 non-null  float64
 16  local_bo_qty       1687860 non-null  float64
 17  deck_risk          1687860 non-null  object 
 18  oe_constraint      1687860 non-null  object 
 19  ppap_risk          1687860 non-null  object 
 20  stop_auto_buy      1687860 non-null  object 
 21  rev_stop           1687860 non-null  object 
 22  went_on_backorder  1687860 non-null  object 
dtypes: float64(15), object(8)
memory usage: 296.2+ MB
In [6]:
dataset.iloc[:3,:6]
Out[6]:
sku national_inv lead_time in_transit_qty forecast_3_month forecast_6_month
0 1132159 155.0 8.0 46.0 100.0 200.0
1 2938083 20.0 2.0 0.0 0.0 0.0
2 2120328 11189.0 2.0 4020.0 18000.0 34500.0
In [7]:
dataset.iloc[:3,6:12]
Out[7]:
forecast_9_month sales_1_month sales_3_month sales_6_month sales_9_month min_bank
0 200.0 16.0 52.0 110.0 205.0 49.0
1 0.0 0.0 0.0 0.0 0.0 0.0
2 52000.0 5098.0 16231.0 32603.0 44111.0 4892.0
In [8]:
dataset.iloc[:3,12:18]
Out[8]:
potential_issue pieces_past_due perf_6_month_avg perf_12_month_avg local_bo_qty deck_risk
0 No 0.0 1.0 0.98 0.0 No
1 No 0.0 0.8 0.64 0.0 Yes
2 No 0.0 1.0 1.00 0.0 No
In [9]:
dataset.iloc[:3,18:24]
Out[9]:
oe_constraint ppap_risk stop_auto_buy rev_stop went_on_backorder
0 No No Yes No No
1 No No Yes No No
2 No Yes Yes No No

Drop columns that are irrelevant or not processable

In [10]:
# Dropping the product id column (sku) and minimum recommended amount of stock column (min_bank).
dataset = dataset.iloc[:,np.r_[1:11,12:23]]
dataset.iloc[:3,:]
Out[10]:
national_inv lead_time in_transit_qty forecast_3_month forecast_6_month forecast_9_month sales_1_month sales_3_month sales_6_month sales_9_month ... pieces_past_due perf_6_month_avg perf_12_month_avg local_bo_qty deck_risk oe_constraint ppap_risk stop_auto_buy rev_stop went_on_backorder
0 155.0 8.0 46.0 100.0 200.0 200.0 16.0 52.0 110.0 205.0 ... 0.0 1.0 0.98 0.0 No No No Yes No No
1 20.0 2.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.8 0.64 0.0 Yes No No Yes No No
2 11189.0 2.0 4020.0 18000.0 34500.0 52000.0 5098.0 16231.0 32603.0 44111.0 ... 0.0 1.0 1.00 0.0 No No Yes Yes No No

3 rows × 21 columns

Find unique values of string columns

Making sure Yes/No columns really only contain Yes or No, and then converting them into binaries (0s and 1s).

In [11]:
# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))
print(yes_no_columns)

print('potential_issue', dataset['potential_issue'].unique())
print('deck_risk', dataset['deck_risk'].unique())
print('oe_constraint', dataset['oe_constraint'].unique())
print('ppap_risk', dataset['ppap_risk'].unique())
print('stop_auto_buy', dataset['stop_auto_buy'].unique())
print('rev_stop', dataset['rev_stop'].unique())
print('went_on_backorder', dataset['went_on_backorder'].unique())
['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop', 'went_on_backorder']
potential_issue ['No' 'Yes' nan]
deck_risk ['No' 'Yes' nan]
oe_constraint ['No' 'Yes' nan]
ppap_risk ['No' 'Yes' nan]
stop_auto_buy ['Yes' 'No' nan]
rev_stop ['No' 'Yes' nan]
went_on_backorder ['No' 'Yes' nan]

Fill nan values with most common values.

In [12]:
for column_name in yes_no_columns:
    mode = dataset[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    dataset[column_name].fillna(mode, inplace=True)
Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No

Convert yes/no columns into binary (0s and 1s)

In [13]:
categories = [('potential_issue'),
              ('deck_risk'),
              ('oe_constraint'),
              ('ppap_risk'),
              ('stop_auto_buy'),
              ('rev_stop'),
              ('went_on_backorder')]
In [14]:
# Apply one-hot encoder to each column with categorical data
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='error',sparse=False)

for c in categories:
    print(c)
    datat = pd.DataFrame(enc.fit_transform(dataset[[c]]))
    datat.rename(columns={0:c+"_0",1:c+"_1"}, inplace=True) 
    datat = datat.drop(c+"_0", axis=1)
    dataset = pd.concat([dataset, datat], axis=1)
potential_issue
deck_risk
oe_constraint
ppap_risk
stop_auto_buy
rev_stop
went_on_backorder

Visual inspection of one-hot encoded columns next to original column and removing orginial columns

In [15]:
dataset.iloc[:4,np.r_[10,16:28]]
Out[15]:
potential_issue oe_constraint ppap_risk stop_auto_buy rev_stop went_on_backorder potential_issue_1 deck_risk_1 oe_constraint_1 ppap_risk_1 stop_auto_buy_1 rev_stop_1 went_on_backorder_1
0 No No No Yes No No 0.0 0.0 0.0 0.0 1.0 0.0 0.0
1 No No No Yes No No 0.0 1.0 0.0 0.0 1.0 0.0 0.0
2 No No Yes Yes No No 0.0 0.0 0.0 1.0 1.0 0.0 0.0
3 No No No Yes No No 0.0 0.0 0.0 0.0 1.0 0.0 0.0
In [16]:
# Drop columns replaced with one-hot encoding from dataframe.
dataset = dataset.drop(categories, axis=1)
In [17]:
# Imputing values where missing in lead_time column.
dataset["lead_time"].fillna(dataset["lead_time"].mode()[0], inplace=True)
In [18]:
# Drop records with missing data.
dataset = dataset.dropna()
In [19]:
#View dataset info.
dataset.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1687860 entries, 0 to 1687860
Data columns (total 21 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   national_inv         1687860 non-null  float64
 1   lead_time            1687860 non-null  float64
 2   in_transit_qty       1687860 non-null  float64
 3   forecast_3_month     1687860 non-null  float64
 4   forecast_6_month     1687860 non-null  float64
 5   forecast_9_month     1687860 non-null  float64
 6   sales_1_month        1687860 non-null  float64
 7   sales_3_month        1687860 non-null  float64
 8   sales_6_month        1687860 non-null  float64
 9   sales_9_month        1687860 non-null  float64
 10  pieces_past_due      1687860 non-null  float64
 11  perf_6_month_avg     1687860 non-null  float64
 12  perf_12_month_avg    1687860 non-null  float64
 13  local_bo_qty         1687860 non-null  float64
 14  potential_issue_1    1687860 non-null  float64
 15  deck_risk_1          1687860 non-null  float64
 16  oe_constraint_1      1687860 non-null  float64
 17  ppap_risk_1          1687860 non-null  float64
 18  stop_auto_buy_1      1687860 non-null  float64
 19  rev_stop_1           1687860 non-null  float64
 20  went_on_backorder_1  1687860 non-null  float64
dtypes: float64(21)
memory usage: 283.3 MB

Create balanced sample of data for more manageable size for cross-validation

In [20]:
num_backorder = np.sum(dataset['went_on_backorder_1']==1)
print('backorder ratio:', num_backorder, '/', len(dataset), '=', num_backorder / len(dataset))
backorder ratio: 11293 / 1687860 = 0.006690720794378681
In [21]:
# Splitting data into X and y for sampling.
from sklearn.model_selection import train_test_split

# Create arrays.
y = np.array(dataset.went_on_backorder_1)
X = np.array(dataset.drop('went_on_backorder_1', axis=1))

# Split into two files in order to have smaller sample available for processingin Part 2.
X_50, X_40, y_50, y_40  = train_test_split(X, y, test_size=0.4, random_state=27)
In [22]:
# Evaluating ratios of minority cases in datasets.
num_backorder_100 = np.sum(dataset['went_on_backorder_1']==1)

print('backorder ratio:', num_backorder_100, '/', len(dataset), '=', num_backorder_100 / len(dataset))
print('backorder ratio:', sum(y_40), '/', y_40.shape, '=', sum(y_40) / y_40.shape )
print('backorder ratio:', sum(y_50), '/', y_50.shape, '=', sum(y_50) / y_50.shape )
backorder ratio: 11293 / 1687860 = 0.006690720794378681
backorder ratio: 4599.0 / (675144,) = [0.00681188]
backorder ratio: 6694.0 / (1012716,) = [0.00660995]
In [23]:
# Clearing memory.
dataset = 0;
X = 0;
y = 0;
In [24]:
# Using Synthetic Minority Oversampling Technique (SMOTE) to balance training data.
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_40, y_40= sm.fit_resample(X_40, y_40)
X_50, y_50= sm.fit_resample(X_50, y_50)
In [26]:
# Evaluating ratios of minority cases in balanced datasets.
print('backorder ratio:', sum(y_40), '/', y_40.shape, '=', sum(y_40) / y_40.shape )
print('backorder ratio:', sum(y_50), '/', y_50.shape, '=', sum(y_50) / y_50.shape )
backorder ratio: 670545.0 / (1341090,) = [0.5]
backorder ratio: 1006022.0 / (2012044,) = [0.5]
In [27]:
# Write your smart, balanced sample to local file  
# ----------------------------------
import joblib

joblib.dump([X_50, y_50, X_40, y_40], '../data/sample-data-v1.pkl')
Out[27]:
['../data/sample-data-v1.pkl']