Parts Inventrory - Applied Machine Learning¶

This dataset was originally posted on Kaggle. The key task is to predict whether a product/part will go on backorder.

Product backorder may be the result of strong sales performance (e.g. the product is in such a high demand that production cannot keep up with sales). However, backorders can upset consumers, lead to canceled orders and decreased customer loyalty. Companies want to avoid backorders, but also avoid overstocking every product (leading to higher inventory costs).

This dataset has ~1.9 million observations of products/parts in an 8 week period. The source of the data is unreferenced.

Outcome: whether the product went on backorder
Predictors: Current inventory, sales history, forecasted sales, recommended stocking amount, product risk flags etc. (22 predictors in total)

The features and the target variable of the dataset are as follows:

Description

# Features: 
sku - Random ID for the product
national_inv - Current inventory level for the part
lead_time - Transit time for product (if available)
in_transit_qty - Amount of product in transit from source
forecast_3_month - Forecast sales for the next 3 months
forecast_6_month - Forecast sales for the next 6 months
forecast_9_month - Forecast sales for the next 9 months
sales_1_month - Sales quantity for the prior 1 month time period
sales_3_month - Sales quantity for the prior 3 month time period
sales_6_month - Sales quantity for the prior 6 month time period
sales_9_month - Sales quantity for the prior 9 month time period
min_bank - Minimum recommend amount to stock
potential_issue - Source issue for part identified
pieces_past_due - Parts overdue from source
perf_6_month_avg - Source performance for prior 6 month period
perf_12_month_avg - Source performance for prior 12 month period
local_bo_qty - Amount of stock orders overdue
deck_risk - Part risk flag
oe_constraint - Part risk flag
ppap_risk - Part risk flag
stop_auto_buy - Part risk flag
rev_stop - Part risk flag

# Target 
went_on_backorder - Product actually went on backorder

Overview / Roadmap¶

General steps:

Part 1: Dataset Carpentry & Exploratory Data Analysis
Part 2: Training and Validation
- Anomaly detection
- Dimensionality reduction
- Classification
- Train chosen model with full training data
Part 3: Testing and Evaluation
- Apply model to testing data
- Evaluate model performance

Part 1: Data Carpentry & Exploratory Analysis¶

%matplotlib inline
import matplotlib.pyplot as plt

import os, sys
import itertools
import numpy as np
import pandas as pd

Load dataset¶

DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv'
assert os.path.exists(DATASET)

# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)

dataset.head().transpose()

/opt/conda/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3058: DtypeWarning: Columns (0) have mixed types.Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)

Review datset¶

dataset.describe().transpose()

Recode missing values¶

# Assuming -99 in the perf_6_month_avg and perf_12_month_avg columns
# represents a missing value, and replacing those with average 
# values of products with non-missing values.

z6 = dataset['perf_6_month_avg'].loc[dataset['perf_6_month_avg'] != -99]
dataset['perf_6_month_avg'].replace(-99, z6.mean(), inplace=True)

z12 = dataset['perf_12_month_avg'].loc[dataset['perf_12_month_avg'] != -99]
dataset['perf_12_month_avg'].replace(-99, z12.mean(), inplace=True)

dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1687861 entries, 0 to 1687860
Data columns (total 23 columns):
 #   Column             Non-Null Count    Dtype  
---  ------             --------------    -----  
 0   sku                1687861 non-null  object 
 1   national_inv       1687860 non-null  float64
 2   lead_time          1586967 non-null  float64
 3   in_transit_qty     1687860 non-null  float64
 4   forecast_3_month   1687860 non-null  float64
 5   forecast_6_month   1687860 non-null  float64
 6   forecast_9_month   1687860 non-null  float64
 7   sales_1_month      1687860 non-null  float64
 8   sales_3_month      1687860 non-null  float64
 9   sales_6_month      1687860 non-null  float64
 10  sales_9_month      1687860 non-null  float64
 11  min_bank           1687860 non-null  float64
 12  potential_issue    1687860 non-null  object 
 13  pieces_past_due    1687860 non-null  float64
 14  perf_6_month_avg   1687860 non-null  float64
 15  perf_12_month_avg  1687860 non-null  float64
 16  local_bo_qty       1687860 non-null  float64
 17  deck_risk          1687860 non-null  object 
 18  oe_constraint      1687860 non-null  object 
 19  ppap_risk          1687860 non-null  object 
 20  stop_auto_buy      1687860 non-null  object 
 21  rev_stop           1687860 non-null  object 
 22  went_on_backorder  1687860 non-null  object 
dtypes: float64(15), object(8)
memory usage: 296.2+ MB

dataset.iloc[:3,:6]

dataset.iloc[:3,6:12]

dataset.iloc[:3,12:18]

dataset.iloc[:3,18:24]

Drop columns that are irrelevant or not processable¶

# Dropping the product id column (sku) and minimum recommended amount of stock column (min_bank).
dataset = dataset.iloc[:,np.r_[1:11,12:23]]
dataset.iloc[:3,:]

Find unique values of string columns¶

Making sure Yes/No columns really only contain Yes or No, and then converting them into binaries (0s and 1s).

# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))
print(yes_no_columns)

print('potential_issue', dataset['potential_issue'].unique())
print('deck_risk', dataset['deck_risk'].unique())
print('oe_constraint', dataset['oe_constraint'].unique())
print('ppap_risk', dataset['ppap_risk'].unique())
print('stop_auto_buy', dataset['stop_auto_buy'].unique())
print('rev_stop', dataset['rev_stop'].unique())
print('went_on_backorder', dataset['went_on_backorder'].unique())

['potential_issue', 'deck_risk', 'oe_constraint', 'ppap_risk', 'stop_auto_buy', 'rev_stop', 'went_on_backorder']
potential_issue ['No' 'Yes' nan]
deck_risk ['No' 'Yes' nan]
oe_constraint ['No' 'Yes' nan]
ppap_risk ['No' 'Yes' nan]
stop_auto_buy ['Yes' 'No' nan]
rev_stop ['No' 'Yes' nan]
went_on_backorder ['No' 'Yes' nan]

Fill nan values with most common values.¶

for column_name in yes_no_columns:
    mode = dataset[column_name].apply(str).mode()[0]
    print('Filling missing values of {} with {}'.format(column_name, mode))
    dataset[column_name].fillna(mode, inplace=True)

Filling missing values of potential_issue with No
Filling missing values of deck_risk with No
Filling missing values of oe_constraint with No
Filling missing values of ppap_risk with No
Filling missing values of stop_auto_buy with Yes
Filling missing values of rev_stop with No
Filling missing values of went_on_backorder with No

Convert yes/no columns into binary (0s and 1s)¶

categories = [('potential_issue'),
              ('deck_risk'),
              ('oe_constraint'),
              ('ppap_risk'),
              ('stop_auto_buy'),
              ('rev_stop'),
              ('went_on_backorder')]

# Apply one-hot encoder to each column with categorical data
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='error',sparse=False)

for c in categories:
    print(c)
    datat = pd.DataFrame(enc.fit_transform(dataset[[c]]))
    datat.rename(columns={0:c+"_0",1:c+"_1"}, inplace=True) 
    datat = datat.drop(c+"_0", axis=1)
    dataset = pd.concat([dataset, datat], axis=1)

potential_issue
deck_risk
oe_constraint
ppap_risk
stop_auto_buy
rev_stop
went_on_backorder

Visual inspection of one-hot encoded columns next to original column and removing orginial columns¶

dataset.iloc[:4,np.r_[10,16:28]]

# Drop columns replaced with one-hot encoding from dataframe.
dataset = dataset.drop(categories, axis=1)

# Imputing values where missing in lead_time column.
dataset["lead_time"].fillna(dataset["lead_time"].mode()[0], inplace=True)

# Drop records with missing data.
dataset = dataset.dropna()

#View dataset info.
dataset.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1687860 entries, 0 to 1687860
Data columns (total 21 columns):
 #   Column               Non-Null Count    Dtype  
---  ------               --------------    -----  
 0   national_inv         1687860 non-null  float64
 1   lead_time            1687860 non-null  float64
 2   in_transit_qty       1687860 non-null  float64
 3   forecast_3_month     1687860 non-null  float64
 4   forecast_6_month     1687860 non-null  float64
 5   forecast_9_month     1687860 non-null  float64
 6   sales_1_month        1687860 non-null  float64
 7   sales_3_month        1687860 non-null  float64
 8   sales_6_month        1687860 non-null  float64
 9   sales_9_month        1687860 non-null  float64
 10  pieces_past_due      1687860 non-null  float64
 11  perf_6_month_avg     1687860 non-null  float64
 12  perf_12_month_avg    1687860 non-null  float64
 13  local_bo_qty         1687860 non-null  float64
 14  potential_issue_1    1687860 non-null  float64
 15  deck_risk_1          1687860 non-null  float64
 16  oe_constraint_1      1687860 non-null  float64
 17  ppap_risk_1          1687860 non-null  float64
 18  stop_auto_buy_1      1687860 non-null  float64
 19  rev_stop_1           1687860 non-null  float64
 20  went_on_backorder_1  1687860 non-null  float64
dtypes: float64(21)
memory usage: 283.3 MB

Create balanced sample of data for more manageable size for cross-validation¶

num_backorder = np.sum(dataset['went_on_backorder_1']==1)
print('backorder ratio:', num_backorder, '/', len(dataset), '=', num_backorder / len(dataset))

backorder ratio: 11293 / 1687860 = 0.006690720794378681

# Splitting data into X and y for sampling.
from sklearn.model_selection import train_test_split

# Create arrays.
y = np.array(dataset.went_on_backorder_1)
X = np.array(dataset.drop('went_on_backorder_1', axis=1))

# Split into two files in order to have smaller sample available for processingin Part 2.
X_50, X_40, y_50, y_40  = train_test_split(X, y, test_size=0.4, random_state=27)

# Evaluating ratios of minority cases in datasets.
num_backorder_100 = np.sum(dataset['went_on_backorder_1']==1)

print('backorder ratio:', num_backorder_100, '/', len(dataset), '=', num_backorder_100 / len(dataset))
print('backorder ratio:', sum(y_40), '/', y_40.shape, '=', sum(y_40) / y_40.shape )
print('backorder ratio:', sum(y_50), '/', y_50.shape, '=', sum(y_50) / y_50.shape )

backorder ratio: 11293 / 1687860 = 0.006690720794378681
backorder ratio: 4599.0 / (675144,) = [0.00681188]
backorder ratio: 6694.0 / (1012716,) = [0.00660995]

# Clearing memory.
dataset = 0;
X = 0;
y = 0;

# Using Synthetic Minority Oversampling Technique (SMOTE) to balance training data.
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_40, y_40= sm.fit_resample(X_40, y_40)
X_50, y_50= sm.fit_resample(X_50, y_50)

# Evaluating ratios of minority cases in balanced datasets.
print('backorder ratio:', sum(y_40), '/', y_40.shape, '=', sum(y_40) / y_40.shape )
print('backorder ratio:', sum(y_50), '/', y_50.shape, '=', sum(y_50) / y_50.shape )

backorder ratio: 670545.0 / (1341090,) = [0.5]
backorder ratio: 1006022.0 / (2012044,) = [0.5]

# Write your smart, balanced sample to local file  
# ----------------------------------
import joblib

joblib.dump([X_50, y_50, X_40, y_40], '../data/sample-data-v1.pkl')

['../data/sample-data-v1.pkl']

	count	mean	std	min	25%	50%	75%	max
national_inv	1687860.0	496.111782	29615.233831	-27256.0	4.00	15.00	80.00	12334404.0
lead_time	1586967.0	7.872267	7.056024	0.0	4.00	8.00	9.00	52.0
in_transit_qty	1687860.0	44.052022	1342.741731	0.0	0.00	0.00	0.00	489408.0
forecast_3_month	1687860.0	178.119284	5026.553102	0.0	0.00	0.00	4.00	1427612.0
forecast_6_month	1687860.0	344.986664	9795.151861	0.0	0.00	0.00	12.00	2461360.0
forecast_9_month	1687860.0	506.364431	14378.923562	0.0	0.00	0.00	20.00	3777304.0
sales_1_month	1687860.0	55.926069	1928.195879	0.0	0.00	0.00	4.00	741774.0
sales_3_month	1687860.0	175.025930	5192.377625	0.0	0.00	1.00	15.00	1105478.0
sales_6_month	1687860.0	341.728839	9613.167104	0.0	0.00	2.00	31.00	2146625.0
sales_9_month	1687860.0	525.269701	14838.613523	0.0	0.00	4.00	47.00	3205172.0
min_bank	1687860.0	52.772303	1254.983089	0.0	0.00	0.00	3.00	313319.0
pieces_past_due	1687860.0	2.043724	236.016500	0.0	0.00	0.00	0.00	146496.0
perf_6_month_avg	1687860.0	-6.872059	26.556357	-99.0	0.63	0.82	0.97	1.0
perf_12_month_avg	1687860.0	-6.437947	25.843331	-99.0	0.66	0.81	0.95	1.0
local_bo_qty	1687860.0	0.626451	33.722242	0.0	0.00	0.00	0.00	12530.0

	sku	national_inv	lead_time	in_transit_qty	forecast_3_month	forecast_6_month
0	1132159	155.0	8.0	46.0	100.0	200.0
1	2938083	20.0	2.0	0.0	0.0	0.0
2	2120328	11189.0	2.0	4020.0	18000.0	34500.0

	forecast_9_month	sales_1_month	sales_3_month	sales_6_month	sales_9_month	min_bank
0	200.0	16.0	52.0	110.0	205.0	49.0
1	0.0	0.0	0.0	0.0	0.0	0.0
2	52000.0	5098.0	16231.0	32603.0	44111.0	4892.0

	potential_issue	oe_constraint	ppap_risk	stop_auto_buy	rev_stop	went_on_backorder	deck_risk_1	ppap_risk_1	stop_auto_buy_1
0	No	No	No	Yes	No	No	0.0	0.0	1.0
1	No	No	No	Yes	No	No	1.0	0.0	1.0
2	No	No	Yes	Yes	No	No	0.0	1.0	1.0
3	No	No	No	Yes	No	No	0.0	0.0	1.0

	potential_issue	oe_constraint	ppap_risk	stop_auto_buy	rev_stop	went_on_backorder	deck_risk_1	ppap_risk_1	stop_auto_buy_1
0	No	No	No	Yes	No	No	0.0	0.0	1.0
1	No	No	No	Yes	No	No	1.0	0.0	1.0
2	No	No	Yes	Yes	No	No	0.0	1.0	1.0
3	No	No	No	Yes	No	No	0.0	0.0	1.0

	potential_issue	oe_constraint	ppap_risk	stop_auto_buy	rev_stop	went_on_backorder	deck_risk_1	ppap_risk_1	stop_auto_buy_1
0	No	No	No	Yes	No	No	0.0	0.0	1.0
1	No	No	No	Yes	No	No	1.0	0.0	1.0
2	No	No	Yes	Yes	No	No	0.0	1.0	1.0
3	No	No	No	Yes	No	No	0.0	0.0	1.0