This dataset was originally posted on Kaggle. The key task is to predict whether a product/part will go on backorder.
Product backorder may be the result of strong sales performance (e.g. the product is in such a high demand that production cannot keep up with sales). However, backorders can upset consumers, lead to canceled orders and decreased customer loyalty. Companies want to avoid backorders, but also avoid overstocking every product (leading to higher inventory costs).
This dataset has ~1.9 million observations of products/parts in an 8 week period. The source of the data is unreferenced.
The features and the target variable of the dataset are as follows:
Description
# Features:
sku - Random ID for the product
national_inv - Current inventory level for the part
lead_time - Transit time for product (if available)
in_transit_qty - Amount of product in transit from source
forecast_3_month - Forecast sales for the next 3 months
forecast_6_month - Forecast sales for the next 6 months
forecast_9_month - Forecast sales for the next 9 months
sales_1_month - Sales quantity for the prior 1 month time period
sales_3_month - Sales quantity for the prior 3 month time period
sales_6_month - Sales quantity for the prior 6 month time period
sales_9_month - Sales quantity for the prior 9 month time period
min_bank - Minimum recommend amount to stock
potential_issue - Source issue for part identified
pieces_past_due - Parts overdue from source
perf_6_month_avg - Source performance for prior 6 month period
perf_12_month_avg - Source performance for prior 12 month period
local_bo_qty - Amount of stock orders overdue
deck_risk - Part risk flag
oe_constraint - Part risk flag
ppap_risk - Part risk flag
stop_auto_buy - Part risk flag
rev_stop - Part risk flag
# Target
went_on_backorder - Product actually went on backorder
General steps:
Part 1: Dataset Carpentry & Exploratory Data Analysis
Part 2: Training and Validation
Part 3: Testing and Evaluation
%matplotlib inline
import matplotlib.pyplot as plt
import os, sys
import itertools
import numpy as np
import pandas as pd
DATASET = '/dsa/data/all_datasets/back_order/Kaggle_Training_Dataset_v2.csv'
assert os.path.exists(DATASET)
# Load and shuffle
dataset = pd.read_csv(DATASET).sample(frac = 1).reset_index(drop=True)
dataset.head().transpose()
dataset.describe().transpose()
# Assuming -99 in the perf_6_month_avg and perf_12_month_avg columns
# represents a missing value, and replacing those with average
# values of products with non-missing values.
z6 = dataset['perf_6_month_avg'].loc[dataset['perf_6_month_avg'] != -99]
dataset['perf_6_month_avg'].replace(-99, z6.mean(), inplace=True)
z12 = dataset['perf_12_month_avg'].loc[dataset['perf_12_month_avg'] != -99]
dataset['perf_12_month_avg'].replace(-99, z12.mean(), inplace=True)
dataset.info()
dataset.iloc[:3,:6]
dataset.iloc[:3,6:12]
dataset.iloc[:3,12:18]
dataset.iloc[:3,18:24]
# Dropping the product id column (sku) and minimum recommended amount of stock column (min_bank).
dataset = dataset.iloc[:,np.r_[1:11,12:23]]
dataset.iloc[:3,:]
Making sure Yes/No columns really only contain Yes or No, and then converting them into binaries (0s and 1s).
# All the column names of these yes/no columns
yes_no_columns = list(filter(lambda i: dataset[i].dtype!=np.float64, dataset.columns))
print(yes_no_columns)
print('potential_issue', dataset['potential_issue'].unique())
print('deck_risk', dataset['deck_risk'].unique())
print('oe_constraint', dataset['oe_constraint'].unique())
print('ppap_risk', dataset['ppap_risk'].unique())
print('stop_auto_buy', dataset['stop_auto_buy'].unique())
print('rev_stop', dataset['rev_stop'].unique())
print('went_on_backorder', dataset['went_on_backorder'].unique())
for column_name in yes_no_columns:
mode = dataset[column_name].apply(str).mode()[0]
print('Filling missing values of {} with {}'.format(column_name, mode))
dataset[column_name].fillna(mode, inplace=True)
categories = [('potential_issue'),
('deck_risk'),
('oe_constraint'),
('ppap_risk'),
('stop_auto_buy'),
('rev_stop'),
('went_on_backorder')]
# Apply one-hot encoder to each column with categorical data
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='error',sparse=False)
for c in categories:
print(c)
datat = pd.DataFrame(enc.fit_transform(dataset[[c]]))
datat.rename(columns={0:c+"_0",1:c+"_1"}, inplace=True)
datat = datat.drop(c+"_0", axis=1)
dataset = pd.concat([dataset, datat], axis=1)
dataset.iloc[:4,np.r_[10,16:28]]
# Drop columns replaced with one-hot encoding from dataframe.
dataset = dataset.drop(categories, axis=1)
# Imputing values where missing in lead_time column.
dataset["lead_time"].fillna(dataset["lead_time"].mode()[0], inplace=True)
# Drop records with missing data.
dataset = dataset.dropna()
#View dataset info.
dataset.info()
num_backorder = np.sum(dataset['went_on_backorder_1']==1)
print('backorder ratio:', num_backorder, '/', len(dataset), '=', num_backorder / len(dataset))
# Splitting data into X and y for sampling.
from sklearn.model_selection import train_test_split
# Create arrays.
y = np.array(dataset.went_on_backorder_1)
X = np.array(dataset.drop('went_on_backorder_1', axis=1))
# Split into two files in order to have smaller sample available for processingin Part 2.
X_50, X_40, y_50, y_40 = train_test_split(X, y, test_size=0.4, random_state=27)
# Evaluating ratios of minority cases in datasets.
num_backorder_100 = np.sum(dataset['went_on_backorder_1']==1)
print('backorder ratio:', num_backorder_100, '/', len(dataset), '=', num_backorder_100 / len(dataset))
print('backorder ratio:', sum(y_40), '/', y_40.shape, '=', sum(y_40) / y_40.shape )
print('backorder ratio:', sum(y_50), '/', y_50.shape, '=', sum(y_50) / y_50.shape )
# Clearing memory.
dataset = 0;
X = 0;
y = 0;
# Using Synthetic Minority Oversampling Technique (SMOTE) to balance training data.
from imblearn.over_sampling import SMOTE
sm = SMOTE()
X_40, y_40= sm.fit_resample(X_40, y_40)
X_50, y_50= sm.fit_resample(X_50, y_50)
# Evaluating ratios of minority cases in balanced datasets.
print('backorder ratio:', sum(y_40), '/', y_40.shape, '=', sum(y_40) / y_40.shape )
print('backorder ratio:', sum(y_50), '/', y_50.shape, '=', sum(y_50) / y_50.shape )
# Write your smart, balanced sample to local file
# ----------------------------------
import joblib
joblib.dump([X_50, y_50, X_40, y_40], '../data/sample-data-v1.pkl')