Exploration of a few non-parametric models - Hugh Chen - Stat 527¶

Table of Contents

Run this to generate table of contents:

In [18]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

Introduction¶

In this project I implement several of the most popular non-parametric methods being applied today. The goal of this project is to understand the different methods and potentially help other people understand these methods. Note that since I'm not focusing on achieving high prediction accuracy, I don't focus on hyperparameter tuning in this project. Additionally, the use case of this project is for image classification, so I also discuss extracting features from images as well. In this project I use two image datasets: digits data and cats/dogs data. In order to set up your directory you should download only the training data from both and put it into the appropriate subdirectory in the "data" directory in your current working directory.

Image Featurization¶

Table of Contents

Convolutional neural networks do great at classifying images just based off of the raw image data. This is because they essentially learn filters while concurrently learning the classification problem. For most methods, using raw pixels as features is generally a bad idea because scale variation, viewpoint variation, background clutter, etc. all confound the classification. For this section, I focus on using the cats/dogs data to extract features since the images are more interesting than those from the digits dataset.

Features¶

Global Variables/Imports¶

In [2]:
from matplotlib import pyplot as plt
import multiprocessing, cv2, os
import numpy as np
%matplotlib inline

# Global variables
MYPATH = os.path.dirname(os.path.realpath("__file__"))+"/"
NUM_CORES = multiprocessing.cpu_count()
TRAINPATH = MYPATH+"data/catsdogs/"

Downsized Image¶

Downsizing the images is useful for the sake of both standardizing image sizes and for the sake of controlling running time of training/prediction.

In [3]:
files = os.listdir(TRAINPATH); i=3
fname = TRAINPATH+files[i]
image = cv2.cvtColor(cv2.imread(fname), cv2.COLOR_BGR2RGB)
plt.imshow(image); plt.show()
image_small = cv2.resize(image, (32, 32))
plt.imshow(image_small); plt.show()

An image is represented as three matrices that store the r, g, and b pixel values for each location. If we standardize the image size and flatten these, we can directly use these pixel values as features.

In [4]:
image.shape
Out[4]:
(333, 499, 3)
In [5]:
image_small.shape
Out[5]:
(32, 32, 3)
In [6]:
image_small.flatten().shape
Out[6]:
(3072,)

3D Color Histogram¶

For visualization's sake, here is a 2D color histogram. This histogram bins according to the color values present in the green and blue pixels. Since we have two dimensions, our histogram is in two dimensions.

In [7]:
# http://www.pyimagesearch.com/2016/08/08/k-nn-classifier-for-image-classification/
plt.rcParams["figure.figsize"] = (20,5)
fig = plt.figure()

# plot a 2D color histogram for green and blue
chans = cv2.split(image)
ax = fig.add_subplot(131)
hist = cv2.calcHist([chans[1], chans[0]], [0, 1], None, [32, 32], [0, 256, 0, 256])
p = ax.imshow(hist, interpolation = "nearest")
ax.set_title("2D Color Histogram for Green and Blue")
plt.colorbar(p)
Out[7]:
<matplotlib.colorbar.Colorbar at 0x41e7250>

For my actual features, I bin based off of three colors (r, g, and b) all together, which I use as a feature. I simply use the OpenCV package provided by python.

Code - Preprocessing¶

Preprocessing the data using the two features above (flattened 32 by 32 image and then flattened 3D histogram of colors) and saving it as a npy file helps save time, since we're likely to keep using the same data over and over again. Processing it each time is inefficient.

Note that it is often very important to either normalize or standardize the data. Otherwise the methods below may place uneven emphasis on particular features, which isn't necessarily what you want.

In [8]:
# Generate the downsized images - parallelized
def get_downsize(fname):
    image = cv2.imread(TRAINPATH+fname)
    image_small = cv2.resize(image, (32, 32))
    return(image_small.flatten())

# Generate the color histograms - parallelized
def get_color(fname):
    image = cv2.imread(TRAINPATH+fname)
    hist = cv2.calcHist([image], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
    return(hist.flatten())

# Get the labels
def get_label(fname):
    return(fname.split(".")[0]=='dog')

Note that the code below uses multiprocessing in python to process the images much faster. I have access to a server with 72 cores, so I use "NUM_CORES-10" cores to parallelize in order to leave 10 for other people. Depending on your use case, you can modify this.

In [9]:
# Load file names
files = os.listdir(TRAINPATH)
labels = [None]*len(files)

# Get downsized features and 3D color histograms - parallelized
pool = multiprocessing.Pool(NUM_CORES-10)
downsize_data = pool.map(get_downsize, files)
pool = multiprocessing.Pool(NUM_CORES-10)
color_data = pool.map(get_color, files)

# Get labels
labels = [get_label(fname) for fname in files]

# Normalize the data
downsize_data = np.array(downsize_data)
downsize_data = (downsize_data - np.mean(downsize_data))/np.std(downsize_data)
color_data = np.array(color_data)
color_data = (color_data - np.mean(color_data))/np.std(color_data)

# Save everything
img_data = np.concatenate([downsize_data, color_data],axis=1)
labels = np.array(labels)
np.save("X.npy", img_data)
np.save("y.npy", labels)

Using pre-trained network to generate features¶

As discussed before, it's common to use a convolutional neural network trained for a particular image classification problem for another image classification problem. This is called "transfer learning". In order to try this out for myself, I went through a bunch of different tutorials on transfer learning and eventually settled on this one from MXNet - a deep learning framework developed here at University of Washington. I mostly used the code from this tutorial. I test out using these features in the nearest neighbor section.

In [2]:
# Download the pretrained model
import os, urllib
def download(url):
    filename = url.split("/")[-1]
    if not os.path.exists(filename):
        urllib.urlretrieve(url, filename)
def get_model(prefix, epoch):
    download(prefix+'-symbol.json')
    download(prefix+'-%04d.params' % (epoch,))
get_model('http://data.mxnet.io/models/imagenet/resnet/50-layers/resnet-50', 0)
# Set up the model
import mxnet as mx
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-50', 0)
In [ ]:
# Visualize the model
mx.viz.plot_network(sym)

Part of the model visualization:

Cat/Dog data¶

In [3]:
from matplotlib import pyplot as plt
import matplotlib, cv2
import numpy as np
%matplotlib inline

# Get the labels
def get_label(fname):
    return(fname.split(".")[0]=='dog')

# Global variables
MYPATH = os.path.dirname(os.path.realpath("__file__"))+"/"
TRAINPATH = MYPATH+"data/catsdogs/"
files = os.listdir(TRAINPATH)
labels = [None]*len(files)
labels = [get_label(fname) for fname in files]
matplotlib.rc("savefig", dpi=100)
for i in range(0,8):
    img = cv2.cvtColor(cv2.imread(TRAINPATH+files[i]), cv2.COLOR_BGR2RGB)
    plt.subplot(2,4,i+1)
    plt.imshow(img)
    plt.axis('off')
    label = labels[i]
    plt.title(label)
In [4]:
import numpy as np
import cv2
def get_image(filename):
    img = cv2.imread(filename)  # read image in b,g,r order
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)   # change to r,g,b order
    img = cv2.resize(img, (224, 224))  # resize to 224*224 to fit model
    img = np.swapaxes(img, 0, 2)
    img = np.swapaxes(img, 1, 2)  # change to (channel, height, width)
    img = img[np.newaxis, :]  # extend to (example, channel, heigth, width)
    return img
from collections import namedtuple
Batch = namedtuple('Batch', ['data'])
all_layers = sym.get_internals()
all_layers.list_outputs()[-10:-1]
all_layers = sym.get_internals()
sym3 = all_layers['flatten0_output']
mod3 = mx.mod.Module(symbol=sym3, label_names=None)
# I didn't use gpu since it's harder to set up, without using GPUs, 
# featurizing with a CNN will be slow.
# mod3 = mx.mod.Module(symbol=sym3, label_names=None, context=mx.gpu())
mod3.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])
mod3.set_params(arg_params, aux_params)
def get_features(fname):
    img = get_image(TRAINPATH+fname)
    mod3.forward(Batch([mx.nd.array(img)]))
    out = mod3.get_outputs()[0].asnumpy()
    return(out)

Get the features from the second to last layer of the CNN:

In [5]:
get_features(files[i])
Out[5]:
array([[ 0.40588924,  0.        ,  0.39202911, ...,  0.32599568,
         0.04644855,  0.03360213]], dtype=float32)

Run the code to save all the features extracted by the CNN. Very slow, since I didn't have time to set up the GPU for MXNet for this project.

In [ ]:
pretrain_feat = np.empty((file_num, 2048))
for i in range(len(files)):
    pretrain_feat[i,:] = get_features(files[i])
np.save("X_pretrain.npy",pretrain_feat)

Nearest Neighbor¶

Table of Contents

Nearest neighbor is a non-parametric technique for regression and classification. It's a very simple algorithm which just finds the nearest training point to a given test point based off of some distance metric.

Pros:

  1. No training - you just use the raw training set for predictions.
  2. Easy to implement.

Cons:

  1. The distance metric may be hard to define in certain problems.
  2. Choosing and weighting features is a big deal.
  3. Either your training set is small and you have poor predictions or your training set is large and you have slow predictions.

Code - Nearest Neighbor¶

In [1]:
# Global variables/imports
import multiprocessing, random, time, cv2, os
from matplotlib import pyplot as plt
from datetime import datetime
import numpy as np
%matplotlib inline

# Global variables
PATH = os.path.dirname(os.path.abspath("__file__")).rsplit('/', 1)[0]
MYPATH = PATH+"/stat527/"
NUM_CORES = multiprocessing.cpu_count()
TRAINPATH = MYPATH+"data/catsdogs/"
In [2]:
# Load data
X = np.load("X.npy")
Y = np.load("y.npy")
# Partition into test and training sets
ind = np.arange(len(X)); np.random.shuffle(ind)
nine_tenth_ind = int(len(X) - len(X)/10.0)
trainX = X[:nine_tenth_ind,:]; trainY = Y[:nine_tenth_ind]
testX = X[nine_tenth_ind:,:]; testY = Y[nine_tenth_ind:]

One approach for finding the nearest neighbors is to use a KD-tree. A KD-tree is a k-dimensional tree which serves as a space-partitioning data structure for organizing points in k-dimensional spaces. Unfortunately, image data is often inherently high dimensional, so using these multidimensional search trees is not much better than brute force. In computer science, finding high-dimensional nearest neighbors is an open problem https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree. Since KD-trees aren't necessarily going to improve on brute force (especially with such high dimensional features), I don't end up trying it out.

Naturally, since there isn't any training involved in nearest neighbor, we see that the prediction is very slow. In order to combat this, I tried a variety of things. Firstly, I tried different methods for calculating distances quickly. One method involved using dot products, but ultimately using vectorized computation with numpy was the simplest and fairly fast.

In [3]:
def find_knn(x, datax, datay, k=5, method="euclidean"):
    # Compute euclidean distance using vectorized operations
    if (method == "euclidean"):
        distances = [np.sum((x1-x)**2) for x1 in datax]
    # Compute euclidean distance using dot products - slightly faster
    if (method == "euclidean_dot"):
        deltas = datax - x
        distances = np.einsum('ij,ij->i', deltas, deltas)
    # Sort the labels and return them
    knn_labels = datay[np.argpartition(distances, k)[:k]]
    return(round(np.mean(knn_labels)))

In order to further improve things, I tried out parallel processing with python. Due to a global interpreter lock, certain packages in python don't achieve true parallelism, but I found that using multiprocessing in the following way worked. Fortunately I had access to a machine with 72 cores, so things ran fairly fast. Depending on what resources you have access to it may not be the case.

Cats vs. Dogs Example¶

Predicting 100 test examples sequentially (slow)

In [9]:
startTime = datetime.now()
print "Starting with 1 core (sequential)"
pred = [find_knn(x,trainX,trainY) for x in testX[1:100,:]]
print "Elapsed time:"
print datetime.now() - startTime
Starting with 1 core (sequential)
Elapsed time:
0:00:35.779427

Predicting 100 test examples in parallel (faster)

In [10]:
# Predicting 100 test 
def find_knn1(x):
    return(find_knn(x, trainX, trainY))
startTime = datetime.now()
print "Starting with " + str(NUM_CORES-10) + " cores, using multiprocessing"
pool = multiprocessing.Pool(NUM_CORES-10)
pred = pool.map(find_knn1, testX[1:100,:])
print "Elapsed time:"
print datetime.now() - startTime
Starting with 62 cores, using multiprocessing
Elapsed time:
0:00:02.028805

Predicting all test examples in parallel

In [4]:
def find_knn1(x):
    return(find_knn(x, trainX, trainY))
startTime = datetime.now()
print "Starting with " + str(NUM_CORES-10) + " cores, using multiprocessing"
pool = multiprocessing.Pool(NUM_CORES-10)
pred = pool.map(find_knn1, testX)
print "Elapsed time:"
print datetime.now() - startTime
Starting with 62 cores, using multiprocessing
Elapsed time:
0:00:36.529332

Compute the test accuracy

In [5]:
100*np.sum(pred==testY)/float(len(testY))
Out[5]:
57.840000000000003

So we end up with a test accuracy of around 57%. Since guessing at random would yield an accuracy of 50%, we've improved on the baseline. Obviously there is still a lot of room for improvement.

Using pretrained model's last layer as features¶

In [6]:
# Load data
X = np.load("X_pretrain.npy")
Y = np.load("y.npy")
# Partition into test and training sets
ind = np.arange(len(X)); np.random.shuffle(ind)
nine_tenth_ind = int(len(X) - len(X)/10.0)
trainX = X[:nine_tenth_ind,:]; trainY = Y[:nine_tenth_ind]
testX = X[nine_tenth_ind:,:]; testY = Y[nine_tenth_ind:]
In [7]:
def find_knn1(x):
    return(find_knn(x, trainX, trainY))
startTime = datetime.now()
print "Starting with " + str(NUM_CORES-10) + " cores, using multiprocessing"
pool = multiprocessing.Pool(NUM_CORES-10)
pred = pool.map(find_knn1, testX)
print "Elapsed time:"
print datetime.now() - startTime
Starting with 62 cores, using multiprocessing
Elapsed time:
0:00:20.232698
In [8]:
100*np.sum(pred==testY)/float(len(testY))
Out[8]:
50.280000000000001

Unfortunately using the pretrained CNN's extracted features doesn't do much better than random. The low accuracy simply might be because using nearest neighbor may not be very appropriate for the pretrained features.

Digits Example¶

I also tried using the digits data for nearest neighbor.

In [9]:
# Some of the code for loading data was borrowed from here:
# https://www.kaggle.com/kakauandme/tensorflow-deep-nn
import pandas as pd
data = pd.read_csv('data/digits/train.csv')
images = data.iloc[:,1:].values
images = images.astype(np.float)
images = np.multiply(images, 1.0 / 255.0) # Normalize
labels = data["label"].values.ravel()

ind = np.arange(len(images))
np.random.shuffle(ind)
nine_tenth_ind = int(len(images) - len(images)/10.0)
train_images = images[:nine_tenth_ind,:]
train_labels = labels[:nine_tenth_ind]
test_images = images[nine_tenth_ind:,:]
test_labels = labels[nine_tenth_ind:]
In [10]:
def find_knn1(x):
    return(find_knn(x, train_images, train_labels))
startTime = datetime.now()
print "Starting with " + str(NUM_CORES-10) + " cores, using multiprocessing"
pool = multiprocessing.Pool(NUM_CORES-10)
pred = pool.map(find_knn1, test_images)
print "Elapsed time:"
print datetime.now() - startTime
Starting with 62 cores, using multiprocessing
Elapsed time:
0:03:02.245723
In [11]:
100*np.sum(pred==test_labels)/float(len(test_labels))
Out[11]:
91.071428571428569

So we end up with a test accuracy of 91.07%. Since guessing at random would yield an accuracy of 10%, we've greatly improved on the baseline, but there is still room for improvement!

Gradient Boosting Trees¶

Table of Contents

Gradient boosting trees at their simplest form are very easy to understand. Gradient boosting is a way to descend down a loss function one step at a time. If we use trees, then we are simply descending down our loss function by approximating the gradient with our residuals. For many modern day machine learning prediction problems gradient boosting trees are very generalizable and perform extremely well.

The steps to gradient boosting are as follows:

  1. Start with an initial model $F(x)$.
  2. Calculate negative gradients via residuals.
  3. Fit a regression tree $h$ to the negative gradients.
  4. Set $F=F+\rho\times h$, where $\rho$ is the learning rate.
  5. Repeat steps 2-4 until convergence.

Then only subtlety is that for classification, we have to compute that gradients for each class. This set of slides describes this point quite well.

Below, I report the results for using the digits dataset. Since this is a simple version of gradient boosting trees, I only try it out on the simple digits dataset.

Code - Gradient Boosting¶

In [12]:
# Data from https://www.kaggle.com/c/digit-recognizer
# Some of the code for loading data was borrowed from here:
# https://www.kaggle.com/kakauandme/tensorflow-deep-nn
import numpy as np
import pandas as pd
data = pd.read_csv('data/digits/train.csv')
images = data.iloc[:,1:].values
images = images.astype(np.float)
images = np.multiply(images, 1.0 / 255.0) # Normalize
labels = data["label"].values.ravel()

ind = np.arange(len(images))
np.random.shuffle(ind)
nine_tenth_ind = int(len(images) - len(images)/10.0)
train_images = images[:nine_tenth_ind,:]
train_labels = labels[:nine_tenth_ind]
test_images = images[nine_tenth_ind:,:]
test_labels = labels[nine_tenth_ind:]
In [17]:
from sklearn import tree
class tree_classifier:
    maxdepth = 4
    forest = []
    eta = 0.1
    def __init__(self, num_classes, train_images, train_labels, 
                 max_depth = 4, eta = 0.1):
        # Use a classifier tree as the initial F
        self.num_classes = num_classes
        self.max_depth = max_depth
        self.eta = eta
        self.train_images = train_images
        self.train_labels = train_labels
        clf = tree.DecisionTreeClassifier(max_depth = 4)
        clf = clf.fit(train_images, train_labels)
        self.clf = clf
        onehot_train_labels = np.zeros((len(train_labels), 10))
        onehot_train_labels[np.arange(len(train_labels)), train_labels] = 1
        self.curr_prob = clf.predict_proba(train_images)
        self.curr_res = onehot_train_labels - self.curr_prob
        
    def fit_once(self):
        trees = []
        for i in range(self.num_classes):
            tree_res = self.curr_res[:,i]
            # Add small trees regression trees per class
            curr_tree = tree.DecisionTreeRegressor(max_depth = 4)
            curr_tree.fit(self.train_images, tree_res)
            tree_prob = curr_tree.predict(self.train_images)
            self.curr_prob[:,i] = self.curr_prob[:,i] + self.eta*tree_prob
            trees.append(curr_tree)
        self.forest.append(trees)

    def fit(self, niters, verbose = True):
        for i in range(niters):
            if (verbose):
                print "Training Accuracy: {0}".format(self.train_acc())                
            self.fit_once()

    def train_acc(self):
        return(np.sum(np.argmax(self.curr_prob, axis=1)==train_labels)/float(len(train_labels)))
    
    def predict(self, test_images):
        test_prob = self.clf.predict_proba(test_images)
        for trees in self.forest:
            for j in range(self.num_classes):
                curr_tree = trees[j]
                tree_prob = curr_tree.predict(test_images)
                test_prob[:,j] = test_prob[:,j] + self.eta*tree_prob
        return(np.argmax(test_prob, axis=1))
In [18]:
clf = tree_classifier(10, train_images, train_labels)
clf.train_acc()
Out[18]:
0.63378306878306878
In [19]:
clf.fit(10)
Training Accuracy: 0.633783068783
Training Accuracy: 0.637301587302
Training Accuracy: 0.638201058201
Training Accuracy: 0.641349206349
Training Accuracy: 0.649021164021
Training Accuracy: 0.66828042328
Training Accuracy: 0.673386243386
Training Accuracy: 0.690396825397
Training Accuracy: 0.706322751323
Training Accuracy: 0.714232804233

Test Accuracy

In [20]:
np.sum(clf.predict(test_images)==test_labels)/float(len(test_labels))
Out[20]:
0.72357142857142853

While the test performance isn't that great, keep in mind that this is a very simplified model and we are still improving over the random accuracy of 10%.

Neural Network¶

Table of Contents

Description¶

High Level¶

Neural networks are often touted as being complicated/mysterious machine learning models. In reality, vanilla neural networks aren't that complicated. In order to test out a neural network for image classification, I implement a simple one below.

Neural networks are simply nonparametric models that were loosely inspired by neurons in the brain. Their power comes from their composability and the simplicity of the neurons.

In general, neural networks are defined by layers (a set of neurons) moving from an input layer to an output one. Every layer between the input and the output are considered to be "hidden" layers. The neurons themselves are simply activation functions such as tanh or the sigmoid function. The neurons are connected between each layer by weights which define the linear combinations that go into neurons.

In the example below, I use one hidden layer which has a tanh activation on its neurons which goes to an output layer which has a sigmoid activation on its neurons. In this example the layers are fully connected, which just means every neuron in layer $i$ is connected to every neuron in layer $i+1$.

Here's a depiction:

In the end our goal is to simply minimize a loss function where our predictions are determined by the neural network. In the model, it's clear that when we consider a particular training example, everything is fixed except for the weights (which are the parameters in a neural network). Since neural networks often have activation functions which are hard to work with, in practice neural networks are trained via gradient descent.

The training of a neural network looks like the following:

  1. Randomly initialize weights.
  2. Forward pass - calculate the outputs.
  3. Back propagation - calculate the errors using the chain rule (working backwards).
  4. Repeat steps 2 and 3 until you're happy with your training/validation error.

Backpropagation¶

For backpropagation, we first have to define some terms (it may be useful to look at the depiction of a neural network above):

  • $f$ is the loss function.
  • $\eta$ is the learning rate.
  • $g^{(i)}$ is the activation function (in our example it's constant for each layer) .
  • $sig$ is the sigmoid function.
  • $y_i$ is the true value of the output.
  • $x_i^{(0)}=s_i^{(0)}$ are the input values.
  • $s_i^{(l)}=\sum_jx_j^{(l-1)}w_{ji}$ is the pre-activation value for node $i$ in layer $l$.
  • $x_i^{(l)}=g^{(l)}(s_i^{(l)})$ is the post-activation value for node $i$ in layer $l$.
  • $w_{ij}^{(l)}$ is the weight from $x_i^{(l-1)}$ to $s_j^{(l)}$.

Then, in general we know that the updates we want are $w_{ij}^{(2)}\leftarrow w_{ij}^{(2)}-\eta\times \frac{df}{dw_{ij}^{(2)}}$ and $w_{ij}^{(1)}\leftarrow w_{ij}^{(1)}-\eta\times \frac{df}{dw_{ij}^{(1)}}$.

Thanks to the structure of the neural network, we can apply the chain rule and see that $\frac{df}{dw_{ij}^{(2)}}=\frac{df}{dx_{j}^{(2)}}\times \frac{dx_{j}^{(2)}}{ds_{j}^{(2)}}\times \frac{ds_{j}^{(2)}}{dw_{ij}^{(2)}}$ and $\frac{df}{dw_{ij}^{(1)}}=\frac{df}{ds_{i}^{(1)}}\times \frac{ds_{i}^{(1)}}{dw_{ij}^{(1)}}$.

Additionally, if the loss is expressed as a sum over all output nodes, $\frac{df}{ds_{i}^{(1)}}=\sum_k \frac{df}{ds_{k}^{(2)}}\times \frac{ds_{k}^{(2)}}{dx_{k}^{(2)}}\times \frac{dx_{k}^{(2)}}{ds_{i}^{(1)}}$.

So we can re-express $\frac{df}{dw_{ij}^{(1)}}=\frac{ds_{i}^{(1)}}{dw_{ij}^{(1)}}\times \sum_k \frac{df}{ds_{k}^{(2)}}\times \frac{ds_{k}^{(2)}}{dx_{k}^{(2)}}\times \frac{dx_{k}^{(2)}}{ds_{i}^{(1)}}$.

Then, in my network I simply use a mean-squared error loss (although perhaps binary cross-entropy would be more appropriate). At this point, we can simply plug in to find:

$\frac{df}{dw_{ij}^{(2)}}=\frac{df}{dx_{j}^{(2)}}\times \frac{dx_{j}^{(2)}}{ds_{j}^{(2)}}\times \frac{ds_{j}^{(2)}}{dw_{ij}^{(2)}}=(x_i^{(2)}-y_i)g'(s_i^{(2)})x_i^{(1)}=(x_i^{(2)}-y_i)sig(s_i^{(2)})(1-sig(s_i^{(2)}))x_i^{(1)}$

$\frac{df}{dw_{ij}^{(1)}}=\frac{ds_{i}^{(1)}}{dw_{ij}^{(1)}}\times \sum_k \frac{df}{ds_{k}^{(2)}}\times \frac{ds_{k}^{(2)}}{dx_{k}^{(2)}}\times \frac{dx_{k}^{(2)}}{ds_{i}^{(1)}}=x_i^{(0)}\times \sum_k (x_j^{(2)}-y_j)sig(s_j^{(2)})(1-sig(s_j^{(2)}))w_{jk}^{(2)}(1-tanh^2(s_j^{(1)}))$

In order to use the neural network, you'll have to convert to matrix equations. I found it convenient to use this video series to help get the equations.

Code - Neural Network¶

In [8]:
# Global Variables/Imports
import random, time, csv, cv2, os
from matplotlib import pyplot as plt
from datetime import datetime
import numpy as np
%matplotlib inline

num_hidden_nodes = 200
num_input_nodes = 784

# Sigmoid function
def sigmoid(x):
    return 1/(1 + np.exp(-x))

# Derivative of sigmoid
def sig_prime(x):
    return np.multiply(sigmoid(x), 1-sigmoid(x))

# Derivative of tanh
def tanh_prime(x):
    return 1 - np.square(np.tanh(x))

# Backward pass that finds the errors
def backward_pass(x_two, s_two, w_two, s_one, y):
    delta_two = np.multiply(x_two - y, sig_prime(s_two)) # MSE
    w_two_short = w_two[0:num_hidden_nodes,:] # Chop off bias
    delta_one = np.multiply(np.dot(delta_two, w_two_short.T), tanh_prime(s_one))
    return (delta_two, delta_one)

# Forward pass that calculates new state
def forward_pass(x_zero, w_one, w_two):
    s_one = np.dot(x_zero, w_one)
    x_one = np.tanh(s_one)
    x_one = np.array([np.append(x_one[0], 1)])
    s_two = np.dot(x_one, w_two)
    x_two = np.tanh(s_two)
    return(s_one, x_one, s_two, x_two)

# Find accuracy given images and labels
def test_accuracy(w_one, w_two, images, labels):
    counter = 0
    for i in range(0, len(images)):
        x_in = np.array([np.append(images[i], 1)])
        # Predict
        s_one, x_one, s_two, x_two = forward_pass(x_in, w_one, w_two)
        prediction = np.argmax(x_two)
        if prediction == labels[i]:
            counter = counter + 1
    return counter/float(len(images))

# Print and save progress as needed (currently every 10000 iterations)
def print_save_progress(i, t0, w_two, w_one, images, labels):
    if ((i+1) % 10000 == 0):
        error = test_accuracy(w_one, w_two, images, labels)
        print "Trial: " + str(i+1) + " Accuracy: " + str(error)
        np.savetxt("w_one.csv", w_one, delimiter=",")
        np.savetxt("w_two.csv", w_two, delimiter=",")
        print "Time Elapsed: " + str(time.time() - t0)
    return(w_one, w_two)
    
# Train neural network and save the weights at a certain accuracy
def train_neural_network(images, labels, eta, niter):
    print "TRAINING NEURAL NETWORK"
    # Initialized weights uniformly in [-.01, .01]
    t0 = time.time()
    # Account for bias terms
    w_one = (np.random.rand(num_input_nodes+1,200)*.2)-.1
    w_two = (np.random.rand(num_hidden_nodes+1,10)*.2)-.1
    for i in range(niter):
        # Pick random data point
        index = random.randint(0, len(labels)-1)
        x_zero = np.array([np.append(images[index], 1)])
        y = np.array([np.zeros(10)])
        y[0, labels[index]] = 1
        # Forward pass
        s_one, x_one, s_two, x_two = forward_pass(x_zero, w_one, w_two)
        # Backward pass
        delta_two, delta_one = backward_pass(x_two, s_two, w_two, s_one, y)
        # Weight update
        w_two = w_two - eta*np.dot(x_one.T, delta_two)
        w_one = w_one - eta*np.dot(x_zero.T, delta_one)
        # Printing out progress
        print_save_progress(i, t0, w_two, w_one, images, labels)

I end up using the MNIST digit dataset on kaggle since it's a much simpler problem for the neural network to deal with.

In [9]:
# Data from https://www.kaggle.com/c/digit-recognizer
# Some of the code for loading data was borrowed from here:
# https://www.kaggle.com/kakauandme/tensorflow-deep-nn
import pandas as pd
data = pd.read_csv('data/digits/train.csv')
images = data.iloc[:,1:].values
images = images.astype(np.float)
images = np.multiply(images, 1.0 / 255.0) # Normalize
labels = data["label"].values.ravel()

ind = np.arange(len(images))
np.random.shuffle(ind)
nine_tenth_ind = int(len(images) - len(images)/10.0)
train_images = images[:nine_tenth_ind,:]
train_labels = labels[:nine_tenth_ind]
test_images = images[nine_tenth_ind:,:]
test_labels = labels[nine_tenth_ind:]
In [10]:
import matplotlib.cm as cm

# display image
def display(img):
    one_image = img.reshape(28,28)
    plt.axis('off')
    plt.imshow(one_image, cmap=cm.binary)

# output image     
display(images[100])
In [7]:
train_neural_network(train_images, train_labels, .01, 100000)
TRAINING NEURAL NETWORK
Trial: 10000 Accuracy: 0.853518518519
Time Elapsed: 14.812183857
Trial: 20000 Accuracy: 0.873201058201
Time Elapsed: 28.0374269485
Trial: 30000 Accuracy: 0.900502645503
Time Elapsed: 41.3854689598
Trial: 40000 Accuracy: 0.910132275132
Time Elapsed: 55.8106050491
Trial: 50000 Accuracy: 0.913915343915
Time Elapsed: 70.709143877
Trial: 60000 Accuracy: 0.924735449735
Time Elapsed: 86.7208969593
Trial: 70000 Accuracy: 0.92626984127
Time Elapsed: 100.601037979
Trial: 80000 Accuracy: 0.929920634921
Time Elapsed: 116.155189991
Trial: 90000 Accuracy: 0.933650793651
Time Elapsed: 130.912786961
Trial: 100000 Accuracy: 0.93544973545
Time Elapsed: 146.341517925
In [11]:
# Load the parameters and test the accuracy
w_one = np.genfromtxt('w_one.csv', delimiter=',')
w_two = np.genfromtxt('w_two.csv', delimiter=',')
accuracy = test_accuracy(w_one, w_two, test_images, test_labels)
print "Test Accuracy: {0}".format(accuracy)
Test Accuracy: 0.930238095238

The test accuracy on the digits dataset is very good. I tried to use the same neural network for the cats vs. dogs dataset, but the training was not promising. Using a vanilla neural network on mostly raw image features doesn't perform well given the complexity of the problem.

In [ ]:
'''
X = np.load("X.npy")
Y = np.load("y.npy")
# Partition into test and training sets
ind = np.arange(len(X))
np.random.shuffle(ind)
nine_tenth_ind = int(len(X) - len(X)/10.0)
trainX = X[:nine_tenth_ind,:]
trainY = Y[:nine_tenth_ind]
testX = X[nine_tenth_ind:,:]
testY = Y[nine_tenth_ind:]

Sample output when trying to do cats v dogs
TRAINING NEURAL NETWORK
Trial: 10000 Accuracy: 0.494
Trial: 20000 Accuracy: 0.494222222222
Trial: 30000 Accuracy: 0.522711111111
Trial: 40000 Accuracy: 0.504088888889
Trial: 50000 Accuracy: 0.534444444444
Trial: 60000 Accuracy: 0.520133333333
Trial: 70000 Accuracy: 0.517244444444
Trial: 80000 Accuracy: 0.505911111111
Trial: 90000 Accuracy: 0.4972
Trial: 100000 Accuracy: 0.5212
'''