Run this to generate table of contents:

In [18]:

```
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')
```

In this project I implement several of the most popular non-parametric methods being applied today. The goal of this project is to understand the different methods and potentially help other people understand these methods. Note that since I'm not focusing on achieving high prediction accuracy, I don't focus on hyperparameter tuning in this project. Additionally, the use case of this project is for image classification, so I also discuss extracting features from images as well. In this project I use two image datasets: digits data and cats/dogs data. In order to set up your directory you should download only the training data from both and put it into the appropriate subdirectory in the "data" directory in your current working directory.

Convolutional neural networks do great at classifying images just based off of the raw image data. This is because they essentially learn filters while concurrently learning the classification problem. For most methods, using raw pixels as features is generally a bad idea because scale variation, viewpoint variation, background clutter, etc. all confound the classification. For this section, I focus on using the cats/dogs data to extract features since the images are more interesting than those from the digits dataset.

In [2]:

```
from matplotlib import pyplot as plt
import multiprocessing, cv2, os
import numpy as np
%matplotlib inline
# Global variables
MYPATH = os.path.dirname(os.path.realpath("__file__"))+"/"
NUM_CORES = multiprocessing.cpu_count()
TRAINPATH = MYPATH+"data/catsdogs/"
```

Downsizing the images is useful for the sake of both standardizing image sizes and for the sake of controlling running time of training/prediction.

In [3]:

```
files = os.listdir(TRAINPATH); i=3
fname = TRAINPATH+files[i]
image = cv2.cvtColor(cv2.imread(fname), cv2.COLOR_BGR2RGB)
plt.imshow(image); plt.show()
image_small = cv2.resize(image, (32, 32))
plt.imshow(image_small); plt.show()
```

An image is represented as three matrices that store the r, g, and b pixel values for each location. If we standardize the image size and flatten these, we can directly use these pixel values as features.

In [4]:

```
image.shape
```

Out[4]:

In [5]:

```
image_small.shape
```

Out[5]:

In [6]:

```
image_small.flatten().shape
```

Out[6]:

For visualization's sake, here is a 2D color histogram. This histogram bins according to the color values present in the green and blue pixels. Since we have two dimensions, our histogram is in two dimensions.

In [7]:

```
# http://www.pyimagesearch.com/2016/08/08/k-nn-classifier-for-image-classification/
plt.rcParams["figure.figsize"] = (20,5)
fig = plt.figure()
# plot a 2D color histogram for green and blue
chans = cv2.split(image)
ax = fig.add_subplot(131)
hist = cv2.calcHist([chans[1], chans[0]], [0, 1], None, [32, 32], [0, 256, 0, 256])
p = ax.imshow(hist, interpolation = "nearest")
ax.set_title("2D Color Histogram for Green and Blue")
plt.colorbar(p)
```

Out[7]:

For my actual features, I bin based off of three colors (r, g, and b) all together, which I use as a feature. I simply use the OpenCV package provided by python.

Preprocessing the data using the two features above (flattened 32 by 32 image and then flattened 3D histogram of colors) and saving it as a npy file helps save time, since we're likely to keep using the same data over and over again. Processing it each time is inefficient.

Note that it is often very important to either normalize or standardize the data. Otherwise the methods below may place uneven emphasis on particular features, which isn't necessarily what you want.

In [8]:

```
# Generate the downsized images - parallelized
def get_downsize(fname):
image = cv2.imread(TRAINPATH+fname)
image_small = cv2.resize(image, (32, 32))
return(image_small.flatten())
# Generate the color histograms - parallelized
def get_color(fname):
image = cv2.imread(TRAINPATH+fname)
hist = cv2.calcHist([image], [0, 1, 2], None, [8, 8, 8], [0, 256, 0, 256, 0, 256])
return(hist.flatten())
# Get the labels
def get_label(fname):
return(fname.split(".")[0]=='dog')
```

Note that the code below uses multiprocessing in python to process the images much faster. I have access to a server with 72 cores, so I use "NUM_CORES-10" cores to parallelize in order to leave 10 for other people. Depending on your use case, you can modify this.

In [9]:

```
# Load file names
files = os.listdir(TRAINPATH)
labels = [None]*len(files)
# Get downsized features and 3D color histograms - parallelized
pool = multiprocessing.Pool(NUM_CORES-10)
downsize_data = pool.map(get_downsize, files)
pool = multiprocessing.Pool(NUM_CORES-10)
color_data = pool.map(get_color, files)
# Get labels
labels = [get_label(fname) for fname in files]
# Normalize the data
downsize_data = np.array(downsize_data)
downsize_data = (downsize_data - np.mean(downsize_data))/np.std(downsize_data)
color_data = np.array(color_data)
color_data = (color_data - np.mean(color_data))/np.std(color_data)
# Save everything
img_data = np.concatenate([downsize_data, color_data],axis=1)
labels = np.array(labels)
np.save("X.npy", img_data)
np.save("y.npy", labels)
```

As discussed before, it's common to use a convolutional neural network trained for a particular image classification problem for another image classification problem. This is called "transfer learning". In order to try this out for myself, I went through a bunch of different tutorials on transfer learning and eventually settled on this one from MXNet - a deep learning framework developed here at University of Washington. I mostly used the code from this tutorial. I test out using these features in the nearest neighbor section.

In [2]:

```
# Download the pretrained model
import os, urllib
def download(url):
filename = url.split("/")[-1]
if not os.path.exists(filename):
urllib.urlretrieve(url, filename)
def get_model(prefix, epoch):
download(prefix+'-symbol.json')
download(prefix+'-%04d.params' % (epoch,))
get_model('http://data.mxnet.io/models/imagenet/resnet/50-layers/resnet-50', 0)
# Set up the model
import mxnet as mx
sym, arg_params, aux_params = mx.model.load_checkpoint('resnet-50', 0)
```

In [ ]:

```
# Visualize the model
mx.viz.plot_network(sym)
```

Part of the model visualization:

In [3]:

```
from matplotlib import pyplot as plt
import matplotlib, cv2
import numpy as np
%matplotlib inline
# Get the labels
def get_label(fname):
return(fname.split(".")[0]=='dog')
# Global variables
MYPATH = os.path.dirname(os.path.realpath("__file__"))+"/"
TRAINPATH = MYPATH+"data/catsdogs/"
files = os.listdir(TRAINPATH)
labels = [None]*len(files)
labels = [get_label(fname) for fname in files]
matplotlib.rc("savefig", dpi=100)
for i in range(0,8):
img = cv2.cvtColor(cv2.imread(TRAINPATH+files[i]), cv2.COLOR_BGR2RGB)
plt.subplot(2,4,i+1)
plt.imshow(img)
plt.axis('off')
label = labels[i]
plt.title(label)
```

In [4]:

```
import numpy as np
import cv2
def get_image(filename):
img = cv2.imread(filename) # read image in b,g,r order
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB) # change to r,g,b order
img = cv2.resize(img, (224, 224)) # resize to 224*224 to fit model
img = np.swapaxes(img, 0, 2)
img = np.swapaxes(img, 1, 2) # change to (channel, height, width)
img = img[np.newaxis, :] # extend to (example, channel, heigth, width)
return img
from collections import namedtuple
Batch = namedtuple('Batch', ['data'])
all_layers = sym.get_internals()
all_layers.list_outputs()[-10:-1]
all_layers = sym.get_internals()
sym3 = all_layers['flatten0_output']
mod3 = mx.mod.Module(symbol=sym3, label_names=None)
# I didn't use gpu since it's harder to set up, without using GPUs,
# featurizing with a CNN will be slow.
# mod3 = mx.mod.Module(symbol=sym3, label_names=None, context=mx.gpu())
mod3.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])
mod3.set_params(arg_params, aux_params)
def get_features(fname):
img = get_image(TRAINPATH+fname)
mod3.forward(Batch([mx.nd.array(img)]))
out = mod3.get_outputs()[0].asnumpy()
return(out)
```

Get the features from the second to last layer of the CNN:

In [5]:

```
get_features(files[i])
```

Out[5]:

Run the code to save all the features extracted by the CNN. Very slow, since I didn't have time to set up the GPU for MXNet for this project.

In [ ]:

```
pretrain_feat = np.empty((file_num, 2048))
for i in range(len(files)):
pretrain_feat[i,:] = get_features(files[i])
np.save("X_pretrain.npy",pretrain_feat)
```

Nearest neighbor is a non-parametric technique for regression and classification. It's a very simple algorithm which just finds the nearest training point to a given test point based off of some distance metric.

Pros:

- No training - you just use the raw training set for predictions.
- Easy to implement.

Cons:

- The distance metric may be hard to define in certain problems.
- Choosing and weighting features is a big deal.
- Either your training set is small and you have poor predictions or your training set is large and you have slow predictions.

In [1]:

```
# Global variables/imports
import multiprocessing, random, time, cv2, os
from matplotlib import pyplot as plt
from datetime import datetime
import numpy as np
%matplotlib inline
# Global variables
PATH = os.path.dirname(os.path.abspath("__file__")).rsplit('/', 1)[0]
MYPATH = PATH+"/stat527/"
NUM_CORES = multiprocessing.cpu_count()
TRAINPATH = MYPATH+"data/catsdogs/"
```

In [2]:

```
# Load data
X = np.load("X.npy")
Y = np.load("y.npy")
# Partition into test and training sets
ind = np.arange(len(X)); np.random.shuffle(ind)
nine_tenth_ind = int(len(X) - len(X)/10.0)
trainX = X[:nine_tenth_ind,:]; trainY = Y[:nine_tenth_ind]
testX = X[nine_tenth_ind:,:]; testY = Y[nine_tenth_ind:]
```

One approach for finding the nearest neighbors is to use a KD-tree. A KD-tree is a k-dimensional tree which serves as a space-partitioning data structure for organizing points in k-dimensional spaces. Unfortunately, image data is often inherently high dimensional, so using these multidimensional search trees is not much better than brute force. In computer science, finding high-dimensional nearest neighbors is an open problem https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.html#scipy.spatial.KDTree. Since KD-trees aren't necessarily going to improve on brute force (especially with such high dimensional features), I don't end up trying it out.

Naturally, since there isn't any training involved in nearest neighbor, we see that the prediction is very slow. In order to combat this, I tried a variety of things. Firstly, I tried different methods for calculating distances quickly. One method involved using dot products, but ultimately using vectorized computation with numpy was the simplest and fairly fast.

In [3]:

```
def find_knn(x, datax, datay, k=5, method="euclidean"):
# Compute euclidean distance using vectorized operations
if (method == "euclidean"):
distances = [np.sum((x1-x)**2) for x1 in datax]
# Compute euclidean distance using dot products - slightly faster
if (method == "euclidean_dot"):
deltas = datax - x
distances = np.einsum('ij,ij->i', deltas, deltas)
# Sort the labels and return them
knn_labels = datay[np.argpartition(distances, k)[:k]]
return(round(np.mean(knn_labels)))
```

In order to further improve things, I tried out parallel processing with python. Due to a global interpreter lock, certain packages in python don't achieve true parallelism, but I found that using multiprocessing in the following way worked. Fortunately I had access to a machine with 72 cores, so things ran fairly fast. Depending on what resources you have access to it may not be the case.

Predicting 100 test examples sequentially (slow)

In [9]:

```
startTime = datetime.now()
print "Starting with 1 core (sequential)"
pred = [find_knn(x,trainX,trainY) for x in testX[1:100,:]]
print "Elapsed time:"
print datetime.now() - startTime
```

Predicting 100 test examples in parallel (faster)

In [10]:

```
# Predicting 100 test
def find_knn1(x):
return(find_knn(x, trainX, trainY))
startTime = datetime.now()
print "Starting with " + str(NUM_CORES-10) + " cores, using multiprocessing"
pool = multiprocessing.Pool(NUM_CORES-10)
pred = pool.map(find_knn1, testX[1:100,:])
print "Elapsed time:"
print datetime.now() - startTime
```

Predicting all test examples in parallel

In [4]:

```
def find_knn1(x):
return(find_knn(x, trainX, trainY))
startTime = datetime.now()
print "Starting with " + str(NUM_CORES-10) + " cores, using multiprocessing"
pool = multiprocessing.Pool(NUM_CORES-10)
pred = pool.map(find_knn1, testX)
print "Elapsed time:"
print datetime.now() - startTime
```

Compute the test accuracy

In [5]:

```
100*np.sum(pred==testY)/float(len(testY))
```

Out[5]:

So we end up with a test accuracy of around **57%**. Since guessing at random would yield an accuracy of 50%, we've improved on the baseline. Obviously there is still a lot of room for improvement.

In [6]:

```
# Load data
X = np.load("X_pretrain.npy")
Y = np.load("y.npy")
# Partition into test and training sets
ind = np.arange(len(X)); np.random.shuffle(ind)
nine_tenth_ind = int(len(X) - len(X)/10.0)
trainX = X[:nine_tenth_ind,:]; trainY = Y[:nine_tenth_ind]
testX = X[nine_tenth_ind:,:]; testY = Y[nine_tenth_ind:]
```

In [7]:

```
def find_knn1(x):
return(find_knn(x, trainX, trainY))
startTime = datetime.now()
print "Starting with " + str(NUM_CORES-10) + " cores, using multiprocessing"
pool = multiprocessing.Pool(NUM_CORES-10)
pred = pool.map(find_knn1, testX)
print "Elapsed time:"
print datetime.now() - startTime
```

In [8]:

```
100*np.sum(pred==testY)/float(len(testY))
```

Out[8]:

Unfortunately using the pretrained CNN's extracted features doesn't do much better than random. The low accuracy simply might be because using nearest neighbor may not be very appropriate for the pretrained features.

I also tried using the digits data for nearest neighbor.

In [9]:

```
# Some of the code for loading data was borrowed from here:
# https://www.kaggle.com/kakauandme/tensorflow-deep-nn
import pandas as pd
data = pd.read_csv('data/digits/train.csv')
images = data.iloc[:,1:].values
images = images.astype(np.float)
images = np.multiply(images, 1.0 / 255.0) # Normalize
labels = data["label"].values.ravel()
ind = np.arange(len(images))
np.random.shuffle(ind)
nine_tenth_ind = int(len(images) - len(images)/10.0)
train_images = images[:nine_tenth_ind,:]
train_labels = labels[:nine_tenth_ind]
test_images = images[nine_tenth_ind:,:]
test_labels = labels[nine_tenth_ind:]
```

In [10]:

```
def find_knn1(x):
return(find_knn(x, train_images, train_labels))
startTime = datetime.now()
print "Starting with " + str(NUM_CORES-10) + " cores, using multiprocessing"
pool = multiprocessing.Pool(NUM_CORES-10)
pred = pool.map(find_knn1, test_images)
print "Elapsed time:"
print datetime.now() - startTime
```

In [11]:

```
100*np.sum(pred==test_labels)/float(len(test_labels))
```

Out[11]:

So we end up with a test accuracy of **91.07%**. Since guessing at random would yield an accuracy of 10%, we've greatly improved on the baseline, but there is still room for improvement!

Gradient boosting trees at their simplest form are very easy to understand. Gradient boosting is a way to descend down a loss function one step at a time. If we use trees, then we are simply descending down our loss function by approximating the gradient with our residuals. For many modern day machine learning prediction problems gradient boosting trees are very generalizable and perform extremely well.

The steps to gradient boosting are as follows:

- Start with an initial model $F(x)$.
- Calculate negative gradients via residuals.
- Fit a regression tree $h$ to the negative gradients.
- Set $F=F+\rho\times h$, where $\rho$ is the learning rate.
- Repeat steps 2-4 until convergence.

Then only subtlety is that for classification, we have to compute that gradients for each class. This set of slides describes this point quite well.

Below, I report the results for using the digits dataset. Since this is a simple version of gradient boosting trees, I only try it out on the simple digits dataset.

In [12]:

```
# Data from https://www.kaggle.com/c/digit-recognizer
# Some of the code for loading data was borrowed from here:
# https://www.kaggle.com/kakauandme/tensorflow-deep-nn
import numpy as np
import pandas as pd
data = pd.read_csv('data/digits/train.csv')
images = data.iloc[:,1:].values
images = images.astype(np.float)
images = np.multiply(images, 1.0 / 255.0) # Normalize
labels = data["label"].values.ravel()
ind = np.arange(len(images))
np.random.shuffle(ind)
nine_tenth_ind = int(len(images) - len(images)/10.0)
train_images = images[:nine_tenth_ind,:]
train_labels = labels[:nine_tenth_ind]
test_images = images[nine_tenth_ind:,:]
test_labels = labels[nine_tenth_ind:]
```

In [17]:

```
from sklearn import tree
class tree_classifier:
maxdepth = 4
forest = []
eta = 0.1
def __init__(self, num_classes, train_images, train_labels,
max_depth = 4, eta = 0.1):
# Use a classifier tree as the initial F
self.num_classes = num_classes
self.max_depth = max_depth
self.eta = eta
self.train_images = train_images
self.train_labels = train_labels
clf = tree.DecisionTreeClassifier(max_depth = 4)
clf = clf.fit(train_images, train_labels)
self.clf = clf
onehot_train_labels = np.zeros((len(train_labels), 10))
onehot_train_labels[np.arange(len(train_labels)), train_labels] = 1
self.curr_prob = clf.predict_proba(train_images)
self.curr_res = onehot_train_labels - self.curr_prob
def fit_once(self):
trees = []
for i in range(self.num_classes):
tree_res = self.curr_res[:,i]
# Add small trees regression trees per class
curr_tree = tree.DecisionTreeRegressor(max_depth = 4)
curr_tree.fit(self.train_images, tree_res)
tree_prob = curr_tree.predict(self.train_images)
self.curr_prob[:,i] = self.curr_prob[:,i] + self.eta*tree_prob
trees.append(curr_tree)
self.forest.append(trees)
def fit(self, niters, verbose = True):
for i in range(niters):
if (verbose):
print "Training Accuracy: {0}".format(self.train_acc())
self.fit_once()
def train_acc(self):
return(np.sum(np.argmax(self.curr_prob, axis=1)==train_labels)/float(len(train_labels)))
def predict(self, test_images):
test_prob = self.clf.predict_proba(test_images)
for trees in self.forest:
for j in range(self.num_classes):
curr_tree = trees[j]
tree_prob = curr_tree.predict(test_images)
test_prob[:,j] = test_prob[:,j] + self.eta*tree_prob
return(np.argmax(test_prob, axis=1))
```

In [18]:

```
clf = tree_classifier(10, train_images, train_labels)
clf.train_acc()
```

Out[18]:

In [19]:

```
clf.fit(10)
```

Test Accuracy

In [20]:

```
np.sum(clf.predict(test_images)==test_labels)/float(len(test_labels))
```

Out[20]:

While the test performance isn't that great, keep in mind that this is a very simplified model and we are still improving over the random accuracy of 10%.

Neural networks are often touted as being complicated/mysterious machine learning models. In reality, vanilla neural networks aren't that complicated. In order to test out a neural network for image classification, I implement a simple one below.

Neural networks are simply nonparametric models that were loosely inspired by neurons in the brain. Their power comes from their composability and the simplicity of the neurons.

In general, neural networks are defined by layers (a set of neurons) moving from an input layer to an output one. Every layer between the input and the output are considered to be "hidden" layers. The neurons themselves are simply activation functions such as tanh or the sigmoid function. The neurons are connected between each layer by weights which define the linear combinations that go into neurons.

In the example below, I use one hidden layer which has a tanh activation on its neurons which goes to an output layer which has a sigmoid activation on its neurons. In this example the layers are fully connected, which just means every neuron in layer $i$ is connected to every neuron in layer $i+1$.

Here's a depiction:

In the end our goal is to simply minimize a loss function where our predictions are determined by the neural network. In the model, it's clear that when we consider a particular training example, everything is fixed except for the weights (which are the parameters in a neural network). Since neural networks often have activation functions which are hard to work with, in practice neural networks are trained via gradient descent.

The training of a neural network looks like the following:

- Randomly initialize weights.
- Forward pass - calculate the outputs.
- Back propagation - calculate the errors using the chain rule (working backwards).
- Repeat steps 2 and 3 until you're happy with your training/validation error.

For backpropagation, we first have to define some terms (it may be useful to look at the depiction of a neural network above):

- $f$ is the loss function.
- $\eta$ is the learning rate.
- $g^{(i)}$ is the activation function (in our example it's constant for each layer) .
- $sig$ is the sigmoid function.
- $y_i$ is the true value of the output.
- $x_i^{(0)}=s_i^{(0)}$ are the input values.
- $s_i^{(l)}=\sum_jx_j^{(l-1)}w_{ji}$ is the pre-activation value for node $i$ in layer $l$.
- $x_i^{(l)}=g^{(l)}(s_i^{(l)})$ is the post-activation value for node $i$ in layer $l$.
- $w_{ij}^{(l)}$ is the weight from $x_i^{(l-1)}$ to $s_j^{(l)}$.

Then, in general we know that the updates we want are $w_{ij}^{(2)}\leftarrow w_{ij}^{(2)}-\eta\times \frac{df}{dw_{ij}^{(2)}}$ and $w_{ij}^{(1)}\leftarrow w_{ij}^{(1)}-\eta\times \frac{df}{dw_{ij}^{(1)}}$.

Thanks to the structure of the neural network, we can apply the chain rule and see that $\frac{df}{dw_{ij}^{(2)}}=\frac{df}{dx_{j}^{(2)}}\times \frac{dx_{j}^{(2)}}{ds_{j}^{(2)}}\times \frac{ds_{j}^{(2)}}{dw_{ij}^{(2)}}$ and $\frac{df}{dw_{ij}^{(1)}}=\frac{df}{ds_{i}^{(1)}}\times \frac{ds_{i}^{(1)}}{dw_{ij}^{(1)}}$.

Additionally, if the loss is expressed as a sum over all output nodes, $\frac{df}{ds_{i}^{(1)}}=\sum_k \frac{df}{ds_{k}^{(2)}}\times \frac{ds_{k}^{(2)}}{dx_{k}^{(2)}}\times \frac{dx_{k}^{(2)}}{ds_{i}^{(1)}}$.

So we can re-express $\frac{df}{dw_{ij}^{(1)}}=\frac{ds_{i}^{(1)}}{dw_{ij}^{(1)}}\times \sum_k \frac{df}{ds_{k}^{(2)}}\times \frac{ds_{k}^{(2)}}{dx_{k}^{(2)}}\times \frac{dx_{k}^{(2)}}{ds_{i}^{(1)}}$.

Then, in my network I simply use a mean-squared error loss (although perhaps binary cross-entropy would be more appropriate). At this point, we can simply plug in to find:

$\frac{df}{dw_{ij}^{(2)}}=\frac{df}{dx_{j}^{(2)}}\times \frac{dx_{j}^{(2)}}{ds_{j}^{(2)}}\times \frac{ds_{j}^{(2)}}{dw_{ij}^{(2)}}=(x_i^{(2)}-y_i)g'(s_i^{(2)})x_i^{(1)}=(x_i^{(2)}-y_i)sig(s_i^{(2)})(1-sig(s_i^{(2)}))x_i^{(1)}$

$\frac{df}{dw_{ij}^{(1)}}=\frac{ds_{i}^{(1)}}{dw_{ij}^{(1)}}\times \sum_k \frac{df}{ds_{k}^{(2)}}\times \frac{ds_{k}^{(2)}}{dx_{k}^{(2)}}\times \frac{dx_{k}^{(2)}}{ds_{i}^{(1)}}=x_i^{(0)}\times \sum_k (x_j^{(2)}-y_j)sig(s_j^{(2)})(1-sig(s_j^{(2)}))w_{jk}^{(2)}(1-tanh^2(s_j^{(1)}))$

In order to use the neural network, you'll have to convert to matrix equations. I found it convenient to use this video series to help get the equations.

In [8]:

```
# Global Variables/Imports
import random, time, csv, cv2, os
from matplotlib import pyplot as plt
from datetime import datetime
import numpy as np
%matplotlib inline
num_hidden_nodes = 200
num_input_nodes = 784
# Sigmoid function
def sigmoid(x):
return 1/(1 + np.exp(-x))
# Derivative of sigmoid
def sig_prime(x):
return np.multiply(sigmoid(x), 1-sigmoid(x))
# Derivative of tanh
def tanh_prime(x):
return 1 - np.square(np.tanh(x))
# Backward pass that finds the errors
def backward_pass(x_two, s_two, w_two, s_one, y):
delta_two = np.multiply(x_two - y, sig_prime(s_two)) # MSE
w_two_short = w_two[0:num_hidden_nodes,:] # Chop off bias
delta_one = np.multiply(np.dot(delta_two, w_two_short.T), tanh_prime(s_one))
return (delta_two, delta_one)
# Forward pass that calculates new state
def forward_pass(x_zero, w_one, w_two):
s_one = np.dot(x_zero, w_one)
x_one = np.tanh(s_one)
x_one = np.array([np.append(x_one[0], 1)])
s_two = np.dot(x_one, w_two)
x_two = np.tanh(s_two)
return(s_one, x_one, s_two, x_two)
# Find accuracy given images and labels
def test_accuracy(w_one, w_two, images, labels):
counter = 0
for i in range(0, len(images)):
x_in = np.array([np.append(images[i], 1)])
# Predict
s_one, x_one, s_two, x_two = forward_pass(x_in, w_one, w_two)
prediction = np.argmax(x_two)
if prediction == labels[i]:
counter = counter + 1
return counter/float(len(images))
# Print and save progress as needed (currently every 10000 iterations)
def print_save_progress(i, t0, w_two, w_one, images, labels):
if ((i+1) % 10000 == 0):
error = test_accuracy(w_one, w_two, images, labels)
print "Trial: " + str(i+1) + " Accuracy: " + str(error)
np.savetxt("w_one.csv", w_one, delimiter=",")
np.savetxt("w_two.csv", w_two, delimiter=",")
print "Time Elapsed: " + str(time.time() - t0)
return(w_one, w_two)
# Train neural network and save the weights at a certain accuracy
def train_neural_network(images, labels, eta, niter):
print "TRAINING NEURAL NETWORK"
# Initialized weights uniformly in [-.01, .01]
t0 = time.time()
# Account for bias terms
w_one = (np.random.rand(num_input_nodes+1,200)*.2)-.1
w_two = (np.random.rand(num_hidden_nodes+1,10)*.2)-.1
for i in range(niter):
# Pick random data point
index = random.randint(0, len(labels)-1)
x_zero = np.array([np.append(images[index], 1)])
y = np.array([np.zeros(10)])
y[0, labels[index]] = 1
# Forward pass
s_one, x_one, s_two, x_two = forward_pass(x_zero, w_one, w_two)
# Backward pass
delta_two, delta_one = backward_pass(x_two, s_two, w_two, s_one, y)
# Weight update
w_two = w_two - eta*np.dot(x_one.T, delta_two)
w_one = w_one - eta*np.dot(x_zero.T, delta_one)
# Printing out progress
print_save_progress(i, t0, w_two, w_one, images, labels)
```

I end up using the MNIST digit dataset on kaggle since it's a much simpler problem for the neural network to deal with.

In [9]:

```
# Data from https://www.kaggle.com/c/digit-recognizer
# Some of the code for loading data was borrowed from here:
# https://www.kaggle.com/kakauandme/tensorflow-deep-nn
import pandas as pd
data = pd.read_csv('data/digits/train.csv')
images = data.iloc[:,1:].values
images = images.astype(np.float)
images = np.multiply(images, 1.0 / 255.0) # Normalize
labels = data["label"].values.ravel()
ind = np.arange(len(images))
np.random.shuffle(ind)
nine_tenth_ind = int(len(images) - len(images)/10.0)
train_images = images[:nine_tenth_ind,:]
train_labels = labels[:nine_tenth_ind]
test_images = images[nine_tenth_ind:,:]
test_labels = labels[nine_tenth_ind:]
```

In [10]:

```
import matplotlib.cm as cm
# display image
def display(img):
one_image = img.reshape(28,28)
plt.axis('off')
plt.imshow(one_image, cmap=cm.binary)
# output image
display(images[100])
```

In [7]:

```
train_neural_network(train_images, train_labels, .01, 100000)
```

In [11]:

```
# Load the parameters and test the accuracy
w_one = np.genfromtxt('w_one.csv', delimiter=',')
w_two = np.genfromtxt('w_two.csv', delimiter=',')
accuracy = test_accuracy(w_one, w_two, test_images, test_labels)
print "Test Accuracy: {0}".format(accuracy)
```

The test accuracy on the digits dataset is very good. I tried to use the same neural network for the cats vs. dogs dataset, but the training was not promising. Using a vanilla neural network on mostly raw image features doesn't perform well given the complexity of the problem.

In [ ]:

```
'''
X = np.load("X.npy")
Y = np.load("y.npy")
# Partition into test and training sets
ind = np.arange(len(X))
np.random.shuffle(ind)
nine_tenth_ind = int(len(X) - len(X)/10.0)
trainX = X[:nine_tenth_ind,:]
trainY = Y[:nine_tenth_ind]
testX = X[nine_tenth_ind:,:]
testY = Y[nine_tenth_ind:]
Sample output when trying to do cats v dogs
TRAINING NEURAL NETWORK
Trial: 10000 Accuracy: 0.494
Trial: 20000 Accuracy: 0.494222222222
Trial: 30000 Accuracy: 0.522711111111
Trial: 40000 Accuracy: 0.504088888889
Trial: 50000 Accuracy: 0.534444444444
Trial: 60000 Accuracy: 0.520133333333
Trial: 70000 Accuracy: 0.517244444444
Trial: 80000 Accuracy: 0.505911111111
Trial: 90000 Accuracy: 0.4972
Trial: 100000 Accuracy: 0.5212
'''
```