Introduction to Deep Learning

By Kai-Zhan Lee

Welcome! This guide is a quick introduction to the theory and applications of deep learning.

A brief topic summary:

  • Machine Learning Basics
  • History of Deep Learning
  • Feedforward Neural Networks
  • Handwritten Digit Classification

Before we get started...

Why should you care?

(Aside from landing that killer machine learning job and making \$\$\$\$)

Cool Applications:

  • Finding things you like (Google, Facebook, etc.)

  • Speech synthesis and speech recognition (Siri, Alexa, etc.)

  • Identifying those in mental distress through social media (current research)

Deep learning models certain aspects of the human brain.

It helps solve problems that only humans could solve before!

Deep learning is growing -- fast

In 2012, deep learning was considered a nice mathematical escape from reality that only researchers investigated.

Today, it approaches ubiquity in resesarch and industry alike.

Tinder (of course!)

Source

Even this actress...

has published a paper on deep learning:

Just don't be this guy...

XKCD, "Machine Learning"

Machine Learning Background

What is data science? (Thanks Robbie!)

Put simply: Finding meaning from data.

What is artificial intelligence (AI)?

"... what we want is a machine that can learn from experience."

- Alan Turing, 1947

Artificial intelligence: a perceiving agent within an environment that takes actions to maximize its chances of achieving a specific set of goals.

Remember PEAS!

  • Performance Measure: how well is the agent acting to achieve its goals?
  • Environment: what is there besides the agent?
  • Actuators: what actions can the agent perform?
  • Sensors: what does the agent perceive?

4 Kinds of AI

- Thinking Acting
Naturally Emotion, belief Running, flying, swimming
Rationally Logic, proofs Decisions, choices

In this talk, we'll examine rational actors.

What is machine learning?

Machine learning lies at the intersection of artificial intelligence and data science.

Formulation

Assume we have input and output sets $\mathcal{X}$ and $\mathcal{Y}$ and a distribution $\mathcal{D}: \mathcal{X} \times \mathcal{Y} \to \mathbb R$. Given $n$ datapoints identically and independently drawn from this distribution, we attempt to find a function $f: \mathcal{X} \to \mathcal{Y}$ that minimizes training error, or loss.

There are many types of loss:

  • Mean squared error: $\mathbb E_{x, y \in \mathcal{D}} \left[\lVert f(x) - y\right \rVert_2^2]$.
  • Cross-entropy loss: $\mathbb E_{x, y \in \mathcal{D}} \left[\lVert - y \log f(x)\right\rVert_1]$.

Deep Learning History

Threshold Logic

Deep learning has had a long history of iterative redesigning and improvement.

  • 1943: Pioneers in mathematically modelling the brain, Walter Pitts and Warren McCulloch propose the Threshold Logic Unit (TLU), a linear model with adjustable, but non-learnable parameters $t$ and $w_1 \dots w_d$, where $d$ is the dimensionality of the input vector $\vec x$.

Introduction to Learning

  • 1947: "... what we want is a machine that can learn from experience." - Alan Turing
  • 1957: Frank Rosenblatt invents the Linear Perceptron (LP), making the TLU's parameters learnable.

Is this guaranteed to converge? (i.e. give a solution for $\vec w$?)

Differential Learning

  • 1958: David Cox invents Logistic Regression (LR) for perceptrons, using gradient descent to find optimal weights. He uses cross-entropy loss to measure model performance.

Structure of a Neural Net

So what is deep learning anyway?

Deep learning is the study of neural networks, models that "learn from experience and understand the world in terms of a hierarchy of concepts."

A neural network consists of multiple layers. Each layer (aside from the input layer) consists of independent perceptrons that take in the previous layer's output as their input. There are 3 main divisions of layers:

  • Input Layer: our input vector $\vec x \in \mathcal{X}$.
  • Hidden Layer(s): vector(s) $\vec h_1 ... \vec h_n$ of fixed size. Each element of each vector is its own perceptron with its own weights. We refer to each perceptron as a "node".
  • Output Layer: our predicted output vector $\vec{y}_{pred} \in \mathcal{Y}$

Sample Visualization

Each column of circles is a vector; each circle with arrows pointing to it is a node (perceptron) taking input from the arrows' sources (the previous layer). This structure is called a feedforward neural network (FFNN or NN for short), because the input is fed forward through the model and, well, it's a neural network. Like with logistic regression, weights are learned with gradient descent; for each weight, we take the partial derivative $\frac{\partial L}{\partial w}$ and move in the opposite direction, to minimize loss.

This particular NN has two layers: one hidden layer of 5 nodes and one output layer of 2 nodes. Note that the input (a 3-vector) doesn't count as a layer.

Deep Learning with Keras

MNIST

Let's try out the MNIST dataset: a collection of handwritten digits from postal codes.

Python Packages: python-mnist (our data!), numpy (matrix library), matplotlib (a good plotting tool in Python)

In [259]:
import numpy as np
import cv2 as cv
from skimage.measure import block_reduce
from matplotlib import pyplot as plt
from mnist import MNIST
%matplotlib inline

# Load data
loader = MNIST(gz=True)
trX, trY = map(np.array, loader.load_training())
teX, teY = map(np.array, loader.load_testing())

What does this actually look like?

In [212]:
# Show the first MNIST training image
img = plt.imshow(trX[0].reshape(28, 28), cmap='gray')
print('True Label:', trY[0])
True Label: 5

Setup

Training a model in Keras is simple: specify a model structure, optimizer, and loss function, and you're ready to start training!

In [252]:
from keras.models import Sequential
from keras.layers import Dense, Dropout

model = Sequential([
    Dense(512, activation='sigmoid', input_dim=trX.shape[1]),
    Dropout(0.5),
    Dense(128, activation='sigmoid'),
    Dropout(0.5),
    Dense(10, activation='softmax')
])
model.compile('adam', 'sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(trX, trY, validation_split=0.1, epochs=20, batch_size=256)
Train on 48600 samples, validate on 5400 samples
Epoch 1/20
48600/48600 [==============================] - 4s 78us/step - loss: 1.0246 - acc: 0.6840 - val_loss: 0.4162 - val_acc: 0.8846
Epoch 2/20
48600/48600 [==============================] - 4s 75us/step - loss: 0.5051 - acc: 0.8523 - val_loss: 0.3234 - val_acc: 0.9091
Epoch 3/20
48600/48600 [==============================] - 4s 77us/step - loss: 0.4372 - acc: 0.8710 - val_loss: 0.2868 - val_acc: 0.9154
Epoch 4/20
48600/48600 [==============================] - 4s 75us/step - loss: 0.4003 - acc: 0.8790 - val_loss: 0.2705 - val_acc: 0.9230
Epoch 5/20
48600/48600 [==============================] - 4s 82us/step - loss: 0.3719 - acc: 0.8879 - val_loss: 0.2510 - val_acc: 0.9256
Epoch 6/20
48600/48600 [==============================] - 4s 81us/step - loss: 0.3582 - acc: 0.8941 - val_loss: 0.2340 - val_acc: 0.9302
Epoch 7/20
48600/48600 [==============================] - 4s 81us/step - loss: 0.3389 - acc: 0.8981 - val_loss: 0.2264 - val_acc: 0.9343
Epoch 8/20
48600/48600 [==============================] - 4s 82us/step - loss: 0.3262 - acc: 0.9009 - val_loss: 0.2271 - val_acc: 0.9307
Epoch 9/20
48600/48600 [==============================] - 4s 76us/step - loss: 0.3167 - acc: 0.9044 - val_loss: 0.2126 - val_acc: 0.9357
Epoch 10/20
48600/48600 [==============================] - 4s 81us/step - loss: 0.3100 - acc: 0.9064 - val_loss: 0.2061 - val_acc: 0.9383
Epoch 11/20
48600/48600 [==============================] - 4s 77us/step - loss: 0.2980 - acc: 0.9097 - val_loss: 0.2155 - val_acc: 0.9381
Epoch 12/20
48600/48600 [==============================] - 4s 80us/step - loss: 0.2959 - acc: 0.9110 - val_loss: 0.1980 - val_acc: 0.9404
Epoch 13/20
48600/48600 [==============================] - 4s 83us/step - loss: 0.2881 - acc: 0.9134 - val_loss: 0.1903 - val_acc: 0.9435
Epoch 14/20
48600/48600 [==============================] - 4s 77us/step - loss: 0.2792 - acc: 0.9153 - val_loss: 0.1929 - val_acc: 0.9420
Epoch 15/20
48600/48600 [==============================] - 4s 85us/step - loss: 0.2763 - acc: 0.9180 - val_loss: 0.1857 - val_acc: 0.9430
Epoch 16/20
48600/48600 [==============================] - 4s 80us/step - loss: 0.2772 - acc: 0.9156 - val_loss: 0.1891 - val_acc: 0.9430
Epoch 17/20
48600/48600 [==============================] - 4s 86us/step - loss: 0.2687 - acc: 0.9177 - val_loss: 0.1845 - val_acc: 0.9439
Epoch 18/20
48600/48600 [==============================] - 5s 96us/step - loss: 0.2701 - acc: 0.9185 - val_loss: 0.1864 - val_acc: 0.9406
Epoch 19/20
48600/48600 [==============================] - ETA: 0s - loss: 0.2594 - acc: 0.921 - 4s 87us/step - loss: 0.2604 - acc: 0.9212 - val_loss: 0.1801 - val_acc: 0.9470
Epoch 20/20
48600/48600 [==============================] - 4s 82us/step - loss: 0.2532 - acc: 0.9246 - val_loss: 0.1720 - val_acc: 0.9485
Out[252]:
<keras.callbacks.History at 0x1120dd278>
In [254]:
model.evaluate(teX, teY)
10000/10000 [==============================] - 0s 39us/step
Out[254]:
[0.14467686785683037, 0.9552]

Drawing input

Let's try drawing an input and seeing how the network does!

In [354]:
sketch = Sketcher()
img = sketch.get_image()
plt.imshow(img, cmap='gray')
print('Prediction:', model.predict(np.array([img.flatten()])).argmax())
Read image...
Prediction: 2

Recap

  • Artificial intelligence consists of performance measure, environment, actuators, and sensors.
  • Machine learning is the intersection of AI and data science. It poses the problem of modeling a distribution: mapping input $x \in \mathcal{X}$ to output $y \in \mathcal{Y}$ while minimizing a loss function.
  • Logistic regression perceptrons are classifiers that linearly separate data and yield a continuous probability $y \in [0, 1]$.
    • Parameters are learned through backpropagation: taking the partial derivative of the loss function with respect to each parameter, and changing the parameters in the opposite direction of the derivative, in order to minimize loss.
  • Deep learning is a subset of machine learning that studies neural networks, which hierarchically represent data in layers of increasingly meaning, eventually culminating in a predicted output $y \in \mathcal{Y}$. Optimal parameter settings are learned through backpropagation as well.
    • Feedforward neural networks use layers of logistic regression perceptrons to represent meaning.

Where to next?

Machine Learning Resources:

Deep Learning:

Research papers: scholar.google.com is your best friend! Here's a starter set of papers.

The whole internet's out there to help you; feel free to use it!

Contact

Kai-Zhan Lee

kl2792@columbia.edu

The best way to learn how to do something is... to do it! So please reach out if you want some advice on a deep learning/machine learning project, or if you want to brainstorm ideas for one!

If you have any questions on starting summer research, please feel to reach out as well.