07 Feb 2024

Rust WASM Neural Net PART 2 - Model

General Principles

Efficiency

The model should not only be efficient in network structure¹, but also in implementation (avoid clone at all costs)².

Simplicity

The model should be effective without being overly complex. Simple, clean, maintainable³ code is the way to go.

From Scratch

Write from scratch⁴ to avoid loading massive libraries⁵. There is no need to bring in tch-rs or something similar.

Activation Functions

Code

I only use two activation functions⁶ for speed of computation - ReLU and LogSoftmax. I found that you can get pretty good accuracy with just these two.

$$ x \text{ is input vector} \newline $$

ReLU

Forward

$$\text{ReLU}(x) = \max(0, x)$$

1D Implementation

pub fn relu1d(x: Array1<f64>) -> Array1<f64> {
    x.mapv(|x| if x > 0.0 { x } else { 0.0 })
}

2D Implementation

pub fn relu2d(x: Array2<f64>) -> Array2<f64> {
    x.mapv(|x| if x > 0.0 { x } else { 0.0 })
}

Backward⁷

$$ \text{ReLU}’(x) = \begin{cases} x < 0 : 0 \newline x > 0 : 1 \end{cases} $$

1D Implementation

pub fn relu_backward1d(x: Array1<f64>, y: Array1<f64>) -> Array1<f64> {
    x.mapv(|x| if x > 0.0 { 1.0 } else { 0.0 }) * y
}

2D Implementation

pub fn relu_backward2d(x: Array2<f64>, y: Array2<f64>) -> Array2<f64> {
    x.mapv(|x| if x > 0.0 { 1.0 } else { 0.0 }) * y
}

LogSoftmax⁸

Forward

$$ \begin{split} \text{Softmax}(x_i) &= \frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \newline \text{LogSoftmax}(x_i) &= \log{\left(\text{Softmax}(x_i)\right)} \newline &= \log{\left(\frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}}\right)} \newline &= \log{\left(e^{x_i - \max{(x)}}\right)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \newline &= x_i - \max{(x)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \end{split} $$

1D Implementation

pub fn logsoftmax1d(x: Array1<f64>) -> Array1<f64> {
    let max_x = x.fold(f64::NAN, |a, b| a.max(*b));
    let diff_x = x.mapv(|x| x - max_x);
    let sum_x = diff_x.mapv(f64::exp).sum().ln();
    diff_x.mapv(|x| x - sum_x)
}

2D Implementation

pub fn logsoftmax2d(x: Array2<f64>) -> Array2<f64> {
    let max_x = x.fold_axis(Axis(1), f64::NAN, |&a, &b| a.max(b));
    let diff_x = &x - max_x.insert_axis(Axis(1));
    let sum_x = diff_x.mapv(f64::exp).sum_axis(Axis(1));
    let log_sum_x = sum_x.mapv(f64::ln);
    diff_x - log_sum_x.insert_axis(Axis(1))
}

Backward⁹

Rules¹⁰

Log Rule

$$ \frac{\partial}{\partial a} \log(a) = \frac{1}{a} $$

Euler Number Rule

$$ \frac{\partial}{\partial a} e^a = e^a $$

Chain Rule

$$ \frac{\partial}{\partial a} f(g(a)) = f’(g(a)) \cdot g’(a) $$

Sum Rule¹¹

$$ \frac{\partial}{\partial a_i} \sum_{j=0}^{\text{len}(a)} f(a_j) = \frac{\partial}{\partial a_i} f(a_i) $$

Putting it together

$$ \begin{split} \text{LogSoftmax}(x_i) &= x_i - \max{(x)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \newline \text{LogSoftmax}’(x_i) &= x_i \frac{\partial}{\partial x_i} - \max{(x)} \frac{\partial}{\partial x_i} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \frac{\partial}{\partial x_i} \newline &= 1 - \frac{\partial \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)}}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot \frac{\partial \left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot \frac{\partial e^{x_i - \max{(x)}}}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot e^{x_i - \max{(x)}} \newline &= 1 - \frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \newline &= 1 - \text{Softmax}(x_i) \end{split} $$

1D Implementation

pub fn logsoftmax_backward1d(x: Array1<f64>, y: Array1<f64>) -> Array1<f64> {
    let softmax_x = (&x - x.fold(f64::NAN, |a, b| a.max(*b))).mapv(f64::exp);
    let softmax_sum = softmax_x.sum();
    let softmax = softmax_x / softmax_sum;
    let n = x.len();
    let delta_ij = Array2::eye(n);
    let softmax_matrix = softmax.broadcast(n).unwrap().to_owned();
    let derivative = &delta_ij - &softmax_matrix;
    y.dot(&derivative)
}

2D Implementation¹²

pub fn logsoftmax_backward2d(x: Array2<f64>, y: Array2<f64>) -> Array2<f64> {
    let softmax_x = (&x
        - &x.fold_axis(Axis(1), f64::NAN, |&a, &b| a.max(b))
            .insert_axis(Axis(1)))
        .mapv(f64::exp);
    let softmax_sum = softmax_x.sum_axis(Axis(1)).insert_axis(Axis(1));
    let softmax = softmax_x / &softmax_sum;
    let n = x.shape()[1];
    let m = x.shape()[0];
    let inner_delta_ij: Array2<f64> = Array2::eye(n);
    let delta_ij = inner_delta_ij.broadcast((m, n, n)).unwrap().to_owned();
    let softmax_matrix = softmax
        .insert_axis(Axis(1))
        .broadcast((m, n, n))
        .unwrap()
        .to_owned();
    let derivative = &delta_ij - &softmax_matrix;
    stack(
        Axis(0),
        &y.axis_iter(Axis(0))
            .zip(derivative.axis_iter(Axis(0)))
            .map(|(y, derivative)| y.dot(&derivative))
            .collect::<Vec<Array1<f64>>>()
            .iter()
            .map(|x| x.view())
            .collect::<Vec<ArrayView1<f64>>>(),
    )
    .unwrap()
}

Network

Code

I defined just two layers¹³

The code for the net is not as densely complex as the activation functions, but there are a few tricks to make it efficient.¹⁴

Looking back, I did not get the performance I wanted - around 90% accurate. This is mostly because when I was deciding on the model I decided on an arch that worked well with float inputs instead of binary inputs. However, I switched to binary inputs when creating the final project (as it was much more intuitive on the frontend). ↩︎
The idea is mostly to transfer ownership as much as possible. I also try to keep everything in ndarray to let it do it’s magic ↩︎
I did not do as good of a job as I wanted with this. Looking back I wish I had made the model more generalizable and structured it as a small neural net engine. ↩︎
I do however use the ndarray crate ↩︎
However, if I were to do it again, I would use candle as it is pretty light and possibly have better performance ↩︎
I also only have 2 layers ↩︎
Technically, there is no derivative, but this is the conventional function used during backprop ↩︎
This differs from the fully traditional logsoftmax, as I am using the max of the elements as a normalizing term in Softmax. ↩︎
Yann LeCun assigning it as an exercise ↩︎
These are not all official rules, but concepts that I use next. ↩︎
To expand on this further. Because you are taking the partial derivative against the ith term, all other nonith terms are 0 ↩︎
I think there is some efficiency lost here, specifically on how I aggregate the terms at the end of the function ↩︎
This was surprisingly very accurate with the full float values of the inputs - around 98% accurate ↩︎
Specifically around memory management and resuing previous computations. Training can get much more efficent than my implementation however. ↩︎

M-% || M-x query-replace

Rust WASM Neural Net PART 2 - Model

General Principles

Efficiency

Simplicity

From Scratch

Activation Functions

Code

ReLU

Forward

1D Implementation

2D Implementation

Backward7

1D Implementation

2D Implementation

LogSoftmax8

Forward

1D Implementation

2D Implementation

Backward9

Rules10

Log Rule

Euler Number Rule

Chain Rule

Sum Rule11

Putting it together

1D Implementation

2D Implementation12

Network

Code

Backward⁷

LogSoftmax⁸

Backward⁹

Rules¹⁰

Sum Rule¹¹

2D Implementation¹²