M-% || M-x query-replace

Sachin’s random thoughts.

07 Feb 2024

Rust WASM Neural Net PART 2 - Model

General Principles

Efficiency

The model should not only be efficient in network structure1, but also in implementation (avoid clone at all costs)2.

Simplicity

The model should be effective without being overly complex. Simple, clean, maintainable3 code is the way to go.

From Scratch

Write from scratch4 to avoid loading massive libraries5. There is no need to bring in tch-rs or something similar.

Activation Functions

Code

I only use two activation functions6 for speed of computation - ReLU and LogSoftmax. I found that you can get pretty good accuracy with just these two.

$$ x \text{ is input vector} \newline $$

ReLU

Forward

$$\text{ReLU}(x) = \max(0, x)$$

1D Implementation

pub fn relu1d(x: Array1<f64>) -> Array1<f64> {
    x.mapv(|x| if x > 0.0 { x } else { 0.0 })
}

2D Implementation

pub fn relu2d(x: Array2<f64>) -> Array2<f64> {
    x.mapv(|x| if x > 0.0 { x } else { 0.0 })
}

Backward7

$$ \text{ReLU}’(x) = \begin{cases} x < 0 : 0 \newline x > 0 : 1 \end{cases} $$

1D Implementation

pub fn relu_backward1d(x: Array1<f64>, y: Array1<f64>) -> Array1<f64> {
    x.mapv(|x| if x > 0.0 { 1.0 } else { 0.0 }) * y
}

2D Implementation

pub fn relu_backward2d(x: Array2<f64>, y: Array2<f64>) -> Array2<f64> {
    x.mapv(|x| if x > 0.0 { 1.0 } else { 0.0 }) * y
}

LogSoftmax8

Forward

$$ \begin{split} \text{Softmax}(x_i) &= \frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \newline \text{LogSoftmax}(x_i) &= \log{\left(\text{Softmax}(x_i)\right)} \newline &= \log{\left(\frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}}\right)} \newline &= \log{\left(e^{x_i - \max{(x)}}\right)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \newline &= x_i - \max{(x)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \end{split} $$

1D Implementation

pub fn logsoftmax1d(x: Array1<f64>) -> Array1<f64> {
    let max_x = x.fold(f64::NAN, |a, b| a.max(*b));
    let diff_x = x.mapv(|x| x - max_x);
    let sum_x = diff_x.mapv(f64::exp).sum().ln();
    diff_x.mapv(|x| x - sum_x)
}

2D Implementation

pub fn logsoftmax2d(x: Array2<f64>) -> Array2<f64> {
    let max_x = x.fold_axis(Axis(1), f64::NAN, |&a, &b| a.max(b));
    let diff_x = &x - max_x.insert_axis(Axis(1));
    let sum_x = diff_x.mapv(f64::exp).sum_axis(Axis(1));
    let log_sum_x = sum_x.mapv(f64::ln);
    diff_x - log_sum_x.insert_axis(Axis(1))
}

Backward9

Rules10

Log Rule

$$ \frac{\partial}{\partial a} \log(a) = \frac{1}{a} $$

Euler Number Rule

$$ \frac{\partial}{\partial a} e^a = e^a $$

Chain Rule

$$ \frac{\partial}{\partial a} f(g(a)) = f’(g(a)) \cdot g’(a) $$

Sum Rule11

$$ \frac{\partial}{\partial a_i} \sum_{j=0}^{\text{len}(a)} f(a_j) = \frac{\partial}{\partial a_i} f(a_i) $$

Putting it together

$$ \begin{split} \text{LogSoftmax}(x_i) &= x_i - \max{(x)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \newline \text{LogSoftmax}’(x_i) &= x_i \frac{\partial}{\partial x_i} - \max{(x)} \frac{\partial}{\partial x_i} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \frac{\partial}{\partial x_i} \newline &= 1 - \frac{\partial \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)}}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot \frac{\partial \left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot \frac{\partial e^{x_i - \max{(x)}}}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot e^{x_i - \max{(x)}} \newline &= 1 - \frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \newline &= 1 - \text{Softmax}(x_i) \end{split} $$

1D Implementation

pub fn logsoftmax_backward1d(x: Array1<f64>, y: Array1<f64>) -> Array1<f64> {
    let softmax_x = (&x - x.fold(f64::NAN, |a, b| a.max(*b))).mapv(f64::exp);
    let softmax_sum = softmax_x.sum();
    let softmax = softmax_x / softmax_sum;
    let n = x.len();
    let delta_ij = Array2::eye(n);
    let softmax_matrix = softmax.broadcast(n).unwrap().to_owned();
    let derivative = &delta_ij - &softmax_matrix;
    y.dot(&derivative)
}

2D Implementation12

pub fn logsoftmax_backward2d(x: Array2<f64>, y: Array2<f64>) -> Array2<f64> {
    let softmax_x = (&x
        - &x.fold_axis(Axis(1), f64::NAN, |&a, &b| a.max(b))
            .insert_axis(Axis(1)))
        .mapv(f64::exp);
    let softmax_sum = softmax_x.sum_axis(Axis(1)).insert_axis(Axis(1));
    let softmax = softmax_x / &softmax_sum;
    let n = x.shape()[1];
    let m = x.shape()[0];
    let inner_delta_ij: Array2<f64> = Array2::eye(n);
    let delta_ij = inner_delta_ij.broadcast((m, n, n)).unwrap().to_owned();
    let softmax_matrix = softmax
        .insert_axis(Axis(1))
        .broadcast((m, n, n))
        .unwrap()
        .to_owned();
    let derivative = &delta_ij - &softmax_matrix;
    stack(
        Axis(0),
        &y.axis_iter(Axis(0))
            .zip(derivative.axis_iter(Axis(0)))
            .map(|(y, derivative)| y.dot(&derivative))
            .collect::<Vec<Array1<f64>>>()
            .iter()
            .map(|x| x.view())
            .collect::<Vec<ArrayView1<f64>>>(),
    )
    .unwrap()
}

Network

Code

I defined just two layers13

7 1 1 8 2 0 1 4 8 B R L M 0 i e o a - n L g x 9 a U S r o O y f u t t I m p n a u p x t u t

The code for the net is not as densely complex as the activation functions, but there are a few tricks to make it efficient.14


  1. Looking back, I did not get the performance I wanted - around 90% accurate. This is mostly because when I was deciding on the model I decided on an arch that worked well with float inputs instead of binary inputs. However, I switched to binary inputs when creating the final project (as it was much more intuitive on the frontend). ↩︎

  2. The idea is mostly to transfer ownership as much as possible. I also try to keep everything in ndarray to let it do it’s magic ↩︎

  3. I did not do as good of a job as I wanted with this. Looking back I wish I had made the model more generalizable and structured it as a small neural net engine. ↩︎

  4. I do however use the ndarray crate ↩︎

  5. However, if I were to do it again, I would use candle as it is pretty light and possibly have better performance ↩︎

  6. I also only have 2 layers ↩︎

  7. Technically, there is no derivative, but this is the conventional function used during backprop ↩︎

  8. This differs from the fully traditional logsoftmax, as I am using the max of the elements as a normalizing term in Softmax. ↩︎

  9. Yann LeCun assigning it as an exercise ↩︎

  10. These are not all official rules, but concepts that I use next. ↩︎

  11. To expand on this further. Because you are taking the partial derivative against the ith term, all other nonith terms are 0 ↩︎

  12. I think there is some efficiency lost here, specifically on how I aggregate the terms at the end of the function ↩︎

  13. This was surprisingly very accurate with the full float values of the inputs - around 98% accurate ↩︎

  14. Specifically around memory management and resuing previous computations. Training can get much more efficent than my implementation however. ↩︎