Rust WASM Neural Net PART 2 - Model
General Principles
Efficiency
The model should not only be efficient in network structure1, but also in implementation (avoid clone
at all costs)2.
Simplicity
The model should be effective without being overly complex. Simple, clean, maintainable3 code is the way to go.
From Scratch
Write from scratch4 to avoid loading massive libraries5. There is no need to bring in tch-rs or something similar.
Activation Functions
Code
I only use two activation functions6 for speed of computation - ReLU and LogSoftmax. I found that you can get pretty good accuracy with just these two.
$$ x \text{ is input vector} \newline $$
ReLU
Forward
$$\text{ReLU}(x) = \max(0, x)$$
1D Implementation
pub fn relu1d(x: Array1<f64>) -> Array1<f64> {
x.mapv(|x| if x > 0.0 { x } else { 0.0 })
}
2D Implementation
pub fn relu2d(x: Array2<f64>) -> Array2<f64> {
x.mapv(|x| if x > 0.0 { x } else { 0.0 })
}
Backward7
$$ \text{ReLU}’(x) = \begin{cases} x < 0 : 0 \newline x > 0 : 1 \end{cases} $$
1D Implementation
pub fn relu_backward1d(x: Array1<f64>, y: Array1<f64>) -> Array1<f64> {
x.mapv(|x| if x > 0.0 { 1.0 } else { 0.0 }) * y
}
2D Implementation
pub fn relu_backward2d(x: Array2<f64>, y: Array2<f64>) -> Array2<f64> {
x.mapv(|x| if x > 0.0 { 1.0 } else { 0.0 }) * y
}
LogSoftmax8
Forward
$$ \begin{split} \text{Softmax}(x_i) &= \frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \newline \text{LogSoftmax}(x_i) &= \log{\left(\text{Softmax}(x_i)\right)} \newline &= \log{\left(\frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}}\right)} \newline &= \log{\left(e^{x_i - \max{(x)}}\right)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \newline &= x_i - \max{(x)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \end{split} $$
1D Implementation
pub fn logsoftmax1d(x: Array1<f64>) -> Array1<f64> {
let max_x = x.fold(f64::NAN, |a, b| a.max(*b));
let diff_x = x.mapv(|x| x - max_x);
let sum_x = diff_x.mapv(f64::exp).sum().ln();
diff_x.mapv(|x| x - sum_x)
}
2D Implementation
pub fn logsoftmax2d(x: Array2<f64>) -> Array2<f64> {
let max_x = x.fold_axis(Axis(1), f64::NAN, |&a, &b| a.max(b));
let diff_x = &x - max_x.insert_axis(Axis(1));
let sum_x = diff_x.mapv(f64::exp).sum_axis(Axis(1));
let log_sum_x = sum_x.mapv(f64::ln);
diff_x - log_sum_x.insert_axis(Axis(1))
}
Backward9
Rules10
Log Rule
$$ \frac{\partial}{\partial a} \log(a) = \frac{1}{a} $$
Euler Number Rule
$$ \frac{\partial}{\partial a} e^a = e^a $$
Chain Rule
$$ \frac{\partial}{\partial a} f(g(a)) = f’(g(a)) \cdot g’(a) $$
Sum Rule11
$$ \frac{\partial}{\partial a_i} \sum_{j=0}^{\text{len}(a)} f(a_j) = \frac{\partial}{\partial a_i} f(a_i) $$
Putting it together
$$ \begin{split} \text{LogSoftmax}(x_i) &= x_i - \max{(x)} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \newline \text{LogSoftmax}’(x_i) &= x_i \frac{\partial}{\partial x_i} - \max{(x)} \frac{\partial}{\partial x_i} - \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)} \frac{\partial}{\partial x_i} \newline &= 1 - \frac{\partial \log{\left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)}}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot \frac{\partial \left(\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}\right)}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot \frac{\partial e^{x_i - \max{(x)}}}{\partial x_i} \newline &= 1 - \frac{1}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \cdot e^{x_i - \max{(x)}} \newline &= 1 - \frac{e^{x_i - \max{(x)}}}{\sum_{j=1}^{\text{len}(x)}{e^{x_j - \max{(x)}}}} \newline &= 1 - \text{Softmax}(x_i) \end{split} $$
1D Implementation
pub fn logsoftmax_backward1d(x: Array1<f64>, y: Array1<f64>) -> Array1<f64> {
let softmax_x = (&x - x.fold(f64::NAN, |a, b| a.max(*b))).mapv(f64::exp);
let softmax_sum = softmax_x.sum();
let softmax = softmax_x / softmax_sum;
let n = x.len();
let delta_ij = Array2::eye(n);
let softmax_matrix = softmax.broadcast(n).unwrap().to_owned();
let derivative = &delta_ij - &softmax_matrix;
y.dot(&derivative)
}
2D Implementation12
pub fn logsoftmax_backward2d(x: Array2<f64>, y: Array2<f64>) -> Array2<f64> {
let softmax_x = (&x
- &x.fold_axis(Axis(1), f64::NAN, |&a, &b| a.max(b))
.insert_axis(Axis(1)))
.mapv(f64::exp);
let softmax_sum = softmax_x.sum_axis(Axis(1)).insert_axis(Axis(1));
let softmax = softmax_x / &softmax_sum;
let n = x.shape()[1];
let m = x.shape()[0];
let inner_delta_ij: Array2<f64> = Array2::eye(n);
let delta_ij = inner_delta_ij.broadcast((m, n, n)).unwrap().to_owned();
let softmax_matrix = softmax
.insert_axis(Axis(1))
.broadcast((m, n, n))
.unwrap()
.to_owned();
let derivative = &delta_ij - &softmax_matrix;
stack(
Axis(0),
&y.axis_iter(Axis(0))
.zip(derivative.axis_iter(Axis(0)))
.map(|(y, derivative)| y.dot(&derivative))
.collect::<Vec<Array1<f64>>>()
.iter()
.map(|x| x.view())
.collect::<Vec<ArrayView1<f64>>>(),
)
.unwrap()
}
Network
Code
I defined just two layers13
The code for the net is not as densely complex as the activation functions, but there are a few tricks to make it efficient.14
-
Looking back, I did not get the performance I wanted - around 90% accurate. This is mostly because when I was deciding on the model I decided on an arch that worked well with float inputs instead of binary inputs. However, I switched to binary inputs when creating the final project (as it was much more intuitive on the frontend). ↩︎
-
The idea is mostly to transfer ownership as much as possible. I also try to keep everything in
ndarray
to let it do it’s magic ↩︎ -
I did not do as good of a job as I wanted with this. Looking back I wish I had made the model more generalizable and structured it as a small neural net engine. ↩︎
-
However, if I were to do it again, I would use candle as it is pretty light and possibly have better performance ↩︎
-
I also only have 2 layers ↩︎
-
Technically, there is no derivative, but this is the conventional function used during backprop ↩︎
-
This differs from the fully traditional logsoftmax, as I am using the max of the elements as a normalizing term in Softmax. ↩︎
-
These are not all official rules, but concepts that I use next. ↩︎
-
To expand on this further. Because you are taking the partial derivative against the
ith
term, all other nonith
terms are0
↩︎ -
I think there is some efficiency lost here, specifically on how I aggregate the terms at the end of the function ↩︎
-
This was surprisingly very accurate with the full float values of the inputs - around 98% accurate ↩︎
-
Specifically around memory management and resuing previous computations. Training can get much more efficent than my implementation however. ↩︎