Blog Post
How TensorFlow.js Sees: Grayscale, Weights, and the Math Behind Real Time ML
ASL Tech - a TensorFlow.js website that watches your webcam and classifies American Sign Language hand shapes in real time. Source on GitHub at asl_tensorflow.
How TensorFlow.js Sees: Grayscale, Weights, and the Math Behind Real Time ML
Here is the part most marketing copy gets wrong: there is no large language model attached to it. No GPT call. No OpenAI key in the .env. Just machine learning, browser APIs, and a small amount of math that runs on every frame.
This post is the short version of how that math actually works, and why a tool like this can be person specific. Built for one person, one task, one problem in their day.
Machine learning is just weighted math
Every prediction is a weighted sum followed by an activation
Strip away the marketing and a neural network is a stack of simple operations: multiply, add, activate, repeat. A model "learns" by adjusting numbers, called weights, until its outputs line up with the labels it was trained on.
A single neuron in TensorFlow.js looks like this:
import * as tf from "@tensorflow/tfjs";
// y = activation( Σ (xᵢ · wᵢ) + b )
const x = tf.tensor1d([0.2, 0.8, 0.5]); // input values
const w = tf.tensor1d([0.4, -0.6, 0.9]); // learned weights
const b = tf.scalar(0.1); // learned bias
const y = x.mul(w).sum().add(b).sigmoid(); // forward pass
y.print(); // → one number between 0 and 1That is it. Stack a few thousand of those into layers and you have an image classifier.
How a model "sees" a webcam frame
A frame becomes a grid of brightness values before the model touches it
A webcam frame is not a picture to TensorFlow.js. It is a tensor: a multi-dimensional array of numbers. Color images come in as [height, width, 3] (one channel each for red, green, blue). Many vision models drop that down to grayscale because brightness alone is enough to detect shapes, edges, and motion for most tasks.
Here is the conversion happening on every frame in the browser:
// from <video> element to model-ready tensor
const frame = tf.browser.fromPixels(videoEl); // [h, w, 3] RGB
const gray = frame.mean(2).expandDims(2); // [h, w, 1] grayscale
const small = tf.image.resizeBilinear(gray, [64, 64]);
const normalized = small.div(255).expandDims(0); // [1, 64, 64, 1]
const prediction = model.predict(normalized);Why grayscale? Color triples the math without adding much signal for hand-shape detection. The model cares about where the bright pixels are, not whether they are pink or beige. Less data in, faster inference, lower memory.
The numbers in that grid are just brightness values from 0 (black) to 255 (white) — and after .div(255) they are floats from 0 to 1. That is what the model actually consumes. Pure numbers.
What "weights" actually mean
Input → weighted layers → probability distribution
A trained model is a big bag of weights, usually millions of floating-point numbers saved to disk and loaded into the page. When you call model.predict(), TensorFlow.js does this for every layer:
// dense layer, written by hand to show the math
function denseForward(input, weights, bias) {
return input
.matMul(weights) // matrix multiplication
.add(bias) // shift
.relu(); // activation: clip negatives to 0
}The matrix multiplication is the part that costs cycles. Every output neuron is a dot product of the input row and a column of weights. Bigger model = more weights = more multiplies per frame.
This is also why GPU acceleration matters. tf.setBackend("webgl") runs those matrix multiplies on your graphics card, which is built to do thousands of multiplies in parallel:
await tf.setBackend("webgl");
await tf.ready();
console.log("backend:", tf.getBackend()); // → webglThe real time loop
Webcam → tensor → prediction → render → repeat
The whole pipeline runs every animation frame. There is no server, no API call, no token cost.
const LABELS = ["A", "B", "C", "D", "E" /* ... */];
async function loop() {
const frame = tf.browser.fromPixels(videoEl);
const input = frame
.resizeBilinear([64, 64])
.mean(2)
.expandDims(2)
.div(255)
.expandDims(0); // [1, 64, 64, 1]
const logits = model.predict(input);
const probs = logits.softmax();
const topIdx = (await probs.argMax(1).data())[0];
const label = LABELS[topIdx];
drawPrediction(label);
// free GPU memory — important
tf.dispose([frame, input, logits, probs]);
requestAnimationFrame(loop);
}The tf.dispose line is not optional. Tensors live on the GPU and they do not get garbage-collected like regular JS objects. Forget to dispose and the page leaks memory until the tab dies.
The math behind a live prediction
This is the part that intimidates people, but it is genuinely small. For each frame, the model produces a vector of raw scores called logits (one score per class). To turn those into probabilities, TensorFlow.js applies softmax:
// softmax, written out
function softmax(logits) {
const maxLogit = logits.max();
const exps = logits.sub(maxLogit).exp();
const sum = exps.sum();
return exps.div(sum);
}The max subtraction is just for numerical stability, it prevents exp(largeNumber) from overflowing. After softmax, every value is between 0 and 1 and they sum to 1. That is your probability distribution. The highest one is the prediction.
For ASL Tech, the model spits out 26 numbers (one per letter), softmax turns them into 26 probabilities, argMax picks the index of the largest one, and the LABELS array maps that index to a letter. Frame by frame, 30 times a second.
Person specific tools, not platform features
The reason this style of building matters is that the cost of shipping a small, focused tool has collapsed. A pre-trained model is a few imports away. The browser handles the camera. A static host gives you a public URL in minutes.
That changes the kind of software worth building. You no longer need to write a product for everyone. You can write a tool for one person, one task, one bottleneck in their day:
- A teacher learning ASL who needs a flashcard that watches their hand shape and confirms each letter.
- A speech-language pathologist who wants a private, local tool that recognises a small set of signs for one client.
- A parent who needs a quick way to practice signs with a non-verbal child without uploading any video anywhere.
None of those people need a chat AI. They need fast, narrow ML that runs in their browser tab and respects their camera.
Why no LLM — yet
ASL Tech does not need a large language model to do its job. The classification task is narrow: take a hand shape, output a letter. A small CNN solves that at 30+ frames per second on a laptop.
Adding an LLM would mean every frame becomes a network request to an inference endpoint, real money per call, and a feedback loop too slow for a live camera. The right architecture for this kind of tool is fast ML at the edge, a frozen model running in the browser, and a language model only where flexible reasoning actually adds value.
A future version could pipe full sentences from the ASL stream into a language model for translation polish or grammar repair. That is a separate concern, layered on top, not baked into the perception loop.
The bottom line
Machine learning is not magic. It is weighted matrix multiplications, applied to tensors, repeated layer after layer, optimised to run on a GPU in your browser tab.
Once you see the math, the mystery drops out. A webcam frame becomes a grid of brightness numbers. Numbers move through layers. Each layer is a matrix multiply plus a non-linearity. The output is a probability distribution. You read the highest one and call it a prediction.
That is the entire trick. Build the tool for the person who needs it. Ship it before dinner.
Try ASL Tech in a browser, give the page permission to use your camera, and watch the prediction update live. Nothing leaves the page. The whole thing is TensorFlow.js running locally.