LSTM
What actually is LSTM and how it differs from RNN network and how it can be used to identify human speech out of the background noise
What is LSTM?
—> Long Short Term Memory.. Improved version of RNN
Contains Cell state and hidden state between input windows
LSTM
Long Short Term Memory
Imagine reading this sentence one word at a time, but forgetting each word the moment you move to the next:
"The" → read → forget
"cat" → read → forget
"sat" → read → forget
"on" → read → forget
"the" → read → forget
"mat" → read → ???
Present Transformer remember context among sentences,
My name is Manish. I live in NY...
I ---> means Manish
here name padheko xaina, we forgot My 🤣. that is Regular Neural Network,
input → [neurons] → output
input → [neurons] → output ← same network, no connection between steps
input → [neurons] → output
Why we need LSTM in our Classrec app?
we are doing Silero VAD detection for each 32ms sample …
Window 1 (32ms): [sssss] ← start of "s" sound
LSTM: "sounds like start of consonant, storing..."
h = 0.2 (low confidence so far)
Window 2 Window 2 (32ms): [ppppp] ← "p" sound
LSTM: "another consonant, consonant cluster"
h = 0.4 (building confidence)
Here confidence increases due to previous input step , as a result more easy in
detecting human audio or not
DO we need LSTM in image classification ?
No Images are not sequences. If we classify a CAT image , then it wont have any impact on image to be classified next..
There’s no “time” dimension, no before or after.
Image → CNN → classification
How CNN works
Instead of looking at the whole image at once, CNN slides a small window pixel by pixel (called a filter or kernel)
Image (simplified 6x6):
┌─────────────┐
│ 0 0 1 1 0 0 │
│ 0 1 1 1 1 0 │
│ 1 1 0 0 1 1 │
│ 1 1 0 0 1 1 │
│ 0 1 1 1 1 0 │
│ 0 0 1 1 0 0 │
└─────────────┘
Filter (3x3) slides across:
┌───┐
│1 0│ ← detects vertical edges
│1 0│
└───┘
sliding → → →
↓ ↓ ↓
How CNN learns and Classify image?
Layer 1 — detects simple patterns:
edges corners dots
───── ┐ •
───── │
─────
Layer 2 — combines simple patterns:
curves shapes textures
╭─╮ △ □ ○ ░░░░
╰─╯ ░░░░
Layer 3 — detects complex patterns:
eyes wheels windows
◉ ◯ ▭
Final layer — full objects:
cat face car building
CNN vs LSTM
Data type Structure Use
─────────────────────────────────────────────
Image 2D spatial grid CNN ✅
Audio (raw) 1D sequence LSTM or CNN
Text 1D sequence LSTM or Transformer
Video 3D (2D + time) CNN + LSTM ✅ both
Time series 1D sequence LSTM ✅
Speech detection 1D sequence LSTM ✅ (Silero VAD)
Video is a sequence of image, so first image classification then LSTM to classify it by deriving context from the image..
Lets say classifying if a dog is moving..First we have to detect dog, and then on image snapshot ,we have to compare the relative position
CNN asks: "What patterns exist WHERE in this image?"
spatial relationships matter
↓
cat ears are always near the top
wheels are always near the bottom
LSTM asks: "What happened BEFORE this moment?"
temporal relationships matter
↓
"sp" before "eech" = speech
loud bang with no buildup = door slam
Do we still use LSTM and CNN after TRANSFORMER is popular?
—> yes for lightweight tasks

Back to LSTM
# Import it
from fastapi.staticfiles import StaticFiles
# Mount it to serve files from the static folder
app.mount("/static", StaticFiles(directory="static"), name="static")
# ↓ ↓ ↓ ↓
# URL path Class Folder on disk Name for URL building