Introduction

In the previous blog, we walked through how to prepare the dataset, train, quantize, and deploy the model to the ESP32-S3, following the ESP-DL digit recognition example code:
https://github.com/espressif/esp-iot-solution/tree/master/examples/ai/esp_dl/touchpad_digit_recognition

However, we did not take a closer look at how the neural network itself is designed.

In this post, I will explain the details of the neural network architecture based on my understanding.

Later on, I plan to use this model as a foundation to develop my own digit and alphabet character recognition system.

How This Blog May Help You

  • Gain a deeper understanding of how the neural network in the ESP-DL digit recognition example works.
  • Learn about the structure and parameter calculation of a typical CNN for embedded AI.

Prerequisites

Layer Explanation

The model contains 7 main neural network operations:

  1. Conv2D: Think of this as a window (kernel) that slides over the input matrix, performing multiplication and summation at each position.
  2. ReLU: An activation function that introduces non-linearity.
  3. MaxPool2D: A downsampling operation that reduces the spatial dimensions of the matrix.
  4. Flatten: Converts a 2D array into a 1D array.
  5. Linear: A standard fully connected (dense) neural network layer.
  6. Dropout: Randomly sets some weights to zero during training to prevent overfitting.
  7. Softmax: Converts raw output values into probabilities.

The most important part is understanding how data flows from one layer to another. If the shapes do not match, the neural network will not work.

Conv2Dkernel_size

  • kernel_size is the heart of the CNN. It defines the size of the window (e.g., 3x3) that moves across the input.
  • For kernel_size=3, there is a 3x3 matrix of weights that the network learns and optimizes.
  • The term “kernel” is standard in CNNs, though “window” might be more intuitive.

Conv2Din_channels & out_channels

  • in_channels: Number of input 2D arrays (channels).
    For grayscale images, this is usually 1; for RGB, it’s 3.
  • out_channels: Number of output 2D arrays after the convolution.
  • For example, in_channels=1, out_channels=16 means the Conv2D layer will output 16 feature maps, each generated by a different kernel.

Conv2Dstride & padding

  • stride: Step size of the kernel as it moves across the input.
    stride=1 means the kernel moves one pixel at a time.
  • padding: Number of zeros (dummy values) added around the border of the input.
    padding=1 adds a border of one pixel.

Output Shape of Each Layer

Note: In PyTorch, tensor shapes are typically defined as [channels, height, width].

  • Conv2D (in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=1):

    • Input: 25 x 30 (H x W)
    • After padding: 27 x 32 (H x W)
    • After convolution: 25 x 30 (H x W)
  • MaxPool2D (kernel_size=2, stride=2):

    • Output: 12 x 15 (H x W)
    • max-pool-shape-explain
    • Note: In the vertical direction, the last row may be dropped if it doesn’t fit the pooling window.
  • Conv2D (in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1):

    • Output: 12 x 15 (H x W)
  • MaxPool2D (kernel_size=2, stride=2):

    • Output: 6 x 7 (H x W)
    • second-max-pool-shape
  • Conv2D (in_channels=32, out_channels=64, kernel_size=3, stride=1, padding=1):

    • Output: 6 x 7 (H x W), with 64 channels
  • Flatten:

    • Output: 1D array of size 1 x 2688 (6 x 7 x 64)
  • Linear (in_features=2688, out_features=256):

    • Output: 1 x 256
  • Final Linear Layer (in_features=256, out_features=10):

    • Output: 1 x 10 (for 10 digit classes)

Note: Sometimes, the height and width columns may appear swapped in summaries—always double-check your tensor shapes.

cnn-model-summary

Parameter Calculation

Understanding the number of parameters in each layer is fundamental to grasping how CNNs work.

  1. First Conv2D Layer:

    • Input: 1 channel, Output: 16 channels, Kernel size: 3x3
    • Each output channel has its own 3x3 kernel: 16 x 9 = 144 weights
    • Plus 16 bias values: 144 + 16 = 160 parameters
  2. First MaxPool2D Layer:

    • No parameters (just a downsampling operation)
  3. Second Conv2D Layer:

    • Input: 16 channels, Output: 32 channels, Kernel size: 3x3
    • 16 x 32 x 9 = 4608 weights
    • Plus 32 bias values: 4608 + 32 = 4640 parameters
  4. Third Conv2D Layer:

    • Input: 32 channels, Output: 64 channels, Kernel size: 3x3
    • 32 x 64 x 9 = 18432 weights
    • Plus 64 bias values: 18432 + 64 = 18496 parameters
  5. First Linear Layer:

    • Input: 2688, Output: 256
    • 2688 x 256 = 688,128 weights
    • Plus 256 bias values: 688,128 + 256 = 688,384 parameters
  6. Final Linear Layer:

    • Input: 256, Output: 10
    • 256 x 10 = 2560 weights
    • Plus 10 bias values: 2560 + 10 = 2570 parameters

Tip: The total number of parameters gives you an idea of the model’s complexity and memory requirements.

Conclusion

In this post, I explained how the tensor shapes change through each layer of the CNN and detailed how to calculate the number of parameters in each layer.

Understanding these fundamentals will help you design and modify your own CNN architectures for embedded AI applications.

If you grasp these concepts, you are well on your way to building your own custom neural networks for tasks like digit or character recognition on embedded devices.