How have you been keeping up with the different convolutional neural networks (CNNs)? In recent years, we have witnessed the birth of numerous CNNs. These networks have gotten so deep that it has become extremely difficult to visualise the entire model. We stop keeping track of them and treat them as blackbox models.
Fine, maybe you don’t. But if you’re guilty too then hey, you’ve come to the right place! This article is a visualisation of 10 common CNN architectures, hand-picked by yours truly. These illustrations provide a more compact view of the entire model, without having to scroll down a couple of times just to see the softmax layer. Apart from these images, I’ve also sprinkled some notes on how they ‘evolved’ over time — from 5 to 50 convolutional layers, from plain convolutional layers to modules, from 2–3 towers to 32 towers, from 7⨉7 to 5⨉5— but more on these later.
By ‘common’, I am referring to those models whose pre-trained weights are usually shared by deep learning libraries (such as TensorFlow, Keras and PyTorch) for users to use, and models that are usually taught in classes. Some of these models have shown success in competitions like the ImageNet Large Scale Visual Recognition Challenge (ILSVRC).
The 10 architectures that will be discussed and the year their papers were published.
Pre-trained weights which are available in Keras for 6 of the architectures that we will talk about. Adapted from a table in the Keras documentation.
The motivation for writing this is that there aren’t many blogs and articles out there with these compact visualisations (if you do know of any, please share them with me). So I decided to write one for our reference. For this purpose, I have read the papers and the code (mostly from TensorFlow and Keras) to come up with these vizzes.
Here I’d like to add that a plethora of CNN architectures we see in the wild are a result of many things — improved computer hardware, ImageNet competition, solving specific tasks, new ideas and so on. Christian Szegedy, a researcher at Google once mentioned that
“[m]ost of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures.” (Szegedy et al, 2014)
Now let’s get on with these beasts and observe how network architectures improve over time!
A note on the visualisations
Note that I have excluded information like the number of convolutional filters, padding, stride, dropouts, and the flatten operation in the illustrations.
Fig. 1: LeNet-5 architecture, based on their paper
LeNet-5 is one of the simplest architectures. It has 2 convolutional and 3 fully-connected layers (hence “5” — it is very common for the names of neural networks to be derived from the number of convolutional and fully connected layers that they have). The average-pooling layer as we know it now was called a sub-sampling layer and it had trainable weights (which isn’t the current practice of designing CNNs nowadays). This architecture has about 60,000 parameters.
This architecture has become the standard ‘template’: stacking convolutions and pooling layers, and ending the network with one or more fully-connected layers.
Fig. 2: AlexNet architecture, based on their paper.
With 60M parameters, AlexNet has 8 layers — 5 convolutional and 3 fully-connected. AlexNet just stacked a few more layers onto LeNet-5. At the point of publication, the authors pointed out that their architecture was “one of the largest convolutional neural networks to date on the subsets of ImageNet.”
1. They were the first to implement Rectified Linear Units (ReLUs) as activation functions.
Fig. 3: VGG-16 architecture, based on their paper.
By now you would’ve already noticed that CNNs were starting to get deeper and deeper. This is because the most straightforward way of improving performance of deep neural networks is by increasing their size (Szegedy et. al). The folks at Visual Geometry Group (VGG) invented the VGG-16 which has 13 convolutional and 3 fully-connected layers, carrying with them the ReLU tradition from AlexNet. This network stacks more layers onto AlexNet, and use smaller size filters (2×2 and 3×3). It consists of 138M parameters and takes up about 500MB of storage space 😱. They also designed a deeper variant, VGG-19.
Fig. 4: Inception-v1 architecture. This CNN has two auxiliary networks (which are discarded at inference time). Architecture is based on Figure 3 in the paper.
This 22-layer architecture with 5M parameters is called the Inception-v1. Here, the Network In Network (see Appendix) approach is heavily used, as mentioned in the paper. This is done by means of ‘Inception modules’. The design of the architecture of an Inception module is a product of research on approximating sparse structures (read paper for more!). Each module presents 3 ideas:
It is worth noting that “[t]he main hallmark of this architecture is the improved utilisation of the computing resources inside the network.”
The names of the modules (Stem and Inception) were not used for this version of Inception until its later versions i.e. Inception-v4 and Inception-ResNets. I have added them here for easy comparison.
Fig. 5: Inception-v3 architecture. This CNN has an auxiliary network (which is discarded at inference time). *Note: All convolutional layers are followed by batch norm and ReLU activation. Architecture is based on their GitHub code.
Inception-v3 is a successor to Inception-v1, with 24M parameters. Wait where’s Inception-v2? Don’t worry about it — it’s an earlier prototype of v3 hence it’s very similar to v3 but not commonly used. When the authors came out with Inception-v2, they ran many experiments on it, and recorded some successful tweaks. Inception-v3 is the network that incorporates these tweaks (tweaks to the optimiser, loss function and adding batch normalisation to the auxiliary layers in the auxiliary network).
The motivation for Inception-v2 and Inception-v3 is to avoid representational bottlenecks (this means drastically reducing the input dimensions of the next layer) and have more efficient computations by using factorisation methods.
The names of the modules (Stem, Inception-A, Inception-B etc.) were not used for this version of Inception until its later versions i.e. Inception-v4 and Inception-ResNets. I have added them here for easy comparison.
✨What’s improved from previous version, Inception-v1?
Fig. 6: ResNet-50 architecture, based on the GitHub code from keras-team.
Yes, it’s the answer to the question you see on the top of the article here (“what architecture is this?”).
From the past few CNNs, we have seen nothing but an increasing number of layers in the design, and achieving better performance. But “with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly.” The folks from Microsoft Research addressed this problem with ResNet — using skip connections (a.k.a. shortcut connections, residuals), while building deeper models.
ResNet is one of the early adopters of batch normalisation (the batch norm paper authored by Ioffe and Szegedy was submitted to ICML in 2015). Shown above is ResNet-50, with 26M parameters.
The basic building block for ResNets are the conv and identity blocks. Because they look alike, you might simplify ResNet-50 like this (don’t quote me for this!):
⭐️ What’s novel?
Fig. 7: Xception architecture, based on the GitHub code from keras-team. Depthwise separable convolutions are denoted by ‘conv sep.’
Xception is an adaptation from Inception, where the Inception modules have been replaced with depthwise separable convolutions. It has also roughly the same number of parameters as Inception-v1 (23M).
Xception takes the Inception hypothesis to an eXtreme (hence the name). What’s the Inception hypothesis again? Thank goodness this was explicitly and concisely mentioned in this paper (thanks François!).
Taking this idea to an extreme means performing 1×1 to every channel, then performing a 3×3 to each output. This is identical to replacing the Inception module with depthwise separable convolutions.
Fig. 8: Inception-v4 architecture. This CNN has an auxiliary network (which is discarded at inference time). *Note: All convolutional layers are followed by batch norm and ReLU activation. Architecture is based on their GitHub code.
The folks from Google strike again with Inception-v4, 43M parameters. Again, this is an improvement from Inception-v3. The main difference is the Stem group and some minor changes in the Inception-C module. The authors also “made uniform choices for the Inception blocks for each grid size.” They also mentioned that having “residual connections leads to dramatically improved training speed.”
All in all, note that it was mentioned that Inception-v4 works better because of increased model size.
✨What’s improved from the previous version, Inception-v3?
Fig. 9: Inception-ResNet-V2 architecture. *Note: All convolutional layers are followed by batch norm and ReLU activation. Architecture is based on their GitHub code.
In the same paper as Inception-v4, the same authors also introduced Inception-ResNets — a family of Inception-ResNet-v1 and Inception-ResNet-v2. The latter member of the family has 56M parameters.
✨What’s improved from the previous version, Inception-v3?
Fig. 10: ResNeXt architecture, based on their paper.
If you’re thinking about ResNets, yes, they are related. ResNeXt-50 has 25M parameters (ResNet-50 has 25.5M). What’s different about ResNeXts is the adding of parallel towers/branches/paths within each module, as seen above indicated by ‘total 32 towers.’
⭐️ What’s novel?
Recall that in a convolution, the value of a pixel is a linear combination of the weights in a filter and the current sliding window. The authors proposed that instead of this linear combination, let’s have a mini neural network with 1 hidden layer. This is what they coined as Mlpconv. So what we’re dealing with here is a (simple 1 hidden layer) network in a (convolutional neural) network.
This idea of Mlpconv is likened to 1×1 convolutions, and became the main feature for Inception architectures.
Let’s show them here again for easy reference:
We build amazing solutions for different industries including in the financial, government and telecommunication sectors.