Deep learning-enabled mobile application for efficient and robust herb image recognition

Sun, Xin; Qian, Huinan; Xiong, Yiliang; Zhu, Yingli; Huang, Zhaohan; Yang, Feng

doi:10.1038/s41598-022-10449-9

Download PDF

Article
Open access
Published: 21 April 2022

Deep learning-enabled mobile application for efficient and robust herb image recognition

Xin Sun¹,
Huinan Qian¹,
Yiliang Xiong²,
Yingli Zhu³,
Zhaohan Huang¹ &
…
Feng Yang¹

Scientific Reports volume 12, Article number: 6579 (2022) Cite this article

10k Accesses
15 Citations
1 Altmetric
Metrics details

Subjects

Abstract

With the increasing popularity of herbal medicine, high standards of the high quality control of herbs becomes a necessity, with the herb recognition as one of the great challenges. Due to the complicated processing procedure of the herbs, methods of manual recognition that require chemical materials and expert knowledge, such as fingerprint and experience, have been used. Automatic methods can partially alleviate the problem by deep learning based herb image recognition, but most studies require powerful and expensive computation hardware, which is not friendly to resource-limited settings. In this paper, we introduce a deep learning-enabled mobile application which can run entirely on common low-cost smartphones for efficient and robust herb image recognition with a quite competitive recognition accuracy in resource-limited situations. We hope this application can make contributions to the increasing accessibility of herbal medicine worldwide.

Development of a system for the automated identification of herbarium specimens with high accuracy

Article Open access 16 May 2022

Mapping of soil suitability for medicinal plants using machine learning methods

Article Open access 14 February 2024

Development of a DNA barcode library of plants in the Thai Herbal Pharmacopoeia and Monographs for authentication of herbal products

Article Open access 10 June 2022

Introduction

As the primary healthcare system for about $80\%$ of the world’s population, especially in the developing countries¹ and also an increase in developed countries² such as France and Germany, herbal medicine has gained cumulative popularity in today’s medical practice due to its cultural acceptability, availability, affordability, efficacy and safety claims³.

Herbal medicine, as defined by World Health Organization (WHO), is the practice which includes herbs, herbal materials, herbal preparations and herbal products containing active ingredients parts of plants, or other plant materials, or combinations³. Among the four forms of herbal medicine, the herbs are basic and distinctly important as they are the source and main component of other forms. Therefore, the quality control of the herbs becomes a key problem, which will directly influence herbal products in their acceptability and efficacy^4,5,6. Meanwhile, the worldwide upsurge of applying herbal medicine is calling for high standards for the herb quality, of which the herb recognition telling herb category and providing herb quality rating is the key step.

Herb recognition faces great challenges. Firstly, the herb quality is closely related to many factors, such as the temperature, use of fresh plants, light exposure, nutrients, water availability, period and time of harvest, method of harvesting, drying, packing, storage and transportation. Secondly, herbs are usually complicatedly processed from plants, and their appearance will be much different from their original forms. Thirdly, there are thousands of common herbs worldwide, which makes it difficult and time-consuming to recognize all the categories. Over the past decades, two primary solutions have been proposed, namely, the manual recognition and automatic recognition.

In manual recognition, one popular method is the herb fingerprint technology (Fig. 1b). It extracts the biological fingerprint of herbs by complicated chemical steps, which include herb splitting, Gas Chromatograph-Mass Spectrometer-computer (GC-MS) analysis and fingerprint generating, and then the fingerprint is compared to the template of each category based on their similarity^7,8. Because of its reliability and accuracy in herb recognition, it has been accepted as a standard pipeline worldwide³. Another solution of manual recognition is the experience based method, wherein professionals use their long-time experiences by looking, smelling, tasting, touching and hearing (Fig. 1c). However, manual recognition has its limitations. Firstly, it cannot work without professionals with rich knowledge and experience, or chemical materials and devices that may not be available for resource-limited settings. Secondly, it will cost too much resources and time to recognize thousands of herbs manually.

In order to alleviate these drawbacks of manual recognition, automatic recognition has been introduced based on the idea that utilizing herb images and giving quick candidate categories to assist the manual recognition to speed up making final decisions (Fig. 1c). In conventional approaches differentiating herbs according to color, shape and texture, the accuracy cannot be satisfied due to their low-level appearance property^9,10,11. Recently, inspired by the development of image recognition in computer vision^{12,13,14,15,16} and medical areas^{17,18,19,20,21,22,23}, deep learning based methods have shown great improvements^{24,25,26,27,28,29}, which makes it feasible to provide candidate decisions with satisfied accuracy. Most studies have used the Deep Neural Network (DNN), which combines with powerful and expensive computation hardware to support the computation. This combination is commonly used in hospitals and laboratories, and it aims towards a high degree of system automation and high system stability.

However, due to the high price of the hardware, it is not suited to dispensaries or hospitals in resource-limited settings. Simply reducing the hardware requirements is not feasible, as the DNNs are usually large and heavily-designed, costing too much computation. Therefore, the key of our application design is to obtain an efficient and accurate small DNN that can be deployed in resource-limited devices. One intuitive idea is to use a small DNN directly. However, the recognition ability will be weakened due to its fewer parameters, which will lead to the large drop of accuracy, even though the efficiency can be largely improved. As a result, we shifted into another direction: can we compress the large DNN into a small one? Though cutting a small portion of parameters of the large DNN can work in some cases^30,31,32, it still has three main limitations. Firstly, cutting the parameters cannot change the inner network structure, but can only make the network smaller, which may not achieve the expected efficiency in low-cost situations. Secondly, due to the huge gap between the large and small DNN in the number of parameters, cutting a large portion of parameters will lead to the obvious drop of recognition accuracy. Thirdly, the large gap also makes it hard to give accurate control on how much to cut to the target level. To meet these challenges, we propose a three-step network compression algorithm with each step corresponding to each limitation in order. Some recent studies show that specific network structures are appropriate for low-cost situations^33,34, and the DNN with these structures can obtain satisfying accuracy with fewer parameters and higher efficiency, which matches our motivation perfectly. Inspired by this, our first motivation is to use a separate small DNN with these efficient structures. Then, given the fact that cutting parameters cannot be done with different network structures, our second motivation is to transfer the recognition ability from the large DNN to the small one to keep accuracy. Finally, within the same structure, our third motivation is to give precise control of cutting the small DNN to achieve the best trade-off between accuracy and efficiency.

In this paper, we propose a novel deep learning based network compression algorithm to compress the heavily-designed DNN to a small one. Further, we develop a deep learning-driven mobile application for herb image recognition, which can run entirely on a single smartphone. The application runs in three steps (Fig. 1d). Firstly, image processing is operated on each herb image for the preparation of DNN computation. Then, a DNN is running on the image to get confidence scores for each herb category. Finally, the scores are ranked with top K herb categories to display recognition results. The key step is the second one, wherein the proposed network compression algorithm can guarantee the fast response and accurate recognition for good user experience. Experimental results on a public herb image dataset²⁴ show that the compressed small DNN can run much faster ($4X{-}5X$) with the near-realtime speed around 20 Frames Per Second (FPS) on common smartphones, while it still maintains a quite comparable recognition accuracy to the original small DNN. Without requiring any hardware, the whole recognition takes place entirely on a common smartphone and it can work completely offline (without internet), which is perfectly suitable for resource-limited settings.

The main contribution of the App in this study is to facilitate the implementation of herb recognition in resources-limited laboratories, hospitals and dispensaries. Firstly, not designed to compete with large DNNs with powerful hardware, they pay more attention to high degree of automation. Secondly, not meant to compete with manual herb recognition, our objective is to alleviate the need of expert resources and chemical materials by being a qualified assistant that can make candidate decisions to accelerate the manual process (Fig. 1). Moreover, we believe this study can promote the accessibility of herbal medicine worldwide and possibly facilitate the collection of valuable herbal medicine data, which has long been a key problem in artificial intelligence and deep learning.

In the following, we explore a deep learning based network compression algorithm for efficient and robust herb image recognition on a single smartphone. Then, we demonstrate that the accuracy of the compressed small DNN is quite consistent to that obtained by the original small DNN. The full procedure is evaluated on a public herb image dataset, and an additional large-scale image classification dataset is also adopted to further validate the effectiveness and generality of the algorithm.

Related work

In automatic recognition of herbs, researchers first utilize herb images and analyze their low-level appearance for recognition. Chen et al.⁹ construct the color matching template by integrating two observation surfaces of herb pieces, and they show some anti-interference invariance in rotation, shape and color. Liu et al.¹⁰ extract texture features by converting color images to gray-scale images, wherein the gray level co-occurrence matrix is used for recognition. To distinguish herbs with similar appearance, Ming et al.¹¹ combine raman spectroscopy with the support vector machine. However, these studies mainly focus on a few herb categories, due to the fact that low-level property cannot handle large variations in many categories. Recently, inspired by the development of image recognition in computer vision^{12,13,14,15,16} and medical areas^{17,18,19,20,21,22,23}, deep learning based methods have shown great improvements in herb recognition^{24,25,26,27,28,29}. Sun et al.²⁴ apply convolutional neural network in herb image classification, and they also release a herb image dataset with 95 categories. Vo et al.²⁵ extract image features by a VGG16-based network¹³, and the features can work well with the LightBGM classifier. To incorporate the expert knowledge, Lai et al.²⁶ extract a set of selected traditional features, which are further combined with deep features for joint classification. To deal with the background noise, Zhu et al.²⁸ propose the two-way attention method, which combines the instance and category for better discrimination. To assess the feasibility of automated machine learning, Chen et al.²⁷ use the open-source platform and built a dataset with 315 commonly used herb categories, but the dataset has not been released yet. To popularize the knowledge of herb medicine, Weng et al.²⁹ develop a smartphone application for herb image classification, but the entire process is run on the cloud server. The above studies have used the Deep Neural Network (DNN), which combines with powerful and expensive computation hardware to support the computation. However, due to the high price of the hardware, it is not suited to dispensaries or hospitals in resource-limited settings.

To enable the use of DNN in resource-limited devices, one primary method is to design efficient network structures. One series of studies is the MobileNets. Howard et al.³³ propose a lightweight DNN, namely MobileNet-V1, for embedded vision applications, and the main contribution is the depth-wise separable convolutions. To further improve the efficiency and accuracy, Sandler et al. design the MobileNet-V2³⁴ wherein an inverted residual structure is introduced with removing non-linearities in the narrow layers to maintain representational power. To explore automated search algorithms in network design, Howard et al. further propose the MobileNet-V3³⁵, which adopts Neural Architecture Search (NAS) to better tune on the mobile phone CPU. Another series of studies is the ShuffleNets. Zhang et al. introduce the ShuffleNet-V1³⁶, which utilizes two new operations, namely pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. To guide the network design towards direct metrics such as FLOPs, speed and memory access cost, Ma et al. propose the ShuffleNet-V2³⁷ to derive several practical guidelines for efficient design. In this paper, we adopt the MobileNet-V2 as the small DNN for its convenient design for subsequent network transfer and cut.

Another primary method is the network compression, which usually contains two main parts, namely network transfer and network cut. In network transfer, Hinton et al.³⁸ propose to transfer the logits from the large DNN to the small DNN. To incorporate context information in a single layer or between layers, Zagoruyko et al.³⁹ and Yim et al.⁴⁰ construct the attention maps and pairwise relation to improve the accuracy of the small DNN. However, these studies neglected the relation among samples. Park et al. and Liu et al. propose to model the sample relation by the similarity of pairs⁴¹ and triplets⁴². In this paper, instead of using logits or relation, we adopt the hidden layer for network transfer as it is a compact representation and can be used between different network structures. In network cut, some researchers proposed the sparsity based methods^32,43,44, but this sparsity is not friendly to hardware acceleration, which requires cutting a whole channel or filter. Based on this fact, the weight based methods have been proposed by jointly cutting network parameters of all the layers with some weight regularization^{45,46,47,48,49}, but the accuracy drops obviously due to the difficulty in joint optimization. Motivated by the feature map approximation⁵⁰, we propose a top-down layer-wise network cut method, which is further combined with network transfer to minimize the loss of accuracy in cutting each layer.

Methods

In this section, we will give a full introduction on how to obtain an efficient and accurate DNN for use on a smartphone. We first give a brief introduction of the DNN and explain the importance of image processing, and then we present a novel deep learning based network compression algorithm to learn a compressed DNN. Finally, we show how to deploy the compressed DNN on the smartphone.

Brief introduction

In this part, we introduce some key concepts and notations in DNN that are related to our method. Figure 2a shows a simple DNN, which contains a sequence of data representation and parameter layers, and it is trained with a one-hot label that denotes which category the herb image belongs to. Among this simple structure, feature map, hidden vector, convolution layer and fully-connection layer are important concepts. The former two are data representation, while the latter two are parameter layers.

Feature map

It is a general form of data representation in DNN. It has the form of $\mathbf{{F}} \in {R^{C \times H \times W}}$, wherein C, H and W are the size of the feature map and they denote channel, height and width respectively. For example, an input RGB image has the data size of ${3 \times H \times W}$, wherein the channel C is 3. For clarity, we denote the feature map in the ${t}$th layer as ${\mathbf{{F}}_\mathbf{{t}}} \in {R^{{C_t} \times {H_t} \times {W_t}}}$, and we will adopt this notation throughout the paper.

Convolution

It can extract information from local regions, and Fig. 2b shows how it works. For the ${t}$th layer, the convolution consists of filters ${\mathbf{{P}}_\mathbf{{t}}} \in {R^{{C_t} \times {C_{t - 1}} \times {k_t} \times {k_t}}}$, wherein ${{C_{t - 1}} \times {k_t} \times {k_t}}$ is the parameter size for a single filter and ${{C_t}}$ is the number of filters. Especially, ${{C_{t - 1}}}$ must equal to the number of channels in the input feature map ${\mathbf{{F}}_{\mathbf{{t - 1}}}}$. Each filter will convolve with local regions in ${\mathbf{{F}}_{\mathbf{{t - 1}}}}$ on every s locations to form a new map, and the maps of all ${{C_t}}$ filters are stacked together to generate the output map ${{\mathbf{{F}}}_\mathbf{{t}}} \in {R^{{C_t} \times {H_t} \times {W_t}}}$.

Fully-connection

It is a special form of convolution. Given the input feature map ${\mathbf{{F}}_{\mathbf{{t + 1}}}} \in {R^{{C_{t + 1}} \times {H_{t + 1}} \times {W_{t + 1}}}}$, the layer has filters $\mathbf{{P}} \in {R^{D \times {C_{t + 1}} \times {H_{t + 1}} \times {W_{t + 1}}}}$, and these filters act the same to the ones in convolution. But different from the convolution, the layer generates a special form of feature map $\mathbf{{H}} \in {R^D}$, wherein the height and width both equal to 1, as shown in Fig. 2c.

Hidden vector

We refer to the output feature map $\mathbf{{H}}$ of the fully-connection layer as the hidden vector throughout the paper. The reason we transform the feature map into the hidden vector is to enhance the recognition ability with global image information, and meanwhile prepare the data form for training.

Image processing

Given a herb image, the objective of image processing is twofolds. Firstly, it processes the image to be prepared for DNN computation. The image size varies a lot due to the different hardware and resolution, but the input image size of the DNN is fixed, e.g. $224 \times 224$ throughout our evaluation. Therefore, it is necessary to process the image to the target size, and this rule applies both in the training and testing phase. Secondly, image processing can be used as an effective data augmentation method to improve the robustness of recognition. We train and evaluate our method on the largest public herb image dataset²⁴ that contains 95 herb categories with 5640 herb images, but it is actually a small-scale dataset compared to some large ones in computer vision⁵¹. Training DNN on this dataset without any data augmentation will easily cause over-fitting, which will degrade the generality of herb image recognition. Particularly, this rule applies mainly at the training phase, while the processing in the testing phase is quite simple.

In the training phase, inspired by the data augmentation method in computer vision^15,16, we use four types of augmentations in order: random scaling, random ratio, random cropping and resizing, as shown in Fig. 3a.

Random scaling

The motivation is to adapt to different herb scales caused by the different distance between the herbs and a smartphone. As the smartphone moving away from the herbs, their size in the image will become smaller, which is actually the zoom-out operation. This operation only changes the image scale that is randomly selected, but keeps the width-height ratio fixed, as shown in the “Scale” column in Fig. 3a.

Random ratio

The motivation is to adapt to different herb shapes caused by the different angle between the herbs and a smartphone. As the smartphone rotating around the herbs, their shape in the image will be different, i.e. the width-height ratio will change. This operation only modifies the width-height ratio that is randomly selected, but remains the image area unchanged, as shown in the “Ratio” column in Fig. 3a.

Random cropping

The motivation is to fully learn the representation ability from every herb detail. Cropping is equivalent to concentrating on a local region, which is actually the zoom-in operation. By this way, DNN has to learn to correctly recognize the herb category from different local regions, which can increase the learning ability from every part of the herbs. Due to the crop location and region are randomly selected, this operation will change both the image scale and width-height ratio, as shown in the “Crop” column in Fig. 3a.

Resize

As previously mentioned, the motivation is to prepare for the fixed input size for DNN computation, as shown in the “Resize” column in Fig. 2a.

For each training iteration, the above steps are processed in order to get the input image for DNN computation. With the randomness in each step, the DNN can see various combinations in training, which can largely improve the generality of the herb image recognition. In the testing phase, we only adopt resizing and center-cropping, as used in¹⁵. The testing image is first resized to $256 \times 256$, then the center region with $224 \times 224$ is cropped to feed the DNN for recognition, as shown in Fig. 3b. Different from the training phase, we want the testing condition to be simple by assuming the herbs are at the near-center region of the image.

Network compression

In this part, we introduce the deep learning based network compression algorithm, which compresses the large DNN into a small one that can run efficiently and meanwhile maintain considerable accuracy. The algorithm contains three main steps, namely network pre-train, network transfer and network cut. One may wonder why we need an extra cutting step but not designing a small DNN that is small enough? For this question, we will explain it later and validate it in the result part.

Network pre-train

Network pre-training is a strategy that trains the DNN initially on a large-scale dataset, and then based on the initialization, the DNN is further trained on the target dataset. The advantage of this strategy is twofolds. Firstly, pre-training is a regularization technique, and it improves recognition generalization ability. Since the DNN is first exposed to a large amount of data, its parameters will be carried to a space that is more likely to represent the data distribution in overall rather than over-fitting a specific subset of underlying data distribution^52,53. Secondly, pre-training can improve the training stability and speed up the convergence. Deep neural networks, especially those with high representation capacity with tons of parameters, are tend to be vulnerable to random parameter initialization¹². Pre-training can initialize the parameters in a supervised way, which can provide a good starting point for stable training.

In our strategy, we adopt the ImageNet dataset⁵¹ as the large-scale set and pre-train the DNN on it, and then the DNN is further trained on the herb image dataset. The ImageNet dataset is a large-scale set in image recognition, and it contains 1000 common object categories and 1.2M images⁵¹. Though its target is not for herb image recognition, it can share general representation, which can bring better recognition generality. Especially, this pre-training rule applies to both the large and small DNN.

Let ${\mathbf{{X}}_\mathbf{{i}}} \in {R^{3 \times H \times W}} \;\; (i = 1,...,N)$ be the input RGB image, wherein H and W are the height and width, and N is the number of herb images. For each image ${\mathbf{{X}}_\mathbf{{i}}}$, the recognition label is defined as ${\mathbf{{Y}}_\mathbf{{i}}} \in {R^C}$, which is a one-hot vector with C herb categories. We denote the pre-trained large and small DNN as ${W_L}$ and ${W_S}$ respectively, and they are further trained on the herb image dataset with the cross-entropy loss in a mini-batch way:

$$\begin{aligned} \begin{array}{l} L({W_L}) = \min {{\sum \nolimits _{\mathrm{{i}} = 1}^B { - {\mathbf{{Y}}_\mathbf{{i}}}\log \phi ({\mathbf{{X}}_\mathbf{{i}}})} } \big / B},\\ L({W_S}) = \min {{\sum \nolimits _{\mathrm{{i}} = 1}^B { - {\mathbf{{Y}}_\mathbf{{i}}}\log \varphi ({\mathbf{{X}}_\mathbf{{i}}})} } \big / B} \end{array} \end{aligned}$$

(1)

wherein ${\phi ({\mathbf{{X}}_\mathbf{{i}}})}$ and ${\varphi ({\mathbf{{X}}_\mathbf{{i}}})}$ are the softmax output of the large and small DNN respectively, and B is the training batchsize. In this step, both DNNs are trained individually, i.e. without any interaction between them, as shown in Fig. 4a. With enough training, they can learn some recognition ability for herb image recognition.

Network transfer

After the pre-training, the large DNN is fixed and all the subsequent steps will be operated only on the small DNN. Due to the much fewer parameters, the small DNN cannot learn the recognition ability as good as the large one, thus its recognition accuracy is usually lower. To reduce the accuracy gap, we propose to transfer the recognition ability from the large DNN to the small one.

Some studies in computer vision have found that given the same input image, if the feature map of the small DNN can be close to the one of the large DNN, then their predicted category will also be similar, thus their recognition ability can be comparable^{39,40,41,42,54,55}. Based on this understanding, we can formulate the transfer process to be matching the feature map of both networks in a specific layer. As each network has dozens of layers, one problem is which layer to use. Inspired by the fact that the layer of higher level has better semantic representation^38,52,53,55, we propose to use the hidden vector $\mathbf{{H}}$, as it is a compact and high-level representation. Let ${\mathbf{{H}}_\mathbf{{L}}} \in {R^D}$ and ${\mathbf{{H}}_\mathbf{{S}}} \in {R^D}$ be the hidden vector for the large and small DNN respectively, then the network transfer on the small DNN can be formulated with the additional transfer loss as follows:

$$\begin{aligned} L({W_S}^K\left| {{W_S}} \right. ) = \min {{\sum \nolimits _{i = 1}^B {( - {\mathbf{{Y}}_\mathbf{{i}}}\log \varphi ({\mathbf{{X}}_\mathbf{{i}}}) + \lambda \left\| {{\mathbf{{H}}_\mathbf{{S}}} - {\mathbf{{H}}_\mathbf{{L}}}} \right\| _2^2)} } \big / B}, \end{aligned}$$

(2)

wherein the transferred small DNN is denoted as ${W_S}^K$, and it is trained based on the previously trained ${{W_S}}$. Besides, $\lambda$ is the weight of the network transfer that controls how much ability should be transferred from the large DNN. In this way, the recognition ability can be transferred to the small DNN, which maintains the high accuracy of the large DNN but runs faster at the same time, as shown in Fig. 4b.

Network cut

The objective of this step is to improve the network efficiency by cutting its parameters, which can reduce the amount of computation. As mentioned before, one confusing idea is why we need an extra cutting step but not using a small DNN that is small enough? Actually, in network transfer, the gap in the parameter number of the large and small DNN cannot be too large. As the small DNN only has limited learning ability, the large gap cannot guarantee the efficacy of the transfer, and this observation will be validated in the result part. This implies the small DNN cannot be too small, but it causes another problem that the small DNN cannot achieve high efficiency at the same time. Therefore, we propose a network cutting method to achieve the best trade-off between accuracy and efficiency.

Before going into it, we need to ask three questions: (1) which parameters should be cut? (2) what if the accuracy drops after cutting them? (3) do we cut all layers jointly or each layer individually? In fact, the answers constitute the three steps in network cutting. Inspired by some recent studies^47,48,56, we will give our answers for these questions as follows.

For the ${t}$th layer, cutting parameters is equivalent to removing unimportant filters in ${\mathbf{{P}}_\mathbf{{t}}} \in {R^{{C_t} \times {C_{t - 1}} \times k \times k}}$, i.e. to reduce the filter number ${{C_t}}$. To measure the importance of each filter, one intuitive idea is to see how much it affects subsequent feature maps if we remove it, and the unimportant ones will have lower influences. Specifically, assume one filter has been removed, then we will have ${\mathbf{{P}}_\mathbf{{t}}} \in {R^{({C_t} - 1) \times {C_{t - 1}} \times k \times k}}$, ${\mathbf{{F}}_\mathbf{{t}}} \in {R^{({C_t} - 1) \times H \times W}}$, ${\mathbf{{P}}_{\mathbf{{t + 1}}}} \in {R^{{C_{t + 1}} \times ({C_t} - 1) \times k \times k}}$ and ${\mathbf{{F}}_{\mathbf{{t + 1}}}} \in {R^{{C_{t + 1}} \times H \times W}}$. We see that ${\mathbf{{F}}_{\mathbf{{t + 1}}}}$ is the nearest feature map with its size unchanged, thus we use it as the reference map. To measure the importance, we use a subset of herb images and calculate the reconstruction error of the reference map by removing each filter individually:

$$\begin{aligned} Scor{e^{{C_i}}} = {{\sum \nolimits _{m = 1}^M {\left\| {\mathbf{{F}}_{\mathbf{{t + 1}}}^{{C_i}}({\mathbf{{X}}_\mathbf{{m}}}) - {\mathbf{{F}}_{\mathbf{{t + 1}}}}({\mathbf{{X}}_\mathbf{{m}}})} \right\| _2^2} } \big / M}, \end{aligned}$$

(3)

wherein $\left\{ {{\mathbf{{X}}_\mathbf{{m}}}|m = 1,...,M,M \ll N} \right\}$ is the sampled subset, ${\mathbf{{F}}_{\mathbf{{t + 1}}}^{{C_i}}({\mathbf{{X}}_\mathbf{{m}}})}$ is the feature map for image ${{\mathbf{{X}}_\mathbf{{m}}}}$ by removing the ${i}$th filter from ${\mathbf{{P}}_\mathbf{{t}}}$, and $Scor{e^{{C_i}}}$ is the importance score by removing the ${i}$th filter from ${\mathbf{{P}}_\mathbf{{t}}}$. By ranking the importance scores, we can cut the filters with smaller scores by a given cutting ratio $\alpha$, i.e. $\left\lceil {{C_t} \times \alpha } \right\rceil$ filters will be cut and $\left\lceil {{C_t} \times (1 - \alpha )} \right\rceil$ filters will be reserved, which gives the new filters ${\mathbf{{P}}_\mathbf{{t}}} \in {R^{\left\lceil {{C_t} \times (1 - \alpha )} \right\rceil \times {C_{t - 1}} \times k \times k}}$ for the ${t}$th layer, as shown in Fig. 4d.

The unimportant filters, though they contribute less to the recognition, they still learn some image representation that can be helpful. Therefore, simply removing them will cause the loss of recognition ability^43,57. To recover the loss back, one common solution is to re-train the after-cut network^56,58, and this can make the reserved filters fully learn the image presentation that is necessary for the recognition. Specifically, we use all the herb images to re-train the after-cut small DNN, and the same training rule in Eq. (2) is adopted:

$$\begin{aligned} L({W_S}^t{\left| {{W_S}} \right. ^K}) = \min {{\sum \nolimits _{i = 1}^B {( - {\mathbf{{Y}}_\mathbf{{i}}}\log \varphi ({\mathbf{{X}}_\mathbf{{i}}}) + \lambda \left\| {{\mathbf{{H}}_\mathbf{{S}}} - {\mathbf{{H}}_\mathbf{{L}}}} \right\| _2^2)} } \big / B}, \end{aligned}$$

(4)

wherein ${W_S}^t$ is the re-trained small DNN for the ${t}$th layer, and it is trained based on the transferred small DNN ${W_S}^K$ with cutting the filters at the ${t}$th layer. With the re-training step, the accuracy can be well recovered, and the whole cutting process at the ${t}$th layer is then finished, as shown in Fig. 4d.

By far, we have answered the first two questions. For the third one about cutting all layers jointly or each layer individually, we adopt an iterative approach that cuts layers one by one. Assume there are T filter layers in the transferred small DNN ${W_S}^K$, and we start the cutting from layer T all the way down to layer 1. For each layer t, the cutting is very similar to the above two steps with only one difference in Eq. (4): the top layer T ($t = T$) is trained based on the after-cut ${W_S}^{K}$, while others ($1 \le t < T$) use the after-cut ${W_S}^{t+1}$ as initialization, as shown in Fig. 4c. Especially, the cutting ratio $\alpha$ can be adjusted to satisfy different efficiency requirements, but we should also pay attention to the trade-off between accuracy and efficiency. After the top-down iterative cutting on each layer, ${W_S}^{1}$ is the final compressed small DNN for deployment on smartphones.

A short summary

To give a short summary of the network compression pipeline, we conclude the algorithm pipeline as follows.

Network deployment

After the network compression, we deploy the final compressed small DNN on a smartphone. Given a smartphone, a video sequence is first taken from the camera, and the sequence is split into each individual image frame. Then, the data augmentation for testing will be operated in each frame to generate the input data, as shown in Fig. 3b. Finally, the compressed small DNN will run on each input data to output the ranked predicted herb categories. All these steps are illustrated in Fig. 1c.

Results

We first introduce experimental settings, then the efficacy of image processing and network compression will be validated, lastly we show the running efficiency on three common smartphones.

Experimental setting

The setup is given based on seven parts: dataset, image processing, network selection, network pre-train, network transfer/cut, network deployment and evaluation.

Dataset

We use the public largest herb image dataset released in²⁴. It contains 95 herb categories with 5640 images, which are all collected from the internet with the herbs having mutual occlusion and cluttered backgrounds. Some herb images in the dataset are shown in Fig. 3. For each herb category, the number of herb images varies from 15 to 180, and we randomly split them into $70\%$ training and $30\%$ testing, which contain 3904 and 1736 images respectively. This random split is repeated 5 times for fivefold cross-validation.