Abstract
Much experimental evidence in neuroscience has suggested a division of higher visual processing into a ventral pathway specialized for object recognition and a dorsal pathway specialized for spatial recognition. Previous computational studies have suggested that neural networks with two segregated pathways (branches) have better performance in visual recognition tasks than neural networks with a single pathway (branch). One previously proposed possibility is that two pathways increase the learning efficiency of a network by allowing separate networks to process information about different visual attributes separately. However, most of these previous studies were limited, considering recognition of only two visual attributes, identity and location, simultaneously with a restricted number of classes in each attribute. We investigate whether it is always advantageous to use two-pathway networks when recognizing other visual attributes as well as examine whether the advantage of using two-pathway networks would be different when there are a different number of classes in each attribute. We find that it is always advantageous to use segregated pathways to process different visual attributes separately, with this advantage increasing with a greater number of classes. Thus, using a computational approach, we demonstrate that it is computationally advantageous to have separate pathways if the amount of variations of a given visual attribute is high or that attribute needs to be finely discriminated. Hence, when the size of the computer vision model is limited, designing a segregated pathway (branch) for a given visual attribute should only be used when it is computationally advantageous to do so.
Similar content being viewed by others
Introduction
Many lesion, neuropsychological, and anatomical studies support the hypothesis that there are two primary cortical visual pathways in the brain: a ventral pathway specialized for object recognition and a dorsal pathway specialized for spatial recognition1,2,3,4,5. In addition, some recent studies suggest that there may be multiple sub-pathways within each of the two main cortical visual pathways6,7,8,9,10.
Interestingly, various computational modeling studies using artificial neural networks over the years have used modeling to better understand the functional consequences of different modeling structures and how information is processed by different structures. Reference11 is the earliest model using artificial neural networks we are aware of that first demonstrated that it was always computationally advantageous to use two segregated pathways to process different attributes separately when two visual attributes, such as identity and location, need to be recognized simultaneously. However, their model only considered very simple shapes and did not examine the processing of different kinds of shape and space information in neural networks. Later, Jacobs and Jordan12 proposed that modular neural networks that consist of several sub-networks have many advantages: they can learn faster, generalize better, generate more interpretable representations, and reduce the required number of units and lengths of connections in a neural network. In addition, they proposed a modular network with two sub-networks. They found that competition between the two sub-networks could be used to train the sub-networks to specialize in shape recognition and localization, respectively. They argued that it may be a mechanism that the brain uses to develop different specialized modules.
More recently, inspired by deep learning studies, Marblestone and colleagues13 hypothesized that specialized systems in the brain can be used to find more efficient solutions to important computational problems. They argued that similar to computer scientists’ need to design different algorithms to solve different problems, the brain needs to use different specialized systems for planning, accessing memories, and many other functions. Indeed, Dwivedi and colleagues14 compared the fMRI responses collected from the brain and the neural activations in deep neural networks. They found that the neural activities in different brain regions can be explained by deep neural networks trained to perform different functions. This agreement between biological and artificial neural networks lends support to the hypothesis that there may be functional consequences of different specialized modules or systems in the brain.
In a recent study, Scholte and colleagues15 examined the properties of neural representations in artificial modular networks with two artificial visual pathways. They used a neural network model with a single shared pathway at the beginning, and two separated pathways specialized to complete different goals after the single shared pathway. Based on their model and simulation results, they proposed that distributed and shared representations are more common in the early and intermediate layers of neural networks, whereas modular and specialized representations are more common in the later layers of neural networks. We used a similar but different model in our previous studies. We used a model with two segregated pathways at the beginning to process the same visual inputs differently, then a shared final single pathway that receives information from the two segregated pathways to produce the final outputs16. In our previous studies, we showed that two-pathway models were better and more efficient than one pathway models and that there were specialized representations in the later layers of the different visual pathways of two-pathway models. However we also showed that shared representations in the later layers of different segregated visual pathways were also common16, and these shared representations may be used to constrain the binding problem17.
According to many previous computational modeling studies, when there are two visual attributes such as identity and location that need to be recognized simultaneously, it is computationally advantageous to use two segregated pathways to process the two visual attributes separately before combining them together for recognition11,16,17 (for recent review, see Ref.18). In addition, Tamura19 showed that when training a network with parallel streams to recognize objects, similar to what we used in our previous studies16,17, each artificial pathway could spontaneously specialize and segregate, with each pathway processing different kinds of visual information. However, Tamura19 trained the two-pathway networks to recognize object identity only, whereas we trained the two-pathway networks to recognize two different visual attributes (e.g. identity and location) at the same time. Nevertheless, most of these previous studies only required the neural networks to recognize two visual attributes simultaneously with a specific number of classes in each attribute (e.g., three kinds of object identities and nine possible locations). One question that remains is whether such two-pathway networks are advantageous when recognizing some other visual attributes, such as luminance and orientation. Another question is whether such two-pathway networks always have higher performance even when there are a different number of classes in each attribute.
In our current study, we consider four visual attributes: identity (shape), luminance, orientation, and location. Black and white images with three non-overlapping objects at different locations are used as the input images to the neural networks. Convolutional neural networks are used to model cortical visual pathways. All neural networks are trained using supervised learning. In order to model different cortical visual pathways, different neural networks are trained to recognize different visual attributes. Some additional common dense layers are used to combine and process the neural representations from multiple pathways to produce the final outputs. Because our previous study suggests that the location map is the best choice for constraining the binding problem18,20, we always use a relative location map here when we are training the neural networks to bind multiple visual attributes of each object together. We compare the performance of two-pathway networks and one-pathway networks when recognizing different visual attributes and when there are different numbers of classes in each attribute.
According to our results, the performance of two-pathway networks, no matter what choice of two attributes, is always better than the performance of one-pathway networks when recognizing different combinations of visual attributes. In addition, the advantage of using a two-pathway network is larger when there are a greater number of classes in each visual attribute. In summary, our results suggest that in order to achieve high visual recognition performance, it is always advantageous to use segregated pathways to process different visual attributes separately, with this advantage increasing with greater numbers of classes.
Methods
Objects and visual images
The objects in the visual tasks are different kinds of t-shirts, pants, and shoes images obtained from the Fashion-MNIST dataset21. There are 200 kinds of t-shirts, 200 kinds of pants, and 200 kinds of shoes. Each object is randomly presented at one of nine possible locations in a 140 \(\times\) 140 (pixels) black background. There are three objects in each black background image and these images with objects and black background are used as input images. Some examples of these input images are shown in Fig. 1. We created six thousand input images. Two-thirds of these input images were used for training, one-sixth of them were used for validation, and one-sixth of them were used for testing. Cross-validation is used during training, so each time a different set of images are used to train the neural networks and the rest of the images are used to validate and test the neural networks.
Object and visual image examples. (a) An example of a visual image with three objects (a t-shirt, a pant, a shoe). (b) An example of a t-shirt with high luminance, a t-shirt with medium high luminance, a pant with medium low luminance, and a shoe with low luminance. (c) An example of a t-shirt with orientations up, down, left, right. (d) Nine possible locations of the objects in the visual image.
Four attributes of objects
Four attributes of objects are considered in this study: identity, luminance, orientation, and location.
-
Identity: There are three kinds of object identities: t-shirt, pant, and shoe. An example of a visual image containing a t-shirt, a pant, and a shoe is shown in Fig. 1a.
-
Luminance: The average value of all pixels in the object image is defined as the luminance value of the object. The black background pixels within each object image (as shown in Fig. 1b and c) are also included when calculating the luminance value of the object. The luminance value of each object is then classified into different luminance levels: (1) high (luminance value \(\ge\) 255/4) and low (luminance value < 255/4) when there are two classes in the visual attribute luminance; or, (2) high (luminance value \(\ge\) 3 * 255/8), medium high (2 * 255/8 \(\le\) luminance value < 3 * 255/8), medium low (255/8 \(\le\) luminance value < 2 * 255/8), low (luminance value < 255/8) when there are four classes in the visual attribute luminance. Examples of different luminance levels are shown in Fig. 1b.
-
Orientation: When there are two classes in the visual attribute orientation, each object has two possible orientations: up, down. When there are four classes in the visual attribute orientation, each object has four possible orientations: up, down, left, right. An example object in different orientations is shown in Fig. 1c, with the top row showing the conditions when there are only two classes.
-
Location: The nine possible object locations are shown in Fig. 1d. The objects in a given input image never overlap with each other.
Neural networks
Convolutional neural networks are used to model the different visual pathways in the brain. We use Tensorflow to implement all neural networks and all neural networks are trained using supervised learning with gradient descent with back-propagation, the cross-entropy loss function and the Adam optimization method. The ReLU activation function is used at each layer except the output layer, in which the softmax activation function is used. A batch size of 256 and a initial learning rate of 0.001 are used for training. A \(30\%\) random dropout is applied to all dense layers in neural networks for regularization. The other hyperparameters are shown in Figs. 2 and 3. All neural networks are trained until they have reached the highest validation accuracy.
The structure of the neural networks for modeling visual pathways is shown in Fig. 2a. All artificial visual pathways are modeled using neural networks with the same structure and the only difference is that they may have different output layer sizes. All visual pathway neural networks take the same input images and are trained to complete different tasks (\(Network_{identity}\), \(Network_{luminance}\), \(Network_{orientation}\), \(Network_{location}\)). Specifically, \(Network_{identity}\) is trained to recognize the identities of the objects (t-shirt, pant, or shoe), \(Network_{luminance}\) is trained to recognize the luminance level of the objects (e.g. high or low), \(Network_{orientation}\) is trained to recognize the orientations of the objects (e.g. up or down), and \(Network_{location}\) is trained to recognize the locations of the objects (locations 1-9). After training, the visual pathway neural networks are used to serve as one of the pathways in a two-pathway network (\(Network_{two\;pathways}\)).
The structures of \(Network_{two\;pathways}\) and \(Network_{one\;pathway}\) are shown in Figs. 2b and 3, respectively. The two neural networks have the same size, the only difference is they have different pathways. \(Network_{two\;pathways}\) uses the two visual pathway networks (which have been independently trained to recognize different visual attributes) to process the input images separately and then combines the two independent visual pathway outputs using some common dense layers to output \(Network_{two\;pathways}\) final results. Specifically, the final output layer of each of the two visual pathways are removed after they have been independently trained. Then the second to last layer of each of the two visual pathways are concatenated, and this concatenated vector serves as the input to the following common dense layers. With the weights in the two visual pathway networks fixed, the final shared common dense layers in the two-pathway network are then trained to recognize the two visual attributes simultaneously. \(Network_{one\;pathway}\) takes the same input images and is trained to recognize the two visual attributes simultaneously using a single pathway.
The architecture of \(Network_{one\;pathway}\). The architecture in the red square box is the same as the architecture in the red square boxes in Fig. 2, the only difference is that all layers in \(Network_{one\;pathway}\) have double size.
Because there are three objects in each image, there is a binding problem (i.e. how to associate the different visual attributes of each object). According to our previous studies, the location map is a propitious choice for constraining the binding problem17,20. Therefore, we use the location map to constrain the binding problem in this study.
Briefly, we solve the binding problem with a method we have previously published20. Specifically, we use a location map to constrain the binding problem. For example, when the goal of \(Network_{two\;pathways}\) is to recognize identities and locations of all objects, we train the identity pathway to report all objects’ identities in a certain order that depends on the relative locations of the objects. Specifically, we train the identity pathway to report the identities of the objects from top to bottom. If objects are at the same horizontal line, then we train the network to report the identities of the objects from left to right. The order described here is an example without loss of generality because any consistent spatial order would suffice. The location pathway is trained to report all objects’ absolute locations regardless of the other visual attributes. After training, the binding problem is constrained by combining the relative location information in the identity pathway and the absolute location information in the location pathway. A similar method (relative location map) is used to use the location map to constrain the binding problem when the goal of \(Network_{two\;pathways}\) is to recognize other visual attributes.
All neural networks are trained for three times with cross-validation, and the testing accuracy after all three training sessions are recorded. The accuracies that are used to compare different networks in this study are always the testing accuracies.
Results
Each neural network is trained and tested three times with network weights randomly initialized differently each time. Welch’s two-sample t-tests are used to determine the significance of the differences between testing accuracies. The Bonferroni correction for multiple tests is used because there are multiple comparisons (n=6) between the single-pathway and two-pathway neural networks. After the Bonferroni correction, the difference is considered significant if the corresponding p-value is less than 0.05/6. The testing accuracies of one-pathway networks and two-pathway networks for recognizing different pairs of attributes are shown in Table 1 and, with more levels, in Table 2.
First, we train the neural networks to recognize three kinds of objects, two luminance levels, two orientations, and nine locations. According to the results shown in Table 1, the accuracies of the two-pathway networks are higher than the accuracies of the one-pathway network in most cases but the differences are mostly not significant. When they are trained to recognize luminance and locations of objects, the accuracy of the two-pathway network is even lower than the accuracy of the one-pathway network, though the difference is small and not significant(\(1.1\%\)).
However, the two-pathway networks are still more efficient than the one-pathway network in these two exception cases. According to Fig. 4a,b,d, the total required number of training epochs for the two-pathway network for recognizing orientation and location is 110 (40 for training the orientation pathway, 10 for training the location pathway, 60 for training the common dense layers). According to Fig. 4b,c,e, the total required number of training epochs for the two-pathway network for recognizing luminance and location is 170 (60 for training the orientation pathway, 10 for training the location pathway, 100 for training the common dense layers). According to Fig. 4f and g, in both cases the one-pathway networks require 300 epochs to train. Therefore, the two-pathway networks always require less number of training epochs than the one-pathway networks. In summary, the two-pathway networks always have higher performance (combining accuracy and efficiency). That is, two-pathway networks are much more efficient when their accuracies are the same or slightly lower than one-pathway networks.
The training (in blue) and validation (in orange) curves of \(Network_{two\;pathways}\) (left column) and \(Network_{one\;pathway}\) (right column) with 6000 total number of samples and three objects in each image. (a) The training and validation curves when training the orientation pathway in \(Network_{two\;pathways}\). (b) The training and validation curves when training the location pathway in \(Network_{two\;pathways}\). (c) The training and validation curves when training the luminance pathway in \(Network_{two\;pathways}\). (d) The training and validation curves when training the common dense layers in \(Network_{two\;pathways}\) for recognizing orientation and location. (e) The training and validation curves when training the common dense layers in \(Network_{two\;pathways}\) for recognizing luminance and location. (f) The training and validation curves of \(Network_{one\;pathway}\) for recognizing orientation and location. (g) The training and validation curves of \(Network_{one\;pathway}\) for recognizing luminance and location. According to (a,b,d), the total required number of training epochs for the two-pathway network for recognizing orientation and location is 110 (40 for training the orientation pathway, 10 for training the location pathway, 60 for training the common dense layers). According to (b,c,e), the total required number of training epochs for the two-pathway network for recognizing luminance and location is 170 (60 for training the orientation pathway, 10 for training the location pathway, 100 for training the common dense layers). According to (f,g), the one-pathway networks require 300 epochs to train in both cases.
To examine the effect of the number of classes, we increase the number of classes in luminance and orientation (from two to four classes) and train the neural networks to recognize three kinds of objects, four luminance levels, four orientations, and nine locations. According to Table 2, the accuracies of the two-pathway networks are always significantly higher than the accuracies of the one-pathway network. In addition, the magnitude of difference in accuracy between the two kinds of networks increases when there are more possible luminance levels and orientations. These results suggest that the advantages of using two-pathway networks increase when the number of classes in visual attributes increases.
Though the one-pathway network and the two-pathway network have the same number of units in each layer, they have different numbers of trainable parameters. The two-pathway network has fewer trainable parameters than the one-pathway network. In order to test whether reducing the number of trainable parameters can improve the performance of the one-pathway network, we train another one-pathway network with half the number of units in each layer. In this way, the modified one-pathway network has roughly the same number of trainable parameters as the two-pathway network. We test the performance of the modified one-pathway network on the orientation and location task and the accuracy is (0.9 ± 0.4)\(\%\). The accuracy of the modified one-pathway network is even lower than the accuracy of the original one-pathway network (2.2 ± 0.6)\(\%\).
Another possible reason that the two-pathway network performs better than a one-pathway network is that by training the two pathways to recognize different visual attributes separately, we have explicitly identified the different tasks and thereby made the overall task easier for the two-pathway network. In order to test this hypothesis, we first train a one-pathway network to recognize the orientations of objects (four possible orientations: up, down, left, right), then train it to recognize the locations of objects. After this task-segregated training, we then train the one-pathway network to recognize the orientations and locations of objects at the same time, and the testing accuracy is (2.1 ± 0.2)\(\%\). This accuracy is not higher than the accuracy of the one-pathway network that is trained to recognize the orientations and locations of objects at the same time.
In order to test whether the accuracy of the one-pathway network could increase with different initial learning rates, we train the one-pathway network to recognize orientation and location with different initial learning rates. The results are shown in Table 3. According to the results, the accuracy of the one-pathway network is always low even when different initial learning rates are used for training.
Discussion
In our current study, we compare the performance of one-pathway networks and two-pathway networks when four visual attributes (identity, luminance, orientation, location) are considered. We find that the two-pathway networks have higher performance (higher accuracy and less training time) than one-pathway networks, and the advantage of using two-pathway networks is larger when there are more classes within an attribute.
The performance of two-pathway networks is higher than the one-pathway networks, with advantages increasing with greater number of classes
When there are two luminance levels and two orientations, the accuracies of two-pathway networks are higher than the accuracies of one-pathway networks in most cases. However, the differences are small and not significant in some cases. When the goal is to recognize luminance and locations of objects, the average accuracy of the two-pathway network is lower than the accuracy of the one-pathway network, though the difference is small and not significant. In these cases, it seems the performance of the two kinds of networks is similar. However, the training and validation curves shown in Fig. 4 suggest that the two-pathway networks are always much more efficient than the one-pathway networks. When there are four luminance levels and four orientations, the accuracies of the two-pathway networks are always significantly higher than the accuracies of the one-pathway network. The magnitude of difference in accuracy between the two kinds of networks increases a lot when there are more possible luminance levels and orientations. These results suggest that the advantages of using two-pathway networks increase when the number of classes in visual attributes increases. In addition, Table 3 shows that when there are four possible orientations, the accuracy of the one-pathway network for recognizing orientation and location is always very low even when different initial learning rates are used for training.
The possible reasons for the performance differences between the two kinds of networks
The one-pathway networks find a local minimum at around 20 epochs and this can be seen as a small plateau in both Fig. 4f and g. The accuracy is still very low around this plateau, and the one-pathway networks then require a lot of epochs (around 300 epochs) to go beyond this local minimum and reach a much higher validation accuracy. This pattern suggests that it may be possible for the one-pathway networks to reach a high accuracy similar to the two-pathway networks. Though the one-pathway network and the two-pathway network have the same number of units in each layer, they have different numbers of trainable parameters. The two pathways in the two-pathway network are separated and not interconnected, so the two-pathway network has fewer trainable parameters than the one-pathway network. Since the one-pathway networks have more connection weights to learn, they need to search in a much larger parameter space. As a result, they are much less efficient to train compared to the two-pathway networks. When the number of luminance levels and orientations increase, the networks need to search in an even larger parameter space to find the best solution. The parameter space is so large that even with many more training epochs, the one-pathway networks are still not able to find a good solution to the task. Therefore, one of the possible reasons that the accuracy of two-pathway networks are always much better than the one-pathway networks in these cases is two-pathway networks have fewer trainable parameters. Hence, we train another modified one-pathway network with half the number of units in each layer, so that it has roughly the same number of trainable parameters as the two-pathway network. According to the results, the modified one-pathway network still has much lower accuracy than the two-pathway network. These findings suggest that the reduced number of trainable parameters in the two pathway network is not the reason for the performance advantage for this network.
Because there are multiple objects in each image, there is a binding problem (e.g. which object identity goes with which location). And indeed, prior work has shown that the advantages of a two-pathway network are greater for multiple object displays17,18. However, when there is only a single object in an image, and therefore no binding problem, previous work11,16 has demonstrated that a single pathway still does not perform as well as a two-pathway network.
It is also possible that the two-pathway network performs better because we first train the two pathways to recognize different visual attributes separately; and by segregating the tasks during training, we make the overall task easier for the two-pathway network. In order to test this hypothesis, we train a one-pathway network to recognize the orientations of objects first, then train it to recognize the locations of objects. After this task-segregated training, we then train the one-pathway network to recognize the orientations and locations of objects at the same time and then obtain the testing accuracy. The accuracy of the one-pathway network trained in this way is not higher than the accuracy of the one-pathway network that is directly trained to recognize the orientations and locations of objects at the same time. Therefore, making the task easier by first separately training each of the two tasks does not seem to be the reason that the two-pathway network can perform better.
What do these results suggest about the brain?
There are many different visual attributes in natural images, and previous studies suggest that there are multiple segregated cortical visual pathways and subpathways in the brain1,2,3,4,5,6,7,8,9,10,22. In addition, Genç and colleagues23 found that higher intelligence in humans is related to lower dendritic density and arborization. Together, these findings support the hypothesis that neural networks with multiple segregated pathways are computationally advantageous because they have a smaller number of connections between neurons when the total number of neurons is the same. However, though two-pathway networks have advantages over one-pathway networks, the performance of two-pathway networks may be worse if there are not enough neurons in each pathway11. There are a large number of neurons in the brain, but the number of neurons available in each visual pathway are more limited if there are a lot of different visual pathways. Therefore, the brain faces a trade off: it needs to have multiple segregated visual pathways to improve task performance, but it also needs to limit the number of visual pathways so that there are enough computational resources in each pathway. Therefore, an important question is: what determines whether the brain develops a separate visual pathway for a certain visual attribute? According to our current study, the advantage of using segregated visual pathways is larger when there are a greater number of classes or greater discrimination needed in the visual attributes. Therefore, the results of our study suggest that if the amount of variations of a certain visual attribute is high in nature or a certain visual attribute needs to be finely discriminated it is much more efficient and accurate if the brain develops a separate visual pathway for this attribute. Likewise, if a certain visual attribute does not have many variations in nature or does not need to be finely discriminated, there is no computational advantage for the brain to develop a separate visual pathway for this attribute.
What do these results suggest about computer vision?
Computer vision and computational approaches to object recognition have made rapid advances (e.g., see review24). However, current computer vision models are still inefficient. Current state-of-the-art computer vision models usually require large amounts of computational resources and large datasets to train. It is critical to develop computation-efficient computer vision models to make wide adoption of these models more practical25. Hence, finding neural architectures that improve performance with increased efficiency has enormous consequences for computer vision. Previous studies have found that artificial neural networks with multiple pathways (branches) could significantly improve computer vision accuracy without significantly increasing model complexity11,16,17,26, so it may be a general good principle to improve the computational efficiency of computer vision models by using segregated pathways. However, if the size of the model is limited, the performance of networks using multiple segregated pathways may decrease with too many pathways (branches) if there are not enough computational resources in each pathway (branch)11. Our findings in this study suggest that in order to save computational resources, it is only computationally advantageous to use segregated pathways (branches) for visual attributes with large amount of variations or for visual attributes that need to be finely discriminated.
Limitations and future directions
There are many limitations in our current study. We only consider four visual attributes in our study, but there are many other visual attributes in nature. In addition, we simplify the tasks by considering all visual attributes as discrete variables and train neural networks to do classification tasks. However, some visual attributes (e.g. orientation, location) are continuous and neural networks may need to use more complex algorithms to find the values for these continuous variables. There are also more complex scenarios in natural images, and the dataset in our current study does not fully represent all scenarios. However, with more complex scenarios, it can be more difficult to control all the variables. Nevertheless, more research is needed to examine whether our findings can be generalized to more complex visual stimuli and tasks.
Conclusion
In summary, our study shows that it is always advantageous to use segregated pathways to process different visual attributes separately, and that this advantage increases with a greater number of classes. The two-pathway networks may not have higher accuracy when the number of classes in visual attributes is small, but the two-pathway networks are always more efficient to train than the one-pathway networks. The two-pathway networks always have much higher accuracy and are always much more efficient to train when the number of classes in visual attributes is larger. Our study suggests that it is much more computationally advantageous for the brain to develop separate visual pathways if the amount of variations of a given visual attribute is high or that attribute needs to be finely discriminated. Moreover, our findings confirm that architectures that capitalize on the segregation of processing may be a simple yet effective way to improve current computer vision algorithms. More importantly, our findings suggest that when the size of a computer vision model is limited, segregated pathways (branches) for a given visual attribute is highly computationally advantageous only if the attribute has a large amount of variation or needs to be finely discriminated.
Code availability
The code for this study will be available at https://github.com/Zhixian-Han. Additional code for our studies can be requested by contacting the corresponding author.
References
Ungerleider, L. G. & Mishkin, M. Two cortical visual systems. In Analysis of Visual Behavior (eds Goodale, M. et al.) 549–586 (MIT Press, 1982).
Mishkin, M., Ungerleider, L. G. & Macko, K. A. Object vision and spatial vision: two cortical pathways. Trends Neurosci. 6, 414–417 (1983).
Felleman, D. & Van Essen, D. Distributed hierarchical processing in the primate cerebral cortex. Cerebral Cortex 1, 1–47 (1991).
Logothetis, N. K. & Sheinberg, D. L. Visual object recognition. Annu. Rev. Neurosci. 19, 577–621 (1996).
Colby, C. L. & Goldberg, M. E. Space and attention in parietal cortex. Annu. Rev. Neurosci. 22, 319–349 (1999).
Aflalo, T. N. & Graziano, M. Organization of the macaque extrastriate visual cortex re-examined using the principle of spatial continuity of function. J. Neurophysiol. 105, 305–320 (2011).
Kravitz, D. J., Saleem, K. S., Baker, C. I. & Mishkin, M. A new neural framework for visuospatial processing. Nat. Rev. Neurosci. 12, 217–230 (2011).
Kravitz, D. J., Saleem, K. S., Baker, C. I., Ungerleider, L. G. & Mishkin, M. The ventral visual pathway: an expanded neural framework for the processing of object quality. Trends Cogn. Sci. 17, 26–49 (2013).
Pitcher, D. & Ungerleider, L. G. Evidence for a third visual pathway specialized for social perception. Trends Cogn. Sci. 25, 100–110 (2021).
Taubert, J., Ritchie, J. B., Ungerleider, L. G. & Baker, C. I. One object, two networks? Assessing the relationship between the face and body-selective regions in the primate visual system. Brain Struct. Funct. 227, 1423–1438 (2022).
Rueckl, J. G., Cave, K. R. & Kosslyn, S. M. Why are “what’’ and “where’’ processed by separate cortical visual systems? a computational investigation. J. Cogn. Neurosci. 1, 171–186 (1989).
Jacobs, R. A. & Jordan, M. I. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cogn. Sci. 15, 219–250 (1991).
Marblestone, A. H., Wayne, G. & Kording, K. P. Toward an integration of deep learning and neuroscience. Front. Comput. Neurosci. 10, 1–41 (2016).
Dwivedi, K., Bonner, M. F., Cichy, R. M. & Roig, G. Unveiling functions of the visual cortex using task-specific deep neural networks. PLoS Comput. Biol. 17, 1–22 (2021).
Scholte, H. S., Losch, M. M., Ramakrishnan, K., de Haan, E. H. & Bohte, S. M. Visual pathways from the perspective of cost functions and multi-task deep neural networks. Cortex 98, 249–261 (2018).
Han, Z. & Sereno, A. Modeling the ventral and dorsal cortical visual pathways using artificial neural networks. Neural Comput. 34, 138–171 (2022).
Han, Z. & Sereno, A. Identifying and localizing multiple objects using artificial ventral and dorsal cortical visual pathways. Neural Comput. 35, 249–275 (2023).
Han, Z. & Sereno, A. B. Understanding cortical streams from a computational perspective. J. Cogn. Neurosci. (2024).
Tamura, H. An analysis of information segregation in parallel streams of a multi-stream convolutional neural network. Sci. Rep. 14, 1–17 (2024).
Han, Z. & Sereno, A. B. A spatial map: a propitious choice for constraining the binding problem. Front. Comput. Neurosci. 18, 1–16 (2024).
Xiao, H., Rasul, K. & Vollgraf, R. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017).
Livingstone, M. & Hubel, D. Segregation of form, color, movement, and depth: anatomy, physiology, and perception. Science 240, 740–749 (1988).
Genç, E. et al. Diffusion markers of dendritic density and arborization in gray matter predict differences in intelligence. Nat. Commun. 9, 1–11 (2018).
Manakitsa, N., Maraslidis, G. S., Moysis, L. & Fragulis, G. F. A review of machine learning and deep learning for object detection, semantic segmentation, and human action recognition in machine and robotic vision. Technologies 12, 1–40 (2024).
Wang, Y. et al. Computation-efficient deep learning for computer vision: A survey. arXiv (2023).
Xie, S., Girshick, R., Dollár, P., Tu, Z. & He, K. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 5987 – 5995 (2017).
Acknowledgements
We thank Aditya A. Shanghavi for comments on the manuscript. Funding to Anne B. Sereno: Purdue University and NIH CTSI (Indiana State Department of Health #20000703).
Author information
Authors and Affiliations
Contributions
Conceptualization - ZH, ABS; Algorithm Development - ZH, ABS; Experiment Design - ZH, ABS; Conducting Simulations - ZH; Data Collection - ZH; Data Analysis - ZH; Interpretation of Results - ZH, ABS; Writing Manuscript - ZH, ABS. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Han, Z., Sereno, A.B. Exploring neural architectures for simultaneously recognizing multiple visual attributes. Sci Rep 14, 30036 (2024). https://doi.org/10.1038/s41598-024-80679-6
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-024-80679-6






