AI bots could see like humans, but may require a sense of 'pressure'

How would a computer react to an approaching lion?
Sade Agard
Humans vs Robots concept image
Humans vs Robots concept image


In less than 100 milliseconds (or approximately one-tenth of a second), the human brain may detect a familiar face or an oncoming car. But more significantly, it can put the data in the proper context, allowing people to respond appropriately.

No doubt, computers would be quicker at doing this, but the question is- how accurate are their responses in a real-life case? Or better yet, in a fight or flight situation? Apparently, computers fail to reproduce this human vision- and it's concerning- according to a new study published in JNeurosci.

Can computers react to danger?

The study demonstrates that deep neural networks cannot fully account for neural responses measured in human observers and has significant implications for using deep learning models in real-world settings, such as self-driving cars

Deep neural networks, also called deep learning networks, are a type of artificial intelligence that may be used to teach computers how to analyze incoming data, such as recognizing faces and cars. This machine learning method uses interconnected nodes or neurons in a layered structure like the human brain.

However, despite their strength and promise of deep learning, computers have not yet mastered human calculations and, more importantly, the link and communication between the body and the brain, particularly concerning visual recognition.

"While promising, deep neural networks are far from being perfect computational models of human vision," said study lead Marieke Mur in a press release.

Few studies have attempted to determine specific features of human vision deep learning fails to replicate, despite previous ones showing that deep learning cannot precisely replicate human visual recognition.

The team employed magnetoencephalography (MEG), a non-invasive medical procedure that assesses the magnetic fields generated by electrical currents in the brain. 

Mur and her team found that readily nameable parts of objects, such as 'eye,' ‘wheel,' and 'face,' can account for variance in human neural dynamics over and above what deep learning can deliver.

"This discovery provides clues about what neural networks are failing to understand in images, namely visual features that are indicative of ecologically relevant object categories such as faces and animals," said Mur. 

"We suggest that neural networks can be improved as models of the brain by giving them a more human-like learning experience, like a training regime that more strongly emphasizes behavioral pressures that humans are subjected to during development."

For instance, it's critical for humans to swiftly determine whether an object is an approaching animal or not and, if it is, to foresee its possible next steps. The efficacy of deep learning techniques to simulate human vision may be improved by incorporating these pressures during training.

The full study was published in JNeurosci on March 8 and can be found here.

Study abstract:

Deep neural networks (DNNs) are promising models of the cortical computations supporting human object recognition. However, despite their ability to explain a significant portion of variance in neural data, the agreement between models and brain representational dynamics is far from perfect. We address this issue by asking which representational features are currently unaccounted for in neural time series data, estimated for multiple areas of the ventral stream via source-reconstructed magnetoencephalography data acquired in human participants (nine females, six males) during object viewing. We focus on the ability of visuo-semantic models, consisting of human-generated labels of object features and categories, to explain variance beyond the explanatory power of DNNs alone. We report a gradual reversal in the relative importance of DNN versus visuo-semantic features as ventral-stream object representations unfold over space and time. Although lower-level visual areas are better explained by DNN features starting early in time (at 66 ms after stimulus onset), higher-level cortical dynamics are best accounted for by visuo-semantic features starting later in time (at 146 ms after stimulus onset). Among the visuo-semantic features, object parts and basic categories drive the advantage over DNNs. These results show that a significant component of the variance unexplained by DNNs in higher-level cortical dynamics is structured and can be explained by readily nameable aspects of the objects. We conclude that current DNNs fail to fully capture dynamic representations in higher-level human visual cortex and suggest a path toward more accurate models of ventral-stream computations.