So I was having a tea alone outside the office and I hear this word called ‘Rasher’ which basically is an eatable (thin slice of bacon).
Now I haven’t heard of this word before, I couldn’t imagine what it is, how it looks etc. I can’t even imagine how it tastes because I don’t even know that its an eatable at first place.
And the funny things is, as people around me were talking about it and I didn’t know what it means , I couldn’t at all understand what they are talking about. I had no clue.
This ! this situation here, I think is the best situation to describe why the research we are doing in machine learning seems to be in the wrong direction.
When I heard this word Rasher, I didn’t know what it is, I hadn’t seen that thing before in my life and thus my brain wasn’t able to understand it. The way I understood more is when I went back in office, searched a bit about that and saw that its an eatable.
So the way our brain understands something is not just by the textual information we get get but we truly understand something when we see it.
A baby when hears a word ‘car’ for the first time, it doesn’t understand what it means until he sees a car and then he thinks like ‘Oh ! a car looks like this, it has four wheels, a steering, seats’ etc. and then when somebody talks about a car, everything is understood.
Our brain is designed to understand things by looking at them. See most of the things, in fact all the knowledge, awareness, feel everything that we have is a collaboration of visual as well as the textual information that we have.
One of the major problems I see in current Machine learning researches is that
- we are researching Image processing and Text processing separately.
- That’s the primary reason behind the failure because we have some awesome text processing algorithms like Word2Vec but they don’t work truly because when such algorithms encounter any word, say ‘car’, it doesn’t know what a car is !! it doesn’t have that visual info to truly understand the word ‘car’. As a result, we need hundred and thousands of data-sets, dozens of algorithms to cover each and every use case in life.
- On the other hand we have some awesome image processing algorithms that can detect almost anything in an image but still struggle to solve real life problems because although algorithm understands how car looks like, it doesn’t know the purpose of those objects in image. Again because it doesn’t get necessary description other than what is available in image. So although algorithm knows how car looks like, it doesn’t know why its there in image. It doesn’t know why its waiting at traffic signal… I know, this description is pretty much overhead but think of it in this way. Keep reading …
Let me clearly explain this with an example.
Currently, we have algorithms to detect cars in an image and we have algorithms to detect semantics/relations between words in sentence. now when I want some AI to teach what it means by “People sit in the car”, I can’t do that …right ?
Now this happens so because our image processing algorithms know what a car is but when it sees this sentence, it doesn’t know what word ‘car’ means in a sentence “People sit in the car” because it just knows it visually, it doesn’t infer that thought in this text processing operation. It can’t pick up the context and that’s the problem. That’s the reason why we still have to create 100 different ML algorithms to teach it 100 different things and still it fails. Because our algorithms doesn’t use visual detection with text processing.
and that, my friends, is a problem ..
I completely understand that it’s a challenge but that’s the only true way to advance in machine learning ..
Machine learning is in vain without true context awareness…once that is achieved, we will just have one algorithm which will keep reading all the literature in the world and will understand each and every thing…