A few weeks ago, I had the pleasure of attending a compelling lecture on computer vision at The Royal Society in London. Professor Andrew Zisserman showcased an innovative approach to building models, much like how a child learns – by cross-referencing visual, audio, and text data. However that is an oversimplified summary, the actual process can broadly be summed up into 3 steps and really got me thinking deeply about AI and some of the issues the lecture uncovered.
Teaching AI like a child
If you haven’t been able to find time to watch the video, teaching AI as a child can be summed up as the following 3 steps: –
- Audio/Visual Synchronisation
This sounds fancy but it involves training a model to associate mouth movements with sound.
- Audio/Visual Correspondence
The next step involves teaching the model how to associate different sounds with different images e.g. the sound of a guitar with someone playing the guitar.
- Language/Visual Correspondence
Finally the model has to go to school and learn associations between phrases and images.
By doing the above steps you’ve essentially taught the model how to associate words with sound without specifically needing to train it on that data since it associates the words with the images and so associates the image with the phrase.
Let’s think of it another way, take the phrase “man playing a guitar”, in a traditional machine learning sense you’d need to see that phrase, hear the guitar playing and see a picture of a guitar a few thousand times to associate everything, assuming you have enough examples. Instead you learn what a guitar looks like with what it sounds like, then teach the relationship between “man playing a guitar” with images of guitars. If you heard a guitar, it might make you think of the phase. You’ve now linked them up with a fraction of the needed training data. Pretty smart!
At the end of the video Professor Andrew Zisserman shows a model that has been trained to generate audio descriptions and closed-captions from film scenes. In one example it identifies Harry Potter flying off on a dragon as him being on a horse instead. I can’t help but see similarities to how a child observes the world, where they confidently interpret unfamiliar objects with their limited pool of knowledge such as how I once heard my young cousin upon seeing a brown cow in a field for the first time, say excitedly “look mummy, look at that big dog in the field eating the grass!!”.
Leaps of imagination
Now given this example, something that I’ve been thinking about is how training a model in this way requires a “leap” of imagination by the model or to put it in a more technical way, the associations we talked about building earlier or “joint embedding space” as it’s technically known, isn’t a clean match. At some level there is a distance between the two inputs and perhaps this “leap” where you accept a given distance between them, is something that we think of as a very human part of our intelligence – similar to how you might associate atoms smashing into each other in a similar way to how you might associate smashing billiard balls into each other in a game of pool. These “leaps” in generalising objects and applying knowledge from one context to another have a distinctly human-like quality, perhaps even evolutionary – think of how you might associate flying insects with wasps and pain for example. Although… there’s also a chance I’m reading a bit into it, perhaps the model just didn’t have enough training data of dragons but then come to think of it I don’t think I’ve ever seen a dragon either.
Interpreting a scene from Star Trek
Here’s a thought: What if we could “interpret” what the audio description AI really saw about the programme using another model that had been “informed” about what it was going to watch? Essentially, we’d be setting the stage, avoiding baking in a single context or scenario into the AI, and instead allowing it to dynamically adjust based on the “lens” we provide. Take, for instance, a typical scene from Star Trek. The same scene could be interpreted differently by our AI, from a different perspective, depending on whether it was told it’s watching a sci-fi show or a behind-the-scenes documentary.
In the midst of all the buzz around ChatGPT and other generative models, I can’t help but consider the role of context in these AI models. Looking at the audio description model’s interpretation of film scenes, I feel these models could really benefit from the context that a generative model could provide.
Consider this scenario – if the model sees a clip of William Shatner gripping a giant ball of painted polystyrene and throwing it at a rubber monster, how would it interpret this? In the context of a sci-fi show, it might caption the scene as “James T. Kirk hurls a heavy boulder at an alien on a strange alien planet”. But if informed it’s watching a documentary, it might describe it humorously as “William Shatner, with all his theatrical glory, lifts a ‘heavy’ polystyrene ball and chucks it at a guy who’s probably cursing his luck for having to wear a rubber suit under the unforgiving Californian sun”.
Experimenting with ChatGPT
The idea of dynamically shifting the context based on cues or priming doesn’t sound so far fetched for AI especially seeing how ChatGPT can build upon a conversation and understand the deeper relationships – this is what I mean by context. For example, as an experiment when I gave ChatGPT a generic audio description from a scene that read “William Shatner flicks his wrist to open a device that makes a noise. He talks into the device and then suddenly fades away and appears in another location” and then informed it that it was interpreting a Star Trek episode, it was able to rephrase the scene as “Captain Kirk effortlessly operates his trusty communicator with a flick of his wrist, speaking his orders into the device. In the blink of an eye, he engages the transporter, dematerialising and then reappearing in a completely different location on the ship”. Perhaps the future is already here!
It’s worth noting that the audio description AI wouldn’t necessarily need specific training on what a “communicator” or “transporter” is, as this context is provided by the generative model like ChatGPT. It would just need to understand basic interactions, actors, essentially describing what it sees, allowing for another AI to interpret it in a given context. In many ways perhaps this context is actually needed for audio descriptions through the whole episode. Think of how one scene might set up an alien planet called Vulcan in the beginning of the episode and later cut to it throughout the programme. If the audio description AI didn’t use context and saw the last scene featuring the planet, would it know what to call it when it had been set up earlier in the programme? A ChatGPT-like AI could be fed audio descriptions for each scene and build upon them for the next. It also underscores the importance of imbuing AI models with not just data, but also the ability to interpret that data from different perspectives.
By coupling or chaining models together in this way, we could achieve highly advanced behaviours using much less training data. In the context of writing audio descriptions for TV shows, it might be sufficient to train the model on screenplays and photos of actors rather than having it trudge through thousands of videos of episodes. This practice could significantly improve the efficiency of model training, whilst also making it more adaptable to different situations.
Now, I’d like to turn this over to you: What do you think of this approach? Can you envision other applications where this method could be beneficial? How else do you think we could improve the way we train AI models? Please share your thoughts in the comments below.
Of course, this is just an idea and my thoughts, and as we continue exploring the evolving landscape of AI and machine learning, I encourage you to consider these points and join the discussion. If you found this post interesting, please consider sharing it with your colleagues or friends interested in AI.