Armed with his MacBook Pro and a modified version of image recognition software NeuralTalk2, McDonald walked the streets of Amsterdam and recorded as his computer tried to identify what it was seeing. The results are fascinating, if not consistently accurate.
If you look out your window right now, you can describe what you see instantly and accurately. It seems simple, but teaching computers to do the same thing is complicated. The computer needs to analyze the scene, identify individual components, and then figure out how they relate to each other.
This video makes that clear. When the captions are accurate it’s uncanny. “A boat is docked in the water near a city” and “a row of bikes parked next to each other” are both as accurate as they are quintessentially Amsterdam. And the look on the face of “a man eating a hot dog in a crowd” alone is worth the price of admission.
But most of the captions are totally wrong, and that’s possibly even more interesting. Why does the neural network describe McDonald , who is clearly wearing a hoody, as “wearing a suit and tie?” Is it confusing his zipper for a tie? Why is it constantly seeing clocks where none exist? Why does it perceive so many colorful things as black and white?
NeuralTalk2 is an open source piece of software designed to look at photos and caption them, identifying things in the images and attempting to put them into context. You can set NeuralTalk2 up yourself if you want to make a Thanksgiving project out of it, but you should that it’s not exactly user friendly. You’re going to need to invest some time and some smarts in this one.
But don’t worry. If you’re not a programming wizard you can just enjot the video, or check out more caption examples here.