Visual object detection and recognition is your thing? Assuming you always want the fastest and most accurate results for your queries, you might be interested in how the “new kid on the block” aka Google’s Cloud Vision API performs when compared to a few of its competing services. Well, we did this just for you and want to share our results.
Google Cloud Vision API
Google recently announced the open Beta release of their Cloud Vision API. As object and text recognition has become more and more sophisticated and many different providers for said services appeared throughout the last few years, we thought it would be interesting to do a quick review to find out how well the new service by Google works. Then, we also did a brief comparison between Google’s approach and several other providers of image recognition to see how it stacks up against the rest of the field.
The Cloud Vision API offers multiple types of recognition / detection: 
detects if any faces appear in a picture, along with facial features like eye, nose and mouth placement, as well as the likelihood of various attributes like joy and sorrow. Google also states that they don’t store any facial detection / recognition information on their servers – but I’ll let you make up your own mind about that…
detects and identifies popular natural and manmade structures (we did not find any other service providing this feature)
detects and identifies product logos within an image and returns the brand name
picks out the dominant entity within an image and returns matching metadata
detects text in an image (while supporting many different languages)
- Safe Search
detects inappropriate content within an image (comparable to Google SafeSearch)
For this little test scenario, we chose to only use four of the above mentioned features namely (logo, landmark, label and text recognition) to see how well the API performs when given different images. The numbers in the pictures show the “level of confidence” for each tag, with 1 being the best (which would indicate that the algorithm is 100% sure the found metadata is correct).
Let’s try this
The first test consisted of taking pictures of various items on the office desk, relatively close up. Big surprise, the algorithm was capable of detecting at least two of the three logos in the frame, as well as some semi-precise labels for the objects. One could argue that Cherry has done a pretty good job at tricking the detection software into thinking those mouse buttons are made of iron – which they are definitely not – or that we are really good at hiding a “wedge (71%)” and a “putter (64%)” somewhere in the frame, as Google’s results quite confidently stated that they were there.
Anyway, moving on. The next picture shows almost the same objects but now from a different angle and a bit further away. While using one more logo as “suggestion” for detection, the algorithm now decides that there aren’t any logos in the frame anymore – which is kind of a surprise. At least the labelling does not find the “putter” anymore instead it can see a “mobile phone (85%)” and a “hand (65%)” – well, OK. On the other hand (see what I did there) labels like “gadget”, “computer”, “input device” or “computer hardware” could not match any better, as well as the text recognition which worked extremely well here.
As we move the camera even further away and introduce more objects in the frame (like my perfectly executed cable-management), Google can now see a “vehicle (64%)” and a “drum machine (51%)” in the picture but no logos at all. This is a bit funny, as the text recognition was able to read the BenQ logo on the monitor but was not able to detect it as being a logo. All in all though, the general labelling job did quite OK again.
But what if I don’t always just want to photograph my office gear? I thought we should also try the following. For this small test, we used a picture of the “Uhrturm” clock tower, which is a well-known landmark – shame on you, if you don’t know it – located here in Graz. It does obviously not have the same reputation as for example the Eiffel tower or the Statue of Liberty, but it still has the potential of being detected (as you will see in a second).
The algorithm gave no false alarms concerning logos, but was, maybe surprisingly, able to detect and identify the landmark as “Graz Schloßberg Clock Tower (98%)” which is a really good and confident value. Also the general labelling did great here with things like “clock tower (94%)”, “tower (83%)” and “steeple (56%)”. The only small issue to complain about here was the falsely detected single character “a” (maybe because of the A-shaped rose arches?), but overall the software did a really great job for this sample picture.
What about the competition?
This service returned a huge amount of tags matching and describing the submitted picture, but was not able to identify the shown landmark itself. But keep in mind that this feature also had not been advertised for this service, we can’t blame them for that. One unique feature of this service is its verbal representation of colors in the picture along with their relative percentage of occurrence.
After numerous attempts and being prompted with “Too many requests” by the demo, one try finally succeeded. The response “white and black concrete clock tower” was neither the most accurate, nor could CloudSight’s algorithm detect the landmark present in the picture. A positive thing to mention here though is the fact that the response was a sentence-like description actually readable by a human instead of a word cluster.
This service was again more or less able to categorize and tag the submitted image but the accuracy was pretty weak. Or to put it another way: the returned terms were too “meta”, in our opinion (“no person”), yet undeniably still correct. The conclusion for this service is that with the returned tags, the description of the visible image was just too vague. We tried the API service, which provides probabilities for every tag, colors as well, as well but that didn’t helped us to improve the results of the overall recognized objects.
Now, can I actually honestly say that Google Cloud Vision is the #1?
Compared to the tested services my answer is: yeah!
“But… what about using all the other providers of such services for your comparison?” I hear you ask. In order to answer that, we would have to hire somebody to do the testing full time – feel free to apply if that’s your cup of tea. So I guess my final answer is: maybe.
Google’s approach does a great job at what it’s supposed to do. You just shouldn’t expect any wonders / magic from it, although it still is a Google service so who knows. At the time of writing, and guessing that the quality of, for example, labelling or logo detection will probably further improve, we can recommend it. Compared to the quality and feature set of other object detection and recognition services, Google is still in the leading position.
P.S.: You can probably imagine what the results of testing our own logo were. If you can’t, just have a look at the sad truth below… Ah, well.