Google’s Vision: Detection & Recognition at level next

There has been numerous technologies, free and paid for recognition and detection for quite some time now. But the necessity for detection and recognition of objects and text has become a quintessential feature for AI and its dependent technology. I had tried quite a bit of OCR & detectors; mostly rendered, below average to average detection results with lots of ghost characters.

Here comes Google

Mobile Vision, another interesting framework by Google. This framework is capable of detecting objects in photos and videos. Yes, Heard it right!
Currently, the Android API can recognize

  • Face
  • Barcode
  • Text

whereas API for iOS detects face and barcode only at this point in time.

The beauty of this framework is the ability to recognize with and/or without any external network connectivity. So we can now detect faces, barcodes, and text in image or video by having API installed on a device.

Let’s talk about the detectors. Google Mobile Vision API includes detectors which can locate visual objects in images and/or video frames. This API returns the position of an object in images and videos. Much more interesting fact is that we can have multiple detectors to detect all supported objects simultaneously in frames. It works real-time on the device as well as existing images and recorded videos.


  • Face detector
  • Barcode detector
  • Text detector

Face Detector:

This face detection API tracks human faces in still images, recorded video and on mobile camera. It tracks facial landmarks such as eyes and nose. It provides information about the state of human faces. Currently, android face API supports 2 classifications only: eyes open and smiling face. It is able to track multiple faces in a frame.

You can try face detection by steps mentioned in Google code labs for face detection. This API can be used to create hands-free controls for Games and Apps. Eg. react when a person smile or wink.

Note: This API does not support face recognition, as it does not determine 2 faces are likely to correspond to the same person.

Barcode Detector

This API can detect barcodes on mobile camera as well as in a picture.
It supports following barcode formats:

  • 1D barcodes: EAN-13, EAN-8, Code-39, Code-93, Code-128, UPC-A, UPC-E, ITF, Codabar
  • 2D barcodes: PDF-417, AZTEC, QR Code, Data Matrix

i.e: URL, Contact information, Calender events, Email, Phone, SMS, ISBN, WiFi, Geo Location, AAMVA driver license/ID

Most importantly it can detect multiple barcodes at once and on any orientation. We can try barcode code lab to integrate it in your app.

Text Recognition

Google Mobile Vision’s text recognition was introdued recently (as announced in Google blog). This API currently detects text in languages such as Catalan, Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Latin, Norwegian, Polish, Portuguese, Romanian, Spanish, Swedish, Tagalog, Turkish, English (of course) but not limited to. The output of the text detection is based on segments. These segments can be block, lines, and words. The image below explains the block, line, and word concepts.

google vision text detection

Real-time example

google vision api real-time

With this API, we can now

  • Organize photos that contain text
  • Automate tedious data entry for credit cards, receipts, and business cards
  • Translate documents (along with the Cloud Translate API)
  • Keep track of real objects, such as reading the numbers on subway trains
  • Provide accessibility features

That means the API can now read visiting cards data entry, convert a text file from an image and more. The identified text could be in any sequence and recognition of printed matter is at its optimal capability when compared to another available framework (s) & libraries. But when it comes to hand-written or cursive styled fonts, the API has difficulty in recognizing.

Open ends

  • Face detector does not support face recognition. This does not have the capability to identify 2 identical faces.
  • Face detector supports only 2 classifications: Eyes open & Smiling
  • Text detector reads randomly. There is a possibility that it returns text in a different sequence from the frame text sequence. But it returns the position of the text, which can be useful to arrange them in sequence.
  • Text detector is not 100% accurate (obviously, none of them in the market does). Also, the accuracy rate drops drastically when hand-written and cursive typography is introduced. I am not complaining though!

So now on start with Machine Learning/Computer vision on mobile, irrespective of your experience. Google keeps on increasing their way to provide learning materials to developers. Adding to the above API, Google recently released its new object detection API called Tensorflow. Read more about this here and code repo is available in Github.

witness-meGoogle, Facebook, and Apple have been investing on resources into these mobile models. Last year, Facebook released Caffe2Go framework, followed by Apple’s CoreML to ease the difficulty of executing machine learning models on iOS devices. Looking forward we will witness a big boom in the machine learning and AI stream.

Here are the code samples provided by Google in Github to learn Mobile Vision API for each type of objects.

Create your own app with samples available!

Image sources:
* Text recognition segments
* StackOverflow


Post Author: Shreyo

is a developer & involved in project management; loves to learn, guide and keep himself updated on UI/UX and WordPress. He is also an amateur photographer mostly macro shots.