The area of computer vision is promising and expanding. There is no shortage of challenging issues to address. Face recognition is one such method, and it involves a computer being able to identify the presence of a human face in a given image and pinpoint its precise location. This piece will teach you how to recognize faces using Python.
Understanding how pictures are encoded internally by a computer and how that thing visibly differs from any other object is required for detecting any object in an image.
When that is complete, picture scanning to locate these cues should be automated and improved. When put together, these procedures build a computer vision algorithm that is both quick and accurate. This guide will teach you how to:
- What face detection is
- How computers understand features in images
- How to quickly analyze many different features to reach a decision
- How to use a minimal Python solution for detecting human faces in images
What Is Face Detection?
Face detection is a subfield of computer vision used to locate and recognize human faces in digital photographs. Although we humans may easily do this, computers need more detailed instructions. It’s possible that there will be a lot of stuff in the pictures that isn’t people.
In contrast to previous face-related computer vision technologies such as facial identification, analysis, and tracking, this method is completely novel. Recognizing a face in a picture and determining that it belongs to person X rather than person Y is the essence of facial recognition. Its most common biometric application is smartphone unlocking. Analysis of a person’s face is performed to learn anything about them, such as their age, gender, or emotional state. Face tracking is a kind of video analysis that attempts to monitor a face and its characteristics (such as the eyes, nose, and lips) over several frames. Most people use them with one of the many filter options in popular smartphone apps like Snapchat.
The technical answers to these issues are as varied as the problems themselves. In this guide, we’ll look at a time-honored approach to that initial problem: the face
How Do Computers “See” Images?
A pixel is the smallest graphical element that can be represented digitally. It’s so little it hardly registers on the screen. The pixels in a picture are laid out in rows and columns.
Often, the number of rows and columns is used to describe the picture resolution. For instance, a 3840×2160 resolution indicates that there are 3840 horizontal and 2160 vertical pixels on an Ultra HD Screen.
Nevertheless, pixels are not represented as color dots to a computer. It can only process numerical data. Several color models are used by the computer to do the color-to-number conversion.
The RGB color model is often used to represent pixel values in color pictures. The letters RGB represent the colors red, green, and blue. Every pixel incorporates all three hues. By adjusting the proportions of red, green, and blue, RGB accurately represents the whole range of colors visible to the human eye.
Each pixel is represented by three integers, one for each of the three primary colors (red, green, and blue) since computers can only interpret numeric data. Image Segmentation with OpenCV and Python Using Color Spaces is a good resource for learning more about color spaces.
Each pixel in a black-and-white (grayscale) picture is assigned a single numeric value that represents the intensity of the light it receives. The intensity scale typically runs from 0 (black) to 255 (maximum white) (white). The range from 0 to 255 consists entirely of grayscale values.
A picture is just a matrix (or table) of integers if each grayscale pixel represents a number:
Pixel values and color examples for a 3×3 picture
Three such matrices are used to represent the red, green, and blue components of a color picture.
What Are Features?
An image’s feature is an informative bit of data that aids in problem-solving. It might be as basic as a single pixel value or as complicated as a series of edges, corners, and forms. A complicated characteristic is the result of combining many simpler ones.
Several image processing techniques may provide data that might be called features. Several different features and feature extraction methods have been developed for use in computer vision and image processing.
Features may be utilized to address a variety of problems, and can be anything from an image’s intrinsic or derived properties.
Setting up a development environment with the required libraries installed is required to execute the code examples. Using conda is the easiest option.
You’ll need a set of 3
Execute the following commands in your terminal to set up a conda environment:
$ conda create --name face-detection python=3.7 $ source activate face-detection (face-detection)$ conda install scikit-learn (face-detection)$ conda install -c conda-forge scikit-image (face-detection)$ conda install -c menpo opencv3
Check out the OpenCV Installation Guide or this post on OpenCV Tutorials, Resources, and Guides if you’re having trouble getting OpenCV set up or getting the examples to run.
You have all you need to put what you learn here into practice now.
Viola-Jones Object Detection Framework
This technique was suggested in 2001 by Paul Viola and Michael Jones, two computer vision researchers.
In order to provide competitive object detection rates in real time, they created a generic object detection framework. While it may be used to other types of detection, face detection is where the majority of the interest lies.
The Viola-Jones method consists of 4 key phases; they are discussed in greater detail below.
- Selecting Haar-like features
- Creating an integral image
- Running AdaBoost training
- Creating classifier cascades
When presented with a picture, the algorithm breaks it down into numerous smaller sections, each of which is analyzed for the presence of facial traits. Since a picture may have numerous faces of varying sizes, it must examine a wide range of relative locations and proportions. The Haar-like properties employed by Viola and Jones in their
The human face is universally recognizable. For instance, the eye area is often darker than the bridge of the nose in photographs of people’s faces. It’s not only the eyes; the cheekbones are more vivid as well. These characteristics may be used to determine whether a human face is present in a picture.
Finding out whether area is brighter or darker may be done quickly by adding up the pixel values in both areas and comparing the results. In this case, the total pixel value in the brighter area will be greater than that in the darker area. With characteristics that are similar to Haar’s, this is possible.
To depict a feature that is similar to a Haar filter, a rectangular region of an image is extracted and subdivided into many smaller regions. They are often represented by a pair of adjacent black and white rectangles:
Functions similar to those of a Haar rectangle
Four different forms of Haar-like distributions are shown below.
- Horizontal feature with two rectangles
- Vertical feature with two rectangles
- Vertical feature with three rectangles
- Diagonal feature with four rectangles
The first two illustrations may be used to identify edges. The third one can see straight lines, while the fourth one excels at locating diagonal ones. How, however, do they function?
The feature’s worth is reduced to a single digit, which is the difference between the total pixel values in the black region and those in the white region. This value is near to 0 for uniform regions like a wall, thus it doesn’t tell you anything much.
A big number indicates that the areas of the black and white rectangles are substantially distinct, which is what you want from a Haar-like characteristic. It is well-established that certain traits may be used to reliably identify human faces:
The eye area was treated with a Haar-like characteristic. (Photo: Wikimedia Commons)
The area around the eyes is much darker than the rest of the face. This attribute may be used to determine which parts of a picture have a high response (a big number) to a certain feature:
Application of a nasal bridge feature like a Haar. (Photo: Wikimedia Commons)
When applied on the bridge of the nose, this particular instance produces a powerful reaction. The presence of a human face in a picture area may be determined by combining several of these elements.
The Viola-Jones approach, as was previously noted, computes a large number of these characteristics over a wide variety of picture areas. This becomes computationally costly very soon due to the large amount of time required utilizing a computer’s finite resources.
Viola and Jones utilized integral calculus to attempt to solve this
Both the data structure and the process for creating it share the moniker “integral image,” which refers to a “summed-area table.” It’s a handy tool for quickly and accurately adding up the values of all the pixels in a given picture or rectangular region of an image.
Each point’s value in an integral picture is equal to the sum of the values of the pixels above and to the left of the target pixel.
Integrating a picture’s pixel data to get a number
It just takes one pass over the original picture to compute the integral image. Regardless of the size of the rectangle, this simplifies the process of adding up the pixel intensities inside the rectangle to only three operations using four numbers:
Determine how many pixels are included inside the orange rectangle.
D – B – C + A is the formula for the total number of pixels in the rectangle ABCD, where A, B, C, and D are numbers. This formula may be seen more easily in the following way:
Step-by-step pixel sum calculation
Because the area specified by A has been deducted twice (once from B and once from C), we need to put it back in.
You may now quickly determine how much space separates two rectangles by adding or subtracting their pixel values. Those with Haar-like traits will love this!
So how do you choose which attributes to utilize and how large to make them while trying to locate faces in pictures? Machine learning’s boosting method finds a solution to this problem. AdaBoost, an acronym for “Adaptive Boosting,” is what you’ll study in detail.
Boosting is based on the hypothesis that “weak learners” may be combined to become a “strong learner.”
A poor classifier (or learner) is one that performs just marginally better than random guessing.
This implies that a poor learner is just slightly more accurate than random guessing when it comes to determining whether or not a given picture subregion contains a face. A strong learner is dramatically superior at distinguishing human faces from those of other objects.
Boosting is effective because it combines several (often thousands) of relatively weak classifiers into a single, powerful one. Each Haar-like feature in the Viola-Jones algorithm stands in for a very inexperienced student. AdaBoost evaluates the efficacy of every classifiers you provide it with to determine the kind and amount of a feature that gets into the final classifier.
Calculating a classifier’s performance requires testing it on every area of every picture used for training. The classifier will have a more robust reaction to certain locations than others. The classifier will mark those as “positive,” indicating that it believes the image to be of a human face.
A human face is not present in subregions that do not elicit a significant reaction, according to the classifier. These will be labeled as unfavorable results.
We give more weight to the classifiers that did a good job. The end result is a boosted classifier, or strong classifier, which incorporates the highest-performing weak classifiers.
The algorithm gets its name as training advances and it begins to prioritize previously misclassified photos. We give greater weight to the weak classifiers that do well on these challenging samples.
Here’s a case in point:
Several types of samples are represented by the blue and orange circles.
Consider the following picture, in which you must use a series of weak classifiers to sort blue and orange circles:
Several of the blue circles are accurately labeled by the first weak classifier.
The first classifier you try to employ only manages to recognize some of the blue circles. You prioritize the samples you skipped through in the following iteration:
More emphasis is placed on the blue samples that were omitted.
If the second classifier can properly categorize these instances, it will be given more data to work with. Keep in mind that a lower-quality classifier’s improved performance will increase the likelihood of its inclusion in the final, higher-quality classifiers.
The larger blue circles are captured by a second classifier.
Now you’ve got every blue circle, but you accidentally got some orange ones, too. In the future version, these mislabeled orange circles will be prioritized:
The mislabeled orange circles are emphasized while others are diminished.
The final classifier successfully identifies the orange circles:
The remaining orange circles are collected by the third classifier.
Combining these three classifiers into a single robust classifier ensures that every sample is properly categorized.
The three weak classifiers are combined into one final, robust classifier.
Viola and Jones have used a variant of this method to assess tens of thousands of face-detecting classifiers. As it would be impractical to run all of these classifiers on every area in every picture, a classifier cascade was developed.
A cascade is defined as a succession of successive waterfalls. In computing, the same idea is used to break down complicated problems into manageable chunks. Thus, the challenge is to do more with less computational effort for each picture.
To address this issue, Viola and Jones transformed their powerful classifier (comprised of thousands of weak classifiers) into a cascade in which each weak classifier acts as a separate stage. The cascade’s purpose is to efficiently eliminate non-faces from further processing.
As soon as a new picture area is sent into the cascade, it is assessed. The stage will provide maybe as its output if its evaluation of the subregion is affirmative, indicating that it recognizes the object as a face.
If an area obtains an unsure rating, it moves on to the next round of evaluation. If that one approves, then we have another maybe, and the picture moves on to the next step:
An ineffective cascade classifier
Once the picture has gone through each level of the cascade, this procedure will be repeated. If the picture passes muster with all classifiers, it is labeled as a human face and given to the user.
The picture is instantly eliminated if the first step does not determine that it contains a human face. Even if something makes it beyond the first cut, if the second one is unsuccessful, it will be thrown out. In essence, the picture may be thrown out of the classifier at any point:
A hierarchy of _n_ face classifiers
To save time and computing power, this is set up such that non-faces are swiftly deleted. As each classifier stands in for a different aspect of a human face, a positive detection amounts to saying, “Yes, this subregion includes all the characteristics of a human face,” while a negative detection indicates that at least one of those features is lacking.
The best way to do this is to position high-performing classifiers at the beginning of the cascade. The eyes and nose bridge classifiers excel in the Viola-Jones method despite being quite simple.
Now that you know how the algorithm works, you can put it to use to find faces in images.
Using a Viola-Jones Classifier
A Viola-Jones classifier trained from scratch may be time-consuming to train. Luckily, OpenCV already has a Viola-Jones classifier that has been pre-trained! That’s the one you’ll put the algorithm into action on.
To begin, locate and save a photograph for further analysis of facial features in it. Hence, to illustrate:
Stock image used for facial recognition: example ( Image source )
OpenCV, with the picture loaded in,
import cv2 as cv # Read image from your local file system original_image = cv.imread('path/to/your-image.jpg') # Convert color image to grayscale for Viola-Jones grayscale_image = cv.cvtColor(original_image, cv.COLOR_BGR2GRAY)
The Viola-Jones classifier should then be loaded. It may be found in the same directory as the OpenCV library if you installed it from source.
The actual location may change between versions, but you should always expect to see numerous files in a folder with the name haarcascades. Haarcascade frontalface alt.xml is the correct filename in this case.
If the pre-trained classifier was not included in your installation of OpenCV, you may get it from the OpenCV GitHub repository.
# Load the classifier and create a cascade object for face detection face_cascade = cv.CascadeClassifier('path/to/haarcascade_frontalface_alt.xml')
The detectMultiScale() function on the face cascade object takes a picture as an input and applies the classifier cascade to it. MultiScale describes how the algorithm analyzes different sized picture portions to identify different types of faces.
detected_faces = face_cascade.detectMultiScale(grayscale_image)
Each face in the target picture has been identified and stored in the detected faces variable. Iterate over each detection and draw rectangles over the identified faces to see the results.
Rectangle() in OpenCV is used to draw rectangles across pictures and requires the top left and bottom right pixel coordinates. Pixel row and column are shown by the coordinates.
Thankfully, the data from these scans is stored as pixel coordinates. The coordinates of the upper left corner of the rectangle that contains the detected face, as well as its width and height, are used to identify each detection.
The lower-right corner is found by adding the row width to the column height.
for (column, row, width, height) in detected_faces: cv.rectangle( original_image, (column, row), (column + width, row + height), (0, 255, 0), 2 )
The parameters that rectangle() takes are:
- The original image
- The coordinates of the top-left point of the detection
- The coordinates of the bottom-right point of the detection
The color of the rectangle (a tuple that defines the amount of red, green, and blue (
- The thickness of the rectangle lines
At last, you must exhibit the
cv.imshow('Image', original_image) cv.waitKey(0) cv.destroyAllWindows()
imshow() is used to see the picture. The waitKey() function awaits a keypress. If you shut the window after the picture has been shown, imshow() will not work. If you instruct it to wait forever with an argument of 0, it will. Lastly, pressing a key triggers the window’s closure through destroyAllWindows().
Here’s what happened:
Image in its original, detected form
So long! A complete Python face detector is now at your disposal.
The scikit-image website includes a tutorial with source code if you want to train the classifier on your own.
The Viola-Jones algorithm is a remarkable technological breakthrough. It’s about 20 years old, yet it still works excellent for many applications. There are other algorithms, each using a unique set of characteristics.
Histograms of directed gradients are used as a feature in support vector machines (SVMs) (HOG). The Python Data Science Handbook is a good resource for this kind of thing.
In a subsequent piece, we’ll go further into deep learning, the technology at the heart of most modern, state-of-the-art approaches to face detection and identification.
Check out the most current studies in Computer Vision and Pattern Recognition on arXiv to stay abreast of the field.
Check out Practical Text Classification With Python and Keras if you’re keen in machine learning but want to shift gears away from computer vision.