viola jones

August 21, 2023

  • Rapid Object Detection using a Boosted Cascade of Simple Features

    • Steerable filters, and their relatives, are excellent for the detailed analysis of boundaries, image compression, and texture analysis. In contrast rectangle features, while sensitive to the presence of edges, bars, and other simple image structure, are quite coarse. Unlike steerable filters the only orientations available are vertical, horizontal, and diagonal. The set of rectangle features do however provide a rich image representation which supports effective learning. In conjunction with the integral image , the efficiency of the rectangle feature set provides ample compensation for their limited flexibility.

    • at a very small number of these features can be combined to form an effective classifier. The main challenge is to find these feature

    • wasn't able to understand the adaboost algorithm for classifier learning, but checked this out and it made more sense: [What is Adaboost]({{video https://www.youtube.com/watch?v=NLRO1-jp5F8&ab_channel=KrishNaik}})

  • This algorithm is painfully slow to train but can detect faces in real-time with impressive speed.

  • A simple way to find out which region is lighter or darker is to sum up the pixel values of both regions and compare them. The sum of pixel values in the darker region will be smaller than the sum of pixels in the lighter region. If one side is lighter than the other, it may be an edge of an eyebrow or sometimes the middle portion may be shinier than the surrounding boxes, which can be interpreted as a nose This can be accomplished using Haar-like features and with the help of them, we can interpret the different parts of a face.

    There are 3 types of Haar-like features that Viola and Jones identified in their research:

    • Edge features

    • Line-features

    • Four-sided features

    • Edge features and Line features are useful for detecting edges and lines respectively. The four-sided features are used for finding diagonal features.

      The value of the feature is calculated as a single number: the sum of pixel values in the black area minus the sum of pixel values in the white area. The value is zero for a plain surface in which all the pixels have the same value, and thus, provide no useful information.

      Since our faces are of complex shapes with darker and brighter spots, a Haar-like feature gives you a large number when the areas in the black and white rectangles are very different. Using this value, we get a piece of valid information out of the image.

      To be useful, a Haar-like feature needs to give you a large number, meaning that the areas in the black and white rectangles are very different. There are known features that perform very well to detect human faces:

      For example, when we apply this specific haar-like feature to the bridge of the nose, we get a good response. Similarly, we combine many of these features to understand if an image region contains a human face.

  • main features of the Algorithm:

    • Feature Selection: Start with the integral image of the input image.

      • Start with the Original Image: You have an image with pixel values that represent brightness.

      • Create a New Table: You make a new table of the same size as the image, and each cell in this table will hold the sum of all the pixel values from the top-left corner of the image to the corresponding pixel in the original image.

      • Fill in the Table: You start filling in the table from the top-left corner. The value in each cell is calculated by adding up the pixel value from the original image in that cell and the values in the cells to the left and above it in the new table. This way, each cell holds the cumulative sum of pixel values.

      • Use the Magic: Now, when you want to find the sum of pixel values in any rectangular area of the original image, you can use the magic table. You just look at the values in four cells of the table (the top-left, top-right, bottom-left, and bottom-right corners of the rectangular area), and with a bit of subtraction and addition, you can quickly calculate the total brightness of that area.

      • for more info refer: of card tricks and integral images

      • import numpy as np
        import cv2
        
        image = cv2.imread('image.jpg', cv2.IMREAD_GRAYSCALE)
        
        # empty array for the integral image
        integral_image = np.zeros_like(image, dtype=int)
        
        # compute integral image
        # iterate over each pixel in the original image
        # y -> row; x -> column
        for y in range(image.shape[0]):
            for x in range(image.shape[1]):
        		# initialize integral image with pixel values of original image
                integral_image[y, x] = image[y, x]
                if y > 0:
                    # cumulative sum of pixel values along the vertical direction to the integral image
                    integral_image[y, x] += integral_image[y - 1, x]
                if x > 0:
                  	# cumulative sum of pixel values along the horizontal direction to the integral image
                    integral_image[y, x] += integral_image[y, x - 1]
                if x > 0 and y > 0:
                  	# corrects for double-counting the overlapping area that was added in the previous two steps
                    integral_image[y, x] -= integral_image[y - 1, x - 1]
        
        def calculate_sum(integral_img, top_left, bottom_right):
            x1, y1 = top_left
            x2, y2 = bottom_right
        
            A = integral_img[y1 - 1, x1 - 1] if x1 > 0 and y1 > 0 else 0
            B = integral_img[y2, x1 - 1] if x1 > 0 else 0
            C = integral_img[y1 - 1, x2] if y1 > 0 else 0
            D = integral_img[y2, x2]
        
            return D - B - C + A
        
        top_left = (x1, y1)  # Top-left coordinates of the rectangular region
        bottom_right = (x2, y2)  # Bottom-right coordinates of the rectangular region
        sum_pixels = calculate_sum(integral_image, top_left, bottom_right)
        print("Sum of pixel values in the region:", sum_pixels)
        
      • When you run through these steps for each pixel in the image, you end up with an integral_image array where each cell holds the cumulative sum of pixel values from the top-left corner of the image to that specific pixel. This integral image is then used for efficient computation of sums of pixel values in rectangular regions

    • Haar-like Feature Evaluation: Slide different sizes and types of Haar-like features over the integral image to calculate their responses. These responses indicate the contrast variations in specific regions.

    • Adaboost Training:

      • Train a series of (Haar-like features) weak classifiers using Adaboost. Each weak classifier is trained to minimize the classification error on the training data while focusing more on misclassified examples from previous iterations.

      • Adaboost (Adaptive Boosting) is used in the Viola-Jones algorithm to create a strong classifier from a set of weak classifiers. It's employed to improve the accuracy and efficiency of object detection, specifically for tasks like face detection.

      • The cascade structure in Viola-Jones divides the classifier into stages, each consisting of a set of weak classifiers. Adaboost aids in the optimization of this cascade by determining which weak classifiers should be included in each stage, thus enabling the algorithm to quickly reject non-object regions.

      • in a nutshell:

        • You have N number of features

        • Weights are randomly assigned to each feature classifier

        • One weak classifier is trained for each feature using AdaBoost

        • Weights are adjusted and misclassifications are penalized higher than correctly classified ones

        • When training is complete: sort models based on the least error rate to the highest error rate (best models first)

        • Select the best weak classifiers based on a threshold value (drop the “useless” ones)

    • Cascade Structure:

      • Organize the weak classifiers into a cascade structure. The cascade consists of multiple stages, each containing a set of weak classifiers.

      • Detecting objects like faces requires analyzing a large number of image subregions. However, most of these subregions don't contain the object of interest. If we apply all the available weak classifiers to each subregion, we might waste a lot of computation time on regions that are clearly not the object we're looking for.

        • The cascade structure addresses this problem by arranging the weak classifiers in a series of stages, where each stage is a sequence of weak classifiers. The cascade is designed in a way that the first few stages can quickly eliminate a large portion of non-object regions, allowing subsequent stages to focus only on regions that are more likely to contain the object.

        • The early stages quickly reject the majority of non-object regions. This significantly reduces the computational load for subsequent stages.

      • The best feature selected by AdaBoost rejects many negative windows while detecting almost all positive windows. Hence the classifier corresponding to the best feature is evaluated first on a given window. A positive response triggers the evaluation of a second (more complex) classifier, and so on. A negative response at any level leads to the rejection of the window.

      • This strategy rejects as many negative windows as possible at the earliest stage. Only positive instances trigger all classifiers in the cascade.

      • the cascade structure allows the Viola-Jones algorithm to strike a balance between computational efficiency and accuracy. It is particularly effective for real-time applications like face detection, where rapid processing is essential.

    • Detection: Run the cascade of classifiers on the input image. At each stage, if a region is classified as non-object, it's quickly rejected. If a region passes all stages, it's considered a potential object and marked as a detection.

  • How It All Works Together:

    • Sliding Window Approach: The algorithm slides a rectangular window over the image at different scales and positions.

    • Feature Evaluation: At each position and scale, the integral image is used to efficiently evaluate the responses of Haar-like features. These features indicate patterns that could be characteristic of a face.

    • Cascade Stages: The cascade structure begins with simple stages that quickly reject regions. If a region is rejected by any stage, it's discarded as a non-object region. If a region passes all stages, it's considered a potential face region.

    • Accurate Detection: As the algorithm progresses through stages, it becomes increasingly selective. Only regions that have a higher chance of containing a face pass through. The combination of the cascade structure and Adaboost-trained classifiers improves accuracy while maintaining real-time performance.

    • Multiple Scales: The algorithm analyzes the image at various scales to detect faces of different sizes. The integral image's efficiency and the cascade's rejection mechanism make this multi-scale analysis feasible.

  • summary of the lifecycle:

    • Preprocessing

    • 1. Create handmade simple haar features
    • Training:

    • 1. Convert image to integral image
      2. Compute delta values for each feature over an image region
      3. Train Adaboost model for each feature
      4. Sort classifiers by strongest to weakest
      5. Drop the “useless” classifiers
      6. Add useful classifiers to attentional cascade
    • Inference:

    • 1. Load cascade
      2. Pass image through each classifier in a cascade
      3. Get result
  • I was thinking how does the haar-like features know that a face is a face, why wont it just select any edge etc:

    • Feature Diversity: The algorithm employs a wide variety of Haar-like features, not just simple edge patterns. These features include patterns that resemble the eyes, nose, mouth, and other components of a human face. By using a diverse set of features, the algorithm increases the likelihood of detecting true face patterns while reducing the likelihood of false positives caused by non-face patterns.

    • Adaboost Training: During the Adaboost training process, the algorithm identifies the features that are most informative for discriminating between faces and non-faces. Features that consistently contribute to correct classifications are given higher weights. Adaboost ensures that the collective decision-making of multiple weak classifiers becomes increasingly selective and accurate, focusing on the unique patterns present in human faces.

    • Cascade Structure: The cascade structure further enhances the algorithm's selectivity. The early stages consist of simple weak classifiers that are adept at quickly rejecting non-object regions. These stages filter out many false positives, including patterns that may resemble edges on objects like motorcycles. Subsequent stages refine the detection process by using more complex classifiers that are better tuned to human face features.

    • Training Data: The algorithm is trained using a large dataset of positive (faces) and negative (non-faces) examples. This extensive training process helps the weak classifiers and the cascade structure learn to distinguish between the specific patterns that are characteristic of human faces and other patterns that may resemble edges or features on different objects.

    • here's some resources to dig deeper:

      • {{video(https://www.youtube.com/watch?v=88HdqNDQsEk&ab_channel=sentdex)}}

      • Get Started with Cascade Object Detector --> how to train a cascade classifier

        • Haar and LBP features are often used to detect faces because they work well for representing fine-scale textures. The HOG features are often used to detect objects such as people and cars.

        • Generally, it is better to have a greater number of simple stages because at each stage the overall false positive rate decreases exponentially. For example, if the false positive rate at each stage is 50%, then the overall false positive rate of a cascade classifier with two stages is 25%. With three stages, it becomes 12.5%, and so on. However, the greater the number of stages, the greater the amount of training data the classifier requires. Also, increasing the number of stages increases the false negative rate. This increase results in a greater chance of rejecting a positive sample by mistake. Set the false positive rate (FalseAlarmRate) and the number of stages, (NumCascadeStages) to yield an acceptable overall false positive rate. Then you can tune these two parameters experimentally.