Invented by Sarah Tariq, James William Vaisey Philbin, Kratarth Goel, Zoox Inc
The Zoox Inc invention works as followsHerein, “Techniques of using instance segmentation in machine learning (ML model) are discussed.” A ML model can use an image as input, and generate a feature map containing a number of features as output. Each feature can include a confidence rating, classification information and a region-of-interest (ROI) that is determined using a nonmaximal suppression technique. Segmentation can be performed by grouping similar ROIs together. This means that instead of needing a separate ML model or a second operation for segmentation (e.g. identifying which pixels correspond to the detected object by outputting a set of lines or curves, etc. ), the techniques described herein can detect an object and segment it simultaneously (e.g. determine an ROI).
Background for Instance Segmentation Inferred From Machine Learning Model Output
Computer vision is critical for some applications such as autonomous vehicle operation. To equip a computer with the functionality to imitate human vision, software components can be built that take an image and identify the salient parts from the image. Then, the computer will use the salient areas of the image in order to perform further operations. Machine-learned models (ML) are one form of software which can be used to equip a machine with this functionality.
Previous attempts to train ML model to identify salient parts of an image resulted into flawed ML Models. Some forms of ML training, for example, result in a ML that is unable to distinguish between objects which are close together (e.g. a pedestrian who passes in front another pedestrian on the camera’s view), resulting inaccurate or extraneous identifications.
Moreover, while some ML-models are more accurate in identifying objects than flawed ML-models, they require too much computing to be used for real-time applications and/or require expensive, specialized equipment that might not suit a specific use. A computer vision ML model may require an autonomous vehicle to receive a video feed and make decisions every 50 milliseconds. Due to the computation time needed for some ML models, an object may have moved by the time it is detected. Therefore, object detection for decision-making is no longer reliable. Some of these ML model may need a computation time greater than 100 milliseconds.
The techniques described herein improve computer-vision by improving the accuracy of object recognition and decreasing compute time to obtain accurate object identifications. This allows objects to be detected in real-time for applications like autonomous vehicle control. The techniques described herein can be used in other applications, such as video games, augmented realities, etc.
The techniques described herein include submitting an image to a ML model in order to receive multiple regions of interests (ROIs) from the model for different parts of an image. These ROIs can be in any form that identifies what the ML believes is the existence of an image object. An ROI can include, for example, a box that indicates pixels associated with the detected object. “A mask that contains pixels that correspond with the detected object is one example.
In some cases, the ML models may output additional or alternative confidence scores (or information about confidence) for each of the ROIs. The ML model can detect an object in a part of the image and generate an ROI that indicates where the object is located. The ML Model may also, or instead, generate a confidence score which essentially indicates the ML Model’s level of confidence that it has correctly identified a salient image object and/or the fit between the ROI and the object. A confidence score can be a number between 0-1 where 0 indicates the ML is not confident at all that an item appears in ROI. 1 represents the ML is confident strongly that an item appears in ROI. Other permutations of the value are possible. The ML model will output an indication where it believes an object may be, and a score which indicates the confidence of the model that it has correctly identified the object or how well the ROI identifies where the object appears in the image. Other permutations are also possible.
Some of the techniques discussed in this document are directed at training the ML models to produce better ROIs and/or confidence scores that are more accurate (e.g. producing a score that is closer to 0 when an ROI does not contain an item and/or closer to 1 when an ROI does indicate an object salient) and reduce the compute times to achieve ROIs with such accuracy.
The ML model can include a neural net, such as a boosted ensemble or a random forest; a directed acyclic network (DAG), (e.g. where the nodes have been organized into a Bayesian Network); and/or deep learning algorithms, such as artificial networks (ANN), (e.g. recurrent neural net (RNN)), residual neural (ResNet), deep belief (DBN), etc. For example, determining a level of alignment between the ROI and an area of an image that is indicated by a “ground truth” as representing an object, may be a loss function for training the ML models. In certain instances, determining the alignment of the ROI with the area indicated by a ground truth can include determining a intersection over union (IoU), which is metric to determine how well the ROI “fits” The ROI may be compared to the area indicated by the ground truth using other indicators. You can use other indicators to determine if the ROI fits the area shown by the ground fact. The ground truth may be referred as a reference area in some cases.
It is beneficial to identify ROIs that the ML Model got wrong. This can be done by providing the model with tens or hundreds of thousands images and correcting the weights. The time required to train an ML-model is greatly reduced. It may also increase accuracy, since the ML-model can be corrected to take into account?very wrong’. By reinforcing the learning of?correct?, ROIs or confidence scores won’t be washed away. Correct?
In some cases, the techniques described herein can include selecting specific examples to train the ML model. Hard example mining may be used to select these examples. This may involve sorting ROIs according to confidence scores (e.g. greatest confidence scores down to the least) or error in confidence scores (e.g. a confidence error associated with ROI for an ROI suppressed by NMS), and then selecting the top n ROIs. Selecting examples using hard example mining can exclude ROIs with maximum confidence scores (or scores). The techniques can also include selecting n random ROIs. The number n may in some cases be selected to represent the number of positive images in the image.
However in some training schemes, such as training a portion of an area (e.g. 30% of an area that represents an object, instead of the entire area, as described herein in greater detail), selecting the top n ROIs (by the confidence score) can skew ML model learning because, often, at least some top n ROIs are able to correctly identify an item. These examples are not penalized because they accurately predict the ROI. Techniques include, as will be explained in more detail below: suppressing some of top n ROIs and choosing new ROIs to replace them. A hard negative example that is agreed upon with a particular region of interest may be re-assigned to a positive example. For example, networks such as the ones described here can distinguish between examples that should be penalized and those that shouldn’t. In some cases, the chosen example can be used to backpropagate (either as a hard negative example or as a good one)
In some cases, these techniques can exclude certain portions of image data from training. This is based, at least partly, on the determination that (1) the degree of alignment between an ROI and a ground-truth for the object shown by the ROI has met or exceeded a threshold level of alignment. (e.g. the ROI fits well with the area of the ground truth. The ROI must be generated from a part of the image within the ground truth area. A ROI that is generated from a part of an image outside the area indicated by the ground truth may, regardless of how well it matches the ground fact, be included in a subset used for training. The loss function penalizes ROIs that are produced for portions of the image that lie within the ground truth but produce a?bad? ROI. Similar, an ROI that is produced for a part of the image which lies within the ground fact, but produces a “bad” ROI, can also be included in the training subset and penalized in the loss function. The ROI (e.g. the degree of alignement of the ROI with the ground truth is below a threshold level of alignment) can be included in a subset for training and penalized by the loss function. In certain instances, only the top n samples may be included in the training set, while excluding the examples that were discussed in the previous section. This technique, which uses the exclusion/inclusion rule discussed herein, is referred herein as “an improved hard example mining method.
The training subset (determined according to the exclusion/inclusion rule) can be given to a loss-function. This loss function can include, for instance, a huber loss function for the confidence scores (for example, when the confidence scores are included in the determination of training/loss with ROI associated therewith), a mean-squared-error, focal loss functions, etc.
In some cases, techniques can include training the model in multiple stages. Stages may include giving a first set of images to the ML (scaled or not), and then training the model with hard examples using the above procedure. The first batch can include thousands or even millions of images in some cases.
In some cases, the second stage of training a ML model can be added after the first. The second stage of training the ML model may include using subsets that contain hard examples. In some cases, the second step may also include training the ML using a loss function. The focal loss function can be used to modify the cross-entropy loss functions (or other loss functions) in some cases. This will reduce the weight of the errors calculated by the ROIs that are well classified.
In some cases, the receptive area of the ML models may cause the ML models to produce ROIs that are too large for a receptive region and are associated with a low confidence score or error value. If a person were to touch their nose to a painting it would be difficult for them to recognize the painting or even identify specific objects within the painting. The reason is that the most important parts of a painting are not visible to the human eye.
The ML Model may not be able to detect objects outside its receptive area (e.g. objects too large to fit into the receptive area for the ML Model to’see? The object is a good way to understand what the ML Model’sees ?).
Some techniques to remedy this problem include providing an image as input to a ML-model, using the model to obtain objects in a certain size range, then down-scaling and re-running the image through the model so that previously outside the range of sizes objects now fall into the range. It is possible to repeat this process iteratively. Scaling down the image makes the large objects smaller, allowing them fall into the receptive area of the ML models. It may be possible for an ML to be trained with a receptive area that is similar to the input image. However, this ML may not be responsive or fast enough to be used in autonomous driving.
These accuracy/ROI pairs can be combined for each ROI for the first model. Techniques may include identifying ROIs with accuracy levels that meet or surpass a threshold. This range of sizes could indicate that the first model ML determines “good?” This range of sizes may indicate that the first ML model determines?good? ROIs for those objects. The first ML may output ROIs with sizes that are within the range of sizes and suppress other ROIs determined using the ML.
In some cases, the first batch may be scaled using a factor (e.g. 0.75, 0.5) and the scaled images can be used as input to a second model. The second ML model can also be based on the scaled images to determine a second response curve, and a range of sizes.Click here to view the patent on Google Patents.