Artificial Intelligence – Arvind YEDLA, Marcel Nassar, Mostafa El-Khamy, Jungwon Lee, Samsung Electronics Co Ltd

Abstract for “System and Method for Deep Learning Machine for Object Detection”

“Applications and methods of manufacturing the same, systems, as well as methods for object detection using deep learning models based on region are described. One aspect of this method involves the use of a region proposal network to identify regions in an image. The RoIs are assigned confidence levels and used by downstream classifiers to increase the background score for each RoI. Finally, the background scores are used by a softmax function in order to calculate the final class probabilities of each object class.

Background for “System and Method for Deep Learning Machine for Object Detection”

“Machine-learning technology is constantly evolving and has helped support many aspects in modern society. These include web searches, content filtering and automated recommendations on merchant sites. They also allow for object detection, object classification, speech recognition, machine translations and drug discovery and genomes. Deep neural networks are the current state-of-the art in machine learning. They use multiple layers of computation to learn representations of data (usually very large quantities of data), with multiple levels or abstraction. etc. You can see, for example, Yann, LeCun and Geoffrey Hinton. ?Deep learning.? Nature, vol. 521, pp. 434-444 (28 May 2015), is hereby incorporated by reference in its entirety.”

“Deep learning has shown outstanding performance in general object detection. Deep learning has not been able to detect certain objects or situations. Deep learning has not been able to detect pedestrian detection in many real-world applications like autonomous driving and advanced driving aid systems.

According to one aspect of the disclosure, an object detection method using a region-based deeplearning model is provided. This includes using a region proposal system (RPN), to identify regions in an image, and assigning confidence levels to them. The background score assigned to each RoI by the downstream classifier will be boosted using the RoIs’ assigned confidence levels. Finally, using the scores in softmax functions to calculate the final class probabilities of each object class.

According to one aspect of the disclosure, an apparatus for object detection using region-based deep-learning models is provided. It includes one or more processors that execute instructions stored on non-transitory media. These steps include using a region proposal system (RPN), to identify regions in an image, assigning confidence levels and then using these RoIs to boost each RoI’s background score. Finally, the softmax function uses the scores to calculate the final class probabilities of each object class.

“Accordingly to one aspect of the present disclosure, there is a method of manufacturing a chipset that executes instructions stored on non-transitory media. It includes the following steps: using a region proposition network (RPN), to identify regions in an image using confidence levels; using the RoIs as background scores to boost each RoI’s assigned background score; and using the scores in softmax functions to calculate the final class probabilities of each object class.

According to one aspect of the disclosure, there is a method for testing an apparatus. This includes using a region proposal system (RPN), to identify regions in an image using confidence levels. Using the RoIs as background scores, the downstream classifier assigns each RoI a confidence level. Using the softmax function scores to calculate the final class probabilities for each class. Testing whether the apparatus stores the instructions on non-transitory media.

“Hereinafter, the embodiments of this disclosure will be described in detail with reference the accompanying drawings. Although elements may be shown in different drawings, the same elements can be identified by the same reference numbers. Specific details, such as configurations and components, are not provided in this description. They are only intended to aid in understanding the various embodiments of the present disclosure. It should be obvious to anyone skilled in the arts that the disclosed embodiments can be modified or changed without departing from its scope. For clarity and simplicity, we have omitted descriptions of well-known constructions and functions. These terms are defined in light of the functions described in the disclosure. They may differ according to the intentions of users or custom. The specification’s contents should determine the terms definitions.

“The present disclosure could have many modifications and different embodiments. These embodiments are detailed below with reference to the accompanying illustrations. It is important to understand that the present disclosure does not limit itself to specific embodiments. The scope of the disclosure includes all modifications, equivalents and alternative methods.

Although terms such as first or second ordinal numbers can be used to describe elements, they do not limit the scope of the terms. These terms can only be used to differentiate one element from another. Without limiting the scope of this disclosure, a first or second structural element could be called a second structural component. The second structural element can also be called the first structural element. The term “and/or” is used herein. The term?and/or? refers to any combination of one or more related items.

The terms used herein only describe different embodiments of the present disclosed disclosure. They are not meant to limit the disclosure. If the context indicates otherwise, singular forms will include plural forms. It is important to understand that the terms “include” and “have” are used in the disclosure. The terms?include? and?have? are used to indicate the existence of a feature, a number, an operation, a structural element, parts or a combination thereof. The presence of a feature, number, an operation, a structure element, parts or a combination thereof indicates that there is a possibility of adding one or more features, numerals or steps to the equation.

“Unless otherwise stated, all terms used in this disclosure have the same meanings that are understood by someone skilled in the art to be which they belong.” Terms like those found in a dictionary should be understood to mean the same thing as the context meanings in the applicable field of art. They are not to be taken to have an ideal or excessively formal interpretation unless they are clearly stated in the disclosure.

“Multiple embodiments could include one or more elements. Any structure that is designed to perform certain operations can be considered an element. An embodiment can be described using a restricted number of elements, but the embodiment could include additional elements in alternative arrangements. It is important to remember that any reference to “one embodiment”? or ?an embodiment? An embodiment means that at least one embodiment includes a particular structure, feature, or characteristic. The phrase “one embodiment” appears. (or ?an embodiment?) This specification may not refer to the exact same embodiment in different places.

“Although deep learning methods have been shown to be very effective in general object detection, they are not as good at pedestrian detection,”

“Faster region based convolutional neural network (R-CNN), have been the most popular framework for general object identification. This framework has a high false negative rate. Background regions are considered objects that belong to pre-determined (foreground), object categories such as persons. Therefore, R-CNN with a faster speed has not been able to detect pedestrians.

The present disclosure’s components reduce false positive rates by using the region proposition network (RPN), score to boost the background score or confidence level of image regions (i.e. regions of interest (RoI), used by the downstream classification of faster R-CNN. Simply put, if the RPN is confident that a RoI has background, then the downstream classifier of faster R-CNN will boost the confidence for the background class proportionally. This reduces false positive objects/foregrounds. This technique can be used to estimate inferences for models that have not been trained by RPN. To increase the background scores of RoIs in downstream classifiers, you can also use semantic segmentation masks and other information.

Ren, S. He, K. Girshick and Sun, J., Faster RNN: Towards real time object detection with region proposition networks, in ADVANCES In Neurological Information Processing Systems, pp. The general object detection technique 91-99 (2015) is one of the most effective, and is included herein by reference in its entirety. It uses a fully neural network approach with a 2-stage detection process.

“FIG. “FIG. FIG. FIG. 1 shows how an input image 101 is processed and output to a deep CNN. This will be called the base 110 in this disclosure.

The RPN 130, a sliding-window detector, is the first stage. The RPN is used for predicting objectness scores. This measures the likelihood of belonging to a set object classes (in front of you) vs the background (without objects). It also includes anchors that correspond to the location in your input image. ?Objectness? It can be defined as the degree to which an image is containing an object. The RPN 130 generates overlay proposals 135

“In the second stage the RPN regions are fed into a downstream classification 140 for further classification in one of the many object categories. This is done by using an attention mechanism called RoI Pooling.

The main problem with the R-CNN faster approach is the need to perform downstream classification independently for each RoI. The region-based fully convolutional networks (R-FCN) are a new type of network. This network is described in Dai J. Li, Y. He, K. and Sun J., RFCN: Object Detection Via Region-Based Fully Convolutional Networks arXiv Preprint arXiv 1605.06409 (2016). It was created to overcome the inefficiencies inherent in the R-CNN framework.

“FIG. “FIG. As FIG. In FIG. 2. Image 201 is processed using the base network to generate feature maps 220.

The R-FCN architecture was designed to categorize the RoIs proposed into object categories and background. The R-FCN framework implements the downstream classifier using a convolution only network. Convolutional networks are translation-invariant and object detection must be sensitive to translations of object position. The R-FCN framework generates an array of specialized convolutional layers known as position-sensitive scores maps 250. Each score map encodes position information relative a relative spatial location as a channel dimension.

“In contrast, the embodiments of this disclosure offer a mechanism to decrease the false positive rate (or the?false objects rate?). Deep learning systems that use region-based techniques for object detection. Baseline faster R-CNN/RFCN models only use RPN scores to sort the RoIs and select the top-N RoIs that will be used for downstream classification. This approach has a drawback: all top-N RoIs get treated equally by downstream classification. This could include RoIs with very low objectness scores.

“In embodiments according the present disclosure, region scores generated from the RPN can be used to boost scores computed downstream by the classifier. This is known as RPN boosting. One embodiment uses a Bayesian framework to calculate the posteriori probability that a RoI is an object given both RPN and classifier scores. This approach however biases all objects toward the background and lowers scores for the good RoIs (i.e. those more likely to be an item).

“For example, let C0, C1, . . . CK, CK denote K+1 classes of interests, while C0 denotes the background class. Let PB, PF represent the background and foreground probabilities that the RPN assigned to a RoI. The probability that an object exists is called the foreground probability. This includes objects belonging to any of the K classes of interest. Let s0, and s1 be the same. . . sK is the score that was assigned to the RoI by the downstream classificationifier. The Equation (1) represents the RPN confidence.

“The updated scores are used for computing the final class probabilities with the softmax layer. This outputs the probability distribution across the possible classes.”

{“Specifically, continuing with the example above, with K+1 classes and updated boosted scores s=”Specifically, keeping with the example, but with K+1 classes, updated boosted scores, s=s0 and s1,. . . , sK, where s0 is boosted according to Equation (1), then the probability ym of a class C having label m (i.e., Cm) is calculated by the softmax layer using Equation (2):”|, sK. If Equation (1) is used to boost s0, then the probability that a class C has label m (i.e. Cm) can be calculated using the softmax layer. Equation (2):}

“y m =? ( C = C?m? s ) = e s m ? i = 0 K e s i ( 2 )”

“The softmax probability can directly be used in prediction phase. A boosting of the s0 will affect the probability of all classes, as it changes Equation (2)’s denominator.

“While the softmax probabilities are used in the training phase, it is easier to express them as a cross-entropy function E within the log domain, as shown by Equation (3) below. In Equation (3), wherein ti=1 when the training input corresponds with class Cm (i.e. ti=tm=1) and zero otherwise (i.e. ti=0), then? These are the network parameters.

“? = – log ? ? L? ( ? ? t , s ) = – ? i = 0K? t i * log ? ( y i ) ( 3 )”

“To maximize parameters?” The partial derivative of cross entropy function? Equation (4) shows how the score sm is calculated.

“? ? ? s m = y m – t m ( 4 )”

“FIG. 3. This block diagram illustrates an example deep neural convolutional networks to which the embodiment of this disclosure can be applied. The residual network (ResNet) is formed by the blocks and layers at its bottom. The ResNet’s output is fed into a regional proposition convolutional network (RPN Cony 310), whose output is used to determine objectness scores and coordinates to detect objects. This information is used to further classify the detected object by the Position Sensitive Region of Interest Classification 340 (PSRoI Cls), 340, and Regression (PSRoI Reg 350) networks to generate classification scores for each possible category and refine the detection boxes delivered by RPN to ROI networks. Equations (1) and (2) are the basis of the Boosted Scores 360 and Boosted Softmax 370 operations.

“FIG. “FIG. The base network processes the input image to create feature maps in 410. The RPN sliding window-based detector in 420 selects RoIs and assigns confidence scores to each RoI. This is the probability that a RoI is an object in the background or the foreground. The downstream classifier pools the RPN regions and further categorizes each RoI into one or more object categories. However, 430 uses the confidence levels that were calculated by RPN in 420 to increase the background scores of RoIs before downstream classifiers classify the RoIs. Softmax functions also use the background score to calculate final class probabilities of each object class.

“Another embodiment of the disclosure uses semantic segmentation masks from any source to boost the RPN, thereby reducing false alarm rates. A semantic segmentation mask, which is provided by another semantic segmentation algorithm, provides pixel-wise labeling for each class under consideration in comparison to region- or box-wise labeling done by an object detection system. The ratio of the number of pixels in the RoI can be used to calculate the background and foreground probabilities for each RoI. To prevent PF from falling to 0 if necessary, you can set a lower limit on the number foreground pixels within an RoI. This prevents the classifier assigning a background probability value of 1.

“In another embodiment of this disclosure, the magnitude and direction of the optical flow are used to boost the false alarm rate of a detector. You can obtain the optical flow from any source. Another algorithm provides optical flow information as a measure for the degree of change between frames in pixels. It can also be used to indicate a moving object when the camera is stationary such as in surveillance cameras. A threshold is used in such embodiments. A threshold? is used to determine the magnitude of the optical flows. If the background is still stationary, a pixel may be considered background if it has a magnitude that is less than the threshold?. Otherwise, the pixel will be designated as foreground. The ratio of the number of pixels in each RoI to their foreground pixels can be used to calculate the background and foreground probabilities PF or PB. To prevent PF dropping below 0 if necessary, you can set a lower limit on the number foreground pixels within an RoI.

“In another embodiment, RPN scaling can also be combined with other scale factors such as those determined by semantic segmentation and optical flow to calculate the boosting.”

“Another embodiment allows for iterative refinement. In other words, when the classification head modifies the scores and adjusts the region, the ROI score from the updated ROI region can be reused in the next iteration. Iterative schemes only consider the detection candidates with the highest classification scores in the current iteration.

Let D0=(si. B1)i=1N, for example, be the output of the network’s detections. Si and Bi are the score and bounding-box coordinates for the ith predicted boxes. If the RoI pooling layers input is replaced by Bi, and the network is run forward from the RoI poolsing layer, D1=(s??i, B??i)i=1N corresponding to the new RoIs, then a new set detections D1=(s??i, B?ii)=1 N is obtained. Let D=D0?D1 and N=NMS (D,?) Let D=D0?D1 and let N=NMS(D,?), where NMS is the Non-Maximum Suppression algorithm that suppresses detections with lower scores. You can refine the final output by combining the overlapped detection boxes from the first iteration with those from the second iteration with AVG(N.D).

“FIG. “FIG.5” illustrates an exemplary schematic of the present apparatus according to one embodiment. The apparatus 500 comprises at least one processor (510) and one or more nontransitory computer-readable media (520). When executing instructions stored in the non-transitory media 520, the processor 510 performs the following steps: using an RPN to identify RoIs within an image using confidence levels; using these confidence levels to boost the background score that was assigned to each RoI by the downstream classifier; and using the scores in softmax functions to calculate the final class probabilities of each object class. The instructions stored on the non-transitory computer readable media 520 also allow the processor 510 to carry out the listed steps.

“In another embodiment, at least one processor 510 executes instructions stored on one or more nontransitory computer readable medium 520. It uses at least one confidence level assigned by an RPN for identification of RoIs in images, semantic segmentation masks and the magnitude of the optical flow to boost background score used by the downstream classification. These instructions are stored on the non-transitory media 520.

“FIG. “FIG. 6” illustrates an exemplary flowchart that can be used to manufacture and test the present apparatus according to one embodiment.

“At 650 the apparatus (in this case a chipset), is manufactured. It includes at least one processor as well as one or more non-transitory media. The instructions stored on one or more nontransitory computer-readable media are executed by at least one processor. This includes using an RPN (remote processing unit) to identify RoIs within an image. It assigns a confidence level to each RoI. Finally, it uses the scores in softmax functions to calculate the final class probabilities of each object class. Instructions for at least one processor are stored on one or more nontransitory computer-readable media.

“At 660 the apparatus (in this case a chipset), is tested. The testing of 660 involves testing the apparatus’s ability to execute instructions stored on non-transitory media. This includes using an RPN (for identifying RoIs in images) to assign a background score to each RoI. Using the RoIs as confidence levels to boost the background score assigned to each RoI by the downstream classifier, and using the scores in softmax functions to calculate the final class probabilities of each object class.

“In another embodiment, a chipet is manufactured that includes at least one processor as well as one or more non-transitory computers-readable media. When executing instructions stored on one or more nontransitory computer readable medium, the at least one processor uses at least one confidence level assigned by an RPN in order to identify RoIs within an image, semantic segmentation Masks, and the magnitude optical flow to boost background score used by the downstream classification. The instructions stored on the non-transitory media can also be used to instruct the processor to perform the listed steps.

“In this embodiment, the chipset can be tested by testing the apparatus for at least one processor that, when executing instructions on one or multiple non-transitory computers readable media, uses at minimum one of the confidence levels assigned to an RPN to identify RoIs within an image, semantic segmentation Masks, and the magnitude optical flow to boost background scores using the downstream classifier; and whether the apparatus contains the one or two non-transitory media that store instructions to the at least 1 processor to complete that step.”

“In embodiments, the present disclosure provides a fully deep convolutional neural net approach to pedestrian detection based on the newly introduced R-FCN architecture. One aspect of the disclosure is that the scores of RPN can be used to increase the performance of downstream classifiers.

The steps and/or operations described in relation to an embodiment may be performed in a different order depending on the particular embodiment and/or implementation. This would be as understood by someone of ordinary skill in this art. Different embodiments could perform different actions by using different methods or ways. Some drawings can be understood by an ordinary skilled in the art. They are simplified representations that show the actions performed. Real-world implementations will require more steps and/or components and may vary depending on the specific requirements. These drawings are simplified representations of the steps required to understand the current description.

“Similarly, some drawings show only relevant components. Some of these components are merely a function or operation well-known in a field rather than an actual piece hardware. Some or all of these components/modules can be implemented in various and/or combination of ways, including firmware and/or hardware. This includes, but is not limited, to one or more Application-Specific Integrated Circuits (?ASICs). Standard integrated circuits, controllers that execute appropriate instructions, including microcontrollers and/or embedded controllers, field-programmable gates arrays (FPGAs), Complex programmable logic devices, (?CPLDs?) Complex programmable logic devices (?CPLDs) are also available. You may store some or all of the system components or data structures as content (e.g. as executable software instructions or other machine-readable data or structured data), on a nontransitory computer-readable media (e.g. a hard drive; a memory; or a computer network, cellular wireless network, or other data transmission medium); so that the computer-readable medium, and/or one of its associated computing systems, or devices, can be configured to execute, otherwise use, or provide the contents for at least

“One or more processors or simple microcontrollers, controllers or the like may be used to execute sequences stored on non-transitory computer readable media to implement embodiments according to the present disclosure. Hard-wired circuitry can be used in some embodiments in addition to or in combination with the software instructions. The embodiments of this disclosure do not limit themselves to any particular combination of hardware circuitry or firmware and/or other software.

“computer-readable medium” is a term that refers to any medium that can store instructions. Any medium that contains instructions and can be given to a processor to execute them is referred to herein. This medium can take many forms, including volatile and non-volatile media. Common non-transitory computer-readable media are, for instance, a floppy disc, a flexible drive, hard disk or magnetic tape, as well as punch cards, paper tape and any other physical medium that has patterns of holes. A RAM, a PROM and EPROM, an FLASH-EPROM, any memory chip or cartridge on which instructions can be executed by the processor, are all common forms.

“Some embodiments may be implemented at least partially on a portable device. ?Portable device? and/or ?mobile device? As used herein, any electronic device that can receive wireless signals is a portable or mobile electronic device. This includes communication devices, multimedia players, computing devices and navigating devices. Mobile devices can include, but are not limited to, user equipment (UE), tablet computers, portable digital aids (PDAs), mp3 and mp3 players as well as handheld PCs, instant message devices (IMD), cellular phones, global navigation satellite system (GNSS), receivers, watches, and any other device that can be worn or carried around.

“Several embodiments of this disclosure can be implemented in an integrated Circuit (IC), also known as a microchip or silicon chip, computer chips, or simply a chip. As would be understandable by someone of ordinary skill in art in light of the disclosure. This IC could be, for instance, a baseband and/or broadband modem chip.

Although several embodiments have been described herein, it is clear that many modifications are possible without departing from this disclosure’s scope. It will therefore be obvious to anyone with ordinary skill in art that the scope of the present disclosure does not include any of the described embodiments, but only the appended claims or their equivalents.

Summary for “System and Method for Deep Learning Machine for Object Detection”

“Machine-learning technology is constantly evolving and has helped support many aspects in modern society. These include web searches, content filtering and automated recommendations on merchant sites. They also allow for object detection, object classification, speech recognition, machine translations and drug discovery and genomes. Deep neural networks are the current state-of-the art in machine learning. They use multiple layers of computation to learn representations of data (usually very large quantities of data), with multiple levels or abstraction. etc. You can see, for example, Yann, LeCun and Geoffrey Hinton. ?Deep learning.? Nature, vol. 521, pp. 434-444 (28 May 2015), is hereby incorporated by reference in its entirety.”

“Deep learning has shown outstanding performance in general object detection. Deep learning has not been able to detect certain objects or situations. Deep learning has not been able to detect pedestrian detection in many real-world applications like autonomous driving and advanced driving aid systems.

According to one aspect of the disclosure, an object detection method using a region-based deeplearning model is provided. This includes using a region proposal system (RPN), to identify regions in an image, and assigning confidence levels to them. The background score assigned to each RoI by the downstream classifier will be boosted using the RoIs’ assigned confidence levels. Finally, using the scores in softmax functions to calculate the final class probabilities of each object class.

According to one aspect of the disclosure, an apparatus for object detection using region-based deep-learning models is provided. It includes one or more processors that execute instructions stored on non-transitory media. These steps include using a region proposal system (RPN), to identify regions in an image, assigning confidence levels and then using these RoIs to boost each RoI’s background score. Finally, the softmax function uses the scores to calculate the final class probabilities of each object class.

“Accordingly to one aspect of the present disclosure, there is a method of manufacturing a chipset that executes instructions stored on non-transitory media. It includes the following steps: using a region proposition network (RPN), to identify regions in an image using confidence levels; using the RoIs as background scores to boost each RoI’s assigned background score; and using the scores in softmax functions to calculate the final class probabilities of each object class.

According to one aspect of the disclosure, there is a method for testing an apparatus. This includes using a region proposal system (RPN), to identify regions in an image using confidence levels. Using the RoIs as background scores, the downstream classifier assigns each RoI a confidence level. Using the softmax function scores to calculate the final class probabilities for each class. Testing whether the apparatus stores the instructions on non-transitory media.

“Hereinafter, the embodiments of this disclosure will be described in detail with reference the accompanying drawings. Although elements may be shown in different drawings, the same elements can be identified by the same reference numbers. Specific details, such as configurations and components, are not provided in this description. They are only intended to aid in understanding the various embodiments of the present disclosure. It should be obvious to anyone skilled in the arts that the disclosed embodiments can be modified or changed without departing from its scope. For clarity and simplicity, we have omitted descriptions of well-known constructions and functions. These terms are defined in light of the functions described in the disclosure. They may differ according to the intentions of users or custom. The specification’s contents should determine the terms definitions.

“The present disclosure could have many modifications and different embodiments. These embodiments are detailed below with reference to the accompanying illustrations. It is important to understand that the present disclosure does not limit itself to specific embodiments. The scope of the disclosure includes all modifications, equivalents and alternative methods.

Although terms such as first or second ordinal numbers can be used to describe elements, they do not limit the scope of the terms. These terms can only be used to differentiate one element from another. Without limiting the scope of this disclosure, a first or second structural element could be called a second structural component. The second structural element can also be called the first structural element. The term “and/or” is used herein. The term?and/or? refers to any combination of one or more related items.

The terms used herein only describe different embodiments of the present disclosed disclosure. They are not meant to limit the disclosure. If the context indicates otherwise, singular forms will include plural forms. It is important to understand that the terms “include” and “have” are used in the disclosure. The terms?include? and?have? are used to indicate the existence of a feature, a number, an operation, a structural element, parts or a combination thereof. The presence of a feature, number, an operation, a structure element, parts or a combination thereof indicates that there is a possibility of adding one or more features, numerals or steps to the equation.

“Unless otherwise stated, all terms used in this disclosure have the same meanings that are understood by someone skilled in the art to be which they belong.” Terms like those found in a dictionary should be understood to mean the same thing as the context meanings in the applicable field of art. They are not to be taken to have an ideal or excessively formal interpretation unless they are clearly stated in the disclosure.

“Multiple embodiments could include one or more elements. Any structure that is designed to perform certain operations can be considered an element. An embodiment can be described using a restricted number of elements, but the embodiment could include additional elements in alternative arrangements. It is important to remember that any reference to “one embodiment”? or ?an embodiment? An embodiment means that at least one embodiment includes a particular structure, feature, or characteristic. The phrase “one embodiment” appears. (or ?an embodiment?) This specification may not refer to the exact same embodiment in different places.

“Although deep learning methods have been shown to be very effective in general object detection, they are not as good at pedestrian detection,”

“Faster region based convolutional neural network (R-CNN), have been the most popular framework for general object identification. This framework has a high false negative rate. Background regions are considered objects that belong to pre-determined (foreground), object categories such as persons. Therefore, R-CNN with a faster speed has not been able to detect pedestrians.

The present disclosure’s components reduce false positive rates by using the region proposition network (RPN), score to boost the background score or confidence level of image regions (i.e. regions of interest (RoI), used by the downstream classification of faster R-CNN. Simply put, if the RPN is confident that a RoI has background, then the downstream classifier of faster R-CNN will boost the confidence for the background class proportionally. This reduces false positive objects/foregrounds. This technique can be used to estimate inferences for models that have not been trained by RPN. To increase the background scores of RoIs in downstream classifiers, you can also use semantic segmentation masks and other information.

Ren, S. He, K. Girshick and Sun, J., Faster RNN: Towards real time object detection with region proposition networks, in ADVANCES In Neurological Information Processing Systems, pp. The general object detection technique 91-99 (2015) is one of the most effective, and is included herein by reference in its entirety. It uses a fully neural network approach with a 2-stage detection process.

“FIG. “FIG. FIG. FIG. 1 shows how an input image 101 is processed and output to a deep CNN. This will be called the base 110 in this disclosure.

The RPN 130, a sliding-window detector, is the first stage. The RPN is used for predicting objectness scores. This measures the likelihood of belonging to a set object classes (in front of you) vs the background (without objects). It also includes anchors that correspond to the location in your input image. ?Objectness? It can be defined as the degree to which an image is containing an object. The RPN 130 generates overlay proposals 135

“In the second stage the RPN regions are fed into a downstream classification 140 for further classification in one of the many object categories. This is done by using an attention mechanism called RoI Pooling.

The main problem with the R-CNN faster approach is the need to perform downstream classification independently for each RoI. The region-based fully convolutional networks (R-FCN) are a new type of network. This network is described in Dai J. Li, Y. He, K. and Sun J., RFCN: Object Detection Via Region-Based Fully Convolutional Networks arXiv Preprint arXiv 1605.06409 (2016). It was created to overcome the inefficiencies inherent in the R-CNN framework.

“FIG. “FIG. As FIG. In FIG. 2. Image 201 is processed using the base network to generate feature maps 220.

The R-FCN architecture was designed to categorize the RoIs proposed into object categories and background. The R-FCN framework implements the downstream classifier using a convolution only network. Convolutional networks are translation-invariant and object detection must be sensitive to translations of object position. The R-FCN framework generates an array of specialized convolutional layers known as position-sensitive scores maps 250. Each score map encodes position information relative a relative spatial location as a channel dimension.

“In contrast, the embodiments of this disclosure offer a mechanism to decrease the false positive rate (or the?false objects rate?). Deep learning systems that use region-based techniques for object detection. Baseline faster R-CNN/RFCN models only use RPN scores to sort the RoIs and select the top-N RoIs that will be used for downstream classification. This approach has a drawback: all top-N RoIs get treated equally by downstream classification. This could include RoIs with very low objectness scores.

“In embodiments according the present disclosure, region scores generated from the RPN can be used to boost scores computed downstream by the classifier. This is known as RPN boosting. One embodiment uses a Bayesian framework to calculate the posteriori probability that a RoI is an object given both RPN and classifier scores. This approach however biases all objects toward the background and lowers scores for the good RoIs (i.e. those more likely to be an item).

“For example, let C0, C1, . . . CK, CK denote K+1 classes of interests, while C0 denotes the background class. Let PB, PF represent the background and foreground probabilities that the RPN assigned to a RoI. The probability that an object exists is called the foreground probability. This includes objects belonging to any of the K classes of interest. Let s0, and s1 be the same. . . sK is the score that was assigned to the RoI by the downstream classificationifier. The Equation (1) represents the RPN confidence.

“The updated scores are used for computing the final class probabilities with the softmax layer. This outputs the probability distribution across the possible classes.”

{“Specifically, continuing with the example above, with K+1 classes and updated boosted scores s=”Specifically, keeping with the example, but with K+1 classes, updated boosted scores, s=s0 and s1,. . . , sK, where s0 is boosted according to Equation (1), then the probability ym of a class C having label m (i.e., Cm) is calculated by the softmax layer using Equation (2):”|, sK. If Equation (1) is used to boost s0, then the probability that a class C has label m (i.e. Cm) can be calculated using the softmax layer. Equation (2):}

“y m =? ( C = C?m? s ) = e s m ? i = 0 K e s i ( 2 )”

“The softmax probability can directly be used in prediction phase. A boosting of the s0 will affect the probability of all classes, as it changes Equation (2)’s denominator.

“While the softmax probabilities are used in the training phase, it is easier to express them as a cross-entropy function E within the log domain, as shown by Equation (3) below. In Equation (3), wherein ti=1 when the training input corresponds with class Cm (i.e. ti=tm=1) and zero otherwise (i.e. ti=0), then? These are the network parameters.

“? = – log ? ? L? ( ? ? t , s ) = – ? i = 0K? t i * log ? ( y i ) ( 3 )”

“To maximize parameters?” The partial derivative of cross entropy function? Equation (4) shows how the score sm is calculated.

“? ? ? s m = y m – t m ( 4 )”

“FIG. 3. This block diagram illustrates an example deep neural convolutional networks to which the embodiment of this disclosure can be applied. The residual network (ResNet) is formed by the blocks and layers at its bottom. The ResNet’s output is fed into a regional proposition convolutional network (RPN Cony 310), whose output is used to determine objectness scores and coordinates to detect objects. This information is used to further classify the detected object by the Position Sensitive Region of Interest Classification 340 (PSRoI Cls), 340, and Regression (PSRoI Reg 350) networks to generate classification scores for each possible category and refine the detection boxes delivered by RPN to ROI networks. Equations (1) and (2) are the basis of the Boosted Scores 360 and Boosted Softmax 370 operations.

“FIG. “FIG. The base network processes the input image to create feature maps in 410. The RPN sliding window-based detector in 420 selects RoIs and assigns confidence scores to each RoI. This is the probability that a RoI is an object in the background or the foreground. The downstream classifier pools the RPN regions and further categorizes each RoI into one or more object categories. However, 430 uses the confidence levels that were calculated by RPN in 420 to increase the background scores of RoIs before downstream classifiers classify the RoIs. Softmax functions also use the background score to calculate final class probabilities of each object class.

“Another embodiment of the disclosure uses semantic segmentation masks from any source to boost the RPN, thereby reducing false alarm rates. A semantic segmentation mask, which is provided by another semantic segmentation algorithm, provides pixel-wise labeling for each class under consideration in comparison to region- or box-wise labeling done by an object detection system. The ratio of the number of pixels in the RoI can be used to calculate the background and foreground probabilities for each RoI. To prevent PF from falling to 0 if necessary, you can set a lower limit on the number foreground pixels within an RoI. This prevents the classifier assigning a background probability value of 1.

“In another embodiment of this disclosure, the magnitude and direction of the optical flow are used to boost the false alarm rate of a detector. You can obtain the optical flow from any source. Another algorithm provides optical flow information as a measure for the degree of change between frames in pixels. It can also be used to indicate a moving object when the camera is stationary such as in surveillance cameras. A threshold is used in such embodiments. A threshold? is used to determine the magnitude of the optical flows. If the background is still stationary, a pixel may be considered background if it has a magnitude that is less than the threshold?. Otherwise, the pixel will be designated as foreground. The ratio of the number of pixels in each RoI to their foreground pixels can be used to calculate the background and foreground probabilities PF or PB. To prevent PF dropping below 0 if necessary, you can set a lower limit on the number foreground pixels within an RoI.

“In another embodiment, RPN scaling can also be combined with other scale factors such as those determined by semantic segmentation and optical flow to calculate the boosting.”

“Another embodiment allows for iterative refinement. In other words, when the classification head modifies the scores and adjusts the region, the ROI score from the updated ROI region can be reused in the next iteration. Iterative schemes only consider the detection candidates with the highest classification scores in the current iteration.

Let D0=(si. B1)i=1N, for example, be the output of the network’s detections. Si and Bi are the score and bounding-box coordinates for the ith predicted boxes. If the RoI pooling layers input is replaced by Bi, and the network is run forward from the RoI poolsing layer, D1=(s??i, B??i)i=1N corresponding to the new RoIs, then a new set detections D1=(s??i, B?ii)=1 N is obtained. Let D=D0?D1 and N=NMS (D,?) Let D=D0?D1 and let N=NMS(D,?), where NMS is the Non-Maximum Suppression algorithm that suppresses detections with lower scores. You can refine the final output by combining the overlapped detection boxes from the first iteration with those from the second iteration with AVG(N.D).

“FIG. “FIG.5” illustrates an exemplary schematic of the present apparatus according to one embodiment. The apparatus 500 comprises at least one processor (510) and one or more nontransitory computer-readable media (520). When executing instructions stored in the non-transitory media 520, the processor 510 performs the following steps: using an RPN to identify RoIs within an image using confidence levels; using these confidence levels to boost the background score that was assigned to each RoI by the downstream classifier; and using the scores in softmax functions to calculate the final class probabilities of each object class. The instructions stored on the non-transitory computer readable media 520 also allow the processor 510 to carry out the listed steps.

“In another embodiment, at least one processor 510 executes instructions stored on one or more nontransitory computer readable medium 520. It uses at least one confidence level assigned by an RPN for identification of RoIs in images, semantic segmentation masks and the magnitude of the optical flow to boost background score used by the downstream classification. These instructions are stored on the non-transitory media 520.

“FIG. “FIG. 6” illustrates an exemplary flowchart that can be used to manufacture and test the present apparatus according to one embodiment.

“At 650 the apparatus (in this case a chipset), is manufactured. It includes at least one processor as well as one or more non-transitory media. The instructions stored on one or more nontransitory computer-readable media are executed by at least one processor. This includes using an RPN (remote processing unit) to identify RoIs within an image. It assigns a confidence level to each RoI. Finally, it uses the scores in softmax functions to calculate the final class probabilities of each object class. Instructions for at least one processor are stored on one or more nontransitory computer-readable media.

“At 660 the apparatus (in this case a chipset), is tested. The testing of 660 involves testing the apparatus’s ability to execute instructions stored on non-transitory media. This includes using an RPN (for identifying RoIs in images) to assign a background score to each RoI. Using the RoIs as confidence levels to boost the background score assigned to each RoI by the downstream classifier, and using the scores in softmax functions to calculate the final class probabilities of each object class.

“In another embodiment, a chipet is manufactured that includes at least one processor as well as one or more non-transitory computers-readable media. When executing instructions stored on one or more nontransitory computer readable medium, the at least one processor uses at least one confidence level assigned by an RPN in order to identify RoIs within an image, semantic segmentation Masks, and the magnitude optical flow to boost background score used by the downstream classification. The instructions stored on the non-transitory media can also be used to instruct the processor to perform the listed steps.

“In this embodiment, the chipset can be tested by testing the apparatus for at least one processor that, when executing instructions on one or multiple non-transitory computers readable media, uses at minimum one of the confidence levels assigned to an RPN to identify RoIs within an image, semantic segmentation Masks, and the magnitude optical flow to boost background scores using the downstream classifier; and whether the apparatus contains the one or two non-transitory media that store instructions to the at least 1 processor to complete that step.”

“In embodiments, the present disclosure provides a fully deep convolutional neural net approach to pedestrian detection based on the newly introduced R-FCN architecture. One aspect of the disclosure is that the scores of RPN can be used to increase the performance of downstream classifiers.

The steps and/or operations described in relation to an embodiment may be performed in a different order depending on the particular embodiment and/or implementation. This would be as understood by someone of ordinary skill in this art. Different embodiments could perform different actions by using different methods or ways. Some drawings can be understood by an ordinary skilled in the art. They are simplified representations that show the actions performed. Real-world implementations will require more steps and/or components and may vary depending on the specific requirements. These drawings are simplified representations of the steps required to understand the current description.

“Similarly, some drawings show only relevant components. Some of these components are merely a function or operation well-known in a field rather than an actual piece hardware. Some or all of these components/modules can be implemented in various and/or combination of ways, including firmware and/or hardware. This includes, but is not limited, to one or more Application-Specific Integrated Circuits (?ASICs). Standard integrated circuits, controllers that execute appropriate instructions, including microcontrollers and/or embedded controllers, field-programmable gates arrays (FPGAs), Complex programmable logic devices, (?CPLDs?) Complex programmable logic devices (?CPLDs) are also available. You may store some or all of the system components or data structures as content (e.g. as executable software instructions or other machine-readable data or structured data), on a nontransitory computer-readable media (e.g. a hard drive; a memory; or a computer network, cellular wireless network, or other data transmission medium); so that the computer-readable medium, and/or one of its associated computing systems, or devices, can be configured to execute, otherwise use, or provide the contents for at least

“One or more processors or simple microcontrollers, controllers or the like may be used to execute sequences stored on non-transitory computer readable media to implement embodiments according to the present disclosure. Hard-wired circuitry can be used in some embodiments in addition to or in combination with the software instructions. The embodiments of this disclosure do not limit themselves to any particular combination of hardware circuitry or firmware and/or other software.

“computer-readable medium” is a term that refers to any medium that can store instructions. Any medium that contains instructions and can be given to a processor to execute them is referred to herein. This medium can take many forms, including volatile and non-volatile media. Common non-transitory computer-readable media are, for instance, a floppy disc, a flexible drive, hard disk or magnetic tape, as well as punch cards, paper tape and any other physical medium that has patterns of holes. A RAM, a PROM and EPROM, an FLASH-EPROM, any memory chip or cartridge on which instructions can be executed by the processor, are all common forms.

“Some embodiments may be implemented at least partially on a portable device. ?Portable device? and/or ?mobile device? As used herein, any electronic device that can receive wireless signals is a portable or mobile electronic device. This includes communication devices, multimedia players, computing devices and navigating devices. Mobile devices can include, but are not limited to, user equipment (UE), tablet computers, portable digital aids (PDAs), mp3 and mp3 players as well as handheld PCs, instant message devices (IMD), cellular phones, global navigation satellite system (GNSS), receivers, watches, and any other device that can be worn or carried around.

“Several embodiments of this disclosure can be implemented in an integrated Circuit (IC), also known as a microchip or silicon chip, computer chips, or simply a chip. As would be understandable by someone of ordinary skill in art in light of the disclosure. This IC could be, for instance, a baseband and/or broadband modem chip.

Although several embodiments have been described herein, it is clear that many modifications are possible without departing from this disclosure’s scope. It will therefore be obvious to anyone with ordinary skill in art that the scope of the present disclosure does not include any of the described embodiments, but only the appended claims or their equivalents.

Click here to view the patent on Google Patents.