Invented by Sayed Mehdi Sajjadi Mohammadabadi, Berta Rodriguez Hervas, Hang Dou, Igor Tryndin, David Nister, Minwoo Park, Neda CVIJETIC, Junghyun Kwon, Trung Pham, Nvidia Corp
The Nvidia Corp invention works as followsIn various examples, the live perception of sensors on a vehicle can be used to detect and classify an intersection in a vehicle’s environment in real-time. A deep neural network (DNN), for example, may be trained to produce various outputs. These include intersection bounding box coordinates, intersection coverage map corresponding to these bounding boxes and intersection attributes. The outputs can be decoded or post-processed in order to determine the final locations, distances, and/or characteristics of detected intersections.
Background for Intersection detection in autonomous machine applications:
Autonomous systems and advanced driver assist systems (ADAS), typically, leverage various sensors to accomplish various tasks. Such as lane-keeping, lane-changing, lane assignment and camera calibration. In order for autonomous and ADAS to function independently and efficiently it is essential that the vehicle’s environment be understood in real-time, or close to real-time. This may include information about the locations of objects, barriers, lanes and/or intersecting points in the environment, with respect to different demarcations such as lanes or road boundaries. A vehicle may use the information about its surroundings to make decisions such as where, when and how long it should stop.
As an example, information about the location and attributes of intersections within an environment for an autonomous or semiautonomous vehicle can be valuable in making decisions regarding path planning, obstacles avoidance and/or control?such as when and where to stop or move, which path to take to safely cross an intersection, or where other vehicles and pedestrians might be located. It is especially important for vehicles operating in semi-urban and urban driving environments where understanding intersections and planning paths is crucial. In a multi-lane, bi-directional driving environment where the vehicle must slow down and wait for an intersection, it is important to determine its location and type. The intersection is critical for safe and effective semi-autonomous and/or autonomous driving.
In conventional systems intersections can be interpreted using a combination of several characteristics from a vehicle’s environment and the individual characteristics detected. To detect an intersection, for example, multiple objects, such as traffic lights, stop signs, vehicle positions, vehicle rotation, lanes etc. can be detected. Separately detected objects can be combined to classify and detect a single intersection. Such solutions, however, require an algorithm that is accurate and detailed to identify relevant features and combine them for each type of intersection. The more complex intersections are, the more detailed annotations required. This increases the difficulty of detecting and classifying the intersections. It also decreases the scaleability of the intersection detection. The compute resources needed to perform real-time or nearly real-time classification and detection of intersections are also increased because multiple detection processes need to be performed and later combined in order for a final classification. A detection error relating to a particular feature can lead to an incorrect classification of the intersection (e.g.), which may make it difficult to detect.
Other systems can interpolate intersections based on a comparison of features detected by the individual sensors to those in the pre-stored HD, three-dimensional (3D), maps of a vehicle’s driving surface. Map-based solutions are dependent on the accuracy and availability. These conventional systems are unable to function when maps for certain areas have become outdated or unavailable. The process is more complicated when the vehicle must be able drive independently in different regions. “Conventional systems fail also when there is a transient condition at an intersection (e.g. police directing traffic or a stopped school bus), which may not be reflected on the maps.
Embodiments” of the present disclosure are related to intersection detection and classifying?e.g. for associated waiting conditions?in autonomous machine application. Systems and methods are disclosed that leverage outputs from various sensors of a vehicle to detect regions of an environment that correspond to intersections and to classify?holistically, looking at an intersection region of the environment as a whole?intersections in real-time or near real-time.
Unlike conventional systems such as those above, the system in use today may use the live perception of a vehicle to detect and categorize one or more intersections within the vehicle’s surroundings. To detect and classify an intersection, the system may use information such as intersection locations, distances from intersections and/or attributes corresponding to those intersections (e.g. classifications of waiting conditions). Machine learning algorithms, such as deep neural network (DNNs), can be used to calculate information about an intersection, including intersection bounding box, coverage maps and attributes. This information can be used to accurately and effectively determine intersection locations, attributes and distances. The vehicle can use the outputs to navigate intersections accurately and effectively. The outputs from the DNN can be used, for example, to determine the location of each intersecting, the distance to each intersecting, the wait condition at each intersection, where to stop, how long to stay at each of these intersections, or the like.
The process of detecting intersections and classifying them may be less time-consuming and computationally intensive if the system is able to detect and classify each intersection in near-real-time. This can happen without the need for prior knowledge or experience of the intersection and without the need to combine and detect several features from the vehicle and the environment. The autonomous vehicle can travel more freely in cities, urban environments or other places where HD maps are not readily available.
Systems and Methods are disclosed in relation to intersection detection and classifying in autonomous machine applications. The present disclosure can be described in relation to an autonomous vehicle 700, (also referred to as “vehicle” 700 herein). The present disclosure may be described with respect to an example autonomous vehicle 700 (also referred to herein as?vehicle?700 or?ego?vehicle?700), of which is described in relation FIGS. An example is given in FIGS. This is not meant to be restrictive. The systems and methods described in this document may, for example, be used with non-autonomous cars, semi-autonomous cars (e.g. one or more adaptive driving assistance systems (ADAS), robots (including warehouse vehicles), off-road vehicles (such as motorcycles), boats, shuttles (e.g. emergency response vehicles), electric or motorized bikes, aircraft, construction vehicle, underwater craft (such as drones), and/or any other vehicle type. The present disclosure is also not limited to intersection detection and classification in vehicle applications. For example, the methods and systems described herein can be used for augmented reality, VR, robotics, security, surveillance, autonomous and semi-autonomous machines, underwater craft, drones, or any other technologies.
As described in this document, current systems and techniques provide techniques for detecting and classifying intersections by using the outputs of sensors (e.g. cameras, RADAR, LIDAR, etc.). The vehicle’s live perception can be obtained in real time or near real time. For each intersection, the live perception of the car may be used in order to detect intersection locations, distances from intersections and/or classifications or attributes corresponding to those intersections. “Computer vision and/or machine-learning model(s) [e.g. deep neural networks, such as convolutional networks (CNNs),] may be trained to produce outputs which, after decoding in embodiments, result in detected intersections and distances to them, or classifications and attributes of the intersections. The outputs can then be used by the car to navigate intersections effectively and accurately.
In some embodiments, to accurately classify intersections in image space that visually overlap, machine learning models may be trained to calculate, for pixels in the bounding shapes, pixel distances that correspond to edges of the corresponding bounding shapes so that the bounding shapes can be generated. This allows each pixel to represent the shape and position of the bounding form. A smaller encoding area (e.g. smaller than the number of pixels that actually correspond to the shape) can be used to encode bounding edge locations, removing overlap between bounding shapes. In some embodiments, the prediction accuracy and stability can be improved over time by using temporal processing. This allows previous detections and classifications to be used as a basis for current predictions. To train the machine learning models to predict distances to intersections, ground truth information may be generated and associated?automatically, in embodiments?with the ground truth sensor data. As a vehicle crosses a captured portion of an imaged environment, vehicle sensors can be used to calculate the distance traveled to each intersection. The distance travelled is then attributed to that distance.
The system can learn to diagnose each intersection near-real-time, or in real-time, using live perception. It is also not necessary to have any prior experience or knowledge of the intersection.
In deployment, sensor data (e.g., image data, LIDAR data, RADAR data, etc.) Sensors (e.g. cameras, LIDAR, RADAR, etc.) may be used to receive and/or generate data. The sensors may be located on or in any other way disposed within an autonomous vehicle or semi-autonomous car. The sensor data can be used to train a neural net (e.g. a deep neural networks (DNN), or a convolutional network (CNN)), which is trained to identify intersections of interest (e.g. raised pavement markers (RPMs), rumble stripes, colored lane separators, sidewalks (cross-walks), turn-offs etc.). Sensor data can be used to represent semantic information (e.g. wait conditions), as well as distance and/or location information. The neural network can be configured to compute data that represents intersection locations, classifications and/or distances from intersections. The computed outputs can be used, for example, to determine the location of the bounding shapes corresponding to intersections, confidence maps for determining whether pixels are intersections, distances from the intersections, classification and semantic data (e.g. attributes, waiting conditions), or other information. In some examples the computed location (e.g. pixel distances to left, right, top, and bottom edges of the corresponding bounding boxes) for an intersect may be represented by a pixel coverage map, with each pixel representing an intersection being uniformly weighted. The distance (e.g. the distance between the vehicle and the bottom edge bounding box of the intersection) can also be represented by a pixel based distance coverage map.
The DNN can be trained to predict a variety of information, e.g. via a number of channels, that corresponds to intersection location, distance and attributes. The channels can represent, for example, intersection locations, coverage maps of intersections, intersection attributes and/or distances. During training, images or other representations of sensor data may be labeled with bounding box intersections and may include semantic information. The semantic information and labeled boxes may be used by the ground truth encoder in order to produce intersection locations, coverage map, confidence values and attributes, as well as distances and distance coverage maps.
Locations” of intersection bounding shapes can be encoded in order to generate one (or more) intersection coverage maps. In certain embodiments, DNN may have been trained to react to all pixels within each bounding box. As a result, each pixel of a bounding shape may receive a uniform mask. The intersection coverage maps may encode confidence levels corresponding to the likelihood that each pixel represents an intersection. The pixels that are associated with a bounding form may be used to determine the location information of bounding shapes encoded in those pixels. The bounding box information for a pixel, which is determined to be a bounding form, may be used at least partially to determine the location and dimension of the bounding shapes for a particular sensor data instance, such as an image. The number of pixels in the coverage map of an intersection or bounding shapes may be reduced, e.g. shrunk to crop off the top portion, as described herein. This allows the overlap region, e.g. where the top of one bounding form may overlap the bottom of another bounding form, to be excluded from the consideration when determining the bounding shapes locations and dimensions. The DNN can be trained to detect intersects nearer and further from the vehicle, without the bounding shapes for one intersection interfering in predictions for an intersection farther from the vehicle.
In certain examples, due to different sensor data instances with different intersection numbers, the ground-truth data may be encoded by using different coverage maps. Coverage maps (e.g. distances) for the close, middle, and/or distant ranges within each image can be encoded separately so that DNNs may be trained using individual channels. The distances between each intersection of an image can be used to determine where the intersection lies in a range.
The DNN outputs can be used to detect pixels corresponding to a bounding shape. In embodiments, the DNN outputs are used to detect pixels that correspond to a boundary shape. The region in the image-space world-space corresponding the the region in the bounding form in image-space can be classified as having particular wait conditions and/or attributes. Once a confidence level is established for each pixel in determining whether it corresponds to a boundary shape, the bounding shapes associated with the pixel can be used to predict the location of the bounding form. Combining the bounding shapes shape locations from multiple pixels can be used to create a final location prediction for the bounding shape. In some embodiments, the random sample consensus algorithm (RANSAC), which is based on the individual predictions for each pixel, may be used to determine a final prediction.
Further,” a number pixels that share the same attribute can be used to determine the final attributes of the final bounding shapes associated with an intersection. When a threshold of pixels in a bounding rectangle are all assigned the same attribute, it may be determined that the intersection is associated with this attribute. This process can be repeated for every attribute type. The intersection detected may then be linked to a set (e.g. one or more attributes) based on the predicted number of pixels in its final bounding shape.
In some cases, a temporal analysis can be done on the bounding box, the distances and/or attributes in order to confirm the accuracy, stability and robustness of the DNN prediction. The weighting of the current prediction by the DNN can be compared to the prior predictions of the DNN that corresponded to previous instances of sensor data. In some cases, temporal filters (e.g. statistical filtering), can be applied to multiple predictions that correspond to consecutive frames of sensor information. The vehicle’s motion between successive sensor data instances can be used to perform temporal filtering accurately.
In certain examples, the vehicle’s motion may be tracked during deployment and associated with sensor data in order to generate new data that can be used for training the DNN or another DNN. In a similar manner to initial DNN training, the vehicle’s distance from the world-space position of captured sensor data (e.g. an image) to the world-space intersection location (e.g. the entrance to the intersection) can be tracked and encoded to be the distance to intersection. The entry point to the intersection can then be projected backwards into previous instances to train the DNN in calculating distances to intersections. The distances can be encoded in order to generate a coverage chart corresponding with distances and/or encode a value of distance to pixels (e.g. pixels at intersections, pixels outside intersections, pixels within intersections etc.). This allows the DNN to be trained using the ground truth data generated by the vehicle’s motion.
Referring to FIG. 1 , FIG. According to some embodiments, FIG. 1 illustrates an example data flow chart 100 of a process for training a neuronal network in order to detect and categorize intersections using the outputs of one or more sensors on a vehicle. This and other arrangements are described only as examples. The process 100 can be described as a system that includes one or multiple machine learning models 104, which receive one or several inputs such as sensor data, 102 and generate one or many outputs such as output(s), 106. When used as training data, sensor data 102 can be called training data in some examples. The sensor data 102, although primarily described in relation to image data representing images, is not meant to be limiting. Other types of sensor information used for intersection pose determination, such as LIDAR, SONAR, RADAR, or the like, may also be included. 7A-7D).
The process 100 can include receiving and/or generating sensor data 102. As an example and without limitation, the sensor data 102 can be received from one or multiple sensors of a car (e.g. vehicle 700 in FIGS. The sensor data 102 may be received, as an example that is not limited to any particular vehicle (e.g., the vehicle 700 of FIGS. The vehicle and the process 100 may use the sensor data 102 to detect and create paths for navigating intersections in real time or near real time. Sensor data 102 can include without limitation sensor data 102 collected from any sensor of the vehicle, including those shown in FIGS. 7A-7C, global navigation satellite systems (GNSS) sensor(s) 758 (e.g., Global Positioning System sensor(s)), RADAR sensor(s) 760, ultrasonic sensor(s) 762, LIDAR sensor(s) 764, inertial measurement unit (IMU) sensor(s) 766 (e.g., accelerometer(s), gyroscope(s), magnetic compass(es), magnetometer(s), etc. The sensor data 102 may also include virtual sensor data generated from any number of sensors on a virtual vehicle or other virtual object. Sensor data 102 can also include virtual sensor information generated by any number of virtual sensors on a virtual vehicle. In this example, virtual sensors can correspond to a vehicle or another virtual object within a simulated or virtual setting (e.g. used for testing and training neural networks, or validating their performance). The virtual sensor data could represent sensor data collected by the virtual objects in the simulated or virtually environment. By using virtual sensor data, machine learning models 104 described in the present invention may be trained, tested and/or verified using simulated data within a simulated or virtual environment. This may allow testing of more extreme scenarios away from a real-world setting where such tests are not available or unsafe.Click here to view the patent on Google Patents.