Metaverse – Stefan Johannes Josef HOLZER, Yuheng Ren, Abhishek Kar, Alexander Jay Bruen Trevor, Krunal Ketan Chande, Martin Josef Nikolaus Saelzle, Radu Bogdan Rusu, Fyusion Inc

Abstract for “Real-time mobile device capture, generation of AR/VR material”

“Various embodiments provide information on systems and processes that can be used to generate AR/VR content. One aspect of the invention provides a method to generate a 3D projection of an object. One lens camera can be used to capture a sequence of images that follow a camera translation. Each image has at least one portion of the object as overlapping subject matter. A trained neural network is used to semantically segment the images into an object. These images are then processed using fine-grained segmentation. The interpolation parameters for the object are calculated on the fly and stereoscopic pairs for points along the camera translation are created from the refined sequence segmented image images. These are used to display the object as a 3D projection within a virtual reality or an augmented reality environment. The segmented image indices are then mapped into a rotation range to be displayed in the virtual reality and augmented reality environments.

Background for “Real-time mobile device capture, generation of AR/VR material”

Modern computing platforms and technologies are shifting to mobile and wearable devices, which include camera sensors as native acquisition streams. This has made it more obvious that people want to preserve digital moments in a different format than traditional flat (2D) images and videos. Digital media formats are often restricted to passive viewing. A 2D flat image, for example, can only be viewed from one side and cannot be zoomed in or out. Traditional digital media formats such as 2D flat images are not well suited to reproducing memories or events with high fidelity.

“Producing combined images such as a panorama or a 3-D (3D) model or image requires data from multiple images. This can involve interpolation or extrapolation. The majority of the existing methods of extrapolation and interpolation require significant amounts of additional data beyond what is available in the images. These additional data must describe the scene structure in dense ways, such as a dense depthmap (where depth values are stored for each pixel) or an optical flowmap (which stores the motion vector between all the images). Computer generation of polygons, texture mapping over a 3-dimensional mesh and/or 3D models are other options. These methods also require significant processing time and resources. These methods are limited in terms of processing speed and transfer rates when it is sent over a network. It is therefore desirable to have improved methods for extrapolating 3D image data and presenting it.

“Provided is a variety of mechanisms and processes that allow AR/VR content to be captured and generated in real-time. One aspect of the invention, which can include at least some of the subject matter from any of the preceding or following examples and aspects, is a method of generating a 3D projection of an object in virtual reality (or augmented reality) environment. This method involves obtaining sequences of images with a single lens camera. Each image in the sequence contains at most a portion of the overlapping subject matter, which also includes the object.

The method also involves semantically segmenting an object from a sequence using a trained neural networks to create a sequence segmented object images. This method could also include stabilizing the sequence using focal length and camera rotation values, before semantically segmenting an object from the sequence.

The sequence of segmented objects images can then be refined by fine-grained segmentation. A temporal conditional random fields can be used to refine the sequence of segmented objects images. The temporal conditional random fields allows you to use a graph of neighboring pictures for each image that needs to be refined. Further, the method includes computing on-the fly interpolation parameters. Interpolation parameters that are generated on-the-fly can be used to create interpolated images along any point of the camera translation in real time.

The method also involves generating stereo pairs from the sequence of segmented objects images to display the object in a 3D projection in virtual reality or augmented reality. One or more points may be used to generate stereoscopic pairs along the camera translation. An interpolated virtual frame may be included in a stereoscopic pair. Rotating an image from one frame may allow for modification. This will make the selected frames correspond to a view of an object that is angled toward the object. A projection of the object is created by combining the sequences of segmented image images. This project shows a 3D view without the need for polygon generation. Segmented image indexes are then mapped into a rotation range to be displayed in virtual reality or augmented reality. The mapping of the segmented images indices can include mapping physical viewing points to a frame index.

“Other embodiments of this disclosure include the corresponding devices, systems and computer programs that are configured to perform the described actions. A non-transitory computer-readable medium may include one or more programs that can be executed by a computer system. In some embodiments, one or more of the programs includes instructions for performing actions according to described methods and systems. Each of these other implementations can optionally include one or several of the following features.

“In another aspect, which may contain at least a part of the subject matter in any of the preceding or following examples and aspects,” a system to generate a three-dimensional (3D), projection of an object within a virtual reality (or augmented reality) environment includes a single lens camera that captures a sequence images using a camera translation. Each image in the sequence contains at least one portion of the overlapping subject matter which includes the object. The system also includes a display module and a processor. It also contains memory that stores one or more programs for execution by the processor. Instructions for performing the actions described in methods and systems are included in one or more of these programs.

These and other embodiments will be described in detail below, with reference to the Figures.

“We will now refer to specific examples of disclosure, including the best methods contemplated and used by the inventors in carrying out the disclosure. The accompanying drawings show examples of specific embodiments. Although the present disclosure is presented in conjunction with specific embodiments, it should be understood that it does not limit the disclosure to those embodiments. It is, however, intended to include all alternatives, modifications, or equivalents that may be included in the scope and spirit of the disclosure, as described by the appended claims.

“The following description provides a detailed understanding of the disclosure. Some embodiments of this disclosure can be implemented without all or some of these details. Other instances of well-known process operations are not described in detail to avoid confusing the present disclosure.

U.S. Patent Application Ser. No. No. A surround view, according to different embodiments, allows a user to adjust the viewpoint of visual information on a screen.

U.S. Patent Application Ser. No. No. 14,800,638 filed by Holzer and others on Jul. 15, 2015, titled ARTIFICIALLY RENDERING IMAGES U.S. Patent Application Ser. No. No. According to different embodiments, artificial images can be interpolated between selected keyframes, captured image frames, and/or used in conjunction with one or more frames of a stereo pair. This interpolation can be used in an infinite smoothing method to generate as many intermediate frames as necessary to achieve a smooth transition between frames. No. No.

U.S. Patent Application Ser. No. No. 15/408,270 titled STABILIZING IMAGE SERIES BASED ON CAMERA ROOTATION and FOCAL LENGTH, filed Jan. 17, 2017. This application is incorporated herein in its entirety. These systems and methods may be used to create stereoscope pairs of image frames that can be presented to the user in order to give depth perception. No. No.

“Overview”

According to different embodiments, a surround is a multi-view interactive media representation. This surround view can contain content for virtual reality (VR), augmented reality, (AR) and may be presented to the user using a viewing device such as a virtual reality headset. A structured concave sequence may be captured live around an object of particular interest. This surround view can then be presented as a hologram model when viewed with a viewing device. AR/VR is the term. When referring to both virtual reality and augmented reality, the term?AR/VR? shall be used.

The data that is used to create a surround view can be derived from many sources. A surround view can be generated using data, including two-dimensional (2D), images. These 2D images can be captured using a camera that moves along a camera translation. This may or may not always be uniform. These 2D images can be captured at constant intervals of time or distances using camera translation. These 2D images may include color image data streams, such as multiple sequences of images, video data, etc. or multiple images in any number of formats, depending on what the application requires. A surround view can also be generated from location data, such as GPS, WiFi, IMUs (Inertial Measurement unit systems), magnetometers, accelerometers and magnetometers. Depth images are another data source that can be used in order to create a surround view.

The data can then be merged together in the current example embodiment. A combination of 2D and location data can create a surround view. Other embodiments allow for the use of depth images and location data together. Depending on the application, different combinations of data can be combined with location information. The data is used to model context and content in the current example. The context can be defined as the scene surrounding the object or interest. The content may be presented in a three-dimensional model depicting the object of interest. However, some embodiments allow for the presentation of the content as a two-dimensional image. In some embodiments, context can also be represented as a two-dimensional model of the scenery surrounding the object. While many contexts can present two-dimensional views showing the scenery around the object of curiosity, some embodiments allow for three-dimensional elements. This surround view can be generated using systems and methods described in U.S. Patent Application Ser. No. No. 14,530,669 titled ANALYSIS and MANIPULATION of IMAGES AND VIDEO TO GENERATE SURROUND VIEWS.

“In the current example embodiment, one or more enhancement algorithm can be applied. You can use different algorithms to capture surround view data in particular embodiments. These algorithms can be used for user enhancement. Automatic frame selection, image stabilization and object segmentation can all be used to enhance the user experience. These enhancement algorithms can be applied to image files after data acquisition. These enhancement algorithms can also be applied to surround view data while image data is being captured. Automatic frame selection can be used to reduce the storage requirements by saving keyframes from all captured images. This allows for more uniform spatial distribution of viewpoints. Image stabilization can be used to stabilize keyframes within a surround view in order to improve things like smoother transitions and enhanced/enhanced focus.

“View interpolation can also be used to enhance the viewing experience. To avoid sudden “jumps”, interpolation is particularly useful. Between stabilized frames, synthetic intermediate views can be created on the fly. View interpolation can only be used to the foreground, such as the object or interest. This information can be provided by IMU and content-weighted keypoint tracking, as well denser pixels-to-pixel matches. The process may be simplified if depth information is available. In some embodiments, view interpolation may be used during surround view capture. Other embodiments allow view interpolation to be applied during surround-view generation. These enhancement algorithms and others may be described using U.S. Patent Application Ser. No. No. 14.800,638, titled ARTIFICIALLY REINDERING IMAGES USING INTERPOOLATION OF TRACKEDCONTROL POINTS and U.S. Patent Application Ser. No. No. 14,860,983 titled ARTIFICIALLY RENDERING IMAGES UTILIZING VIEWPOINT INTERPOLATION and EXTRAPOLATION.

The surround view data may be used to generate content for AR and/or VR viewing. In various embodiments, additional image processing may be used to create a stereooscopic three-dimensional view of the object of interest that can be displayed to a user using a viewing device such as a virtual reality headset. The subject matter in the images can be divided into context (background), content (foreground), and semantic segmentation using neural networks. Fine grained segmentation refinement with temporal conditional random field may also be used. This separation can be used to remove background images from the foreground so that only the images that correspond to the object of particular interest are displayed. Systems and methods described in U.S. Patent Application Ser. No. No. To create stereoscopic image pair, stabilization my image may be achieved by determining focal length and image rotation. This is described in U.S. Patent Application Ser. No. No.

“View interpolation can also be used to smoothen the transition between image frames infinitely by creating any number of intermediate artificial images frames. Additionally, interpolated frames and capture keyframes can be combined into stereo pairs (stereo pairings) of image frames. The surround view can be presented with stereoscopic pairs so that the user may sense depth. This will enhance the user’s experience of the 3D surround views. Each pair of stereoscopic images may contain a 2D image that was used to create the surround view. Each pair of stereoscopic frames may contain a collection of 2D images separated by a spatial baseline. This baseline can be determined using a predetermined angle at a focal point and distance from it. Image rotation can also be used to fix one or more stereo pairs of images. This is to ensure that the line between the object of interest and the focal point of the desired focal point is perpendicular with the frame. Stereographic pairs of frames can be created from images taken in a single view. This allows for depth experience without the need to store additional images.

“The image frames are then mapped onto a rotation display so that the user’s movement and/or the viewing device can determine which frames to display. Image indexes can be matched to various physical locations that correspond to camera translations around objects of interest. A user can see a three-dimensional surround view at different focal lengths and angles of the object of their interest. This surround view allows for a three-dimensional view without rendering or storing a real three-dimensional model. The surround view creates a three-dimensional effect by simply stitching actual two-dimensional images or portions thereof and grouping stereoscopic pairs.

According to different embodiments, surround view have many advantages over traditional videos or two-dimensional images. These include the ability for the user to interact with the surround view and to change its viewpoint.

“In particular, examples, the characteristics above can be incorporated naturally in the surround view representation and provided the ability for use in different applications. Surround views can be used in many fields, such as ecommerce, visual search and file sharing. They also allow for user interaction and entertainment. A surround view can also be displayed as virtual reality (VR), augmented reality, or AR to the user at a viewing device such as a virtual reality headset. VR applications can simulate the user’s presence in an environment, and allow them to interact with it and any objects. An AR view, which shows a direct or indirect view of an environment in real time, may be used to present images. This is where the elements of an environment are enhanced (or supplemented by) computer-generated sensory input, such as sound, video and graphics. These AR/VR content can be created on the fly by using the systems and methods described herein. This will reduce the amount of images and data that must be stored. These systems and methods may reduce the processing time and power requirements. This allows AR and/or VR content generation to occur more quickly, in real-time or near real-time.

“Example Embodiments”

“According to different embodiments of this disclosure, a surroundview is a multiview interactive digital media representation. Referring to FIG. FIG. 1 shows an example of a system 100 that can be used to capture real-time and generate augmented reality (AR), and/or VR content. The system 100 in the current example embodiment is shown in a flow sequence that can generate a surround view of AR/VR. According to different embodiments, data that is used to create a surround view can be derived from many sources. A surround view can be generated using data, including but not limited, two-dimensional (2D), images 104. These 2D images may include color image streams, such as multiple sequences of images or video data. They can also include multiple images in different formats depending on the application. Location information 106 is another source of data that can help to create a surround view. Location information 106 can come from many sources, including GPS, WiFi and magnetometers. Depth images are another source of data that could be used to create a surround view. These depth images include 3D or depth image data streams and can be captured using devices such as stereo cameras, time of flight cameras, three-dimensional cameras and the like.

“In the current example embodiment, the data may then be fused at sensor fusion block 110. Some embodiments allow for a surround view to be created using data that includes both 2D and location information. Other embodiments allow depth images 108 to be combined with location information 106 at sensor fusion block 110. Depending on the application, different combinations of image data may be combined with location information at 116, depending on what data is available.

“In the current example embodiment, the data fused at sensor fusion block 110 can then be used for content modeling 112 or context modeling 114. FIG. 5 explains this in greater detail. 5 can be broken down into context and content. The context can be defined as the scene surrounding the object or interest. According to different embodiments, the content may be a three-dimensional model depicting an object, but in some cases the content may be a 2-dimensional image. 4. In some embodiments, the context may be a two-dimensional representation of the scenery around the object. The context may provide views in two dimensions of the scenery around the object of curiosity in some examples, but it can also be three-dimensional in certain embodiments. The context could be described as a “flat” image. Image along a cylindrical canvas,? so that the image appears flat. The image is projected on the surface of the cylinder. Some examples also include three-dimensional context modeling, which is useful when objects are identified as such in the surroundings. According to different embodiments, models generated by content modeling 112 or context modeling 114 can also be created by combining image and location data. 4.”

According to different embodiments, the context and content of surround views are determined based upon a specific object of interest. Some examples show that an object of interest can be selected automatically based on the processing of image data and location information. If a dominant object is identified in a set of images, it can be chosen as the content. As shown in FIG. 2, a user-specified target 102 could be selected. 1. However, it should be noted that surround views can be created in certain applications without the need for a target to be specified by the user.

“In the current example embodiment, one (or more) enhancement algorithms can be applied to enhancement algorithm(s), block 116. For example, different algorithms can be used to capture surround view data regardless of what capture mode is employed. These algorithms can be used for user enhancement. Automatic frame selection, stabilization and view interpolation can all be used to enhance the user experience. These enhancement algorithms may be applied to image data following acquisition. These enhancement algorithms can also be applied to image data captured with surround view data.

“Automatic frame selection can be used, according to certain example embodiments to create a more pleasant surround view. Frames are automatically chosen so that the transition between them is smoother or more even. This automatic frame selection can incorporate blur- and overexposure-detection in some applications, as well as more uniformly sampling poses such that they are more evenly distributed.”

“Image stabilization can be used in some embodiments for surround views in a way similar to video. To improve the quality of surround views, such as smooth transitions and enhanced focus, keyframes can be stabilized. But, unlike video, surround views can be stabilized using IMU information and depth information. Computer vision techniques can also be used to select the area that needs to stabilize, face detection, and other methods.

IMU information, for example, can be extremely helpful in stabilization. IMU information can be used to estimate the camera tremor, even though it may not be accurate or reliable. This information can be used to cancel, reduce, or remove the effects of camera tremor.

“Some examples show depth information that can be used, if possible, to stabilize a surround view. These points of interest in surround views are three-dimensional rather than two-dimensional. This makes it easier to track and match these points as the search space shrinks. Descriptors of points of interest can be used to describe them using both depth and color information, making them more discriminative. It is possible to provide depth information for automatic or semi-automatic content selecting. A user can select a specific pixel from an image and have it expanded to cover the entire surface. Furthermore, content can also be selected automatically by using a foreground/background differentiation based on depth. The content may be visible and can remain stable in certain situations.

Computer vision techniques can be used to stabilize surround views, according to many examples. Keypoints, for example, can be identified and tracked. There are certain scenes that cannot be stabilized by a simple warp, such as dynamic scenes or static scenes with parallax. There is a compromise: certain scenes get more attention for stabilization, while other parts of the scene are less. A surround view often focuses on one object of interest. It can be content-weighted to ensure that the object is maximumly stabilized in certain examples.

Direct selection of a specific region on a screen is another way to increase stabilization in surround views. If a user taps on a specific area of the screen to focus, and then records a convex surround image, that region can be maximally stabilized. This allows stabilization algorithms that can be focused on a specific area or object of particular interest.

“Face detection can be used in some cases to stabilize the scene. When recording with a front-facing cam, it is likely that the subject of interest is the user. Face detection can be used for weight stabilization in that area. If face detection is accurate enough, facial features (such as the eyes, nose, and mouth) can be used to stabilize areas rather than using generic keypoints.

“View interpolation can be used to enhance the viewing experience, according to various examples. To avoid sudden “jumps”, interpolation is particularly useful. Between stabilized frames, synthetic intermediate views can be rendered quickly. This information can be provided by content-weighted keypoint tracks, IMU information, and denser pixels-to-pixel matches. The process may be simplified if depth information is available. In some embodiments, view interpolation may be used during surround view capture. Other embodiments allow view interpolation to be applied during surround-view generation.

“View interpolation can be described as infinite smoothing in some cases. This may be used to enhance the viewing experience by smoother transitions between frames. These smooth transitions may be real or interpolated as discussed above. Infinite smoothing can be used to determine a set number of possible transformations between frames. To identify key points in frames, a Harris corner detector algorithm can be used to detect areas of high contrast, low ambiguity in different dimensions, or areas with high cornerness. The Harris score of a predetermined number of keypoints may be used to select the best. The RANSAC algorithm (random sample consensus), can then be used to identify the most commonly occurring transformations based on all possible keypoint transformations between frames. A smooth flow space may be defined as eight possible motions or transformations that can be applied to different pixels. Different transformations can be assigned to different pixels within a frame. Online keypoint tracking, keypoint detection and RANSAC algorithms can be used. In certain embodiments, infinite smoothing algorithms can be executed in real-time on the fly. The system can generate an artificial image frame that corresponds to the specific translation position by allowing the user to navigate to it.

“Filters can be used to enhance the viewing experience, such as when capturing or creating a surround view. A lot of popular photo sharing sites offer aesthetic filters that can be applied only to static images. However, surround images can also be used. Because a surround view is more expressive than a 2-dimensional image and there is more information in a 3-D surround view, filters can be extended to include effects not possible in 2-D photos. In a surround view, motion blur can also be added to the background (e.g. While the content is still clear, motion blur can be added to the background (i.e. Another example is to add a drop shadow to the object of concern in a surround view.

“Compression can be used in various ways as an enhancement algorithm. Compression can be used to improve user experience by reducing data download and upload costs. Surround views can send far less data than normal videos, but retain the desired characteristics of surround views. The IMU, keypoint tracks and user input can all be combined with the view interpolation described earlier to reduce the data that must transfer from and to a device during the upload or downloading of surround views. A variable compression style for content and context can be used if the object of interest can easily be identified. Variable compression styles can have lower quality resolutions for background information (i.e. context) and a higher quality resolution for the foreground (i.e. Some examples show how to reduce the amount of data transmitted. These examples show how data can be reduced without compromising the context quality and still maintaining the desired content quality.

“In the present embodiment, a surroundview 118 is generated after any enhancement algorithms have been applied. A surround view can be used to represent multi-view interactive digital media. The surround view can be a three-dimensional model for the content or a two-dimensional model for the context. In some cases, however, the context may be a flat? The context can be viewed as the background or scenery projected along a surface such as a cylindrical surface or another-shaped surface. This means that the context is more than two-dimensional. Another example is that the context may include three-dimensional aspects.

According to different embodiments, surround view have many advantages over traditional videos or two-dimensional images. These include the ability for surround views to deal with moving scenery, or both, the ability remove redundant information, and the ability for users to modify the view. The above-described characteristics can be integrated natively into the surround view representation and are available for use in many applications. Surround views can be used in many fields, such as ecommerce, visual search and file sharing. They also allow for user interaction and entertainment.

According to different examples, after a surround view 118 has been generated, user feedback can be given for acquisition 120 of additional data. If a surround view requires additional views to accurately represent the context or content, users may be asked to provide additional views. These additional views will be received by the surroundview acquisition system 100. They can then be processed by system 100 and integrated into the surroundview.

“The surround view 118 can be further processed at AR/VR Content Generation Block 122 to create content suitable for different AR/VR systems. This AR/VR content block may contain a processing module that can be used to segment images in order to extract an object or background image. Further details are provided with reference to FIGS. 11, and 12. Additional enhancement algorithms can be applied at AR/VR content generator block 122, as described in block 116. View interpolation, for example, can be used to determine the parameters of any number artificial intermediate images. This will result in an infinitely smooth transition among image frames. 13-21. The user can also be presented with stereoscopic pairs, which may provide depth perception. 22-24. 22-24.

“With reference to FIG. 2 shows an example of a 200 method for creating augmented reality or virtual reality content in real time. Method 200 may use system 100 or other methods as described in process 300. Method 200 can be performed at block 122 for AR/VR content generation. Method 200 can produce AR/VR content that gives a user a three-dimensional surround view and depth of an object of particular interest.

“A sequence of images is created at step 201. Some embodiments allow for the addition of 2D images such as 2D photos 104 to the sequence of images. Other data, such as location information (106), and depth information (106), may be obtained from the user or camera in some embodiments. Step 203 is where the images are stabilized, and a selection of frames are chosen. These image frames are known as keyframes. These keyframes can be used to create a surround view using content modeling 112, context modelling 114, or enhancement algorithms 116. Refer to FIG. 1.”

According to different aspects of the disclosure, AR/VR content can also be generated by extracting an object or other content (e.g. a person) within a sequence images to separate it and the background. You can achieve this by applying different segmentation algorithms to the images. One example embodiment uses semantic segmentation to distinguish the foreground and background of each image in each keyframe at step 205. A segmenting neural network that can identify and label pixels in each frame may perform such semantic segmentation. FIG. 2 shows how semantic segmentation can be further explained. 11. Step 207 may also be used to refine the keyframes’ fine-grained segmentation. Step 207 can enhance or improve the separation between the foreground and the background so that the object of interests is clearly and clearly isolated from the background without any artifacts. Below is a description of fine-grained segmentation with reference to FIG. 12.”

“At step 209, parameters are calculated for interpolation. Parameters for interpolation can be determined in some embodiments by computing a number possible transformations and then applying the best transformation to each pixel within an images frame. These parameters can be offline determined and used to render images at runtime, when the surrounding view is viewed. Further details regarding interpolation of keyframes, and rendering artificial frames are provided in FIGS. 13-21. 13-21.

“Stereoscopic pairs of images frames are generated at step 211. Stereoscopic pairs can be created in some embodiments by determining which pair of frames will give the desired perception of depth. This is done based on distance from the object of interest and angle of vergence. One or more frames of a stereoscopic pair could include an artificially interpolated picture. One or more frames of a stereoscopic couple may be corrected using a rotation transform such that the line at site is perpendicular with the plane of the frame. Further details on the generation of stereoscopic pairs are provided in FIGS. 22-24.”

“At step 213, the indexes of image frames are mapped into a rotation range for display. The rotation range can be concave arc around the object of interest in some embodiments. Other embodiments may use a convex rotation. Different rotation ranges can correspond to different types of camera positions and translations, as shown in FIGS. 4, 6A-6B and 7A-7E. 8, 9 and 10. In an example image sequence with 150 images, keyframe 0 may be the leftmost frame. The keyframe 150 may be the last frame that corresponds to the end the camera translation. Some embodiments allow the selected keyframes or captured frames to be evenly distributed across the rotation range. They may also be distributed according to location and/or other IMU information.

“In different embodiments, the physical viewing position is matched with the frame index. If a viewing device and/or user are located in the middle of the rotation, an image frame that corresponds to the middle should be displayed. This information may be loaded into a viewing device such as headset 2500 described in FIG. 25. Based on the headset 2500’s position, an appropriate image or stereoscopic pair may be shown to the user.

“In different embodiments, AR/VR content generated using process flow 200 may include an object or interest that can be viewed from different angles and/or perspectives by a user. The surround view model, in some embodiments, is not a rendered three-dimensional model, but rather a view that the user experiences as a model. The surround view, for example, provides a three-dimensional view without rendering or storing a real three-dimensional model. This means that there is no texture mapping or polygon generation over a three-dimensional mesh/polygon model. The user can still see the context and content as a three-dimensional model. The surround view creates the three-dimensional effect by stitching together two-dimensional images or portions thereof. The term “three-dimensional model” is used herein interchangeably with the term “three-dimensional view”. This type of three-dimensional view is interchangeable with the term “three-dimensional model”.

“Surround View Generation”

“With reference to FIG. 3 shows an example of a flow that generates a surround view. The present example shows how a plurality images are obtained at 302. According to different embodiments, the plurality may include a variety of images taken with different types of cameras. A camera could be a digital camera that is in continuous shooting mode (or burst mode), and can capture a set number of frames at a time. For example, five frames per second. The camera could also be a camera mounted on a smartphone in other embodiments. The camera can be set up to capture multiple images in continuous video.

According to different embodiments, the plurality can contain two-dimensional (2D), images or data streams. These 2D images may contain location information which can be used for creating a surround view. As described in FIG. 1. You can include location information with depth images in different examples.

According to different embodiments, the plurality images obtained at 302 may include a variety sources and characteristics. The plurality can be obtained from multiple users. These images could be a collection from different users, such as video or 2D images, taken from the internet. The plurality of images may include images with different time information. The images can be taken at different times with the same object. Multiple images of the same statue, for example, can be taken at different times of the day or during different seasons. Another example is that the plurality can be used to represent moving objects. The images could include an object of particular interest that is moving in the scenery, such a car traveling on a road or a plane flying through the sky. Other instances include images that show an object of interest moving in motion, such as someone running, dancing, twirling, or a vehicle traveling along a road.

“In the current example embodiment, the plurality images are fused into content models and context models at 304. According to different embodiments, subject matter in images can be divided into context and content. The context can be defined as the scene surrounding the object or interest. In some embodiments, the content may be a three-dimensional model depicting an object, while in others it can be a 2-dimensional image.

“According the present example embodiment, one to several enhancement algorithms can applied to the context and content models at 306. These algorithms can be used for enhancing the user experience. These algorithms can be used to enhance the user experience, such as automatic frame selection and stabilization, view-interpolation, image rotations, infinite smoothing, filters and/or compression. These enhancement algorithms may be applied to images during the capture of the images. These enhancement algorithms can also be applied to image data following acquisition.

“In the present embodiment, the surround view is generated using the context models and content at 308. A surround view can be used to create a multi-view interactive digital multimedia representation. The surround view may include both a three-dimensional model for the content and a model of its context. Depending on the method of capture and the views of the images, certain characteristics can be included in the surround view model. There are three types of surround views: a locally concave, locally convex, and a locally plain. It is important to note that surround views may include a variety of views and characteristics depending on the application. The surround view model, in some embodiments, is not a real three-dimensional model rendered but rather a view that the user experiences as a model. The surround view, for example, provides a three-dimensional view without rendering or storing an actual three dimensional model.

“With reference to FIG. “With reference to FIG. 4, here is an example of multiple camera frames that can merged together to create a 3D model. Multiple images can be captured at different viewpoints and merged together to create a surround view, according to many embodiments. Three cameras (412, 414 and 416) are located at A 422, B 424 and X 426 respectively. They are situated close to the object of interest 408 in the current example. The object of interest 408, such as object 410, can be surrounded by scenery. Frames A 402, B 404, and X 406 are all taken from the respective cameras 412-414 and 416. They include overlapping subject matter. Each frame 402, 406, and 406 include the object of interest 408, and different degrees of visibility of scenery surrounding it 410. Frame A 402 shows the object of interest 408 at the front of the cylindrical, which is part of the scenery surrounding it 410. View 406 shows 408 on one side of cylinder. View 404 shows the object without any view.

The present embodiment contains frames A 402, B 404 and X 416. These frames along with their locations (location A 422, locationB 424 and location X 466) provide rich information about the object of interest 408 as well as the surrounding context. This can be used to create a surround view. The various frames 402, 426 and 404 provide information about the different sides of the object and their relationship to the scenery when they are viewed together. This information can be used, according to different embodiments, to separate the object of interest 408 into content and the setting as the context. As described in FIGS. These viewpoints can produce images that are immersive and interactive.

“Frame X 406 in some embodiments may be an artificially rendered image generated for a viewpoint from Location X 426 along a trajectory between Location A 422, and Location B 424. A single transform is used for viewpoint interpolation along the trajectory between two frames: Frame A 402 or Frame B 404. Frame A 402 is an image of objects 408 & 410 taken by a camera 412 at Location A 422. Frame B 404 is an image of object 408 taken by a camera 414 at Location B 424. The transformation (T_AB), which is the distance between the frames, is calculated in the current example. T_AB maps one pixel from frame A and frame B. Methods such as homography and affine, similarity or translation are used to perform this transformation.

“In this example, an artificially rendered picture at Location X 426 can be denoted by a viewpoint location at x [0,1] on the trajectory between frames A and B. Frame A is at 0 and frame BC at 1. The image is generated by interpolating transformation, combining image information from Frames B and A, and then combining them. The transformation in the current example is interpolated (T_AX, T_XB). This transformation can be interpolated by parameterizing the transformation T_AB, and then linearly interpolating those parameters. This interpolation does not have to be linear. Other methods are possible within the scope. Next, image data is collected from both Frames A & B. This involves transferring image information between Frame A 402 and Frame X 406 based upon T_AX and transferring image info from Frame B 404 and Frame X 406. Using T_XB, this information will be used to determine the Frame A X. The image information from both Frames A & B is then combined to create an artificially rendered image at location X 426. Below is a description of interpolation to render artificial frames, with references to FIGS. 13-21.”

“FIG. “FIG.5” illustrates an example of the separation of context and content in a surround-view. A surround view, according to different embodiments of this disclosure, is a multi-view interactive media representation of a scene 500. Referring to FIG. FIG. 5 shows a user 502 in a scene 500. The user 502 is taking images of an object of particular interest, such a statue. The digital visual data captured by the user can be used to create a surround view.

“According to different embodiments of this disclosure, digital visual data that is included in surround views can be separated semantically and/or praktischly into content 504 or context 506. Particular embodiments allow content 504 to include an object, person or scene of interest. The context 506 can then represent the rest of the scene around the content 504. A surround view can be used to represent the content 504 in three-dimensional data and the context 506 in two-dimensional panoramic backgrounds. A surround view could also be used to represent the context 506 and content 504 as two-dimensional panoramic scenes. Another example is that context 506 and content 504 may contain three-dimensional components. Particular embodiments differ in the way the surround view depicts context 504 and content 506 depending on the mode of capture used to acquire the images.

The context 506 and content 504 may look identical in some cases, including recordings of objects, people, or parts thereof, recordings of large flat areas, recordings where only the object, person or part of them is visible, and recordings where no subject is within the recording area. These surround views may share some similarities with other digital media, such as panoramas. According to different embodiments, surround view may include additional features that differentiate them from other types of digital media. A surround view may be used to represent moving data, for example. A surround view does not have to be restricted to a particular cylindrical, spherical, or translational movement. You can capture image data using a camera, or any other capture device. A surround view, which is different from a stitched panorama can show different sides of an object.

“FIGS. “FIGS. These views are especially useful when a camera phone has been used. The camera is located on the back of a phone and faces away from the user. Concave and convex views, in particular, can influence how content and context are identified in surround views.

“With reference to FIG. “With reference to FIG. 6A, here is an example of a concave 600 where a user stands along a vertical 608. The user is holding a camera so that 602 doesn’t leave 608 during image capture. The camera captures a panoramic view around the user by pivoting about axis 608, creating a concave view. Because of the way the images were captured, this embodiment shows the object of interest 604 as well as the distant scenery 606 in a similar manner. This example shows that all objects in the concave view appear to be at infinity. The content therefore corresponds to the context.

“With reference to FIG. 6B is an example of a convexview 620 where a user can change his position while taking pictures of an object of curiosity 624. This example shows how the user moves around an object of interest 624 and takes pictures from various angles from the camera locations 628, 632, and 630. Each image includes a view and background 626 of distant scenery. The object of interest 624 is the content and the distant scenery 626 the context.

“FIGS. “FIGS.7A-7E” illustrates various capture modes for surround view images. While there are many motions that can be used for capturing surround views, they don’t have to be restricted to one type of motion. However, there are three main types of motion that can be used to capture certain features or views. Each of these three types can produce a locally concave, locally convex, or locally flat surround view. A surround view may include multiple types of motions within the same surround. FIGS. 7A-7E describes the type of surround view (for instance, concave, or convex), with reference to the direction in which the camera view is looking.

“With reference to FIG. 7A is an example of a convex surround view that is captured from the back. A locally convex surround view, according to different embodiments, is one where the viewing angles of the camera and other capture devices diverge. This can be compared to the motion required for capturing a spherical 360 panorama (pure rotation), but the motion can also be applied to any curving sweeping motion where the view faces outward. The experience in the current example is that of a stationary observer looking out at a (possibly) dynamic context.

“In the current example embodiment, user 702 uses a back-facing camera 706 in order to take images towards world 700 and away from user 702. A back-facing camera is a camera that faces away, as shown in different examples. The camera is moved in concave motion 708, so that views 704a, 704b, and 704c capture different parts of capture area 709.

“With reference to FIG. 7B is an example showing a concave surround view captured from the back. A locally concave surround view, according to different embodiments, is one where viewing angles meet at a single object. A locally concave surround view may give the viewer the feeling of being orbited around a point. In this way, the viewer can see multiple sides to the same object. This object could be an “object of interest”. This object, which may be an?object of interest?, can be separated from its surround view to become content. Any surrounding data can also be segmented to become context. This type of viewing angle was not recognized by previous technologies.

“In the current example embodiment, user 702 uses a back-facing camera 714 in order to capture images towards world 700 and away from user 702. The camera is moved in concave motion 710 so that views 712a, 712b and 712c capture different parts of the capture area 711. The convex motion 710 may orbit around an object of interest, as described above. These views 712a, 712b and 712c can show different sides of the object.

“With reference to FIG. 7C is an example of a convex surround view captured from the front. A front-facing camera is a device that has a camera facing towards the user. This includes the camera on the smart phone’s front. Front-facing cameras can be used for taking’selfies’. “Self-portraits” of the user are taken with front-facing cameras.

Camera 720 faces user 702. Camera 706 follows a convex motion so that views 718a, 718b and 718c diverge in an angular sense. The capture area 717 is convex and includes the user around a perimeter.

“With reference to FIG. 7D is an example of a concave surround view taken from the front. Camera 726 faces user 702. Camera 722 follows a concave motion 722 so that views 724a, 724b and 724c all converge towards user 702. The capture area 717 is concave and surrounds the user 702.

“With reference to FIG. 7E is an example of a flat, back-facing view being captured. A locally flat surround view, in particular embodiments, is one where the rotation of the camera and its translation is smaller than it is. A locally flat surround view is one in which the viewing angles are roughly parallel and the parallax effect dominates. This surround view can contain an “object of interest”, but it is not fixed in any of the views. This type of viewing angle was also not recognized by previous technologies in the media-sharing environment.

“The camera 732 in the current example embodiment is facing away towards user 702, and towards world 700. Camera 732 follows a linear motion 728 so that the capture area 729 follows a line. Views 730a, 730b and 730c generally have parallel lines of sight. Multiple views of an object can make it appear that the background scenery has changed or been moved in different views. A slightly different side of an object might be visible in different views. The parallax effect allows for information about the location and characteristics of objects to be generated in surround views that provide more information than any static image.

There are many modes that can be used to capture images in a surround view, as described above. These modes include locally concave, locally conex and locally linear motions. They can be used for either individual images or continuous recording of a scene. This recording can capture multiple images in a single session.

According to different embodiments of this disclosure, data can be acquired in many ways to create a surround view. Data can be obtained by moving a camera around space, as shown in FIG. 7 of U.S. Patent Application Ser. No. 14,530,669. To begin recording, the user can tap the record button on a capture gadget. The object might move in a general rightward direction across the screen as the capture device moves in a leftward direction. The object may appear to be moving rightward as the capture device moves leftward. The record button can be tapped once more to stop recording. Another way to stop recording is to tap and hold on the record button. The present embodiment captures a series images that can be used for creating a surround view.

According to different embodiments, a user can capture a sequence of images that are used to create a surround view by recording a scene or object of interest. In some cases, multiple users may be able to acquire a set of images that will create a surround view. FIG. 8 is an example of a space time surround view that has been simultaneously recorded by independent observers.

“In the current example embodiment, cameras 804, 806 808, 810 and 812 are placed at different locations. These cameras 804, 806, 808 and 810 can be linked with independent observers. Independent observers could, for example, be spectators at a concert, show or event. Cameras 804, 806, 808 and 812 could also be mounted on stands or tripods. The present embodiment uses the cameras 804, 806, 808 and 812 to capture views 804a, 808, 808, 810 and 812a of an object or interest 800. World 802 provides the background scenery. In some cases, the images from cameras 804, 806, 808 and 812 can be combined to create a single surround view. Each camera 804, 806, 808 and 812 provide a different vantage point relative the object of curiosity 800. Therefore, aggregating images from different locations gives information about the different viewing angles for the object of concern 800. Cameras 804, 806, 808, 808, 810 and 812 can also provide a series images taken at their respective locations over a period of time. This allows the surrounding view to include temporal information as well as possible movement.

“As mentioned above in relation to different embodiments, surround view can be associated with a variety capture modes. A surround view can also include different capture modes and different capture motions within the same surround view. Surround views can also be broken down into smaller pieces in certain cases, as shown in FIG. 10 of U.S. Patent Application Ser. No. 14,530,669. A complex surround-view could be divided into smaller, more linear parts. A complex surround view might include a capture area following a L motion. This includes two separate linear motions. These surround views can be divided into two separate surround views. You should note that while the linear motions in the complex surround view can sometimes be captured continuously and sequentially in certain embodiments, they can also be captured in separate sessions by other embodiments.

“In certain embodiments, the two surround views can be processed separately and combined with a transition to create a continuous experience for users. This method can offer many benefits, such as breaking down motion into smaller linear parts. These smaller, linear components can be broken down into loadable, discrete parts to help with data compression for bandwidth purposes. Non-linear surround views can be also broken down into separate components. Local capture motion can sometimes be used to break down surround views. A complex motion can be broken down into two parts, one locally convex and one that is linear. Another example is that a complex motion may be broken down into smaller, locally convex parts. You should know that complex surround views can contain any number motions, and that these complex surround views can be broken down into separate parts depending on the application.

“While it may be desirable in some applications to seperate complex surround views in certain applications, in others it is desirable for multiple surround views to be combined. Referring to FIG. FIG. 9 shows an example of a graph that incorporates multiple surround views into a multi-surroundview 900. The rectangles in this example represent different surround views 902, 906, 908, 908, 912, 914 and 916. The length of each rectangle represents the dominant motion for each surround view. The lines connecting the surround views signify possible transitions 918 to 920, 922 and 924, 926. 928. 928. 930.

“In certain cases, surround views can be used to efficiently partition scenes spatially and temporally. Multi-surround view data 900 can be used for large-scale scenes. A multi-surroundview 900 data can contain a number of surround views connected in a spatial graph. You can collect the individual surround views from one source (e.g., a single user) or multiple sources (e.g., multiple users). The individual surround views can also be taken in sequence, parallel or completely uncorrelated at different times. To connect surround views, however, content, context and/or location must overlap. To provide a portion of the multi-surround view, 900, surround views must have some overlap in terms of content, context, or location. This overlap can allow for individual surround views to be linked and then stitched together to create a multi-surroundview 900. According to different examples, you can use any combination of front, back or front-and-back cameras.

“Multi-surround views can be used to capture whole environments in some embodiments. Similar to “photo tours?” collect photographs into a graph of discrete, spatially-neighboring components, multiple surround views can be combined into an entire scene graph. This can be done using information such as image matching/tracking and depth matching/tracking. A multi-surround view or graph can be used to switch between surround views at different points in recorded motion. Multi-surround views are more appealing than?photo tours. The user can navigate the surround view as they wish and can store more visual information in surround views. Traditional?photo tours are not compatible. Traditional?photo tours have limited views that are shown to the viewer either by default or by allowing them to pan through a panorama using a computer mouse or keystrokes.

According to different embodiments, a surround image is created from a collection of images. These images can either be captured by the user who intends to create a surround view, or they can be retrieved from storage depending on what application. A surround view can give more information about multiple views of the same object or scene because it is not restricted or limited to a specific amount of visibility. A single view may not be sufficient to accurately describe a three-dimensional object. Multiple views can give more detailed and specific information. Multiple views can give enough information to enable visual search queries to return more precise results. A surround view allows you to see the object from multiple sides. If there is no distinguishable view, you can choose from the surrounding view or request one from the user. If the data provided or captured is not sufficient to enable recognition or generation of an object or scene of interest, the capturing system can direct the user to move the capturing device, or to provide additional data. A user might be asked to take additional images if the surround view determines that it is necessary to create a more precise model.

A surround view can be used in many applications, depending on the particular embodiments. A surround view can be used to allow a user navigate the surround view, or interact with it in some other way. A surround view, according to different embodiments, is intended to give the user the experience of being present in a scene while the user interacts within the surround view. The type of surround view being viewed will also affect the experience. A surround view doesn’t have to be a fixed geometry. However, it can include different geometries over a particular segment, such as concave, flat, or convex surround views.

“In particular, example embodiments, navigation mode is informed by the geometry of a surround view. Concave surround views allow for the rotation of a device, such as a smartphone. It can be compared to rotating a stationary observer looking at the surrounding scene. Some applications allow you to flip the screen in the opposite direction, so that the view will rotate in the other direction. This is similar to a user standing inside a hollow cylindrical and causing its walls to rotate around him. Other convex surround views can be rotated so that the view orbits in the direction the user is looking into. Some applications allow the user to swipe the screen in one direction, which causes the viewing angle of the screen to rotate in that direction. This creates the sensation that the object of interest is rotating around its axis. Some flat views can translate the view in the opposite direction to the user’s movements. Swiping in one direction on a screen can cause it to turn in the opposite direction as if you were pushing objects in the foreground to the side.

“In some cases, the user might be able navigate a multi-surround or graph of surround views. Individual surround views can be loaded in pieces and additional surround views may be loaded as needed (e.g. If they are located adjacent to/overlapped the current surround view, and/or the user navigates toward them. When the user reaches a point within a surround view that has two or more surroundviews overlapping, the user can choose which surround view to follow. Sometimes, the user can choose which surround view to follow based on how the device is moved or swiped.

“With reference to FIG. 10 is an example of how to navigate a surround view 1000. A request from a user is made to view an object of particular interest in a surroundview at 1002. The request could also be generic and request to view a surround-view without any particular object, as in the case of a panoramic or landscape view. At 1004 a three-dimensional model is available. The three-dimensional model may include all or part of a stored surround view. In some applications, the three-dimensional model may include a segmented content view. The initial image is sent at 1006 from the first viewpoint. This is the initial viewpoint that allows you to view the surround view of the output device.

“In the present embodiment, the user can then view the object of interests from a second viewpoint at 1008 by taking a user action. Moving (e.g. tilting, translating, rotating, etc.) An input device, swipe the screen, and so on, depending on what application. The user action could correspond to movement associated with a locally concave surround, locally convex surround, or locally flat surround view. The three-dimensional model of the user action is computed at 1010. The three-dimensional model is processed at 1010. The application may allow the input and output devices to be integrated in one mobile device. In some cases, the requested image corresponds with an image that was captured before the generation of the surround view. The three-dimensional model is used in other cases to generate the requested image (e.g. Interpolation, etc. The output device can receive an image taken from this perspective at 1012. The output device can receive the image from this viewpoint along with some degree of certainty about its accuracy in certain embodiments. In some cases, such as when interpolation algorithms generate images from a specific viewpoint, the degree certainty may vary. This information can be provided to the user in certain applications. Another example is a message that can be sent to the output device to indicate if the surrounding view does not contain sufficient information to produce the images requested.

“In some embodiments intermediate images can be sent between an initial image at 1006 or the requested image at 1012. These intermediate images may correspond to viewpoints that are located between the first viewpoint associated the original image and the second viewpoint associated the requested image. These intermediate images can also be chosen based on characteristics of the user’s action. The intermediate images can be used to track the movement of the input device that is associated with the user’s action. This allows the intermediate images to provide visual navigation of the object or interest.

“Segmentation and Background of the Object Of Interest”

According to different aspects of the disclosure, AR/VR content can also be generated by extracting an object or other content (e.g. a person) from a sequence images in order to separate it and the background imagery. You can achieve this by applying different segmentation algorithms to the images. 2. Some embodiments allow semantic segmentation with neural networks to be performed. Fine-grained segmentation can be performed in further embodiments. Fine-grained segmentation can be performed using temporal and conditional conditional random field.

“With reference to FIG. 11 shows an example method 1100 to semantically segment image frames in accordance with one or several embodiments. A neural network system has been trained to recognize and label pixels according a specific category or class in various embodiments of semantic segmentation. The convolutional neural system is used in some embodiments. The neural network system may include multiple computational layers in some embodiments.

Summary for “Real-time mobile device capture, generation of AR/VR material”

Modern computing platforms and technologies are shifting to mobile and wearable devices, which include camera sensors as native acquisition streams. This has made it more obvious that people want to preserve digital moments in a different format than traditional flat (2D) images and videos. Digital media formats are often restricted to passive viewing. A 2D flat image, for example, can only be viewed from one side and cannot be zoomed in or out. Traditional digital media formats such as 2D flat images are not well suited to reproducing memories or events with high fidelity.

“Producing combined images such as a panorama or a 3-D (3D) model or image requires data from multiple images. This can involve interpolation or extrapolation. The majority of the existing methods of extrapolation and interpolation require significant amounts of additional data beyond what is available in the images. These additional data must describe the scene structure in dense ways, such as a dense depthmap (where depth values are stored for each pixel) or an optical flowmap (which stores the motion vector between all the images). Computer generation of polygons, texture mapping over a 3-dimensional mesh and/or 3D models are other options. These methods also require significant processing time and resources. These methods are limited in terms of processing speed and transfer rates when it is sent over a network. It is therefore desirable to have improved methods for extrapolating 3D image data and presenting it.

“Provided is a variety of mechanisms and processes that allow AR/VR content to be captured and generated in real-time. One aspect of the invention, which can include at least some of the subject matter from any of the preceding or following examples and aspects, is a method of generating a 3D projection of an object in virtual reality (or augmented reality) environment. This method involves obtaining sequences of images with a single lens camera. Each image in the sequence contains at most a portion of the overlapping subject matter, which also includes the object.

The method also involves semantically segmenting an object from a sequence using a trained neural networks to create a sequence segmented object images. This method could also include stabilizing the sequence using focal length and camera rotation values, before semantically segmenting an object from the sequence.

The sequence of segmented objects images can then be refined by fine-grained segmentation. A temporal conditional random fields can be used to refine the sequence of segmented objects images. The temporal conditional random fields allows you to use a graph of neighboring pictures for each image that needs to be refined. Further, the method includes computing on-the fly interpolation parameters. Interpolation parameters that are generated on-the-fly can be used to create interpolated images along any point of the camera translation in real time.

The method also involves generating stereo pairs from the sequence of segmented objects images to display the object in a 3D projection in virtual reality or augmented reality. One or more points may be used to generate stereoscopic pairs along the camera translation. An interpolated virtual frame may be included in a stereoscopic pair. Rotating an image from one frame may allow for modification. This will make the selected frames correspond to a view of an object that is angled toward the object. A projection of the object is created by combining the sequences of segmented image images. This project shows a 3D view without the need for polygon generation. Segmented image indexes are then mapped into a rotation range to be displayed in virtual reality or augmented reality. The mapping of the segmented images indices can include mapping physical viewing points to a frame index.

“Other embodiments of this disclosure include the corresponding devices, systems and computer programs that are configured to perform the described actions. A non-transitory computer-readable medium may include one or more programs that can be executed by a computer system. In some embodiments, one or more of the programs includes instructions for performing actions according to described methods and systems. Each of these other implementations can optionally include one or several of the following features.

“In another aspect, which may contain at least a part of the subject matter in any of the preceding or following examples and aspects,” a system to generate a three-dimensional (3D), projection of an object within a virtual reality (or augmented reality) environment includes a single lens camera that captures a sequence images using a camera translation. Each image in the sequence contains at least one portion of the overlapping subject matter which includes the object. The system also includes a display module and a processor. It also contains memory that stores one or more programs for execution by the processor. Instructions for performing the actions described in methods and systems are included in one or more of these programs.

These and other embodiments will be described in detail below, with reference to the Figures.

“We will now refer to specific examples of disclosure, including the best methods contemplated and used by the inventors in carrying out the disclosure. The accompanying drawings show examples of specific embodiments. Although the present disclosure is presented in conjunction with specific embodiments, it should be understood that it does not limit the disclosure to those embodiments. It is, however, intended to include all alternatives, modifications, or equivalents that may be included in the scope and spirit of the disclosure, as described by the appended claims.

“The following description provides a detailed understanding of the disclosure. Some embodiments of this disclosure can be implemented without all or some of these details. Other instances of well-known process operations are not described in detail to avoid confusing the present disclosure.

U.S. Patent Application Ser. No. No. A surround view, according to different embodiments, allows a user to adjust the viewpoint of visual information on a screen.

U.S. Patent Application Ser. No. No. 14,800,638 filed by Holzer and others on Jul. 15, 2015, titled ARTIFICIALLY RENDERING IMAGES U.S. Patent Application Ser. No. No. According to different embodiments, artificial images can be interpolated between selected keyframes, captured image frames, and/or used in conjunction with one or more frames of a stereo pair. This interpolation can be used in an infinite smoothing method to generate as many intermediate frames as necessary to achieve a smooth transition between frames. No. No.

U.S. Patent Application Ser. No. No. 15/408,270 titled STABILIZING IMAGE SERIES BASED ON CAMERA ROOTATION and FOCAL LENGTH, filed Jan. 17, 2017. This application is incorporated herein in its entirety. These systems and methods may be used to create stereoscope pairs of image frames that can be presented to the user in order to give depth perception. No. No.

“Overview”

According to different embodiments, a surround is a multi-view interactive media representation. This surround view can contain content for virtual reality (VR), augmented reality, (AR) and may be presented to the user using a viewing device such as a virtual reality headset. A structured concave sequence may be captured live around an object of particular interest. This surround view can then be presented as a hologram model when viewed with a viewing device. AR/VR is the term. When referring to both virtual reality and augmented reality, the term?AR/VR? shall be used.

The data that is used to create a surround view can be derived from many sources. A surround view can be generated using data, including two-dimensional (2D), images. These 2D images can be captured using a camera that moves along a camera translation. This may or may not always be uniform. These 2D images can be captured at constant intervals of time or distances using camera translation. These 2D images may include color image data streams, such as multiple sequences of images, video data, etc. or multiple images in any number of formats, depending on what the application requires. A surround view can also be generated from location data, such as GPS, WiFi, IMUs (Inertial Measurement unit systems), magnetometers, accelerometers and magnetometers. Depth images are another data source that can be used in order to create a surround view.

The data can then be merged together in the current example embodiment. A combination of 2D and location data can create a surround view. Other embodiments allow for the use of depth images and location data together. Depending on the application, different combinations of data can be combined with location information. The data is used to model context and content in the current example. The context can be defined as the scene surrounding the object or interest. The content may be presented in a three-dimensional model depicting the object of interest. However, some embodiments allow for the presentation of the content as a two-dimensional image. In some embodiments, context can also be represented as a two-dimensional model of the scenery surrounding the object. While many contexts can present two-dimensional views showing the scenery around the object of curiosity, some embodiments allow for three-dimensional elements. This surround view can be generated using systems and methods described in U.S. Patent Application Ser. No. No. 14,530,669 titled ANALYSIS and MANIPULATION of IMAGES AND VIDEO TO GENERATE SURROUND VIEWS.

“In the current example embodiment, one or more enhancement algorithm can be applied. You can use different algorithms to capture surround view data in particular embodiments. These algorithms can be used for user enhancement. Automatic frame selection, image stabilization and object segmentation can all be used to enhance the user experience. These enhancement algorithms can be applied to image files after data acquisition. These enhancement algorithms can also be applied to surround view data while image data is being captured. Automatic frame selection can be used to reduce the storage requirements by saving keyframes from all captured images. This allows for more uniform spatial distribution of viewpoints. Image stabilization can be used to stabilize keyframes within a surround view in order to improve things like smoother transitions and enhanced/enhanced focus.

“View interpolation can also be used to enhance the viewing experience. To avoid sudden “jumps”, interpolation is particularly useful. Between stabilized frames, synthetic intermediate views can be created on the fly. View interpolation can only be used to the foreground, such as the object or interest. This information can be provided by IMU and content-weighted keypoint tracking, as well denser pixels-to-pixel matches. The process may be simplified if depth information is available. In some embodiments, view interpolation may be used during surround view capture. Other embodiments allow view interpolation to be applied during surround-view generation. These enhancement algorithms and others may be described using U.S. Patent Application Ser. No. No. 14.800,638, titled ARTIFICIALLY REINDERING IMAGES USING INTERPOOLATION OF TRACKEDCONTROL POINTS and U.S. Patent Application Ser. No. No. 14,860,983 titled ARTIFICIALLY RENDERING IMAGES UTILIZING VIEWPOINT INTERPOLATION and EXTRAPOLATION.

The surround view data may be used to generate content for AR and/or VR viewing. In various embodiments, additional image processing may be used to create a stereooscopic three-dimensional view of the object of interest that can be displayed to a user using a viewing device such as a virtual reality headset. The subject matter in the images can be divided into context (background), content (foreground), and semantic segmentation using neural networks. Fine grained segmentation refinement with temporal conditional random field may also be used. This separation can be used to remove background images from the foreground so that only the images that correspond to the object of particular interest are displayed. Systems and methods described in U.S. Patent Application Ser. No. No. To create stereoscopic image pair, stabilization my image may be achieved by determining focal length and image rotation. This is described in U.S. Patent Application Ser. No. No.

“View interpolation can also be used to smoothen the transition between image frames infinitely by creating any number of intermediate artificial images frames. Additionally, interpolated frames and capture keyframes can be combined into stereo pairs (stereo pairings) of image frames. The surround view can be presented with stereoscopic pairs so that the user may sense depth. This will enhance the user’s experience of the 3D surround views. Each pair of stereoscopic images may contain a 2D image that was used to create the surround view. Each pair of stereoscopic frames may contain a collection of 2D images separated by a spatial baseline. This baseline can be determined using a predetermined angle at a focal point and distance from it. Image rotation can also be used to fix one or more stereo pairs of images. This is to ensure that the line between the object of interest and the focal point of the desired focal point is perpendicular with the frame. Stereographic pairs of frames can be created from images taken in a single view. This allows for depth experience without the need to store additional images.

“The image frames are then mapped onto a rotation display so that the user’s movement and/or the viewing device can determine which frames to display. Image indexes can be matched to various physical locations that correspond to camera translations around objects of interest. A user can see a three-dimensional surround view at different focal lengths and angles of the object of their interest. This surround view allows for a three-dimensional view without rendering or storing a real three-dimensional model. The surround view creates a three-dimensional effect by simply stitching actual two-dimensional images or portions thereof and grouping stereoscopic pairs.

According to different embodiments, surround view have many advantages over traditional videos or two-dimensional images. These include the ability for the user to interact with the surround view and to change its viewpoint.

“In particular, examples, the characteristics above can be incorporated naturally in the surround view representation and provided the ability for use in different applications. Surround views can be used in many fields, such as ecommerce, visual search and file sharing. They also allow for user interaction and entertainment. A surround view can also be displayed as virtual reality (VR), augmented reality, or AR to the user at a viewing device such as a virtual reality headset. VR applications can simulate the user’s presence in an environment, and allow them to interact with it and any objects. An AR view, which shows a direct or indirect view of an environment in real time, may be used to present images. This is where the elements of an environment are enhanced (or supplemented by) computer-generated sensory input, such as sound, video and graphics. These AR/VR content can be created on the fly by using the systems and methods described herein. This will reduce the amount of images and data that must be stored. These systems and methods may reduce the processing time and power requirements. This allows AR and/or VR content generation to occur more quickly, in real-time or near real-time.

“Example Embodiments”

“According to different embodiments of this disclosure, a surroundview is a multiview interactive digital media representation. Referring to FIG. FIG. 1 shows an example of a system 100 that can be used to capture real-time and generate augmented reality (AR), and/or VR content. The system 100 in the current example embodiment is shown in a flow sequence that can generate a surround view of AR/VR. According to different embodiments, data that is used to create a surround view can be derived from many sources. A surround view can be generated using data, including but not limited, two-dimensional (2D), images 104. These 2D images may include color image streams, such as multiple sequences of images or video data. They can also include multiple images in different formats depending on the application. Location information 106 is another source of data that can help to create a surround view. Location information 106 can come from many sources, including GPS, WiFi and magnetometers. Depth images are another source of data that could be used to create a surround view. These depth images include 3D or depth image data streams and can be captured using devices such as stereo cameras, time of flight cameras, three-dimensional cameras and the like.

“In the current example embodiment, the data may then be fused at sensor fusion block 110. Some embodiments allow for a surround view to be created using data that includes both 2D and location information. Other embodiments allow depth images 108 to be combined with location information 106 at sensor fusion block 110. Depending on the application, different combinations of image data may be combined with location information at 116, depending on what data is available.

“In the current example embodiment, the data fused at sensor fusion block 110 can then be used for content modeling 112 or context modeling 114. FIG. 5 explains this in greater detail. 5 can be broken down into context and content. The context can be defined as the scene surrounding the object or interest. According to different embodiments, the content may be a three-dimensional model depicting an object, but in some cases the content may be a 2-dimensional image. 4. In some embodiments, the context may be a two-dimensional representation of the scenery around the object. The context may provide views in two dimensions of the scenery around the object of curiosity in some examples, but it can also be three-dimensional in certain embodiments. The context could be described as a “flat” image. Image along a cylindrical canvas,? so that the image appears flat. The image is projected on the surface of the cylinder. Some examples also include three-dimensional context modeling, which is useful when objects are identified as such in the surroundings. According to different embodiments, models generated by content modeling 112 or context modeling 114 can also be created by combining image and location data. 4.”

According to different embodiments, the context and content of surround views are determined based upon a specific object of interest. Some examples show that an object of interest can be selected automatically based on the processing of image data and location information. If a dominant object is identified in a set of images, it can be chosen as the content. As shown in FIG. 2, a user-specified target 102 could be selected. 1. However, it should be noted that surround views can be created in certain applications without the need for a target to be specified by the user.

“In the current example embodiment, one (or more) enhancement algorithms can be applied to enhancement algorithm(s), block 116. For example, different algorithms can be used to capture surround view data regardless of what capture mode is employed. These algorithms can be used for user enhancement. Automatic frame selection, stabilization and view interpolation can all be used to enhance the user experience. These enhancement algorithms may be applied to image data following acquisition. These enhancement algorithms can also be applied to image data captured with surround view data.

“Automatic frame selection can be used, according to certain example embodiments to create a more pleasant surround view. Frames are automatically chosen so that the transition between them is smoother or more even. This automatic frame selection can incorporate blur- and overexposure-detection in some applications, as well as more uniformly sampling poses such that they are more evenly distributed.”

“Image stabilization can be used in some embodiments for surround views in a way similar to video. To improve the quality of surround views, such as smooth transitions and enhanced focus, keyframes can be stabilized. But, unlike video, surround views can be stabilized using IMU information and depth information. Computer vision techniques can also be used to select the area that needs to stabilize, face detection, and other methods.

IMU information, for example, can be extremely helpful in stabilization. IMU information can be used to estimate the camera tremor, even though it may not be accurate or reliable. This information can be used to cancel, reduce, or remove the effects of camera tremor.

“Some examples show depth information that can be used, if possible, to stabilize a surround view. These points of interest in surround views are three-dimensional rather than two-dimensional. This makes it easier to track and match these points as the search space shrinks. Descriptors of points of interest can be used to describe them using both depth and color information, making them more discriminative. It is possible to provide depth information for automatic or semi-automatic content selecting. A user can select a specific pixel from an image and have it expanded to cover the entire surface. Furthermore, content can also be selected automatically by using a foreground/background differentiation based on depth. The content may be visible and can remain stable in certain situations.

Computer vision techniques can be used to stabilize surround views, according to many examples. Keypoints, for example, can be identified and tracked. There are certain scenes that cannot be stabilized by a simple warp, such as dynamic scenes or static scenes with parallax. There is a compromise: certain scenes get more attention for stabilization, while other parts of the scene are less. A surround view often focuses on one object of interest. It can be content-weighted to ensure that the object is maximumly stabilized in certain examples.

Direct selection of a specific region on a screen is another way to increase stabilization in surround views. If a user taps on a specific area of the screen to focus, and then records a convex surround image, that region can be maximally stabilized. This allows stabilization algorithms that can be focused on a specific area or object of particular interest.

“Face detection can be used in some cases to stabilize the scene. When recording with a front-facing cam, it is likely that the subject of interest is the user. Face detection can be used for weight stabilization in that area. If face detection is accurate enough, facial features (such as the eyes, nose, and mouth) can be used to stabilize areas rather than using generic keypoints.

“View interpolation can be used to enhance the viewing experience, according to various examples. To avoid sudden “jumps”, interpolation is particularly useful. Between stabilized frames, synthetic intermediate views can be rendered quickly. This information can be provided by content-weighted keypoint tracks, IMU information, and denser pixels-to-pixel matches. The process may be simplified if depth information is available. In some embodiments, view interpolation may be used during surround view capture. Other embodiments allow view interpolation to be applied during surround-view generation.

“View interpolation can be described as infinite smoothing in some cases. This may be used to enhance the viewing experience by smoother transitions between frames. These smooth transitions may be real or interpolated as discussed above. Infinite smoothing can be used to determine a set number of possible transformations between frames. To identify key points in frames, a Harris corner detector algorithm can be used to detect areas of high contrast, low ambiguity in different dimensions, or areas with high cornerness. The Harris score of a predetermined number of keypoints may be used to select the best. The RANSAC algorithm (random sample consensus), can then be used to identify the most commonly occurring transformations based on all possible keypoint transformations between frames. A smooth flow space may be defined as eight possible motions or transformations that can be applied to different pixels. Different transformations can be assigned to different pixels within a frame. Online keypoint tracking, keypoint detection and RANSAC algorithms can be used. In certain embodiments, infinite smoothing algorithms can be executed in real-time on the fly. The system can generate an artificial image frame that corresponds to the specific translation position by allowing the user to navigate to it.

“Filters can be used to enhance the viewing experience, such as when capturing or creating a surround view. A lot of popular photo sharing sites offer aesthetic filters that can be applied only to static images. However, surround images can also be used. Because a surround view is more expressive than a 2-dimensional image and there is more information in a 3-D surround view, filters can be extended to include effects not possible in 2-D photos. In a surround view, motion blur can also be added to the background (e.g. While the content is still clear, motion blur can be added to the background (i.e. Another example is to add a drop shadow to the object of concern in a surround view.

“Compression can be used in various ways as an enhancement algorithm. Compression can be used to improve user experience by reducing data download and upload costs. Surround views can send far less data than normal videos, but retain the desired characteristics of surround views. The IMU, keypoint tracks and user input can all be combined with the view interpolation described earlier to reduce the data that must transfer from and to a device during the upload or downloading of surround views. A variable compression style for content and context can be used if the object of interest can easily be identified. Variable compression styles can have lower quality resolutions for background information (i.e. context) and a higher quality resolution for the foreground (i.e. Some examples show how to reduce the amount of data transmitted. These examples show how data can be reduced without compromising the context quality and still maintaining the desired content quality.

“In the present embodiment, a surroundview 118 is generated after any enhancement algorithms have been applied. A surround view can be used to represent multi-view interactive digital media. The surround view can be a three-dimensional model for the content or a two-dimensional model for the context. In some cases, however, the context may be a flat? The context can be viewed as the background or scenery projected along a surface such as a cylindrical surface or another-shaped surface. This means that the context is more than two-dimensional. Another example is that the context may include three-dimensional aspects.

According to different embodiments, surround view have many advantages over traditional videos or two-dimensional images. These include the ability for surround views to deal with moving scenery, or both, the ability remove redundant information, and the ability for users to modify the view. The above-described characteristics can be integrated natively into the surround view representation and are available for use in many applications. Surround views can be used in many fields, such as ecommerce, visual search and file sharing. They also allow for user interaction and entertainment.

According to different examples, after a surround view 118 has been generated, user feedback can be given for acquisition 120 of additional data. If a surround view requires additional views to accurately represent the context or content, users may be asked to provide additional views. These additional views will be received by the surroundview acquisition system 100. They can then be processed by system 100 and integrated into the surroundview.

“The surround view 118 can be further processed at AR/VR Content Generation Block 122 to create content suitable for different AR/VR systems. This AR/VR content block may contain a processing module that can be used to segment images in order to extract an object or background image. Further details are provided with reference to FIGS. 11, and 12. Additional enhancement algorithms can be applied at AR/VR content generator block 122, as described in block 116. View interpolation, for example, can be used to determine the parameters of any number artificial intermediate images. This will result in an infinitely smooth transition among image frames. 13-21. The user can also be presented with stereoscopic pairs, which may provide depth perception. 22-24. 22-24.

“With reference to FIG. 2 shows an example of a 200 method for creating augmented reality or virtual reality content in real time. Method 200 may use system 100 or other methods as described in process 300. Method 200 can be performed at block 122 for AR/VR content generation. Method 200 can produce AR/VR content that gives a user a three-dimensional surround view and depth of an object of particular interest.

“A sequence of images is created at step 201. Some embodiments allow for the addition of 2D images such as 2D photos 104 to the sequence of images. Other data, such as location information (106), and depth information (106), may be obtained from the user or camera in some embodiments. Step 203 is where the images are stabilized, and a selection of frames are chosen. These image frames are known as keyframes. These keyframes can be used to create a surround view using content modeling 112, context modelling 114, or enhancement algorithms 116. Refer to FIG. 1.”

According to different aspects of the disclosure, AR/VR content can also be generated by extracting an object or other content (e.g. a person) within a sequence images to separate it and the background. You can achieve this by applying different segmentation algorithms to the images. One example embodiment uses semantic segmentation to distinguish the foreground and background of each image in each keyframe at step 205. A segmenting neural network that can identify and label pixels in each frame may perform such semantic segmentation. FIG. 2 shows how semantic segmentation can be further explained. 11. Step 207 may also be used to refine the keyframes’ fine-grained segmentation. Step 207 can enhance or improve the separation between the foreground and the background so that the object of interests is clearly and clearly isolated from the background without any artifacts. Below is a description of fine-grained segmentation with reference to FIG. 12.”

“At step 209, parameters are calculated for interpolation. Parameters for interpolation can be determined in some embodiments by computing a number possible transformations and then applying the best transformation to each pixel within an images frame. These parameters can be offline determined and used to render images at runtime, when the surrounding view is viewed. Further details regarding interpolation of keyframes, and rendering artificial frames are provided in FIGS. 13-21. 13-21.

“Stereoscopic pairs of images frames are generated at step 211. Stereoscopic pairs can be created in some embodiments by determining which pair of frames will give the desired perception of depth. This is done based on distance from the object of interest and angle of vergence. One or more frames of a stereoscopic pair could include an artificially interpolated picture. One or more frames of a stereoscopic couple may be corrected using a rotation transform such that the line at site is perpendicular with the plane of the frame. Further details on the generation of stereoscopic pairs are provided in FIGS. 22-24.”

“At step 213, the indexes of image frames are mapped into a rotation range for display. The rotation range can be concave arc around the object of interest in some embodiments. Other embodiments may use a convex rotation. Different rotation ranges can correspond to different types of camera positions and translations, as shown in FIGS. 4, 6A-6B and 7A-7E. 8, 9 and 10. In an example image sequence with 150 images, keyframe 0 may be the leftmost frame. The keyframe 150 may be the last frame that corresponds to the end the camera translation. Some embodiments allow the selected keyframes or captured frames to be evenly distributed across the rotation range. They may also be distributed according to location and/or other IMU information.

“In different embodiments, the physical viewing position is matched with the frame index. If a viewing device and/or user are located in the middle of the rotation, an image frame that corresponds to the middle should be displayed. This information may be loaded into a viewing device such as headset 2500 described in FIG. 25. Based on the headset 2500’s position, an appropriate image or stereoscopic pair may be shown to the user.

“In different embodiments, AR/VR content generated using process flow 200 may include an object or interest that can be viewed from different angles and/or perspectives by a user. The surround view model, in some embodiments, is not a rendered three-dimensional model, but rather a view that the user experiences as a model. The surround view, for example, provides a three-dimensional view without rendering or storing a real three-dimensional model. This means that there is no texture mapping or polygon generation over a three-dimensional mesh/polygon model. The user can still see the context and content as a three-dimensional model. The surround view creates the three-dimensional effect by stitching together two-dimensional images or portions thereof. The term “three-dimensional model” is used herein interchangeably with the term “three-dimensional view”. This type of three-dimensional view is interchangeable with the term “three-dimensional model”.

“Surround View Generation”

“With reference to FIG. 3 shows an example of a flow that generates a surround view. The present example shows how a plurality images are obtained at 302. According to different embodiments, the plurality may include a variety of images taken with different types of cameras. A camera could be a digital camera that is in continuous shooting mode (or burst mode), and can capture a set number of frames at a time. For example, five frames per second. The camera could also be a camera mounted on a smartphone in other embodiments. The camera can be set up to capture multiple images in continuous video.

According to different embodiments, the plurality can contain two-dimensional (2D), images or data streams. These 2D images may contain location information which can be used for creating a surround view. As described in FIG. 1. You can include location information with depth images in different examples.

According to different embodiments, the plurality images obtained at 302 may include a variety sources and characteristics. The plurality can be obtained from multiple users. These images could be a collection from different users, such as video or 2D images, taken from the internet. The plurality of images may include images with different time information. The images can be taken at different times with the same object. Multiple images of the same statue, for example, can be taken at different times of the day or during different seasons. Another example is that the plurality can be used to represent moving objects. The images could include an object of particular interest that is moving in the scenery, such a car traveling on a road or a plane flying through the sky. Other instances include images that show an object of interest moving in motion, such as someone running, dancing, twirling, or a vehicle traveling along a road.

“In the current example embodiment, the plurality images are fused into content models and context models at 304. According to different embodiments, subject matter in images can be divided into context and content. The context can be defined as the scene surrounding the object or interest. In some embodiments, the content may be a three-dimensional model depicting an object, while in others it can be a 2-dimensional image.

“According the present example embodiment, one to several enhancement algorithms can applied to the context and content models at 306. These algorithms can be used for enhancing the user experience. These algorithms can be used to enhance the user experience, such as automatic frame selection and stabilization, view-interpolation, image rotations, infinite smoothing, filters and/or compression. These enhancement algorithms may be applied to images during the capture of the images. These enhancement algorithms can also be applied to image data following acquisition.

“In the present embodiment, the surround view is generated using the context models and content at 308. A surround view can be used to create a multi-view interactive digital multimedia representation. The surround view may include both a three-dimensional model for the content and a model of its context. Depending on the method of capture and the views of the images, certain characteristics can be included in the surround view model. There are three types of surround views: a locally concave, locally convex, and a locally plain. It is important to note that surround views may include a variety of views and characteristics depending on the application. The surround view model, in some embodiments, is not a real three-dimensional model rendered but rather a view that the user experiences as a model. The surround view, for example, provides a three-dimensional view without rendering or storing an actual three dimensional model.

“With reference to FIG. “With reference to FIG. 4, here is an example of multiple camera frames that can merged together to create a 3D model. Multiple images can be captured at different viewpoints and merged together to create a surround view, according to many embodiments. Three cameras (412, 414 and 416) are located at A 422, B 424 and X 426 respectively. They are situated close to the object of interest 408 in the current example. The object of interest 408, such as object 410, can be surrounded by scenery. Frames A 402, B 404, and X 406 are all taken from the respective cameras 412-414 and 416. They include overlapping subject matter. Each frame 402, 406, and 406 include the object of interest 408, and different degrees of visibility of scenery surrounding it 410. Frame A 402 shows the object of interest 408 at the front of the cylindrical, which is part of the scenery surrounding it 410. View 406 shows 408 on one side of cylinder. View 404 shows the object without any view.

The present embodiment contains frames A 402, B 404 and X 416. These frames along with their locations (location A 422, locationB 424 and location X 466) provide rich information about the object of interest 408 as well as the surrounding context. This can be used to create a surround view. The various frames 402, 426 and 404 provide information about the different sides of the object and their relationship to the scenery when they are viewed together. This information can be used, according to different embodiments, to separate the object of interest 408 into content and the setting as the context. As described in FIGS. These viewpoints can produce images that are immersive and interactive.

“Frame X 406 in some embodiments may be an artificially rendered image generated for a viewpoint from Location X 426 along a trajectory between Location A 422, and Location B 424. A single transform is used for viewpoint interpolation along the trajectory between two frames: Frame A 402 or Frame B 404. Frame A 402 is an image of objects 408 & 410 taken by a camera 412 at Location A 422. Frame B 404 is an image of object 408 taken by a camera 414 at Location B 424. The transformation (T_AB), which is the distance between the frames, is calculated in the current example. T_AB maps one pixel from frame A and frame B. Methods such as homography and affine, similarity or translation are used to perform this transformation.

“In this example, an artificially rendered picture at Location X 426 can be denoted by a viewpoint location at x [0,1] on the trajectory between frames A and B. Frame A is at 0 and frame BC at 1. The image is generated by interpolating transformation, combining image information from Frames B and A, and then combining them. The transformation in the current example is interpolated (T_AX, T_XB). This transformation can be interpolated by parameterizing the transformation T_AB, and then linearly interpolating those parameters. This interpolation does not have to be linear. Other methods are possible within the scope. Next, image data is collected from both Frames A & B. This involves transferring image information between Frame A 402 and Frame X 406 based upon T_AX and transferring image info from Frame B 404 and Frame X 406. Using T_XB, this information will be used to determine the Frame A X. The image information from both Frames A & B is then combined to create an artificially rendered image at location X 426. Below is a description of interpolation to render artificial frames, with references to FIGS. 13-21.”

“FIG. “FIG.5” illustrates an example of the separation of context and content in a surround-view. A surround view, according to different embodiments of this disclosure, is a multi-view interactive media representation of a scene 500. Referring to FIG. FIG. 5 shows a user 502 in a scene 500. The user 502 is taking images of an object of particular interest, such a statue. The digital visual data captured by the user can be used to create a surround view.

“According to different embodiments of this disclosure, digital visual data that is included in surround views can be separated semantically and/or praktischly into content 504 or context 506. Particular embodiments allow content 504 to include an object, person or scene of interest. The context 506 can then represent the rest of the scene around the content 504. A surround view can be used to represent the content 504 in three-dimensional data and the context 506 in two-dimensional panoramic backgrounds. A surround view could also be used to represent the context 506 and content 504 as two-dimensional panoramic scenes. Another example is that context 506 and content 504 may contain three-dimensional components. Particular embodiments differ in the way the surround view depicts context 504 and content 506 depending on the mode of capture used to acquire the images.

The context 506 and content 504 may look identical in some cases, including recordings of objects, people, or parts thereof, recordings of large flat areas, recordings where only the object, person or part of them is visible, and recordings where no subject is within the recording area. These surround views may share some similarities with other digital media, such as panoramas. According to different embodiments, surround view may include additional features that differentiate them from other types of digital media. A surround view may be used to represent moving data, for example. A surround view does not have to be restricted to a particular cylindrical, spherical, or translational movement. You can capture image data using a camera, or any other capture device. A surround view, which is different from a stitched panorama can show different sides of an object.

“FIGS. “FIGS. These views are especially useful when a camera phone has been used. The camera is located on the back of a phone and faces away from the user. Concave and convex views, in particular, can influence how content and context are identified in surround views.

“With reference to FIG. “With reference to FIG. 6A, here is an example of a concave 600 where a user stands along a vertical 608. The user is holding a camera so that 602 doesn’t leave 608 during image capture. The camera captures a panoramic view around the user by pivoting about axis 608, creating a concave view. Because of the way the images were captured, this embodiment shows the object of interest 604 as well as the distant scenery 606 in a similar manner. This example shows that all objects in the concave view appear to be at infinity. The content therefore corresponds to the context.

“With reference to FIG. 6B is an example of a convexview 620 where a user can change his position while taking pictures of an object of curiosity 624. This example shows how the user moves around an object of interest 624 and takes pictures from various angles from the camera locations 628, 632, and 630. Each image includes a view and background 626 of distant scenery. The object of interest 624 is the content and the distant scenery 626 the context.

“FIGS. “FIGS.7A-7E” illustrates various capture modes for surround view images. While there are many motions that can be used for capturing surround views, they don’t have to be restricted to one type of motion. However, there are three main types of motion that can be used to capture certain features or views. Each of these three types can produce a locally concave, locally convex, or locally flat surround view. A surround view may include multiple types of motions within the same surround. FIGS. 7A-7E describes the type of surround view (for instance, concave, or convex), with reference to the direction in which the camera view is looking.

“With reference to FIG. 7A is an example of a convex surround view that is captured from the back. A locally convex surround view, according to different embodiments, is one where the viewing angles of the camera and other capture devices diverge. This can be compared to the motion required for capturing a spherical 360 panorama (pure rotation), but the motion can also be applied to any curving sweeping motion where the view faces outward. The experience in the current example is that of a stationary observer looking out at a (possibly) dynamic context.

“In the current example embodiment, user 702 uses a back-facing camera 706 in order to take images towards world 700 and away from user 702. A back-facing camera is a camera that faces away, as shown in different examples. The camera is moved in concave motion 708, so that views 704a, 704b, and 704c capture different parts of capture area 709.

“With reference to FIG. 7B is an example showing a concave surround view captured from the back. A locally concave surround view, according to different embodiments, is one where viewing angles meet at a single object. A locally concave surround view may give the viewer the feeling of being orbited around a point. In this way, the viewer can see multiple sides to the same object. This object could be an “object of interest”. This object, which may be an?object of interest?, can be separated from its surround view to become content. Any surrounding data can also be segmented to become context. This type of viewing angle was not recognized by previous technologies.

“In the current example embodiment, user 702 uses a back-facing camera 714 in order to capture images towards world 700 and away from user 702. The camera is moved in concave motion 710 so that views 712a, 712b and 712c capture different parts of the capture area 711. The convex motion 710 may orbit around an object of interest, as described above. These views 712a, 712b and 712c can show different sides of the object.

“With reference to FIG. 7C is an example of a convex surround view captured from the front. A front-facing camera is a device that has a camera facing towards the user. This includes the camera on the smart phone’s front. Front-facing cameras can be used for taking’selfies’. “Self-portraits” of the user are taken with front-facing cameras.

Camera 720 faces user 702. Camera 706 follows a convex motion so that views 718a, 718b and 718c diverge in an angular sense. The capture area 717 is convex and includes the user around a perimeter.

“With reference to FIG. 7D is an example of a concave surround view taken from the front. Camera 726 faces user 702. Camera 722 follows a concave motion 722 so that views 724a, 724b and 724c all converge towards user 702. The capture area 717 is concave and surrounds the user 702.

“With reference to FIG. 7E is an example of a flat, back-facing view being captured. A locally flat surround view, in particular embodiments, is one where the rotation of the camera and its translation is smaller than it is. A locally flat surround view is one in which the viewing angles are roughly parallel and the parallax effect dominates. This surround view can contain an “object of interest”, but it is not fixed in any of the views. This type of viewing angle was also not recognized by previous technologies in the media-sharing environment.

“The camera 732 in the current example embodiment is facing away towards user 702, and towards world 700. Camera 732 follows a linear motion 728 so that the capture area 729 follows a line. Views 730a, 730b and 730c generally have parallel lines of sight. Multiple views of an object can make it appear that the background scenery has changed or been moved in different views. A slightly different side of an object might be visible in different views. The parallax effect allows for information about the location and characteristics of objects to be generated in surround views that provide more information than any static image.

There are many modes that can be used to capture images in a surround view, as described above. These modes include locally concave, locally conex and locally linear motions. They can be used for either individual images or continuous recording of a scene. This recording can capture multiple images in a single session.

According to different embodiments of this disclosure, data can be acquired in many ways to create a surround view. Data can be obtained by moving a camera around space, as shown in FIG. 7 of U.S. Patent Application Ser. No. 14,530,669. To begin recording, the user can tap the record button on a capture gadget. The object might move in a general rightward direction across the screen as the capture device moves in a leftward direction. The object may appear to be moving rightward as the capture device moves leftward. The record button can be tapped once more to stop recording. Another way to stop recording is to tap and hold on the record button. The present embodiment captures a series images that can be used for creating a surround view.

According to different embodiments, a user can capture a sequence of images that are used to create a surround view by recording a scene or object of interest. In some cases, multiple users may be able to acquire a set of images that will create a surround view. FIG. 8 is an example of a space time surround view that has been simultaneously recorded by independent observers.

“In the current example embodiment, cameras 804, 806 808, 810 and 812 are placed at different locations. These cameras 804, 806, 808 and 810 can be linked with independent observers. Independent observers could, for example, be spectators at a concert, show or event. Cameras 804, 806, 808 and 812 could also be mounted on stands or tripods. The present embodiment uses the cameras 804, 806, 808 and 812 to capture views 804a, 808, 808, 810 and 812a of an object or interest 800. World 802 provides the background scenery. In some cases, the images from cameras 804, 806, 808 and 812 can be combined to create a single surround view. Each camera 804, 806, 808 and 812 provide a different vantage point relative the object of curiosity 800. Therefore, aggregating images from different locations gives information about the different viewing angles for the object of concern 800. Cameras 804, 806, 808, 808, 810 and 812 can also provide a series images taken at their respective locations over a period of time. This allows the surrounding view to include temporal information as well as possible movement.

“As mentioned above in relation to different embodiments, surround view can be associated with a variety capture modes. A surround view can also include different capture modes and different capture motions within the same surround view. Surround views can also be broken down into smaller pieces in certain cases, as shown in FIG. 10 of U.S. Patent Application Ser. No. 14,530,669. A complex surround-view could be divided into smaller, more linear parts. A complex surround view might include a capture area following a L motion. This includes two separate linear motions. These surround views can be divided into two separate surround views. You should note that while the linear motions in the complex surround view can sometimes be captured continuously and sequentially in certain embodiments, they can also be captured in separate sessions by other embodiments.

“In certain embodiments, the two surround views can be processed separately and combined with a transition to create a continuous experience for users. This method can offer many benefits, such as breaking down motion into smaller linear parts. These smaller, linear components can be broken down into loadable, discrete parts to help with data compression for bandwidth purposes. Non-linear surround views can be also broken down into separate components. Local capture motion can sometimes be used to break down surround views. A complex motion can be broken down into two parts, one locally convex and one that is linear. Another example is that a complex motion may be broken down into smaller, locally convex parts. You should know that complex surround views can contain any number motions, and that these complex surround views can be broken down into separate parts depending on the application.

“While it may be desirable in some applications to seperate complex surround views in certain applications, in others it is desirable for multiple surround views to be combined. Referring to FIG. FIG. 9 shows an example of a graph that incorporates multiple surround views into a multi-surroundview 900. The rectangles in this example represent different surround views 902, 906, 908, 908, 912, 914 and 916. The length of each rectangle represents the dominant motion for each surround view. The lines connecting the surround views signify possible transitions 918 to 920, 922 and 924, 926. 928. 928. 930.

“In certain cases, surround views can be used to efficiently partition scenes spatially and temporally. Multi-surround view data 900 can be used for large-scale scenes. A multi-surroundview 900 data can contain a number of surround views connected in a spatial graph. You can collect the individual surround views from one source (e.g., a single user) or multiple sources (e.g., multiple users). The individual surround views can also be taken in sequence, parallel or completely uncorrelated at different times. To connect surround views, however, content, context and/or location must overlap. To provide a portion of the multi-surround view, 900, surround views must have some overlap in terms of content, context, or location. This overlap can allow for individual surround views to be linked and then stitched together to create a multi-surroundview 900. According to different examples, you can use any combination of front, back or front-and-back cameras.

“Multi-surround views can be used to capture whole environments in some embodiments. Similar to “photo tours?” collect photographs into a graph of discrete, spatially-neighboring components, multiple surround views can be combined into an entire scene graph. This can be done using information such as image matching/tracking and depth matching/tracking. A multi-surround view or graph can be used to switch between surround views at different points in recorded motion. Multi-surround views are more appealing than?photo tours. The user can navigate the surround view as they wish and can store more visual information in surround views. Traditional?photo tours are not compatible. Traditional?photo tours have limited views that are shown to the viewer either by default or by allowing them to pan through a panorama using a computer mouse or keystrokes.

According to different embodiments, a surround image is created from a collection of images. These images can either be captured by the user who intends to create a surround view, or they can be retrieved from storage depending on what application. A surround view can give more information about multiple views of the same object or scene because it is not restricted or limited to a specific amount of visibility. A single view may not be sufficient to accurately describe a three-dimensional object. Multiple views can give more detailed and specific information. Multiple views can give enough information to enable visual search queries to return more precise results. A surround view allows you to see the object from multiple sides. If there is no distinguishable view, you can choose from the surrounding view or request one from the user. If the data provided or captured is not sufficient to enable recognition or generation of an object or scene of interest, the capturing system can direct the user to move the capturing device, or to provide additional data. A user might be asked to take additional images if the surround view determines that it is necessary to create a more precise model.

A surround view can be used in many applications, depending on the particular embodiments. A surround view can be used to allow a user navigate the surround view, or interact with it in some other way. A surround view, according to different embodiments, is intended to give the user the experience of being present in a scene while the user interacts within the surround view. The type of surround view being viewed will also affect the experience. A surround view doesn’t have to be a fixed geometry. However, it can include different geometries over a particular segment, such as concave, flat, or convex surround views.

“In particular, example embodiments, navigation mode is informed by the geometry of a surround view. Concave surround views allow for the rotation of a device, such as a smartphone. It can be compared to rotating a stationary observer looking at the surrounding scene. Some applications allow you to flip the screen in the opposite direction, so that the view will rotate in the other direction. This is similar to a user standing inside a hollow cylindrical and causing its walls to rotate around him. Other convex surround views can be rotated so that the view orbits in the direction the user is looking into. Some applications allow the user to swipe the screen in one direction, which causes the viewing angle of the screen to rotate in that direction. This creates the sensation that the object of interest is rotating around its axis. Some flat views can translate the view in the opposite direction to the user’s movements. Swiping in one direction on a screen can cause it to turn in the opposite direction as if you were pushing objects in the foreground to the side.

“In some cases, the user might be able navigate a multi-surround or graph of surround views. Individual surround views can be loaded in pieces and additional surround views may be loaded as needed (e.g. If they are located adjacent to/overlapped the current surround view, and/or the user navigates toward them. When the user reaches a point within a surround view that has two or more surroundviews overlapping, the user can choose which surround view to follow. Sometimes, the user can choose which surround view to follow based on how the device is moved or swiped.

“With reference to FIG. 10 is an example of how to navigate a surround view 1000. A request from a user is made to view an object of particular interest in a surroundview at 1002. The request could also be generic and request to view a surround-view without any particular object, as in the case of a panoramic or landscape view. At 1004 a three-dimensional model is available. The three-dimensional model may include all or part of a stored surround view. In some applications, the three-dimensional model may include a segmented content view. The initial image is sent at 1006 from the first viewpoint. This is the initial viewpoint that allows you to view the surround view of the output device.

“In the present embodiment, the user can then view the object of interests from a second viewpoint at 1008 by taking a user action. Moving (e.g. tilting, translating, rotating, etc.) An input device, swipe the screen, and so on, depending on what application. The user action could correspond to movement associated with a locally concave surround, locally convex surround, or locally flat surround view. The three-dimensional model of the user action is computed at 1010. The three-dimensional model is processed at 1010. The application may allow the input and output devices to be integrated in one mobile device. In some cases, the requested image corresponds with an image that was captured before the generation of the surround view. The three-dimensional model is used in other cases to generate the requested image (e.g. Interpolation, etc. The output device can receive an image taken from this perspective at 1012. The output device can receive the image from this viewpoint along with some degree of certainty about its accuracy in certain embodiments. In some cases, such as when interpolation algorithms generate images from a specific viewpoint, the degree certainty may vary. This information can be provided to the user in certain applications. Another example is a message that can be sent to the output device to indicate if the surrounding view does not contain sufficient information to produce the images requested.

“In some embodiments intermediate images can be sent between an initial image at 1006 or the requested image at 1012. These intermediate images may correspond to viewpoints that are located between the first viewpoint associated the original image and the second viewpoint associated the requested image. These intermediate images can also be chosen based on characteristics of the user’s action. The intermediate images can be used to track the movement of the input device that is associated with the user’s action. This allows the intermediate images to provide visual navigation of the object or interest.

“Segmentation and Background of the Object Of Interest”

According to different aspects of the disclosure, AR/VR content can also be generated by extracting an object or other content (e.g. a person) from a sequence images in order to separate it and the background imagery. You can achieve this by applying different segmentation algorithms to the images. 2. Some embodiments allow semantic segmentation with neural networks to be performed. Fine-grained segmentation can be performed in further embodiments. Fine-grained segmentation can be performed using temporal and conditional conditional random field.

“With reference to FIG. 11 shows an example method 1100 to semantically segment image frames in accordance with one or several embodiments. A neural network system has been trained to recognize and label pixels according a specific category or class in various embodiments of semantic segmentation. The convolutional neural system is used in some embodiments. The neural network system may include multiple computational layers in some embodiments.

Click here to view the patent on Google Patents.