Alphabet – Bruce L. Davis, Tony F. Rodriguez, William Y. Conwell, Geoffrey B. Rhoads, Digimarc Corp

Abstract for “Intuitive computing methods, systems”

“Smart phones sense audio, imagery and/or any other stimuli in a user’s environment and act autonomously to satisfy anticipated or inferred user wishes. One aspect of the technology is phone-based cognition of a scene as seen by the phone’s cameras. You can choose from a variety of image processing tasks that are applicable to the scene by referring to resource costs, constraints, and other stimuli information (e.g. audio), task substitutability etc. The success of the task or the user’s interest in it can affect how much or little the phone will use to process an image processing task. Some arrangements allow data to be sent to the cloud for analysis or gleaning. Collateral information such as context can aid in recognition and the identification of the appropriate response(s) from the device. There are many other features and arrangements that can be described.”

Background for “Intuitive computing methods, systems”

“The subject matter of this disclosure could be considered technologies that allow users to interact with their environment using computers. The technology is well-suited for many applications because of its broad scope.

It is hard to present the information in an organized manner due to the wide range of topics covered. Many of the topics presented here are both based on and foundational to other sections, as will be apparent. The sections are presented in an arbitrary order, as it is necessary. Both the general principles and particular details of each section can be applied to other sections. The various combinations and permutations of the features in the sections are not detailed to prevent this disclosure becoming too long. The inventors intend to explicitly teach such combinations/permutations, but practicality requires that the detailed synthesis be left to those who ultimately implement systems in accordance with such teachings.”

“It is important to note that the presently described technology builds upon, and extends technology disclosed in earlier-cited patent applications. These documents will provide details about how applicants intend to apply the present technology and technical supplements to the disclosure.

“Cognition, Disintermediated search”

“Mobile devices such as cell phones are becoming cognition tools rather than communication tools. One aspect of cognition can be described as an activity that provides information about a person’s environment. Cognitive actions include:

“Seeing and hearing are two of the many functions that mobile devices can perform to aid in the informing process of a person’s environment.”

“Mobile devices are growing at an incredible rate. According to reports, many countries, including Finland, Sweden and Norway, Russia, Italy and the United Kingdom, have more cell phones than people. According to the GSM Association there are currently approximately 4 billion GSM or 3G phones in use. According to the International Telecommunications Union, there were 4.9 billion mobile cellular subscribers at the end 2009. Devices are only replaced on average once every 24 months because of the short upgrade cycle.

Mobile devices have attracted a lot of investment. Google, Microsoft and Nokia have all invested huge amounts in research and development to expand the functionality of their devices. The ingeniousness of such technologies is evident despite the intense and extensive efforts made by industry giants.

“?Disintermediated search,? Visual query is one of the most appealing applications for the upcoming generation of mobile devices.

Disintermediated search can be defined as search that minimizes or eliminates the role of the human in initiating the search. A smart phone, for example, may be constantly analyzing the visual environment and providing interpretations and related information, without having to be explicitly queried.

Disintermediated search could be considered the next step beyond Google. Google created a massive, monolithic system to organize all of the textual information available on the web. Google cannot comprehend the complexity and size of the visual world. There will be many parties involved, each playing a different role. There won’t be one search engine that can do it all. (Given the possibility of multiple parties being involved, maybe a different moniker would be “hyperintermediated search .?)”

“As you will see from the discussion, the present inventors believe visual search is very complicated in some aspects. It requires an intimate device/cloud orchestration and a highly interactive user interface on a mobile screen to produce a satisfactory experience. The utility of the results is dependent on user interaction and guidance. On the local device, a key challenge is deploying scarce CPU/memory/channel/power resources against a dizzying array of demands. To drive the evolution of technology on the cloud side, it is expected that auction-based service models will emerge. Disintermediated search will initially be sold as closed systems. However, to thrive, it will be available via extensible, open platforms. The technologies that provide the greatest value to users will ultimately be the most successful.

“Architectural View”

“FIG. “FIG. (It is important to note that the division into blocks of functionality can be somewhat arbitrary. The actual implementation might not be as described.

“The ICP Baubles and Spatial Model component deals with tasks that involve the viewing space, display, and their relations. The relevant functions include tracking, pose estimation, and ortho-rectified map in connection with overlaying Baubles on a visual environment.

“Baubles can be considered, in one aspect as augmented reality icons that appear on the screen along with features of captured imagery. These baubles can be interactive and user-tuned. For example, different baubles might appear on different screens for different users viewing the same scene.

“In some arrangements, baubles seem to indicate a first glimmer recognition by the system. The system will present a bauble when it begins to recognize something of potential interest. The bauble’s size, shape, color, or brightness may change as the system learns more about the feature. This could make it more visible and/or informative. The system’s resource manager, e.g. the ICP State Machine, can allocate disproportionately more processing resources for analysis of this feature than to other areas if the user taps on the bauble. The data store also stores information about the feature and the bauble. This allows the system’s resource manager (e.g., the ICP State Machine) to allocate disproportionately more processing resources to the analysis of that particular feature than other regions.

“When a bauble appears for the first time, there is little information about its visual features, except that it seems to be a visually distinct entity. A generic bauble, perhaps called a “proto-bauble?” at this level of understanding is possible. A small circle or star can be displayed. Once more information about the feature is uncovered (e.g., it appears to be a face or bar code or leaf), then a graphic called bauble can be displayed that reflects this increased understanding.

“Baubles are commercially available. The display screen can become overcrowded with different baubles competing for attention in certain environments. A user-settable control, a visual verbosity controller?that limits the amount of information displayed on the screen can be used to address this problem. A control that allows users to set a limit on the number of non-commercial baubles and commercial baubles can also be available. “(As with Google, collecting raw data from the system might prove more valuable over presenting ads to users in the long-term.)

“Desirably the baubles chosen for display are those that provide the greatest value to the user based on different dimensions of the current context. In some cases?both commercial and non-commercial?baubles may be selected based on auction processes conducted in the cloud. You can influence the final list of displayed baubles. The baubles with which the user has most interaction become favorites. They are more likely to be displayed again in the future. Those that the user ignores or dismisses repeatedly may not be displayed again.

“Another GUI control is available to indicate the user?s current interest (e.g. shopping, hiking or social, navigating, eating etc. You can adjust the way baubles are presented to suit your needs.

“In some ways, the analogy of an older car radio with a volume knob on one side and a tuning knob the other?is appropriate. The volume knob is a user-settable control of screen busyness (visual verbosity). The tuning knob is made up of sensors, stored data and user input. These indicate the type of content that is currently relevant to the user (e.g., their likely intent).

“The illustrated ICP Baubles & Spatial Model may be borrowed from or built based upon existing software tools that perform similar functions. One of these is the ARToolKit?a set of freely-available software that was developed by the Human Interface Technology Lab at University of Washington (hitl). Washington edu/artoolkit/) is being developed further by AR Toolworks, Inc. of Seattle (artoolworks). com). Another set of related tools is MV Tools?a popular collection of machine vision functions.”

“FIG. “FIG. RAs are components that extract feature and form from sensor data (e.g. pixels) and/or their derivatives (e.g.?keyvector). data, c.f., US20100048242, WO10022185). They are generally used to recognize and extract meaning from available information. Some RAs can be compared to specialized search engines in one aspect. One could search for bar codes, one might search for faces, and so on. Other types of (RAs are also possible, such as processing audio information, providing GPS data and magnetometer data, among other tasks.

“RAs can execute remotely or locally depending on the needs of the session. They can be loaded remotely and operated per cloud-negotiated business rules. Keyvector data is often used by RAs to input keyvector data into a shared data structure, such as the ICP blackboard (described below). They might provide elemental services, which are combined by the ICP state machines in accordance to a solution tree.

“As with baubles there could be a competition involving RAs. This means that overlapping functionality can be provided by multiple RAs from different providers. The decision about which RA is to be used on a specific device in a given context may depend on user selection, third-party reviews, cost, system constraints and re-usability. Eventually, Darwinian winnowing will occur. The RAs that meet the users’ needs may become more common.

A smart phone vendor might initially supply a phone with a default set RAs. While some vendors might retain control over RA selection?a closed-door approach?some may allow users to discover different RAs. The RA market may be served by online marketplaces like the Apple App Store. There may be packages of RAs that cater to different customer groups. A menu may be provided by the system that allows users to set different RAs to load at different times.

Depending on the circumstances, some, or all, of these RAs might push functionality to cloud. If the device has a fast connection to the cloud and the battery is nearly empty (or if the device is being used for gaming), the local RA may only do a fraction of the task locally (e.g. administration) and then ship the remainder to a cloud counterpart.

“As described elsewhere, processor time and other resources can be controlled dynamically. This oversight can be performed by the ICP state machine’s dispatcher component. The ICP state can also manage the division between cloud counterparts and local RA components.

“The ICP state machine can use aspects that are modeled from Android open-source operating system (e.g. developer). android com/guide/topics/fundamentals.html), as well as from the iPhone and Symbian SDKs.”

“To the right, in FIG. 1. is the Cloud & Business Rules Component. It acts as an interface to cloud-related processes. It is also capable of performing administration for cloud auctions. It communicates with the cloud via a service provider interface, or SPI. This can use virtually any communication channel and protocol.

“Even though the rules may be different, there are exemplary rules-based systems that can serve as models for this aspect. These include the Movielabs Content Rules & Rights arrangement (e.g. movielabs). com/CRR/) and the CNRI Handle System, (e.g. handle net/).”

“To the left, a context engine provides context information to the system and processes it (e.g. What is the current place?). What actions did the user take in the last minute? What actions has the user taken in the last hour? etc.). A context component can be used to link to remote data via an interface. Remote data can include any information that is related to the user, such as activities, friends, social networks, consumed media, geography, or anything else. If the device has a music recognition agent it might consult the playlists of user friends on Facebook. This information may be used to refine the model of music the user listens too.

“Context engine and cloud & business rule components can have cloud-side counterparts.” This means that this functionality can be distributed with part local and one in the cloud.

“Cloud-based interactions can use many of the tools, software and software already published by Google’s App engine (e.g. code) Google http://www.appengine.com/) and Amazon Elastic Compute Cloud (e.g. aws). Amazon com/ec2/).”

“At the bottom of FIG. 1 is the Blackboard and Clustering Engine.

“The blackboard can serve various functions, including as a shared data repository, and as a means for interprocess communication?allowing multiple recognition agents to observe and contribute feature objects (e.g., keyvectors), and collaborate. It can also serve as a data model, such as maintaining a visual representation that aids in feature extraction and association across multiple agents. It can also be used as a feature factory and allow for feature object instantiation (creation, destruction, notification, serialization, in the form keyvectors etc ).

Blackboard functionality can be used with the open-source blackboard software GBBopen. org). The Blackboard Event Processor is another open-source implementation that runs on Java Virtual Machine and supports JavaScript scripting (code Google com/p/blackboardeventprocessor/).”

Daniel Corkill popularized the blackboard construction. Corkill, Collaborating Software?Blackboard, Multi-Agent Systems & the Future Proceedings of the International Lisp Conference 2003. Implementation of the current technology doesn’t require any particular concept.

The Clustering Engine group items of content data (e.g. pixels) together, e.g. in keyvectors. One aspect of keyvectors could be analogized to audio-visual counterparts to text keywords?a grouping elements that are input into a process to achieve related results.

“Clustering can also be done by low-level processes that create new features from image data.features that can represented as lists, vectors, or image regions. Recognition operations often look for clusters of similar features because they could potentially be objects of interest. These features can be added to the blackboard. ”

“The ARToolKit, which was previously mentioned, can be used as a foundation for certain functionality.”

“Aspects of what is described above are detailed in the following section and other sections.”

“Local Device & Cloud Processing.”

FIG. 2. Disintermediated search should be based on strengths/attributes both of the local device as well as the cloud. (The cloud?pipe? ”

“The distribution of functionality between local devices and cloud can vary from one implementation to the next. It is broken down as follows in one implementation:

“Local Functionality:”

“Cloud roles could include, e.g. :”

“The Cloud facilitates disintermediated searching, and often serves as the destination (except in OCR cases, where results can generally be provided solely on sensor data);

“The currently-detailed technology draws inspiration from a variety of sources, including:

“FIG. “FIG. An Intuitive Computing Platform’s (ICP) Context Engine, for instance, applies cognitive processes such as association, problem solving status and determining solutions to the context of the system. The ICP Context Engine, in other words, attempts to determine user intent using history and then use that information to inform system operations. The ICP Baubles & Spatial Model components also serve the same functions, in that they present information to the user and receive input from them.

“The ICP Blackboard, keyvectors, and keyvectors are data structure used, among other things, in conjunction with orientation aspects of this system.”

In conjunction with recognition agents, “ICP State Machine & Recognition Agent Management” oversees recognition processes and the composition of services related to recognition. The state machine is a typical real-time operating system. These processes include, e.g. the ICP Blackboard or keyvectors.

“Cloud Management & Business Rules” deals with cloud association, registration, and session operations.

“Local Functionality to Support Baubles.”

“Some functions that one or more software components can provide in relation to baubles include:

The cloud should be a competitive market for services and high-value bauble results. This will drive suppliers to excellence and ensure business success. This market could be driven by the establishment of a cloud auction with non-commercial, baseline quality services.

“Users demand the best quality and most relevant baubles. This is because they are more concerned about their intentions and real queries than any commercial intrusion.

On the other hand, screen real estate buyers may be divided into two groups: those who are willing to offer non-commercial baubles or sessions (e.g. with the goal to gain a customer for branding) and those who want to?qualify. The screen real estate (e.g. in terms of the demographics and users who will see it), and only bid on the commercial opportunities that it represents.

Google has, naturally, built a large business around monetizing its key word, auction process, and sponsored hyperlink presentation. arrangements. It seems unlikely, however, that one entity will dominate all aspects visual search. It is more likely that there will be a middle layer of companies who assist with the buyer-matchmaking and user queries.

“The user interface might include a control that allows the user to dismiss baubles of no interest. This will remove them from the screen and terminate any ongoing recognition agent process that is devoted to further information about that visual feature. The information about dismissed baubles can be stored in a data storage and used to enhance the user’s profile. The system might conclude that the user is not interested in independent coffee shops or Starbucks coffee shops if he/she dismisses all baubles. If the user only dismisses baubles for Starbucks coffee shops, it is possible to discern a narrower lack of interest. The data store can be used to help you find future displays of baubles. Baubles that were previously dismissed or repeatedly removed may not be displayed again.

“Similarly, if the user taps a bauble?indicating an interest?then that type of bauble (e.g. Starbucks, coffee shops, etc.) can be given higher scores in the future when evaluating which baubles (among many others) to display.

Historical information regarding user interactions with baubles may be combined with current context information. If the user rejects baubles related to coffee shops in afternoons but not mornings, the system might continue to display coffee-related baubles at morning.

“The inherent complexity of the visual query problem means that many baubles are of an interim, or protobauble class?” inviting and guiding the user for human-level filtering, interaction and navigation deeper into this query process. As such, the progression of baubles on a scene is a function of human input as well as other factors.

“When a user taps or otherwise expresses an interest in a bauble it usually starts a session related to that bauble’s subject matter. The particular bauble will determine the details of the session. Some sessions might be commercial (e.g. tapping on a Starbucks symbol may give you a $1 off coupon). Some sessions may be informative (e.g. tapping on a bauble that is associated with a statue could lead to a Wikipedia entry or a photo of the sculptor). A bauble that indicates recognition of a person in a captured photo might allow for a variety of operations, including the presentation of a profile from a social networking site, such as LinkedIn, or posting a face-annotated copy to the Facebook page of either the person or the user. Tapping a bauble can sometimes bring up a list of operations from which the user can choose the desired action.

“Tapping on a bauble is a win for the bauble over other. If the tapped Bauble is commercial, it has won the user’s attention and temporary use of screen real estate. An associated payment might be made in some cases?perhaps to the user or to another party (e.g. to an entity that secured a?win? ”

“A tapped Bauble also represents a vote for preference?” a Darwinian nod towards that bauble. This affirmation can influence the selections of baubles to be displayed to future users. This will hopefully lead to bauble suppliers moving in a positive direction towards user-serving excellence. “How many television commercials could survive if users only had regular airtime?

“As indicated by the user, a given scene may offer opportunities for display of many different baubles?often more than the screen can contain. This is where the user can start to narrow down the possibilities to a manageable number.

“A variety can be used to input user information, starting with the verbosity control mentioned earlier. This simply sets a baseline for how busy the user would like the screen to display baubles. Other controls can indicate topical preferences and a specific mix of commercial and non-commercial.

“Another dimension to control is the user’s actual-time expressions of interest in certain areas of the screen. For example, the user may tap on a specific area of the screen to indicate features that they are interested in learning more about or interact with. You can indicate your interest by tapping on protobaubles that are overlaid on these features. However, protobaubles do not have to be used (e.g., you may tap on an undifferentiated portion of the screen to direct processor attention to that area).

“Additional user input can also be called contextual” and includes the many types of information described elsewhere (e.g. computing context, physical environment, user context, user context, temporal context, historical context, etc.).

“External data can be used to inform the bauble selection process. This information could include information about third-party interactions. This factor may be based on distance between other users and the current user and their context. In some cases, the weight of bauble preferences expressed in actions by social friends can be much higher than actions taken by strangers in other circumstances.

“Another factor that can impact the user’s screen real-estate is commercial considerations. For example, what price a third party would be willing to pay to temporarily lease some of their screen real estate. These issues may be considered in a cloud-based auction arrangement. An auction may also consider the popularity of certain baubles among other users. This aspect of the auction may also be implemented using the Google technology to auction online advertising real estate. (See, e.g. Levy, Secret of Googlenomics : Data-Fueled Recipe Bakes Profitability Wired Magazine, May 22, 2009). A generalized second-price auction. In the published PCT application, WO2010022185, applicants described cloud-based auction arrangements.

“Briefly, it is assumed that such cloud-based models are similar to advertising models that are based on click through rates (CTR). Entities will pay various amounts (monetary or subsidized) to ensure their service is used and/or their baubles show up on users’ screens. It is desirable that there is a dynamic market for recognition services offered commercially and non-commercially by recognition agents (e.g., a Logo Recognition Agent with Starbucks logos pre-cached). Search-informed advertising can also teach us lessons.

“Generally speaking, the difficulties in these auctions do not lie in the conduct of the auction but in addressing the many variables involved. These are:

“(In some cases, bauble promoters might try harder to place Baubles on the screens of wealthy users. This is based on their device type. The most well-off user may be more likely to get commercial attention if they have the most recent, most expensive device or use a premium data service than someone who has an older device or uses the trailing edge service. Third parties can also use the profile data of the user or inferred from the circumstances to determine which screens make the best targets for their baubles.

In one implementation, some baubles (e.g. 1-8) may be assigned to commercial promotions (e.g. as determined by a Google auction procedure and subject to user tuning commercial vs. not-commercial baubles), while others may be selected based upon non-commercial factors such as those mentioned earlier. These baubles can be selected in a rule-based fashion. For example, an algorithm may weight different factors mentioned earlier to get a score for each one. The scores of the competing baubles are then ranked and the N highest scoring baubles are displayed on the screen (where N can be set using the verbosity control).

“In another implementation there is no priori allocation of commercial baubles. These are scored in a way similar to non-commercial baubles. They use different criteria but are scaled to a similar range. The N highest-scoring baubles are then displayed. These may be commercial, non-commercial or a combination.

“In another implementation, the mix between commercial and non-commercial baubles depends on the subscription service. At an entry level, users pay an introductory fee and are given large-sized, or multiple, commercial baubles. Premium service users are offered smaller or fewer commercial jewels. They also have the option to choose their own display parameters.

The graphical indicia of a bauble may be visually customized to show its feature association and may contain animated elements to grab the user’s interest. If the user zooms in on the area of the displayed imagery or expresses interest in it, the bauble provider might provide indicia in different sizes. Sometimes the system will have to act as a cop and decide not to present a profferedbauble. The system can automatically reduce baubles to a size that is suitable and substitute generic indicia (such as a star?) for indicia that are not appropriate or otherwise unavailable.

“Baubles may be presented in addition to visual features that can be seen from the imagery. A bauble can be used to show that the device is aware of its location or the identity of the user. The user can receive various operational feedback, regardless of the image content. Other than identifying particular features, some image feedback can also be provided by baubles.

Each bauble may contain a bit-mapped representation or a collection graphical primitives. The plan view is the most common way to define the bauble indicators. The software’s spatial model component can map the bauble onto the screen according to the detected surfaces in the captured imagery. This includes, for example, appearing to tilt and possibly perspectively warping a storefront that is viewed from the side. These issues will be discussed in the next section.

“Spatial Model/Engine”

It is important to have a satisfactory projection and display of the 3D world on a 2D screen in order to create a pleasant user experience. To serve these purposes, the preferred system will include software components (variously referred to as a spatial model or a space engine).

Understanding the 3D world is key to rendering it in 2D. What if you have only a few pixels to work with? How do you distinguish objects and classify them? How can you track the movement of an image scene so that baubles are repositioned in accordance with it? These issues have been encountered in many cases. Video motion encoding and machine vision are just two examples of fields that offer useful prior art which an artisan can use in connection to the present application.

“By the way of first principles:

“Below is a proposal to codify spatial comprehension as an orthogonal stream of process, as well as context items and attribute items. It uses the construct of three “spacelevels?” It uses the construct of three?spacelevels?

Spacelevel 1 includes basic scene analysis and parsing. Initial groupings are created by clumping pixels together. It is possible to understand the real estate of captured scenes and display screen real estate. You also have some knowledge of the flow of scene real property across frames.

Geometrically, Spacelevel 1 exists in the context of a simple 2D plan. Spacelevel 1 operations include the generation of lists of 2D objects from pixel data. This category includes the elemental operations that are performed by OpenCV vision library (discussed later). Local software on the smart phone may be able to handle Spacelevel 1 operations and rich lists may be available for 2D objects.

“Spacelevel 2 is transitional” making some sense of the Spacelevel 1 2D primaries, but not yet reaching a full 3D understanding. This level of analysis includes tasks seeking to relate different Spacelevel 1 primitives?discerning how objects relate in a 2D context, and looking for clues to 3D understanding. These operations include identifying groups of objects, tracing lines and noting patterns. You will also learn about vanishing points, horizons and the notion of?up/down. You might also be interested in?closer/further? You might also find hints of?closer/further? A face, for example, has generally known dimensions. A set of elements that appears to represent a face and is 40 pixels high in a scene measuring 480 pixels, then the?further’ attribute is likely. (In contrast to a facial collection that is 400 pixels high, this attribute could be gathered.

“The Spacelevel 1 cacophony is reduced to a smaller, more meaningful list of object-related entities.

Spacelevel 2 could impose a GIS-like organization onto scene or scene sequences. For example, each identified clump, object or region of interest may be assigned its own logical data layers?possibly with overlapping areas. Each layer could have its own metadata store. In this level, object continuity?frame-to-frame, can be discerned.”

Geometrically Spacelevel 2 recognizes that the captured pixels data are a camera’s projection onto a 2D frame. These primitives and objects are not meant to be taken as a complete characterization of reality. They can only be seen from one perspective. The context from which objects are seen is the lens of the camera. The lens position provides a perspective through which the pixel data can be understood.

“Spacelevel 2 operations tend to rely more heavily on cloud processing than Spacelevel 1.”

“In the exemplary embodiment, Spatial Model components are general-purpose?distilling pixels into more usable form. Each recognition agent can use this pool of data to perform their tasks. However, it is important to decide which operations are so common that they can be performed in this way as a matter-of-course and which should be delegated to individual recognition agencies. Their results can still be shared, e.g. by the blackboard. The designer can draw the line arbitrarily. He or she has the freedom to choose which operations fall on which side. Sometimes, the line can shift dynamically in the course of a phone’s operation. For example, when a recognition agent requests additional common services support.

“Spacelevel 3 operations can be based in 3D. The analyses assume that pixels represent a 3D world, regardless of whether the data reveal the full 3D relationships (it usually won’t). This understanding is important?even integral?to some object recognition processes.

Spacelevel 3 builds upon the previous levels of understanding and extends out to world correlation. The user is a observer in a world model that follows a certain projection and time trajectory. The system can use transformation equations to map scene-to world and world-to?scene so it understands where it is and where the objects are in space and provides a framework for understanding how they relate. These analysis phases are based on work in the gaming industry and augmented reality engines.

“Unlike operations associated to Spacelevel 1 (and some associated with Spacelevel 2) operations, operations associated Spacelevel 3 are typically so specialized that they cannot routinely be performed on incoming data (at the least not with current technology). These tasks are instead left to specific recognition tasks that may need particular 3D information.

“Some recognition agents might create a virtual model?and populate it with objects that are part of their 3D context. For example, a vehicle driving monitor may be able to see out of the windshield of the car and note items and actions that are relevant to traffic safety. It might keep a 3D model and any actions in the traffic environment. It might take note of the wife of the user (identified by another software program agent and posted it to the blackboard), driving her Subaru through a red light?in the view of the user. Although 3D modeling is possible to support such functionality, it is not something that the phone’s general service would perform routinely.

FIG. 4 conceptually illustrates how spatial understanding has advanced from Spacelevel 1 to Spacelevel 2. to Spacelevel 3.

Summary for “Intuitive computing methods, systems”

“The subject matter of this disclosure could be considered technologies that allow users to interact with their environment using computers. The technology is well-suited for many applications because of its broad scope.

It is hard to present the information in an organized manner due to the wide range of topics covered. Many of the topics presented here are both based on and foundational to other sections, as will be apparent. The sections are presented in an arbitrary order, as it is necessary. Both the general principles and particular details of each section can be applied to other sections. The various combinations and permutations of the features in the sections are not detailed to prevent this disclosure becoming too long. The inventors intend to explicitly teach such combinations/permutations, but practicality requires that the detailed synthesis be left to those who ultimately implement systems in accordance with such teachings.”

“It is important to note that the presently described technology builds upon, and extends technology disclosed in earlier-cited patent applications. These documents will provide details about how applicants intend to apply the present technology and technical supplements to the disclosure.

“Cognition, Disintermediated search”

“Mobile devices such as cell phones are becoming cognition tools rather than communication tools. One aspect of cognition can be described as an activity that provides information about a person’s environment. Cognitive actions include:

“Seeing and hearing are two of the many functions that mobile devices can perform to aid in the informing process of a person’s environment.”

“Mobile devices are growing at an incredible rate. According to reports, many countries, including Finland, Sweden and Norway, Russia, Italy and the United Kingdom, have more cell phones than people. According to the GSM Association there are currently approximately 4 billion GSM or 3G phones in use. According to the International Telecommunications Union, there were 4.9 billion mobile cellular subscribers at the end 2009. Devices are only replaced on average once every 24 months because of the short upgrade cycle.

Mobile devices have attracted a lot of investment. Google, Microsoft and Nokia have all invested huge amounts in research and development to expand the functionality of their devices. The ingeniousness of such technologies is evident despite the intense and extensive efforts made by industry giants.

“?Disintermediated search,? Visual query is one of the most appealing applications for the upcoming generation of mobile devices.

Disintermediated search can be defined as search that minimizes or eliminates the role of the human in initiating the search. A smart phone, for example, may be constantly analyzing the visual environment and providing interpretations and related information, without having to be explicitly queried.

Disintermediated search could be considered the next step beyond Google. Google created a massive, monolithic system to organize all of the textual information available on the web. Google cannot comprehend the complexity and size of the visual world. There will be many parties involved, each playing a different role. There won’t be one search engine that can do it all. (Given the possibility of multiple parties being involved, maybe a different moniker would be “hyperintermediated search .?)”

“As you will see from the discussion, the present inventors believe visual search is very complicated in some aspects. It requires an intimate device/cloud orchestration and a highly interactive user interface on a mobile screen to produce a satisfactory experience. The utility of the results is dependent on user interaction and guidance. On the local device, a key challenge is deploying scarce CPU/memory/channel/power resources against a dizzying array of demands. To drive the evolution of technology on the cloud side, it is expected that auction-based service models will emerge. Disintermediated search will initially be sold as closed systems. However, to thrive, it will be available via extensible, open platforms. The technologies that provide the greatest value to users will ultimately be the most successful.

“Architectural View”

“FIG. “FIG. (It is important to note that the division into blocks of functionality can be somewhat arbitrary. The actual implementation might not be as described.

“The ICP Baubles and Spatial Model component deals with tasks that involve the viewing space, display, and their relations. The relevant functions include tracking, pose estimation, and ortho-rectified map in connection with overlaying Baubles on a visual environment.

“Baubles can be considered, in one aspect as augmented reality icons that appear on the screen along with features of captured imagery. These baubles can be interactive and user-tuned. For example, different baubles might appear on different screens for different users viewing the same scene.

“In some arrangements, baubles seem to indicate a first glimmer recognition by the system. The system will present a bauble when it begins to recognize something of potential interest. The bauble’s size, shape, color, or brightness may change as the system learns more about the feature. This could make it more visible and/or informative. The system’s resource manager, e.g. the ICP State Machine, can allocate disproportionately more processing resources for analysis of this feature than to other areas if the user taps on the bauble. The data store also stores information about the feature and the bauble. This allows the system’s resource manager (e.g., the ICP State Machine) to allocate disproportionately more processing resources to the analysis of that particular feature than other regions.

“When a bauble appears for the first time, there is little information about its visual features, except that it seems to be a visually distinct entity. A generic bauble, perhaps called a “proto-bauble?” at this level of understanding is possible. A small circle or star can be displayed. Once more information about the feature is uncovered (e.g., it appears to be a face or bar code or leaf), then a graphic called bauble can be displayed that reflects this increased understanding.

“Baubles are commercially available. The display screen can become overcrowded with different baubles competing for attention in certain environments. A user-settable control, a visual verbosity controller?that limits the amount of information displayed on the screen can be used to address this problem. A control that allows users to set a limit on the number of non-commercial baubles and commercial baubles can also be available. “(As with Google, collecting raw data from the system might prove more valuable over presenting ads to users in the long-term.)

“Desirably the baubles chosen for display are those that provide the greatest value to the user based on different dimensions of the current context. In some cases?both commercial and non-commercial?baubles may be selected based on auction processes conducted in the cloud. You can influence the final list of displayed baubles. The baubles with which the user has most interaction become favorites. They are more likely to be displayed again in the future. Those that the user ignores or dismisses repeatedly may not be displayed again.

“Another GUI control is available to indicate the user?s current interest (e.g. shopping, hiking or social, navigating, eating etc. You can adjust the way baubles are presented to suit your needs.

“In some ways, the analogy of an older car radio with a volume knob on one side and a tuning knob the other?is appropriate. The volume knob is a user-settable control of screen busyness (visual verbosity). The tuning knob is made up of sensors, stored data and user input. These indicate the type of content that is currently relevant to the user (e.g., their likely intent).

“The illustrated ICP Baubles & Spatial Model may be borrowed from or built based upon existing software tools that perform similar functions. One of these is the ARToolKit?a set of freely-available software that was developed by the Human Interface Technology Lab at University of Washington (hitl). Washington edu/artoolkit/) is being developed further by AR Toolworks, Inc. of Seattle (artoolworks). com). Another set of related tools is MV Tools?a popular collection of machine vision functions.”

“FIG. “FIG. RAs are components that extract feature and form from sensor data (e.g. pixels) and/or their derivatives (e.g.?keyvector). data, c.f., US20100048242, WO10022185). They are generally used to recognize and extract meaning from available information. Some RAs can be compared to specialized search engines in one aspect. One could search for bar codes, one might search for faces, and so on. Other types of (RAs are also possible, such as processing audio information, providing GPS data and magnetometer data, among other tasks.

“RAs can execute remotely or locally depending on the needs of the session. They can be loaded remotely and operated per cloud-negotiated business rules. Keyvector data is often used by RAs to input keyvector data into a shared data structure, such as the ICP blackboard (described below). They might provide elemental services, which are combined by the ICP state machines in accordance to a solution tree.

“As with baubles there could be a competition involving RAs. This means that overlapping functionality can be provided by multiple RAs from different providers. The decision about which RA is to be used on a specific device in a given context may depend on user selection, third-party reviews, cost, system constraints and re-usability. Eventually, Darwinian winnowing will occur. The RAs that meet the users’ needs may become more common.

A smart phone vendor might initially supply a phone with a default set RAs. While some vendors might retain control over RA selection?a closed-door approach?some may allow users to discover different RAs. The RA market may be served by online marketplaces like the Apple App Store. There may be packages of RAs that cater to different customer groups. A menu may be provided by the system that allows users to set different RAs to load at different times.

Depending on the circumstances, some, or all, of these RAs might push functionality to cloud. If the device has a fast connection to the cloud and the battery is nearly empty (or if the device is being used for gaming), the local RA may only do a fraction of the task locally (e.g. administration) and then ship the remainder to a cloud counterpart.

“As described elsewhere, processor time and other resources can be controlled dynamically. This oversight can be performed by the ICP state machine’s dispatcher component. The ICP state can also manage the division between cloud counterparts and local RA components.

“The ICP state machine can use aspects that are modeled from Android open-source operating system (e.g. developer). android com/guide/topics/fundamentals.html), as well as from the iPhone and Symbian SDKs.”

“To the right, in FIG. 1. is the Cloud & Business Rules Component. It acts as an interface to cloud-related processes. It is also capable of performing administration for cloud auctions. It communicates with the cloud via a service provider interface, or SPI. This can use virtually any communication channel and protocol.

“Even though the rules may be different, there are exemplary rules-based systems that can serve as models for this aspect. These include the Movielabs Content Rules & Rights arrangement (e.g. movielabs). com/CRR/) and the CNRI Handle System, (e.g. handle net/).”

“To the left, a context engine provides context information to the system and processes it (e.g. What is the current place?). What actions did the user take in the last minute? What actions has the user taken in the last hour? etc.). A context component can be used to link to remote data via an interface. Remote data can include any information that is related to the user, such as activities, friends, social networks, consumed media, geography, or anything else. If the device has a music recognition agent it might consult the playlists of user friends on Facebook. This information may be used to refine the model of music the user listens too.

“Context engine and cloud & business rule components can have cloud-side counterparts.” This means that this functionality can be distributed with part local and one in the cloud.

“Cloud-based interactions can use many of the tools, software and software already published by Google’s App engine (e.g. code) Google http://www.appengine.com/) and Amazon Elastic Compute Cloud (e.g. aws). Amazon com/ec2/).”

“At the bottom of FIG. 1 is the Blackboard and Clustering Engine.

“The blackboard can serve various functions, including as a shared data repository, and as a means for interprocess communication?allowing multiple recognition agents to observe and contribute feature objects (e.g., keyvectors), and collaborate. It can also serve as a data model, such as maintaining a visual representation that aids in feature extraction and association across multiple agents. It can also be used as a feature factory and allow for feature object instantiation (creation, destruction, notification, serialization, in the form keyvectors etc ).

Blackboard functionality can be used with the open-source blackboard software GBBopen. org). The Blackboard Event Processor is another open-source implementation that runs on Java Virtual Machine and supports JavaScript scripting (code Google com/p/blackboardeventprocessor/).”

Daniel Corkill popularized the blackboard construction. Corkill, Collaborating Software?Blackboard, Multi-Agent Systems & the Future Proceedings of the International Lisp Conference 2003. Implementation of the current technology doesn’t require any particular concept.

The Clustering Engine group items of content data (e.g. pixels) together, e.g. in keyvectors. One aspect of keyvectors could be analogized to audio-visual counterparts to text keywords?a grouping elements that are input into a process to achieve related results.

“Clustering can also be done by low-level processes that create new features from image data.features that can represented as lists, vectors, or image regions. Recognition operations often look for clusters of similar features because they could potentially be objects of interest. These features can be added to the blackboard. ”

“The ARToolKit, which was previously mentioned, can be used as a foundation for certain functionality.”

“Aspects of what is described above are detailed in the following section and other sections.”

“Local Device & Cloud Processing.”

FIG. 2. Disintermediated search should be based on strengths/attributes both of the local device as well as the cloud. (The cloud?pipe? ”

“The distribution of functionality between local devices and cloud can vary from one implementation to the next. It is broken down as follows in one implementation:

“Local Functionality:”

“Cloud roles could include, e.g. :”

“The Cloud facilitates disintermediated searching, and often serves as the destination (except in OCR cases, where results can generally be provided solely on sensor data);

“The currently-detailed technology draws inspiration from a variety of sources, including:

“FIG. “FIG. An Intuitive Computing Platform’s (ICP) Context Engine, for instance, applies cognitive processes such as association, problem solving status and determining solutions to the context of the system. The ICP Context Engine, in other words, attempts to determine user intent using history and then use that information to inform system operations. The ICP Baubles & Spatial Model components also serve the same functions, in that they present information to the user and receive input from them.

“The ICP Blackboard, keyvectors, and keyvectors are data structure used, among other things, in conjunction with orientation aspects of this system.”

In conjunction with recognition agents, “ICP State Machine & Recognition Agent Management” oversees recognition processes and the composition of services related to recognition. The state machine is a typical real-time operating system. These processes include, e.g. the ICP Blackboard or keyvectors.

“Cloud Management & Business Rules” deals with cloud association, registration, and session operations.

“Local Functionality to Support Baubles.”

“Some functions that one or more software components can provide in relation to baubles include:

The cloud should be a competitive market for services and high-value bauble results. This will drive suppliers to excellence and ensure business success. This market could be driven by the establishment of a cloud auction with non-commercial, baseline quality services.

“Users demand the best quality and most relevant baubles. This is because they are more concerned about their intentions and real queries than any commercial intrusion.

On the other hand, screen real estate buyers may be divided into two groups: those who are willing to offer non-commercial baubles or sessions (e.g. with the goal to gain a customer for branding) and those who want to?qualify. The screen real estate (e.g. in terms of the demographics and users who will see it), and only bid on the commercial opportunities that it represents.

Google has, naturally, built a large business around monetizing its key word, auction process, and sponsored hyperlink presentation. arrangements. It seems unlikely, however, that one entity will dominate all aspects visual search. It is more likely that there will be a middle layer of companies who assist with the buyer-matchmaking and user queries.

“The user interface might include a control that allows the user to dismiss baubles of no interest. This will remove them from the screen and terminate any ongoing recognition agent process that is devoted to further information about that visual feature. The information about dismissed baubles can be stored in a data storage and used to enhance the user’s profile. The system might conclude that the user is not interested in independent coffee shops or Starbucks coffee shops if he/she dismisses all baubles. If the user only dismisses baubles for Starbucks coffee shops, it is possible to discern a narrower lack of interest. The data store can be used to help you find future displays of baubles. Baubles that were previously dismissed or repeatedly removed may not be displayed again.

“Similarly, if the user taps a bauble?indicating an interest?then that type of bauble (e.g. Starbucks, coffee shops, etc.) can be given higher scores in the future when evaluating which baubles (among many others) to display.

Historical information regarding user interactions with baubles may be combined with current context information. If the user rejects baubles related to coffee shops in afternoons but not mornings, the system might continue to display coffee-related baubles at morning.

“The inherent complexity of the visual query problem means that many baubles are of an interim, or protobauble class?” inviting and guiding the user for human-level filtering, interaction and navigation deeper into this query process. As such, the progression of baubles on a scene is a function of human input as well as other factors.

“When a user taps or otherwise expresses an interest in a bauble it usually starts a session related to that bauble’s subject matter. The particular bauble will determine the details of the session. Some sessions might be commercial (e.g. tapping on a Starbucks symbol may give you a $1 off coupon). Some sessions may be informative (e.g. tapping on a bauble that is associated with a statue could lead to a Wikipedia entry or a photo of the sculptor). A bauble that indicates recognition of a person in a captured photo might allow for a variety of operations, including the presentation of a profile from a social networking site, such as LinkedIn, or posting a face-annotated copy to the Facebook page of either the person or the user. Tapping a bauble can sometimes bring up a list of operations from which the user can choose the desired action.

“Tapping on a bauble is a win for the bauble over other. If the tapped Bauble is commercial, it has won the user’s attention and temporary use of screen real estate. An associated payment might be made in some cases?perhaps to the user or to another party (e.g. to an entity that secured a?win? ”

“A tapped Bauble also represents a vote for preference?” a Darwinian nod towards that bauble. This affirmation can influence the selections of baubles to be displayed to future users. This will hopefully lead to bauble suppliers moving in a positive direction towards user-serving excellence. “How many television commercials could survive if users only had regular airtime?

“As indicated by the user, a given scene may offer opportunities for display of many different baubles?often more than the screen can contain. This is where the user can start to narrow down the possibilities to a manageable number.

“A variety can be used to input user information, starting with the verbosity control mentioned earlier. This simply sets a baseline for how busy the user would like the screen to display baubles. Other controls can indicate topical preferences and a specific mix of commercial and non-commercial.

“Another dimension to control is the user’s actual-time expressions of interest in certain areas of the screen. For example, the user may tap on a specific area of the screen to indicate features that they are interested in learning more about or interact with. You can indicate your interest by tapping on protobaubles that are overlaid on these features. However, protobaubles do not have to be used (e.g., you may tap on an undifferentiated portion of the screen to direct processor attention to that area).

“Additional user input can also be called contextual” and includes the many types of information described elsewhere (e.g. computing context, physical environment, user context, user context, temporal context, historical context, etc.).

“External data can be used to inform the bauble selection process. This information could include information about third-party interactions. This factor may be based on distance between other users and the current user and their context. In some cases, the weight of bauble preferences expressed in actions by social friends can be much higher than actions taken by strangers in other circumstances.

“Another factor that can impact the user’s screen real-estate is commercial considerations. For example, what price a third party would be willing to pay to temporarily lease some of their screen real estate. These issues may be considered in a cloud-based auction arrangement. An auction may also consider the popularity of certain baubles among other users. This aspect of the auction may also be implemented using the Google technology to auction online advertising real estate. (See, e.g. Levy, Secret of Googlenomics : Data-Fueled Recipe Bakes Profitability Wired Magazine, May 22, 2009). A generalized second-price auction. In the published PCT application, WO2010022185, applicants described cloud-based auction arrangements.

“Briefly, it is assumed that such cloud-based models are similar to advertising models that are based on click through rates (CTR). Entities will pay various amounts (monetary or subsidized) to ensure their service is used and/or their baubles show up on users’ screens. It is desirable that there is a dynamic market for recognition services offered commercially and non-commercially by recognition agents (e.g., a Logo Recognition Agent with Starbucks logos pre-cached). Search-informed advertising can also teach us lessons.

“Generally speaking, the difficulties in these auctions do not lie in the conduct of the auction but in addressing the many variables involved. These are:

“(In some cases, bauble promoters might try harder to place Baubles on the screens of wealthy users. This is based on their device type. The most well-off user may be more likely to get commercial attention if they have the most recent, most expensive device or use a premium data service than someone who has an older device or uses the trailing edge service. Third parties can also use the profile data of the user or inferred from the circumstances to determine which screens make the best targets for their baubles.

In one implementation, some baubles (e.g. 1-8) may be assigned to commercial promotions (e.g. as determined by a Google auction procedure and subject to user tuning commercial vs. not-commercial baubles), while others may be selected based upon non-commercial factors such as those mentioned earlier. These baubles can be selected in a rule-based fashion. For example, an algorithm may weight different factors mentioned earlier to get a score for each one. The scores of the competing baubles are then ranked and the N highest scoring baubles are displayed on the screen (where N can be set using the verbosity control).

“In another implementation there is no priori allocation of commercial baubles. These are scored in a way similar to non-commercial baubles. They use different criteria but are scaled to a similar range. The N highest-scoring baubles are then displayed. These may be commercial, non-commercial or a combination.

“In another implementation, the mix between commercial and non-commercial baubles depends on the subscription service. At an entry level, users pay an introductory fee and are given large-sized, or multiple, commercial baubles. Premium service users are offered smaller or fewer commercial jewels. They also have the option to choose their own display parameters.

The graphical indicia of a bauble may be visually customized to show its feature association and may contain animated elements to grab the user’s interest. If the user zooms in on the area of the displayed imagery or expresses interest in it, the bauble provider might provide indicia in different sizes. Sometimes the system will have to act as a cop and decide not to present a profferedbauble. The system can automatically reduce baubles to a size that is suitable and substitute generic indicia (such as a star?) for indicia that are not appropriate or otherwise unavailable.

“Baubles may be presented in addition to visual features that can be seen from the imagery. A bauble can be used to show that the device is aware of its location or the identity of the user. The user can receive various operational feedback, regardless of the image content. Other than identifying particular features, some image feedback can also be provided by baubles.

Each bauble may contain a bit-mapped representation or a collection graphical primitives. The plan view is the most common way to define the bauble indicators. The software’s spatial model component can map the bauble onto the screen according to the detected surfaces in the captured imagery. This includes, for example, appearing to tilt and possibly perspectively warping a storefront that is viewed from the side. These issues will be discussed in the next section.

“Spatial Model/Engine”

It is important to have a satisfactory projection and display of the 3D world on a 2D screen in order to create a pleasant user experience. To serve these purposes, the preferred system will include software components (variously referred to as a spatial model or a space engine).

Understanding the 3D world is key to rendering it in 2D. What if you have only a few pixels to work with? How do you distinguish objects and classify them? How can you track the movement of an image scene so that baubles are repositioned in accordance with it? These issues have been encountered in many cases. Video motion encoding and machine vision are just two examples of fields that offer useful prior art which an artisan can use in connection to the present application.

“By the way of first principles:

“Below is a proposal to codify spatial comprehension as an orthogonal stream of process, as well as context items and attribute items. It uses the construct of three “spacelevels?” It uses the construct of three?spacelevels?

Spacelevel 1 includes basic scene analysis and parsing. Initial groupings are created by clumping pixels together. It is possible to understand the real estate of captured scenes and display screen real estate. You also have some knowledge of the flow of scene real property across frames.

Geometrically, Spacelevel 1 exists in the context of a simple 2D plan. Spacelevel 1 operations include the generation of lists of 2D objects from pixel data. This category includes the elemental operations that are performed by OpenCV vision library (discussed later). Local software on the smart phone may be able to handle Spacelevel 1 operations and rich lists may be available for 2D objects.

“Spacelevel 2 is transitional” making some sense of the Spacelevel 1 2D primaries, but not yet reaching a full 3D understanding. This level of analysis includes tasks seeking to relate different Spacelevel 1 primitives?discerning how objects relate in a 2D context, and looking for clues to 3D understanding. These operations include identifying groups of objects, tracing lines and noting patterns. You will also learn about vanishing points, horizons and the notion of?up/down. You might also be interested in?closer/further? You might also find hints of?closer/further? A face, for example, has generally known dimensions. A set of elements that appears to represent a face and is 40 pixels high in a scene measuring 480 pixels, then the?further’ attribute is likely. (In contrast to a facial collection that is 400 pixels high, this attribute could be gathered.

“The Spacelevel 1 cacophony is reduced to a smaller, more meaningful list of object-related entities.

Spacelevel 2 could impose a GIS-like organization onto scene or scene sequences. For example, each identified clump, object or region of interest may be assigned its own logical data layers?possibly with overlapping areas. Each layer could have its own metadata store. In this level, object continuity?frame-to-frame, can be discerned.”

Geometrically Spacelevel 2 recognizes that the captured pixels data are a camera’s projection onto a 2D frame. These primitives and objects are not meant to be taken as a complete characterization of reality. They can only be seen from one perspective. The context from which objects are seen is the lens of the camera. The lens position provides a perspective through which the pixel data can be understood.

“Spacelevel 2 operations tend to rely more heavily on cloud processing than Spacelevel 1.”

“In the exemplary embodiment, Spatial Model components are general-purpose?distilling pixels into more usable form. Each recognition agent can use this pool of data to perform their tasks. However, it is important to decide which operations are so common that they can be performed in this way as a matter-of-course and which should be delegated to individual recognition agencies. Their results can still be shared, e.g. by the blackboard. The designer can draw the line arbitrarily. He or she has the freedom to choose which operations fall on which side. Sometimes, the line can shift dynamically in the course of a phone’s operation. For example, when a recognition agent requests additional common services support.

“Spacelevel 3 operations can be based in 3D. The analyses assume that pixels represent a 3D world, regardless of whether the data reveal the full 3D relationships (it usually won’t). This understanding is important?even integral?to some object recognition processes.

Spacelevel 3 builds upon the previous levels of understanding and extends out to world correlation. The user is a observer in a world model that follows a certain projection and time trajectory. The system can use transformation equations to map scene-to world and world-to?scene so it understands where it is and where the objects are in space and provides a framework for understanding how they relate. These analysis phases are based on work in the gaming industry and augmented reality engines.

“Unlike operations associated to Spacelevel 1 (and some associated with Spacelevel 2) operations, operations associated Spacelevel 3 are typically so specialized that they cannot routinely be performed on incoming data (at the least not with current technology). These tasks are instead left to specific recognition tasks that may need particular 3D information.

“Some recognition agents might create a virtual model?and populate it with objects that are part of their 3D context. For example, a vehicle driving monitor may be able to see out of the windshield of the car and note items and actions that are relevant to traffic safety. It might keep a 3D model and any actions in the traffic environment. It might take note of the wife of the user (identified by another software program agent and posted it to the blackboard), driving her Subaru through a red light?in the view of the user. Although 3D modeling is possible to support such functionality, it is not something that the phone’s general service would perform routinely.

FIG. 4 conceptually illustrates how spatial understanding has advanced from Spacelevel 1 to Spacelevel 2. to Spacelevel 3.

Click here to view the patent on Google Patents.