Invented by Anson An-Chun Tsao, Yang Cao, Pusheng Zhang, Microsoft Technology Licensing LLC
The Microsoft Technology Licensing LLC invention works as followsThe same source code can be used for multiple data sources to facilitate porting between local, cloud, cluster and cloud execution. The source code identifier of a data source maps to different data sources. A resolution API produces executable codes that are tailored to successive targets without changing the source code of the developer. A project-specific editable data source map is maintained in a property or file and included in executable software packages. The burden on developers is reduced by reducing the need to explicitly handle different execution locations and determine targets of execution in source code. The source code can be freed from absolute path data sources identifiers and code to detect execution location. Translation from source to source injects calls for creating a cloud container, folder, or file. Data source resolution is based on relative paths rooted in the data source identifier.
Background for Transformational context aware data source management
A developer can modify a computer program to run on multiple computers to benefit from additional processing power or memory. A developer might want to run a computer program originally designed to be executed on a desktop computer on a cluster of computers or cloud computing. A developer may wish to apply a particular processing logic for different data sources without having to specify each one as a parameter in the logic.
A computing cluster is a collection of computers connected through a local area network (LAN) or another relatively fast communication mechanism. When viewed outside of the cluster, computers in a cluster work together to act like one powerful computer.
A computing cloud” is a pool of configurable computing services (e.g. servers, storage, software, and applications) that are shared in a network. Cloud resources can be made quickly available to users and released easily as the computing demands of a user increase or decrease. Cloud servers can provide applications to a web browser, eliminating the need for a client-side installation.
Many definitions have been given of cloud and cluster. A cluster is defined as a collection of less than 1,000 processing cores, or a group that resides in a single facility. Cloud computing, on the other hand, is a cloud that has a thousand cores or more, or resides in more than one building.
The process of converting a program to run in a cluster or cloud can be arduous and error prone. For a developer, it can be difficult to run a programme against datasets that are of different sizes. “Some embodiments described in this document provide ways to improve program portability by giving access to a variety of datasets. For example, ways to manage a single algorithm source code with multiple alternative computation data sources.
For instance, in certain scenarios, there is a data-source mapping in which an identifier for a data-source is mapped to multiple data sources with substantially different sizes on respective execution targets. For example, the data source may be mapped to several hundred gigabytes of local storage, or a few terabytes of cluster storage, or dozens of hundreds of terabytes of cloud storage. The embodiment receives the source code that cites the data source as a source. The embodiment produces a first executable from the source code after identifying a target execution. The executable, when executed in the first target, will use the mapped data of the first target as the data sources identified by the datasource identifier. The embodiment produces automatically, from the same source code, a second executable tailored for the second target after identifying a secondary execution target that has a data source different than the first target. The second executable code will use the second execution targets mapped data as its data source when it is executed in the second target. This port can be achieved without the developer having to modify the source code.
In some embodiments, data source mappings are maintained in textual configuration files, while in others they are read from nontextual properties, project headers or other project-specific structures. In some embodiments the data source map is included in a software package that can be distributed, which allows portability, regardless of the execution locations supported by executable code within the package.
An embodiment can reduce or eliminate the burden for developers in determining execution targets and explicitly handling different execution locations within the source code. In some embodiments the source code does not include absolute data source identifiers. “An embodiment can provide agility for a developer to work quickly through iterations of source code with a small subset data locally on one computer, and then test selected program iterations against larger data in a cloud or cluster.
From the developer’s point of view, certain embodiments obtain a mapping data sources in which a specific data source identifier has been mapped to multiple data sources for respective execution targets. The embodiments then automatically generate data sources resolutions using the data source mapping, and source code without the developer having to make any changes. Different data sources resolutions correspond to various execution targets. For example, a cloud data source using a Universal Resource Identifier or a cluster data source. The developer may specify the execution targets, or they may be set as defaults. In some embodiments the resolution of the data source can be based upon a relative path that is rooted at the identifier. Some embodiments accept modifications to the data source mapping by the developer.
From an architectural perspective, certain embodiments include a processor logical and a memory that is in communication with the processor logical. The memory contains a data source mapping that has a data source identifier that is mapped to different data sources for different execution targets. Data source mappings may be stored or persistent in various places, such as a textual file configuration, nontextual property of a project, project-specific structure and/or user-specific structures. The source code is also stored in memory and contains the data source identifier as a datasource. A memory-resident executable code creator contains instructions that, upon execution, automatically produce multiple executable codes from the same source code at different times. Each executable code refers to a separate data source.
In some embodiments the source code does not contain code that detects execution location. Some embodiments have source codes that are free of absolute data path source identifiers. In some embodiments, the executable is produced by an executable producer and contains injected code that was not in the source code.
In some embodiments, an executable code product includes a resolution interface. The resolution API contains instructions that, when executed by the processor, will resolve a relative pathway and an execution destination into a data-source resolution containing a physically defined path. Some executable code producers include a translator from source to source. Different embodiments can take on one or more general approaches. On a local computer, the source code can be translated into executable code or intermediate code for a specific execution target. The IC or EC are then deployed to the target and run. In a second method, the source code will be deployed first to an execution destination. The code will be translated on-the-spot by a producer designed for the execution target. In a third method, the source code will be translated to make it appear as though the data was local. When the code is deployed in an environment with data located at a remote location a pre-task downloads the data so that the code has access to the local data.
The examples are only illustrative.” This summary is not meant to identify the key features or essential elements of the claimed object matter. Nor is it intended to limit the scope. This Summary is intended to provide a simplified version of some concepts that will be described in greater detail below. “The innovation is defined by claims and, to the extent that this Summary conflict with the claims then the claims will prevail.
DESCRIPTION DU DRAWINGS
The drawings will be used to provide a more detailed description. These drawings are only a selection of aspects, and therefore do not determine the full scope or coverage.
Some fields are experiencing a data boom, so it is beneficial to use more powerful and faster processing for large data sets. This can be done in a cluster, or cloud, using distributed processing. As traditional desktops and single-node computers have limited processing power, they are not able to handle large volumes of data. Therefore, more developers and engineers want to run desktop applications using clusters and clouds that can accommodate larger data sets. A developer might want to create an algorithm locally using a subset of data, and then use the program on a cluster or cloud that contains the full set of data in the same format.Click here to view the patent on Google Patents.