Microsoft – Prosenjit Sinha, Commvault Systems Inc

Abstract for “Synchronizing selected data elements in a storage management software”

“Disclosed systems and methods leverage storage management system resources to partially sync primary data files. They do this by synchronizing selected parts of the file without considering changes in non-synchronized areas. A number of primary files can be partially synchronized using auto-restore operations to restore backup data. This method uses storage management resources to identify the source data to be kept synchronized across any number targets. They also detect changes and back them up to secondary storage. The secondary storage changes are then distributed to the targets with minimal impact on the primary data environment. This approach can be used in conjunction with each other so that any changes to an associated source data file may also be detected and backed up before being distributed to the others.

Background for “Synchronizing selected data elements in a storage management software”

Global businesses recognize the commercial value and need to find cost-effective, reliable ways to secure their information while minimising their impact on productivity. As part of their daily, weekly, and monthly maintenance plan, a company may back up important computing systems like databases, virtual machines (file servers, web servers), and so forth. Companies continue to look for innovative ways to manage data growth, given the ever-growing volume of data under their control. Collaboration environments require the ability to coordinate data content and efficiently manage data growth.

It is important to find a way for data content to be coordinated across multiple users and/or storage devices within collaboration environments. This will ensure that both network bandwidth and storage resources can be kept minimal. Many users might want to stay current with rapidly changing data sources, such as a code base for software, a database that contains transactions, or a media project in development. Traditional methods often copy the source data to a local workspace so that the user can use it as needed. This approach can be costly and not practical if the source data is large or multiple users want to keep current. Traditional approaches can cause network communication bottlenecks, may take up too much storage, and may not be suitable for multi-party updates. By tapping the device too often for copies, the traditional approach can also affect production performance. A more efficient and simplified approach is required. Even if changes are made incrementally, it may not cover all the changes that are necessary.

The present inventor has developed methods and systems to leverage storage management system resources to partially sync primary data files. This involves synchronizing selected parts of the file without considering changes in non-synchronized areas. A number of primary files can be partially synchronized using auto-restore operations to restore backup data. This method uses storage management resources to: identify portions of source data to keep synchronized across multiple targets; detect any changes to those portions; back them up to secondary storage; then distribute the changes from secondary storage directly to the targets with minimal impact on the primary data environment. This approach can be mutual so that any changes to one source data file in an associated group may also be detected and backed up before being distributed to other members. The present method uses significantly less storage and communications resources than trying to synchronize all files. Instead of managing synchronization in designated portions, i.e. partial synchronization. Accordingly, the present approach may be more economically practicable and may enable partial-synchronization operations to occur more frequently and provide more current data to the respective users.”

“The illustrative embodiment employs an enhanced storage manager and enhanced data agents that work with media agents in the storage management system to perform partial-synchronization operations across any number of primary data files. Administrators and users may designate the enhanced storage manager to identify which parts of their primary data files they want the storage management system keep synchronized. The storage manager includes enhancements to store metadata for the designation of selected portions (hereinafter called?synchronization portions?). and their mutual association, as well as enhancements for managing partial-synchronization operations throughout the storage management system. The enhanced storage manager might be notified if a synchronization section of a primary file has changed. This could include a new transaction, a modification to a source code portion, or a change in a media file. and may then automatically launch a partial-synchronization operation to bring all the associated synchronization portions up-to-date with the detected change.”

An enhanced data agent could include a monitor function that tracks the designated synchronization section of a particular data file that falls under its purview. The enhanced data agent may notify the storage administrator if it detects any changes in the monitored sync portion. The storage manager might instruct the data agent, in response to a detected change in the monitored synchronization section, to back it up to secondary storage. It is important to note that only the synchronization section is backed-up and not the entire data file. The storage manager can then manage restore operations to restore all the synchronization portions of the secondary copy. To pull in changes and synchronize the local file with another data file, users do not need to interact. The enhanced storage manager may manage the process, instructing media agents and data agents to perform the required operations.

“Systems and methods for partially synchronizing primary files are disclosed. They use auto-restore operations to restore backup data from syncronized portions. These systems and methods can be further described by referring to FIGS. 2-6. Components and functionality that allow partial synchronization of primary data files, based on synchronizing portion thereof via auto-restore operations using backup data, may be configured and/or integrated into information management systems like those shown in FIGS. 1A-1H.”

“Information Management System Overview”

Organizations simply cannot afford to lose critical data. This is because of the growing importance of protecting and leveraging data. Protecting and managing data is becoming more difficult due to runaway data growth and other modern realities. It is imperative to have user-friendly, efficient and powerful solutions for managing and protecting data.

“Depending on the organization’s size, there may be many data production sources that fall under the control of thousands, hundreds or even thousands of employees. Individual employees used to be responsible for protecting and managing their data in the past. In other cases, a patchwork of software and hardware point solutions was used. These solutions were often offered by different vendors, and sometimes had little or no interoperability.

“CERTAIN embodiments described herein offer systems and methods capable to address these and other shortcomings in prior approaches by implementing unified information management across the organization. FIG. FIG. 1A illustrates one such information management systems 100. It generally includes combinations hardware and software that are used to manage and protect data and metadata generated by various computing devices within information management system 100. An organization using the information management system 100 could be a company, other business entity, educational institution, household or governmental agency.

“Generally, the systems described herein may be compatible and/or provide some of the functionality of one or more U.S patents or patent application publications assigned by CommVault Systems, Inc., each which is hereby incorporated into its entirety by reference herein.

“The information management software 100 can contain a wide range of computing devices. As an example, the information management software 100 could include one or more client computing device 102 and secondary storage computing device 106, as we will discuss in more detail.

Computer devices may include without limitation one or more of the following: personal computers, workstations, desktop computers or other types generally fixed computing systems like mainframe computers or minicomputers. Other computing devices include portable or mobile computing devices like laptops, tablets computers, personal information assistants, mobile phones (such a smartphones), and other mobile/portable computing devices like embedded computers, set top boxes or vehicle-mounted devices. Servers can be included in computing devices, including mail servers, file server, database servers and web servers.

“In certain cases, a computing device may include virtualized and/or Cloud computing resources. A third-party cloud service provider may provide one or more virtual machines to an organization. In some cases, computing devices may include one or more virtual machines running on a physical host computing device (or “host machine?”). The organization may use one or more virtual machines to run its database server and another virtual machine as a mail server. One example is that the organization might use one virtual machine to run its database server and another as a mail server. Both virtual machines are running on the same host computer.

A virtual machine is an operating system and associated resources that is hosted on a host computer or host machine. Hypervisor is typically software and is also known as a virtual monitor, virtual machine manager or?VMM? The hypervisor acts as a bridge between the virtual machine’s hardware and its host machine. ESX Server, by VMware, Inc., of Palo Alto, Calif., is an example of hypervisor used for virtualization. Other examples include Microsoft Virtual Server, Microsoft Windows Server Hyper-V, and Sun xVM, both by Oracle America Inc., Santa Clara, Calif. In some embodiments, hypervisors may be hardware or firmware.

The hypervisor gives each virtual operating system virtual resources such as a processor, virtual memory, and virtual network devices. Each virtual machine can have one or more virtual drives. The data of virtual drives is stored by the hypervisor in files on the filesystem of the physical host machine. These files are called virtual machine disk images (in the instance of Microsoft virtual servers) and virtual machine disk files (in case of VMware virtual server). VMware’s ESX server provides the Virtual Machine File System, (VMFS), for storage of virtual machine files. Virtual machines read and write data to their virtual disks in the same manner as physical machines.

U.S. Pat. 102,297 describes “Examples for information management techniques in cloud computing environments.” No. No. 8,285,681 is incorporated herein. U.S. Pat. explains some techniques for managing information in virtualized computing environments. No. No. 8.307,177, also included by reference herein

“The information management software 100 can include many storage devices. Primary storage devices 104, secondary storage devices (108), and others are examples. You can store any type of storage device, including hard-disk arrays and semiconductor memory (e.g. solid state storage), network-attached storage (NAS), tape libraries or other magnetic non-tape storage devices as well as optical media storage devices. DNA/RNA-based memories technology and combinations thereof. Storage devices may be part of a distributed storage system in some instances. Some storage devices can be provided in a cloud, such as a private cloud, or one managed by a third party vendor. In some cases, a storage device is a disk array or a portion thereof.

“The illustrated information system 100 comprises one or more client computing devices 102 that execute at least one application 110, and one or two primary storage devices (104) that store primary data 112. In some cases, the client computing device(s), 102 and primary storage devices (104) may be called a primary storage subsystem. 117 A computing device that is part of an information management systems 100 and has a data agent 42 installed and running on it is called a client computing device (or in the context of a component in the information management systems 100, simply as a “client ?).””).

“The meaning of the term “information management system” depends on the context. It can be used to refer to all the software and hardware components. In other cases, it may only refer to a subset or all of the components.

In some cases, the information system 100 may refer to a collection of components that protect, move and manage data and metadata generated from client computing devices 102. The information management system 100 does not necessarily include all the components that create and/or store primary data 112, such the client computing device 102 and applications 110, as well as the primary storage devices 104. For example, the term “information management system” could refer to: Sometimes, the term “information management system” may refer to any of the following components with corresponding data structures: storage agents, media agents, and data agents. We will describe these components in greater detail below.

“Client Computing Devices.”

There are many sources of data that an organization can use to protect and manage its data. One example is that a company environment can have multiple data sources. These include employee workstations, company servers, such as mail servers, web servers, database servers, transaction servers, and the like. The information management system 100 includes the client computing devices 102 as data sources.

“The client computing device 102 can include any of these types of computing devices, but in some cases, the client computing device 102 is associated with one or more users or corresponding user accounts of employees or other individuals.”

“The information management software 100 addresses the data management needs and protects the data generated by client computing devices 102. This does not mean that client computing devices 102 can’t be called?servers? In other ways. A client computing device 102 can act as a server for other clients, such as client computing devices 102. The client computing devices 102 include file servers, mail servers, database servers and web servers.

“Each client computing device (102) may have one or more software applications 110 (e.g. software applications). These applications generate and manipulate data that must be managed and protected from loss. Applications 110 are generally used to support the operation of an organization or multiple affiliated organizations. They can include file server applications (e.g. Microsoft Exchange Server), mail client applications(e.g. Microsoft Exchange Client), SQL, Oracle, SAP and Lotus Notes Database), word processing apps (e.g. Microsoft Word), spreadsheet and financial applications, presentation and graphics applications, web applications, mobile applications and entertainment applications.

“The client computing devices 102 may have at least one operating software (e.g. Microsoft Windows, Mac OS X iOS, IBM z/OS Linux, or other Unix-based OSes). There may be one or more file system or other applications that are installed on the client computing devices 102.

“The client computing devices (102 and 100) can be connected via one or more communication paths 114. A first communication path 114 could connect client computing devices 102 and secondary storage computing devices 106. A second communication pathway, 114, may connect storage manger 140 and client computing devices 102. A third communication pathway, 114, may connect storage managers 140 and client computing devices 102. Finally, storage manager 140 may be connected to storage manager 140, and secondary storage computing equipment 106. (see, e.g., FIG. FIG. 1A and FIG. 1C). 1C. In some cases, communication pathways 114 may also include application programming Interfaces (APIs), such as cloud service provider APIs and virtual machine management APIs. The infrastructure that underlies communication paths 114 can be wired, wireless, analog, and/or digital or any combination thereof. Facilities may also be private, public or third-party provided.

“Primary Data, Exemplary Primary Storage Devices”

According to some embodiments, primary data 112 is production data or any other?live? data. Data generated by the operating systems and/or applications 110 running on a client computing device. Primary data 112 is usually stored on the primary storage device(s), 104. It is organized using a file system that is supported by the client computing devices 102. The client computing device (102) and the corresponding applications 110 can create, access modify, delete, write, delete, or otherwise use primary data 112. Some cases allow some or all the primary data 112 to be stored in cloud storage resources. For example, client computing device(s) 102 and corresponding applications 110 may create, modify, write, delete, or otherwise use primary data 112.

“Primary Data 112 is usually in the native format for the source application 110. Primary data 112 can be described as an initial or first copy (e.g. created before any other copies, or at least one additional copy) of data generated from the source application 110. In some cases, primary data 112 is substantially created directly from the data generated by the source applications 110.

The primary storage devices 104 that store the primary data 112 can be expensive and/or slow (e.g., disk drives, hard-disk arrays, solid state memories, etc.). Primary data 112 can be extremely changeable, and/or intended for short-term retention (e.g. hours, days or weeks).

“Accordingly to some embodiments, the client computing devices 102 can access primary information 112 from the primary storage unit 104 via conventional file system calls through the operating system. Structured data, unstructured data, and/or semi-structured information may all be included in primary data 112. Below are some examples with regard to FIG. 1B.”

It can be used to perform certain tasks, such as organizing primary data 112 into units with different granularities. Primary data 112 may include files, directories and file system volumes. It can also include data blocks, extents, and any other hierarchies of data objects. A “data object” is defined herein. A?data object? can be used to refer to either (1) any file that is currently addressed by a system or that was previously addressed by the system (e.g. an archive file) or (2) a subset thereof (e.g. a data block).

“As we will explain in detail, it can also help in performing certain functions in the information management system 100 to modify and access metadata within the primary dataset 112. Metadata is information about data objects and characteristics that are associated with them. It is important to note that any reference to primary information 112 includes its associated metadata. However, references to the metadata don’t include primary data.

“Metadata may include, without limitation: the owner of the data (e.g. the client or user that generated the data object), the last modified date (e.g. the time at which the data object was modified), the file size (e.g. a number bytes of data), information about content (e.g. an indication of the existence of a specific search term), user-supplied tag, to/from information (e.g. an email sender, recipient), and other information related to the email information (e. The creation date, file type (e.g. format or application type), the last accessed times, application type (e.g. type of application that created the data objects), location/network (e.g. a current, past, or future location of data object and network paths to/from it), user-supplied tags, to/from information for email (e.g. an email sender, recipient, etc.), partition layouts, file location within the file folder directory structure, permissions, owners groups, access control list [ACLs], system metadata (e.

“In addition to metadata related to file system and operating systems, some applications 110 and/or components of the information management software 100 maintain indices metadata for data objects. For example, metadata associated to individual email messages. Each data object can be associated with the corresponding metadata. Below is a more detailed explanation of how metadata can be used to perform classification and other functions.

“Each client computing device 102 is generally associated with or in communication with one of the primary storage units 104, storing the corresponding primary data 112. A client computing device 102 could be considered to be “associated with?” A client computing device 102 may be considered to be?associated with? A primary storage unit 104 is capable of performing one or more of the following: routing and/or storage data (e.g. primary data 112) to the specific primary storage devices 104; coordinating the routing/or storage of data to the primary storage devices 104; retrieving data from that primary storage facility 104; coordinating the retrieval data from that primary storage apparatus 104; and altering and/or eliminating data retrieved from that primary storage appliance 104.”

“Primary storage devices 104 may include any of the storage devices mentioned above or another type of storage device. The primary storage devices (104) may be slower than the secondary storage device 108 and/or more expensive. The information management system 100 might, for example, access metadata and data stored on primary storage device 104 quite often, while data stored on secondary storage device 108 is more frequently accessed.

“Primary storage devices 104 can be shared or dedicated. Each primary storage device (104) may be dedicated to a client computing device 102 in some cases. In one embodiment, the primary storage device (104) is a local drive belonging to a client computing device (102). Other cases allow one or more primary storage device 104 to be shared by multiple client computers devices 102 via a network, such as in a cloud storage system. A primary storage device 104 could be a disk that is shared by a group 102 of clients computing devices, such as EMC Clariion or EMC Symmetrix. It can also include one of the following types: EMC Clariion and EMC Celerra.

“The information management software 100 could also contain hosted services (not illustrated), which may be hosted by another entity than the one that uses the information management software 100. Hosted services can be provided by different online service providers to the company. These service providers may offer services such as social networking, hosted email services, and hosted productivity apps. Hosted services may include software-as-a-service (SaaS), platform-as-a-service (PaaS), application service providers (ASPs), cloud services, or other mechanisms for delivering functionality via a network. Each hosted service can generate additional data and metadata as it delivers services to users. This data may be managed by the information management system 100 (e.g. primary data 112). The hosted services can be accessed via one of the applications 110 in some cases. A hosted mail service could be accessed using a browser on a client computer device 102. Hosted services can be used in many computing environments. They may be implemented in an environment similar to the information management systems 100 where various physical and logic components are distributed over a network.

“Secondary copies and Exemplary Secondary Storage Devices”

In some instances, the primary data 112 stored in primary storage devices (104) may be compromised. For example, an employee might delete or accidentally overwrite primary data 112 during normal work hours. The primary storage devices 104 may also be lost, damaged, or corrupted. It is useful to create copies of the primary data 112 for recovery purposes and/or regulatory compliance. The information management system 100 contains one or more secondary computing devices 106 and one, or more, secondary storage devices108 that are used to create and store secondary copies 116 and associated metadata. Sometimes, the secondary storage computing devices (106 and 108) may be called a secondary subsystem 118.

“Creation and storage of secondary copies 116 is a useful tool for search and analysis and other information management goals. It allows you to restore data and/or metadata in the event that a primary version (e.g. of primary data 112) is lost due to deletion, corruption or natural disaster; it also permits point-in time recovery.

“The client computing devices (102) access or receive primary information 112 and communicate that data, e.g. over one or more communication paths 114 for storage in the secondary storage device(s).108

“A secondary copy (116) can contain a separate, stored copy of the application data. It may be derived from one or several earlier-created, store copies (e.g. primary data 112 and another secondary copy, 116). Secondary copies 116 may contain point-in time data and can be stored for a relatively long period of storage (e.g. weeks, months, or years) before any or all data is moved to another storage or discarded.

“In some cases, a second copy 116 can be a copy created of application data and stored after at least one other stored instance (e.g. following corresponding primary data 112 and to another secondary data 116), in an alternative storage device than at most one stored copy and/or remotely. Secondary copies may be stored on the same storage device with primary data 112 or other previously stored copies in some cases. In one example, a disk array that can perform hardware snapshots stores primary information 112, and creates and stores secondary copies 116. Secondary copies 116 can be kept in low-cost storage, such as magnetic tape. The secondary copy 116 could be kept in a backup, archive format or another format than the primary data or native application format.

“Some secondary copies 116 can be indexed to allow users to browse and restore at a later time. A secondary copy 116 representing certain primary data 112 may be created. A pointer or another location indicator (e.g. a stub), may be added to primary data 112. To indicate the current location of the secondary storage device(s), 108 or secondary copy 116.

“An instance of a metadata or data object in primary information 112 can change over time as it’s modified by an app 110 (or the operating system) so the information management 100 may create multiple secondary copies 116 to represent the state of that data object or metadata at a specific point in time. The information management system 100 can also manage point-in time representations of primary data objects, even though they may be deleted from primary storage device 104 or the file system.

“Virtualized computing devices may have the operating system 110 and other applications 110 executed within or under virtualization software management (e.g., VMM). The primary storage device(s), 104 may contain a virtual disk created on physical storage device. Information management system 100 can create secondary copies of 116 files and other data objects within a virtual disk, and/or secondary copies of 116 of the entire virtual drive file (e.g. of an entire.vmdk) itself.

“Secondary copy 116 can be distinguished from the corresponding primary data 112 by a variety of means. Some of these will be discussed. As mentioned, secondary copies 116 may be stored in a different format than primary data 112 (e.g. backup, archive, and other non-native formats). Secondary copies 116 may not have direct access to the client computing device 110 for various reasons.

“Secondary backups 116 may also be stored in certain embodiments on a secondary storage unit 108 that is not accessible to the applications 110 running at the client computing devices (and/or hosted service). Some secondary copies 116 could be “offline copies”, They are not easily accessible (e.g., they are not mounted to tape or disc). “Offline copies” can be copies of data that an information management system 100 is able to access without human intervention.

“The Use Of Intermediate Devices to Create Secondary Copies”

It can be difficult to create secondary copies. There can be hundreds of clients computing devices 102 that generate large amounts of primary data 112 which must be protected. Secondary copies can also be created with significant overhead 116. Secondary storage devices 108 can also be used for special purposes, so interacting with them may require specialized intelligence.

“In certain cases, client computing devices 102 can interact directly with secondary storage device108 to create secondary copies 116. This approach, however, can have a negative impact on the client computing devices’ 102 ability to serve applications 110 and generate primary data 112. The client computing devices 102 might not be optimized for interaction to the secondary storage devices (108).

“In some embodiments, the information system 100 may include one or more software/hardware components that act as intermediaries between client computing devices (102) and secondary storage devices (108). These intermediate components may provide additional benefits beyond transferring certain responsibilities to the client computing device 102. As shown in FIG. 1D) can increase scalability by distributing some of work required to create secondary copies 116

“The intermediate components may include one or more secondary storage computing device 106, as shown in FIG. 1A, and/or one or several media agents. These can be software modules that operate on the secondary storage computing devices (106) or other suitable computing devices. Below are some examples of media agents (e.g., in relation to FIGS. 1C-1E).”

“The secondary storage computing devices(s)106” can include any of the computing units described above. Sometimes, the secondary storage computing devices (106) may include special hardware and/or software components for interfacing with secondary storage devices 108.

“To create secondary copies 116, which involves the copying data from the primary subsystem 117 into the secondary subsystem 118. In some embodiments, the client computing devices 102 communicates the primary data 112 (or a processed copy thereof) to the designated secondary computing device 106 via the communication path 114. The secondary storage computing unit 106 then transmits the received data or a processed version thereof to the secondary storage device. In certain cases, the communication path 114 between client computing device (102) and secondary storage computing device (106) may be a part of a LAN/WAN or SAN. Other cases allow at least one client computing device 102 to communicate directly with secondary storage devices (108, e.g. via Fibre Channel or SCSI connections). Other cases include creating one or more secondary copies from secondary copies that exist, as in the case with an auxiliary copy operation.

“Exemplary Secondary Data and Exemplary Primary Data”

“FIG. “FIG. The primary storage device(s), 104 contains primary data objects, including word processing documents (119A-B), spreadsheets 120, presentation files 122, video files 124, image files 126 and email mailboxes 128 with corresponding emails 129A?C), html/xml files 130, databases 132, and the corresponding tables or data structures 133A?133C.

“Some or all primary objects are associated with the corresponding metadata (e.g.?Meta1-11). These metadata may be file system metadata or application-specific metadata. Secondary copy data objects (134A-C) are stored on secondary storage device(s), 108. These secondary data objects may contain copies of, or otherwise represent, corresponding primary data objects and metadata.

“The secondary copy data objects (134A-C) can each represent more than one primary object, as shown in the figure. Secondary copy data object (134A) can represent three primary data objects 133C-122C and 129C respectively. They are represented as 133C? and 122C respectively and accompanied with the Meta11, Meta3, or Meta8 metadata. The prime mark (?) also indicates that secondary copy data object 134A may store a representation of a primary data object and/or metadata differently than the original format. A secondary copy object can store metadata and a representation of primary data objects in a different format than the original. Secondary data object 1346 also represents primary data objects 120, 1306, and 119A respectively. It is accompanied by the corresponding metadata Meta2, Met10, and Meta1, and also accompanies primary data objects 120, 1306, and 119A. Secondary data object 134C also represents primary data objects 130A, 1196 and 129A respectively. It is accompanied by the corresponding metadata Meta9 and Meta5 and Meta6.

“Exemplary Information Management System Architecture”

“The information management software 100 can contain a wide range of hardware and software components. These can be organized in many different ways depending on the embodiment. It is crucial to make clear design decisions about the functional responsibilities and roles of components in the information management systems 100. As will be discussed, these design decisions can have a significant impact on performance and the ability of the information management software 100 to adapt to changing data growth or other circumstances.

“FIG. 1C is an illustration of an information management system 100. It includes: storage manager 140, which is a centralized storage/or information manager configured to perform specific control functions. One or more data agents (142) are executed on client computing devices 102 for processing primary data 112, and one, or more, media agents 144 that execute on secondary storage computing devices. 106 for performing tasks related to the secondary storage devices. 108. Although it is possible to distribute functionality across multiple computing devices, there are other benefits. In some cases, consolidating functionality can be more beneficial. In various other embodiments, any or all of the components in FIG. 1C are not implemented on different computing devices. One configuration includes a storage manager 140 and one or two data agents 142. A media agent 144 is also implemented on the same device. Another embodiment allows for one or more data agent 142, one or several media agents 144, and the storage manager 140 to be implemented on the same computing devices. This is not a limitation.

“Storage Manager”

“As you can see, there are 100 components to the information management system and a lot of data that needs to be managed. The task of managing the components and data can be a complex one. It is also a task that can become more difficult as the number of components and the data grows to meet the organization’s needs. According to certain embodiments, the storage manager 140 is responsible for the control of the information management system 100. The storage manager 140 can be modified independently by distributing control functionality. A computing device that hosts the storage manager 140 can also be chosen to best fit the functions of the storage manger 140. FIG. 2 explains these and other benefits in more detail. 1D.”

“The storage manager 140 could be a software module, or another application that, in certain embodiments, operates in conjunction with one of the associated data structures (e.g. a dedicated database, management database 146). Storage manager 140 may be a computing device that executes computer instructions. The storage manager is responsible for initiating, performing, coordination and/or controlling storage operations and other information management operations performed under the information management system 100. This includes protecting and controlling the primary data 112 as well as secondary copies 116 and metadata. Storage manager 100 is generally responsible for managing information management system 100. This includes managing its constituent components (e.g. data agents and media agents).

“As indicated by the dashed-arrowed lines (114 in FIG. 1C shows that the storage manager 140 can communicate with or control certain elements of the information system 100 such as data agents 142, media agents 144, and/or other components. In certain embodiments, control information is received from the storage manger 140. Status reporting is sent to storage manager 140 by various managed components. Payload data and metadata are generally communicated between data agents 142, media agents 144 and client computing devices 102 (or otherwise between the secondary storage computing devices 106), e.g. at the direction and under the supervision of the storage manager140. The control information may include instructions and parameters for performing information management operations. This includes instructions on how to start an operation, when to start it, timing information that specifies when to do so, data path information that specifies which components to access or communicate with in order to complete the operation. Payload data can, however, include data that is actually involved in storage operations, such as content data that has been written to secondary storage device 108 during a secondary copy operation. Payload metadata may include any of these types of metadata and can be written to a storage unit with payload content data (e.g. in the form a header).

“In some embodiments, certain information management operations can be controlled by other components of the information management system 100 (e.g. the media agent(s 144) or data agent(s 142), in addition to or in combination with storage manager 140.”

“Accordingly to certain embodiments, storage manager 140 provides one of the following functions.

“The storage manager 140 could maintain a database (or?storage manger database 146?) or ?management database 146?) Management-related data and information management policy 148. A management index 150 or?index 150 may be included in the database 146. or any other data structure that stores logical association between components of the system, user preference and/or profiles (e.g. preferences regarding encryption, compression or deduplication, scheduling, type or other aspects, mappings of information management users to specific computing devices or other components, etc. Management tasks, media containerization, and other useful data. The index 150 may be used by the storage manager 140 to track logical connections between media agents 144, secondary storage device 108, and/or the movement of data from primary storage device 104 to secondary storage device 108. The index 150 could store data that associates a client computing device with a specific media agent 144 or secondary storage devices 108. This is according to an information management policy (148) which can be found below.

Administrators and other individuals may be able configure and initiate information management operations individually. This may work for certain recovery operations and other tasks that are less frequently performed, but it is not practical for ongoing organization-wide data management. The information management system 100 can use information management policies 148 to specify and execute information management operations (e.g. on an automated basis). An information management policy 148 may include a data structure, or another information source, that specifies a set or parameters (e.g. criteria and rules) related to storage or other information operations.

The storage manager database 146 may contain the information management policy 148 and associated data. However, the information management policy 148 can be stored at any location. An information management policy 148, such as a storage policy, may be stored in metadata in a media agency database 152 or in secondary storage device 108 (e.g. as an archive copy) to aid in restore operations and other information management operations depending on the embodiment. Below are descriptions of information management policies 148.

According to some embodiments, the storage manger database 146 includes a relational database (e.g. an SQL database) that tracks metadata such as metadata associated secondary copy operations (e.g. what client computing devices were used and the corresponding data). These and other metadata can also be stored at other locations, such the secondary storage computing device 106 or the secondary storage device 108. This allows data recovery without the need for the storage manager 140 in certain cases.

“As shown in the figure, the storage manager 140 could include a jobs agent (156), a user interface (158), and a management agents 154. All of these may be implemented as interconnected modules or applications programs.

In some embodiments, the jobs agent 156 initiates, controls and/or monitors some or all storage operations or other information management operations. These operations may be currently being performed or scheduled to be performed in the information management system 100. The jobs agent 156 might, for example, access information management policies (148) to determine when and how to control secondary copy and other operations.

“The user interface 158 can include information processing, display software, and graphical user interfaces (?GUI). An application program interface (??API?) ), an application program interface (?API?). Users can optionally issue instructions to components of the information management system 100 via the user interface 158 regarding storage and recovery operations. A user might modify a schedule indicating the number of secondary copy operations that are pending. Another example is that a user might use the GUI to view the status pending storage operations, or monitor certain components of the information management system 100 (e.g. the remaining storage capacity).

“An information management cell?” (or ?storage operation cell? (or?storage operation cell? A logical or physical grouping may be used to describe a combination of hardware-software components that are associated with information management operations on electronic files. This includes at least one storage manager 140, at least one client computing device 102, and at most one data agent (or 142) and at minimum one media agent (144). FIG. 1C shows an example of such components. 1C could be combined to form an information management system cell. Multiple cells can be organized hierarchically. This configuration allows cells to inherit properties from hierarchically superior cell or to be controlled by other cells (automatically or not). In some embodiments, cells can inherit or be linked to information management policies, preferences or information management metrics or any other property or characteristic based on their relative position within a hierarchy of cells. You can also organize cells hierarchically according geography, architecture, function, or any other factor that is useful in information management operations. One cell could represent a geographical segment of an enterprise such as a Chicago office. A second cell might represent another geographic segment such as a New York or New York office. Others cells could represent different departments within an office. A first cell can perform one or several first types information management operations (e.g. one or two first types secondary or additional copies), while a second cell could perform one, more, or all of the second types information management operations.

“The storage manager 140 can also track information that allows it to identify, select, or otherwise identify content indexes, deduplication database or similar resources or data sets within its information cell (or another cell), to be searched for certain queries. These queries can be entered via the interface 158. The management agent 154 permits multiple information management cells to communicate with each other. In some cases, the information management system 100 may be one of many information management cells in a network of multiple cells that are adjacent or otherwise logically connected in a WAN/LAN. These cells can be linked to each other through their respective management agents 154.

“For example, the management agent 150 can give the storage manager 140 the ability to communicate via network protocols or application programming interfaces (??) with other components of the information management systems 100 (and/or cells within a larger system). These include, e.g. HTTP, HTTPS FTP, REST and virtualization software APIs. U.S. Pat. explains inter-cell communication and hierarchy in more detail. Nos. Nos. 7,747.579 and 7,343,453, are incorporated herein by reference.”

“Data Agents”

“As we have discussed, there are many types of applications 110 that can be run on a client computing device 102. These include operating systems, database apps, e-mail programs, and virtual machines to name just a few. Client computing devices 102 might be responsible for processing the primary data 112 created by these different applications 110 as part of the creation and restoration of secondary copies 116. Moreover, the nature of the processing/preparation can differ across clients and application types, e.g., due to inherent structural and formatting differences among applications 110.”

“The one or more data agents 142 can be advantageously configured in certain embodiments to aid in the performance information management operations based upon the type of data being protected at a client-specific, and/or app-specific level.”

“The data agent142 could be a module or component of a software program that is responsible for initiating, managing, or otherwise supporting the execution of information management operations within information management system 100, usually as directed by storage manger 140. The data agent 142 might be responsible for performing data storage operations like copying, archiving and migrating primary data 112 to the primary storage device(s). 104. The storage manager 140 may give control information to the data agent 142, including commands to send copies of data objects and metadata to media agents 144.

“In some embodiments, the data agent (142) may be distributed between client computing device 101 and storage manager 140 (and any intermediate components), or it may be deployed from remote locations or its functions approximated using a remote process that performs all or some of the functions of data agent 142. A data agent 142 can also perform functions that are provided by a media agents 144 or perform other functions, such as encryption and duplication.

Each data agent 142 can be customized for a specific application 110. The system can use multiple application-specific agents 142 to perform information management operations (e.g. backup, migration and data recovery) associated in a different 110 application. Different data agents 142 could be used to manage Microsoft Exchange data and Lotus Notes data. They may also handle Microsoft Active Directory Objects, Microsoft Windows file system, Microsoft Windows data, Microsoft SQL Server data and SQL Server data.

A file system agent may be used to manage data files and/or other information. A specialized data agent 142 can be used to backup, archive, migrate and restore client computing devices 102 data if there are multiple types of data. To backup, migrate, or restore all data on a Microsoft Exchange server, a client computing device 102 might use a Microsoft Exchange Mailbox Data agent 142, a Microsoft Exchange Database Data agent 142, and a Microsoft Exchange Public Folder and File System data agents 142. These specialized data agents 142 can be considered four different data agents 142, even though they are all running on the same client computing device.

“Other embodiments may use one or more generic agents 142 that can process data from multiple applications 110 or can handle multiple data types in addition to or instead of specialized data agents 142. One generic data agent 142 could be used to backup, migrate, and restore Microsoft Exchange Mailbox data, and Microsoft Exchange Database data, while another generic agent might handle Microsoft Exchange Public Folder and Microsoft Windows File System information.

Each data agent 142 can be configured to access the primary storage device(s), 104 and then process the data according to its needs. The data agent 142 might arrange the data or metadata into one or more files with a specific format, such as a backup or archive format, before transferring them to a media agent (144) or another component. A list of files and other metadata may be included in the file(s). Each data agent 142 is capable of restoring data and metadata from secondary storage devices 104 to secondary copies 116. The data agent 142 can be used in conjunction with the storage manger 140 and one or more media agents 144 to recover data from secondary storage devices 108.

“Media Agents”

“As shown above in relation to FIG. “As indicated above with respect to FIG. 1A, shifting certain responsibilities from client computing devices (102) to intermediate components like the media agent(s), 144 can provide many benefits, including faster secondary copy operation performance and improved scalability. One example will be described below. The media agent 144 acts as a local cache for copied data and/or metadata it has stored to secondary storage device(s).108. This provides improved restore capabilities.

A media agent 144 is a module of software that coordinates and transmits data between client computing devices 102 and secondary storage devices 108. The storage manager 140 controls operation of the information system 100. However, the media agent (144) provides access to secondary storage devices 108. To read, write, modify, delete, and modify data stored on secondary storage devices 108, the system allows other components to interact with media agents 144. Media agents 144 are able to generate and store information about the characteristics and/or metadata of stored data, and can also generate and store additional information that provides an overview of the contents of secondary storage devices 108.

Media agents 144 may include separate nodes within the information management system 100 (e.g. nodes that are distinct from client computing devices, storage manager 140 and/or secondary storage device 108). A node in the information management system 100 may be physically or logically distinct. In some cases, it can also be an individual addressable component. Each media agent 144 can operate on its own secondary storage computing devices 106, or on multiple secondary storage computing devices 106.

“A media agent number 144 and the corresponding media agent database 152 may be considered to have been?associated? A particular secondary device 108 may be considered to be associated with if the media agent 144 can perform one or more of the following: retrieving data from the specific secondary storage devices 108, coordination of retrieval from the particular second storage device (108), and modification and/or deletion of data retrieved from that particular secondary device 108.

“Media agent(s)144 may be associated with one or several secondary storage devices (108), but in some embodiments, one or more of the media agents 144 is physically distinct from the secondary storage device 108. The media agents 144 can operate on secondary storage computing units 106 with different housings and packages than secondary storage devices. A media agent 144 may be operating on a primary server computer, and communicating with secondary storage devices 108 in separate rack-mounted RAID-based systems.

“Where the information system 100 contains multiple media agents 144 (see e.g. FIG. 1D) A first media agent (144) may be used to provide failover functionality for failed media agents 144. To provide load balancing, media agents 144 may be dynamically selected to store operations. Below are more details about load balancing and failover.

“In operation, a media agency 144 may be associated with a specific secondary storage device. 108. This could instruct the secondary storage unit 108 to carry out an information management operation. A media agent 144 might instruct a tape library that it use a robot arm or another retrieval device to load or eject certain storage media and then archive, migrate, or retrieve the data from the media. This could be done for the purpose of restoring data to client computing devices 102. Another example is a secondary storage device (108), which may contain an array of solid state drives or hard disk drives, in a RAID format. The media agent 144 may also forward a LUN and other relevant information to the array. This information will be used to execute the storage operation. A suitable communication link such as a SCSI/Fiber Channel link may be used by the media agent 144 to communicate with a secondary storage unit 108.

“As shown in the figure, each media agent (144) may have an associated media agent database (152). The media agent database may be saved on a disk, or another storage device (not illustrated) that is located near the secondary storage computing device. 106 where the media agent 144 operates. Other cases the media agent database is stored remotely from secondary storage computing device (106).

Summary for “Synchronizing selected data elements in a storage management software”

Global businesses recognize the commercial value and need to find cost-effective, reliable ways to secure their information while minimising their impact on productivity. As part of their daily, weekly, and monthly maintenance plan, a company may back up important computing systems like databases, virtual machines (file servers, web servers), and so forth. Companies continue to look for innovative ways to manage data growth, given the ever-growing volume of data under their control. Collaboration environments require the ability to coordinate data content and efficiently manage data growth.

It is important to find a way for data content to be coordinated across multiple users and/or storage devices within collaboration environments. This will ensure that both network bandwidth and storage resources can be kept minimal. Many users might want to stay current with rapidly changing data sources, such as a code base for software, a database that contains transactions, or a media project in development. Traditional methods often copy the source data to a local workspace so that the user can use it as needed. This approach can be costly and not practical if the source data is large or multiple users want to keep current. Traditional approaches can cause network communication bottlenecks, may take up too much storage, and may not be suitable for multi-party updates. By tapping the device too often for copies, the traditional approach can also affect production performance. A more efficient and simplified approach is required. Even if changes are made incrementally, it may not cover all the changes that are necessary.

The present inventor has developed methods and systems to leverage storage management system resources to partially sync primary data files. This involves synchronizing selected parts of the file without considering changes in non-synchronized areas. A number of primary files can be partially synchronized using auto-restore operations to restore backup data. This method uses storage management resources to: identify portions of source data to keep synchronized across multiple targets; detect any changes to those portions; back them up to secondary storage; then distribute the changes from secondary storage directly to the targets with minimal impact on the primary data environment. This approach can be mutual so that any changes to one source data file in an associated group may also be detected and backed up before being distributed to other members. The present method uses significantly less storage and communications resources than trying to synchronize all files. Instead of managing synchronization in designated portions, i.e. partial synchronization. Accordingly, the present approach may be more economically practicable and may enable partial-synchronization operations to occur more frequently and provide more current data to the respective users.”

“The illustrative embodiment employs an enhanced storage manager and enhanced data agents that work with media agents in the storage management system to perform partial-synchronization operations across any number of primary data files. Administrators and users may designate the enhanced storage manager to identify which parts of their primary data files they want the storage management system keep synchronized. The storage manager includes enhancements to store metadata for the designation of selected portions (hereinafter called?synchronization portions?). and their mutual association, as well as enhancements for managing partial-synchronization operations throughout the storage management system. The enhanced storage manager might be notified if a synchronization section of a primary file has changed. This could include a new transaction, a modification to a source code portion, or a change in a media file. and may then automatically launch a partial-synchronization operation to bring all the associated synchronization portions up-to-date with the detected change.”

An enhanced data agent could include a monitor function that tracks the designated synchronization section of a particular data file that falls under its purview. The enhanced data agent may notify the storage administrator if it detects any changes in the monitored sync portion. The storage manager might instruct the data agent, in response to a detected change in the monitored synchronization section, to back it up to secondary storage. It is important to note that only the synchronization section is backed-up and not the entire data file. The storage manager can then manage restore operations to restore all the synchronization portions of the secondary copy. To pull in changes and synchronize the local file with another data file, users do not need to interact. The enhanced storage manager may manage the process, instructing media agents and data agents to perform the required operations.

“Systems and methods for partially synchronizing primary files are disclosed. They use auto-restore operations to restore backup data from syncronized portions. These systems and methods can be further described by referring to FIGS. 2-6. Components and functionality that allow partial synchronization of primary data files, based on synchronizing portion thereof via auto-restore operations using backup data, may be configured and/or integrated into information management systems like those shown in FIGS. 1A-1H.”

“Information Management System Overview”

Organizations simply cannot afford to lose critical data. This is because of the growing importance of protecting and leveraging data. Protecting and managing data is becoming more difficult due to runaway data growth and other modern realities. It is imperative to have user-friendly, efficient and powerful solutions for managing and protecting data.

“Depending on the organization’s size, there may be many data production sources that fall under the control of thousands, hundreds or even thousands of employees. Individual employees used to be responsible for protecting and managing their data in the past. In other cases, a patchwork of software and hardware point solutions was used. These solutions were often offered by different vendors, and sometimes had little or no interoperability.

“CERTAIN embodiments described herein offer systems and methods capable to address these and other shortcomings in prior approaches by implementing unified information management across the organization. FIG. FIG. 1A illustrates one such information management systems 100. It generally includes combinations hardware and software that are used to manage and protect data and metadata generated by various computing devices within information management system 100. An organization using the information management system 100 could be a company, other business entity, educational institution, household or governmental agency.

“Generally, the systems described herein may be compatible and/or provide some of the functionality of one or more U.S patents or patent application publications assigned by CommVault Systems, Inc., each which is hereby incorporated into its entirety by reference herein.

“The information management software 100 can contain a wide range of computing devices. As an example, the information management software 100 could include one or more client computing device 102 and secondary storage computing device 106, as we will discuss in more detail.

Computer devices may include without limitation one or more of the following: personal computers, workstations, desktop computers or other types generally fixed computing systems like mainframe computers or minicomputers. Other computing devices include portable or mobile computing devices like laptops, tablets computers, personal information assistants, mobile phones (such a smartphones), and other mobile/portable computing devices like embedded computers, set top boxes or vehicle-mounted devices. Servers can be included in computing devices, including mail servers, file server, database servers and web servers.

“In certain cases, a computing device may include virtualized and/or Cloud computing resources. A third-party cloud service provider may provide one or more virtual machines to an organization. In some cases, computing devices may include one or more virtual machines running on a physical host computing device (or “host machine?”). The organization may use one or more virtual machines to run its database server and another virtual machine as a mail server. One example is that the organization might use one virtual machine to run its database server and another as a mail server. Both virtual machines are running on the same host computer.

A virtual machine is an operating system and associated resources that is hosted on a host computer or host machine. Hypervisor is typically software and is also known as a virtual monitor, virtual machine manager or?VMM? The hypervisor acts as a bridge between the virtual machine’s hardware and its host machine. ESX Server, by VMware, Inc., of Palo Alto, Calif., is an example of hypervisor used for virtualization. Other examples include Microsoft Virtual Server, Microsoft Windows Server Hyper-V, and Sun xVM, both by Oracle America Inc., Santa Clara, Calif. In some embodiments, hypervisors may be hardware or firmware.

The hypervisor gives each virtual operating system virtual resources such as a processor, virtual memory, and virtual network devices. Each virtual machine can have one or more virtual drives. The data of virtual drives is stored by the hypervisor in files on the filesystem of the physical host machine. These files are called virtual machine disk images (in the instance of Microsoft virtual servers) and virtual machine disk files (in case of VMware virtual server). VMware’s ESX server provides the Virtual Machine File System, (VMFS), for storage of virtual machine files. Virtual machines read and write data to their virtual disks in the same manner as physical machines.

U.S. Pat. 102,297 describes “Examples for information management techniques in cloud computing environments.” No. No. 8,285,681 is incorporated herein. U.S. Pat. explains some techniques for managing information in virtualized computing environments. No. No. 8.307,177, also included by reference herein

“The information management software 100 can include many storage devices. Primary storage devices 104, secondary storage devices (108), and others are examples. You can store any type of storage device, including hard-disk arrays and semiconductor memory (e.g. solid state storage), network-attached storage (NAS), tape libraries or other magnetic non-tape storage devices as well as optical media storage devices. DNA/RNA-based memories technology and combinations thereof. Storage devices may be part of a distributed storage system in some instances. Some storage devices can be provided in a cloud, such as a private cloud, or one managed by a third party vendor. In some cases, a storage device is a disk array or a portion thereof.

“The illustrated information system 100 comprises one or more client computing devices 102 that execute at least one application 110, and one or two primary storage devices (104) that store primary data 112. In some cases, the client computing device(s), 102 and primary storage devices (104) may be called a primary storage subsystem. 117 A computing device that is part of an information management systems 100 and has a data agent 42 installed and running on it is called a client computing device (or in the context of a component in the information management systems 100, simply as a “client ?).””).

“The meaning of the term “information management system” depends on the context. It can be used to refer to all the software and hardware components. In other cases, it may only refer to a subset or all of the components.

In some cases, the information system 100 may refer to a collection of components that protect, move and manage data and metadata generated from client computing devices 102. The information management system 100 does not necessarily include all the components that create and/or store primary data 112, such the client computing device 102 and applications 110, as well as the primary storage devices 104. For example, the term “information management system” could refer to: Sometimes, the term “information management system” may refer to any of the following components with corresponding data structures: storage agents, media agents, and data agents. We will describe these components in greater detail below.

“Client Computing Devices.”

There are many sources of data that an organization can use to protect and manage its data. One example is that a company environment can have multiple data sources. These include employee workstations, company servers, such as mail servers, web servers, database servers, transaction servers, and the like. The information management system 100 includes the client computing devices 102 as data sources.

“The client computing device 102 can include any of these types of computing devices, but in some cases, the client computing device 102 is associated with one or more users or corresponding user accounts of employees or other individuals.”

“The information management software 100 addresses the data management needs and protects the data generated by client computing devices 102. This does not mean that client computing devices 102 can’t be called?servers? In other ways. A client computing device 102 can act as a server for other clients, such as client computing devices 102. The client computing devices 102 include file servers, mail servers, database servers and web servers.

“Each client computing device (102) may have one or more software applications 110 (e.g. software applications). These applications generate and manipulate data that must be managed and protected from loss. Applications 110 are generally used to support the operation of an organization or multiple affiliated organizations. They can include file server applications (e.g. Microsoft Exchange Server), mail client applications(e.g. Microsoft Exchange Client), SQL, Oracle, SAP and Lotus Notes Database), word processing apps (e.g. Microsoft Word), spreadsheet and financial applications, presentation and graphics applications, web applications, mobile applications and entertainment applications.

“The client computing devices 102 may have at least one operating software (e.g. Microsoft Windows, Mac OS X iOS, IBM z/OS Linux, or other Unix-based OSes). There may be one or more file system or other applications that are installed on the client computing devices 102.

“The client computing devices (102 and 100) can be connected via one or more communication paths 114. A first communication path 114 could connect client computing devices 102 and secondary storage computing devices 106. A second communication pathway, 114, may connect storage manger 140 and client computing devices 102. A third communication pathway, 114, may connect storage managers 140 and client computing devices 102. Finally, storage manager 140 may be connected to storage manager 140, and secondary storage computing equipment 106. (see, e.g., FIG. FIG. 1A and FIG. 1C). 1C. In some cases, communication pathways 114 may also include application programming Interfaces (APIs), such as cloud service provider APIs and virtual machine management APIs. The infrastructure that underlies communication paths 114 can be wired, wireless, analog, and/or digital or any combination thereof. Facilities may also be private, public or third-party provided.

“Primary Data, Exemplary Primary Storage Devices”

According to some embodiments, primary data 112 is production data or any other?live? data. Data generated by the operating systems and/or applications 110 running on a client computing device. Primary data 112 is usually stored on the primary storage device(s), 104. It is organized using a file system that is supported by the client computing devices 102. The client computing device (102) and the corresponding applications 110 can create, access modify, delete, write, delete, or otherwise use primary data 112. Some cases allow some or all the primary data 112 to be stored in cloud storage resources. For example, client computing device(s) 102 and corresponding applications 110 may create, modify, write, delete, or otherwise use primary data 112.

“Primary Data 112 is usually in the native format for the source application 110. Primary data 112 can be described as an initial or first copy (e.g. created before any other copies, or at least one additional copy) of data generated from the source application 110. In some cases, primary data 112 is substantially created directly from the data generated by the source applications 110.

The primary storage devices 104 that store the primary data 112 can be expensive and/or slow (e.g., disk drives, hard-disk arrays, solid state memories, etc.). Primary data 112 can be extremely changeable, and/or intended for short-term retention (e.g. hours, days or weeks).

“Accordingly to some embodiments, the client computing devices 102 can access primary information 112 from the primary storage unit 104 via conventional file system calls through the operating system. Structured data, unstructured data, and/or semi-structured information may all be included in primary data 112. Below are some examples with regard to FIG. 1B.”

It can be used to perform certain tasks, such as organizing primary data 112 into units with different granularities. Primary data 112 may include files, directories and file system volumes. It can also include data blocks, extents, and any other hierarchies of data objects. A “data object” is defined herein. A?data object? can be used to refer to either (1) any file that is currently addressed by a system or that was previously addressed by the system (e.g. an archive file) or (2) a subset thereof (e.g. a data block).

“As we will explain in detail, it can also help in performing certain functions in the information management system 100 to modify and access metadata within the primary dataset 112. Metadata is information about data objects and characteristics that are associated with them. It is important to note that any reference to primary information 112 includes its associated metadata. However, references to the metadata don’t include primary data.

“Metadata may include, without limitation: the owner of the data (e.g. the client or user that generated the data object), the last modified date (e.g. the time at which the data object was modified), the file size (e.g. a number bytes of data), information about content (e.g. an indication of the existence of a specific search term), user-supplied tag, to/from information (e.g. an email sender, recipient), and other information related to the email information (e. The creation date, file type (e.g. format or application type), the last accessed times, application type (e.g. type of application that created the data objects), location/network (e.g. a current, past, or future location of data object and network paths to/from it), user-supplied tags, to/from information for email (e.g. an email sender, recipient, etc.), partition layouts, file location within the file folder directory structure, permissions, owners groups, access control list [ACLs], system metadata (e.

“In addition to metadata related to file system and operating systems, some applications 110 and/or components of the information management software 100 maintain indices metadata for data objects. For example, metadata associated to individual email messages. Each data object can be associated with the corresponding metadata. Below is a more detailed explanation of how metadata can be used to perform classification and other functions.

“Each client computing device 102 is generally associated with or in communication with one of the primary storage units 104, storing the corresponding primary data 112. A client computing device 102 could be considered to be “associated with?” A client computing device 102 may be considered to be?associated with? A primary storage unit 104 is capable of performing one or more of the following: routing and/or storage data (e.g. primary data 112) to the specific primary storage devices 104; coordinating the routing/or storage of data to the primary storage devices 104; retrieving data from that primary storage facility 104; coordinating the retrieval data from that primary storage apparatus 104; and altering and/or eliminating data retrieved from that primary storage appliance 104.”

“Primary storage devices 104 may include any of the storage devices mentioned above or another type of storage device. The primary storage devices (104) may be slower than the secondary storage device 108 and/or more expensive. The information management system 100 might, for example, access metadata and data stored on primary storage device 104 quite often, while data stored on secondary storage device 108 is more frequently accessed.

“Primary storage devices 104 can be shared or dedicated. Each primary storage device (104) may be dedicated to a client computing device 102 in some cases. In one embodiment, the primary storage device (104) is a local drive belonging to a client computing device (102). Other cases allow one or more primary storage device 104 to be shared by multiple client computers devices 102 via a network, such as in a cloud storage system. A primary storage device 104 could be a disk that is shared by a group 102 of clients computing devices, such as EMC Clariion or EMC Symmetrix. It can also include one of the following types: EMC Clariion and EMC Celerra.

“The information management software 100 could also contain hosted services (not illustrated), which may be hosted by another entity than the one that uses the information management software 100. Hosted services can be provided by different online service providers to the company. These service providers may offer services such as social networking, hosted email services, and hosted productivity apps. Hosted services may include software-as-a-service (SaaS), platform-as-a-service (PaaS), application service providers (ASPs), cloud services, or other mechanisms for delivering functionality via a network. Each hosted service can generate additional data and metadata as it delivers services to users. This data may be managed by the information management system 100 (e.g. primary data 112). The hosted services can be accessed via one of the applications 110 in some cases. A hosted mail service could be accessed using a browser on a client computer device 102. Hosted services can be used in many computing environments. They may be implemented in an environment similar to the information management systems 100 where various physical and logic components are distributed over a network.

“Secondary copies and Exemplary Secondary Storage Devices”

In some instances, the primary data 112 stored in primary storage devices (104) may be compromised. For example, an employee might delete or accidentally overwrite primary data 112 during normal work hours. The primary storage devices 104 may also be lost, damaged, or corrupted. It is useful to create copies of the primary data 112 for recovery purposes and/or regulatory compliance. The information management system 100 contains one or more secondary computing devices 106 and one, or more, secondary storage devices108 that are used to create and store secondary copies 116 and associated metadata. Sometimes, the secondary storage computing devices (106 and 108) may be called a secondary subsystem 118.

“Creation and storage of secondary copies 116 is a useful tool for search and analysis and other information management goals. It allows you to restore data and/or metadata in the event that a primary version (e.g. of primary data 112) is lost due to deletion, corruption or natural disaster; it also permits point-in time recovery.

“The client computing devices (102) access or receive primary information 112 and communicate that data, e.g. over one or more communication paths 114 for storage in the secondary storage device(s).108

“A secondary copy (116) can contain a separate, stored copy of the application data. It may be derived from one or several earlier-created, store copies (e.g. primary data 112 and another secondary copy, 116). Secondary copies 116 may contain point-in time data and can be stored for a relatively long period of storage (e.g. weeks, months, or years) before any or all data is moved to another storage or discarded.

“In some cases, a second copy 116 can be a copy created of application data and stored after at least one other stored instance (e.g. following corresponding primary data 112 and to another secondary data 116), in an alternative storage device than at most one stored copy and/or remotely. Secondary copies may be stored on the same storage device with primary data 112 or other previously stored copies in some cases. In one example, a disk array that can perform hardware snapshots stores primary information 112, and creates and stores secondary copies 116. Secondary copies 116 can be kept in low-cost storage, such as magnetic tape. The secondary copy 116 could be kept in a backup, archive format or another format than the primary data or native application format.

“Some secondary copies 116 can be indexed to allow users to browse and restore at a later time. A secondary copy 116 representing certain primary data 112 may be created. A pointer or another location indicator (e.g. a stub), may be added to primary data 112. To indicate the current location of the secondary storage device(s), 108 or secondary copy 116.

“An instance of a metadata or data object in primary information 112 can change over time as it’s modified by an app 110 (or the operating system) so the information management 100 may create multiple secondary copies 116 to represent the state of that data object or metadata at a specific point in time. The information management system 100 can also manage point-in time representations of primary data objects, even though they may be deleted from primary storage device 104 or the file system.

“Virtualized computing devices may have the operating system 110 and other applications 110 executed within or under virtualization software management (e.g., VMM). The primary storage device(s), 104 may contain a virtual disk created on physical storage device. Information management system 100 can create secondary copies of 116 files and other data objects within a virtual disk, and/or secondary copies of 116 of the entire virtual drive file (e.g. of an entire.vmdk) itself.

“Secondary copy 116 can be distinguished from the corresponding primary data 112 by a variety of means. Some of these will be discussed. As mentioned, secondary copies 116 may be stored in a different format than primary data 112 (e.g. backup, archive, and other non-native formats). Secondary copies 116 may not have direct access to the client computing device 110 for various reasons.

“Secondary backups 116 may also be stored in certain embodiments on a secondary storage unit 108 that is not accessible to the applications 110 running at the client computing devices (and/or hosted service). Some secondary copies 116 could be “offline copies”, They are not easily accessible (e.g., they are not mounted to tape or disc). “Offline copies” can be copies of data that an information management system 100 is able to access without human intervention.

“The Use Of Intermediate Devices to Create Secondary Copies”

It can be difficult to create secondary copies. There can be hundreds of clients computing devices 102 that generate large amounts of primary data 112 which must be protected. Secondary copies can also be created with significant overhead 116. Secondary storage devices 108 can also be used for special purposes, so interacting with them may require specialized intelligence.

“In certain cases, client computing devices 102 can interact directly with secondary storage device108 to create secondary copies 116. This approach, however, can have a negative impact on the client computing devices’ 102 ability to serve applications 110 and generate primary data 112. The client computing devices 102 might not be optimized for interaction to the secondary storage devices (108).

“In some embodiments, the information system 100 may include one or more software/hardware components that act as intermediaries between client computing devices (102) and secondary storage devices (108). These intermediate components may provide additional benefits beyond transferring certain responsibilities to the client computing device 102. As shown in FIG. 1D) can increase scalability by distributing some of work required to create secondary copies 116

“The intermediate components may include one or more secondary storage computing device 106, as shown in FIG. 1A, and/or one or several media agents. These can be software modules that operate on the secondary storage computing devices (106) or other suitable computing devices. Below are some examples of media agents (e.g., in relation to FIGS. 1C-1E).”

“The secondary storage computing devices(s)106” can include any of the computing units described above. Sometimes, the secondary storage computing devices (106) may include special hardware and/or software components for interfacing with secondary storage devices 108.

“To create secondary copies 116, which involves the copying data from the primary subsystem 117 into the secondary subsystem 118. In some embodiments, the client computing devices 102 communicates the primary data 112 (or a processed copy thereof) to the designated secondary computing device 106 via the communication path 114. The secondary storage computing unit 106 then transmits the received data or a processed version thereof to the secondary storage device. In certain cases, the communication path 114 between client computing device (102) and secondary storage computing device (106) may be a part of a LAN/WAN or SAN. Other cases allow at least one client computing device 102 to communicate directly with secondary storage devices (108, e.g. via Fibre Channel or SCSI connections). Other cases include creating one or more secondary copies from secondary copies that exist, as in the case with an auxiliary copy operation.

“Exemplary Secondary Data and Exemplary Primary Data”

“FIG. “FIG. The primary storage device(s), 104 contains primary data objects, including word processing documents (119A-B), spreadsheets 120, presentation files 122, video files 124, image files 126 and email mailboxes 128 with corresponding emails 129A?C), html/xml files 130, databases 132, and the corresponding tables or data structures 133A?133C.

“Some or all primary objects are associated with the corresponding metadata (e.g.?Meta1-11). These metadata may be file system metadata or application-specific metadata. Secondary copy data objects (134A-C) are stored on secondary storage device(s), 108. These secondary data objects may contain copies of, or otherwise represent, corresponding primary data objects and metadata.

“The secondary copy data objects (134A-C) can each represent more than one primary object, as shown in the figure. Secondary copy data object (134A) can represent three primary data objects 133C-122C and 129C respectively. They are represented as 133C? and 122C respectively and accompanied with the Meta11, Meta3, or Meta8 metadata. The prime mark (?) also indicates that secondary copy data object 134A may store a representation of a primary data object and/or metadata differently than the original format. A secondary copy object can store metadata and a representation of primary data objects in a different format than the original. Secondary data object 1346 also represents primary data objects 120, 1306, and 119A respectively. It is accompanied by the corresponding metadata Meta2, Met10, and Meta1, and also accompanies primary data objects 120, 1306, and 119A. Secondary data object 134C also represents primary data objects 130A, 1196 and 129A respectively. It is accompanied by the corresponding metadata Meta9 and Meta5 and Meta6.

“Exemplary Information Management System Architecture”

“The information management software 100 can contain a wide range of hardware and software components. These can be organized in many different ways depending on the embodiment. It is crucial to make clear design decisions about the functional responsibilities and roles of components in the information management systems 100. As will be discussed, these design decisions can have a significant impact on performance and the ability of the information management software 100 to adapt to changing data growth or other circumstances.

“FIG. 1C is an illustration of an information management system 100. It includes: storage manager 140, which is a centralized storage/or information manager configured to perform specific control functions. One or more data agents (142) are executed on client computing devices 102 for processing primary data 112, and one, or more, media agents 144 that execute on secondary storage computing devices. 106 for performing tasks related to the secondary storage devices. 108. Although it is possible to distribute functionality across multiple computing devices, there are other benefits. In some cases, consolidating functionality can be more beneficial. In various other embodiments, any or all of the components in FIG. 1C are not implemented on different computing devices. One configuration includes a storage manager 140 and one or two data agents 142. A media agent 144 is also implemented on the same device. Another embodiment allows for one or more data agent 142, one or several media agents 144, and the storage manager 140 to be implemented on the same computing devices. This is not a limitation.

“Storage Manager”

“As you can see, there are 100 components to the information management system and a lot of data that needs to be managed. The task of managing the components and data can be a complex one. It is also a task that can become more difficult as the number of components and the data grows to meet the organization’s needs. According to certain embodiments, the storage manager 140 is responsible for the control of the information management system 100. The storage manager 140 can be modified independently by distributing control functionality. A computing device that hosts the storage manager 140 can also be chosen to best fit the functions of the storage manger 140. FIG. 2 explains these and other benefits in more detail. 1D.”

“The storage manager 140 could be a software module, or another application that, in certain embodiments, operates in conjunction with one of the associated data structures (e.g. a dedicated database, management database 146). Storage manager 140 may be a computing device that executes computer instructions. The storage manager is responsible for initiating, performing, coordination and/or controlling storage operations and other information management operations performed under the information management system 100. This includes protecting and controlling the primary data 112 as well as secondary copies 116 and metadata. Storage manager 100 is generally responsible for managing information management system 100. This includes managing its constituent components (e.g. data agents and media agents).

“As indicated by the dashed-arrowed lines (114 in FIG. 1C shows that the storage manager 140 can communicate with or control certain elements of the information system 100 such as data agents 142, media agents 144, and/or other components. In certain embodiments, control information is received from the storage manger 140. Status reporting is sent to storage manager 140 by various managed components. Payload data and metadata are generally communicated between data agents 142, media agents 144 and client computing devices 102 (or otherwise between the secondary storage computing devices 106), e.g. at the direction and under the supervision of the storage manager140. The control information may include instructions and parameters for performing information management operations. This includes instructions on how to start an operation, when to start it, timing information that specifies when to do so, data path information that specifies which components to access or communicate with in order to complete the operation. Payload data can, however, include data that is actually involved in storage operations, such as content data that has been written to secondary storage device 108 during a secondary copy operation. Payload metadata may include any of these types of metadata and can be written to a storage unit with payload content data (e.g. in the form a header).

“In some embodiments, certain information management operations can be controlled by other components of the information management system 100 (e.g. the media agent(s 144) or data agent(s 142), in addition to or in combination with storage manager 140.”

“Accordingly to certain embodiments, storage manager 140 provides one of the following functions.

“The storage manager 140 could maintain a database (or?storage manger database 146?) or ?management database 146?) Management-related data and information management policy 148. A management index 150 or?index 150 may be included in the database 146. or any other data structure that stores logical association between components of the system, user preference and/or profiles (e.g. preferences regarding encryption, compression or deduplication, scheduling, type or other aspects, mappings of information management users to specific computing devices or other components, etc. Management tasks, media containerization, and other useful data. The index 150 may be used by the storage manager 140 to track logical connections between media agents 144, secondary storage device 108, and/or the movement of data from primary storage device 104 to secondary storage device 108. The index 150 could store data that associates a client computing device with a specific media agent 144 or secondary storage devices 108. This is according to an information management policy (148) which can be found below.

Administrators and other individuals may be able configure and initiate information management operations individually. This may work for certain recovery operations and other tasks that are less frequently performed, but it is not practical for ongoing organization-wide data management. The information management system 100 can use information management policies 148 to specify and execute information management operations (e.g. on an automated basis). An information management policy 148 may include a data structure, or another information source, that specifies a set or parameters (e.g. criteria and rules) related to storage or other information operations.

The storage manager database 146 may contain the information management policy 148 and associated data. However, the information management policy 148 can be stored at any location. An information management policy 148, such as a storage policy, may be stored in metadata in a media agency database 152 or in secondary storage device 108 (e.g. as an archive copy) to aid in restore operations and other information management operations depending on the embodiment. Below are descriptions of information management policies 148.

According to some embodiments, the storage manger database 146 includes a relational database (e.g. an SQL database) that tracks metadata such as metadata associated secondary copy operations (e.g. what client computing devices were used and the corresponding data). These and other metadata can also be stored at other locations, such the secondary storage computing device 106 or the secondary storage device 108. This allows data recovery without the need for the storage manager 140 in certain cases.

“As shown in the figure, the storage manager 140 could include a jobs agent (156), a user interface (158), and a management agents 154. All of these may be implemented as interconnected modules or applications programs.

In some embodiments, the jobs agent 156 initiates, controls and/or monitors some or all storage operations or other information management operations. These operations may be currently being performed or scheduled to be performed in the information management system 100. The jobs agent 156 might, for example, access information management policies (148) to determine when and how to control secondary copy and other operations.

“The user interface 158 can include information processing, display software, and graphical user interfaces (?GUI). An application program interface (??API?) ), an application program interface (?API?). Users can optionally issue instructions to components of the information management system 100 via the user interface 158 regarding storage and recovery operations. A user might modify a schedule indicating the number of secondary copy operations that are pending. Another example is that a user might use the GUI to view the status pending storage operations, or monitor certain components of the information management system 100 (e.g. the remaining storage capacity).

“An information management cell?” (or ?storage operation cell? (or?storage operation cell? A logical or physical grouping may be used to describe a combination of hardware-software components that are associated with information management operations on electronic files. This includes at least one storage manager 140, at least one client computing device 102, and at most one data agent (or 142) and at minimum one media agent (144). FIG. 1C shows an example of such components. 1C could be combined to form an information management system cell. Multiple cells can be organized hierarchically. This configuration allows cells to inherit properties from hierarchically superior cell or to be controlled by other cells (automatically or not). In some embodiments, cells can inherit or be linked to information management policies, preferences or information management metrics or any other property or characteristic based on their relative position within a hierarchy of cells. You can also organize cells hierarchically according geography, architecture, function, or any other factor that is useful in information management operations. One cell could represent a geographical segment of an enterprise such as a Chicago office. A second cell might represent another geographic segment such as a New York or New York office. Others cells could represent different departments within an office. A first cell can perform one or several first types information management operations (e.g. one or two first types secondary or additional copies), while a second cell could perform one, more, or all of the second types information management operations.

“The storage manager 140 can also track information that allows it to identify, select, or otherwise identify content indexes, deduplication database or similar resources or data sets within its information cell (or another cell), to be searched for certain queries. These queries can be entered via the interface 158. The management agent 154 permits multiple information management cells to communicate with each other. In some cases, the information management system 100 may be one of many information management cells in a network of multiple cells that are adjacent or otherwise logically connected in a WAN/LAN. These cells can be linked to each other through their respective management agents 154.

“For example, the management agent 150 can give the storage manager 140 the ability to communicate via network protocols or application programming interfaces (??) with other components of the information management systems 100 (and/or cells within a larger system). These include, e.g. HTTP, HTTPS FTP, REST and virtualization software APIs. U.S. Pat. explains inter-cell communication and hierarchy in more detail. Nos. Nos. 7,747.579 and 7,343,453, are incorporated herein by reference.”

“Data Agents”

“As we have discussed, there are many types of applications 110 that can be run on a client computing device 102. These include operating systems, database apps, e-mail programs, and virtual machines to name just a few. Client computing devices 102 might be responsible for processing the primary data 112 created by these different applications 110 as part of the creation and restoration of secondary copies 116. Moreover, the nature of the processing/preparation can differ across clients and application types, e.g., due to inherent structural and formatting differences among applications 110.”

“The one or more data agents 142 can be advantageously configured in certain embodiments to aid in the performance information management operations based upon the type of data being protected at a client-specific, and/or app-specific level.”

“The data agent142 could be a module or component of a software program that is responsible for initiating, managing, or otherwise supporting the execution of information management operations within information management system 100, usually as directed by storage manger 140. The data agent 142 might be responsible for performing data storage operations like copying, archiving and migrating primary data 112 to the primary storage device(s). 104. The storage manager 140 may give control information to the data agent 142, including commands to send copies of data objects and metadata to media agents 144.

“In some embodiments, the data agent (142) may be distributed between client computing device 101 and storage manager 140 (and any intermediate components), or it may be deployed from remote locations or its functions approximated using a remote process that performs all or some of the functions of data agent 142. A data agent 142 can also perform functions that are provided by a media agents 144 or perform other functions, such as encryption and duplication.

Each data agent 142 can be customized for a specific application 110. The system can use multiple application-specific agents 142 to perform information management operations (e.g. backup, migration and data recovery) associated in a different 110 application. Different data agents 142 could be used to manage Microsoft Exchange data and Lotus Notes data. They may also handle Microsoft Active Directory Objects, Microsoft Windows file system, Microsoft Windows data, Microsoft SQL Server data and SQL Server data.

A file system agent may be used to manage data files and/or other information. A specialized data agent 142 can be used to backup, archive, migrate and restore client computing devices 102 data if there are multiple types of data. To backup, migrate, or restore all data on a Microsoft Exchange server, a client computing device 102 might use a Microsoft Exchange Mailbox Data agent 142, a Microsoft Exchange Database Data agent 142, and a Microsoft Exchange Public Folder and File System data agents 142. These specialized data agents 142 can be considered four different data agents 142, even though they are all running on the same client computing device.

“Other embodiments may use one or more generic agents 142 that can process data from multiple applications 110 or can handle multiple data types in addition to or instead of specialized data agents 142. One generic data agent 142 could be used to backup, migrate, and restore Microsoft Exchange Mailbox data, and Microsoft Exchange Database data, while another generic agent might handle Microsoft Exchange Public Folder and Microsoft Windows File System information.

Each data agent 142 can be configured to access the primary storage device(s), 104 and then process the data according to its needs. The data agent 142 might arrange the data or metadata into one or more files with a specific format, such as a backup or archive format, before transferring them to a media agent (144) or another component. A list of files and other metadata may be included in the file(s). Each data agent 142 is capable of restoring data and metadata from secondary storage devices 104 to secondary copies 116. The data agent 142 can be used in conjunction with the storage manger 140 and one or more media agents 144 to recover data from secondary storage devices 108.

“Media Agents”

“As shown above in relation to FIG. “As indicated above with respect to FIG. 1A, shifting certain responsibilities from client computing devices (102) to intermediate components like the media agent(s), 144 can provide many benefits, including faster secondary copy operation performance and improved scalability. One example will be described below. The media agent 144 acts as a local cache for copied data and/or metadata it has stored to secondary storage device(s).108. This provides improved restore capabilities.

A media agent 144 is a module of software that coordinates and transmits data between client computing devices 102 and secondary storage devices 108. The storage manager 140 controls operation of the information system 100. However, the media agent (144) provides access to secondary storage devices 108. To read, write, modify, delete, and modify data stored on secondary storage devices 108, the system allows other components to interact with media agents 144. Media agents 144 are able to generate and store information about the characteristics and/or metadata of stored data, and can also generate and store additional information that provides an overview of the contents of secondary storage devices 108.

Media agents 144 may include separate nodes within the information management system 100 (e.g. nodes that are distinct from client computing devices, storage manager 140 and/or secondary storage device 108). A node in the information management system 100 may be physically or logically distinct. In some cases, it can also be an individual addressable component. Each media agent 144 can operate on its own secondary storage computing devices 106, or on multiple secondary storage computing devices 106.

“A media agent number 144 and the corresponding media agent database 152 may be considered to have been?associated? A particular secondary device 108 may be considered to be associated with if the media agent 144 can perform one or more of the following: retrieving data from the specific secondary storage devices 108, coordination of retrieval from the particular second storage device (108), and modification and/or deletion of data retrieved from that particular secondary device 108.

“Media agent(s)144 may be associated with one or several secondary storage devices (108), but in some embodiments, one or more of the media agents 144 is physically distinct from the secondary storage device 108. The media agents 144 can operate on secondary storage computing units 106 with different housings and packages than secondary storage devices. A media agent 144 may be operating on a primary server computer, and communicating with secondary storage devices 108 in separate rack-mounted RAID-based systems.

“Where the information system 100 contains multiple media agents 144 (see e.g. FIG. 1D) A first media agent (144) may be used to provide failover functionality for failed media agents 144. To provide load balancing, media agents 144 may be dynamically selected to store operations. Below are more details about load balancing and failover.

“In operation, a media agency 144 may be associated with a specific secondary storage device. 108. This could instruct the secondary storage unit 108 to carry out an information management operation. A media agent 144 might instruct a tape library that it use a robot arm or another retrieval device to load or eject certain storage media and then archive, migrate, or retrieve the data from the media. This could be done for the purpose of restoring data to client computing devices 102. Another example is a secondary storage device (108), which may contain an array of solid state drives or hard disk drives, in a RAID format. The media agent 144 may also forward a LUN and other relevant information to the array. This information will be used to execute the storage operation. A suitable communication link such as a SCSI/Fiber Channel link may be used by the media agent 144 to communicate with a secondary storage unit 108.

“As shown in the figure, each media agent (144) may have an associated media agent database (152). The media agent database may be saved on a disk, or another storage device (not illustrated) that is located near the secondary storage computing device. 106 where the media agent 144 operates. Other cases the media agent database is stored remotely from secondary storage computing device (106).

Click here to view the patent on Google Patents.