Microsoft – Amit Mitkar, Andrei Erofeev, Commvault Systems Inc

Abstract for “Hybrid drive caching with a backup system and SSD deletion management”

“Systems and methods may implement intelligent caching algorithms to reduce wear on SSDs and/or improve caching performance. These algorithms can increase storage utilization and I/O efficiency, taking into consideration the SSD’s write-wearing limits. The systems and methods can store data to the SSD, but avoid writing to it too often to extend the SSD’s lifespan. Systems and methods can write data to the SSD if data has been read multiple times from the hard drive or memory to prevent or attempt to avoid writing data that was only read once. Systems and methods can also write large amounts of data to the SSD at one time, instead of just one unit. The systems and methods may also write to the SSD in circular fashion.

Background for “Hybrid drive caching with a backup system and SSD deletion management”

Global businesses recognize the commercial value and need to find cost-effective, reliable ways to secure their information while minimising their impact on productivity. Information protection is often an integral part of the daily work that is done within an organization. As part of its daily, weekly or monthly maintenance plan, a company may back up important computing systems like web servers, file servers, and databases. A company might also protect the computing systems of each employee, such as those used in an accounting, marketing, or engineering department.

Companies continue to look for innovative ways to manage data growth and protect data, given the ever-growing volume of data under their control. Companies often use migration techniques to move data to cheaper storage and data reduction techniques to reduce redundant data, prune lower priority data, and so forth. Companies increasingly see their stored data as an asset. Customers are increasingly looking for ways to not only manage and protect their data but also make use of it. Solutions that provide data analysis capabilities, information management and improved data presentation and accessibility features are increasingly in demand.

Data storage devices such as optical disk drives and hybrid drives can be used to store data that can later be retrieved. These devices may use data caches that can include high-speed semiconductor memory chip chips. This allows the devices to quickly manage data and receive commands from a host. All commands received from a host computer for read or write would be executed without caching. This would allow access to the mass storage medium (e.g. a magnetic disk) by the host computer. This access can lead to significant time delays due to mechanical positioning between the magnetic disk and the head. Caching allows the storage device to buffer data that may be accessed by the host system so that data can be made more quickly when it is actually needed.

“These inventions include novel features, advantages, and certain aspects. These advantages may not be realized in accordance to every embodiment of the inventions described herein. The inventions described herein can be implemented or performed in a way that achieves or selects one advantage, group of advantages or other benefits as suggested or taught herein.

“A method of caching in storage systems comprising a hard drive and a solid state drive is described according to certain aspects. This can be done by receiving a first request to read the first page of a hybrid drive that includes a hard drive and a solid state drive (SSD). The SSD can be used as a cache and has a faster read speed than hard disk. This can be done by determining whether the page 1 is in the SSD. If it is, the SSD will read the page 1 from the hard drive. However, the SSD will not cache the page 1. The SSD will still write to the SSD if the page is not needed again. This is done to decrease wear on the SSD. A subsequent read request can be made to the hybrid drive to read the first page. The subsequent read request can be used to read the first page from the hybrid drive. If the page is already in memory, the processor will mark that page as cached in the SSD. It will wait for other pages to be ready to cache in the SSD to prevent unnecessary writing to the SSD. After a predetermined number pages are marked as ready to cache, the method may include writing the first and all other pages ready to cache to the SSD. This will allow you to write the pages efficiently to the SSD while reducing wear.

The method may also include caching the page first in a memory cache before caching it in the SSD. A hash table can be maintained in the memory. The hash table can be used to map the first page to the storage location of the SSD or memory. In some embodiments, the first and subsequent read requests relate to backup operations.

“A system for caching in storage systems comprising a hard drive and a solid state drive is disclosed according to additional aspects. A storage driver can be implemented in a hardware processor and comprises executable instructions that are configured to: Receive a first request for a data element from a storage device comprising a hard drive and a SSD. The SSD acts as a cache for hard disk. The storage driver can also be configured to read the first element from the hard drive. Further, the storage driver can be configured to accept a second request to access the first data element in the storage system. The storage driver can then be configured to respond to the second request to read the first data element from the storage system. A storage driver can be configured to determine if a predetermined number of data elements has been indicated as ready to cache in addition to the first data element. If the storage driver determines that the predetermined number of data elements are ready to cache, it can be set up to write the data elements, including the first, to the SSD.

“The system may also include a hybrid drive under certain circumstances. The storage driver can also be configured to maintain a data schema in the memory, according to certain aspects. You can configure the data structure to map data elements that are ready to cache to a storage location either in the SSD or memory. A hash table can be used to index the data structure, at least according to data element identifier. In response to a write to the SSD, the storage driver can also be configured to delete the first data elements from the SSD. To reduce wear on the SSD, the storage driver can also be configured to write data elements that are ready to cache to the SSD in circular fashion.

“The storage driver may be further configured to store the first data element in memory before writing the data element into the SSD. You can further configure the storage driver to delete the first data element from memory when the data element is written to the SSD.

“Depending on the embodiment the quantity of data element can correspond to either one or both the size of the data element and the number of the data element.”

“The storage driver may include an interface to either a file system or both a filesystem and a database. The first and second read requests are received from the latter.”

“A system for caching is described according to further aspects. A hardware processor can be used to read data from a hard drive; store a first indication that the first element will be stored in memory in order to cache it in the SSD without actually cache it; then read another data element from the hard disc; store a second indication that the second element will be stored in the SSD; finally, store the second indication in your memory and cache the first and the second data elements in SSD.

The hardware processor can also be configured to store the first indication and second indication in a buffer within memory. The first and second indications can contain pointers to the first or second data elements. The hardware processor may also be configured to cache the first or second data elements in SSDs in certain cases in response to buffer capacity reaching its limit. In some cases, the hardware processor is set up to cache the first two data elements in a memory cache before caching them in the SSD. You can further configure the hardware processor to cache the first two data elements in the SSD when the memory cache reaches capacity, even if it has not reached capacity.

“In certain embodiments, the hardware processor can be configured to delete the first data elements from the SSD upon receiving a write request to that data element.”

A hybrid drive may include both a hard drive and a solid state drive (SSD), which can be used to cache the hard drive. An SSD can store certain input/output data (I/O) between the SSD and the hard drive so that often accessed data can be accessed faster from the SSD. Flash memory is generally more expensive than hard drive technology, so an SSD can be used as a cache. A hybrid drive is a combination of a fast SSD cache and a large hard disk that has a lot of storage, which can offer a compromise in price and storage capacity.

SSDs are limited in their lifespan. This means that only a small number of writes can be done to each SSD cell. Existing logic layers can use wear-leveling algorithms to evenly wear SSDs. An example of a wear-leveling algorithm is one that writes to different parts of an SSD at different times. This allows the SSD to be written to multiple times without overwriting it, thereby reducing its lifespan prematurely. The wear-leveling algorithms for SSDs might not be sufficient to reduce wear when used as a cache. If every data element read from the hard drive was stored in the SSD, regardless of whether it will be used again soon or not, this would significantly reduce the SSD’s lifespan. To avoid unnecessary writing to the SSD, data that is read from the hard drive in a scanning operation should not be stored in the SSD. Backup operations can be very costly as multiple writes to the SSD may be performed during backup.

“Aspects of the systems and methods described herein may implement intelligent caching algorithms to address these and other issues in order to reduce wear on SSDs and/or improve caching performance. These intelligent cache algorithms can increase storage utilization and I/O efficiency, taking into consideration the SSD’s write-wearing limitations. The systems and methods can store data to the SSD, but not write too often to it to extend its lifespan. Systems and methods can write data to the SSD if data has been read multiple times from the hard drive or memory to prevent or attempt to avoid writing data that was only read once. Systems and methods can also write large amounts of data to the SSD at one time, instead of one unit at a given time. The systems and methods may also write to the SSD in circular fashion, overwriting older or less recent data in order to avoid, or try to avoid, overwriting the SSD the same area multiple times in succession.

“The systems and methods described herein for intelligent caching may also be used in information management systems, such as those shown in FIGS. 1A-1H.”

Organizations simply cannot afford to lose critical data. This is because of the growing importance of protecting and leveraging data. Protecting and managing data is becoming more difficult due to runaway data growth and other modern realities. It is imperative to have user-friendly, efficient and powerful solutions for managing and protecting data.

“Depending on the organization’s size, there may be many data production sources that fall under the control of thousands, hundreds or even thousands of employees. Individual employees used to be responsible for protecting and managing their data in the past. In other cases, a patchwork of software and hardware point solutions was used. These solutions were often offered by different vendors, and sometimes had little or no interoperability.

“CERTAIN embodiments described herein offer systems and methods capable to address these and other shortcomings in prior approaches by implementing unified information management across the organization. FIG. FIG. 1A illustrates one such information management systems 100. It generally includes combinations hardware and software that are used to manage and protect data and metadata generated by various computing devices within information management system 100. An organization using the information management system 100 could be a company, other business entity, educational institution, household or governmental agency.

“Generally, the systems described herein may be compatible and/or provide some of the functionality of one or more U.S patents or patent application publications assigned by CommVault Systems, Inc., each which is hereby incorporated into its entirety by reference herein.

“The information management software 100 can contain a wide range of computing devices. As an example, the information management software 100 could include one or more client computing device 102 and secondary storage computing device 106, as we will discuss in more detail.

Computer devices may include without limitation one or more of the following: personal computers, workstations, desktop computers or other types generally fixed computing systems like mainframe computers or minicomputers. Other computing devices include portable or mobile computing devices like laptops, tablets computers, personal information assistants, mobile phones (such a smartphones), and other mobile/portable computing devices like embedded computers, set top boxes or vehicle-mounted devices. Servers can be included in computing devices, including mail servers, file server, database servers and web servers.

“In certain cases, a computing device may include virtualized and/or Cloud computing resources. A third-party cloud service provider may provide one or more virtual machines to an organization. In some cases, computing devices may include one or more virtual machines running on a physical host computing device (or “host machine?”). The organization may use one or more virtual machines to run its database server and another virtual machine as a mail server. One example is that the organization might use one virtual machine to run its database server and another as a mail server. Both virtual machines are running on the same host computer.

A virtual machine is an operating system and associated resources that is hosted on a host computer or host machine. Hypervisor is typically software and is also known as a virtual monitor, virtual machine manager or?VMM? The hypervisor acts as a bridge between the virtual machine’s hardware and its host machine. ESX Server, by VMware, Inc., of Palo Alto, Calif., is an example of hypervisor used for virtualization. Other examples include Microsoft Virtual Server, Microsoft Windows Server Hyper-V, and Sun xVM, both by Oracle America Inc., Santa Clara, Calif. In some embodiments, hypervisors may be hardware or firmware.

The hypervisor gives each virtual operating system virtual resources such as a processor, virtual memory, and virtual network devices. Each virtual machine can have one or more virtual drives. The data of virtual drives is stored by the hypervisor in files on the filesystem of the physical host machine. These files are called virtual machine disk images (in the instance of Microsoft virtual servers) and virtual machine disk files (in case of VMware virtual server). VMware’s ESX server provides the Virtual Machine File System, (VMFS), for storage of virtual machine files. Virtual machines read and write data to their virtual disks in the same manner as physical machines.

U.S. Pat. 102,297 describes “Examples for information management techniques in cloud computing environments.” No. No. 8,285,681 is incorporated herein. U.S. Pat. explains some techniques for managing information in virtualized computing environments. No. No. 8.307,177, also included by reference herein

“The information management software 100 can include many storage devices. Primary storage devices 104, secondary storage devices (108), and others are examples. You can store any type of storage device, including hard-disk arrays and semiconductor memory (e.g. solid state storage), network-attached storage (NAS), tape libraries or other magnetic non-tape storage devices as well as optical media storage devices. DNA/RNA-based memories technology and combinations thereof. Storage devices may be part of a distributed storage system in some instances. Some storage devices can be provided in a cloud, such as a private cloud, or one managed by a third party vendor. In some cases, a storage device is a disk array or a portion thereof.

“The illustrated information system 100 comprises one or more client computing devices 102 that execute at least one application 110, and one or two primary storage devices (104) that store primary data 112. In some cases, the client computing device(s), 102 and primary storage devices (104) may be called a primary storage subsystem. 117 A computing device that is part of an information management systems 100 and has a data agent 42 installed and running on it is called a client computing device (or in the context of a component in the information management systems 100, simply as a “client ?).””).

“The meaning of the term “information management system” depends on the context. It can be used to refer to all the software and hardware components. In other cases, it may only refer to a subset or all of the components.

In some cases, the information system 100 may refer to a collection of components that protect, move and manage data and metadata generated from client computing devices 102. The information management system 100 does not necessarily include all the components that create and/or store primary data 112, such the client computing device 102 and applications 110, as well as the primary storage devices 104. For example, the term “information management system” could refer to: Sometimes, the term “information management system” may refer to any of the following components with corresponding data structures: storage agents, media agents, and data agents. We will describe these components in greater detail below.

“Client Computing Devices.”

There are many sources of data that an organization can use to protect and manage its data. One example is that a company environment can have multiple data sources. These include employee workstations, company servers, such as mail servers, web servers, database servers, transaction servers, and the like. The information management system 100 includes the client computing devices 102 as data sources.

“The client computing device 102 can include any of these types of computing devices, but in some cases, the client computing device 102 is associated with one or more users or corresponding user accounts of employees or other individuals.”

“The information management software 100 addresses the data management needs and protects the data generated by client computing devices 102. This does not mean that client computing devices 102 can’t be called?servers? In other ways. A client computing device 102 can act as a server for other clients, such as client computing devices 102. The client computing devices 102 include file servers, mail servers, database servers and web servers.

“Each client computing device (102) may have one or more software applications 110 (e.g. software applications). These applications generate and manipulate data that must be managed and protected from loss. Applications 110 are generally used to support the operation of an organization or multiple affiliated organizations. They can include file server applications (e.g. Microsoft Exchange Server), mail client applications(e.g. Microsoft Exchange Client), SQL, Oracle, SAP and Lotus Notes Database), word processing apps (e.g. Microsoft Word), spreadsheet and financial applications, presentation and graphics applications, web applications, mobile applications and entertainment applications.

“The client computing devices 102 may have at least one operating software (e.g. Microsoft Windows, Mac OS X iOS, IBM z/OS Linux, or other Unix-based OSes). There may be one or more file system or other applications that are installed on the client computing devices 102.

“The client computing devices (102 and 100) can be connected via one or more communication paths 114. A first communication path 114 could connect client computing devices 102 and secondary storage computing devices 106. A second communication pathway, 114, may connect storage manger 140 and client computing devices 102. A third communication pathway, 114, may connect storage managers 140 and client computing devices 102. Finally, storage manager 140 may be connected to storage manager 140, and secondary storage computing equipment 106. (see, e.g., FIG. FIG. 1A and FIG. 1C). 1C. In some cases, communication pathways 114 may also include application programming Interfaces (APIs), such as cloud service provider APIs and virtual machine management APIs. The infrastructure that underlies communication paths 114 can be wired, wireless, analog, and/or digital or any combination thereof. Facilities may also be private, public or third-party provided.

“Primary Data, Exemplary Primary Storage Devices”

According to some embodiments, primary data 112 is production data or any other?live? data. Data generated by the operating systems and/or applications 110 running on a client computing device. Primary data 112 is usually stored on the primary storage device(s), 104. It is organized using a file system that is supported by the client computing devices 102. The client computing device (102) and the corresponding applications 110 can create, access modify, delete, write, delete, or otherwise use primary data 112. Some cases allow some or all the primary data 112 to be stored in cloud storage resources. For example, client computing device(s) 102 and corresponding applications 110 may create, modify, write, delete, or otherwise use primary data 112.

“Primary Data 112 is usually in the native format for the source application 110. Primary data 112 can be described as an initial or first copy (e.g. created before any other copies, or at least one additional copy) of data generated from the source application 110. In some cases, primary data 112 is substantially created directly from the data generated by the source applications 110.

The primary storage devices 104 that store the primary data 112 can be expensive and/or slow (e.g., disk drives, hard-disk arrays, solid state memories, etc.). Primary data 112 can be extremely changeable, and/or intended for short-term retention (e.g. hours, days or weeks).

“Accordingly to some embodiments, the client computing devices 102 can access primary information 112 from the primary storage unit 104 via conventional file system calls through the operating system. Structured data, unstructured data, and/or semi-structured information may all be included in primary data 112. Below are some examples with regard to FIG. 1B.”

It can be used to perform certain tasks, such as organizing primary data 112 into units with different granularities. Primary data 112 may include files, directories and file system volumes. It can also include data blocks, extents, and any other hierarchies of data objects. A “data object” is defined herein. A?data object? can be used to refer to either (1) any file that is currently addressed by a system or that was previously addressed by the system (e.g. an archive file) or (2) a subset thereof (e.g. a data block).

“As we will explain in detail, it can also help in performing certain functions in the information management system 100 to modify and access metadata within the primary dataset 112. Metadata is information about data objects and characteristics that are associated with them. It is important to note that any reference to primary information 112 includes its associated metadata. However, references to the metadata don’t include primary data.

“Metadata may include, without limitation: the owner of the data (e.g. the client or user that generated the data object), the last modified date (e.g. the time at which the data object was modified), the file size (e.g. a number bytes of data), information about content (e.g. an indication of the existence of a specific search term), user-supplied tag, to/from information (e.g. an email sender, recipient), and other information related to the email information (e. The creation date, file type (e.g. format or application type), the last accessed times, application type (e.g. type of application that created the data objects), location/network (e.g. a current, past, or future location of data object and network paths to/from it), user-supplied tags, to/from information for email (e.g. an email sender, recipient, etc.), partition layouts, file location within the file folder directory structure, permissions, owners groups, access control list [ACLs], system metadata (e.

“In addition to metadata related to file system and operating systems, some applications 110 and/or components of the information management software 100 maintain indices metadata for data objects. For example, metadata associated to individual email messages. Each data object can be associated with the corresponding metadata. Below is a more detailed explanation of how metadata can be used to perform classification and other functions.

“Each client computing device 102 is generally associated with or in communication with one of the primary storage units 104, storing the corresponding primary data 112. A client computing device 102 could be considered to be “associated with?” A client computing device 102 may be considered to be?associated with? A primary storage unit 104 is capable of performing one or more of the following: routing and/or storage data (e.g. primary data 112) to the specific primary storage devices 104; coordinating the routing/or storage of data to the primary storage devices 104; retrieving data from that primary storage facility 104; coordinating the retrieval data from that primary storage apparatus 104; and altering and/or eliminating data retrieved from that primary storage appliance 104.”

“Primary storage devices 104 may include any of the storage devices mentioned above or another type of storage device. The primary storage devices (104) may be slower than the secondary storage device 108 and/or more expensive. The information management system 100 might, for example, access metadata and data stored on primary storage device 104 quite often, while data stored on secondary storage device 108 is more frequently accessed.

“Primary storage devices 104 can be shared or dedicated. Each primary storage device (104) may be dedicated to a client computing device 102 in some cases. In one embodiment, the primary storage device (104) is a local drive belonging to a client computing device (102). Other cases allow one or more primary storage device 104 to be shared by multiple client computers devices 102 via a network, such as in a cloud storage system. A primary storage device 104 could be a disk that is shared by a group 102 of clients computing devices, such as EMC Clariion or EMC Symmetrix. It can also include one of the following types: EMC Clariion and EMC Celerra.

“The information management software 100 could also contain hosted services (not illustrated), which may be hosted by another entity than the one that uses the information management software 100. Hosted services can be provided by different online service providers to the company. These service providers may offer services such as social networking, hosted email services, and hosted productivity apps. Hosted services may include software-as-a-service (SaaS), platform-as-a-service (PaaS), application service providers (ASPs), cloud services, or other mechanisms for delivering functionality via a network. Each hosted service can generate additional data and metadata as it delivers services to users. This data may be managed by the information management system 100 (e.g. primary data 112). The hosted services can be accessed via one of the applications 110 in some cases. A hosted mail service could be accessed using a browser on a client computer device 102. Hosted services can be used in many computing environments. They may be implemented in an environment similar to the information management systems 100 where various physical and logic components are distributed over a network.

“Secondary copies and Exemplary Secondary Storage Devices”

In some instances, the primary data 112 stored in primary storage devices (104) may be compromised. For example, an employee might delete or accidentally overwrite primary data 112 during normal work hours. The primary storage devices 104 may also be lost, damaged, or corrupted. It is useful to create copies of the primary data 112 for recovery purposes and/or regulatory compliance. The information management system 100 contains one or more secondary computing devices 106 and one, or more, secondary storage devices108 that are used to create and store secondary copies 116 and associated metadata. Sometimes, the secondary storage computing devices (106 and 108) may be called a secondary subsystem 118.

“Creation and storage of secondary copies 116 is a useful tool for search and analysis and other information management goals. It allows you to restore data and/or metadata in the event that a primary version (e.g. of primary data 112) is lost due to deletion, corruption or natural disaster; it also permits point-in time recovery.

“The client computing devices (102) access or receive primary information 112 and communicate that data, e.g. over one or more communication paths 114 for storage in the secondary storage device(s).108

“A secondary copy (116) can contain a separate, stored copy of the application data. It may be derived from one or several earlier-created, store copies (e.g. primary data 112 and another secondary copy, 116). Secondary copies 116 may contain point-in time data and can be stored for a relatively long period of storage (e.g. weeks, months, or years) before any or all data is moved to another storage or discarded.

“In some cases, a second copy 116 can be a copy created of application data and stored after at least one other stored instance (e.g. following corresponding primary data 112 and to another secondary data 116), in an alternative storage device than at most one stored copy and/or remotely. Secondary copies may be stored on the same storage device with primary data 112 or other previously stored copies in some cases. In one example, a disk array that can perform hardware snapshots stores primary information 112, and creates and stores secondary copies 116. Secondary copies 116 can be kept in low-cost storage, such as magnetic tape. The secondary copy 116 could be kept in a backup, archive format or another format than the primary data or native application format.

“Some secondary copies 116 can be indexed to allow users to browse and restore at a later time. A secondary copy 116 representing certain primary data 112 may be created. A pointer or another location indicator (e.g. a stub), may be added to primary data 112. To indicate the current location of the secondary storage device(s), 108 or secondary copy 116.

“An instance of a metadata or data object in primary information 112 can change over time as it’s modified by an app 110 (or the operating system) so the information management 100 may create multiple secondary copies 116 to represent the state of that data object or metadata at a specific point in time. The information management system 100 can also manage point-in time representations of primary data objects, even though they may be deleted from primary storage device 104 or the file system.

“Virtualized computing devices may have the operating system 110 and other applications 110 executed within or under virtualization software management (e.g., VMM). The primary storage device(s), 104 may contain a virtual disk created on physical storage device. Information management system 100 can create secondary copies of 116 files and other data objects within a virtual disk, and/or secondary copies of 116 of the entire virtual drive file (e.g. of an entire.vmdk) itself.

“Secondary copy 116 can be distinguished from the corresponding primary data 112 by a variety of means. Some of these will be discussed. As mentioned, secondary copies 116 may be stored in a different format than primary data 112 (e.g. backup, archive, and other non-native formats). Secondary copies 116 may not have direct access to the client computing device 110 for various reasons.

“Secondary backups 116 may also be stored in certain embodiments on a secondary storage unit 108 that is not accessible to the applications 110 running at the client computing devices (and/or hosted service). Some secondary copies 116 could be “offline copies”, They are not easily accessible (e.g., they are not mounted to tape or disc). “Offline copies” can be copies of data that an information management system 100 is able to access without human intervention.

“The Use Of Intermediate Devices To Create Secondary Copies”

It can be difficult to create secondary copies. There can be hundreds of clients computing devices 102 that generate large amounts of primary data 112 which must be protected. Secondary copies can also be created with significant overhead 116. Secondary storage devices 108 can also be used for special purposes, so interacting with them may require specialized intelligence.

“In certain cases, client computing devices 102 can interact directly with secondary storage device108 to create secondary copies 116. This approach, however, can have a negative impact on the client computing devices’ 102 ability to serve applications 110 and generate primary data 112. The client computing devices 102 might not be optimized for interaction to the secondary storage devices (108).

“In some embodiments, the information system 100 may include one or more software/hardware components that act as intermediaries between client computing devices (102) and secondary storage devices (108). These intermediate components may provide additional benefits beyond transferring certain responsibilities to the client computing device 102. As shown in FIG. 1D) can increase scalability by distributing some of work required to create secondary copies 116

“The intermediate components may include one or more secondary storage computing device 106, as shown in FIG. 1A, and/or one or several media agents. These can be software modules that operate on the secondary storage computing devices (106) or other suitable computing devices. Below are some examples of media agents (e.g., in relation to FIGS. 1C-1E).”

“The secondary storage computing devices(s)106” can include any of the computing units described above. Sometimes, the secondary storage computing devices (106) may include special hardware and/or software components for interfacing with secondary storage devices 108.

“To create secondary copies 116, which involves the copying data from the primary subsystem 117 into the secondary subsystem 118. In some embodiments, the client computing devices 102 communicates the primary data 112 (or a processed copy thereof) to the designated secondary computing device 106 via the communication path 114. The secondary storage computing unit 106 then transmits the received data or a processed version thereof to the secondary storage device. In certain cases, the communication path 114 between client computing device (102) and secondary storage computing device (106) may be a part of a LAN/WAN or SAN. Other cases allow at least one client computing device 102 to communicate directly with secondary storage devices (108, e.g. via Fibre Channel or SCSI connections). Other cases include creating one or more secondary copies from secondary copies that exist, as in the case with an auxiliary copy operation.

“Exemplary Secondary Data and Exemplary Primary Data”

“FIG. “FIG. The primary storage device(s), 104 contains primary data objects, including word processing documents (119A-B), spreadsheets 120, presentation files 122, video files 124, image files 126 and email mailboxes 128 with corresponding emails 129A?C), html/xml files 130, databases 132, and the corresponding tables or data structures 133A?133C.

“Some or all primary objects are associated with the corresponding metadata (e.g.?Meta1-11). These metadata may be file system metadata or application-specific metadata. Secondary copy data objects (134A-C) are stored on secondary storage device(s), 108. These secondary data objects may contain copies of, or otherwise represent, corresponding primary data objects and metadata.

“As you can see, secondary copy data objects (134A-C) can each represent more than one primary object. Secondary copy data object (134A) can represent three primary data objects 133C-122C and 129C respectively. They are represented as 133C? and 122C respectively and accompanied with the Meta11, Meta3 and Meta8 metadatas. The prime mark (?) also indicates that secondary copy data object 134A may store a representation of a primary data object and/or metadata differently than the original format. A secondary copy object can store metadata and a representations of primary data objects in a different format than the original format. Secondary data object 134B also represents primary data objects 120, 130B, and 120A respectively, and is accompanied by the corresponding metadata Meta2, 113B, and Meta1 respectively. Secondary data object 134C also represents primary data objects 130A, 119B and 129A respectively as 133A??, 119B and 129A respectively. It is accompanied by the corresponding metadata Meta9 and Meta5, respectively.

“Exemplary Information Management System Architecture”

“The information management software 100 can contain a wide range of hardware and software components. These can be organized in many different ways depending on the embodiment. It is crucial to make clear design decisions about the functional responsibilities and roles of components in the information management systems 100. As will be discussed, these design decisions can have a significant impact on performance and the ability of the information management software 100 to adapt to changing data growth or other circumstances.

“FIG. 1C is an illustration of an information management system 100. It includes: storage manager 140, which is a centralized storage/or information manager configured to perform specific control functions. One or more data agents (142) are executed on client computing devices 102 for processing primary data 112, and one, or more, media agents 144 that execute on secondary storage computing devices. 106 for performing tasks related to the secondary storage devices. 108. Although it is possible to distribute functionality across multiple computing devices, there are other benefits. In some cases, consolidating functionality can be more beneficial. In various other embodiments, any or all of the components in FIG. 1C are not implemented on different computing devices. One configuration includes a storage manager 140 and one or two data agents 142. A media agent 144 is also implemented on the same device. Another embodiment allows for one or more data agent 142, one or several media agents 144, and the storage manager 140 to be implemented on the same computing devices. This is not a limitation.

“Storage Manager”

“As you can see, there are 100 components to the information management system and a lot of data that needs to be managed. The task of managing the components and data can be a complex one. It is also a task that can become more difficult as the number of components and the data grows to meet the organization’s needs. According to certain embodiments, the storage manager 140 is responsible for the control of the information management system 100. The storage manager 140 can be modified independently by distributing control functionality. A computing device that hosts the storage manager 140 can also be chosen to best fit the functions of the storage manger 140. FIG. 2 explains these and other benefits in more detail. 1D.”

“The storage manager 140 could be a software module, or another application that, in certain embodiments, operates in conjunction with one of the associated data structures (e.g. a dedicated database, management database 146). Storage manager 140 may be a computing device that executes computer instructions. The storage manager is responsible for initiating, performing, coordination and/or controlling storage operations and other information management operations performed under the information management system 100. This includes protecting and controlling the primary data 112 as well as secondary copies 116 and metadata. Storage manager 100 is generally responsible for managing information management system 100. This includes managing its constituent components (e.g. data agents and media agents).

“As indicated by the dashed-arrowed lines (114 in FIG. 1C shows that the storage manager 140 can communicate with or control certain elements of the information system 100 such as data agents 142, media agents 144, and/or other components. In certain embodiments, control information is received from the storage manger 140. Status reporting is sent to storage manager 140 by various managed components. Payload data and metadata are generally communicated between data agents 142, media agents 144 and client computing devices 102 (or otherwise between the secondary storage computing devices 106), e.g. at the direction and under the supervision of the storage manager140. The control information may include instructions and parameters for performing information management operations. This includes instructions on how to start an operation, when to start it, timing information that specifies when to do so, data path information that specifies which components to access or communicate with in order to complete the operation. Payload data can, however, include data that is actually involved in storage operations, such as content data that has been written to secondary storage device 108 during a secondary copy operation. Payload metadata may include any of these types of metadata and can be written to a storage unit with payload content data (e.g. in the form a header).

“In some embodiments, certain information management operations can be controlled by other components of the information management system 100 (e.g. the media agent(s 144) or data agent(s 142), in addition to or in combination with storage manager 140.”

“Accordingly to certain embodiments, storage manager 140 provides one of the following functions.

“The storage manager 140 could maintain a database (or?storage manger database 146?) or ?management database 146?) Management-related data and information management policy 148. A management index 150 or?index 150 may be included in the database 146. or any other data structure that stores logical association between components of the system, user preference and/or profiles (e.g. preferences regarding encryption, compression or deduplication, scheduling, type or other aspects, mappings of information management users to specific computing devices or other components, etc. Management tasks, media containerization, and other useful data. The index 150 may be used by the storage manager 140 to track logical connections between media agents 144, secondary storage device 108, and/or the movement of data from primary storage device 104 to secondary storage device 108. The index 150 could store data that associates a client computing device with a specific media agent 144 or secondary storage devices 108. This is according to an information management policy (148) which can be found below.

Administrators and other individuals may be able configure and initiate information management operations individually. This may work for certain recovery operations and other tasks that are less frequently performed, but it is not practical for ongoing organization-wide data management. The information management system 100 can use information management policies 148 to specify and execute information management operations (e.g. on an automated basis). An information management policy 148 may include a data structure, or another information source, that specifies a set or parameters (e.g. criteria and rules) related to storage or other information operations.

The storage manager database 146 may contain the information management policy 148 and associated data. However, the information management policy 148 can be stored at any location. An information management policy 148, such as a storage policy, may be stored in metadata in a media agency database 152 or in secondary storage device 108 (e.g. as an archive copy) to aid in restore operations and other information management operations depending on the embodiment. Below are descriptions of information management policies 148.

According to some embodiments, the storage manger database 146 includes a relational database (e.g. an SQL database) that tracks metadata such as metadata associated secondary copy operations (e.g. what client computing devices were used and the corresponding data). These and other metadata can also be stored at other locations, such the secondary storage computing device 106 or the secondary storage device 108. This allows data recovery without the need for the storage manager 140 in certain cases.

“As shown in the figure, the storage manager 140 could include a jobs agent (156), a user interface (158), and a management agents 154. All of these may be implemented as interconnected modules or applications programs.

In some embodiments, the jobs agent 156 initiates, controls and/or monitors some or all storage operations or other information management operations. These operations may be currently being performed or scheduled to be performed in the information management system 100. The jobs agent 156 might, for example, access information management policies (148) to determine when and how to control secondary copy and other operations.

“The user interface 158 can include information processing, display software, and graphical user interfaces (?GUI). An application program interface (??API?) ), an application program interface (?API?). Users can optionally issue instructions to components of the information management system 100 via the user interface 158 regarding storage and recovery operations. A user might modify a schedule indicating the number of secondary copy operations that are pending. Another example is that a user might use the GUI to view the status pending storage operations, or monitor certain components of the information management system 100 (e.g. the remaining storage capacity).

“An information management cell?” (or ?storage operation cell? (or?storage operation cell? A logical or physical grouping may be used to describe a combination of hardware-software components that are associated with information management operations on electronic files. This includes at least one storage manager 140, at least one client computing device 102, and at most one data agent (or 142) and at minimum one media agent (144). FIG. 1C shows an example of such components. 1C could be combined to form an information management system cell. Multiple cells can be organized hierarchically. This configuration allows cells to inherit properties from hierarchically superior cell or to be controlled by other cells (automatically or not). In some embodiments, cells can inherit or be linked to information management policies, preferences or information management metrics or any other property or characteristic based on their relative position within a hierarchy of cells. You can also organize cells hierarchically according geography, architecture, function, or any other factor that is useful in information management operations. One cell could represent a geographical segment of an enterprise such as a Chicago office. A second cell might represent another geographic segment such as a New York or New York office. Others cells could represent different departments within an office. A first cell can perform one or several first types information management operations (e.g. one or two first types secondary or additional copies), while a second cell could perform one, more, or all of the second types information management operations.

“The storage manager 140 can also track information that allows it to identify, select, or otherwise identify content indexes, deduplication database or similar resources or data sets within its information cell (or another cell), to be searched for certain queries. These queries can be entered via the interface 158. The management agent 154 permits multiple information management cells to communicate with each other. In some cases, the information management system 100 may be one of many information management cells in a network of multiple cells that are adjacent or otherwise logically connected in a WAN/LAN. These cells can be linked to each other through their respective management agents 154.

“For example, the management agent 150 can give the storage manager 140 the ability to communicate via network protocols or application programming interfaces (??) with other components of the information management systems 100 (and/or cells within a larger system). These include, e.g. HTTP, HTTPS FTP, REST and virtualization software APIs. U.S. Pat. explains inter-cell communication and hierarchy in more detail. Nos. Nos. 7,747.579 and 7,343,453, are incorporated herein by reference.”

“Data Agents”

“As we have discussed, there are many types of applications 110 that can be run on a client computing device 102. These include operating systems, database apps, e-mail programs, and virtual machines to name just a few. Client computing devices 102 might be responsible for processing the primary data 112 created by these different applications 110 as part of the creation and restoration of secondary copies 116. Moreover, the nature of the processing/preparation can differ across clients and application types, e.g., due to inherent structural and formatting differences among applications 110.”

“The one or more data agents 142 can be advantageously configured in certain embodiments to aid in the performance information management operations based upon the type of data being protected at a client-specific, and/or app-specific level.”

“The data agent142 could be a module or component of a software program that is responsible for initiating, managing, or otherwise supporting the execution of information management operations within information management system 100, usually as directed by storage manger 140. The data agent 142 might be responsible for performing data storage operations like copying, archiving and migrating primary data 112 to the primary storage device(s). 104. The storage manager 140 may give control information to the data agent 142, including commands to send copies of data objects and metadata to media agents 144.

“In some embodiments, the data agent (142) may be distributed between client computing device 101 and storage manager 140 (and any intermediate components), or it may be deployed from remote locations or its functions approximated using a remote process that performs all or some of the functions of data agent 142. A data agent 142 can also perform functions that are provided by a media agents 144 or perform other functions, such as encryption and duplication.

Each data agent 142 can be customized for a specific application 110. The system can use multiple application-specific agents 142 to perform information management operations (e.g. backup, migration and data recovery) associated in a different 110 application. Different data agents 142 could be used to manage Microsoft Exchange data and Lotus Notes data. They may also handle Microsoft Active Directory Objects, Microsoft Windows file system, Microsoft Windows data, Microsoft SQL Server data and SQL Server data.

A file system agent may be used to manage data files and/or other information. A specialized data agent 142 can be used to backup, archive, migrate and restore client computing devices 102 data if there are multiple types of data. To backup, migrate, or restore all data on a Microsoft Exchange server, a client computing device 102 might use a Microsoft Exchange Mailbox Data agent 142, a Microsoft Exchange Database Data agent 142, and a Microsoft Exchange Public Folder and File System data agents 142. These specialized data agents 142 can be considered four different data agents 142, even though they are all running on the same client computing device.

Summary for “Hybrid drive caching with a backup system and SSD deletion management”

Global businesses recognize the commercial value and need to find cost-effective, reliable ways to secure their information while minimising their impact on productivity. Information protection is often an integral part of the daily work that is done within an organization. As part of its daily, weekly or monthly maintenance plan, a company may back up important computing systems like web servers, file servers, and databases. A company might also protect the computing systems of each employee, such as those used in an accounting, marketing, or engineering department.

Companies continue to look for innovative ways to manage data growth and protect data, given the ever-growing volume of data under their control. Companies often use migration techniques to move data to cheaper storage and data reduction techniques to reduce redundant data, prune lower priority data, and so forth. Companies increasingly see their stored data as an asset. Customers are increasingly looking for ways to not only manage and protect their data but also make use of it. Solutions that provide data analysis capabilities, information management and improved data presentation and accessibility features are increasingly in demand.

Data storage devices such as optical disk drives and hybrid drives can be used to store data that can later be retrieved. These devices may use data caches that can include high-speed semiconductor memory chip chips. This allows the devices to quickly manage data and receive commands from a host. All commands received from a host computer for read or write would be executed without caching. This would allow access to the mass storage medium (e.g. a magnetic disk) by the host computer. This access can lead to significant time delays due to mechanical positioning between the magnetic disk and the head. Caching allows the storage device to buffer data that may be accessed by the host system so that data can be made more quickly when it is actually needed.

“These inventions include novel features, advantages, and certain aspects. These advantages may not be realized in accordance to every embodiment of the inventions described herein. The inventions described herein can be implemented or performed in a way that achieves or selects one advantage, group of advantages or other benefits as suggested or taught herein.

“A method of caching in storage systems comprising a hard drive and a solid state drive is described according to certain aspects. This can be done by receiving a first request to read the first page of a hybrid drive that includes a hard drive and a solid state drive (SSD). The SSD can be used as a cache and has a faster read speed than hard disk. This can be done by determining whether the page 1 is in the SSD. If it is, the SSD will read the page 1 from the hard drive. However, the SSD will not cache the page 1. The SSD will still write to the SSD if the page is not needed again. This is done to decrease wear on the SSD. A subsequent read request can be made to the hybrid drive to read the first page. The subsequent read request can be used to read the first page from the hybrid drive. If the page is already in memory, the processor will mark that page as cached in the SSD. It will wait for other pages to be ready to cache in the SSD to prevent unnecessary writing to the SSD. After a predetermined number pages are marked as ready to cache, the method may include writing the first and all other pages ready to cache to the SSD. This will allow you to write the pages efficiently to the SSD while reducing wear.

The method may also include caching the page first in a memory cache before caching it in the SSD. A hash table can be maintained in the memory. The hash table can be used to map the first page to the storage location of the SSD or memory. In some embodiments, the first and subsequent read requests relate to backup operations.

“A system for caching in storage systems comprising a hard drive and a solid state drive is disclosed according to additional aspects. A storage driver can be implemented in a hardware processor and comprises executable instructions that are configured to: Receive a first request for a data element from a storage device comprising a hard drive and a SSD. The SSD acts as a cache for hard disk. The storage driver can also be configured to read the first element from the hard drive. Further, the storage driver can be configured to accept a second request to access the first data element in the storage system. The storage driver can then be configured to respond to the second request to read the first data element from the storage system. A storage driver can be configured to determine if a predetermined number of data elements has been indicated as ready to cache in addition to the first data element. If the storage driver determines that the predetermined number of data elements are ready to cache, it can be set up to write the data elements, including the first, to the SSD.

“The system may also include a hybrid drive under certain circumstances. The storage driver can also be configured to maintain a data schema in the memory, according to certain aspects. You can configure the data structure to map data elements that are ready to cache to a storage location either in the SSD or memory. A hash table can be used to index the data structure, at least according to data element identifier. In response to a write to the SSD, the storage driver can also be configured to delete the first data elements from the SSD. To reduce wear on the SSD, the storage driver can also be configured to write data elements that are ready to cache to the SSD in circular fashion.

“The storage driver may be further configured to store the first data element in memory before writing the data element into the SSD. You can further configure the storage driver to delete the first data element from memory when the data element is written to the SSD.

“Depending on the embodiment the quantity of data element can correspond to either one or both the size of the data element and the number of the data element.”

“The storage driver may include an interface to either a file system or both a filesystem and a database. The first and second read requests are received from the latter.”

“A system for caching is described according to further aspects. A hardware processor can be used to read data from a hard drive; store a first indication that the first element will be stored in memory in order to cache it in the SSD without actually cache it; then read another data element from the hard disc; store a second indication that the second element will be stored in the SSD; finally, store the second indication in your memory and cache the first and the second data elements in SSD.

The hardware processor can also be configured to store the first indication and second indication in a buffer within memory. The first and second indications can contain pointers to the first or second data elements. The hardware processor may also be configured to cache the first or second data elements in SSDs in certain cases in response to buffer capacity reaching its limit. In some cases, the hardware processor is set up to cache the first two data elements in a memory cache before caching them in the SSD. You can further configure the hardware processor to cache the first two data elements in the SSD when the memory cache reaches capacity, even if it has not reached capacity.

“In certain embodiments, the hardware processor can be configured to delete the first data elements from the SSD upon receiving a write request to that data element.”

A hybrid drive may include both a hard drive and a solid state drive (SSD), which can be used to cache the hard drive. An SSD can store certain input/output data (I/O) between the SSD and the hard drive so that often accessed data can be accessed faster from the SSD. Flash memory is generally more expensive than hard drive technology, so an SSD can be used as a cache. A hybrid drive is a combination of a fast SSD cache and a large hard disk that has a lot of storage, which can offer a compromise in price and storage capacity.

SSDs are limited in their lifespan. This means that only a small number of writes can be done to each SSD cell. Existing logic layers can use wear-leveling algorithms to evenly wear SSDs. An example of a wear-leveling algorithm is one that writes to different parts of an SSD at different times. This allows the SSD to be written to multiple times without overwriting it, thereby reducing its lifespan prematurely. The wear-leveling algorithms for SSDs might not be sufficient to reduce wear when used as a cache. If every data element read from the hard drive was stored in the SSD, regardless of whether it will be used again soon or not, this would significantly reduce the SSD’s lifespan. To avoid unnecessary writing to the SSD, data that is read from the hard drive in a scanning operation should not be stored in the SSD. Backup operations can be very costly as multiple writes to the SSD may be performed during backup.

“Aspects of the systems and methods described herein may implement intelligent caching algorithms to address these and other issues in order to reduce wear on SSDs and/or improve caching performance. These intelligent cache algorithms can increase storage utilization and I/O efficiency, taking into consideration the SSD’s write-wearing limitations. The systems and methods can store data to the SSD, but not write too often to it to extend its lifespan. Systems and methods can write data to the SSD if data has been read multiple times from the hard drive or memory to prevent or attempt to avoid writing data that was only read once. Systems and methods can also write large amounts of data to the SSD at one time, instead of one unit at a given time. The systems and methods may also write to the SSD in circular fashion, overwriting older or less recent data in order to avoid, or try to avoid, overwriting the SSD the same area multiple times in succession.

“The systems and methods described herein for intelligent caching may also be used in information management systems, such as those shown in FIGS. 1A-1H.”

Organizations simply cannot afford to lose critical data. This is because of the growing importance of protecting and leveraging data. Protecting and managing data is becoming more difficult due to runaway data growth and other modern realities. It is imperative to have user-friendly, efficient and powerful solutions for managing and protecting data.

“Depending on the organization’s size, there may be many data production sources that fall under the control of thousands, hundreds or even thousands of employees. Individual employees used to be responsible for protecting and managing their data in the past. In other cases, a patchwork of software and hardware point solutions was used. These solutions were often offered by different vendors, and sometimes had little or no interoperability.

“CERTAIN embodiments described herein offer systems and methods capable to address these and other shortcomings in prior approaches by implementing unified information management across the organization. FIG. FIG. 1A illustrates one such information management systems 100. It generally includes combinations hardware and software that are used to manage and protect data and metadata generated by various computing devices within information management system 100. An organization using the information management system 100 could be a company, other business entity, educational institution, household or governmental agency.

“Generally, the systems described herein may be compatible and/or provide some of the functionality of one or more U.S patents or patent application publications assigned by CommVault Systems, Inc., each which is hereby incorporated into its entirety by reference herein.

“The information management software 100 can contain a wide range of computing devices. As an example, the information management software 100 could include one or more client computing device 102 and secondary storage computing device 106, as we will discuss in more detail.

Computer devices may include without limitation one or more of the following: personal computers, workstations, desktop computers or other types generally fixed computing systems like mainframe computers or minicomputers. Other computing devices include portable or mobile computing devices like laptops, tablets computers, personal information assistants, mobile phones (such a smartphones), and other mobile/portable computing devices like embedded computers, set top boxes or vehicle-mounted devices. Servers can be included in computing devices, including mail servers, file server, database servers and web servers.

“In certain cases, a computing device may include virtualized and/or Cloud computing resources. A third-party cloud service provider may provide one or more virtual machines to an organization. In some cases, computing devices may include one or more virtual machines running on a physical host computing device (or “host machine?”). The organization may use one or more virtual machines to run its database server and another virtual machine as a mail server. One example is that the organization might use one virtual machine to run its database server and another as a mail server. Both virtual machines are running on the same host computer.

A virtual machine is an operating system and associated resources that is hosted on a host computer or host machine. Hypervisor is typically software and is also known as a virtual monitor, virtual machine manager or?VMM? The hypervisor acts as a bridge between the virtual machine’s hardware and its host machine. ESX Server, by VMware, Inc., of Palo Alto, Calif., is an example of hypervisor used for virtualization. Other examples include Microsoft Virtual Server, Microsoft Windows Server Hyper-V, and Sun xVM, both by Oracle America Inc., Santa Clara, Calif. In some embodiments, hypervisors may be hardware or firmware.

The hypervisor gives each virtual operating system virtual resources such as a processor, virtual memory, and virtual network devices. Each virtual machine can have one or more virtual drives. The data of virtual drives is stored by the hypervisor in files on the filesystem of the physical host machine. These files are called virtual machine disk images (in the instance of Microsoft virtual servers) and virtual machine disk files (in case of VMware virtual server). VMware’s ESX server provides the Virtual Machine File System, (VMFS), for storage of virtual machine files. Virtual machines read and write data to their virtual disks in the same manner as physical machines.

U.S. Pat. 102,297 describes “Examples for information management techniques in cloud computing environments.” No. No. 8,285,681 is incorporated herein. U.S. Pat. explains some techniques for managing information in virtualized computing environments. No. No. 8.307,177, also included by reference herein

“The information management software 100 can include many storage devices. Primary storage devices 104, secondary storage devices (108), and others are examples. You can store any type of storage device, including hard-disk arrays and semiconductor memory (e.g. solid state storage), network-attached storage (NAS), tape libraries or other magnetic non-tape storage devices as well as optical media storage devices. DNA/RNA-based memories technology and combinations thereof. Storage devices may be part of a distributed storage system in some instances. Some storage devices can be provided in a cloud, such as a private cloud, or one managed by a third party vendor. In some cases, a storage device is a disk array or a portion thereof.

“The illustrated information system 100 comprises one or more client computing devices 102 that execute at least one application 110, and one or two primary storage devices (104) that store primary data 112. In some cases, the client computing device(s), 102 and primary storage devices (104) may be called a primary storage subsystem. 117 A computing device that is part of an information management systems 100 and has a data agent 42 installed and running on it is called a client computing device (or in the context of a component in the information management systems 100, simply as a “client ?).””).

“The meaning of the term “information management system” depends on the context. It can be used to refer to all the software and hardware components. In other cases, it may only refer to a subset or all of the components.

In some cases, the information system 100 may refer to a collection of components that protect, move and manage data and metadata generated from client computing devices 102. The information management system 100 does not necessarily include all the components that create and/or store primary data 112, such the client computing device 102 and applications 110, as well as the primary storage devices 104. For example, the term “information management system” could refer to: Sometimes, the term “information management system” may refer to any of the following components with corresponding data structures: storage agents, media agents, and data agents. We will describe these components in greater detail below.

“Client Computing Devices.”

There are many sources of data that an organization can use to protect and manage its data. One example is that a company environment can have multiple data sources. These include employee workstations, company servers, such as mail servers, web servers, database servers, transaction servers, and the like. The information management system 100 includes the client computing devices 102 as data sources.

“The client computing device 102 can include any of these types of computing devices, but in some cases, the client computing device 102 is associated with one or more users or corresponding user accounts of employees or other individuals.”

“The information management software 100 addresses the data management needs and protects the data generated by client computing devices 102. This does not mean that client computing devices 102 can’t be called?servers? In other ways. A client computing device 102 can act as a server for other clients, such as client computing devices 102. The client computing devices 102 include file servers, mail servers, database servers and web servers.

“Each client computing device (102) may have one or more software applications 110 (e.g. software applications). These applications generate and manipulate data that must be managed and protected from loss. Applications 110 are generally used to support the operation of an organization or multiple affiliated organizations. They can include file server applications (e.g. Microsoft Exchange Server), mail client applications(e.g. Microsoft Exchange Client), SQL, Oracle, SAP and Lotus Notes Database), word processing apps (e.g. Microsoft Word), spreadsheet and financial applications, presentation and graphics applications, web applications, mobile applications and entertainment applications.

“The client computing devices 102 may have at least one operating software (e.g. Microsoft Windows, Mac OS X iOS, IBM z/OS Linux, or other Unix-based OSes). There may be one or more file system or other applications that are installed on the client computing devices 102.

“The client computing devices (102 and 100) can be connected via one or more communication paths 114. A first communication path 114 could connect client computing devices 102 and secondary storage computing devices 106. A second communication pathway, 114, may connect storage manger 140 and client computing devices 102. A third communication pathway, 114, may connect storage managers 140 and client computing devices 102. Finally, storage manager 140 may be connected to storage manager 140, and secondary storage computing equipment 106. (see, e.g., FIG. FIG. 1A and FIG. 1C). 1C. In some cases, communication pathways 114 may also include application programming Interfaces (APIs), such as cloud service provider APIs and virtual machine management APIs. The infrastructure that underlies communication paths 114 can be wired, wireless, analog, and/or digital or any combination thereof. Facilities may also be private, public or third-party provided.

“Primary Data, Exemplary Primary Storage Devices”

According to some embodiments, primary data 112 is production data or any other?live? data. Data generated by the operating systems and/or applications 110 running on a client computing device. Primary data 112 is usually stored on the primary storage device(s), 104. It is organized using a file system that is supported by the client computing devices 102. The client computing device (102) and the corresponding applications 110 can create, access modify, delete, write, delete, or otherwise use primary data 112. Some cases allow some or all the primary data 112 to be stored in cloud storage resources. For example, client computing device(s) 102 and corresponding applications 110 may create, modify, write, delete, or otherwise use primary data 112.

“Primary Data 112 is usually in the native format for the source application 110. Primary data 112 can be described as an initial or first copy (e.g. created before any other copies, or at least one additional copy) of data generated from the source application 110. In some cases, primary data 112 is substantially created directly from the data generated by the source applications 110.

The primary storage devices 104 that store the primary data 112 can be expensive and/or slow (e.g., disk drives, hard-disk arrays, solid state memories, etc.). Primary data 112 can be extremely changeable, and/or intended for short-term retention (e.g. hours, days or weeks).

“Accordingly to some embodiments, the client computing devices 102 can access primary information 112 from the primary storage unit 104 via conventional file system calls through the operating system. Structured data, unstructured data, and/or semi-structured information may all be included in primary data 112. Below are some examples with regard to FIG. 1B.”

It can be used to perform certain tasks, such as organizing primary data 112 into units with different granularities. Primary data 112 may include files, directories and file system volumes. It can also include data blocks, extents, and any other hierarchies of data objects. A “data object” is defined herein. A?data object? can be used to refer to either (1) any file that is currently addressed by a system or that was previously addressed by the system (e.g. an archive file) or (2) a subset thereof (e.g. a data block).

“As we will explain in detail, it can also help in performing certain functions in the information management system 100 to modify and access metadata within the primary dataset 112. Metadata is information about data objects and characteristics that are associated with them. It is important to note that any reference to primary information 112 includes its associated metadata. However, references to the metadata don’t include primary data.

“Metadata may include, without limitation: the owner of the data (e.g. the client or user that generated the data object), the last modified date (e.g. the time at which the data object was modified), the file size (e.g. a number bytes of data), information about content (e.g. an indication of the existence of a specific search term), user-supplied tag, to/from information (e.g. an email sender, recipient), and other information related to the email information (e. The creation date, file type (e.g. format or application type), the last accessed times, application type (e.g. type of application that created the data objects), location/network (e.g. a current, past, or future location of data object and network paths to/from it), user-supplied tags, to/from information for email (e.g. an email sender, recipient, etc.), partition layouts, file location within the file folder directory structure, permissions, owners groups, access control list [ACLs], system metadata (e.

“In addition to metadata related to file system and operating systems, some applications 110 and/or components of the information management software 100 maintain indices metadata for data objects. For example, metadata associated to individual email messages. Each data object can be associated with the corresponding metadata. Below is a more detailed explanation of how metadata can be used to perform classification and other functions.

“Each client computing device 102 is generally associated with or in communication with one of the primary storage units 104, storing the corresponding primary data 112. A client computing device 102 could be considered to be “associated with?” A client computing device 102 may be considered to be?associated with? A primary storage unit 104 is capable of performing one or more of the following: routing and/or storage data (e.g. primary data 112) to the specific primary storage devices 104; coordinating the routing/or storage of data to the primary storage devices 104; retrieving data from that primary storage facility 104; coordinating the retrieval data from that primary storage apparatus 104; and altering and/or eliminating data retrieved from that primary storage appliance 104.”

“Primary storage devices 104 may include any of the storage devices mentioned above or another type of storage device. The primary storage devices (104) may be slower than the secondary storage device 108 and/or more expensive. The information management system 100 might, for example, access metadata and data stored on primary storage device 104 quite often, while data stored on secondary storage device 108 is more frequently accessed.

“Primary storage devices 104 can be shared or dedicated. Each primary storage device (104) may be dedicated to a client computing device 102 in some cases. In one embodiment, the primary storage device (104) is a local drive belonging to a client computing device (102). Other cases allow one or more primary storage device 104 to be shared by multiple client computers devices 102 via a network, such as in a cloud storage system. A primary storage device 104 could be a disk that is shared by a group 102 of clients computing devices, such as EMC Clariion or EMC Symmetrix. It can also include one of the following types: EMC Clariion and EMC Celerra.

“The information management software 100 could also contain hosted services (not illustrated), which may be hosted by another entity than the one that uses the information management software 100. Hosted services can be provided by different online service providers to the company. These service providers may offer services such as social networking, hosted email services, and hosted productivity apps. Hosted services may include software-as-a-service (SaaS), platform-as-a-service (PaaS), application service providers (ASPs), cloud services, or other mechanisms for delivering functionality via a network. Each hosted service can generate additional data and metadata as it delivers services to users. This data may be managed by the information management system 100 (e.g. primary data 112). The hosted services can be accessed via one of the applications 110 in some cases. A hosted mail service could be accessed using a browser on a client computer device 102. Hosted services can be used in many computing environments. They may be implemented in an environment similar to the information management systems 100 where various physical and logic components are distributed over a network.

“Secondary copies and Exemplary Secondary Storage Devices”

In some instances, the primary data 112 stored in primary storage devices (104) may be compromised. For example, an employee might delete or accidentally overwrite primary data 112 during normal work hours. The primary storage devices 104 may also be lost, damaged, or corrupted. It is useful to create copies of the primary data 112 for recovery purposes and/or regulatory compliance. The information management system 100 contains one or more secondary computing devices 106 and one, or more, secondary storage devices108 that are used to create and store secondary copies 116 and associated metadata. Sometimes, the secondary storage computing devices (106 and 108) may be called a secondary subsystem 118.

“Creation and storage of secondary copies 116 is a useful tool for search and analysis and other information management goals. It allows you to restore data and/or metadata in the event that a primary version (e.g. of primary data 112) is lost due to deletion, corruption or natural disaster; it also permits point-in time recovery.

“The client computing devices (102) access or receive primary information 112 and communicate that data, e.g. over one or more communication paths 114 for storage in the secondary storage device(s).108

“A secondary copy (116) can contain a separate, stored copy of the application data. It may be derived from one or several earlier-created, store copies (e.g. primary data 112 and another secondary copy, 116). Secondary copies 116 may contain point-in time data and can be stored for a relatively long period of storage (e.g. weeks, months, or years) before any or all data is moved to another storage or discarded.

“In some cases, a second copy 116 can be a copy created of application data and stored after at least one other stored instance (e.g. following corresponding primary data 112 and to another secondary data 116), in an alternative storage device than at most one stored copy and/or remotely. Secondary copies may be stored on the same storage device with primary data 112 or other previously stored copies in some cases. In one example, a disk array that can perform hardware snapshots stores primary information 112, and creates and stores secondary copies 116. Secondary copies 116 can be kept in low-cost storage, such as magnetic tape. The secondary copy 116 could be kept in a backup, archive format or another format than the primary data or native application format.

“Some secondary copies 116 can be indexed to allow users to browse and restore at a later time. A secondary copy 116 representing certain primary data 112 may be created. A pointer or another location indicator (e.g. a stub), may be added to primary data 112. To indicate the current location of the secondary storage device(s), 108 or secondary copy 116.

“An instance of a metadata or data object in primary information 112 can change over time as it’s modified by an app 110 (or the operating system) so the information management 100 may create multiple secondary copies 116 to represent the state of that data object or metadata at a specific point in time. The information management system 100 can also manage point-in time representations of primary data objects, even though they may be deleted from primary storage device 104 or the file system.

“Virtualized computing devices may have the operating system 110 and other applications 110 executed within or under virtualization software management (e.g., VMM). The primary storage device(s), 104 may contain a virtual disk created on physical storage device. Information management system 100 can create secondary copies of 116 files and other data objects within a virtual disk, and/or secondary copies of 116 of the entire virtual drive file (e.g. of an entire.vmdk) itself.

“Secondary copy 116 can be distinguished from the corresponding primary data 112 by a variety of means. Some of these will be discussed. As mentioned, secondary copies 116 may be stored in a different format than primary data 112 (e.g. backup, archive, and other non-native formats). Secondary copies 116 may not have direct access to the client computing device 110 for various reasons.

“Secondary backups 116 may also be stored in certain embodiments on a secondary storage unit 108 that is not accessible to the applications 110 running at the client computing devices (and/or hosted service). Some secondary copies 116 could be “offline copies”, They are not easily accessible (e.g., they are not mounted to tape or disc). “Offline copies” can be copies of data that an information management system 100 is able to access without human intervention.

“The Use Of Intermediate Devices To Create Secondary Copies”

It can be difficult to create secondary copies. There can be hundreds of clients computing devices 102 that generate large amounts of primary data 112 which must be protected. Secondary copies can also be created with significant overhead 116. Secondary storage devices 108 can also be used for special purposes, so interacting with them may require specialized intelligence.

“In certain cases, client computing devices 102 can interact directly with secondary storage device108 to create secondary copies 116. This approach, however, can have a negative impact on the client computing devices’ 102 ability to serve applications 110 and generate primary data 112. The client computing devices 102 might not be optimized for interaction to the secondary storage devices (108).

“In some embodiments, the information system 100 may include one or more software/hardware components that act as intermediaries between client computing devices (102) and secondary storage devices (108). These intermediate components may provide additional benefits beyond transferring certain responsibilities to the client computing device 102. As shown in FIG. 1D) can increase scalability by distributing some of work required to create secondary copies 116

“The intermediate components may include one or more secondary storage computing device 106, as shown in FIG. 1A, and/or one or several media agents. These can be software modules that operate on the secondary storage computing devices (106) or other suitable computing devices. Below are some examples of media agents (e.g., in relation to FIGS. 1C-1E).”

“The secondary storage computing devices(s)106” can include any of the computing units described above. Sometimes, the secondary storage computing devices (106) may include special hardware and/or software components for interfacing with secondary storage devices 108.

“To create secondary copies 116, which involves the copying data from the primary subsystem 117 into the secondary subsystem 118. In some embodiments, the client computing devices 102 communicates the primary data 112 (or a processed copy thereof) to the designated secondary computing device 106 via the communication path 114. The secondary storage computing unit 106 then transmits the received data or a processed version thereof to the secondary storage device. In certain cases, the communication path 114 between client computing device (102) and secondary storage computing device (106) may be a part of a LAN/WAN or SAN. Other cases allow at least one client computing device 102 to communicate directly with secondary storage devices (108, e.g. via Fibre Channel or SCSI connections). Other cases include creating one or more secondary copies from secondary copies that exist, as in the case with an auxiliary copy operation.

“Exemplary Secondary Data and Exemplary Primary Data”

“FIG. “FIG. The primary storage device(s), 104 contains primary data objects, including word processing documents (119A-B), spreadsheets 120, presentation files 122, video files 124, image files 126 and email mailboxes 128 with corresponding emails 129A?C), html/xml files 130, databases 132, and the corresponding tables or data structures 133A?133C.

“Some or all primary objects are associated with the corresponding metadata (e.g.?Meta1-11). These metadata may be file system metadata or application-specific metadata. Secondary copy data objects (134A-C) are stored on secondary storage device(s), 108. These secondary data objects may contain copies of, or otherwise represent, corresponding primary data objects and metadata.

“As you can see, secondary copy data objects (134A-C) can each represent more than one primary object. Secondary copy data object (134A) can represent three primary data objects 133C-122C and 129C respectively. They are represented as 133C? and 122C respectively and accompanied with the Meta11, Meta3 and Meta8 metadatas. The prime mark (?) also indicates that secondary copy data object 134A may store a representation of a primary data object and/or metadata differently than the original format. A secondary copy object can store metadata and a representations of primary data objects in a different format than the original format. Secondary data object 134B also represents primary data objects 120, 130B, and 120A respectively, and is accompanied by the corresponding metadata Meta2, 113B, and Meta1 respectively. Secondary data object 134C also represents primary data objects 130A, 119B and 129A respectively as 133A??, 119B and 129A respectively. It is accompanied by the corresponding metadata Meta9 and Meta5, respectively.

“Exemplary Information Management System Architecture”

“The information management software 100 can contain a wide range of hardware and software components. These can be organized in many different ways depending on the embodiment. It is crucial to make clear design decisions about the functional responsibilities and roles of components in the information management systems 100. As will be discussed, these design decisions can have a significant impact on performance and the ability of the information management software 100 to adapt to changing data growth or other circumstances.

“FIG. 1C is an illustration of an information management system 100. It includes: storage manager 140, which is a centralized storage/or information manager configured to perform specific control functions. One or more data agents (142) are executed on client computing devices 102 for processing primary data 112, and one, or more, media agents 144 that execute on secondary storage computing devices. 106 for performing tasks related to the secondary storage devices. 108. Although it is possible to distribute functionality across multiple computing devices, there are other benefits. In some cases, consolidating functionality can be more beneficial. In various other embodiments, any or all of the components in FIG. 1C are not implemented on different computing devices. One configuration includes a storage manager 140 and one or two data agents 142. A media agent 144 is also implemented on the same device. Another embodiment allows for one or more data agent 142, one or several media agents 144, and the storage manager 140 to be implemented on the same computing devices. This is not a limitation.

“Storage Manager”

“As you can see, there are 100 components to the information management system and a lot of data that needs to be managed. The task of managing the components and data can be a complex one. It is also a task that can become more difficult as the number of components and the data grows to meet the organization’s needs. According to certain embodiments, the storage manager 140 is responsible for the control of the information management system 100. The storage manager 140 can be modified independently by distributing control functionality. A computing device that hosts the storage manager 140 can also be chosen to best fit the functions of the storage manger 140. FIG. 2 explains these and other benefits in more detail. 1D.”

“The storage manager 140 could be a software module, or another application that, in certain embodiments, operates in conjunction with one of the associated data structures (e.g. a dedicated database, management database 146). Storage manager 140 may be a computing device that executes computer instructions. The storage manager is responsible for initiating, performing, coordination and/or controlling storage operations and other information management operations performed under the information management system 100. This includes protecting and controlling the primary data 112 as well as secondary copies 116 and metadata. Storage manager 100 is generally responsible for managing information management system 100. This includes managing its constituent components (e.g. data agents and media agents).

“As indicated by the dashed-arrowed lines (114 in FIG. 1C shows that the storage manager 140 can communicate with or control certain elements of the information system 100 such as data agents 142, media agents 144, and/or other components. In certain embodiments, control information is received from the storage manger 140. Status reporting is sent to storage manager 140 by various managed components. Payload data and metadata are generally communicated between data agents 142, media agents 144 and client computing devices 102 (or otherwise between the secondary storage computing devices 106), e.g. at the direction and under the supervision of the storage manager140. The control information may include instructions and parameters for performing information management operations. This includes instructions on how to start an operation, when to start it, timing information that specifies when to do so, data path information that specifies which components to access or communicate with in order to complete the operation. Payload data can, however, include data that is actually involved in storage operations, such as content data that has been written to secondary storage device 108 during a secondary copy operation. Payload metadata may include any of these types of metadata and can be written to a storage unit with payload content data (e.g. in the form a header).

“In some embodiments, certain information management operations can be controlled by other components of the information management system 100 (e.g. the media agent(s 144) or data agent(s 142), in addition to or in combination with storage manager 140.”

“Accordingly to certain embodiments, storage manager 140 provides one of the following functions.

“The storage manager 140 could maintain a database (or?storage manger database 146?) or ?management database 146?) Management-related data and information management policy 148. A management index 150 or?index 150 may be included in the database 146. or any other data structure that stores logical association between components of the system, user preference and/or profiles (e.g. preferences regarding encryption, compression or deduplication, scheduling, type or other aspects, mappings of information management users to specific computing devices or other components, etc. Management tasks, media containerization, and other useful data. The index 150 may be used by the storage manager 140 to track logical connections between media agents 144, secondary storage device 108, and/or the movement of data from primary storage device 104 to secondary storage device 108. The index 150 could store data that associates a client computing device with a specific media agent 144 or secondary storage devices 108. This is according to an information management policy (148) which can be found below.

Administrators and other individuals may be able configure and initiate information management operations individually. This may work for certain recovery operations and other tasks that are less frequently performed, but it is not practical for ongoing organization-wide data management. The information management system 100 can use information management policies 148 to specify and execute information management operations (e.g. on an automated basis). An information management policy 148 may include a data structure, or another information source, that specifies a set or parameters (e.g. criteria and rules) related to storage or other information operations.

The storage manager database 146 may contain the information management policy 148 and associated data. However, the information management policy 148 can be stored at any location. An information management policy 148, such as a storage policy, may be stored in metadata in a media agency database 152 or in secondary storage device 108 (e.g. as an archive copy) to aid in restore operations and other information management operations depending on the embodiment. Below are descriptions of information management policies 148.

According to some embodiments, the storage manger database 146 includes a relational database (e.g. an SQL database) that tracks metadata such as metadata associated secondary copy operations (e.g. what client computing devices were used and the corresponding data). These and other metadata can also be stored at other locations, such the secondary storage computing device 106 or the secondary storage device 108. This allows data recovery without the need for the storage manager 140 in certain cases.

“As shown in the figure, the storage manager 140 could include a jobs agent (156), a user interface (158), and a management agents 154. All of these may be implemented as interconnected modules or applications programs.

In some embodiments, the jobs agent 156 initiates, controls and/or monitors some or all storage operations or other information management operations. These operations may be currently being performed or scheduled to be performed in the information management system 100. The jobs agent 156 might, for example, access information management policies (148) to determine when and how to control secondary copy and other operations.

“The user interface 158 can include information processing, display software, and graphical user interfaces (?GUI). An application program interface (??API?) ), an application program interface (?API?). Users can optionally issue instructions to components of the information management system 100 via the user interface 158 regarding storage and recovery operations. A user might modify a schedule indicating the number of secondary copy operations that are pending. Another example is that a user might use the GUI to view the status pending storage operations, or monitor certain components of the information management system 100 (e.g. the remaining storage capacity).

“An information management cell?” (or ?storage operation cell? (or?storage operation cell? A logical or physical grouping may be used to describe a combination of hardware-software components that are associated with information management operations on electronic files. This includes at least one storage manager 140, at least one client computing device 102, and at most one data agent (or 142) and at minimum one media agent (144). FIG. 1C shows an example of such components. 1C could be combined to form an information management system cell. Multiple cells can be organized hierarchically. This configuration allows cells to inherit properties from hierarchically superior cell or to be controlled by other cells (automatically or not). In some embodiments, cells can inherit or be linked to information management policies, preferences or information management metrics or any other property or characteristic based on their relative position within a hierarchy of cells. You can also organize cells hierarchically according geography, architecture, function, or any other factor that is useful in information management operations. One cell could represent a geographical segment of an enterprise such as a Chicago office. A second cell might represent another geographic segment such as a New York or New York office. Others cells could represent different departments within an office. A first cell can perform one or several first types information management operations (e.g. one or two first types secondary or additional copies), while a second cell could perform one, more, or all of the second types information management operations.

“The storage manager 140 can also track information that allows it to identify, select, or otherwise identify content indexes, deduplication database or similar resources or data sets within its information cell (or another cell), to be searched for certain queries. These queries can be entered via the interface 158. The management agent 154 permits multiple information management cells to communicate with each other. In some cases, the information management system 100 may be one of many information management cells in a network of multiple cells that are adjacent or otherwise logically connected in a WAN/LAN. These cells can be linked to each other through their respective management agents 154.

“For example, the management agent 150 can give the storage manager 140 the ability to communicate via network protocols or application programming interfaces (??) with other components of the information management systems 100 (and/or cells within a larger system). These include, e.g. HTTP, HTTPS FTP, REST and virtualization software APIs. U.S. Pat. explains inter-cell communication and hierarchy in more detail. Nos. Nos. 7,747.579 and 7,343,453, are incorporated herein by reference.”

“Data Agents”

“As we have discussed, there are many types of applications 110 that can be run on a client computing device 102. These include operating systems, database apps, e-mail programs, and virtual machines to name just a few. Client computing devices 102 might be responsible for processing the primary data 112 created by these different applications 110 as part of the creation and restoration of secondary copies 116. Moreover, the nature of the processing/preparation can differ across clients and application types, e.g., due to inherent structural and formatting differences among applications 110.”

“The one or more data agents 142 can be advantageously configured in certain embodiments to aid in the performance information management operations based upon the type of data being protected at a client-specific, and/or app-specific level.”

“The data agent142 could be a module or component of a software program that is responsible for initiating, managing, or otherwise supporting the execution of information management operations within information management system 100, usually as directed by storage manger 140. The data agent 142 might be responsible for performing data storage operations like copying, archiving and migrating primary data 112 to the primary storage device(s). 104. The storage manager 140 may give control information to the data agent 142, including commands to send copies of data objects and metadata to media agents 144.

“In some embodiments, the data agent (142) may be distributed between client computing device 101 and storage manager 140 (and any intermediate components), or it may be deployed from remote locations or its functions approximated using a remote process that performs all or some of the functions of data agent 142. A data agent 142 can also perform functions that are provided by a media agents 144 or perform other functions, such as encryption and duplication.

Each data agent 142 can be customized for a specific application 110. The system can use multiple application-specific agents 142 to perform information management operations (e.g. backup, migration and data recovery) associated in a different 110 application. Different data agents 142 could be used to manage Microsoft Exchange data and Lotus Notes data. They may also handle Microsoft Active Directory Objects, Microsoft Windows file system, Microsoft Windows data, Microsoft SQL Server data and SQL Server data.

A file system agent may be used to manage data files and/or other information. A specialized data agent 142 can be used to backup, archive, migrate and restore client computing devices 102 data if there are multiple types of data. To backup, migrate, or restore all data on a Microsoft Exchange server, a client computing device 102 might use a Microsoft Exchange Mailbox Data agent 142, a Microsoft Exchange Database Data agent 142, and a Microsoft Exchange Public Folder and File System data agents 142. These specialized data agents 142 can be considered four different data agents 142, even though they are all running on the same client computing device.

Click here to view the patent on Google Patents.