Microsoft – Amey Vijaykumar Karandikar, Anand Vibhor, Mrityunjay UPADHYAY, Commvault Systems Inc

Abstract for “Management of log data”

A system that is based on certain aspects can improve the management of log data. A log file may be received by the system that contains information about computing operations. A minimum of some log lines could contain a static and variable portion. A first log line may be processed to extract a static section from the first log. The first value for the log line will then be determined based on this static portion. A second log line may be processed to extract a static section from the second line. The second value will be determined based on this extracted static portion. The system might compare the first and the second values and, based on that comparison, arrange the first and the second log lines together for presentation.

Background for “Management of log data”

Global businesses recognize the commercial value and need to find cost-effective, reliable ways to secure their information while minimising their impact on productivity. Information protection is often an integral part of the daily work that is done within an organization. As part of its daily, weekly or monthly maintenance plan, a company may back up important computing systems like web servers, file servers, and databases. A company might also protect the computing systems of each employee, such as those used in an accounting, marketing, or engineering department.

Companies continue to look for innovative ways to manage data growth and protect data, given the ever-growing volume of data under their control. Companies often use migration techniques to move data to cheaper storage and data reduction techniques to reduce redundant data, prune lower priority data, and so forth. Companies increasingly see their stored data as an asset. Customers are increasingly looking for ways to not only manage and protect their data but also make use of it. Solutions that provide data analysis capabilities, information management and improved data presentation and accessibility features are increasingly in demand.

One issue with managing a computing environment is the log data that applications and systems in the environment can automatically generate as a result of the execution of transactions in that environment. These log data could include one or more log files that contain transaction information about a system or application. The logs can be viewed by users who monitor the computing environment to see information about transactions, such as whether certain transactions were successful or failed. If there is a catastrophic failure, log data can be replayed to show certain transactions.

These log data can be very large. The log data generated by a system or an application may generate log data for tens, hundreds or thousands of transactions per second. Each transaction may result in multiple log entries being recorded in a log file. This can lead to millions, or even billions of log files that must be organized and processed for presentation to the user. It is difficult to process such a large amount of log data manually because users are unable to review them manually in a meaningful manner. Computer systems described herein can be used to organize and simplify log entries into a format that is understandable for users. Some computer systems can sort log files alphabetically or chronologically for the purpose of presenting the information to the user. Even though log files are sorted in this manner, the user might have difficulty understanding the information. To determine the type of errors encountered by the system, the user might simply need to go through all log entries. This may prove to be difficult. Some log entries may be overlooked by the user, which can further reduce the utility of log data files. If a log file for a particular application contains 1,000,000 out of memory errors, 500,000 database warnings and 10 critical hacking attempts (unless the user is able to identify them), the chances of it being spotted are slim.”

“It is therefore desirable to have a better way of organizing log data so that it makes it easier to use the information in the log data.”

A system according to certain aspects can improve the processing and organization of log data generated by applications, such as logs generated during data storage operations. The system can store log data generated by an application and organize them into one or more groups that can be presented to users. One example is that the system categorizes log entries by computing a fingerprint value per log entry and then assigning log entries to the appropriate groups based upon the calculated fingerprint values.

“One aspect of this disclosure describes a method to process log data. One aspect of the disclosure may be to receive a log file containing one or more log lines. The log lines could contain information about one or more computing operations. A minimum of some log lines may contain a static and variable portion. Processing a first line may be used to identify the static section of the log line. The static portion can then be extracted from the log line and the first value for that log line determined based on this static portion. A second logline may be used to identify the static section of the second line. The static portion is extracted from the second line and the second value is determined based on that static portion. The method could also include comparing the first to second values and, based on that comparison, organizing the first log line and the second log line together for presentation to the user.

The method described in paragraph 1 can be sub-combinated with the following features: Where the first two values uniquely identify the static text portion of the initial log line. where the organizing involves extracting a static portion from the original log line and then determining a third value from the second line using the extracted variable. where the sub-group is formed by adding the first line to the group of Log lines that were based upon the second and third values. where the static portions are used to indicate the type of transaction that was associated with the log.

“Another aspect provides a system to process log data. The system could include a computing device that is comprised of computer hardware. It can be configured to receive log data files containing one or more log lines. Information relating to one or several computing operations may be included in the log lines. A minimum of some log lines could include both a static and variable portion. A computing device can be configured to process a log line to identify the static section of the first line, extract the static part from the first line and calculate a first value for that log line based upon the extracted static portion. A second log may be processed by the computing device to identify the static section of the second line, extract the static part from the second line and determine a second value to the second line based upon the extracted static portion. The computing device can be configured to compare the first two values and, based on that comparison, to organize the first log line and the second log line together for presentation to the user.

The system described in paragraph 1 can include any combination of the following features. Where the first two values uniquely identify the static text portion of the initial log line. Where the computing devices is configured for organizing the first line and the second line at most by adding them to a group that is based only on the second and third values. Where the computing devices are configured for determining the primary and secondary values at minimum by extracting a static portion from each log line. A second set of one, or more, trigrams is extracted from the second line.

“Another aspect provides a non-transitory computer-readable medium that contains code that, when executed by an apparatus, causes it to perform a process that involves receiving a log file containing one or several log lines. These log lines can contain information about one or more computing operations. At least some log lines could include both a static and variable portion. Processing a first line may be used to identify a static section of the log line. The static portion is then extracted from the log line and the first value for that log line determined based on this static portion. A second logline may be processed to identify a static section of the second line. The static portion is extracted from the second line and the second value for that log line is determined based on the extracted stat portion. This may include comparing the first to second values and organizing the first log line and second log line together for presentation to the user.

“For the purposes of summarizing disclosure, certain aspects and novel features of inventions have been described in this document. You should understand that not all of these advantages can be achieved according to any particular embodiment. The invention can be implemented or performed in a way that maximizes one advantage or group if advantages are taught herein, but not necessarily other advantages.

“A method for processing log data is provided according to certain embodiments. This method may include receiving a log file containing one or more loglines from one or more computing devices that are comprised of computer hardware. This method may also include extracting a static section from one or more log lines and determining a first data value for the log line based upon the extracted static portion. Then, processing the log line according to the first value determined.

“Systems and methods for improving the management and storage of log data are disclosed (e.g. log data files containing application or system error messages). These systems and methods can be further described by referring to FIGS. 2-7. 2-7. 1A-1H.”

Organizations simply cannot afford to lose critical data. This is because of the growing importance of protecting and leveraging data. Protecting and managing data is becoming more difficult due to runaway data growth and other modern realities. It is imperative to have user-friendly, efficient and powerful solutions for managing and protecting data.

“Depending on the organization’s size, there may be many data production sources that fall under the control of thousands, hundreds or even thousands of employees. Individual employees used to be responsible for protecting and managing their data in the past. In other cases, a patchwork of software and hardware point solutions was used. These solutions were often offered by different vendors, and sometimes had little or no interoperability.

“CERTAIN embodiments described herein offer systems and methods capable to address these and other shortcomings in prior approaches by implementing unified information management across the organization. FIG. FIG. 1A illustrates one such information management systems 100. It generally includes combinations hardware and software that are used to manage and protect data and metadata generated by various computing devices within information management system 100. An organization using the information management system 100 could be a company, other business entity, educational institution, household or governmental agency.

“Generally, the systems described herein may be compatible and/or provide some of the functionality of one or more U.S patents or patent application publications assigned by CommVault Systems, Inc., each which is hereby incorporated into its entirety by reference herein.

“The information management software 100 can contain a wide range of computing devices. As an example, the information management software 100 could include one or more client computing device 102 and secondary storage computing device 106, as we will discuss in more detail.

Computer devices may include without limitation one or more of the following: personal computers, workstations, desktop computers or other types generally fixed computing systems like mainframe computers or minicomputers. Other computing devices include portable or mobile computing devices like laptops, tablets computers, personal information assistants, mobile phones (such a smartphones), and other mobile/portable computing devices like embedded computers, set top boxes or vehicle-mounted devices. Servers can be included in computing devices, including mail servers, file server, database servers and web servers.

“In certain cases, a computing device may include virtualized and/or Cloud computing resources. A third-party cloud service provider may provide one or more virtual machines to an organization. In some cases, computing devices may include one or more virtual machines running on a physical host computing device (or “host machine?”). The organization may use one or more virtual machines to run its database server and another virtual machine as a mail server. One example is that the organization might use one virtual machine to run its database server and another as a mail server. Both virtual machines are running on the same host computer.

A virtual machine is an operating system and associated resources that is hosted on a host computer or host machine. Hypervisor is typically software and is also known as a virtual monitor, virtual machine manager or?VMM? The hypervisor acts as a bridge between the virtual machine’s hardware and its host machine. ESX Server, by VMware, Inc., of Palo Alto, Calif., is an example of hypervisor used for virtualization. Other examples include Microsoft Virtual Server, Microsoft Windows Server Hyper-V, and Sun xVM, both by Oracle America Inc., Santa Clara, Calif. In some embodiments, hypervisors may be hardware or firmware.

The hypervisor gives each virtual operating system virtual resources such as a processor, virtual memory, and virtual network devices. Each virtual machine can have one or more virtual drives. The data of virtual drives is stored by the hypervisor in files on the filesystem of the physical host machine. These files are called virtual machine disk images (in the instance of Microsoft virtual servers) and virtual machine disk files (in case of VMware virtual server). VMware’s ESX server provides the Virtual Machine File System, (VMFS), for storage of virtual machine files. Virtual machines read and write data to their virtual disks in the same manner as physical machines.

U.S. Pat. 102,297 describes “Examples for information management techniques in cloud computing environments.” No. No. 8,285,681 is incorporated herein. U.S. Pat. explains some techniques for managing information in virtualized computing environments. No. No. 8.307,177, also included by reference herein

“The information management software 100 can include many storage devices. Primary storage devices 104, secondary storage devices (108), and others are examples. You can store any type of storage device, including hard-disk arrays and semiconductor memory (e.g. solid state storage), network-attached storage (NAS), tape libraries or other magnetic non-tape storage devices as well as optical media storage devices. DNA/RNA-based memories technology and combinations thereof. Storage devices may be part of a distributed storage system in some instances. Some storage devices can be provided in a cloud, such as a private cloud, or one managed by a third party vendor. In some cases, a storage device is a disk array or a portion thereof.

“The illustrated information system 100 comprises one or more client computing devices 102 that execute at least one application 110, and one or two primary storage devices (104) that store primary data 112. In some cases, the client computing device(s), 102 and primary storage devices (104) may be called a primary storage subsystem. 117 A computing device that is part of an information management systems 100 and has a data agent 42 installed and running on it is called a client computing device (or in the context of a component in the information management systems 100, simply as a “client ?).””).

“The meaning of the term “information management system” depends on the context. It can be used to refer to all the software and hardware components. In other cases, it may only refer to a subset or all of the components.

In some cases, the information system 100 may refer to a collection of components that protect, move and manage data and metadata generated from client computing devices 102. The information management system 100 does not necessarily include all the components that create and/or store primary data 112, such the client computing device 102 and applications 110, as well as the primary storage devices 104. For example, the term “information management system” could refer to: Sometimes, the term “information management system” may refer to any of the following components with corresponding data structures: storage agents, media agents, and data agents. We will describe these components in greater detail below.

“Client Computing Devices.”

There are many sources of data that an organization can use to protect and manage its data. One example is that a company environment can have multiple data sources. These include employee workstations, company servers, such as mail servers, web servers, database servers, transaction servers, and the like. The information management system 100 includes the client computing devices 102 as data sources.

“The client computing device 102 can include any of these types of computing devices, but in some cases, the client computing device 102 is associated with one or more users or corresponding user accounts of employees or other individuals.”

“The information management software 100 addresses the data management needs and protects the data generated by client computing devices 102. This does not mean that client computing devices 102 can’t be called?servers? In other ways. A client computing device 102 can act as a server for other clients, such as client computing devices 102. The client computing devices 102 include file servers, mail servers, database servers and web servers.

“Each client computing device (102) may have one or more software applications 110 (e.g. software applications). These applications generate and manipulate data that must be managed and protected from loss. Applications 110 are generally used to support the operation of an organization or multiple affiliated organizations. They can include file server applications (e.g. Microsoft Exchange Server), mail client applications(e.g. Microsoft Exchange Client), SQL, Oracle, SAP and Lotus Notes Database), word processing apps (e.g. Microsoft Word), spreadsheet and financial applications, presentation and graphics applications, web applications, mobile applications and entertainment applications.

“The client computing devices 102 may have at least one operating software (e.g. Microsoft Windows, Mac OS X iOS, IBM z/OS Linux, or other Unix-based OSes). There may be one or more file system or other applications that are installed on the client computing devices 102.

“The client computing devices (102 and 100) can be connected via one or more communication paths 114. A first communication path 114 could connect client computing devices 102 and secondary storage computing devices 106. A second communication pathway, 114, may connect storage manger 140 and client computing devices 102. A third communication pathway, 114, may connect storage managers 140 and client computing devices 102. Finally, storage manager 140 may be connected to storage manager 140, and secondary storage computing equipment 106. (see, e.g., FIG. FIG. 1A and FIG. 1C). 1C. In some cases, communication pathways 114 may also include application programming Interfaces (APIs), such as cloud service provider APIs and virtual machine management APIs. The infrastructure that underlies communication paths 114 can be wired, wireless, analog, and/or digital or any combination thereof. Facilities may also be private, public or third-party provided.

“Primary Data, Exemplary Primary Storage Devices”

According to some embodiments, primary data 112 is production data or any other?live? data. Data generated by the operating systems and/or applications 110 running on a client computing device. Primary data 112 is usually stored on the primary storage device(s), 104. It is organized using a file system that is supported by the client computing devices 102. The client computing device (102) and the corresponding applications 110 can create, access modify, delete, write, delete, or otherwise use primary data 112. Some cases allow some or all the primary data 112 to be stored in cloud storage resources. For example, client computing device(s) 102 and corresponding applications 110 may create, modify, write, delete, or otherwise use primary data 112.

“Primary Data 112 is usually in the native format for the source application 110. Primary data 112 can be described as an initial or first copy (e.g. created before any other copies, or at least one additional copy) of data generated from the source application 110. In some cases, primary data 112 is substantially created directly from the data generated by the source applications 110.

The primary storage devices 104 that store the primary data 112 can be expensive and/or slow (e.g., disk drives, hard-disk arrays, solid state memories, etc.). Primary data 112 can be extremely changeable, and/or intended for short-term retention (e.g. hours, days or weeks).

“Accordingly to some embodiments, the client computing devices 102 can access primary information 112 from the primary storage unit 104 via conventional file system calls through the operating system. Structured data, unstructured data, and/or semi-structured information may all be included in primary data 112. Below are some examples with regard to FIG. 1B.”

It can be used to perform certain tasks, such as organizing primary data 112 into units with different granularities. Primary data 112 may include files, directories and file system volumes. It can also include data blocks, extents, and any other hierarchies of data objects. A “data object” is defined herein. A?data object? can be used to refer to either (1) any file that is currently addressed by a system or that was previously addressed by the system (e.g. an archive file) or (2) a subset thereof (e.g. a data block).

“As we will explain in detail, it can also help in performing certain functions in the information management system 100 to modify and access metadata within the primary dataset 112. Metadata is information about data objects and characteristics that are associated with them. It is important to note that any reference to primary information 112 includes its associated metadata. However, references to the metadata don’t include primary data.

“Metadata may include, without limitation: the owner of the data (e.g. the client or user that generated the data object), the last modified date (e.g. the time at which the data object was modified), the file size (e.g. a number bytes of data), information about content (e.g. an indication of the existence of a specific search term), user-supplied tag, to/from information (e.g. an email sender, recipient), and other information related to the email information (e. The creation date, file type (e.g. format or application type), the last accessed times, application type (e.g. type of application that created the data objects), location/network (e.g. a current, past, or future location of data object and network paths to/from it), user-supplied tags, to/from information for email (e.g. an email sender, recipient, etc.), partition layouts, file location within the file folder directory structure, permissions, owners groups, access control list [ACLs], system metadata (e.

“In addition to metadata related to file system and operating systems, some applications 110 and/or components of the information management software 100 maintain indices metadata for data objects. For example, metadata associated to individual email messages. Each data object can be associated with the corresponding metadata. Below is a more detailed explanation of how metadata can be used to perform classification and other functions.

“Each client computing device 102 is generally associated with or in communication with one of the primary storage units 104, storing the corresponding primary data 112. A client computing device 102 could be considered to be “associated with?” A client computing device 102 may be considered to be?associated with? A primary storage unit 104 is capable of performing one or more of the following: routing and/or storage data (e.g. primary data 112) to the specific primary storage devices 104; coordinating the routing/or storage of data to the primary storage devices 104; retrieving data from that primary storage facility 104; coordinating the retrieval data from that primary storage apparatus 104; and altering and/or eliminating data retrieved from that primary storage appliance 104.”

“Primary storage devices 104 may include any of the storage devices mentioned above or another type of storage device. The primary storage devices (104) may be slower than the secondary storage device 108 and/or more expensive. The information management system 100 might, for example, access metadata and data stored on primary storage device 104 quite often, while data stored on secondary storage device 108 is more frequently accessed.

“Primary storage devices 104 can be shared or dedicated. Each primary storage device (104) may be dedicated to a client computing device 102 in some cases. In one embodiment, the primary storage device (104) is a local drive belonging to a client computing device (102). Other cases allow one or more primary storage device 104 to be shared by multiple client computers devices 102 via a network, such as in a cloud storage system. A primary storage device 104 could be a disk that is shared by a group 102 of clients computing devices, such as EMC Clariion or EMC Symmetrix. It can also include one of the following types: EMC Clariion and EMC Celerra.

“The information management software 100 could also contain hosted services (not illustrated), which may be hosted by another entity than the one that uses the information management software 100. Hosted services can be provided by different online service providers to the company. These service providers may offer services such as social networking, hosted email services, and hosted productivity apps. Hosted services may include software-as-a-service (SaaS), platform-as-a-service (PaaS), application service providers (ASPs), cloud services, or other mechanisms for delivering functionality via a network. Each hosted service can generate additional data and metadata as it delivers services to users. This data may be managed by the information management system 100 (e.g. primary data 112). The hosted services can be accessed via one of the applications 110 in some cases. A hosted mail service could be accessed using a browser on a client computer device 102. Hosted services can be used in many computing environments. They may be implemented in an environment similar to the information management systems 100 where various physical and logic components are distributed over a network.

“Secondary copies and Exemplary Secondary Storage Devices”

In some instances, the primary data 112 stored in primary storage devices (104) may be compromised. For example, an employee might delete or accidentally overwrite primary data 112 during normal work hours. The primary storage devices 104 may also be lost, damaged, or corrupted. It is useful to create copies of the primary data 112 for recovery purposes and/or regulatory compliance. The information management system 100 contains one or more secondary computing devices 106 and one, or more, secondary storage devices108 that are used to create and store secondary copies 116 and associated metadata. Sometimes, the secondary storage computing devices (106 and 108) may be called a secondary subsystem 118.

“Creation and storage of secondary copies 116 is a useful tool for search and analysis and other information management goals. It allows you to restore data and/or metadata in the event that a primary version (e.g. of primary data 112) is lost due to deletion, corruption or natural disaster; it also permits point-in time recovery.

“The client computing devices (102) access or receive primary information 112 and communicate that data, e.g. over one or more communication paths 114 for storage in the secondary storage device(s).108

“A secondary copy (116) can contain a separate, stored copy of the application data. It may be derived from one or several earlier-created, store copies (e.g. primary data 112 and another secondary copy, 116). Secondary copies 116 may contain point-in time data and can be stored for a relatively long period of storage (e.g. weeks, months, or years) before any or all data is moved to another storage or discarded.

“In some cases, a second copy 116 can be a copy created of application data and stored after at least one other stored instance (e.g. following corresponding primary data 112 and to another secondary data 116), in an alternative storage device than at most one stored copy and/or remotely. Secondary copies may be stored on the same storage device with primary data 112 or other previously stored copies in some cases. In one example, a disk array that can perform hardware snapshots stores primary information 112, and creates and stores secondary copies 116. Secondary copies 116 can be kept in low-cost storage, such as magnetic tape. The secondary copy 116 could be kept in a backup, archive format or another format than the primary data or native application format.

“Some secondary copies 116 can be indexed to allow users to browse and restore at a later time. A secondary copy 116 representing certain primary data 112 may be created. A pointer or another location indicator (e.g. a stub), may be added to primary data 112. To indicate the current location of the secondary storage device(s), 108 or secondary copy 116.

“An instance of a metadata or data object in primary information 112 can change over time as it’s modified by an app 110 (or the operating system) so the information management 100 may create multiple secondary copies 116 to represent the state of that data object or metadata at a specific point in time. The information management system 100 can also manage point-in time representations of primary data objects, even though they may be deleted from primary storage device 104 or the file system.

“Virtualized computing devices may have the operating system 110 and other applications 110 executed within or under virtualization software management (e.g., VMM). The primary storage device(s), 104 may contain a virtual disk created on physical storage device. Information management system 100 can create secondary copies of 116 files and other data objects within a virtual disk, and/or secondary copies of 116 of the entire virtual drive file (e.g. of an entire.vmdk) itself.

“Secondary copy 116 can be distinguished from the corresponding primary data 112 by a variety of means. Some of these will be discussed. As mentioned, secondary copies 116 may be stored in a different format than primary data 112 (e.g. backup, archive, and other non-native formats). Secondary copies 116 may not have the ability to be used directly by client computing devices 102 for this and other reasons.

“Secondary backups 116 may also be stored in certain embodiments on a secondary storage unit 108 that is not accessible to the applications 110 running at the client computing devices (and/or hosted service). Some secondary copies 116 could be “offline copies”, They are not easily accessible (e.g., they are not mounted to tape or disc). “Offline copies” can be copies of data that an information management system 100 is able to access without human intervention.

“The Use Of Intermediate Devices to Create Secondary Copies”

It can be difficult to create secondary copies. There can be hundreds of clients computing devices 102 that generate large amounts of primary data 112 which must be protected. Secondary copies can also be created with significant overhead 116. Secondary storage devices 108 can also be used for special purposes, so interacting with them may require specialized intelligence.

“In certain cases, client computing devices 102 can interact directly with secondary storage device108 to create secondary copies 116. This approach, however, can have a negative impact on the client computing devices’ 102 ability to serve applications 110 and generate primary data 112. The client computing devices 102 might not be optimized for interaction to the secondary storage devices (108).

“In some embodiments, the information system 100 may include one or more software/hardware components that act as intermediaries between client computing devices (102) and secondary storage devices (108). These intermediate components may provide additional benefits beyond transferring certain responsibilities to the client computing device 102. As shown in FIG. 1D) can increase scalability by distributing some of work required to create secondary copies 116

“The intermediate components may include one or more secondary storage computing device 106, as shown in FIG. 1A, and/or one or several media agents. These can be software modules that operate on the secondary storage computing devices (106) or other suitable computing devices. Below are some examples of media agents (e.g., in relation to FIGS. 1C-1E).”

“The secondary storage computing devices(s)106” can include any of the computing units described above. Sometimes, the secondary storage computing devices (106) may include special hardware and/or software components for interfacing with secondary storage devices 108.

“To create secondary copies 116, which involves the copying data from the primary subsystem 117 into the secondary subsystem 118. In some embodiments, the client computing devices 102 communicates the primary data 112 (or a processed copy thereof) to the designated secondary computing device 106 via the communication path 114. The secondary storage computing unit 106 then transmits the received data or a processed version thereof to the secondary storage device. In certain cases, the communication path 114 between client computing device (102) and secondary storage computing device (106) may be a part of a LAN/WAN or SAN. Other cases allow at least one client computing device 102 to communicate directly with secondary storage devices (108, e.g. via Fibre Channel or SCSI connections). Other cases include creating one or more secondary copies from secondary copies that exist, as in the case with an auxiliary copy operation.

“Exemplary Secondary Data and Exemplary Primary Data”

“FIG. “FIG. The primary storage device(s), 104 contains primary data objects, including word processing documents (119A-B), spreadsheets 120, presentation files 122, video files 124, image files 126 and email mailboxes 128 with corresponding emails 129A?C), html/xml files 130, databases 132, and the corresponding tables or data structures 133A?133C.

“Some or all primary objects are associated with the corresponding metadata (e.g.?Meta1-11). These metadata may be file system metadata or application-specific metadata. Secondary copy data objects (134A-C) are stored on secondary storage device(s), 108. These secondary data objects may contain copies of, or otherwise represent, corresponding primary data objects and metadata.

“As you can see, secondary copy data objects (134A-C) can each represent more than one primary object. Secondary copy data object (134A) can represent three primary data objects 133C-122C and 129C respectively. They are represented as 133C? and 122C respectively and accompanied with the Meta11, Meta3, and meta8 metadatas. The prime mark (?) also indicates that secondary copy data object 134A may store a representation of a primary data object and/or metadata differently than the original format. A secondary copy object can store metadata and a representation of primary data objects in a different format than the original. Secondary data object 1346 also represents primary data objects 120, 1306, and 119A respectively. It is accompanied by the corresponding metadata Meta2, Met10, and Meta1, and vice versa. Secondary data object 134C also represents primary data objects 130A, 1196 and 129A respectively. It is accompanied by the corresponding metadata Meta9 and Meta5, respectively.

“Exemplary Information Management System Architecture”

“The information management software 100 can contain a wide range of hardware and software components. These can be organized in many different ways depending on the embodiment. It is crucial to make clear design decisions about the functional responsibilities and roles of components in the information management systems 100. As will be discussed, these design decisions can have a significant impact on performance and the ability of the information management software 100 to adapt to changing data growth or other circumstances.

“FIG. 1C is an illustration of an information management system 100. It includes: storage manager 140, which is a centralized storage/or information manager configured to perform specific control functions. One or more data agents (142) are executed on client computing devices 102 for processing primary data 112, and one, or more, media agents 144 that execute on secondary storage computing devices. 106 for performing tasks related to the secondary storage devices. 108. Although it is possible to distribute functionality across multiple computing devices, there are other benefits. In some cases, consolidating functionality can be more beneficial. In various other embodiments, any or all of the components in FIG. 1C are not implemented on different computing devices. One configuration includes a storage manager 140 and one or two data agents 142. A media agent 144 is also implemented on the same device. Another embodiment allows for one or more data agent 142, one or several media agents 144, and the storage manager 140 to be implemented on the same computing devices. This is not a limitation.

“Storage Manager”

“As you can see, there are 100 components to the information management system and a lot of data that needs to be managed. The task of managing the components and data can be a complex one. It is also a task that can become more difficult as the number of components and the data grows to meet the organization’s needs. According to certain embodiments, the storage manager 140 is responsible for the control of the information management system 100. The storage manager 140 can be modified independently by distributing control functionality. A computing device that hosts the storage manager 140 can also be chosen to best fit the functions of the storage manger 140. FIG. 2 explains these and other benefits in more detail. 1D.”

“The storage manager 140 could be a software module, or another application that, in certain embodiments, operates in conjunction with one of the associated data structures (e.g. a dedicated database, management database 146). Storage manager 140 may be a computing device that executes computer instructions. The storage manager is responsible for initiating, performing, coordination and/or controlling storage operations and other information management operations performed under the information management system 100. This includes protecting and controlling the primary data 112 as well as secondary copies 116 and metadata. Storage manager 100 is generally responsible for managing information management system 100. This includes managing its constituent components (e.g. data agents and media agents).

“As indicated by the dashed-arrowed lines (114 in FIG. 1C shows that the storage manager 140 can communicate with or control certain elements of the information system 100 such as data agents 142, media agents 144, and/or other components. In certain embodiments, control information is received from the storage manger 140. Status reporting is sent to storage manager 140 by various managed components. Payload data and metadata are generally communicated between data agents 142, media agents 144 and client computing devices 102 (or otherwise between the secondary storage computing devices 106), e.g. at the direction and under the supervision of the storage manager140. The control information may include instructions and parameters for performing information management operations. This includes instructions on how to start an operation, when to start it, timing information that specifies when to do so, data path information that specifies which components to access or communicate with in order to complete the operation. Payload data can, however, include data that is actually involved in storage operations, such as content data that has been written to secondary storage device 108 during a secondary copy operation. Payload metadata may include any of these types of metadata and can be written to a storage unit with payload content data (e.g. in the form a header).

“In some embodiments, certain information management operations can be controlled by other components of the information management system 100 (e.g. the media agent(s 144) or data agent(s 142), in addition to or in combination with storage manager 140.”

“Accordingly to certain embodiments, storage manager 140 provides one of the following functions.

“The storage manager 140 could maintain a database (or?storage manger database 146?) or ?management database 146?) Management-related data and information management policy 148. A management index 150 or?index 150 may be included in the database 146. or any other data structure that stores logical association between components of the system, user preference and/or profiles (e.g. preferences regarding encryption, compression or deduplication, scheduling, type or other aspects, mappings of information management users to specific computing devices or other components, etc. Management tasks, media containerization, and other useful data. The index 150 may be used by the storage manager 140 to track logical connections between media agents 144, secondary storage device 108, and/or the movement of data from primary storage device 104 to secondary storage device 108. The index 150 could store data that associates a client computing device with a specific media agent 144 or secondary storage devices 108. This is according to an information management policy (148) which can be found below.

Administrators and other individuals may be able configure and initiate information management operations individually. This may work for certain recovery operations and other tasks that are less frequently performed, but it is not practical for ongoing organization-wide data management. The information management system 100 can use information management policies 148 to specify and execute information management operations (e.g. on an automated basis). An information management policy 148 may include a data structure, or another information source, that specifies a set or parameters (e.g. criteria and rules) related to storage or other information operations.

The storage manager database 146 may contain the information management policy 148 and associated data. However, the information management policy 148 can be stored at any location. An information management policy 148, such as a storage policy, may be stored in metadata in a media agency database 152 or in secondary storage device 108 (e.g. as an archive copy) to aid in restore operations and other information management operations depending on the embodiment. Below are descriptions of information management policies 148.

According to some embodiments, the storage manger database 146 includes a relational database (e.g. an SQL database) that tracks metadata such as metadata associated secondary copy operations (e.g. what client computing devices were used and the corresponding data). These and other metadata can also be stored at other locations, such the secondary storage computing device 106 or the secondary storage device 108. This allows data recovery without the need for the storage manager 140 in certain cases.

“As shown in the figure, the storage manager 140 could include a jobs agent (156), a user interface (158), and a management agents 154. All of these may be implemented as interconnected modules or applications programs.

In some embodiments, the jobs agent 156 initiates, controls and/or monitors some or all storage operations or other information management operations. These operations may be currently being performed or scheduled to be performed in the information management system 100. The jobs agent 156 might, for example, access information management policies (148) to determine when and how to control secondary copy and other operations.

“The user interface 158 can include information processing, display software, and graphical user interfaces (?GUI). An application program interface (??API?) ), an application program interface (?API?). Users can optionally issue instructions to components of the information management system 100 via the user interface 158 regarding storage and recovery operations. A user might modify a schedule indicating the number of secondary copy operations that are pending. Another example is that a user might use the GUI to view the status pending storage operations, or monitor certain components of the information management system 100 (e.g. the remaining storage capacity).

“An information management cell?” (or ?storage operation cell? (or?storage operation cell? A logical or physical grouping may be used to describe a combination of hardware-software components that are associated with information management operations on electronic files. This includes at least one storage manager 140, at least one client computing device 102, and at most one data agent (or 142) and at minimum one media agent (144). FIG. 1C shows an example of such components. 1C could be combined to form an information management system cell. Multiple cells can be organized hierarchically. This configuration allows cells to inherit properties from hierarchically superior cell or to be controlled by other cells (automatically or not). In some embodiments, cells can inherit or be linked to information management policies, preferences or information management metrics or any other property or characteristic based on their relative position within a hierarchy of cells. You can also organize cells hierarchically according geography, architecture, function, or any other factor that is useful in information management operations. One cell could represent a geographical segment of an enterprise such as a Chicago office. A second cell might represent another geographic segment such as a New York or New York office. Others cells could represent different departments within an office. A first cell can perform one or several first types information management operations (e.g. one or two first types secondary or additional copies), while a second cell could perform one, more, or all of the second types information management operations.

“The storage manager 140 can also track information that allows it to identify, select, or otherwise identify content indexes, deduplication database or similar resources or data sets within its information cell (or another cell), to be searched for certain queries. These queries can be entered via the interface 158. The management agent 154 permits multiple information management cells to communicate with each other. In some cases, the information management system 100 may be one of many information management cells in a network of multiple cells that are adjacent or otherwise logically connected in a WAN/LAN. These cells can be linked to each other through their respective management agents 154.

“For example, the management agent 150 can give the storage manager 140 the ability to communicate via network protocols or application programming interfaces (??) with other components of the information management systems 100 (and/or cells within a larger system). These include, e.g. HTTP, HTTPS FTP, REST and virtualization software APIs. U.S. Pat. explains inter-cell communication and hierarchy in more detail. Nos. Nos. 7,747.579 and 7,343,453, are incorporated herein by reference.”

“Data Agents”

“As we have discussed, there are many types of applications 110 that can be run on a client computing device 102. These include operating systems, database apps, e-mail programs, and virtual machines to name just a few. Client computing devices 102 might be responsible for processing the primary data 112 created by these different applications 110 as part of the creation and restoration of secondary copies 116. Moreover, the nature of the processing/preparation can differ across clients and application types, e.g., due to inherent structural and formatting differences among applications 110.”

“The one or more data agents 142 can be advantageously configured in certain embodiments to aid in the performance information management operations based upon the type of data being protected at a client-specific, and/or app-specific level.”

“The data agent142 could be a module or component of a software program that is responsible for initiating, managing, or otherwise supporting the execution of information management operations within information management system 100, usually as directed by storage manger 140. The data agent 142 might be responsible for performing data storage operations like copying, archiving and migrating primary data 112 to the primary storage device(s). 104. The storage manager 140 may give control information to the data agent 142, including commands to send copies of data objects and metadata to media agents 144.

“In some embodiments, the data agent (142) may be distributed between client computing device 101 and storage manager 140 (and any intermediate components), or it may be deployed from remote locations or its functions approximated using a remote process that performs all or some of the functions of data agent 142. A data agent 142 can also perform functions that are provided by a media agents 144 or perform other functions, such as encryption and duplication.

Each data agent 142 can be customized for a specific application 110. The system can use multiple application-specific agents 142 to perform information management operations (e.g. backup, migration and data recovery) associated in a different 110 application. Different data agents 142 could be used to manage Microsoft Exchange data and Lotus Notes data. They may also handle Microsoft Active Directory Objects, Microsoft Windows file system, Microsoft Windows data, Microsoft SQL Server data and SQL Server data.

A file system agent may be used to manage data files and/or other information. A specialized data agent 142 can be used to backup, archive, migrate and restore client computing devices 102 data if there are multiple types of data. To backup, migrate, or restore all data on a Microsoft Exchange server, a client computing device 102 might use a Microsoft Exchange Mailbox Data agent 142, a Microsoft Exchange Database Data agent 142, and a Microsoft Exchange Public Folder and File System data agents 142. These specialized data agents 142 can be considered four different data agents 142, even though they are all running on the same client computing device.

“Other embodiments may use one or more generic agents 142 that can process data from multiple applications 110 or can handle multiple data types in addition to or instead of specialized data agents 142. One generic data agent 142 could be used to backup, migrate, and restore Microsoft Exchange Mailbox data, and Microsoft Exchange Database data, while another generic agent might handle Microsoft Exchange Public Folder and Microsoft Windows File System information.

Each data agent 142 can be configured to access the primary storage device(s), 104 and then process the data according to its needs. The data agent 142 might arrange the data or metadata into one or more files with a specific format, such as a backup or archive format, before transferring them to a media agent (144) or another component. A list of files and other metadata may be included in the file(s). Each data agent 142 is capable of restoring data and metadata from secondary storage devices 104 to secondary copies 116. The data agent 142 can be used in conjunction with the storage manger 140 and one or more media agents 144 to recover data from secondary storage devices 108.

“Media Agents”

“As shown above in relation to FIG. “As indicated above with respect to FIG. 1A, shifting certain responsibilities from client computing devices (102) to intermediate components like the media agent(s), 144 can provide many benefits, including faster secondary copy operation performance and improved scalability. One example will be described below. The media agent 144 acts as a local cache for copied data and/or metadata it has stored to secondary storage device(s).108. This provides improved restore capabilities.

Summary for “Management of log data”

Global businesses recognize the commercial value and need to find cost-effective, reliable ways to secure their information while minimising their impact on productivity. Information protection is often an integral part of the daily work that is done within an organization. As part of its daily, weekly or monthly maintenance plan, a company may back up important computing systems like web servers, file servers, and databases. A company might also protect the computing systems of each employee, such as those used in an accounting, marketing, or engineering department.

Companies continue to look for innovative ways to manage data growth and protect data, given the ever-growing volume of data under their control. Companies often use migration techniques to move data to cheaper storage and data reduction techniques to reduce redundant data, prune lower priority data, and so forth. Companies increasingly see their stored data as an asset. Customers are increasingly looking for ways to not only manage and protect their data but also make use of it. Solutions that provide data analysis capabilities, information management and improved data presentation and accessibility features are increasingly in demand.

One issue with managing a computing environment is the log data that applications and systems in the environment can automatically generate as a result of the execution of transactions in that environment. These log data could include one or more log files that contain transaction information about a system or application. The logs can be viewed by users who monitor the computing environment to see information about transactions, such as whether certain transactions were successful or failed. If there is a catastrophic failure, log data can be replayed to show certain transactions.

These log data can be very large. The log data generated by a system or an application may generate log data for tens, hundreds or thousands of transactions per second. Each transaction may result in multiple log entries being recorded in a log file. This can lead to millions, or even billions of log files that must be organized and processed for presentation to the user. It is difficult to process such a large amount of log data manually because users are unable to review them manually in a meaningful manner. Computer systems described herein can be used to organize and simplify log entries into a format that is understandable for users. Some computer systems can sort log files alphabetically or chronologically for the purpose of presenting the information to the user. Even though log files are sorted in this manner, the user might have difficulty understanding the information. To determine the type of errors encountered by the system, the user might simply need to go through all log entries. This may prove to be difficult. Some log entries may be overlooked by the user, which can further reduce the utility of log data files. If a log file for a particular application contains 1,000,000 out of memory errors, 500,000 database warnings and 10 critical hacking attempts (unless the user is able to identify them), the chances of it being spotted are slim.”

“It is therefore desirable to have a better way of organizing log data so that it makes it easier to use the information in the log data.”

A system according to certain aspects can improve the processing and organization of log data generated by applications, such as logs generated during data storage operations. The system can store log data generated by an application and organize them into one or more groups that can be presented to users. One example is that the system categorizes log entries by computing a fingerprint value per log entry and then assigning log entries to the appropriate groups based upon the calculated fingerprint values.

“One aspect of this disclosure describes a method to process log data. One aspect of the disclosure may be to receive a log file containing one or more log lines. The log lines could contain information about one or more computing operations. A minimum of some log lines may contain a static and variable portion. Processing a first line may be used to identify the static section of the log line. The static portion can then be extracted from the log line and the first value for that log line determined based on this static portion. A second logline may be used to identify the static section of the second line. The static portion is extracted from the second line and the second value is determined based on that static portion. The method could also include comparing the first to second values and, based on that comparison, organizing the first log line and the second log line together for presentation to the user.

The method described in paragraph 1 can be sub-combinated with the following features: Where the first two values uniquely identify the static text portion of the initial log line. where the organizing involves extracting a static portion from the original log line and then determining a third value from the second line using the extracted variable. where the sub-group is formed by adding the first line to the group of Log lines that were based upon the second and third values. where the static portions are used to indicate the type of transaction that was associated with the log.

“Another aspect provides a system to process log data. The system could include a computing device that is comprised of computer hardware. It can be configured to receive log data files containing one or more log lines. Information relating to one or several computing operations may be included in the log lines. A minimum of some log lines could include both a static and variable portion. A computing device can be configured to process a log line to identify the static section of the first line, extract the static part from the first line and calculate a first value for that log line based upon the extracted static portion. A second log may be processed by the computing device to identify the static section of the second line, extract the static part from the second line and determine a second value to the second line based upon the extracted static portion. The computing device can be configured to compare the first two values and, based on that comparison, to organize the first log line and the second log line together for presentation to the user.

The system described in paragraph 1 can include any combination of the following features. Where the first two values uniquely identify the static text portion of the initial log line. Where the computing devices is configured for organizing the first line and the second line at most by adding them to a group that is based only on the second and third values. Where the computing devices are configured for determining the primary and secondary values at minimum by extracting a static portion from each log line. A second set of one, or more, trigrams is extracted from the second line.

“Another aspect provides a non-transitory computer-readable medium that contains code that, when executed by an apparatus, causes it to perform a process that involves receiving a log file containing one or several log lines. These log lines can contain information about one or more computing operations. At least some log lines could include both a static and variable portion. Processing a first line may be used to identify a static section of the log line. The static portion is then extracted from the log line and the first value for that log line determined based on this static portion. A second logline may be processed to identify a static section of the second line. The static portion is extracted from the second line and the second value for that log line is determined based on the extracted stat portion. This may include comparing the first to second values and organizing the first log line and second log line together for presentation to the user.

“For the purposes of summarizing disclosure, certain aspects and novel features of inventions have been described in this document. You should understand that not all of these advantages can be achieved according to any particular embodiment. The invention can be implemented or performed in a way that maximizes one advantage or group if advantages are taught herein, but not necessarily other advantages.

“A method for processing log data is provided according to certain embodiments. This method may include receiving a log file containing one or more loglines from one or more computing devices that are comprised of computer hardware. This method may also include extracting a static section from one or more log lines and determining a first data value for the log line based upon the extracted static portion. Then, processing the log line according to the first value determined.

“Systems and methods for improving the management and storage of log data are disclosed (e.g. log data files containing application or system error messages). These systems and methods can be further described by referring to FIGS. 2-7. 2-7. 1A-1H.”

Organizations simply cannot afford to lose critical data. This is because of the growing importance of protecting and leveraging data. Protecting and managing data is becoming more difficult due to runaway data growth and other modern realities. It is imperative to have user-friendly, efficient and powerful solutions for managing and protecting data.

“Depending on the organization’s size, there may be many data production sources that fall under the control of thousands, hundreds or even thousands of employees. Individual employees used to be responsible for protecting and managing their data in the past. In other cases, a patchwork of software and hardware point solutions was used. These solutions were often offered by different vendors, and sometimes had little or no interoperability.

“CERTAIN embodiments described herein offer systems and methods capable to address these and other shortcomings in prior approaches by implementing unified information management across the organization. FIG. FIG. 1A illustrates one such information management systems 100. It generally includes combinations hardware and software that are used to manage and protect data and metadata generated by various computing devices within information management system 100. An organization using the information management system 100 could be a company, other business entity, educational institution, household or governmental agency.

“Generally, the systems described herein may be compatible and/or provide some of the functionality of one or more U.S patents or patent application publications assigned by CommVault Systems, Inc., each which is hereby incorporated into its entirety by reference herein.

“The information management software 100 can contain a wide range of computing devices. As an example, the information management software 100 could include one or more client computing device 102 and secondary storage computing device 106, as we will discuss in more detail.

Computer devices may include without limitation one or more of the following: personal computers, workstations, desktop computers or other types generally fixed computing systems like mainframe computers or minicomputers. Other computing devices include portable or mobile computing devices like laptops, tablets computers, personal information assistants, mobile phones (such a smartphones), and other mobile/portable computing devices like embedded computers, set top boxes or vehicle-mounted devices. Servers can be included in computing devices, including mail servers, file server, database servers and web servers.

“In certain cases, a computing device may include virtualized and/or Cloud computing resources. A third-party cloud service provider may provide one or more virtual machines to an organization. In some cases, computing devices may include one or more virtual machines running on a physical host computing device (or “host machine?”). The organization may use one or more virtual machines to run its database server and another virtual machine as a mail server. One example is that the organization might use one virtual machine to run its database server and another as a mail server. Both virtual machines are running on the same host computer.

A virtual machine is an operating system and associated resources that is hosted on a host computer or host machine. Hypervisor is typically software and is also known as a virtual monitor, virtual machine manager or?VMM? The hypervisor acts as a bridge between the virtual machine’s hardware and its host machine. ESX Server, by VMware, Inc., of Palo Alto, Calif., is an example of hypervisor used for virtualization. Other examples include Microsoft Virtual Server, Microsoft Windows Server Hyper-V, and Sun xVM, both by Oracle America Inc., Santa Clara, Calif. In some embodiments, hypervisors may be hardware or firmware.

The hypervisor gives each virtual operating system virtual resources such as a processor, virtual memory, and virtual network devices. Each virtual machine can have one or more virtual drives. The data of virtual drives is stored by the hypervisor in files on the filesystem of the physical host machine. These files are called virtual machine disk images (in the instance of Microsoft virtual servers) and virtual machine disk files (in case of VMware virtual server). VMware’s ESX server provides the Virtual Machine File System, (VMFS), for storage of virtual machine files. Virtual machines read and write data to their virtual disks in the same manner as physical machines.

U.S. Pat. 102,297 describes “Examples for information management techniques in cloud computing environments.” No. No. 8,285,681 is incorporated herein. U.S. Pat. explains some techniques for managing information in virtualized computing environments. No. No. 8.307,177, also included by reference herein

“The information management software 100 can include many storage devices. Primary storage devices 104, secondary storage devices (108), and others are examples. You can store any type of storage device, including hard-disk arrays and semiconductor memory (e.g. solid state storage), network-attached storage (NAS), tape libraries or other magnetic non-tape storage devices as well as optical media storage devices. DNA/RNA-based memories technology and combinations thereof. Storage devices may be part of a distributed storage system in some instances. Some storage devices can be provided in a cloud, such as a private cloud, or one managed by a third party vendor. In some cases, a storage device is a disk array or a portion thereof.

“The illustrated information system 100 comprises one or more client computing devices 102 that execute at least one application 110, and one or two primary storage devices (104) that store primary data 112. In some cases, the client computing device(s), 102 and primary storage devices (104) may be called a primary storage subsystem. 117 A computing device that is part of an information management systems 100 and has a data agent 42 installed and running on it is called a client computing device (or in the context of a component in the information management systems 100, simply as a “client ?).””).

“The meaning of the term “information management system” depends on the context. It can be used to refer to all the software and hardware components. In other cases, it may only refer to a subset or all of the components.

In some cases, the information system 100 may refer to a collection of components that protect, move and manage data and metadata generated from client computing devices 102. The information management system 100 does not necessarily include all the components that create and/or store primary data 112, such the client computing device 102 and applications 110, as well as the primary storage devices 104. For example, the term “information management system” could refer to: Sometimes, the term “information management system” may refer to any of the following components with corresponding data structures: storage agents, media agents, and data agents. We will describe these components in greater detail below.

“Client Computing Devices.”

There are many sources of data that an organization can use to protect and manage its data. One example is that a company environment can have multiple data sources. These include employee workstations, company servers, such as mail servers, web servers, database servers, transaction servers, and the like. The information management system 100 includes the client computing devices 102 as data sources.

“The client computing device 102 can include any of these types of computing devices, but in some cases, the client computing device 102 is associated with one or more users or corresponding user accounts of employees or other individuals.”

“The information management software 100 addresses the data management needs and protects the data generated by client computing devices 102. This does not mean that client computing devices 102 can’t be called?servers? In other ways. A client computing device 102 can act as a server for other clients, such as client computing devices 102. The client computing devices 102 include file servers, mail servers, database servers and web servers.

“Each client computing device (102) may have one or more software applications 110 (e.g. software applications). These applications generate and manipulate data that must be managed and protected from loss. Applications 110 are generally used to support the operation of an organization or multiple affiliated organizations. They can include file server applications (e.g. Microsoft Exchange Server), mail client applications(e.g. Microsoft Exchange Client), SQL, Oracle, SAP and Lotus Notes Database), word processing apps (e.g. Microsoft Word), spreadsheet and financial applications, presentation and graphics applications, web applications, mobile applications and entertainment applications.

“The client computing devices 102 may have at least one operating software (e.g. Microsoft Windows, Mac OS X iOS, IBM z/OS Linux, or other Unix-based OSes). There may be one or more file system or other applications that are installed on the client computing devices 102.

“The client computing devices (102 and 100) can be connected via one or more communication paths 114. A first communication path 114 could connect client computing devices 102 and secondary storage computing devices 106. A second communication pathway, 114, may connect storage manger 140 and client computing devices 102. A third communication pathway, 114, may connect storage managers 140 and client computing devices 102. Finally, storage manager 140 may be connected to storage manager 140, and secondary storage computing equipment 106. (see, e.g., FIG. FIG. 1A and FIG. 1C). 1C. In some cases, communication pathways 114 may also include application programming Interfaces (APIs), such as cloud service provider APIs and virtual machine management APIs. The infrastructure that underlies communication paths 114 can be wired, wireless, analog, and/or digital or any combination thereof. Facilities may also be private, public or third-party provided.

“Primary Data, Exemplary Primary Storage Devices”

According to some embodiments, primary data 112 is production data or any other?live? data. Data generated by the operating systems and/or applications 110 running on a client computing device. Primary data 112 is usually stored on the primary storage device(s), 104. It is organized using a file system that is supported by the client computing devices 102. The client computing device (102) and the corresponding applications 110 can create, access modify, delete, write, delete, or otherwise use primary data 112. Some cases allow some or all the primary data 112 to be stored in cloud storage resources. For example, client computing device(s) 102 and corresponding applications 110 may create, modify, write, delete, or otherwise use primary data 112.

“Primary Data 112 is usually in the native format for the source application 110. Primary data 112 can be described as an initial or first copy (e.g. created before any other copies, or at least one additional copy) of data generated from the source application 110. In some cases, primary data 112 is substantially created directly from the data generated by the source applications 110.

The primary storage devices 104 that store the primary data 112 can be expensive and/or slow (e.g., disk drives, hard-disk arrays, solid state memories, etc.). Primary data 112 can be extremely changeable, and/or intended for short-term retention (e.g. hours, days or weeks).

“Accordingly to some embodiments, the client computing devices 102 can access primary information 112 from the primary storage unit 104 via conventional file system calls through the operating system. Structured data, unstructured data, and/or semi-structured information may all be included in primary data 112. Below are some examples with regard to FIG. 1B.”

It can be used to perform certain tasks, such as organizing primary data 112 into units with different granularities. Primary data 112 may include files, directories and file system volumes. It can also include data blocks, extents, and any other hierarchies of data objects. A “data object” is defined herein. A?data object? can be used to refer to either (1) any file that is currently addressed by a system or that was previously addressed by the system (e.g. an archive file) or (2) a subset thereof (e.g. a data block).

“As we will explain in detail, it can also help in performing certain functions in the information management system 100 to modify and access metadata within the primary dataset 112. Metadata is information about data objects and characteristics that are associated with them. It is important to note that any reference to primary information 112 includes its associated metadata. However, references to the metadata don’t include primary data.

“Metadata may include, without limitation: the owner of the data (e.g. the client or user that generated the data object), the last modified date (e.g. the time at which the data object was modified), the file size (e.g. a number bytes of data), information about content (e.g. an indication of the existence of a specific search term), user-supplied tag, to/from information (e.g. an email sender, recipient), and other information related to the email information (e. The creation date, file type (e.g. format or application type), the last accessed times, application type (e.g. type of application that created the data objects), location/network (e.g. a current, past, or future location of data object and network paths to/from it), user-supplied tags, to/from information for email (e.g. an email sender, recipient, etc.), partition layouts, file location within the file folder directory structure, permissions, owners groups, access control list [ACLs], system metadata (e.

“In addition to metadata related to file system and operating systems, some applications 110 and/or components of the information management software 100 maintain indices metadata for data objects. For example, metadata associated to individual email messages. Each data object can be associated with the corresponding metadata. Below is a more detailed explanation of how metadata can be used to perform classification and other functions.

“Each client computing device 102 is generally associated with or in communication with one of the primary storage units 104, storing the corresponding primary data 112. A client computing device 102 could be considered to be “associated with?” A client computing device 102 may be considered to be?associated with? A primary storage unit 104 is capable of performing one or more of the following: routing and/or storage data (e.g. primary data 112) to the specific primary storage devices 104; coordinating the routing/or storage of data to the primary storage devices 104; retrieving data from that primary storage facility 104; coordinating the retrieval data from that primary storage apparatus 104; and altering and/or eliminating data retrieved from that primary storage appliance 104.”

“Primary storage devices 104 may include any of the storage devices mentioned above or another type of storage device. The primary storage devices (104) may be slower than the secondary storage device 108 and/or more expensive. The information management system 100 might, for example, access metadata and data stored on primary storage device 104 quite often, while data stored on secondary storage device 108 is more frequently accessed.

“Primary storage devices 104 can be shared or dedicated. Each primary storage device (104) may be dedicated to a client computing device 102 in some cases. In one embodiment, the primary storage device (104) is a local drive belonging to a client computing device (102). Other cases allow one or more primary storage device 104 to be shared by multiple client computers devices 102 via a network, such as in a cloud storage system. A primary storage device 104 could be a disk that is shared by a group 102 of clients computing devices, such as EMC Clariion or EMC Symmetrix. It can also include one of the following types: EMC Clariion and EMC Celerra.

“The information management software 100 could also contain hosted services (not illustrated), which may be hosted by another entity than the one that uses the information management software 100. Hosted services can be provided by different online service providers to the company. These service providers may offer services such as social networking, hosted email services, and hosted productivity apps. Hosted services may include software-as-a-service (SaaS), platform-as-a-service (PaaS), application service providers (ASPs), cloud services, or other mechanisms for delivering functionality via a network. Each hosted service can generate additional data and metadata as it delivers services to users. This data may be managed by the information management system 100 (e.g. primary data 112). The hosted services can be accessed via one of the applications 110 in some cases. A hosted mail service could be accessed using a browser on a client computer device 102. Hosted services can be used in many computing environments. They may be implemented in an environment similar to the information management systems 100 where various physical and logic components are distributed over a network.

“Secondary copies and Exemplary Secondary Storage Devices”

In some instances, the primary data 112 stored in primary storage devices (104) may be compromised. For example, an employee might delete or accidentally overwrite primary data 112 during normal work hours. The primary storage devices 104 may also be lost, damaged, or corrupted. It is useful to create copies of the primary data 112 for recovery purposes and/or regulatory compliance. The information management system 100 contains one or more secondary computing devices 106 and one, or more, secondary storage devices108 that are used to create and store secondary copies 116 and associated metadata. Sometimes, the secondary storage computing devices (106 and 108) may be called a secondary subsystem 118.

“Creation and storage of secondary copies 116 is a useful tool for search and analysis and other information management goals. It allows you to restore data and/or metadata in the event that a primary version (e.g. of primary data 112) is lost due to deletion, corruption or natural disaster; it also permits point-in time recovery.

“The client computing devices (102) access or receive primary information 112 and communicate that data, e.g. over one or more communication paths 114 for storage in the secondary storage device(s).108

“A secondary copy (116) can contain a separate, stored copy of the application data. It may be derived from one or several earlier-created, store copies (e.g. primary data 112 and another secondary copy, 116). Secondary copies 116 may contain point-in time data and can be stored for a relatively long period of storage (e.g. weeks, months, or years) before any or all data is moved to another storage or discarded.

“In some cases, a second copy 116 can be a copy created of application data and stored after at least one other stored instance (e.g. following corresponding primary data 112 and to another secondary data 116), in an alternative storage device than at most one stored copy and/or remotely. Secondary copies may be stored on the same storage device with primary data 112 or other previously stored copies in some cases. In one example, a disk array that can perform hardware snapshots stores primary information 112, and creates and stores secondary copies 116. Secondary copies 116 can be kept in low-cost storage, such as magnetic tape. The secondary copy 116 could be kept in a backup, archive format or another format than the primary data or native application format.

“Some secondary copies 116 can be indexed to allow users to browse and restore at a later time. A secondary copy 116 representing certain primary data 112 may be created. A pointer or another location indicator (e.g. a stub), may be added to primary data 112. To indicate the current location of the secondary storage device(s), 108 or secondary copy 116.

“An instance of a metadata or data object in primary information 112 can change over time as it’s modified by an app 110 (or the operating system) so the information management 100 may create multiple secondary copies 116 to represent the state of that data object or metadata at a specific point in time. The information management system 100 can also manage point-in time representations of primary data objects, even though they may be deleted from primary storage device 104 or the file system.

“Virtualized computing devices may have the operating system 110 and other applications 110 executed within or under virtualization software management (e.g., VMM). The primary storage device(s), 104 may contain a virtual disk created on physical storage device. Information management system 100 can create secondary copies of 116 files and other data objects within a virtual disk, and/or secondary copies of 116 of the entire virtual drive file (e.g. of an entire.vmdk) itself.

“Secondary copy 116 can be distinguished from the corresponding primary data 112 by a variety of means. Some of these will be discussed. As mentioned, secondary copies 116 may be stored in a different format than primary data 112 (e.g. backup, archive, and other non-native formats). Secondary copies 116 may not have the ability to be used directly by client computing devices 102 for this and other reasons.

“Secondary backups 116 may also be stored in certain embodiments on a secondary storage unit 108 that is not accessible to the applications 110 running at the client computing devices (and/or hosted service). Some secondary copies 116 could be “offline copies”, They are not easily accessible (e.g., they are not mounted to tape or disc). “Offline copies” can be copies of data that an information management system 100 is able to access without human intervention.

“The Use Of Intermediate Devices to Create Secondary Copies”

It can be difficult to create secondary copies. There can be hundreds of clients computing devices 102 that generate large amounts of primary data 112 which must be protected. Secondary copies can also be created with significant overhead 116. Secondary storage devices 108 can also be used for special purposes, so interacting with them may require specialized intelligence.

“In certain cases, client computing devices 102 can interact directly with secondary storage device108 to create secondary copies 116. This approach, however, can have a negative impact on the client computing devices’ 102 ability to serve applications 110 and generate primary data 112. The client computing devices 102 might not be optimized for interaction to the secondary storage devices (108).

“In some embodiments, the information system 100 may include one or more software/hardware components that act as intermediaries between client computing devices (102) and secondary storage devices (108). These intermediate components may provide additional benefits beyond transferring certain responsibilities to the client computing device 102. As shown in FIG. 1D) can increase scalability by distributing some of work required to create secondary copies 116

“The intermediate components may include one or more secondary storage computing device 106, as shown in FIG. 1A, and/or one or several media agents. These can be software modules that operate on the secondary storage computing devices (106) or other suitable computing devices. Below are some examples of media agents (e.g., in relation to FIGS. 1C-1E).”

“The secondary storage computing devices(s)106” can include any of the computing units described above. Sometimes, the secondary storage computing devices (106) may include special hardware and/or software components for interfacing with secondary storage devices 108.

“To create secondary copies 116, which involves the copying data from the primary subsystem 117 into the secondary subsystem 118. In some embodiments, the client computing devices 102 communicates the primary data 112 (or a processed copy thereof) to the designated secondary computing device 106 via the communication path 114. The secondary storage computing unit 106 then transmits the received data or a processed version thereof to the secondary storage device. In certain cases, the communication path 114 between client computing device (102) and secondary storage computing device (106) may be a part of a LAN/WAN or SAN. Other cases allow at least one client computing device 102 to communicate directly with secondary storage devices (108, e.g. via Fibre Channel or SCSI connections). Other cases include creating one or more secondary copies from secondary copies that exist, as in the case with an auxiliary copy operation.

“Exemplary Secondary Data and Exemplary Primary Data”

“FIG. “FIG. The primary storage device(s), 104 contains primary data objects, including word processing documents (119A-B), spreadsheets 120, presentation files 122, video files 124, image files 126 and email mailboxes 128 with corresponding emails 129A?C), html/xml files 130, databases 132, and the corresponding tables or data structures 133A?133C.

“Some or all primary objects are associated with the corresponding metadata (e.g.?Meta1-11). These metadata may be file system metadata or application-specific metadata. Secondary copy data objects (134A-C) are stored on secondary storage device(s), 108. These secondary data objects may contain copies of, or otherwise represent, corresponding primary data objects and metadata.

“As you can see, secondary copy data objects (134A-C) can each represent more than one primary object. Secondary copy data object (134A) can represent three primary data objects 133C-122C and 129C respectively. They are represented as 133C? and 122C respectively and accompanied with the Meta11, Meta3, and meta8 metadatas. The prime mark (?) also indicates that secondary copy data object 134A may store a representation of a primary data object and/or metadata differently than the original format. A secondary copy object can store metadata and a representation of primary data objects in a different format than the original. Secondary data object 1346 also represents primary data objects 120, 1306, and 119A respectively. It is accompanied by the corresponding metadata Meta2, Met10, and Meta1, and vice versa. Secondary data object 134C also represents primary data objects 130A, 1196 and 129A respectively. It is accompanied by the corresponding metadata Meta9 and Meta5, respectively.

“Exemplary Information Management System Architecture”

“The information management software 100 can contain a wide range of hardware and software components. These can be organized in many different ways depending on the embodiment. It is crucial to make clear design decisions about the functional responsibilities and roles of components in the information management systems 100. As will be discussed, these design decisions can have a significant impact on performance and the ability of the information management software 100 to adapt to changing data growth or other circumstances.

“FIG. 1C is an illustration of an information management system 100. It includes: storage manager 140, which is a centralized storage/or information manager configured to perform specific control functions. One or more data agents (142) are executed on client computing devices 102 for processing primary data 112, and one, or more, media agents 144 that execute on secondary storage computing devices. 106 for performing tasks related to the secondary storage devices. 108. Although it is possible to distribute functionality across multiple computing devices, there are other benefits. In some cases, consolidating functionality can be more beneficial. In various other embodiments, any or all of the components in FIG. 1C are not implemented on different computing devices. One configuration includes a storage manager 140 and one or two data agents 142. A media agent 144 is also implemented on the same device. Another embodiment allows for one or more data agent 142, one or several media agents 144, and the storage manager 140 to be implemented on the same computing devices. This is not a limitation.

“Storage Manager”

“As you can see, there are 100 components to the information management system and a lot of data that needs to be managed. The task of managing the components and data can be a complex one. It is also a task that can become more difficult as the number of components and the data grows to meet the organization’s needs. According to certain embodiments, the storage manager 140 is responsible for the control of the information management system 100. The storage manager 140 can be modified independently by distributing control functionality. A computing device that hosts the storage manager 140 can also be chosen to best fit the functions of the storage manger 140. FIG. 2 explains these and other benefits in more detail. 1D.”

“The storage manager 140 could be a software module, or another application that, in certain embodiments, operates in conjunction with one of the associated data structures (e.g. a dedicated database, management database 146). Storage manager 140 may be a computing device that executes computer instructions. The storage manager is responsible for initiating, performing, coordination and/or controlling storage operations and other information management operations performed under the information management system 100. This includes protecting and controlling the primary data 112 as well as secondary copies 116 and metadata. Storage manager 100 is generally responsible for managing information management system 100. This includes managing its constituent components (e.g. data agents and media agents).

“As indicated by the dashed-arrowed lines (114 in FIG. 1C shows that the storage manager 140 can communicate with or control certain elements of the information system 100 such as data agents 142, media agents 144, and/or other components. In certain embodiments, control information is received from the storage manger 140. Status reporting is sent to storage manager 140 by various managed components. Payload data and metadata are generally communicated between data agents 142, media agents 144 and client computing devices 102 (or otherwise between the secondary storage computing devices 106), e.g. at the direction and under the supervision of the storage manager140. The control information may include instructions and parameters for performing information management operations. This includes instructions on how to start an operation, when to start it, timing information that specifies when to do so, data path information that specifies which components to access or communicate with in order to complete the operation. Payload data can, however, include data that is actually involved in storage operations, such as content data that has been written to secondary storage device 108 during a secondary copy operation. Payload metadata may include any of these types of metadata and can be written to a storage unit with payload content data (e.g. in the form a header).

“In some embodiments, certain information management operations can be controlled by other components of the information management system 100 (e.g. the media agent(s 144) or data agent(s 142), in addition to or in combination with storage manager 140.”

“Accordingly to certain embodiments, storage manager 140 provides one of the following functions.

“The storage manager 140 could maintain a database (or?storage manger database 146?) or ?management database 146?) Management-related data and information management policy 148. A management index 150 or?index 150 may be included in the database 146. or any other data structure that stores logical association between components of the system, user preference and/or profiles (e.g. preferences regarding encryption, compression or deduplication, scheduling, type or other aspects, mappings of information management users to specific computing devices or other components, etc. Management tasks, media containerization, and other useful data. The index 150 may be used by the storage manager 140 to track logical connections between media agents 144, secondary storage device 108, and/or the movement of data from primary storage device 104 to secondary storage device 108. The index 150 could store data that associates a client computing device with a specific media agent 144 or secondary storage devices 108. This is according to an information management policy (148) which can be found below.

Administrators and other individuals may be able configure and initiate information management operations individually. This may work for certain recovery operations and other tasks that are less frequently performed, but it is not practical for ongoing organization-wide data management. The information management system 100 can use information management policies 148 to specify and execute information management operations (e.g. on an automated basis). An information management policy 148 may include a data structure, or another information source, that specifies a set or parameters (e.g. criteria and rules) related to storage or other information operations.

The storage manager database 146 may contain the information management policy 148 and associated data. However, the information management policy 148 can be stored at any location. An information management policy 148, such as a storage policy, may be stored in metadata in a media agency database 152 or in secondary storage device 108 (e.g. as an archive copy) to aid in restore operations and other information management operations depending on the embodiment. Below are descriptions of information management policies 148.

According to some embodiments, the storage manger database 146 includes a relational database (e.g. an SQL database) that tracks metadata such as metadata associated secondary copy operations (e.g. what client computing devices were used and the corresponding data). These and other metadata can also be stored at other locations, such the secondary storage computing device 106 or the secondary storage device 108. This allows data recovery without the need for the storage manager 140 in certain cases.

“As shown in the figure, the storage manager 140 could include a jobs agent (156), a user interface (158), and a management agents 154. All of these may be implemented as interconnected modules or applications programs.

In some embodiments, the jobs agent 156 initiates, controls and/or monitors some or all storage operations or other information management operations. These operations may be currently being performed or scheduled to be performed in the information management system 100. The jobs agent 156 might, for example, access information management policies (148) to determine when and how to control secondary copy and other operations.

“The user interface 158 can include information processing, display software, and graphical user interfaces (?GUI). An application program interface (??API?) ), an application program interface (?API?). Users can optionally issue instructions to components of the information management system 100 via the user interface 158 regarding storage and recovery operations. A user might modify a schedule indicating the number of secondary copy operations that are pending. Another example is that a user might use the GUI to view the status pending storage operations, or monitor certain components of the information management system 100 (e.g. the remaining storage capacity).

“An information management cell?” (or ?storage operation cell? (or?storage operation cell? A logical or physical grouping may be used to describe a combination of hardware-software components that are associated with information management operations on electronic files. This includes at least one storage manager 140, at least one client computing device 102, and at most one data agent (or 142) and at minimum one media agent (144). FIG. 1C shows an example of such components. 1C could be combined to form an information management system cell. Multiple cells can be organized hierarchically. This configuration allows cells to inherit properties from hierarchically superior cell or to be controlled by other cells (automatically or not). In some embodiments, cells can inherit or be linked to information management policies, preferences or information management metrics or any other property or characteristic based on their relative position within a hierarchy of cells. You can also organize cells hierarchically according geography, architecture, function, or any other factor that is useful in information management operations. One cell could represent a geographical segment of an enterprise such as a Chicago office. A second cell might represent another geographic segment such as a New York or New York office. Others cells could represent different departments within an office. A first cell can perform one or several first types information management operations (e.g. one or two first types secondary or additional copies), while a second cell could perform one, more, or all of the second types information management operations.

“The storage manager 140 can also track information that allows it to identify, select, or otherwise identify content indexes, deduplication database or similar resources or data sets within its information cell (or another cell), to be searched for certain queries. These queries can be entered via the interface 158. The management agent 154 permits multiple information management cells to communicate with each other. In some cases, the information management system 100 may be one of many information management cells in a network of multiple cells that are adjacent or otherwise logically connected in a WAN/LAN. These cells can be linked to each other through their respective management agents 154.

“For example, the management agent 150 can give the storage manager 140 the ability to communicate via network protocols or application programming interfaces (??) with other components of the information management systems 100 (and/or cells within a larger system). These include, e.g. HTTP, HTTPS FTP, REST and virtualization software APIs. U.S. Pat. explains inter-cell communication and hierarchy in more detail. Nos. Nos. 7,747.579 and 7,343,453, are incorporated herein by reference.”

“Data Agents”

“As we have discussed, there are many types of applications 110 that can be run on a client computing device 102. These include operating systems, database apps, e-mail programs, and virtual machines to name just a few. Client computing devices 102 might be responsible for processing the primary data 112 created by these different applications 110 as part of the creation and restoration of secondary copies 116. Moreover, the nature of the processing/preparation can differ across clients and application types, e.g., due to inherent structural and formatting differences among applications 110.”

“The one or more data agents 142 can be advantageously configured in certain embodiments to aid in the performance information management operations based upon the type of data being protected at a client-specific, and/or app-specific level.”

“The data agent142 could be a module or component of a software program that is responsible for initiating, managing, or otherwise supporting the execution of information management operations within information management system 100, usually as directed by storage manger 140. The data agent 142 might be responsible for performing data storage operations like copying, archiving and migrating primary data 112 to the primary storage device(s). 104. The storage manager 140 may give control information to the data agent 142, including commands to send copies of data objects and metadata to media agents 144.

“In some embodiments, the data agent (142) may be distributed between client computing device 101 and storage manager 140 (and any intermediate components), or it may be deployed from remote locations or its functions approximated using a remote process that performs all or some of the functions of data agent 142. A data agent 142 can also perform functions that are provided by a media agents 144 or perform other functions, such as encryption and duplication.

Each data agent 142 can be customized for a specific application 110. The system can use multiple application-specific agents 142 to perform information management operations (e.g. backup, migration and data recovery) associated in a different 110 application. Different data agents 142 could be used to manage Microsoft Exchange data and Lotus Notes data. They may also handle Microsoft Active Directory Objects, Microsoft Windows file system, Microsoft Windows data, Microsoft SQL Server data and SQL Server data.

A file system agent may be used to manage data files and/or other information. A specialized data agent 142 can be used to backup, archive, migrate and restore client computing devices 102 data if there are multiple types of data. To backup, migrate, or restore all data on a Microsoft Exchange server, a client computing device 102 might use a Microsoft Exchange Mailbox Data agent 142, a Microsoft Exchange Database Data agent 142, and a Microsoft Exchange Public Folder and File System data agents 142. These specialized data agents 142 can be considered four different data agents 142, even though they are all running on the same client computing device.

“Other embodiments may use one or more generic agents 142 that can process data from multiple applications 110 or can handle multiple data types in addition to or instead of specialized data agents 142. One generic data agent 142 could be used to backup, migrate, and restore Microsoft Exchange Mailbox data, and Microsoft Exchange Database data, while another generic agent might handle Microsoft Exchange Public Folder and Microsoft Windows File System information.

Each data agent 142 can be configured to access the primary storage device(s), 104 and then process the data according to its needs. The data agent 142 might arrange the data or metadata into one or more files with a specific format, such as a backup or archive format, before transferring them to a media agent (144) or another component. A list of files and other metadata may be included in the file(s). Each data agent 142 is capable of restoring data and metadata from secondary storage devices 104 to secondary copies 116. The data agent 142 can be used in conjunction with the storage manger 140 and one or more media agents 144 to recover data from secondary storage devices 108.

“Media Agents”

“As shown above in relation to FIG. “As indicated above with respect to FIG. 1A, shifting certain responsibilities from client computing devices (102) to intermediate components like the media agent(s), 144 can provide many benefits, including faster secondary copy operation performance and improved scalability. One example will be described below. The media agent 144 acts as a local cache for copied data and/or metadata it has stored to secondary storage device(s).108. This provides improved restore capabilities.

Click here to view the patent on Google Patents.