Microsoft – Anand Prahlad, Jeremy A. Schwartz, David Ngo, Brian Brockway, Marcus S. Muller, Commvault Systems Inc

Abstract for “Systems and methods of classifying and transmitting information in a storage system”

“Systems and methods to improve enterprise data management are described. These systems and methods, which evaluate and define data management operations using data characteristics and not data location, are described. Methods for creating a metadata data structure that describes system data, and storage operations are also provided. This data structure can be used to identify changes in system data, rather than scanning individual data files.

Background for “Systems and methods of classifying and transmitting information in a storage system”

“Aspects disclosed in the invention relate to operations on electronic data within a computer network. Particularly, the invention relates to detecting data interactions in a computer system and/or performing storage-related operations on a computer network following a specific classification paradigm.

Current storage management systems use a variety of storage operations to electronic data. Data can be stored as primary copies or secondary copies. This includes backup copies, snapshot copies, hierarchical storage management copies (?HSM), and secondary storage. As an archive copy and as other types of copies.

“A primary copy is usually a production copy or another?live? copy of the data. The primary copy of data is the one that is used by the software application. It is usually in the native format. If necessary, primary copy data can be stored in local memory or another high-speed storage device which allows for fast data access. This primary copy data is usually intended for short-term retention (e.g. several hours or days before some or all data is stored in a local memory or other high-speed storage device that allows for relatively fast data access if necessary).

“Secondary copy” refers to point-in time data. They are usually intended for long-term storage (e.g. weeks, months, or years depending upon retention criteria as described in a storage policy, as well as before any or all of the data is moved or discarded. Users can search for data in secondary copies and retrieve it at a later time. To indicate where the data is located, a pointer (or other location indicator such as a stub) may be added to the primary copy.

A backup copy is one type of secondary copy. Backup copies are usually a point in time copy of primary copy data that is stored in a backup format rather than in the native application format. A backup copy could be stored in an optimized backup format for compression and long-term storage. Backup copies have a longer retention period and can be kept on media that is slower to retrieve than other media. Backup copies can be kept at an offsite location in some instances.

A snapshot copy is another type of secondary copy. A snapshot can be viewed as an instant copy of primary copy data at a particular time. A snapshot captures the directory structure and contents of a primary copy volume at a given moment. A snapshot can exist in parallel to an actual file system. The snapshot’s record of files and directories is typically accessible to users as a read-only file system. Users can also restore primary copy data from a snapshot that was taken at a particular time. This will allow them to return the current file system back to its previous state when it was created.

A snapshot can be created quickly, with very little file space. However, it may still serve as a backup to your file system. Although a snapshot does not create a physical copy of all data, it may create pointers that can be used to map files and directories to particular disk blocks.

“In some embodiments, after a snapshot is taken, subsequent changes made to the file system do not usually overwrite the blocks that were in use at the snapshot time. The initial snapshot may only require a small amount disk space to store a mapping or another data structure that tracks the blocks that correspond to the current file system state. If files or directories are modified later, additional disk space is typically not required. Moreover, files that are modified only copy the pointers to the blocks, and not the actual blocks. Some embodiments, such as copy-on-write snapshots allow for a block to change in primary storage. The block is copied to secondary storage prior to being overwritten in primary storage. In this case, the snapshot mapping of file data is updated to reflect any changes to the block(s). HSM copies are generally copies of primary copy data. However, they typically only contain a subset of primary copy data and are usually stored in an alternative format to the native application format. An HSM copy may contain only data that is greater than a certain size threshold or older that a specified age threshold. This backup format is used to store the HSM copy. HSM data is often removed from the primary copy and a stub stored in the primary copy to indicate the new location. The stub is used by systems to locate HSM data after it has been deleted or migrated. This makes recovery of data transparent, even though HSM data might be stored in a different location than the primary copy.

An archive copy is similar to an HSM, but the data that meets the criteria for deletion from the primary copy is usually completely deleted with no stub remaining in the primary copy to indicate where the data has been moved. The archive copy of data is generally kept in a backup format, or another non-native format. Archive copies of data are often kept for a very long time (e.g. years) and are sometimes never deleted. Archive copies can be kept for longer periods to comply with regulations or other permanent storage purposes.

“In certain embodiments, application data moves over its lifetime from more expensive quick-access storage to less costly slower access storage. Information lifecycle management (?ILM?) is the process of moving data between these different storage tiers. This is how data is “aged”. There are many secondary storage options available. Some have faster access/restore time, while others are slower. This is because data becomes less critical or less important over time.

“Examples for different types of data, and copies of those data, are described in the related applications above. They are here incorporated by reference in full. QiNetix, a storage management system developed by CommVault Systems in Oceanport (N.J.), is one example of a system that stores electronic data and produces such copies.

“The QiNetix storage system uses a modular storage management structure that can include storage manager components and client or data agent component components. Media agent components are further described in U.S. Patent Application Ser. No. No. As further described in U.S. Patent Application Ser. No. No.

No matter where the data is kept, traditional storage management systems store electronic data according to location-specific criteria. Data generated by an application on a client will typically be copied according to specific criteria. For example, it may be from a specific location, like a folder or subfolder in a given data path. The transfer of data between the client and another storage location may be controlled by a module that is installed on the client. When restoring data from secondary storage into primary storage, similar data transfers are made using location-specific criteria. To restore data, an user or process must specify the secondary storage device, media, archive file, and other criteria. The ability to specify or define storage operations based on data location is more important than information relating or describing the data. This limits the precision with which traditional storage management systems can perform storage operations on electronic files.

“Moreover, traditional storage systems scan files on clients or other computing devices to identify data objects that are associated with storage operations. This could involve locating file and/or folder attributes through a client’s file system before performing storage operations. This is a time-consuming process that can consume significant client resources. These resources could be better spent on other tasks related to production applications. It is therefore necessary to develop systems and methods that allow for more efficient and precise storage operations.

“Aspects related to the invention generally concern systems and methods that analyze, classify and store various types of data. This, among other things facilitates identification, searching for, storage, retrieval and retrieval data that meets certain criteria. While the specific embodiments are described, it is clear that the inventions discussed herein can be applied to any wireless or wired network or data transfer device that stores or transmits data, such as enterprise networks, storage networks, or the like.

“Aspects are systems and methods that facilitate and improve enterprise data management. These systems and methods, which evaluate and define data management operations using data characteristics and not data location, are disclosed. Methods for creating a metadata data structure that describes system data, and storage operations are also provided. This data structure can be used to identify changes in system data, rather than scanning individual data files.

“Generally speaking, the methods and systems described below are used to analyze data and other information within a computer network (sometimes called a “data object”). and create a database. A data collection agent might traverse a network filesystem to obtain certain attributes and characteristics of the data. This database could be described as a collection metadata and/or other information about the network data in some embodiments. Metadata is data or information about data. It can include data related to storage operations and storage management such as data locations, storage management component associated with data storage devices, storage devices used to perform storage operations, index, type of data, or any other data.

This arrangement allows system administrators or other system processes to consult the metabase in order to get information about network data. They do not have to iteratively access each item and analyze them all. This greatly reduces the time it takes to get data object information. It also eliminates the need to access source data. Additionally, it minimizes the use of network resources, which in turn makes the process much more efficient and less burdensome for the host system.

“Various embodiments will be described. These embodiments will be described in detail. However, one skilled in the art will know that the invention can be used without these details. Some well-known functions or structures may not be described or shown in detail to avoid confusing the different embodiments.

“The terminology in the description below is meant to be understood in the broadest possible way, even though it is used in conjunction with specific embodiments of this invention. Some terms may be highlighted below. However, any terminology that is intended to be restricted in interpretation will be explicitly and overtly defined in the Detailed Description section.

FIG. 1. To perform certain functions, it might be necessary to install data classification software on computers within the network. (step 102). This could be accomplished by installing the software on clients computers or servers within a network. In some cases, the classification software may be installed worldwide on a computing device. The classification software can monitor data objects generated by computers and classify that information as needed.

“Next, at step104, a monitor agents may be initialized. This monitoring agent, which may reside on every computing device in the same way as the deployment of classification agents above, may be installed and configured to record and monitor certain data interactions within each machine and process. The monitor agent could include a filter driver program, and be installed on an input/output port/data stack. It may also work in conjunction with a file manager program to record interactions between computing device data. This could involve the creation of a data structure, such as a record or journal for each interaction. These records can be stored in a journal structure. They may also record data interactions on an inter-interaction basis. The journal might include information about the type of interaction and relevant properties of the data involved. Microsoft’s Change Journal, or a similar program, is one example of a monitor program.

“Prior populating a metadata metabase, portions of the network and subject systems may be quieted so that data interactions are not permitted before completing an optional scan to system files as described in step 106 below. This is done to get a point-in-time picture of the data being scanned, and to preserve the system’s referential integrity. If the system was not quiet, data interactions could continue, and data would be allowed to move through to mass storage. In some cases, however, the subject system can continue to function with instructions or operations stored in a cache. These operations are usually performed after the scan has been completed so that data interactions based on the cached operations can be captured by the monitor agent.

A data classification agent may perform the file scanning described in step 106. This may involve traversing a client’s file system to identify any data objects, emails, or other information that may be present within the system. The agent may also obtain information about the metadata. This metadata could include information about data objects and characteristics such as the client, user, or data manager who generated them, last modified times (e.g. the time at which the most recent modification occurred), data size (e.g. the number of bytes of data), data content (e.g. the application that generated it, the user who generated it, etc. To/from information for emails (e.g. email sender, recipient, or any other group that is on an email distribution lists), creation date (e.g. when the data objects were created), file type (e.g. format or application), last accessed times (e.g. how many bytes of data), data content (e.g. which application generated them), location/network (e.g. a current, past, or future location of the object and the network paths to/from it), frequency of changes (e.a schedule which could include a time in which might include a date object being migrated to long-term storage), etc. The scan data may be used to populate step 108’s metabase with information about network data.

“After the metabase is populated, the network/subject system can be released from its quiesced status and normal operation may resume. At step 110, the monitor agent can monitor system operations and record any changes to system data in a change journal database. The change journal database could contain a database of metadata and data changes, as well as log files of data or metadata changes. The data classification agent can periodically check the change journal database to see if there are any new entries. These new entries can be examined and, if relevant, written to the metabase (step 112). Other embodiments allow for change journal entries to be provided substantially in parallel with the data classification agent and journal database. This allows the metabase maintain substantial current information about the status of system data at any time.

As mentioned, one advantage of such a metabase can be reduced time to get information. It eliminates the need to directly access the source data. Imagine that a system administrator wants to find data objects that have certain content or other characteristics that were accessed by a user. Instead of searching each file in every directory, which can be time-consuming, the administrator could simply search the metabase for such objects and any properties (e.g. metadata) that are associated with them. This would save the administrator a lot of time. This will result in substantial time savings.

“Moreover, the use of the metabase to satisfy data queries reduces the need for network resources, thereby reducing the processing load on the host system.” As an example, if an administrator wants to identify data objects, querying metabase instead of the file system effectively removes the host from the query process (i.e. no brute force scanning or renaming of files is required). This allows host computing devices to continue to perform host tasks and not be distracted by search tasks.

“FIG. “FIG. 2 illustrates one embodiment of client 200 built according to principles of the present invention. Client 200 could include a class agent 202 and monitor agent 206. In some embodiments, these agents may be combined to form an update agent. 204. This module may contain the functionality of both agents. Client 200 could also contain an internal or exterior data store 209 and metabase 210. A change record 212 may also be included.

Client 200 could be any part of a computing device or any computing device that generates electronic information. Data store 209 is generally used to store application data, such as client 200’s production volume data. Metabase 210 may contain information created by classification agent202. This information may be either internal or external to client200. Change journal 212 may also contain information generated above by monitor agent.

“In operation, data interactions within client 200 can be monitored using update agent 204 and monitor agent 206. Any interaction that is relevant may be recorded and sent to change record 206 Data classification agent 200 may scan or receive entries of monitor agent 206, and update metabase 215. In the event that update agent 204 exists, monitored data interactions can be processed simultaneously with updates to change record 212, and written to metabase 210 and data store 208. File system 207 can be used to process or conduct data transfers from clients to data stores 209.

“FIG. “FIG. 2. System 300 could include a memory 302, a update agent 304, which may include a separate monitor agent 306 or integrated monitor agent 306, and classification agents 312a and 312b. A content agent 315 may be included, as well as a monitor program index 3310, metabase 314, and mass storage device 318.

Monitor agent 306 may monitor data interactions between memory 302 (or mass storage device 318) during operation. Memory 302 could include random access memory (RAM), or any other memory device that is used by client 200 to perform data processing tasks. Some information stored in memory 302 can be periodically read/written to mass storage device 318, which could include a magnetic disk drive, optical drive, hard drive, or any other storage device that is known to the art. Monitoring agent 306 monitors data interactions and may, in certain embodiments, include any appropriate monitoring or journaling agent, as further described in this document.

“System 300, as shown, may also include an administrative program program 316. This is a file program program that may be used to manage file system programs such as operating system programs (e.g. FAT, NTFS, etc.). This may be used for data movement between mass storage devices 318 and 318. In operation, data can be written to memory 302 and mass storage device 318 using file system program 316. This operation could be used, for instance, to access data needed to service an application that is running on a computer device. Monitor agent 306 can capture the interaction and create a record to indicate that it occurred. The record is stored in index 310. Under the supervision of file manager 316, data can be stored in mass storage 318.

“As shown at FIG. 3. a) Monitor agent 306 can analyze data interactions, such as those between memory 302 or mass storage 318, via file system manager 3316. This will record such interactions in monitorindex 310. As such, monitor index 310 could be a list of data interactions. Each entry may indicate a client-data change and provide information about the interaction. If Microsoft Change Journal is used in an embodiment, these entries may contain a unique ID such as an update sequence (USN), certain reason codes for change journal identifying information associated to a reason(s), along with data or metadata describing data and data properties, such as data copy types.

As data moves from memory 312 to mass storage 318, or vice versa, monitor agent 304 can create and write an index entry to index 308. This entry may then be analyzed by classification agent 312b and classified for entry into metabase 314. In some cases, classification agent 312a can be linked with mass storage device (directly or through file system manger 316), and may write metadata entries to metabase 314 as well as mass storage device 318. The metabase information can be stored on mass storage devices 318 in some instances. In an alternative embodiment, classification agent 312b may periodically backup or copy metabase 314 onto the storage device, under the direction and/or pursuant a storage policy (not illustrated). This allows the information in metabase 314, if it is lost, deleted, or is otherwise unavailable, to be quickly restored.

In some embodiments, optional classification agents 312 and 306 may be used together to classify data moving to mass storage devices 318 as described further. Data is then written to device 318. This arrangement allows the data to be written to mass storage device 318, along with processed metadata. This could happen, for instance, in embodiments where monitor agent 306 or classification agent 312a are combined into update agents 304. This allows metadata to be written in a way that it can be retrieved or accessed from mass storage 318 if needed, such as when metabase 314 is lacking certain information, busy or otherwise unavailable.

“Content agent 315, may be used to filter or obtain data related to data that is being moved from memory 302 into mass storage 318. Content agent 315, for example, may read the data payload information and create metadata based upon the operation to store in metabase 314. It may also include a pointer at the data item in mass 318. Optionally, the index may contain pointer information. This metadata can be stored in an index, or with the data item in mass 318. Metabase 314 can be used to store metadata related to data content. This allows you to search entries in mass storage 318 for content, instead of performing content searches in metabase 314. This allows the system quickly to locate content that matches a query in metabase 314 and may be retrieved from mass store 318 if needed.

“Moreover, such metadata can be used to locate data based upon content features within a storage system hierarchy (e.g. content metadata may also be generated and stored at different levels within the storage system (primary). secondary, tertiary, etc.) To facilitate the retrieval and location of content-based data. One of the most important aspects of the content agent 315, classification agents 312 a& b, and monitor agent 306 is that they provide functionality. The modules can be combined into one module or implemented in separate modules providing some or all of the functions.

“FIG. “FIG. Step 355, the monitor program can be initialized. This may include inputting a data structure (or index) for recording interaction entries and assigning a unique journal ID number that allows the system to distinguish between different journaling data structures. The monitor program could include a filter driver (step 360) or another application that monitors data operations. The monitor agent can observe data interactions between mass storage and memory to determine if certain data interactions have occurred. The metabase may contain information relating to these interactions. Sometimes, certain interactions or aspects of interactions may be captured. These types and aspects can be identified in an interaction definition. This could be a Microsoft Change Journal reason code, or a user’s or network administrator’s declaration to capture data interactions. Some change definitions can record all data interactions regardless of whether or not any data changes. This information can be used, for instance, to identify users or processes who have “touched, scanned, or otherwise accessed” data without actually changing it.

It is possible to use interaction definitions to capture a wide or narrow range of operations. This allows a user to tailor the monitor program to achieve specific goals. These interaction definitions can be used to describe or define data movement, manipulations, or other interactions that might be of interest for a system administrator or user (e.g., any operation which?touches?). Data may be recorded along with the action or operation that created the interaction (e.g. read, write, copy, parse, etc.) Change definitions can change over time and may be dynamic depending on entries to the index. If the expected results do not come through, then change definitions can be modified or added to ensure that they are achieved. This can be done by linking global certain interaction definition libraries and selectively enabling libraries until satisfactory results are achieved. This can be done after activation and may continue to be done periodically depending on changing requirements or objectives.

“Moreover, some embodiments may allow the use of?user tag? This allows certain types of information to become tagged, so that they can be tracked and identified throughout the system. A user might designate one type of data, information or project information to be tracked throughout the system. A user interface (not illustrated) allows users to specify information to be tagged. This can be done by using any attribute in the system, such as the ones mentioned above, with respect to the filter or classification agent used in the system. These or other attributes may be used to define tags. They can then be combined using Boolean or logical operators to create a specific tag expression.

“For example, a user might define a tag by specifying a number of criteria such as system users, data permission levels, project, and so forth. These criteria can be specified using logical operators, such as OR or AND operators to conditionally combine different attributes to create a tag. The system can track all information that meets these criteria. As data passes through monitor agent 306 (or another module within update agent 314), data that meets these criteria can be identified and tagged with a head, flag or any other identifying information known to the art. The information can be copied by metabase 314 or mass storage 318 to allow for quick identification. The metabase could contain entries that keep track of all entries meeting the criteria. This information may include information about the operations performed on the information, as well as metadata related to the data content and where it is located in mass storage 318. This allows the system search the metabase at specific levels of storage to find the information and locate it quickly within mass storage device for possible retrieval.

“Next, a Step 365, the monitor agent can continue to monitor data interaction based on change definitions up until an interaction satisfying a defined definition occurs. A system according to the present invention can continue monitoring data interactions at steps 360 through 365 until a defined interaction, such as one that meets or corresponds to a selection criteria, such an interaction definition, etc. occurs. The monitor agent may create an entry if a defined interaction occurs. This record may be stored in a monitor index. In some embodiments, an interaction code may be assigned to describe the interaction that was observed on the data object. The monitor program will next identify the data object identifier associated to the data. This is usually a file reference number (FRN). The FRN can include information such as the path or location of the associated data object. Additional information (e.g. data properties, copy properties and storage policy information) may also be included in the FRN. To enrich or enhance the record, it may be possible to obtain additional information (e.g. data properties, copy properties, storage policy information, etc.) associated with the FRN. This may include obtaining information from master files tables (MFTs), in some cases, to enhance the metabase entries. To populate the metabase with the best or preferred information, additional formatting or processing of metabase entries can also be performed in accordance to certain classification paradigms.

“Next, at step 357, the record may be assigned a record identifier, such as a unique update number (USN). This may be used to identify the entry within an index and, in certain embodiments, act like an index memory location. A data structure that is based on the USN can be used to locate a specific record quickly. At step 380, additional data or metadata data may be combined with the above information to create the record.

“In alternative embodiments, the above information may be written at the index and organized at the index in an expected format. Or it may be written directly to the record?as received? Include metadata, or other information. As an example, records can contain more information than others. Once the record is constructed and deemed complete, it can be “closed”. The system will then close the record at step 385. If the record is incomplete, the monitor agent/update agent may request additional information to complete it. The monitor agent can place a flag in the record to indicate that it does not contain incomplete information. If this information is not received, then the record could be closed.

“FIG. “FIG. Step 410 may see the initialization of the classification agent. This may include activating and/or clearing buffers, and/or linking libraries that are associated with the deployment of the agent. The classification agent can classify existing stored data before scanning the interaction records generated above by the monitor agent. This may include traversing the directory and file structure of an object system to initial populate the metabase.

“Next, at step 422, the classification agent can scan the entries of the interaction journal during normal operation to determine if new entries have been created since previous classification processing was completed. This could be done, for instance, by determining if the most recent journal entry is older or more recent than the last one analyzed. There are many ways to accomplish this. One way is to scan the time and date information of the previous journal entry, then compare it with the most recent entry in the journal. This can be done iteratively if it is determined that the journal entry with the most recent date was created after a prior classification process. The process of going through all the journal entries to find the last one that was previously analyzed by the class agent may be used. The classification agent may consider any entries containing time information that are after this point to be new or unprocessed (step 440). If the time stamp on the journal entry that was last analyzed is the same as the journal entry that was just analyzed, the system will not consider new entries and may return to step 402.

“Another way to identify new journal entries is to compare record identifiers, such as the USN numbers assigned for each journal entry (step 433). A journal entry with a higher USN number than that of the previous one may be considered unprocessed or new. If the USN number of the previous entry is the same as that of the current entry, the system will return to step 422 to continue monitoring and may consider the entry new or unprocessed. This may continue until new entries are found (step 440), or until it is concluded that there are no new entries.

“In other embodiments, instead of scanning the journal data structures for new entries, entries created by the agent may be automatically sent the classification agent. This is except for cases where such scanning may be necessary or desirable to verify certain information .

“Next, at step450, assuming that new journal entries have been identified, the system might determine if there is a metabase record for the data object associated to those entries. This can be done by comparing data objects identifiers (e.g. FRNs for metabase entries) with data object IDs (e.g. FRNs for journal entries). These and other data characteristics can be used to match and correlate journal and metabase entries.

If no metabase record matches, a new record can be created at step 456. This could involve creating a new Metabase Entry ID, parsing the journal entry into a predetermined format and then copying portions of the parsed data (steps 460 or 470), as described further below. To enhance the content of the new entry, additional metadata and file system information can be attached to it, such as information from an FRN, information derived form an interaction code, etc. (step 480).”

“On the contrary, if a corresponding meta-data entry is identified, the new journal entries may be processed as above and may overwrite some, or all, of the corresponding entry. An updated entry might be given a time stamp to indicate that it is a new revision. In some embodiments, even though a corresponding entry exists, a new entry can be created, written to the metabase, and optionally associated with an existing record. The older record can be kept, for example, for historical, recreational, or diagnostic purposes. In some cases, it may also be marked as obsolete or superseded. These corresponding entries can be linked together via a pointer, or another mechanism so that records relating to the history for a particular data object can be quickly obtained.

“Next, the system will process any additional journal entries by returning to step 490. If there are no new entries, the system can return to step 490 to continue monitoring and perform further scans of the journal data structure.

“FIG. “FIG. 5” illustrates an embodiment the present invention where a secondary processor performs certain functions, or all, of the data classification process described herein. System 500 may contain a manager module 505, which may include an Index 510, a first Computing Device 515 (which may include first processor 520 and journal agent 530 and a Data Classification Agent 535), and a secondary computing device 540, which may include second processor 545, 535, and 535. System 500 could also include a data store 550 and a metabase 555.

“Computing devices 515 or 544 can be any computing device described herein. They may include clients, servers, or other network computers running software. These could include programs or applications that create, store, or transfer electronic data. In some embodiments, journal 560 and metabase 555 may be stored locally on local mass storage. Other embodiments may have journal 560 and metabase 555 located outside of computing device 515, or distributed between them. Metabase 555 can be accessed via a network, while journal 560 can only be accessed locally.

Computing device 515 could operate in substantially the same way as system 300 in FIG. 3. The second processor 545 is used to perform certain functions in the second computing device 540. As shown in FIG. 5, data classification agent 535, and journaling agent 535, may perform substantially the same functions as FIG. 3. This means that journaling agent monitors data interactions with computing device 515, records them in journal 535, and classification agent processes the journal entries to populate metabase 555.

“However, some functions may be initiated and performed in whole or part by second processor 545. The second processor 545 may be used to direct computing operations related to journal agent 530 or classification agent 535. These operations may also use support resources associated with computing device 545. This will ensure that computing device 515 is not significantly impacted by these operations. This could be used to transfer certain tasks that are not critical from the host system (515), and have them done by a secondary computing device 545.

“For example, some embodiments may allow the secondary computing device to take on the processing burden of some or all the following tasks: (1) initial scanning of client files by the classification agency 535 and populating metabase 555, (2) ongoing monitoring of computing device data interactions (e.g. 515) and generation interaction records for storage journal 560, (3) classification and processing of journal information for updating Metabase 555; and, (4) searching for or otherwise analyzing and accessing certain information in metabase 555 and/or 560. In some cases, however, it may be preferable to give the secondary computing device certain tasks, such as searching metabase 555, while other tasks, such as updating journal 560 and metabase, may be done by the primary computing devices.

“Performing such operations using another processor may be desirable. For example, when processor 520 is unavailable, overused, unavailable or otherwise heavily utilized or when it is desired to eliminate the primary processor from performing certain tasks, such as those described above. It may be advantageous to use processor 520 to access or search metabase 555 for specific information. This is because it can perform other tasks that are related to programs running on computing device 515, such as when computing device 515 has been busy with other network-related functions.

“In some embodiments, the secondary CPU may be located on computing device 515 (e.g. processor 525) and may perform operations in conjunction with processor 545. Some embodiments include a manager module 505, which coordinates overall operations among the different computing devices. Manager module 505 might monitor the processing load of each computing device or be aware of it and may assign tasks based on that availability (e.g. load balance). If processor 520 is not in use or at a low level of capacity, processor 520 may handle a request for search metabase 555. Manager 505 can assign processor 545 the task if processor520 is not performing or has a schedule to perform. Manager 505 can be used as an arbiter to assign processor tasks to processor 545 in order for system 500’s efficient use of its resources.

“FIG. “FIG. 5. The system may receive a query at step 610 for specific information. The system may process the request and analyze it. The query may indicate which metabases are to be searched, and/or the management module might consult an index that includes information about metabase content within the system. The identification process could involve searching for and identifying multiple computing devices in an enterprise or network that might contain information that meets search criteria.

“In some embodiments, search queries may automatically be referred to secondary processors to reduce processing demands on the computing devices that might have created or been otherwise associated with the identified metadata(s). It is possible that the computing device associated with the identified metabases creates or is involved in search operations. To identify responsive metabases, the secondary computing device can consult with an index or manager associated with other computing devices.

“The secondary processor will search metabases for relevant data sets to determine the next step at step 640. It may be necessary to perform iterative searches, which examine the results of previous searches. The secondary processor may then search additional metabases that were not previously identified in order to locate responsive information that was not found during the initial search. The initial metabase search can be used as a starting point to expand your search options based on the collected or returned results. At step 650, the results can be arranged and formatted in a way that is suitable for future use (e.g. with another application) or for user viewing (step 650).

“FIG. “FIG. 7 shows a system 700 that is constructed according to the principles of the invention and which uses a central metabase 760. This may be used to serve multiple computing devices 715-725. As an example, system 700 could include computing devices 715-725. Each of these may include a journaling and classification agent (730-740 and 7755-755 respectively), as well as a centralized metabase 756 and, in some embodiments, a manager module 705 and an index 710.

System 700 could operate in a similar way to system 300 as shown in FIG. 3. Each computing device 715-725 stores classification entries in the centralized metabase. 760, rather than each device having its own metabase. As an example, data classification agents 745-555 could operate in a similar fashion to the ones described and report their results to central metabase 760. This means that they analyze and process entries in the journals associated with journaling agent 730-740 and then report them to metabase 760. This arrangement allows the classification agent to provide metabase entries with an ID tag or any other indicia identifying the computing device from which the entry originated. This will facilitate future searches and allow the entry owner to be identified.

“Also, each entry in metabase 760 could be assigned an unique identifier to manage them. This number can be used to indicate the index offset or location of an entry in centralized metabase. Some embodiments allow entries to be sent from computing devices 715-725 to metabase 756 on a rolling basis. They may then be stored by metabase 760. Metabase 760, for example, may receive multiple entries from multiple computing devices 715-255. It may also be responsible for arranging and queuing such entries for storage in the metabase.

“In certain embodiments, system 700 may contain manager module 705 which may be responsible to assign or remove associations between certain computing devices 715-725 and a specific centralized metabase. 760. Manager 705 can direct certain computing devices 715 to create classification entries for a specific centralized metabase. This is in accordance to certain system preferences. The index 710 may contain information that indicates an association between the metabase 760 (715-725) and the computing devices (715). This allows system 700 the ability to reassign resources globally or locally to maximize system performance. Manager 705 might reassign certain computing device 715-725 to another Metabase by changing the destination address in the appropriate index.

“FIG. 8 is flowchart 800, which illustrates some of the steps involved with using a central metabase with multiple computing devices like the one in FIG. 7. A manager module may instantiate a central metabase at step 810. This is in accordance to certain system management and provisioning policies. This could include securing the necessary storage and management resources to perform the task, loading some routines into memory buffers, and notifying the management module that the metabase has been ready for operation.

“Next, at 820, the management module might review system resources, management policy, operating trends and other information to determine computing devices that can be associated with the instantiated central metabase. This could also include identifying the paths to the metabase using the various computing devices. It may also involve finding operational policies that govern the computing devices. Finally, it may be necessary to create logical associations between the centralized and identified metabases. These associations may be stored in an database or index for system management purposes once they have been created.

“After the metabase is created and associated with computing devices (step 825), classification agents in each computing device can scan files or other data and populate the central metabase as described further (step 830). A computing device identifier, or other indicia, may be added to each entry during scanning. This allows the metabase to track each entry to its source computing device (step 830). The centralized metabase can then be filled with entries (step 855) and may communicate with management module to establish and monitor the list of computing devices served by the central metabase. The system monitors the computing devices associated with data interactions and may report them to the central metabase on an ongoing basis, periodically, or rolling basis.

“In some circumstances, the central metabase might need to assimilate existing entries or integrate new entries from computing devices. The centralized metabase might become unavailable or disconnected for a time, and then be required to merge a large number queued entries. The management module or metabase may inspect existing metabase entries and communicate with computing devices to determine: (1) how long the object computer has been disconnected from the metabases; (2) whether there are queued entries at computing devices that must be processed (e.g. entries that were cached after the central metabase was unavailable for write operations); (3) whether duplicate entries exist and (4) which entries should be integrated (assuming queried entries are present on multiple computing devices).

These criteria are used by the management module and centralized metabase to assimilate relevant entries into the metabase. The system will then return to normal operation. The metabase and/or manager modules may detect a discontinuity in the metadata or index associated to the centralized storage device. Clients, computing devices, and other data sources can be rescanned in this instance to replace or repair the incorrect entries. Other embodiments may also indicate points of discontinuity and allow for interpolation or data healing to extract information from the unknown points.

“FIG. “FIG. 9” shows a system 900 that is constructed according to the principles of the invention. It includes a computing device that can interact with a network-attached storage device (NAS). System 900 may include an index 905 and a management module 910, as well as computing devices 915-925, 915-942, and data stores 965 and 965. Metabases 970-980 may also be included. System 900 could also include NAS device 995, which may include NAS storage devices 990 and NAS manager 985. Computing device 925 could also be used to supervise data transfer to and from NAS 995.

System 900 could operate in a similar fashion to FIG. 300. 3 a, with the exception of the NAS section on the right-handside. As an example, data classification agents 930-940 could operate in the same way as described and report results to their respective metabases, 970-980. This means that they analyze and process the entries in journals associated with journaling agent 945-955 and then report their results to metabases 970-990. Management module 905 may also supervise them.

“Data from computing device 955 may be journaled using similar methods to those described herein. Journaling agent 955, for example, may be located on computing device 925. It can track data interactions between NAS devices 995 and other applications. Because of its proprietary nature, the journaling agent 955 could be located outside the NAS 995. This is due to the difficulty in running other programs on the NAS 995.

The NAS 995 portion of system 900 could operate in a different way. Computing device 925, for example, may act as a NAS proxy to move data files to and fro NAS device 995. This is possible using a specialized protocol like the Network Data Management Protocol (NDMP). It is an open network protocol that allows data backups to be performed over heterogeneous networks. NDMP can be used to improve performance by transferring data across a network by seperating data and control paths while maintaining central backup administration.

Journaling agent 955 can record interactions between NAS data, external applications, and store those interactions in computing device 925. This journaling agent can include specialized routines that interpret and process data in NAS format. Data classification agent 940 can analyze journal entries and populate metabase 9580 at the beginning and later as described herein.

“Once the initial data has been populated, it might be useful to search the metabases in system 900 for specific information. In connection with FIG. 1100, this is further discussed. 11. Manager 905 or another system process may handle this in some embodiments. This may initial evaluate any search request and consult index 901 or other information stores to determine if the metabases contained within the system contain responsive information. This evaluation can be presented to the computing device that is processing the search request. It may include pointers, other indicia, or identifiers that identify a metabase like a metabase ID. This could allow the computing device that submitted the search request to directly contact the metadata identified. Manager 905 in other embodiments may process the request to provide substantial results to the computing device who submitted the query.

“FIG. “FIG. 9. Step 1010 may see a copy operation that moves data from a computing device to a NAS. This could include identifying the data that needs to be moved, such as based on a storage or data management policy. Data size, last data transfer to the NAS, file owner, type of application, and other factors may be taken into consideration.

“It is possible to use computing device 925 to route data from other network computing units (not shown), to NAS device 995. The computing device 925 will then supervise the data movement by using specialized transfer programs. Step 1020 The data may be routed through computing device 925. Journaling agent 955 can monitor interactions with NAS 995 and create interaction entries (step 1030). This can be done by consulting NAS file manager 985 to identify files in NAS 995 involved in a data interactions as described further here (step 1040). Next, journal entries can be created or modified to reflect the data interactions as described previously (step 1050). After scanning the interaction journal, it is possible to perform the classification process described herein in order to create metabase entry (step 1070). This is when metabase entries can be assigned an identifier that will be used to populate metabase 980. (step 1080).

“As we have mentioned, in certain situations, it might be desirable to search a system with multiple metabases for specific information, such as system 900, shown in FIG. 9 regardless of whether or not NAS is included. FIG. FIG. 11 contains a flowchart 1100 that illustrates some steps involved in searching multiple metabase systems in accordance to certain aspects of this invention.

Let’s say that a user wishes to find and copy all data related to a specified criteria, such as data relating a marketing project that was created and edited over a period of time by a particular group of users. The requestor can first create such a request using a user interface (not illustrated) and then submit it to the system for processing. If the system is performing management functions, this may be automated. The system will then receive and analyze the query (step 110). This may be done by a computing device that supports the user interface in some embodiments. Other embodiments allow the computing device to simply transmit the request to the system, where a management module (or other system process computing device) may perform the analysis. An analysis could include the identification of characteristics in the metabase that might satisfy the chosen criteria.

The system can identify metabases that may contain records that are related to the query or search request. This can be done by consulting with a management module, which may have a comprehensive view of all metabases in the system. It may also include index information or a general overview. Once a set has been identified, the management module (or other computing device) may search for a data set that answers a query and return a set. This is step 1130. Optionally, normalization may be performed at step 1140. The results can be reported at step 1150 if normalization is not necessary. The system can analyze the results for content and completeness if normalization is required. These metabases can also be searched if the system finds other metabases that may have information that meets the search criteria. The process can continue iteratively until a substantial number of results are obtained. These results can be normalized even if there are no other metabases involved. This includes performing functions such as finding and removing duplicate results, identifying network paths to data objects, formatting the results or arranging them for further processing (whether for another computing process, or for a user). The returned results can be used, for example, to retrieve responsive data objects which may contain information on primary and secondary storage devices in the system.

“In some embodiments, systems and methods according to the present invention can be used to track and identify data interactions between users or groups. A system administrator, or user, may want to keep track of all data interactions that involve any or all users or groups. This could include read and write operations that are performed on the user or group’s behalf, information, applications used or accessed by the user or group, electronic gaming interactions and chat, instant messaging, and other communication interactions. The system can identify, capture, classify, and otherwise track user or group interactions with electronic data, creating a data store, or another repository, of these interactions and the metadata associated with them. This repository can be used as a “digital or electronic record” in some instances. This effectively records and catalogs user and group interactions with electronic data and information over a time period, as further described in this document.

“For example, FIG. FIG. 11a shows a system that, according to the principles of the invention, identifies and classifies electronic data and tracks user and group interactions. The system could include a computing device 1162, one to three classification agents 1164 and one or two journaling agents 1165. It also may include metabase 1166, change records 1167, and database 168.

“In operation computing device 1162 can be coupled to/interact with various applications, networks and electronic information, such as multimedia applications 1170 and instant messaging/chat apps 1172, network application 1174 such an enterprise WAN, LAN, Internet 1176 and gaming applications 1178. These are just examples. Any other network, application or type of electronic data that is suitable for the purposes set forth herein can be added, if necessary.

“Journaling agent 1165 and classification agent 1164 can work together to detect and record data interactions, as further explained herein. Each type of electronic data interaction, such as email, web surfing, Internet search activities, multimedia usage, etc.) can be described in detail. Instant messaging, web surfing and Internet search activities are all examples of electronic data interactions. A different classification agent 1165 and 1164 may identify, capture, classify, and track the system. For example, an interaction-specific agent 1165 or a classification agent 1164 that is dedicated to processing one type of interaction with electronic data. The system could have a first class 1165 agent and first classification 1164 agents monitoring the network traffic (not shown), and another journaling agent 1165 or 1164 agent monitoring another system resource that is used for electronic gaming interactions (e.g. Recording and classifying gaming interactions, such as games played, opponents played or win/loss records. or directed to interactions related to the use of an Internet browser for surfing? web (e.g. ?tracking pages visited, content, use patterns, etc.) In some embodiments, the journaling agent 1165 or classification agent 1164 can be combined to perform some or all of the functions of journaling agent 165 and a class agent 1164.

“As a user interacts with different types of electronic information, some, or all, of these interactions can be recorded and stored in database 1168. Metabase 1166 and change record 1167 may be used to record some aspects of interactions. They may also serve as interaction logs of computing activities.

“Example: A user of computing device 1162 can interact with applications like multimedia application 1170 or instant messaging app 1172. This could include receiving, viewing, responding, and sending audio/video files of any format. It may also include instant, text, or email messages. Journaling agent 1165 can detect interactions between these applications, computing device 1162, and classification agent 164 may classify these interactions and record information (e.g. metadata) in metabase 1166.

“Moreover, certain embodiments allow for some or all of the content exchanged, or associated with, these interactions to be captured and stored in database 168 or other storage locations within the system. Screen shots and summaries of data interactions may be captured. The system can download any content associated with web pages viewed and be able to recreate the original content and interaction without having access to the source or original version of the page on the Internet. This can be useful, for instance, if the user wants to interact with previous interactions even though that content is no more available. Another example is that the system might also store or capture data associated with other interactions such as chat transcripts and video game replays, search query results and search results and associated search content. This includes songs accessed and movies accessed, metadata, stored songs and movies, and search queries.

“Moreover, administrators or users may wish to keep track of all applications by using specialized classification agents in certain embodiments. The multimedia and instant messaging apps described above might each have their own dedicated classification agent, which analyzes journal records to create metabase 1166 entries. Each classification agent may also have its own metabase or repository for source information (not shown), so that application histories and content can be quickly indexed, searched and retrieved. A?universal? may be used in other embodiments. In other embodiments, however, a?universal? classification agent may be used. This agent will recognize the application type (e.g. based on journaling agent entries) to process interactions appropriately (which could include routing metadata into one or more specialized Metabases).

“As shown at FIG. 11 a. Computing device 1162 can interact with network applications 1174, such as LAN and WAN applications. These interactions may include interaction to certain distributed programs like Microsoft Word and Outlook. Internet 1176 may be used to allow users to interact with it and download different web pages or other information. In accordance with an aspect of the present invention, interactions with these networks/applications may also be journaled as described above with certain information regarding these interactions stored in metabase 1166. Database 1166 may also contain portions of the exchanged content. Database 1168 may also contain portions of exchanged content, such as Word documents, emails and web pages. It can be used to store a record of user interactions with 1162 or other system devices. User interactions can be tracked at any network computing device and recorded for any user identified by identifiers.

“A user can retrieve captured data, review or replay data exchanges or save such records to be used in the future. A user could store instant messaging conversations for replay or transmission to others. It may not be possible to record all interactions in certain cases, such as private or personal information. This may be achieved by “disabling?” The appropriate classification agent for a specific period of time, etc.

“Similarly, interaction with gaming applications (network and stand-alone) can also be recorded with the appropriate information stored in metabase 1166 and database 1168. A user might be able to replay, retrieve and transmit saved gaming sequences to others.

“In some cases, database 1168 can become too large. In such cases, some information may be moved to single instance storage in database 1168 using a pointer that is placed at the logical address for the instanced information (not illustrated). This can be done to save memory, as some entries in database 1168 may be duplicated.

Chart 1200 in FIG. illustrates some of the steps involved with the above-described method. 12, and could include the following. A group or user of particular interest can be identified at the beginning based on user-related information or other network characteristics (step 1201). These characteristics could include Active Directory privileges or network login, machine ID or biometrics that are associated with a member of a group. These characteristics can be combined or linked together to create a profile for a user or group. These profiles can be stored in a database, index or management module of the software and used to create classification definitions. The system can compare data elements involved in an interaction to determine if they are related and classify them. This is done using profile information (step 1220).

These associations can be stored in a metabase that keeps track of interactions between users and groups. In one embodiment, the metabase is a list of data interactions for a specific group or user. A list of data items touched by users or groups can be obtained quickly if desired.

“In operation, the system can monitor data interactions for a specific computing device through the use journaling agents or the like. A classification agent may analyze the interactions and associate them with one or more profiles (step 1203). An identified metabase (step 1240), which tracks interactions between users or groups, may record the association. This may include references to data objects identified, attributes compared, and the reason for the association. The journaling agent can monitor data interactions throughout operation. This will ensure that each metabase keeps up to date and accurately represents the data touched by any particular user or group. Step 1250: The identified metabases are associated to a user or group (e.g., by storing an indicator of the association in an Index).

“FIG. “FIG. 13” shows a system 1300 that is constructed according to the principles of the invention for communicating metadata/or data objects between multiple computing devices. System 1300 can generally include the first and second computing devices 1310, 1320, respectively, as well as associated data stores 1330, 1340, and metabases 1350, 1360. System 1300 computing devices may store metadata and data objects in their respective data stores and metabases, as described further below. However, certain circumstances may call for the transfer of metadata between metabases 1350-1360 or between data stores 1330-1340. This could be used to transfer certain data between computing devices, create an application in another location, copy or back up certain data objects, and associated metadata.

“FIG. “FIG. 14” shows a flowchart 1400 that illustrates some of the steps involved in moving data between computing devices. At step 1410 data objects and/or associated metadata can be identified to allow for movement between computing devices. You can do this by asking for specific data. For example, a query may be created to search for data objects and/or associated metadata.

Summary for “Systems and methods of classifying and transmitting information in a storage system”

“Aspects disclosed in the invention relate to operations on electronic data within a computer network. Particularly, the invention relates to detecting data interactions in a computer system and/or performing storage-related operations on a computer network following a specific classification paradigm.

Current storage management systems use a variety of storage operations to electronic data. Data can be stored as primary copies or secondary copies. This includes backup copies, snapshot copies, hierarchical storage management copies (?HSM), and secondary storage. As an archive copy and as other types of copies.

“A primary copy is usually a production copy or another?live? copy of the data. The primary copy of data is the one that is used by the software application. It is usually in the native format. If necessary, primary copy data can be stored in local memory or another high-speed storage device which allows for fast data access. This primary copy data is usually intended for short-term retention (e.g. several hours or days before some or all data is stored in a local memory or other high-speed storage device that allows for relatively fast data access if necessary).

“Secondary copy” refers to point-in time data. They are usually intended for long-term storage (e.g. weeks, months, or years depending upon retention criteria as described in a storage policy, as well as before any or all of the data is moved or discarded. Users can search for data in secondary copies and retrieve it at a later time. To indicate where the data is located, a pointer (or other location indicator such as a stub) may be added to the primary copy.

A backup copy is one type of secondary copy. Backup copies are usually a point in time copy of primary copy data that is stored in a backup format rather than in the native application format. A backup copy could be stored in an optimized backup format for compression and long-term storage. Backup copies have a longer retention period and can be kept on media that is slower to retrieve than other media. Backup copies can be kept at an offsite location in some instances.

A snapshot copy is another type of secondary copy. A snapshot can be viewed as an instant copy of primary copy data at a particular time. A snapshot captures the directory structure and contents of a primary copy volume at a given moment. A snapshot can exist in parallel to an actual file system. The snapshot’s record of files and directories is typically accessible to users as a read-only file system. Users can also restore primary copy data from a snapshot that was taken at a particular time. This will allow them to return the current file system back to its previous state when it was created.

A snapshot can be created quickly, with very little file space. However, it may still serve as a backup to your file system. Although a snapshot does not create a physical copy of all data, it may create pointers that can be used to map files and directories to particular disk blocks.

“In some embodiments, after a snapshot is taken, subsequent changes made to the file system do not usually overwrite the blocks that were in use at the snapshot time. The initial snapshot may only require a small amount disk space to store a mapping or another data structure that tracks the blocks that correspond to the current file system state. If files or directories are modified later, additional disk space is typically not required. Moreover, files that are modified only copy the pointers to the blocks, and not the actual blocks. Some embodiments, such as copy-on-write snapshots allow for a block to change in primary storage. The block is copied to secondary storage prior to being overwritten in primary storage. In this case, the snapshot mapping of file data is updated to reflect any changes to the block(s). HSM copies are generally copies of primary copy data. However, they typically only contain a subset of primary copy data and are usually stored in an alternative format to the native application format. An HSM copy may contain only data that is greater than a certain size threshold or older that a specified age threshold. This backup format is used to store the HSM copy. HSM data is often removed from the primary copy and a stub stored in the primary copy to indicate the new location. The stub is used by systems to locate HSM data after it has been deleted or migrated. This makes recovery of data transparent, even though HSM data might be stored in a different location than the primary copy.

An archive copy is similar to an HSM, but the data that meets the criteria for deletion from the primary copy is usually completely deleted with no stub remaining in the primary copy to indicate where the data has been moved. The archive copy of data is generally kept in a backup format, or another non-native format. Archive copies of data are often kept for a very long time (e.g. years) and are sometimes never deleted. Archive copies can be kept for longer periods to comply with regulations or other permanent storage purposes.

“In certain embodiments, application data moves over its lifetime from more expensive quick-access storage to less costly slower access storage. Information lifecycle management (?ILM?) is the process of moving data between these different storage tiers. This is how data is “aged”. There are many secondary storage options available. Some have faster access/restore time, while others are slower. This is because data becomes less critical or less important over time.

“Examples for different types of data, and copies of those data, are described in the related applications above. They are here incorporated by reference in full. QiNetix, a storage management system developed by CommVault Systems in Oceanport (N.J.), is one example of a system that stores electronic data and produces such copies.

“The QiNetix storage system uses a modular storage management structure that can include storage manager components and client or data agent component components. Media agent components are further described in U.S. Patent Application Ser. No. No. As further described in U.S. Patent Application Ser. No. No.

No matter where the data is kept, traditional storage management systems store electronic data according to location-specific criteria. Data generated by an application on a client will typically be copied according to specific criteria. For example, it may be from a specific location, like a folder or subfolder in a given data path. The transfer of data between the client and another storage location may be controlled by a module that is installed on the client. When restoring data from secondary storage into primary storage, similar data transfers are made using location-specific criteria. To restore data, an user or process must specify the secondary storage device, media, archive file, and other criteria. The ability to specify or define storage operations based on data location is more important than information relating or describing the data. This limits the precision with which traditional storage management systems can perform storage operations on electronic files.

“Moreover, traditional storage systems scan files on clients or other computing devices to identify data objects that are associated with storage operations. This could involve locating file and/or folder attributes through a client’s file system before performing storage operations. This is a time-consuming process that can consume significant client resources. These resources could be better spent on other tasks related to production applications. It is therefore necessary to develop systems and methods that allow for more efficient and precise storage operations.

“Aspects related to the invention generally concern systems and methods that analyze, classify and store various types of data. This, among other things facilitates identification, searching for, storage, retrieval and retrieval data that meets certain criteria. While the specific embodiments are described, it is clear that the inventions discussed herein can be applied to any wireless or wired network or data transfer device that stores or transmits data, such as enterprise networks, storage networks, or the like.

“Aspects are systems and methods that facilitate and improve enterprise data management. These systems and methods, which evaluate and define data management operations using data characteristics and not data location, are disclosed. Methods for creating a metadata data structure that describes system data, and storage operations are also provided. This data structure can be used to identify changes in system data, rather than scanning individual data files.

“Generally speaking, the methods and systems described below are used to analyze data and other information within a computer network (sometimes called a “data object”). and create a database. A data collection agent might traverse a network filesystem to obtain certain attributes and characteristics of the data. This database could be described as a collection metadata and/or other information about the network data in some embodiments. Metadata is data or information about data. It can include data related to storage operations and storage management such as data locations, storage management component associated with data storage devices, storage devices used to perform storage operations, index, type of data, or any other data.

This arrangement allows system administrators or other system processes to consult the metabase in order to get information about network data. They do not have to iteratively access each item and analyze them all. This greatly reduces the time it takes to get data object information. It also eliminates the need to access source data. Additionally, it minimizes the use of network resources, which in turn makes the process much more efficient and less burdensome for the host system.

“Various embodiments will be described. These embodiments will be described in detail. However, one skilled in the art will know that the invention can be used without these details. Some well-known functions or structures may not be described or shown in detail to avoid confusing the different embodiments.

“The terminology in the description below is meant to be understood in the broadest possible way, even though it is used in conjunction with specific embodiments of this invention. Some terms may be highlighted below. However, any terminology that is intended to be restricted in interpretation will be explicitly and overtly defined in the Detailed Description section.

FIG. 1. To perform certain functions, it might be necessary to install data classification software on computers within the network. (step 102). This could be accomplished by installing the software on clients computers or servers within a network. In some cases, the classification software may be installed worldwide on a computing device. The classification software can monitor data objects generated by computers and classify that information as needed.

“Next, at step104, a monitor agents may be initialized. This monitoring agent, which may reside on every computing device in the same way as the deployment of classification agents above, may be installed and configured to record and monitor certain data interactions within each machine and process. The monitor agent could include a filter driver program, and be installed on an input/output port/data stack. It may also work in conjunction with a file manager program to record interactions between computing device data. This could involve the creation of a data structure, such as a record or journal for each interaction. These records can be stored in a journal structure. They may also record data interactions on an inter-interaction basis. The journal might include information about the type of interaction and relevant properties of the data involved. Microsoft’s Change Journal, or a similar program, is one example of a monitor program.

“Prior populating a metadata metabase, portions of the network and subject systems may be quieted so that data interactions are not permitted before completing an optional scan to system files as described in step 106 below. This is done to get a point-in-time picture of the data being scanned, and to preserve the system’s referential integrity. If the system was not quiet, data interactions could continue, and data would be allowed to move through to mass storage. In some cases, however, the subject system can continue to function with instructions or operations stored in a cache. These operations are usually performed after the scan has been completed so that data interactions based on the cached operations can be captured by the monitor agent.

A data classification agent may perform the file scanning described in step 106. This may involve traversing a client’s file system to identify any data objects, emails, or other information that may be present within the system. The agent may also obtain information about the metadata. This metadata could include information about data objects and characteristics such as the client, user, or data manager who generated them, last modified times (e.g. the time at which the most recent modification occurred), data size (e.g. the number of bytes of data), data content (e.g. the application that generated it, the user who generated it, etc. To/from information for emails (e.g. email sender, recipient, or any other group that is on an email distribution lists), creation date (e.g. when the data objects were created), file type (e.g. format or application), last accessed times (e.g. how many bytes of data), data content (e.g. which application generated them), location/network (e.g. a current, past, or future location of the object and the network paths to/from it), frequency of changes (e.a schedule which could include a time in which might include a date object being migrated to long-term storage), etc. The scan data may be used to populate step 108’s metabase with information about network data.

“After the metabase is populated, the network/subject system can be released from its quiesced status and normal operation may resume. At step 110, the monitor agent can monitor system operations and record any changes to system data in a change journal database. The change journal database could contain a database of metadata and data changes, as well as log files of data or metadata changes. The data classification agent can periodically check the change journal database to see if there are any new entries. These new entries can be examined and, if relevant, written to the metabase (step 112). Other embodiments allow for change journal entries to be provided substantially in parallel with the data classification agent and journal database. This allows the metabase maintain substantial current information about the status of system data at any time.

As mentioned, one advantage of such a metabase can be reduced time to get information. It eliminates the need to directly access the source data. Imagine that a system administrator wants to find data objects that have certain content or other characteristics that were accessed by a user. Instead of searching each file in every directory, which can be time-consuming, the administrator could simply search the metabase for such objects and any properties (e.g. metadata) that are associated with them. This would save the administrator a lot of time. This will result in substantial time savings.

“Moreover, the use of the metabase to satisfy data queries reduces the need for network resources, thereby reducing the processing load on the host system.” As an example, if an administrator wants to identify data objects, querying metabase instead of the file system effectively removes the host from the query process (i.e. no brute force scanning or renaming of files is required). This allows host computing devices to continue to perform host tasks and not be distracted by search tasks.

“FIG. “FIG. 2 illustrates one embodiment of client 200 built according to principles of the present invention. Client 200 could include a class agent 202 and monitor agent 206. In some embodiments, these agents may be combined to form an update agent. 204. This module may contain the functionality of both agents. Client 200 could also contain an internal or exterior data store 209 and metabase 210. A change record 212 may also be included.

Client 200 could be any part of a computing device or any computing device that generates electronic information. Data store 209 is generally used to store application data, such as client 200’s production volume data. Metabase 210 may contain information created by classification agent202. This information may be either internal or external to client200. Change journal 212 may also contain information generated above by monitor agent.

“In operation, data interactions within client 200 can be monitored using update agent 204 and monitor agent 206. Any interaction that is relevant may be recorded and sent to change record 206 Data classification agent 200 may scan or receive entries of monitor agent 206, and update metabase 215. In the event that update agent 204 exists, monitored data interactions can be processed simultaneously with updates to change record 212, and written to metabase 210 and data store 208. File system 207 can be used to process or conduct data transfers from clients to data stores 209.

“FIG. “FIG. 2. System 300 could include a memory 302, a update agent 304, which may include a separate monitor agent 306 or integrated monitor agent 306, and classification agents 312a and 312b. A content agent 315 may be included, as well as a monitor program index 3310, metabase 314, and mass storage device 318.

Monitor agent 306 may monitor data interactions between memory 302 (or mass storage device 318) during operation. Memory 302 could include random access memory (RAM), or any other memory device that is used by client 200 to perform data processing tasks. Some information stored in memory 302 can be periodically read/written to mass storage device 318, which could include a magnetic disk drive, optical drive, hard drive, or any other storage device that is known to the art. Monitoring agent 306 monitors data interactions and may, in certain embodiments, include any appropriate monitoring or journaling agent, as further described in this document.

“System 300, as shown, may also include an administrative program program 316. This is a file program program that may be used to manage file system programs such as operating system programs (e.g. FAT, NTFS, etc.). This may be used for data movement between mass storage devices 318 and 318. In operation, data can be written to memory 302 and mass storage device 318 using file system program 316. This operation could be used, for instance, to access data needed to service an application that is running on a computer device. Monitor agent 306 can capture the interaction and create a record to indicate that it occurred. The record is stored in index 310. Under the supervision of file manager 316, data can be stored in mass storage 318.

“As shown at FIG. 3. a) Monitor agent 306 can analyze data interactions, such as those between memory 302 or mass storage 318, via file system manager 3316. This will record such interactions in monitorindex 310. As such, monitor index 310 could be a list of data interactions. Each entry may indicate a client-data change and provide information about the interaction. If Microsoft Change Journal is used in an embodiment, these entries may contain a unique ID such as an update sequence (USN), certain reason codes for change journal identifying information associated to a reason(s), along with data or metadata describing data and data properties, such as data copy types.

As data moves from memory 312 to mass storage 318, or vice versa, monitor agent 304 can create and write an index entry to index 308. This entry may then be analyzed by classification agent 312b and classified for entry into metabase 314. In some cases, classification agent 312a can be linked with mass storage device (directly or through file system manger 316), and may write metadata entries to metabase 314 as well as mass storage device 318. The metabase information can be stored on mass storage devices 318 in some instances. In an alternative embodiment, classification agent 312b may periodically backup or copy metabase 314 onto the storage device, under the direction and/or pursuant a storage policy (not illustrated). This allows the information in metabase 314, if it is lost, deleted, or is otherwise unavailable, to be quickly restored.

In some embodiments, optional classification agents 312 and 306 may be used together to classify data moving to mass storage devices 318 as described further. Data is then written to device 318. This arrangement allows the data to be written to mass storage device 318, along with processed metadata. This could happen, for instance, in embodiments where monitor agent 306 or classification agent 312a are combined into update agents 304. This allows metadata to be written in a way that it can be retrieved or accessed from mass storage 318 if needed, such as when metabase 314 is lacking certain information, busy or otherwise unavailable.

“Content agent 315, may be used to filter or obtain data related to data that is being moved from memory 302 into mass storage 318. Content agent 315, for example, may read the data payload information and create metadata based upon the operation to store in metabase 314. It may also include a pointer at the data item in mass 318. Optionally, the index may contain pointer information. This metadata can be stored in an index, or with the data item in mass 318. Metabase 314 can be used to store metadata related to data content. This allows you to search entries in mass storage 318 for content, instead of performing content searches in metabase 314. This allows the system quickly to locate content that matches a query in metabase 314 and may be retrieved from mass store 318 if needed.

“Moreover, such metadata can be used to locate data based upon content features within a storage system hierarchy (e.g. content metadata may also be generated and stored at different levels within the storage system (primary). secondary, tertiary, etc.) To facilitate the retrieval and location of content-based data. One of the most important aspects of the content agent 315, classification agents 312 a& b, and monitor agent 306 is that they provide functionality. The modules can be combined into one module or implemented in separate modules providing some or all of the functions.

“FIG. “FIG. Step 355, the monitor program can be initialized. This may include inputting a data structure (or index) for recording interaction entries and assigning a unique journal ID number that allows the system to distinguish between different journaling data structures. The monitor program could include a filter driver (step 360) or another application that monitors data operations. The monitor agent can observe data interactions between mass storage and memory to determine if certain data interactions have occurred. The metabase may contain information relating to these interactions. Sometimes, certain interactions or aspects of interactions may be captured. These types and aspects can be identified in an interaction definition. This could be a Microsoft Change Journal reason code, or a user’s or network administrator’s declaration to capture data interactions. Some change definitions can record all data interactions regardless of whether or not any data changes. This information can be used, for instance, to identify users or processes who have “touched, scanned, or otherwise accessed” data without actually changing it.

It is possible to use interaction definitions to capture a wide or narrow range of operations. This allows a user to tailor the monitor program to achieve specific goals. These interaction definitions can be used to describe or define data movement, manipulations, or other interactions that might be of interest for a system administrator or user (e.g., any operation which?touches?). Data may be recorded along with the action or operation that created the interaction (e.g. read, write, copy, parse, etc.) Change definitions can change over time and may be dynamic depending on entries to the index. If the expected results do not come through, then change definitions can be modified or added to ensure that they are achieved. This can be done by linking global certain interaction definition libraries and selectively enabling libraries until satisfactory results are achieved. This can be done after activation and may continue to be done periodically depending on changing requirements or objectives.

“Moreover, some embodiments may allow the use of?user tag? This allows certain types of information to become tagged, so that they can be tracked and identified throughout the system. A user might designate one type of data, information or project information to be tracked throughout the system. A user interface (not illustrated) allows users to specify information to be tagged. This can be done by using any attribute in the system, such as the ones mentioned above, with respect to the filter or classification agent used in the system. These or other attributes may be used to define tags. They can then be combined using Boolean or logical operators to create a specific tag expression.

“For example, a user might define a tag by specifying a number of criteria such as system users, data permission levels, project, and so forth. These criteria can be specified using logical operators, such as OR or AND operators to conditionally combine different attributes to create a tag. The system can track all information that meets these criteria. As data passes through monitor agent 306 (or another module within update agent 314), data that meets these criteria can be identified and tagged with a head, flag or any other identifying information known to the art. The information can be copied by metabase 314 or mass storage 318 to allow for quick identification. The metabase could contain entries that keep track of all entries meeting the criteria. This information may include information about the operations performed on the information, as well as metadata related to the data content and where it is located in mass storage 318. This allows the system search the metabase at specific levels of storage to find the information and locate it quickly within mass storage device for possible retrieval.

“Next, a Step 365, the monitor agent can continue to monitor data interaction based on change definitions up until an interaction satisfying a defined definition occurs. A system according to the present invention can continue monitoring data interactions at steps 360 through 365 until a defined interaction, such as one that meets or corresponds to a selection criteria, such an interaction definition, etc. occurs. The monitor agent may create an entry if a defined interaction occurs. This record may be stored in a monitor index. In some embodiments, an interaction code may be assigned to describe the interaction that was observed on the data object. The monitor program will next identify the data object identifier associated to the data. This is usually a file reference number (FRN). The FRN can include information such as the path or location of the associated data object. Additional information (e.g. data properties, copy properties and storage policy information) may also be included in the FRN. To enrich or enhance the record, it may be possible to obtain additional information (e.g. data properties, copy properties, storage policy information, etc.) associated with the FRN. This may include obtaining information from master files tables (MFTs), in some cases, to enhance the metabase entries. To populate the metabase with the best or preferred information, additional formatting or processing of metabase entries can also be performed in accordance to certain classification paradigms.

“Next, at step 357, the record may be assigned a record identifier, such as a unique update number (USN). This may be used to identify the entry within an index and, in certain embodiments, act like an index memory location. A data structure that is based on the USN can be used to locate a specific record quickly. At step 380, additional data or metadata data may be combined with the above information to create the record.

“In alternative embodiments, the above information may be written at the index and organized at the index in an expected format. Or it may be written directly to the record?as received? Include metadata, or other information. As an example, records can contain more information than others. Once the record is constructed and deemed complete, it can be “closed”. The system will then close the record at step 385. If the record is incomplete, the monitor agent/update agent may request additional information to complete it. The monitor agent can place a flag in the record to indicate that it does not contain incomplete information. If this information is not received, then the record could be closed.

“FIG. “FIG. Step 410 may see the initialization of the classification agent. This may include activating and/or clearing buffers, and/or linking libraries that are associated with the deployment of the agent. The classification agent can classify existing stored data before scanning the interaction records generated above by the monitor agent. This may include traversing the directory and file structure of an object system to initial populate the metabase.

“Next, at step 422, the classification agent can scan the entries of the interaction journal during normal operation to determine if new entries have been created since previous classification processing was completed. This could be done, for instance, by determining if the most recent journal entry is older or more recent than the last one analyzed. There are many ways to accomplish this. One way is to scan the time and date information of the previous journal entry, then compare it with the most recent entry in the journal. This can be done iteratively if it is determined that the journal entry with the most recent date was created after a prior classification process. The process of going through all the journal entries to find the last one that was previously analyzed by the class agent may be used. The classification agent may consider any entries containing time information that are after this point to be new or unprocessed (step 440). If the time stamp on the journal entry that was last analyzed is the same as the journal entry that was just analyzed, the system will not consider new entries and may return to step 402.

“Another way to identify new journal entries is to compare record identifiers, such as the USN numbers assigned for each journal entry (step 433). A journal entry with a higher USN number than that of the previous one may be considered unprocessed or new. If the USN number of the previous entry is the same as that of the current entry, the system will return to step 422 to continue monitoring and may consider the entry new or unprocessed. This may continue until new entries are found (step 440), or until it is concluded that there are no new entries.

“In other embodiments, instead of scanning the journal data structures for new entries, entries created by the agent may be automatically sent the classification agent. This is except for cases where such scanning may be necessary or desirable to verify certain information .

“Next, at step450, assuming that new journal entries have been identified, the system might determine if there is a metabase record for the data object associated to those entries. This can be done by comparing data objects identifiers (e.g. FRNs for metabase entries) with data object IDs (e.g. FRNs for journal entries). These and other data characteristics can be used to match and correlate journal and metabase entries.

If no metabase record matches, a new record can be created at step 456. This could involve creating a new Metabase Entry ID, parsing the journal entry into a predetermined format and then copying portions of the parsed data (steps 460 or 470), as described further below. To enhance the content of the new entry, additional metadata and file system information can be attached to it, such as information from an FRN, information derived form an interaction code, etc. (step 480).”

“On the contrary, if a corresponding meta-data entry is identified, the new journal entries may be processed as above and may overwrite some, or all, of the corresponding entry. An updated entry might be given a time stamp to indicate that it is a new revision. In some embodiments, even though a corresponding entry exists, a new entry can be created, written to the metabase, and optionally associated with an existing record. The older record can be kept, for example, for historical, recreational, or diagnostic purposes. In some cases, it may also be marked as obsolete or superseded. These corresponding entries can be linked together via a pointer, or another mechanism so that records relating to the history for a particular data object can be quickly obtained.

“Next, the system will process any additional journal entries by returning to step 490. If there are no new entries, the system can return to step 490 to continue monitoring and perform further scans of the journal data structure.

“FIG. “FIG. 5” illustrates an embodiment the present invention where a secondary processor performs certain functions, or all, of the data classification process described herein. System 500 may contain a manager module 505, which may include an Index 510, a first Computing Device 515 (which may include first processor 520 and journal agent 530 and a Data Classification Agent 535), and a secondary computing device 540, which may include second processor 545, 535, and 535. System 500 could also include a data store 550 and a metabase 555.

“Computing devices 515 or 544 can be any computing device described herein. They may include clients, servers, or other network computers running software. These could include programs or applications that create, store, or transfer electronic data. In some embodiments, journal 560 and metabase 555 may be stored locally on local mass storage. Other embodiments may have journal 560 and metabase 555 located outside of computing device 515, or distributed between them. Metabase 555 can be accessed via a network, while journal 560 can only be accessed locally.

Computing device 515 could operate in substantially the same way as system 300 in FIG. 3. The second processor 545 is used to perform certain functions in the second computing device 540. As shown in FIG. 5, data classification agent 535, and journaling agent 535, may perform substantially the same functions as FIG. 3. This means that journaling agent monitors data interactions with computing device 515, records them in journal 535, and classification agent processes the journal entries to populate metabase 555.

“However, some functions may be initiated and performed in whole or part by second processor 545. The second processor 545 may be used to direct computing operations related to journal agent 530 or classification agent 535. These operations may also use support resources associated with computing device 545. This will ensure that computing device 515 is not significantly impacted by these operations. This could be used to transfer certain tasks that are not critical from the host system (515), and have them done by a secondary computing device 545.

“For example, some embodiments may allow the secondary computing device to take on the processing burden of some or all the following tasks: (1) initial scanning of client files by the classification agency 535 and populating metabase 555, (2) ongoing monitoring of computing device data interactions (e.g. 515) and generation interaction records for storage journal 560, (3) classification and processing of journal information for updating Metabase 555; and, (4) searching for or otherwise analyzing and accessing certain information in metabase 555 and/or 560. In some cases, however, it may be preferable to give the secondary computing device certain tasks, such as searching metabase 555, while other tasks, such as updating journal 560 and metabase, may be done by the primary computing devices.

“Performing such operations using another processor may be desirable. For example, when processor 520 is unavailable, overused, unavailable or otherwise heavily utilized or when it is desired to eliminate the primary processor from performing certain tasks, such as those described above. It may be advantageous to use processor 520 to access or search metabase 555 for specific information. This is because it can perform other tasks that are related to programs running on computing device 515, such as when computing device 515 has been busy with other network-related functions.

“In some embodiments, the secondary CPU may be located on computing device 515 (e.g. processor 525) and may perform operations in conjunction with processor 545. Some embodiments include a manager module 505, which coordinates overall operations among the different computing devices. Manager module 505 might monitor the processing load of each computing device or be aware of it and may assign tasks based on that availability (e.g. load balance). If processor 520 is not in use or at a low level of capacity, processor 520 may handle a request for search metabase 555. Manager 505 can assign processor 545 the task if processor520 is not performing or has a schedule to perform. Manager 505 can be used as an arbiter to assign processor tasks to processor 545 in order for system 500’s efficient use of its resources.

“FIG. “FIG. 5. The system may receive a query at step 610 for specific information. The system may process the request and analyze it. The query may indicate which metabases are to be searched, and/or the management module might consult an index that includes information about metabase content within the system. The identification process could involve searching for and identifying multiple computing devices in an enterprise or network that might contain information that meets search criteria.

“In some embodiments, search queries may automatically be referred to secondary processors to reduce processing demands on the computing devices that might have created or been otherwise associated with the identified metadata(s). It is possible that the computing device associated with the identified metabases creates or is involved in search operations. To identify responsive metabases, the secondary computing device can consult with an index or manager associated with other computing devices.

“The secondary processor will search metabases for relevant data sets to determine the next step at step 640. It may be necessary to perform iterative searches, which examine the results of previous searches. The secondary processor may then search additional metabases that were not previously identified in order to locate responsive information that was not found during the initial search. The initial metabase search can be used as a starting point to expand your search options based on the collected or returned results. At step 650, the results can be arranged and formatted in a way that is suitable for future use (e.g. with another application) or for user viewing (step 650).

“FIG. “FIG. 7 shows a system 700 that is constructed according to the principles of the invention and which uses a central metabase 760. This may be used to serve multiple computing devices 715-725. As an example, system 700 could include computing devices 715-725. Each of these may include a journaling and classification agent (730-740 and 7755-755 respectively), as well as a centralized metabase 756 and, in some embodiments, a manager module 705 and an index 710.

System 700 could operate in a similar way to system 300 as shown in FIG. 3. Each computing device 715-725 stores classification entries in the centralized metabase. 760, rather than each device having its own metabase. As an example, data classification agents 745-555 could operate in a similar fashion to the ones described and report their results to central metabase 760. This means that they analyze and process entries in the journals associated with journaling agent 730-740 and then report them to metabase 760. This arrangement allows the classification agent to provide metabase entries with an ID tag or any other indicia identifying the computing device from which the entry originated. This will facilitate future searches and allow the entry owner to be identified.

“Also, each entry in metabase 760 could be assigned an unique identifier to manage them. This number can be used to indicate the index offset or location of an entry in centralized metabase. Some embodiments allow entries to be sent from computing devices 715-725 to metabase 756 on a rolling basis. They may then be stored by metabase 760. Metabase 760, for example, may receive multiple entries from multiple computing devices 715-255. It may also be responsible for arranging and queuing such entries for storage in the metabase.

“In certain embodiments, system 700 may contain manager module 705 which may be responsible to assign or remove associations between certain computing devices 715-725 and a specific centralized metabase. 760. Manager 705 can direct certain computing devices 715 to create classification entries for a specific centralized metabase. This is in accordance to certain system preferences. The index 710 may contain information that indicates an association between the metabase 760 (715-725) and the computing devices (715). This allows system 700 the ability to reassign resources globally or locally to maximize system performance. Manager 705 might reassign certain computing device 715-725 to another Metabase by changing the destination address in the appropriate index.

“FIG. 8 is flowchart 800, which illustrates some of the steps involved with using a central metabase with multiple computing devices like the one in FIG. 7. A manager module may instantiate a central metabase at step 810. This is in accordance to certain system management and provisioning policies. This could include securing the necessary storage and management resources to perform the task, loading some routines into memory buffers, and notifying the management module that the metabase has been ready for operation.

“Next, at 820, the management module might review system resources, management policy, operating trends and other information to determine computing devices that can be associated with the instantiated central metabase. This could also include identifying the paths to the metabase using the various computing devices. It may also involve finding operational policies that govern the computing devices. Finally, it may be necessary to create logical associations between the centralized and identified metabases. These associations may be stored in an database or index for system management purposes once they have been created.

“After the metabase is created and associated with computing devices (step 825), classification agents in each computing device can scan files or other data and populate the central metabase as described further (step 830). A computing device identifier, or other indicia, may be added to each entry during scanning. This allows the metabase to track each entry to its source computing device (step 830). The centralized metabase can then be filled with entries (step 855) and may communicate with management module to establish and monitor the list of computing devices served by the central metabase. The system monitors the computing devices associated with data interactions and may report them to the central metabase on an ongoing basis, periodically, or rolling basis.

“In some circumstances, the central metabase might need to assimilate existing entries or integrate new entries from computing devices. The centralized metabase might become unavailable or disconnected for a time, and then be required to merge a large number queued entries. The management module or metabase may inspect existing metabase entries and communicate with computing devices to determine: (1) how long the object computer has been disconnected from the metabases; (2) whether there are queued entries at computing devices that must be processed (e.g. entries that were cached after the central metabase was unavailable for write operations); (3) whether duplicate entries exist and (4) which entries should be integrated (assuming queried entries are present on multiple computing devices).

These criteria are used by the management module and centralized metabase to assimilate relevant entries into the metabase. The system will then return to normal operation. The metabase and/or manager modules may detect a discontinuity in the metadata or index associated to the centralized storage device. Clients, computing devices, and other data sources can be rescanned in this instance to replace or repair the incorrect entries. Other embodiments may also indicate points of discontinuity and allow for interpolation or data healing to extract information from the unknown points.

“FIG. “FIG. 9” shows a system 900 that is constructed according to the principles of the invention. It includes a computing device that can interact with a network-attached storage device (NAS). System 900 may include an index 905 and a management module 910, as well as computing devices 915-925, 915-942, and data stores 965 and 965. Metabases 970-980 may also be included. System 900 could also include NAS device 995, which may include NAS storage devices 990 and NAS manager 985. Computing device 925 could also be used to supervise data transfer to and from NAS 995.

System 900 could operate in a similar fashion to FIG. 300. 3 a, with the exception of the NAS section on the right-handside. As an example, data classification agents 930-940 could operate in the same way as described and report results to their respective metabases, 970-980. This means that they analyze and process the entries in journals associated with journaling agent 945-955 and then report their results to metabases 970-990. Management module 905 may also supervise them.

“Data from computing device 955 may be journaled using similar methods to those described herein. Journaling agent 955, for example, may be located on computing device 925. It can track data interactions between NAS devices 995 and other applications. Because of its proprietary nature, the journaling agent 955 could be located outside the NAS 995. This is due to the difficulty in running other programs on the NAS 995.

The NAS 995 portion of system 900 could operate in a different way. Computing device 925, for example, may act as a NAS proxy to move data files to and fro NAS device 995. This is possible using a specialized protocol like the Network Data Management Protocol (NDMP). It is an open network protocol that allows data backups to be performed over heterogeneous networks. NDMP can be used to improve performance by transferring data across a network by seperating data and control paths while maintaining central backup administration.

Journaling agent 955 can record interactions between NAS data, external applications, and store those interactions in computing device 925. This journaling agent can include specialized routines that interpret and process data in NAS format. Data classification agent 940 can analyze journal entries and populate metabase 9580 at the beginning and later as described herein.

“Once the initial data has been populated, it might be useful to search the metabases in system 900 for specific information. In connection with FIG. 1100, this is further discussed. 11. Manager 905 or another system process may handle this in some embodiments. This may initial evaluate any search request and consult index 901 or other information stores to determine if the metabases contained within the system contain responsive information. This evaluation can be presented to the computing device that is processing the search request. It may include pointers, other indicia, or identifiers that identify a metabase like a metabase ID. This could allow the computing device that submitted the search request to directly contact the metadata identified. Manager 905 in other embodiments may process the request to provide substantial results to the computing device who submitted the query.

“FIG. “FIG. 9. Step 1010 may see a copy operation that moves data from a computing device to a NAS. This could include identifying the data that needs to be moved, such as based on a storage or data management policy. Data size, last data transfer to the NAS, file owner, type of application, and other factors may be taken into consideration.

“It is possible to use computing device 925 to route data from other network computing units (not shown), to NAS device 995. The computing device 925 will then supervise the data movement by using specialized transfer programs. Step 1020 The data may be routed through computing device 925. Journaling agent 955 can monitor interactions with NAS 995 and create interaction entries (step 1030). This can be done by consulting NAS file manager 985 to identify files in NAS 995 involved in a data interactions as described further here (step 1040). Next, journal entries can be created or modified to reflect the data interactions as described previously (step 1050). After scanning the interaction journal, it is possible to perform the classification process described herein in order to create metabase entry (step 1070). This is when metabase entries can be assigned an identifier that will be used to populate metabase 980. (step 1080).

“As we have mentioned, in certain situations, it might be desirable to search a system with multiple metabases for specific information, such as system 900, shown in FIG. 9 regardless of whether or not NAS is included. FIG. FIG. 11 contains a flowchart 1100 that illustrates some steps involved in searching multiple metabase systems in accordance to certain aspects of this invention.

Let’s say that a user wishes to find and copy all data related to a specified criteria, such as data relating a marketing project that was created and edited over a period of time by a particular group of users. The requestor can first create such a request using a user interface (not illustrated) and then submit it to the system for processing. If the system is performing management functions, this may be automated. The system will then receive and analyze the query (step 110). This may be done by a computing device that supports the user interface in some embodiments. Other embodiments allow the computing device to simply transmit the request to the system, where a management module (or other system process computing device) may perform the analysis. An analysis could include the identification of characteristics in the metabase that might satisfy the chosen criteria.

The system can identify metabases that may contain records that are related to the query or search request. This can be done by consulting with a management module, which may have a comprehensive view of all metabases in the system. It may also include index information or a general overview. Once a set has been identified, the management module (or other computing device) may search for a data set that answers a query and return a set. This is step 1130. Optionally, normalization may be performed at step 1140. The results can be reported at step 1150 if normalization is not necessary. The system can analyze the results for content and completeness if normalization is required. These metabases can also be searched if the system finds other metabases that may have information that meets the search criteria. The process can continue iteratively until a substantial number of results are obtained. These results can be normalized even if there are no other metabases involved. This includes performing functions such as finding and removing duplicate results, identifying network paths to data objects, formatting the results or arranging them for further processing (whether for another computing process, or for a user). The returned results can be used, for example, to retrieve responsive data objects which may contain information on primary and secondary storage devices in the system.

“In some embodiments, systems and methods according to the present invention can be used to track and identify data interactions between users or groups. A system administrator, or user, may want to keep track of all data interactions that involve any or all users or groups. This could include read and write operations that are performed on the user or group’s behalf, information, applications used or accessed by the user or group, electronic gaming interactions and chat, instant messaging, and other communication interactions. The system can identify, capture, classify, and otherwise track user or group interactions with electronic data, creating a data store, or another repository, of these interactions and the metadata associated with them. This repository can be used as a “digital or electronic record” in some instances. This effectively records and catalogs user and group interactions with electronic data and information over a time period, as further described in this document.

“For example, FIG. FIG. 11a shows a system that, according to the principles of the invention, identifies and classifies electronic data and tracks user and group interactions. The system could include a computing device 1162, one to three classification agents 1164 and one or two journaling agents 1165. It also may include metabase 1166, change records 1167, and database 168.

“In operation computing device 1162 can be coupled to/interact with various applications, networks and electronic information, such as multimedia applications 1170 and instant messaging/chat apps 1172, network application 1174 such an enterprise WAN, LAN, Internet 1176 and gaming applications 1178. These are just examples. Any other network, application or type of electronic data that is suitable for the purposes set forth herein can be added, if necessary.

“Journaling agent 1165 and classification agent 1164 can work together to detect and record data interactions, as further explained herein. Each type of electronic data interaction, such as email, web surfing, Internet search activities, multimedia usage, etc.) can be described in detail. Instant messaging, web surfing and Internet search activities are all examples of electronic data interactions. A different classification agent 1165 and 1164 may identify, capture, classify, and track the system. For example, an interaction-specific agent 1165 or a classification agent 1164 that is dedicated to processing one type of interaction with electronic data. The system could have a first class 1165 agent and first classification 1164 agents monitoring the network traffic (not shown), and another journaling agent 1165 or 1164 agent monitoring another system resource that is used for electronic gaming interactions (e.g. Recording and classifying gaming interactions, such as games played, opponents played or win/loss records. or directed to interactions related to the use of an Internet browser for surfing? web (e.g. ?tracking pages visited, content, use patterns, etc.) In some embodiments, the journaling agent 1165 or classification agent 1164 can be combined to perform some or all of the functions of journaling agent 165 and a class agent 1164.

“As a user interacts with different types of electronic information, some, or all, of these interactions can be recorded and stored in database 1168. Metabase 1166 and change record 1167 may be used to record some aspects of interactions. They may also serve as interaction logs of computing activities.

“Example: A user of computing device 1162 can interact with applications like multimedia application 1170 or instant messaging app 1172. This could include receiving, viewing, responding, and sending audio/video files of any format. It may also include instant, text, or email messages. Journaling agent 1165 can detect interactions between these applications, computing device 1162, and classification agent 164 may classify these interactions and record information (e.g. metadata) in metabase 1166.

“Moreover, certain embodiments allow for some or all of the content exchanged, or associated with, these interactions to be captured and stored in database 168 or other storage locations within the system. Screen shots and summaries of data interactions may be captured. The system can download any content associated with web pages viewed and be able to recreate the original content and interaction without having access to the source or original version of the page on the Internet. This can be useful, for instance, if the user wants to interact with previous interactions even though that content is no more available. Another example is that the system might also store or capture data associated with other interactions such as chat transcripts and video game replays, search query results and search results and associated search content. This includes songs accessed and movies accessed, metadata, stored songs and movies, and search queries.

“Moreover, administrators or users may wish to keep track of all applications by using specialized classification agents in certain embodiments. The multimedia and instant messaging apps described above might each have their own dedicated classification agent, which analyzes journal records to create metabase 1166 entries. Each classification agent may also have its own metabase or repository for source information (not shown), so that application histories and content can be quickly indexed, searched and retrieved. A?universal? may be used in other embodiments. In other embodiments, however, a?universal? classification agent may be used. This agent will recognize the application type (e.g. based on journaling agent entries) to process interactions appropriately (which could include routing metadata into one or more specialized Metabases).

“As shown at FIG. 11 a. Computing device 1162 can interact with network applications 1174, such as LAN and WAN applications. These interactions may include interaction to certain distributed programs like Microsoft Word and Outlook. Internet 1176 may be used to allow users to interact with it and download different web pages or other information. In accordance with an aspect of the present invention, interactions with these networks/applications may also be journaled as described above with certain information regarding these interactions stored in metabase 1166. Database 1166 may also contain portions of the exchanged content. Database 1168 may also contain portions of exchanged content, such as Word documents, emails and web pages. It can be used to store a record of user interactions with 1162 or other system devices. User interactions can be tracked at any network computing device and recorded for any user identified by identifiers.

“A user can retrieve captured data, review or replay data exchanges or save such records to be used in the future. A user could store instant messaging conversations for replay or transmission to others. It may not be possible to record all interactions in certain cases, such as private or personal information. This may be achieved by “disabling?” The appropriate classification agent for a specific period of time, etc.

“Similarly, interaction with gaming applications (network and stand-alone) can also be recorded with the appropriate information stored in metabase 1166 and database 1168. A user might be able to replay, retrieve and transmit saved gaming sequences to others.

“In some cases, database 1168 can become too large. In such cases, some information may be moved to single instance storage in database 1168 using a pointer that is placed at the logical address for the instanced information (not illustrated). This can be done to save memory, as some entries in database 1168 may be duplicated.

Chart 1200 in FIG. illustrates some of the steps involved with the above-described method. 12, and could include the following. A group or user of particular interest can be identified at the beginning based on user-related information or other network characteristics (step 1201). These characteristics could include Active Directory privileges or network login, machine ID or biometrics that are associated with a member of a group. These characteristics can be combined or linked together to create a profile for a user or group. These profiles can be stored in a database, index or management module of the software and used to create classification definitions. The system can compare data elements involved in an interaction to determine if they are related and classify them. This is done using profile information (step 1220).

These associations can be stored in a metabase that keeps track of interactions between users and groups. In one embodiment, the metabase is a list of data interactions for a specific group or user. A list of data items touched by users or groups can be obtained quickly if desired.

“In operation, the system can monitor data interactions for a specific computing device through the use journaling agents or the like. A classification agent may analyze the interactions and associate them with one or more profiles (step 1203). An identified metabase (step 1240), which tracks interactions between users or groups, may record the association. This may include references to data objects identified, attributes compared, and the reason for the association. The journaling agent can monitor data interactions throughout operation. This will ensure that each metabase keeps up to date and accurately represents the data touched by any particular user or group. Step 1250: The identified metabases are associated to a user or group (e.g., by storing an indicator of the association in an Index).

“FIG. “FIG. 13” shows a system 1300 that is constructed according to the principles of the invention for communicating metadata/or data objects between multiple computing devices. System 1300 can generally include the first and second computing devices 1310, 1320, respectively, as well as associated data stores 1330, 1340, and metabases 1350, 1360. System 1300 computing devices may store metadata and data objects in their respective data stores and metabases, as described further below. However, certain circumstances may call for the transfer of metadata between metabases 1350-1360 or between data stores 1330-1340. This could be used to transfer certain data between computing devices, create an application in another location, copy or back up certain data objects, and associated metadata.

“FIG. “FIG. 14” shows a flowchart 1400 that illustrates some of the steps involved in moving data between computing devices. At step 1410 data objects and/or associated metadata can be identified to allow for movement between computing devices. You can do this by asking for specific data. For example, a query may be created to search for data objects and/or associated metadata.

Click here to view the patent on Google Patents.