Invented by Deepak Raghunath Attarde, Manoj Kumar Vijayan, Commvault Systems Inc

Data management is a crucial aspect of any organization, and deduplication is an essential part of it. Deduplication is the process of identifying and removing duplicate data from a database. It helps in reducing storage space, improving data quality, and enhancing data analysis. However, managing deletions in a deduplication database can be challenging. The market for managing deletions in a deduplication database is growing, and this article will explore why. The need for managing deletions in a deduplication database arises due to various reasons. One of the primary reasons is the need to maintain data accuracy. Duplicate data can lead to inaccurate analysis, which can result in wrong decisions. Therefore, it is essential to remove duplicate data from the database regularly. However, deleting data from a deduplication database can be complex, as it requires identifying the correct data to delete without affecting the quality of the remaining data. Another reason for managing deletions in a deduplication database is compliance. Many organizations are subject to regulatory requirements that mandate the retention of specific data for a certain period. After the retention period, the data must be deleted. Managing deletions in a deduplication database can help organizations comply with these regulations by ensuring that the data is deleted at the appropriate time. The market for managing deletions in a deduplication database is growing due to the increasing demand for data accuracy and compliance. Many organizations are looking for solutions that can help them manage deletions in a deduplication database effectively. The market offers various solutions, including software and services, that can help organizations manage deletions in a deduplication database. One of the popular solutions in the market is software that automates the deletion process. The software can identify duplicate data and delete it automatically, reducing the risk of human error. The software can also be customized to meet the specific needs of an organization, such as compliance requirements. Another solution in the market is services that provide expert assistance in managing deletions in a deduplication database. These services can help organizations identify duplicate data, determine the appropriate data to delete, and ensure that the deletion process is compliant with regulatory requirements. In conclusion, managing deletions in a deduplication database is essential for maintaining data accuracy and compliance. The market for managing deletions in a deduplication database is growing, and organizations are looking for solutions that can help them manage deletions effectively. The market offers various solutions, including software and services, that can help organizations manage deletions in a deduplication database. As data continues to grow, the need for managing deletions in a deduplication database will only increase, making it a lucrative market for solution providers.

The Commvault Systems Inc invention works as follows

An information management software can remove data block entries from a deduplicated storage data store by using working copies of data block entry residing in a local storage computing device. The working copies can be used by the system to identify data blocks that need to be removed. After the working copies have been updated (e.g. using a transaction-based update scheme), the system will query the deduplication data base for any database entries that are to be removed. Once the database entries have been identified, the system will be able to remove them from secondary storage.

Background for Managing deletions in a deduplication data base

Global businesses recognize the commercial value and seek cost-effective, reliable ways to secure their information while minimizing productivity. Information protection is often part and parcel of an organizational process.

A company may back up important computing systems like web servers, file servers, web server, etc. as part of its daily, weekly or monthly maintenance plan. A company might also protect the computing systems of each employee, such as those used in an accounting, marketing, or engineering department.

Companies continue to look for innovative ways to manage data growth and protect data, given the ever-growing volume of data under their control. Companies often use migration techniques to move data to cheaper storage and data reduction techniques to reduce redundant data, prune lower priority data, and so forth.

Data stored by enterprises is becoming a valuable asset. Customers are seeking solutions that can not only manage and protect their data but also allow them to leverage it. Solutions that allow for data analysis, enhanced data presentation, and easy access are increasingly in demand.

Data deduplication is a technique that storage system providers have developed to address these issues. Data deduplication is a technique that reduces redundant data within storage systems. This improves storage utilization. Data can be broken down into units with a specific granularity, such as files or sub-files. Data blocks can have a fixed or variable size. The data units can be checked for existing data as new data is added to the system. The storage system will store and/or transmit a reference to an existing data unit if it already exists. Deduplication can increase storage utilization and system traffic (e.g. over a networked system), or both.

Even in systems that use deduplication, data management operations (including backup and restore) can put a lot of strain on the available network bandwidth and system resources. These operations can cause significant delays, such as due to communication latency (e.g. non-production, back storage) between primary storage (e.g. production storage) and secondary storage (e.g. backup storage). It can also be very time-consuming to recover from failures in devices or scripts involved in deduplication.

For instance, some deduplication database pruning operations (e.g. data block deletion operations), are performed at irregular intervals. Over time, a long log of pending pruning operation is kept. The system “plays back” at the next pruning interval. The log is used to implement the pruning operations in deduplication database. It can be difficult to find the entire pruning history in such cases.

These and other problems are addressed in accordance with certain aspects. One solution is to use a locally maintained data system that is stored on a secondary storage computing devices or another storage controller computer. This could be in the main memory of the secondary storage computing devices. An in-memory data base (IMDB) is a local data structure. If a complete version of the deduplication data base exists outside of the secondary storage computing device, the local database can keep working copies of these entries. The complete deduplication database can be stored in one or more secondary storage devices. This is also known as an on-disk (ODDB) database. Some information about deduplication data blocks can be included in deduplication database entries. The database entries for any given datablock can include a signature that corresponds to the block, a pointer at a copy of it stored in secondary storage devices, and a count value that corresponds to a number deduplicated files that reference the block. Working copies of the local database may include all or some of the information contained in the database entries.

Storage operations are performed in secondary storage. The system can modify the working copies of database entries in the local databases (e.g., those residing in main memory) without first modifying the full database entries. The local database’s working copies can be wiped from the secondary storage computing device or merged with the full database. The working copies can be flushed or merged with full database version according to a transaction-based schema. This happens when the local data reaches a certain size or after a specified time period. To preserve the integrity and integrity of the transaction, the working copies can be made read-only or blocked from writing. You can set the threshold size or the time period so that transactions (e.g. additions or modifications to entries) are flushed to the deduplication databases with enough frequency to keep it current (e.g. at intervals of 1, 2, or 5 minutes). This will reduce the time it takes to rebuild and bring the deduplication data back online in the case of hardware or software failures.

The local database can also be used to prune certain deduplication information from an information management system. This will improve crash recovery. The system can query the merged databases to find entries that indicate a pruning event, such as a deletion of data blocks and/or deduplication signatures, etc. If an entry for a data block indicates that no deduplicated files reference that instance, this should be a sign that a pruning event has occurred. When the local database is flushed with the full database at sufficient frequency, pruning operations can be written to and reflected into the deduplication databases relatively quickly. Some pruning events might be missed in certain embodiments due to a crash. For example, if the local database is lost before merging with the full deduplication database. Because the transaction-based system is frequently updated, only a small number of pruning events can be lost in any one crash. This small number can be reissued by system in such cases.

Systems and methods for implementing a transaction based deduplication database management scheme that includes the use of local deduplication information are described in this document. These systems and methods include the use of working copy of deduplication database entries, or other relevant deduplication information kept in main memory. These working copies can be combined with a secondary storage device containing a deduplication data base on a regular basis. This could include when the data structure contains the working copies reaches a certain size or after a time limit has expired. These techniques can be used to reduce the time it takes to rebuild and bring a deduplication database back online in the case of hardware or software crashes. These systems and methods can be further described in detail in FIGS. 2-9. It will also be appreciated that these components and functionality can be used with and/or integrated into information management systems, such as the ones that will now be described with regard to FIGS. 1A-1H.

Information Management System Overview

Organizations simply cannot afford to lose critical data. This is because of the growing importance of protecting and leveraging data. Protecting and managing data is becoming more difficult due to runaway data growth and other modern realities. It is imperative to have user-friendly, efficient and powerful solutions for managing and protecting data.

Depending on the organization’s size, there may be many data production sources that fall under the control of thousands, hundreds or even thousands of employees. Individual employees used to be responsible for protecting and managing their data in the past. In other cases, a patchwork of software and hardware point solutions was used. These solutions were often offered by different vendors, and sometimes had little or no interoperability.

CERTAIN embodiments described herein offer systems and methods capable to address these and other shortcomings in prior approaches by implementing unified information management across the organization. FIG. FIG. 1A illustrates one such information management systems 100. It generally includes combinations hardware and software that are used to manage and protect data and metadata generated by various computing devices within information management system 100. An organization using the information management system 100 could be a company, other business entity, educational institution, household or governmental agency.

Generally, the systems described herein may be compatible and/or provide some of the functionality of one or more U.S patents or patent application publications assigned by CommVault Systems, Inc., each which is hereby incorporated into its entirety by reference herein.

The information management software 100 can contain a wide range of computing devices. As an example, the information management software 100 could include one or more client computing device 102 and secondary storage computing device 106, as we will discuss in more detail.

Computer devices may include without limitation one or more of the following: personal computers, workstations, desktop computers or other types generally fixed computing systems like mainframe computers or minicomputers. Other computing devices include portable or mobile computing devices like laptops, tablets computers, personal information assistants, mobile phones (such a smartphones), and other mobile/portable computing devices like embedded computers, set top boxes or vehicle-mounted devices. Servers can be included in computing devices, including mail servers, file server, database servers and web servers.

In certain cases, a computing device may include virtualized and/or Cloud computing resources. A third-party cloud service provider may provide one or more virtual machines to an organization. In some cases, computing devices may include one or more virtual machines running on a physical host computing device (or “host machine?”). The organization may use one or more virtual machines to run its database server and another virtual machine as a mail server. One example is that the organization might use one virtual machine to run its database server and another as a mail server. Both virtual machines are running on the same host computer.

A virtual machine is an operating system and associated resources that is hosted on a host computer or host machine. Hypervisor is typically software and is also known as a virtual monitor, virtual machine manager or?VMM? The hypervisor acts as a bridge between the virtual machine’s hardware and its host machine. ESX Server, by VMware, Inc., of Palo Alto, Calif., is an example of hypervisor used for virtualization. Other examples include Microsoft Virtual Server, Microsoft Windows Server Hyper-V, and Sun xVM, both by Oracle America Inc., Santa Clara, Calif. In some embodiments, hypervisors may be hardware or firmware.

The hypervisor gives each virtual operating system virtual resources such as a processor, virtual memory, and virtual network devices. Each virtual machine can have one or more virtual drives. The data of virtual drives is stored by the hypervisor in files on the filesystem of the physical host machine. These files are called virtual machine disk images (in the instance of Microsoft virtual servers) and virtual machine disk files (in case of VMware virtual server). VMware’s ESX server provides the Virtual Machine File System, (VMFS), for storage of virtual machine files. Virtual machines read and write data to their virtual disks in the same manner as physical machines.

Click here to view the patent on Google Patents.