De-duplication of files

I recently read an article on VTLs that stated that the killer app for this technology could be something they called “Data de-duplication”. They appeared to assume savings from not backing up the same file twice. While this is totally valid within the realm of vitualized tape libraries, it misses the main benefits of de-duplication.

Organizations are rife with files stored in multiple places. E-mails are even worse but at least most e-mails sent out to huge lists of people are relatively small in average size. Files are typically large enough to make an impact in your storage costs. Furthermore, most files that get proliferated are ones that are read and not altered like Powerpoint presentations and PDFs.

The duplicate file problem is one that should already be solved in most oranizations, by using a system to manage files. The most prevalent systems available to do this are DAM sysstems and document management systems. They store files and associated data (metadata) in a database which can be searched. These systems can also ensure a single existence of each file and control access to files. Some systems also have workflow features to control approval and flow of documents.

Using these systems to minimize duplication will solve the problem at the earliest possible stage in the document’s life cycle. This will eliminate the benefit and therefore necessity of building these features into a backup system. A backup system should remain totally focused on backing up all data on volumes in the backup set. If a backup system is allowed to choose what to backup based on criteria, it opens the door for it to compromise itself with those choices.

