dedup

From Cliquesoft
Revision as of 08:59, 27 March 2012 by Digitalpipe (Talk | contribs)

Jump to: navigation, search

A simple utility script used for de-duplication of data pools such as your pictures or documents. There's no deletion of the original files, but instead all unique data is copied to a separate directory so the originals can be deleted, backed up, or some other desired task. Not unlike our other bash-script-based software, this project relies on our clAPI framework for various functionality, so be sure this dependency is satisfied before using. It's also worth mentioning that the placement of the OPTION's must follow their respective ACTION (or parent script) which can be determined via the --help output. It might also help to read over the basics of clAPI to get a better understanding when running software from the command line.


Terms

This projects' codebase is licensed under the AGPLv3 unless a valid CPL has been purchased. More information about both of these licenses can be found under the "Our Licenses" link of our homepage.



ACTION's

Among our standard 'help', 'version', and 'update' ACTIONs, this project also contains two others - 'install' and 'sort'. The 'install' ACTION simply installs the script in the "~/.bin" directory for XiniX and "/usr/bin" for typical GNU/Linux distros. To see how easy it is to install, see the examples section.

The other ACTION, 'sort', will perform most of the desired work. It will basically run through the entire directory heirarchy of the --source OPTION and create a database of MD5 hashes per compared file. During this comparison, if the DB does not contain the generated MD5 hash, then the file is unique and will be copied to the --target directory - retaining it's location. On the other hand, if the MD5 hash is present in the database, then the file is a duplicate and no further action is taken against it. After 'dedup' has completed its run, all unique data is now located under the --target directory.


OPTION's

Add these in



Examples

Installation is a simple 2-step process...

$ cd /path/to/uncompressed/package
$ ./dedup install


Showing the contents of the data pools prior to a 'dedup' run...

/tmp/data $ ls -alR
.:                                                                                                                                                               
total 36                                                                                                                                                         
drwxrwxr-x  4 dave users  4096 2012-03-27 11:00 .                                                                                                         
drwxrwxrwt 18 dave users 20480 2012-03-27 11:37 ..                                                                                                        
drwxrwxr-x  4 dave users  4096 2012-03-27 10:59 dataset1                                                                                                  
drwxrwxr-x  2 dave users  4096 2012-03-27 11:39 dataset2                                                                                                  
                                                                                                                                                                 
./dataset1:                                                                                                                                                      
total 140268                                                                                                                                                     
drwxrwxr-x 4 dave users      4096 2012-03-27 10:59 .                                                                                                      
drwxrwxr-x 4 dave users      4096 2012-03-27 11:00 ..                                                                                                     
drwxrwxr-x 3 dave users      4096 2012-03-13 16:43 a                                                                                                      
-rw-r--r-- 1 dave users 143614369 2008-11-07 18:05 flash.tar.gz
drwxrwxr-x 2 dave users      4096 2012-03-13 16:03 original

./dataset1/a:
total 12
drwxrwxr-x 3 dave users 4096 2012-03-13 16:43 .
drwxrwxr-x 4 dave users 4096 2012-03-27 10:59 ..
drwxrwxr-x 2 dave users 4096 2012-03-27 10:36 b

./dataset1/a/b:
total 16
drwxrwxr-x 2 dave users 4096 2012-03-27 10:36 .
drwxrwxr-x 3 dave users 4096 2012-03-13 16:43 ..
-rwxr-xr-x 1 dave users  642 2010-07-22 11:45 test.sh
-rwxrwxr-x 1 dave users  517 2009-02-17 09:05 test.txt

./dataset1/original:
total 500744
drwxrwxr-x 2 dave users      4096 2012-03-13 16:03 .
drwxrwxr-x 4 dave users      4096 2012-03-27 10:59 ..
-rw-r--r-- 1 dave users 512753664 2011-05-10 09:37 flash.img

./dataset2:
total 140272
drwxrwxr-x 2 dave users      4096 2012-03-27 11:39 .
drwxrwxr-x 4 dave users      4096 2012-03-27 11:00 ..
-rw-rw-r-- 1 dave users        10 2012-03-27 11:39 a_new_one.txt
-rw-r--r-- 1 dave users 143614369 2008-11-07 18:05 flash.tar.gz
-rwxrwxr-x 1 dave users       517 2009-02-17 09:05 test.txt


Executing a de-duplication run...

$ dedup --noprompts sort --source=/tmp/data --target=/tmp/dedup

Beginning the de-duplication process @ Tue Mar 27 11:39:34 EDT 2012

Checking system environment...
  (i)   Directories...
           Temp: [checking] [exists] [writable] [success] [done]
  (i)   Variables: [done]

Beginning the 'sort' module...

Entering "/tmp/data"...
Entering "/tmp/data/dataset1"...
   Processing "flash.tar.gz": [unique] [checking] [creating] [success] [copying] [success] [done]
Entering "/tmp/data/dataset1/a"...
Entering "/tmp/data/dataset1/a/b"...
   Processing "test.sh": [unique] [checking] [creating] [success] [copying] [success] [done]
   Processing "test.txt": [unique] [copying] [success] [done]
   ** Finished, returning to "/tmp/data/dataset1/a".
   ** Finished, returning to "/tmp/data/dataset1".
Entering "/tmp/data/dataset1/original"...
   Processing "flash.img": [unique] [checking] [creating] [success] [copying] [success] [done]
   ** Finished, returning to "/tmp/data/dataset1".
   ** Finished, returning to "/tmp/data".
Entering "/tmp/data/dataset2"...
   Processing "a_new_one.txt": [unique] [checking] [creating] [success] [copying] [success] [done]
   Processing "flash.tar.gz": [duplicate]
   Processing "test.txt": [duplicate]
   ** Finished, returning to "/tmp/data".
   ** Finished, returning to "/tmp".

Calling exit routines for the modules...

  (i)   dedup script...
           Cleanup: [deleting] [success] [deleting] [success] [done]

The job has completed successfully @ Tue Mar 27 11:39:59 EDT 2012


Showing the contents of the de-duplicated data...

/tmp/dedup $ ls -alR
.:
total 44
drwxrwxr-x  4 dave users  4096 2012-03-27 11:39 .
drwxrwxrwt 18 dave users 20480 2012-03-27 11:45 ..
-rw-rw-r--  1 dave users   224 2012-03-27 11:39 20120327113934.db
-rw-rw-r--  1 dave users  1439 2012-03-27 11:39 20120327113934.log
drwxrwxr-x  4 dave users  4096 2012-03-27 11:39 dataset1
drwxrwxr-x  2 dave users  4096 2012-03-27 11:39 dataset2

./dataset1:
total 140272
drwxrwxr-x 4 dave users      4096 2012-03-27 11:39 .
drwxrwxr-x 4 dave users      4096 2012-03-27 11:39 ..
drwxrwxr-x 3 dave users      4096 2012-03-27 11:39 a
-rw-r--r-- 1 dave users 143614369 2012-03-27 11:39 flash.tar.gz
drwxrwxr-x 2 dave users      4096 2012-03-27 11:39 original

./dataset1/a:
total 12
drwxrwxr-x 3 dave users 4096 2012-03-27 11:39 .
drwxrwxr-x 4 dave users 4096 2012-03-27 11:39 ..
drwxrwxr-x 2 dave users 4096 2012-03-27 11:39 b

./dataset1/a/b:
total 16
drwxrwxr-x 2 dave users 4096 2012-03-27 11:39 .
drwxrwxr-x 3 dave users 4096 2012-03-27 11:39 ..
-rwxr-xr-x 1 dave users  642 2012-03-27 11:39 test.sh
-rwxrwxr-x 1 dave users  517 2012-03-27 11:39 test.txt

./dataset1/original:
total 500748
drwxrwxr-x 2 dave users      4096 2012-03-27 11:39 .
drwxrwxr-x 4 dave users      4096 2012-03-27 11:39 ..
-rw-r--r-- 1 dave users 512753664 2012-03-27 11:39 flash.img

./dataset2:
total 12
drwxrwxr-x 2 dave users 4096 2012-03-27 11:39 .
drwxrwxr-x 4 dave users 4096 2012-03-27 11:39 ..
-rw-rw-r-- 1 dave users   10 2012-03-27 11:39 a_new_one.txt



Developers

Dave Henderson [dhenderson (at) cliquesoft (dot) org]