Gunther Oracle APEX und Datenbank Wiki

Inhaltsverzeichnis

Dateien in Python hashen - Performance Überlegungen - Dubletten suchen

Dateien in Python hashen - Performance Überlegungen - Dubletten suchen

Mit dem Packet „hashlib“ können die verschiedenen Hash Algorithmen in Python angewandt werden.

Import mit: import hashlib
Alle möglichen Algorithmen aufzeigen mit : hashlib.algorithms_available
Hash Class definieren : hashtask = hashlib.new(hashAlgo)
Hash ermitteln lassen: hashtask.update(buf)
File Hash anzeigen : filehash = hashtask.hexdigest()

Code um die Performance der verschiedenen Algorithmen zu prüfen

Die Test Datei wird über einen Buffer eingelesen und der Bufferwert wird gehasht:

__author__ = 'gpipperr'
import time, os
 
# Hashing
import hashlib
 
# graph
import matplotlib.pyplot as plt
from scipy.interpolate import interp1d
 
# check the possible hash libs
print(hashlib.algorithms_available)
 
# add your testfile here
file = 'D:\\temp\\20140818\\DSCN0025.TIF'
size = os.path.getsize(file)
 
# size of the file buffer
buffersize = 262144
 
results = []
for hashAlgo in hashlib.algorithms_available:
 
    # Remember the start time of the program
    start_time = time.clock()
 
    # define the hash algorithm
    hashtask = hashlib.new(hashAlgo)
 
    # Open the file and read and hash
    with open(file, 'rb') as afile:
        buffer = afile.read(buffersize)
        while len(buffer) > 0:
            hashtask.update(buffer)
            buffer = afile.read(buffersize)
 
    # get the hash af the file
    filehash = hashtask.hexdigest()
 
    # Remember Results
    results.append((hashAlgo, time.clock() - start_time))
    # print("-- Info  :: The file {0} has the {1} hash :: {2}".format(file,hashalgo,filehash))
    # print("-- Info  :: The singel run needs          :: {0:5.8f} seconds".format(time.clock() - start_time))
 
# sort the result list after the second parameter
results.sort(key=lambda x: x[1])
sorted_result = results
 
print("--" + 40 * "=")
print("-- Info  :: File {0} and size {1:3.3f} MB".format(file, size / 1024 / 1024))
print("-- Info  :: Use Buffersize {0}".format(buffersize))
 
for result in sorted_result:
    print("   Hash {0:15} | {1:2.6f} ".format(result[0], result[1]))
 
print("--" + 40 * "=")
 
# check the main algorithm  with different buffersizes
buffersize = 512
resultsBuffersize = []
 
# algoList=hashlib.algorithms_available
algoList = ['SHA1', 'MD5', 'DSA-SHA', 'MD4', 'SHA', 'DSA']
 
for i in range(12):
    buffersize = 512
    for i in range(10):
        buffersize = buffersize * 2
        for hashAlgo in algoList:
            # Remember the start time of the program
            start_time = time.clock()
            hashtask = hashlib.new(hashAlgo)
            with open(file, 'rb') as afile:
                buffer = afile.read(buffersize)
                while len(buffer) > 0:
                    hashtask.update(buffer)
                    buffer = afile.read(buffersize)
            filehash = hashtask.hexdigest()
            resultsBuffersize.append((hashAlgo, buffersize, time.clock() - start_time))
 
# sort the list with sorted after the second element
sorted_result = sorted(resultsBuffersize, key=lambda x: float(x[2]))
 
print("--" + 40 * "=")
print("-- Info  :: Results for Buffersize and Algorithm")
print(" {0:17} | {1:11} |  {2:10} ".format("Hash", "Buffer", "Time"))
 
for result in sorted_result:
    print("  {0:15} |  {1:8} |  {2:2.6f} ".format(result[0], result[1], result[2]))
 
print("--" + 40 * "=")
 
valuesX = []
valuesY = []
# graph the figures
for hashAlgo in algoList:
    for val in sorted_result:
        if val[0] == hashAlgo:
            valuesX.append(val[2])
            valuesY.append(val[1])
 
    plt.plot(valuesX, valuesY, label=hashAlgo, marker='x')  # --,marker='o', linestyle='--', color='r',
 
    # plot a smoth line with interpolation #must be implemented later
    # f2 = interp1d(valuesX, valuesY, kind='cubic')
    # plt.plot(f2, label=hashAlgo, marker='x')
 
    # Empty the list
    valuesX[:] = []
    valuesY[:] = []
 
 
############################################################
# Graph the figures
############################################################
# Runtime X axis
# plt.xscale('log')
plt.xlabel('Runtime s')
plt.xlim([0.04, 0.12])
 
# Buffersize Y axis
plt.ylabel('Buffersize Byte')
plt.yscale('log')
buffersize = 512
yticks = []
for i in range(10):
    buffersize = buffersize * 2
    yticks.append(buffersize)
plt.yticks(yticks, yticks)
 
# Diagram options
plt.title('Hash Performance for 10 runs of each algorithm with different buffersize')
legend = plt.legend(loc='upper right', shadow=True, fontsize='x-large')
legend.get_frame().set_facecolor('#FFFFE0')
plt.grid()
 
plt.show()
############################################################

Performance Überlegungen

In der folgenden Graphik der wichtigsten Hash Algorithmen ist der Zusammenhang zwischen der Puffer Größe beim Verarbeiten der Datei und der Laufzeit für die wichtigen Hash Algorithmen aufgezeigt.

Jeder Hash Versuch wird 10 mal mit der gleichen Puffergröße durchgeführt um Schwankungen auf dem System auszugleichen.

Die besten Werte ergeben sich bei folgenden Kombinationen für die Testdatei (Größe 12.588MB) auf einen Intel i7 Notebook:

Hash	Buffer	Time s
MD4	262144	0.044618
MD5	262144	0.046280
DSA-SHA	262144	0.049455
DSA	262144	0.049489
SHA1	262144	0.049500
SHA	262144	0.084308

D.h. im aktuellen Fall ist die Puffergröße 262144 Byte am besten.

Auch mit openssl kann die Performance der aktuellen Maschine für verschiedene Hashs überprüft werden:

openssl speed

Dient aber im Prinzip mehr dazu die Performance der Verschlüsselungsalgorithmen besser zu verstehen.

Einsatz - Erkennen von Dubletten in einen Verzeichnis Baum

Auslesen der Verzeichnisstruktur mit for folder, subFolders, files in os.walk(base_dir): und hashen aller Dateien, anschließend die Liste auf Dubletten prüfen.

Während des Laufes wird eine Progressbar angezeigt, daher muss leider zuvor die Anzahl der Dateien ermittelt werden.

Das sieht dann so aus:

D:\Python34\python.exe D:/entwicklung/work/python/ImageImp/deduplicateDirectoryTree.py
--========================================
-- Info :: start to read base directory d:\temp with 7116 files (each x for 88 files)
--========================================
--
-- [xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx]
--
--========================================
 
....
 ----------------------------------------
-- Info hash ffd7ac3677ce85775a40b195f4263d5d and this date 2014-10-07 10:43:06 ::  File d:\temp\old_images\Handy_Bilder\WhatsApp Images\IMG-20141007-WA0002.jpeg 
-- Info hash ffd7ac3677ce85775a40b195f4263d5d and this date 2014-10-07 09:43:06 ::  File d:\temp\old_images\WhatsApp Images\IMG-20141007-WA0002.jpeg 
 
--========================================
-- Info :: found 7116 duplicate entries in 7116 total files
-- Info  :: The run needs          :: 33.66735999 seconds
--========================================

Im nächsten Schritt stellt sich nun die Frage welche der doppelten Daten dann am Ende gelöscht werden sollen.

Werde das wohl so implementieren, das ein Verzeichnis als Quelle angeben wird und eine Baum als Ziel.

Der Anwender kann entscheiden, ob die Daten aus der Quelle oder aus dem Ziel Baum gelöscht werden sollen.

Code

__author__ = 'gpipperr'
 
import time, os, sys, datetime
 
# Hashing
import hashlib
 
 
# Get the change date of a file
def modification_date(filename):
    t = os.path.getmtime(filename)
    return datetime.datetime.fromtimestamp(t)
 
 
# Write the progressbar
def write_progress(step, width):
    if width > step:
        s = "X" * step + ">"
    else:
        s = "X" * step
    r = "." * (width - step)
    print("-- [{0:{1}}]".format(s + r, width), end="\r")
    sys.stdout.flush()
 
 
# Close the progressbar
def finish_progress(step):
    s = "x" * step
    print("-- [" + s + "]")
    sys.stdout.flush()
 
 
# after x files draw a point
progress_points = 10
progress_width = 80
# algoList=hashlib.algorithms_available
algoList = ['SHA1', 'MD5', 'DSA-SHA', 'MD4', 'SHA', 'DSA']
# Best Performance
hashAlgo = 'MD5'
# size of the file buffer
buffersize = 262144
 
# Remember the start time of the program
start_time = time.clock()
 
# read of a complete directory structure
# base_dir = "d:\\temp\\duptest"
base_dir = "d:\\temp"
 
fileRead = 0
# calculate how many files to compare
for folder, subFolders, files in os.walk(base_dir):
    # print( "folder {0}, subFolders {1}, files {2}".format( folder, subFolders, files ) )
    for file in files:
        fileRead += 1
 
# calculate the progressbar
progress_points = int(fileRead / progress_width)
 
print("--" + 40 * "=")
print("-- Info :: start to read base directory {0} with {1} files (each x for {2} files)".format(base_dir, fileRead,
                                                                                                 progress_points))
# print("-- Info  :: The prep needs          :: {0:5.8f} seconds".format(time.clock() - start_time))
print("--" + 40 * "=")
print("--")
 
fileRead = 0
step = 1
# get all MD5 Hashes of all Files in the directory structure
results = []
for folder, subFolders, files in os.walk(base_dir):
    # print( "folder {0}, subFolders {1}, files {2}".format( folder, subFolders, files ) )
    for file in files:
        fileFullName = folder + os.path.sep + file
        if os.path.isfile(fileFullName):
            fileRead += 1
 
            # progressbar
            if fileRead % progress_points == 0:
                write_progress(step, progress_width)
                step += 1
 
            # define the hash algorithm
            hashtask = hashlib.new(hashAlgo)
            try:
                # Open the file and read and hash
                with open(fileFullName, 'rb') as afile:
                    buffer = afile.read(buffersize)
                    while len(buffer) > 0:
                        hashtask.update(buffer)
                        buffer = afile.read(buffersize)
                # get the hash af the file
                filehash = hashtask.hexdigest()
                # Remember Results
                results.append([filehash, fileFullName, modification_date(fileFullName)])
                # print("-- Info  :: The file {0} has the {1} hash :: {2}".format(fileFullName, hashAlgo, filehash))
            except:
                print("-- Error :: 01 - Read File {0} :: error {1}:".format(fileFullName, sys.exc_info()))
                pass
# get duplicate md5 hash values
 
# close progressbar
finish_progress(step)
print("--")
 
# sort md4 hashes in the list
results.sort(key=lambda x: x[0])
 
# get only the element of a list
h = []
# for a,b in enumerate(results):
for b in results:
    h.append(b[0])
 
# get a unorderd but unique list
s = set(h)
print("--" + 40 * "=")
 
 
# search the duplicates
dub = []
for x in h:
    if x in s:
        s.remove(x)
    else:
        dub.append(x)
 
x = '-'
dub.sort(key=lambda x: x[0])
if len(dub) > 0:
    x = dub[0]
 
# show the duplicates entries
for res in results:
    if res[0] in dub:
        if res[0] != x:
            print("  " + 40 * "-")
        print("-- Info hash {0} and this date {2} ::  File {1} ".format(res[0], res[1], res[2]))
    x = res[0]
 
print("--" + 40 * "=")
print("-- Info :: found {1} duplicate entries in {0} total files".format(len(h) - len(s), fileRead))
print("-- Info  :: The run needs          :: {0:5.8f} seconds".format(time.clock() - start_time))
print("--" + 40 * "=")

Quellen

Hashing Allgemein

Duplikate finden in Listen

http://stackoverflow.com/questions/9835762/find-and-list-duplicates-in-python-list