Benutzer-Werkzeuge

Webseiten-Werkzeuge


prog:python_data_loader_oracle_bfile_text_index

Unterschiede

Hier werden die Unterschiede zwischen zwei Versionen angezeigt.

Link zu dieser Vergleichsansicht

Beide Seiten der vorigen Revision Vorhergehende Überarbeitung
prog:python_data_loader_oracle_bfile_text_index [2016/04/17 20:02]
gpipperr
prog:python_data_loader_oracle_bfile_text_index [2016/04/24 21:47] (aktuell)
gpipperr [Code DocumentLoader.py (Python 3.4 !)]
Zeile 1: Zeile 1:
 +=====Mit Python 3.4 Referenzen alle Dateien in einer Verzeichnisstruktur in der Oracle Datenbank speichern=====
 +
 +**Aufgabe:** 
 +
 +Eine Verzeichnisstruktur auf einen Windows Server soll über Oracle Text in der Oracle Datenbank indiziert werden.
 +
 +Dazu muss eine Referenz auf jede Datei in der Datenbank gespeichert werden.
 +
 +Beim Laden soll auch der Datei Typ, die Datei Endung, der MIME Type, der MD5 Hash, die Größe der Datei und  die Datumsfelder der Datei bestimmt und in die Ziel Tabelle geladen werden.
 +
 +Später soll noch ein Vorschaubild und eine Zusammenfassung der Meta Daten der Datei, die Autor etc und die Sprache des Dokuments ermittelt werden.
 +
 +
 +Nach dem Laden werden die Daten mit Oracle Text indiziert, siehe dazu => [[dba:oracle_text_index_binary_data|Oracle Text für die Indizierung binärer Daten verwenden]]
 +
 +
 +Über eine APEX Oberfläche kann dann eine Suche über die Dateien durchgeführt werden siehe dazu => [[prog:oracle_apex_oracle_text_document_archive|Mit Oracle APEX 5 und Oracle Text ein Dokumentenarchive für technische Dokumentation aufbauen]].
 +
 +----
 +
 +==== Vorbereitung ====
 +
 +=== Mit Unix Magic (file Befehl) den File Type bestimmen ====
 +
 +Leider ist es mir unter Windows unter Python nicht gelungen, Magic direkt einzusetzen.
 +
 +Normalerweise müssen nur die Windows DDL' von Magic zur Verfügung stehen und mit python-magic Package von https://github.com/ahupp/python-magic kann dann direkt in Python der File Typ und weitere Meta Daten ausgelesen werden.
 +
 +Leider hat das aber bei mir nicht funktioniert.
 +
 +<code python>
 +.\python -m pip install python-magic
 +</code>
 +
 +Libs von herunterladen:
 +
 +
 +  * http://gnuwin32.sourceforge.net/packages/file.htm
 +    * file über http://sourceforge.net/projects/gnuwin32/files/file/5.03/ 
 +  * Dependencies wie lib1.dll und regex2.dll über http://gnuwin32.sourceforge.net/downlinks/file-dep-zip.php
 +
 +
 +Test:
 +<code python>
 +import magic
 +magic_file_path= "D:\\tools\\file\\share\misc"
 +magic.Magic(magic_file =magic_file_path)
 +
 +#Fehler:
 +...
 +D:\tools\file\share\misc/magic.mgc, 1: Warning: offset `ñ' invalid
 +D:\tools\file\share\misc/magic.mgc, 1: Warning: type `ñ' invalid
 +...
 +#usw..
 +</code>
 +
 +Trotz einer weitere Versuche und Internet Recherche ist es mir nicht gelungen das hinzubekommen.
 +
 +Daher rufe ich nun magic über "file.exe" direkt als Programm auf, dazu alle dll's (magic1.ddl, zlib1.dll und regex2.dll ) in das bin Verzeichnis von file kopieren.
 +
 +Gerade bei alten Office Formaten kann magic sehr hilfreich sein, es wird zum Beispiel auch der Autor mit ausgelesen.
 +
 +=== Directory Objekt in der Datenbank anlegen und Zugriffs-Rechte vergeben ===
 +
 +Über ein Directory Object wird der Einstieg in die Verzeichnisstruktur auf der Festplatte definiert.
 +
 +In unseren Test Fall liegen die Daten unter "d:\data\info-archiv" in einer Verzeichnisstruktur.
 +
 +Anlegen als SYS und Rechte vergeben:
 +
 +<code sql>
 +sqlplus / as sysdba
 +
 +#Directory anlegen
 +create directory INFO_ARCHIVE as 'D:\data\info-archiv';
 +
 +#Rechte vergeben
 +grant read,write on directory INFO_ARCHIVE to GPI;
 +</code>
 +
 +
 +=== Dokumenten Tabelle anlegen ===
 +
 +In diese Tabelle werden die Informationen eingelesen:
 +<code sql>
 +
 +sqlplus gpi/gpi
 +
 +---------------------------------------------------------
 +-- Document table
 +---------------------------------------------------------
 +
 +CREATE TABLE Documents
 +  (
 +    ID                  NUMBER (30) NOT NULL ,
 +    Filename            VARCHAR2 (512 CHAR) NOT NULL ,
 +    FileTyp             VARCHAR2 (32 CHAR) ,
 +    FileDirectory       VARCHAR2 (2000 CHAR) NOT NULL ,
 +    FilePointer BFILE   NOT NULL ,
 +    MD5Hash             VARCHAR2 (32 CHAR) NOT NULL ,
 +    FileCreateDate      DATE ,
 +    FileLastModify      DATE ,
 +    Language            VARCHAR2 (32 CHAR) ,
 +    CreateDate          DATE default sysdate,
 +    CreateUser          VARCHAR2 (32 CHAR) default user,
 +    ChangeDate          DATE default sysdate,
 +    ChangeUser          VARCHAR2 (32 CHAR) default user,
 +    Theme_data_avaiable VARCHAR2 (1 CHAR)  DEFAULT 'N'
 +  ) ;
 +
 +CREATE UNIQUE INDEX IDX_Documents_ID_PK ON Documents (  ID ASC );
 +
 +ALTER TABLE Documents ADD CONSTRAINT Document_PK PRIMARY KEY ( ID ) ;
 +
 +COMMENT ON TABLE Documents IS   'Store the file information' ;
 +COMMENT ON COLUMN Documents.ID IS  'Primary Key' ;
 +COMMENT ON COLUMN Documents.Filename IS  'Name of the file on disk' ;
 +COMMENT ON COLUMN Documents.FileTyp IS  'Typ of the file' ;
 +COMMENT ON COLUMN Documents.FileDirectory IS  'Directory of the file' ;
 +COMMENT ON COLUMN Documents.FilePointer IS  'Bfile Pointer to the File' ;
 +COMMENT ON COLUMN Documents.MD5Hash IS  'Hash of the files to indentify dublicate files' ;
 +COMMENT ON COLUMN Documents.FileCreateDate IS  'File Create Time from the file' ;
 +COMMENT ON COLUMN Documents.FileLastModify IS  'Last modificatoin date from the file' ;
 +COMMENT ON COLUMN Documents.Language IS  'Language of the file (Oracle NLS Format String!)' ;
 +COMMENT ON COLUMN Documents.CreateDate IS  'Date when the record was created' ;
 +COMMENT ON COLUMN Documents.CreateUser IS  'User create the record' ;
 +COMMENT ON COLUMN Documents.ChangeDate IS  'Last Change on the record' ;
 +COMMENT ON COLUMN Documents.ChangeUser IS  'User change the record' ;
 +COMMENT ON COLUMN Documents.Theme_data_avaiable IS  'If Themdata is there => Y, if not N' ;
 +
 +---------------------------------------------------------
 +
 +create sequence documents_seq;
 +</code>
 +
 +
 +
 +----
 +
 +=== Oracle CX für den Datenbank Zugriff für Python 3.4 einrichten ===
 +
 +Siehe dazu => [[python:python_database|Python 2.7 / 3.4 und die Oracle Datenbank]]
 +
 +
 +----
 +
 +
 +==== Code DocumentLoader.py (Python 3.4 !) ====
 +
 +Aufruf mit "DocumentLoader.py  -s <src>  -r <recursive Level> -d <oracle_directory_path> -c <config_path>"
 +
 +Beim ersten Aufruf wird das Template für die Konfigurationsdatei "dataLoader.conf" erzeugt, diese Datei dann mit den entsprechenden Info's für die jeweilige Umgebung ausfüllen.
 +
 +
 +<code python DocumentLoader.py>
 +__author__ = 'gpipperr'
 +
 +import datetime, time
 +import glob, filecmp, ntpath, shutil
 +import os, errno, sys, getopt
 +import hashlib
 +import subprocess
 +import io
 +import cx_Oracle
 +import locale
 +import configparser
 +
 +
 +# get FileType
 +def getFileType(file, command, magicFilePath):
 +    # magicFilePath = "d:\\tools\\file\\bin\\file.exe"
 +    result = "-2"
 +    # Command + arguments  as sequence
 +    args = magicFilePath + " " + command + ' "' + file + '"'
 +    # execute the program and pipe the output to subprocess.PIPE
 +    prog = subprocess.Popen(args, stdout=subprocess.PIPE, shell=True)
 +    # wait until finished
 +    stat = prog.wait()
 +    # connect pipes together
 +    (output, err) = prog.communicate()
 +    result = str(output)
 +    result = str.replace(result, "b'", "")
 +    result = str.replace(result, "\\r", "")
 +    result = str.replace(result, "\\n", "")
 +    escFileName = str.replace(str(file), "\\", "\\\\") + "; "
 +    result = str.replace(result, escFileName, "")
 +    return result
 +
 +
 +# Get extension
 +# Get extension
 +def getFile_ext(filename):
 +    fparts = filename.split('.')
 +    c = len(fparts)
 +    # try to get .tar.gz
 +    v_return = fparts[-1]
 +    if c > 3:
 +        if len(fparts[-2]) < 5:
 +            chars = set('$,()[]_ ')
 +            if any((c in chars) for c in fparts[-2]):
 +                v_return = fparts[-1]
 +            else:
 +                v_return = fparts[-2] + "." + fparts[-1]
 +
 +    if v_return.find('.') < 0:
 +        v_return = "." + v_return
 +
 +    return v_return.lower()
 +
 +
 +# get the MD5 Hash the file
 +def getMD5(filename):
 +    blocksize = 65536
 +    if os.path.exists(filename) == False:
 +        return 'N/A'
 +    md5 = hashlib.md5()
 +    with open(filename, "rb") as f:
 +        for block in iter(lambda: f.read(blocksize), b""):
 +            md5.update(block)
 +    return md5.hexdigest()
 +
 +
 +# Remember the global Size of all copied files
 +def setStatisticTotalSize(size):
 +    global totalFileSize
 +    totalFileSize += size
 +
 +
 +# global for the total filesize
 +totalFileSize = 0
 +
 +
 +# print the file info
 +# later insert the record to the database
 +#   fileInfo=dict(filename=filename,filepath=fileDirectoryName,fileADate=fileAccessDate,fileMDate=fileModDate,fileCDate=fileCreateDate,md5=md5checkSum,fileBType=fileType,fileMtype=fileMimeType)
 +def insertFileInfo(fileInfo, ora_dir_path, connection):
 +    # strip the oracle path
 +    relative_doc_path = str.replace(fileInfo['filepath'], ora_dir_path, "")
 +    try:
 +        print("-- Info for the file ::{0}".format(fileInfo['filename'].encode('utf-8')))
 +        print("                 Path::{0}".format(fileInfo['filepath'].encode('utf-8')))
 +        print("         Relativ Path::{0}".format(relative_doc_path + os.sep + str(fileInfo['filename'])))
 +        print("          Access Date::{0}".format(fileInfo['fileADate']))
 +        print("          Modify Date::{0}".format(fileInfo['fileMDate']))
 +        print("          Create Date::{0}".format(fileInfo['fileCDate']))
 +        print("            File Size::{0}".format(fileInfo['fileSize']))
 +        print("                  MD5::{0}".format(fileInfo['md5']))
 +        print("     File Binary Type::{0}".format(fileInfo['fileBType']))
 +        print("       File Mime Type::{0}".format(fileInfo['fileMtype']))
 +        print("       File Extention::{0}".format(fileInfo['fileExtention']))
 +    except:
 +        print("Unexpected error with file infos")
 +
 +    # insert into the database
 +
 +    # Cursor auf die DB oeffenen
 +    cursor = connection.cursor()
 +
 +    # filename,filetyp,filedirectory,md5hash, FILECREATEDATE, FILELASTMODIFY
 +    # .encode('utf-8')
 +    # .encode('utf-8')
 +
 +    doc_rows = [(str(fileInfo['filename'])
 +                 , fileInfo['fileExtention']
 +                 , fileInfo['filepath']
 +                 , fileInfo['md5']
 +                 , fileInfo['fileCDate']
 +                 , fileInfo['fileMDate']
 +                 , relative_doc_path + os.sep + str(fileInfo['filename'])
 +                 , fileInfo['fileSize']
 +                 , fileInfo['fileADate']
 +                 )
 +                ]
 +
 +    # how many rows to insert
 +    cursor.bindarraysize = len(doc_rows)
 +
 +    # datatype - lenght of the inserted data
 +    # cursor.setinputsizes(int, 14, 13)
 +    try:
 +        # insert the whole record
 +        cursor.executemany(
 +            "insert into documents (ID , FILENAME, FILETYP, FILEDIRECTORY, MD5HASH, FILECREATEDATE, FILELASTMODIFY,FILEPOINTER,filesize,FILELASTACESS) values ( documents_seq.nextval, :1, :2, :3, :4, :5, :6,BFILENAME('INFO_ARCHIVE',replace(:7,';','')),:8,:9)",
 +            doc_rows)
 +    except cx_Oracle.DatabaseError as e:
 +        connection.rollback()
 +        print("-- Error - Oracle Database Error:" + str(e))
 +    except Exception as e:
 +        print("-- Error - Unexpected error:" + str(e))
 +        connection.rollback()
 +        # raise
 +    connection.commit()
 +
 +
 +# Main Script part
 +def main(argv):
 +    global totalFileSize
 +
 +
 +    # Parameter 1   - Import Directory
 +    # Parameter 2   - Subfolder Level
 +
 +
 +    # Remember the start time of the program
 +    start_time = time.clock()
 +
 +    usage_string = " DocumentLoader.py  -s <src>  -r <recursive Level> -d <oracle_directory_path> -c <config_path>"
 +
 +    path_name = '-'
 +    dest_name = '-'
 +    recursiveLevel = 0
 +    config_path = 'dataLoader.conf'
 +
 +    # -- Parameter from the config file
 +
 +    oracle_port = '-'
 +    oracle_host = '-'
 +    oracle_service = '-'
 +    oracle_user = '-'
 +    oracle_pwd = '-'
 +    magicFilePath = '-'
 +    ingoreFileExt = []
 +
 +    # Path of the oracle Info Archive
 +    ora_dir_path = '-'
 +
 +    try:
 +        opts, args = getopt.getopt(argv, "hs:d:r:d:c:", ["src=", "rec=", "dir=", "config="])
 +    except getopt.GetoptError:
 +        print("usage: {0}".format(usage_string))
 +        sys.exit(2)
 +    # read the parameter
 +    for opt, arg in opts:
 +        if opt == '-h':
 +            print("usage: {0}").format(usage_string)
 +            sys.exit()
 +        elif opt in ("-s", "--src"):
 +            path_name = arg
 +        elif opt in ("-d", "--dir"):
 +            ora_dir_path = arg
 +        elif opt in ("-c", "--config"):
 +            config_path = arg
 +        elif opt in ("-r", "--rec"):
 +            recursiveLevel = int(arg)
 +
 +    # read the config file
 +    config = configparser.ConfigParser()
 +    # check if the file exits
 +    if os.path.exists(config_path) == False:
 +        # use normal configparser to write the template
 +        print("--" + 80 * "!")
 +        print("-- Error to read file {0}".format(config_path))
 +        print("-- Error usage: {0}".format(usage_string))
 +        config['DEFAULT'] = {'MagicFile': 'd:\\tools\\file\\bin\\file.exe', 'ignoreFileExt': '.iso'}
 +        config['ORACLE_DB_CONNECT'] = {'Host': 'localhost', 'Port': '1521', 'Service': 'ORCL', 'DB_User': 'USER',
 +                                       'DB_Password': 'xxxxx'}
 +        with open(config_path, 'w') as configfile:
 +            config.write(configfile)
 +        print("--" + 80 * "!")
 +        print("-- Info create Configuration Template :: {0}".format(config_path))
 +        print("-- Info fillout the configuration file  with your personal values and start again!")
 +        print("--" + 80 * "!")
 +        sys.exit(2)
 +    else:
 +        print("-- Info read config file {0}".format(config_path))
 +        config.read(config_path)
 +        # Parameter of the application
 +        general_configuration = config['DEFAULT']
 +        magicFilePath = general_configuration['MagicFile']
 +        ingoreFileExtString = general_configuration['ignoreFileExt']
 +        ingoreFileExt = str.split(str.replace(ingoreFileExtString, ' ', ''), ',')
 +        # Oracle DB Connect
 +        oracle_db_configuration = config['ORACLE_DB_CONNECT']
 +        oracle_port = oracle_db_configuration['Port']
 +        oracle_host = oracle_db_configuration['Host']
 +        oracle_service = oracle_db_configuration['Service']
 +        oracle_user = oracle_db_configuration['DB_User']
 +        oracle_pwd = oracle_db_configuration['DB_Password']
 +
 +
 +    # check if Directory exists and if the * is necessary
 +    # BUG ! if more then 1 then the * not match the documents on root level??
 +    # FIX IT!
 +    # Source
 +    if os.path.isdir(path_name):
 +        if path_name.endswith(os.path.sep):
 +            path_name += ("*" + os.path.sep) * recursiveLevel
 +            path_name += "*.*"
 +        else:
 +            path_name += os.path.sep
 +            path_name += ("*" + os.path.sep) * recursiveLevel
 +            path_name += "*.*"
 +    else:
 +        print("-- Error :: 05 Source Directory (-s) {0} not found".format(path_name))
 +        print("usage: {0}").format(usage_string)
 +        sys.exit(2)
 +
 +    # connect to the database
 +    print("--" + 40 * "=")
 +    print("-- Info :: Oracle Client Library Version :: {0}".format(cx_Oracle.clientversion()))
 +    # get the connection to the database
 +    print("-- Info :: oracle_host {0} oracle_port {1} oracle_service {2} oracle_user {3} oracle_pwd ********".format(
 +        oracle_host, oracle_port, oracle_service, oracle_user, oracle_pwd))
 +    connection = cx_Oracle.connect(
 +        oracle_user + '/' + oracle_pwd + '@' + oracle_host + ':' + oracle_port + '/' + oracle_service)
 +    # Version der DB ausgeben
 +    print("-- Info :: Oracle Database Version :: {0}".format(connection.version))
 +    print("--" + 40 * "=")
 +
 +    print("-- Info  :: Environment Settings :: Language :: {0} - Char Set ::{1}".format(locale.getdefaultlocale()[0],
 +                                                                                        locale.getdefaultlocale()[1]))
 +
 +    print("--" + 40 * "=")
 +    print("-- Info  :: Read all files from {0}".format(path_name))
 +    print("-- Info  :: Copy files to       {0}".format(dest_name))
 +    print("-- Info  :: Not index this file types ::" + str(ingoreFileExt))
 +    print("--" + 40 * "=")
 +
 +    fileCount = 0
 +    fileExistsCount = 0
 +    dirCount = 0
 +    dirPathList = []
 +    totalFileSize = 0
 +    # Get the list of all Files
 +    fileList = glob.glob(path_name)
 +
 +    # remove Thumbs.db if exist from the list
 +    thumbsDBFile = "Thumbs.db"
 +    for file in fileList:
 +        if file.endswith(thumbsDBFile):
 +            fileList.remove(file)
 +
 +    # Loop one read files in Import Directory
 +    for file in fileList:
 +        fileCount += 1
 +
 +        # do the work
 +        try:
 +            # check if directoy
 +            if ntpath.isdir(file):
 +                print("-- Info :: found dirctory {0} ::".format(file))
 +                dirCount += 1
 +            else:
 +                # get only the filename without the path
 +                filename = ntpath.basename(file)
 +                # get directory
 +                fileDirectoryName = ntpath.dirname(file)
 +                # get Create date
 +
 +                fileAccessDate = datetime.datetime.fromtimestamp(ntpath.getatime(file))
 +                fileModDate = datetime.datetime.fromtimestamp(ntpath.getmtime(file))
 +                fileCreateDate = datetime.datetime.fromtimestamp(ntpath.getctime(file))
 +                # get md5 hash
 +                md5checkSum = getMD5(file)
 +                # get File Type over file from external, Python Lib magic not working, error with magic file!
 +                # now I implement this stupid solution
 +                fileType = getFileType(file, " ", magicFilePath)
 +                # Call file with -i to get the mime Type
 +                fileMimeType = getFileType(file, "-i", magicFilePath)
 +                # get Extenstion
 +                fileExt = getFile_ext(filename)
 +                # getFileSize
 +                fileSize = os.path.getsize(file)
 +                # Remember for statistic
 +                setStatisticTotalSize(fileSize)
 +
 +                # not add url files to the index
 +                # endswith(".url")
 +
 +                if fileExt in ingoreFileExt:
 +                    # encode the output with UTF-8 to avoid errors with stange things in filenames
 +                    print("-- Info :: Not index this file types ::" + str(ingoreFileExt))
 +                    print("-- Info :: Not index this file::{0}".format(repr(filename.encode('utf-8'))))
 +                    print("-- Info :: Not index this Dir::{0}".format(repr(fileDirectoryName.encode('utf-8'))))
 +                    print("-- --")
 +                else:
 +                    # record
 +                    fileInfo = dict(filename=str(filename), filepath=str(fileDirectoryName), fileADate=fileAccessDate,
 +                                    fileMDate=fileModDate, fileCDate=fileCreateDate, md5=md5checkSum,
 +                                    fileBType=fileType,
 +                                    fileMtype=fileMimeType, fileExtention=fileExt, fileSize=fileSize)
 +
 +                    ##print("-- Index this Dir::{0}".format(repr(fileDirectoryName.encode('utf-8'))))
 +                    # encode the output with UTF-8 to avoid errors with stange things in filenames
 +                    ##print("-- Index this file   ::{0}".format(repr(filename.encode('utf-8'))))
 +                    # write to DB
 +                    insertFileInfo(fileInfo, ora_dir_path, connection)
 +
 +        except OSError as exception:
 +            if exception.errno != errno.EEXIST:
 +                print("-- Error :: Error read file :: see error {1}".format(file, sys.exc_info()[0]))
 +
 +
 +    # print statistics
 +    print("--" + 40 * "=")
 +    print("-- Finish with           :: {0} files in {1} new directories".format(fileCount, dirCount))
 +    print("-- The run needs         :: {0:5.4f} seconds".format(time.clock() - start_time))
 +    print("-- Read size             :: {0:5.3f} MB".format(totalFileSize / 1024 / 1024))
 +    print("--" + 40 * "=")
 +
 +    # Close the DB Connection
 +    connection.close()
 +
 +
 +if __name__ == "__main__":
 +    main(sys.argv[1:]);
 +
 +
 +</code>
 +
 +Config file:
 +<code bash dataLoader.conf>
 +[DEFAULT]
 +# Path and name of the magic file program
 +MagicFile=d:\tools\file\bin\file.exe
 +
 +# Do Not read files with this extension
 +# use , to separate the file extensions
 +ignoreFileExt=.cmd ,.iso ,.url,.exe,.class,.dbf
 +
 +[ORACLE_DB_CONNECT]
 +Port = 1521
 +Host = 10.10.10.1
 +Service = GPI
 +DB_User= GPI
 +DB_Password = xxxxxx
 +
 +</code>
 +
 +
  
"Autor: Gunther Pipperr"
prog/python_data_loader_oracle_bfile_text_index.txt · Zuletzt geändert: 2016/04/24 21:47 von gpipperr