Benutzer-Werkzeuge

Webseiten-Werkzeuge


prog:python_data_loader_oracle_bfile_text_index

Mit Python 3.4 Referenzen alle Dateien in einer Verzeichnisstruktur in der Oracle Datenbank speichern

Aufgabe:

Eine Verzeichnisstruktur auf einen Windows Server soll über Oracle Text in der Oracle Datenbank indiziert werden.

Dazu muss eine Referenz auf jede Datei in der Datenbank gespeichert werden.

Beim Laden soll auch der Datei Typ, die Datei Endung, der MIME Type, der MD5 Hash, die Größe der Datei und die Datumsfelder der Datei bestimmt und in die Ziel Tabelle geladen werden.

Später soll noch ein Vorschaubild und eine Zusammenfassung der Meta Daten der Datei, die Autor etc und die Sprache des Dokuments ermittelt werden.

Nach dem Laden werden die Daten mit Oracle Text indiziert, siehe dazu ⇒ Oracle Text für die Indizierung binärer Daten verwenden

Über eine APEX Oberfläche kann dann eine Suche über die Dateien durchgeführt werden siehe dazu ⇒ Mit Oracle APEX 5 und Oracle Text ein Dokumentenarchive für technische Dokumentation aufbauen.


Vorbereitung

Mit Unix Magic (file Befehl) den File Type bestimmen

Leider ist es mir unter Windows unter Python nicht gelungen, Magic direkt einzusetzen.

Normalerweise müssen nur die Windows DDL' von Magic zur Verfügung stehen und mit python-magic Package von https://github.com/ahupp/python-magic kann dann direkt in Python der File Typ und weitere Meta Daten ausgelesen werden.

Leider hat das aber bei mir nicht funktioniert.

.\python -m pip install python-magic

Libs von herunterladen:

Test:

import magic
magic_file_path= "D:\\tools\\file\\share\misc"
magic.Magic(magic_file =magic_file_path)
 
#Fehler:
...
D:\tools\file\share\misc/magic.mgc, 1: Warning: offset `ñ' invalid
D:\tools\file\share\misc/magic.mgc, 1: Warning: type `ñ' invalid
...
#usw..

Trotz einer weitere Versuche und Internet Recherche ist es mir nicht gelungen das hinzubekommen.

Daher rufe ich nun magic über „file.exe“ direkt als Programm auf, dazu alle dll's (magic1.ddl, zlib1.dll und regex2.dll ) in das bin Verzeichnis von file kopieren.

Gerade bei alten Office Formaten kann magic sehr hilfreich sein, es wird zum Beispiel auch der Autor mit ausgelesen.

Directory Objekt in der Datenbank anlegen und Zugriffs-Rechte vergeben

Über ein Directory Object wird der Einstieg in die Verzeichnisstruktur auf der Festplatte definiert.

In unseren Test Fall liegen die Daten unter „d:\data\info-archiv“ in einer Verzeichnisstruktur.

Anlegen als SYS und Rechte vergeben:

sqlplus / AS sysdba
 
#Directory anlegen
CREATE directory INFO_ARCHIVE AS 'D:\data\info-archiv';
 
#Rechte vergeben
GRANT READ,WRITE ON directory INFO_ARCHIVE TO GPI;

Dokumenten Tabelle anlegen

In diese Tabelle werden die Informationen eingelesen:

sqlplus gpi/gpi
 
---------------------------------------------------------
-- Document table
---------------------------------------------------------
 
CREATE TABLE Documents
  (
    ID                  NUMBER (30) NOT NULL ,
    Filename            VARCHAR2 (512 CHAR) NOT NULL ,
    FileTyp             VARCHAR2 (32 CHAR) ,
    FileDirectory       VARCHAR2 (2000 CHAR) NOT NULL ,
    FilePointer BFILE   NOT NULL ,
    MD5Hash             VARCHAR2 (32 CHAR) NOT NULL ,
    FileCreateDate      DATE ,
    FileLastModify      DATE ,
    LANGUAGE            VARCHAR2 (32 CHAR) ,
    CreateDate          DATE DEFAULT sysdate,
    CreateUser          VARCHAR2 (32 CHAR) DEFAULT USER,
    ChangeDate          DATE DEFAULT sysdate,
    ChangeUser          VARCHAR2 (32 CHAR) DEFAULT USER,
    Theme_data_avaiable VARCHAR2 (1 CHAR)  DEFAULT 'N'
  ) ;
 
CREATE UNIQUE INDEX IDX_Documents_ID_PK ON Documents (  ID ASC );
 
ALTER TABLE Documents ADD CONSTRAINT Document_PK PRIMARY KEY ( ID ) ;
 
COMMENT ON TABLE Documents IS   'Store the file information' ;
COMMENT ON COLUMN Documents.ID IS  'Primary Key' ;
COMMENT ON COLUMN Documents.Filename IS  'Name of the file on disk' ;
COMMENT ON COLUMN Documents.FileTyp IS  'Typ of the file' ;
COMMENT ON COLUMN Documents.FileDirectory IS  'Directory of the file' ;
COMMENT ON COLUMN Documents.FilePointer IS  'Bfile Pointer to the File' ;
COMMENT ON COLUMN Documents.MD5Hash IS  'Hash of the files to indentify dublicate files' ;
COMMENT ON COLUMN Documents.FileCreateDate IS  'File Create Time from the file' ;
COMMENT ON COLUMN Documents.FileLastModify IS  'Last modificatoin date from the file' ;
COMMENT ON COLUMN Documents.Language IS  'Language of the file (Oracle NLS Format String!)' ;
COMMENT ON COLUMN Documents.CreateDate IS  'Date when the record was created' ;
COMMENT ON COLUMN Documents.CreateUser IS  'User create the record' ;
COMMENT ON COLUMN Documents.ChangeDate IS  'Last Change on the record' ;
COMMENT ON COLUMN Documents.ChangeUser IS  'User change the record' ;
COMMENT ON COLUMN Documents.Theme_data_avaiable IS  'If Themdata is there => Y, if not N' ;
 
---------------------------------------------------------
 
CREATE SEQUENCE documents_seq;

Oracle CX für den Datenbank Zugriff für Python 3.4 einrichten

Code DocumentLoader.py (Python 3.4 !)

Aufruf mit „DocumentLoader.py -s <src> -r <recursive Level> -d <oracle_directory_path> -c <config_path>“

Beim ersten Aufruf wird das Template für die Konfigurationsdatei „dataLoader.conf“ erzeugt, diese Datei dann mit den entsprechenden Info's für die jeweilige Umgebung ausfüllen.

DocumentLoader.py
__author__ = 'gpipperr'
 
import datetime, time
import glob, filecmp, ntpath, shutil
import os, errno, sys, getopt
import hashlib
import subprocess
import io
import cx_Oracle
import locale
import configparser
 
 
# get FileType
def getFileType(file, command, magicFilePath):
    # magicFilePath = "d:\\tools\\file\\bin\\file.exe"
    result = "-2"
    # Command + arguments  as sequence
    args = magicFilePath + " " + command + ' "' + file + '"'
    # execute the program and pipe the output to subprocess.PIPE
    prog = subprocess.Popen(args, stdout=subprocess.PIPE, shell=True)
    # wait until finished
    stat = prog.wait()
    # connect pipes together
    (output, err) = prog.communicate()
    result = str(output)
    result = str.replace(result, "b'", "")
    result = str.replace(result, "\\r", "")
    result = str.replace(result, "\\n", "")
    escFileName = str.replace(str(file), "\\", "\\\\") + "; "
    result = str.replace(result, escFileName, "")
    return result
 
 
# Get extension
# Get extension
def getFile_ext(filename):
    fparts = filename.split('.')
    c = len(fparts)
    # try to get .tar.gz
    v_return = fparts[-1]
    if c > 3:
        if len(fparts[-2]) < 5:
            chars = set('$,()[]_ ')
            if any((c in chars) for c in fparts[-2]):
                v_return = fparts[-1]
            else:
                v_return = fparts[-2] + "." + fparts[-1]
 
    if v_return.find('.') < 0:
        v_return = "." + v_return
 
    return v_return.lower()
 
 
# get the MD5 Hash the file
def getMD5(filename):
    blocksize = 65536
    if os.path.exists(filename) == False:
        return 'N/A'
    md5 = hashlib.md5()
    with open(filename, "rb") as f:
        for block in iter(lambda: f.read(blocksize), b""):
            md5.update(block)
    return md5.hexdigest()
 
 
# Remember the global Size of all copied files
def setStatisticTotalSize(size):
    global totalFileSize
    totalFileSize += size
 
 
# global for the total filesize
totalFileSize = 0
 
 
# print the file info
# later insert the record to the database
#   fileInfo=dict(filename=filename,filepath=fileDirectoryName,fileADate=fileAccessDate,fileMDate=fileModDate,fileCDate=fileCreateDate,md5=md5checkSum,fileBType=fileType,fileMtype=fileMimeType)
def insertFileInfo(fileInfo, ora_dir_path, connection):
    # strip the oracle path
    relative_doc_path = str.replace(fileInfo['filepath'], ora_dir_path, "")
    try:
        print("-- Info for the file ::{0}".format(fileInfo['filename'].encode('utf-8')))
        print("                 Path::{0}".format(fileInfo['filepath'].encode('utf-8')))
        print("         Relativ Path::{0}".format(relative_doc_path + os.sep + str(fileInfo['filename'])))
        print("          Access Date::{0}".format(fileInfo['fileADate']))
        print("          Modify Date::{0}".format(fileInfo['fileMDate']))
        print("          Create Date::{0}".format(fileInfo['fileCDate']))
        print("            File Size::{0}".format(fileInfo['fileSize']))
        print("                  MD5::{0}".format(fileInfo['md5']))
        print("     File Binary Type::{0}".format(fileInfo['fileBType']))
        print("       File Mime Type::{0}".format(fileInfo['fileMtype']))
        print("       File Extention::{0}".format(fileInfo['fileExtention']))
    except:
        print("Unexpected error with file infos")
 
    # insert into the database
 
    # Cursor auf die DB oeffenen
    cursor = connection.cursor()
 
    # filename,filetyp,filedirectory,md5hash, FILECREATEDATE, FILELASTMODIFY
    # .encode('utf-8')
    # .encode('utf-8')
 
    doc_rows = [(str(fileInfo['filename'])
                 , fileInfo['fileExtention']
                 , fileInfo['filepath']
                 , fileInfo['md5']
                 , fileInfo['fileCDate']
                 , fileInfo['fileMDate']
                 , relative_doc_path + os.sep + str(fileInfo['filename'])
                 , fileInfo['fileSize']
                 , fileInfo['fileADate']
                 )
                ]
 
    # how many rows to insert
    cursor.bindarraysize = len(doc_rows)
 
    # datatype - lenght of the inserted data
    # cursor.setinputsizes(int, 14, 13)
    try:
        # insert the whole record
        cursor.executemany(
            "insert into documents (ID , FILENAME, FILETYP, FILEDIRECTORY, MD5HASH, FILECREATEDATE, FILELASTMODIFY,FILEPOINTER,filesize,FILELASTACESS) values ( documents_seq.nextval, :1, :2, :3, :4, :5, :6,BFILENAME('INFO_ARCHIVE',replace(:7,';','')),:8,:9)",
            doc_rows)
    except cx_Oracle.DatabaseError as e:
        connection.rollback()
        print("-- Error - Oracle Database Error:" + str(e))
    except Exception as e:
        print("-- Error - Unexpected error:" + str(e))
        connection.rollback()
        # raise
    connection.commit()
 
 
# Main Script part
def main(argv):
    global totalFileSize
 
 
    # Parameter 1   - Import Directory
    # Parameter 2   - Subfolder Level
 
 
    # Remember the start time of the program
    start_time = time.clock()
 
    usage_string = " DocumentLoader.py  -s <src>  -r <recursive Level> -d <oracle_directory_path> -c <config_path>"
 
    path_name = '-'
    dest_name = '-'
    recursiveLevel = 0
    config_path = 'dataLoader.conf'
 
    # -- Parameter from the config file
 
    oracle_port = '-'
    oracle_host = '-'
    oracle_service = '-'
    oracle_user = '-'
    oracle_pwd = '-'
    magicFilePath = '-'
    ingoreFileExt = []
 
    # Path of the oracle Info Archive
    ora_dir_path = '-'
 
    try:
        opts, args = getopt.getopt(argv, "hs:d:r:d:c:", ["src=", "rec=", "dir=", "config="])
    except getopt.GetoptError:
        print("usage: {0}".format(usage_string))
        sys.exit(2)
    # read the parameter
    for opt, arg in opts:
        if opt == '-h':
            print("usage: {0}").format(usage_string)
            sys.exit()
        elif opt in ("-s", "--src"):
            path_name = arg
        elif opt in ("-d", "--dir"):
            ora_dir_path = arg
        elif opt in ("-c", "--config"):
            config_path = arg
        elif opt in ("-r", "--rec"):
            recursiveLevel = int(arg)
 
    # read the config file
    config = configparser.ConfigParser()
    # check if the file exits
    if os.path.exists(config_path) == False:
        # use normal configparser to write the template
        print("--" + 80 * "!")
        print("-- Error to read file {0}".format(config_path))
        print("-- Error usage: {0}".format(usage_string))
        config['DEFAULT'] = {'MagicFile': 'd:\\tools\\file\\bin\\file.exe', 'ignoreFileExt': '.iso'}
        config['ORACLE_DB_CONNECT'] = {'Host': 'localhost', 'Port': '1521', 'Service': 'ORCL', 'DB_User': 'USER',
                                       'DB_Password': 'xxxxx'}
        with open(config_path, 'w') as configfile:
            config.write(configfile)
        print("--" + 80 * "!")
        print("-- Info create Configuration Template :: {0}".format(config_path))
        print("-- Info fillout the configuration file  with your personal values and start again!")
        print("--" + 80 * "!")
        sys.exit(2)
    else:
        print("-- Info read config file {0}".format(config_path))
        config.read(config_path)
        # Parameter of the application
        general_configuration = config['DEFAULT']
        magicFilePath = general_configuration['MagicFile']
        ingoreFileExtString = general_configuration['ignoreFileExt']
        ingoreFileExt = str.split(str.replace(ingoreFileExtString, ' ', ''), ',')
        # Oracle DB Connect
        oracle_db_configuration = config['ORACLE_DB_CONNECT']
        oracle_port = oracle_db_configuration['Port']
        oracle_host = oracle_db_configuration['Host']
        oracle_service = oracle_db_configuration['Service']
        oracle_user = oracle_db_configuration['DB_User']
        oracle_pwd = oracle_db_configuration['DB_Password']
 
 
    # check if Directory exists and if the * is necessary
    # BUG ! if more then 1 then the * not match the documents on root level??
    # FIX IT!
    # Source
    if os.path.isdir(path_name):
        if path_name.endswith(os.path.sep):
            path_name += ("*" + os.path.sep) * recursiveLevel
            path_name += "*.*"
        else:
            path_name += os.path.sep
            path_name += ("*" + os.path.sep) * recursiveLevel
            path_name += "*.*"
    else:
        print("-- Error :: 05 Source Directory (-s) {0} not found".format(path_name))
        print("usage: {0}").format(usage_string)
        sys.exit(2)
 
    # connect to the database
    print("--" + 40 * "=")
    print("-- Info :: Oracle Client Library Version :: {0}".format(cx_Oracle.clientversion()))
    # get the connection to the database
    print("-- Info :: oracle_host {0} oracle_port {1} oracle_service {2} oracle_user {3} oracle_pwd ********".format(
        oracle_host, oracle_port, oracle_service, oracle_user, oracle_pwd))
    connection = cx_Oracle.connect(
        oracle_user + '/' + oracle_pwd + '@' + oracle_host + ':' + oracle_port + '/' + oracle_service)
    # Version der DB ausgeben
    print("-- Info :: Oracle Database Version :: {0}".format(connection.version))
    print("--" + 40 * "=")
 
    print("-- Info  :: Environment Settings :: Language :: {0} - Char Set ::{1}".format(locale.getdefaultlocale()[0],
                                                                                        locale.getdefaultlocale()[1]))
 
    print("--" + 40 * "=")
    print("-- Info  :: Read all files from {0}".format(path_name))
    print("-- Info  :: Copy files to       {0}".format(dest_name))
    print("-- Info  :: Not index this file types ::" + str(ingoreFileExt))
    print("--" + 40 * "=")
 
    fileCount = 0
    fileExistsCount = 0
    dirCount = 0
    dirPathList = []
    totalFileSize = 0
    # Get the list of all Files
    fileList = glob.glob(path_name)
 
    # remove Thumbs.db if exist from the list
    thumbsDBFile = "Thumbs.db"
    for file in fileList:
        if file.endswith(thumbsDBFile):
            fileList.remove(file)
 
    # Loop one read files in Import Directory
    for file in fileList:
        fileCount += 1
 
        # do the work
        try:
            # check if directoy
            if ntpath.isdir(file):
                print("-- Info :: found dirctory {0} ::".format(file))
                dirCount += 1
            else:
                # get only the filename without the path
                filename = ntpath.basename(file)
                # get directory
                fileDirectoryName = ntpath.dirname(file)
                # get Create date
 
                fileAccessDate = datetime.datetime.fromtimestamp(ntpath.getatime(file))
                fileModDate = datetime.datetime.fromtimestamp(ntpath.getmtime(file))
                fileCreateDate = datetime.datetime.fromtimestamp(ntpath.getctime(file))
                # get md5 hash
                md5checkSum = getMD5(file)
                # get File Type over file from external, Python Lib magic not working, error with magic file!
                # now I implement this stupid solution
                fileType = getFileType(file, " ", magicFilePath)
                # Call file with -i to get the mime Type
                fileMimeType = getFileType(file, "-i", magicFilePath)
                # get Extenstion
                fileExt = getFile_ext(filename)
                # getFileSize
                fileSize = os.path.getsize(file)
                # Remember for statistic
                setStatisticTotalSize(fileSize)
 
                # not add url files to the index
                # endswith(".url")
 
                if fileExt in ingoreFileExt:
                    # encode the output with UTF-8 to avoid errors with stange things in filenames
                    print("-- Info :: Not index this file types ::" + str(ingoreFileExt))
                    print("-- Info :: Not index this file::{0}".format(repr(filename.encode('utf-8'))))
                    print("-- Info :: Not index this Dir::{0}".format(repr(fileDirectoryName.encode('utf-8'))))
                    print("-- --")
                else:
                    # record
                    fileInfo = dict(filename=str(filename), filepath=str(fileDirectoryName), fileADate=fileAccessDate,
                                    fileMDate=fileModDate, fileCDate=fileCreateDate, md5=md5checkSum,
                                    fileBType=fileType,
                                    fileMtype=fileMimeType, fileExtention=fileExt, fileSize=fileSize)
 
                    ##print("-- Index this Dir::{0}".format(repr(fileDirectoryName.encode('utf-8'))))
                    # encode the output with UTF-8 to avoid errors with stange things in filenames
                    ##print("-- Index this file   ::{0}".format(repr(filename.encode('utf-8'))))
                    # write to DB
                    insertFileInfo(fileInfo, ora_dir_path, connection)
 
        except OSError as exception:
            if exception.errno != errno.EEXIST:
                print("-- Error :: Error read file :: see error {1}".format(file, sys.exc_info()[0]))
 
 
    # print statistics
    print("--" + 40 * "=")
    print("-- Finish with           :: {0} files in {1} new directories".format(fileCount, dirCount))
    print("-- The run needs         :: {0:5.4f} seconds".format(time.clock() - start_time))
    print("-- Read size             :: {0:5.3f} MB".format(totalFileSize / 1024 / 1024))
    print("--" + 40 * "=")
 
    # Close the DB Connection
    connection.close()
 
 
if __name__ == "__main__":
    main(sys.argv[1:]);

Config file:

dataLoader.conf
[DEFAULT]
# Path and name of the magic file program
MagicFile=d:\tools\file\bin\file.exe
 
# Do Not read files with this extension
# use , to separate the file extensions
ignoreFileExt=.cmd ,.iso ,.url,.exe,.class,.dbf
 
[ORACLE_DB_CONNECT]
Port = 1521
Host = 10.10.10.1
Service = GPI
DB_User= GPI
DB_Password = xxxxxx
Cookies helfen bei der Bereitstellung von Inhalten. Durch die Nutzung dieser Seiten erklären Sie sich damit einverstanden, dass Cookies auf Ihrem Rechner gespeichert werden. Weitere Information
"Autor: Gunther Pipperr"
prog/python_data_loader_oracle_bfile_text_index.txt · Zuletzt geändert: 2016/04/24 21:47 von Gunther Pippèrr