Benutzer-Werkzeuge

Webseiten-Werkzeuge

Action disabled: source

nosql:sqoop_erste_schritte

Erste Schritte mit Apache Sqoop 2 - v1.99

Mit Apache Sqoop (siehe ⇒ http://sqoop.apache.org/ ) können Daten zwischen Hadoop und einer Datenbank ausgetauscht werden.

Architektur:

Apache sqoop 1.99 Übersicht

In der Version 2 (v1.99) wird eine Client Server Architektur realisiert.

Installation Apache Sqoop

Voraussetzung:

Ablauf:

Ablauf:

# auspacken
 
cd  /opt/hadoop/product
 
wget http://mirror.netcologne.de/apache.org/sqoop/1.99.3/sqoop-1.99.3-bin-hadoop200.tar.gz
 
tar xfvz sqoop-1.99.3-bin-hadoop200.tar.gz
 
rm sqoop-1.99.3-bin-hadoop200.tar.gz
 
# Verlinken
 
ln -s sqoop-1.99.3-bin-hadoop200 sqoop
 
 
#Umgebung setzen
vi ~/.bashrc
 
..
#Sqoop
export SQOOP_HOME=/opt/hadoop/product/sqoop
export PATH=$SQOOP_HOME/bin:$PATH
..
 
. ~/.bashrc
 
 
#Copy Oracle JDBC Driver in das server Lib verzeichnis:
cp ~/ojdbc7.jar $SQOOP_HOME/server/lib/
 
#lib Verzeichnis anlegen
mkdir $SQOOP_HOME/lib
 
 
#1. test des Clients
 
sqoop.sh client
 
Sqoop home directory: /opt/hadoop/product/sqoop
Sqoop Shell: Type 'help' or '\h' for help.
 
sqoop:000> show version
client version:
  Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b
  Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013
sqoop:000>

Konfiguration Apache Sqoop

Für den Zugriff auf die Hadoop Lib in der Datei catalina.properties in $SQOOP_HOME/server/conf den common.loader Pfad anpassen, dazu ALLE hadoop jar Verzeichnisse mit in den Pfad aufnehmen!

vi $SQOOP_HOME/server/conf/catalina.properties
 
common.loader=${catalina.base}/lib,${catalina.base}/lib/*.jar,${catalina.home}/lib,${catalina.home}/lib/*.jar,${catalina.home}/../lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/common/*.jar,/opt/hadoop/product/hadoop/share/hadoop/common/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/hdfs/*.jar,/opt/hadoop/product/hadoop/share/hadoop/hdfs/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/mapreduce/*.jar,/opt/hadoop/product/hadoop/share/hadoop/mapreduce/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/tools/*.jar,/opt/hadoop/product/hadoop/share/hadoop/tools/lib/*.jar,/opt/hadoop/product/hadoop/share/hadoop/yarn/*.jar,/opt/hadoop/product/hadoop/share/hadoop/yarn/lib/*.jar

die Sqoop Server Konfigurieren befindet sich unter $SQOOP_HOME/server/conf in der Datei sqoop_bootstrap.properties und der Datei sqoop.properties.

In der sqoop.properties muss das Haddop Config Verzeichniss konfiguriert/angepasst werden:

vi $SQOOP_HOME/server/conf/sqoop.properties
 
# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/opt/hadoop/product/hadoop/etc/hadoop/

Die Ports lassen sich mit der Datei $SQOOP_HOME/server/bin/setenv.sh über SQOOP_HTTP_PORT and SQOOP_ADMIN_PORT setzen, der Default ist 12000 and 12001.

Server starten:

$SQOOP_HOME/bin/sqoop.sh server start

Über die URL http://localhost:12000/sqoop/version läßt sich prüfen, ob der Server mit welcher Version läuft.

curl http://localhost:12000/sqoop/version

Server stoppen:

$SQOOP_HOME/bin/sqoop.sh server stop

Test des Servers und Beheben erster evlt. Fehler

Mit client am Server anmelden und testen:

 $SQOOP_HOME/bin/sqoop.sh client
 
Sqoop Shell: Type 'help' or '\h' for help.
 
# Am Server anmelden
sqoop:000> set server --host nosqldb01 --port 12000 --webapp sqoop
Server is set successfully
sqoop:000> show version
 
# Server abfragen
sqoop:000> show version --all
client version:
  Sqoop 1.99.3 revision 2404393160301df16a94716a3034e31b03e27b0b
  Compiled by mengweid on Fri Oct 18 14:15:53 EDT 2013
 
Exception has occurred during processing command
Exception: com.sun.jersey.api.client.UniformInterfaceException Message: GET http://nosqldb01:12000/sqoop/version returned a response status of 404 Not Found

Tritt dieser Fehler auf, liegt vermutlich ein Konfigurationsproblem vor.

Logfile unter $SQOOP_HOME/server/logs analyiseren:

vi localhost.2014-08-09.log
 
...
Caused by: org.apache.sqoop.common.SqoopException: MAPREDUCE_0002:Failure on submission engine initialization - Invalid Hadoop configuration directory (not a directory or permission issues): /etc/hadoop/conf/
 
...
 
Lösung:
 
cd $SQOOP_HOME/server/conf
 
vi sqoop.properties
 
# Hadoop configuration directory
org.apache.sqoop.submission.engine.mapreduce.configuration.directory=/opt/hadoop/product/hadoop/etc/hadoop/

Fehler beim Stoppen des Server:

Aug 09, 2014 2:44:41 PM org.apache.catalina.startup.ClassLoaderFactory validateFile
WARNING: Problem with directory [/opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/lib], exists: [false], isDirectory: [false], canRead: [false]
 
#Lösung 
mkdir /opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/lib

Repository Datenbank

In einer default Installation liegt die Repository Datenbank unter $SQOOP_HOME/@BASEDIR@/repository/db

Die ersten Daten zwischen der Datenbank und Hadoop kopieren

Es soll die Tabelle SYSTEM.AUD$ aus der Oracle Datenbank als Text Datei in das Hadoop Cluster importiert werden.

Ablauf:

  • Client starten und mit dem Sqoop Server verbinden
  • Connection Object anlegen
  • Job Object definieren
  • Job starten und Daten nach Hadoop importieren

Client starten und mit dem Sqoop Server

$SQOOP_HOME/bin/sqoop.sh client
 
 
sqoop:000> set server --host nosqldb01 --port 12000 --webapp sqoop

Connection Object anlegen

Für das Connection Object werden die folgenden Informationen benötigt:

  • Treibername - oracle.jdbc.driver.OracleDriver
  • JDBC URL - jdbc:oracle:thin:@//10.10.10.1:1521/gpi
  • Username und Password
sqoop:000> create connection --cid 1
Creating connection for connector with id 1
Please fill following values to create new connection object
Name: GPI DB
 
Connection configuration
 
JDBC Driver Class: oracle.jdbc.driver.OracleDriver
JDBC Connection String: jdbc:oracle:thin:@//10.10.10.1:1521/gpi
Username: system
Password: ******
JDBC Connection Properties:
There are currently 0 values in the map:
entry#

Security related configuration options
 
Max connections: 10
New connection was successfully created with validation status FINE and persistent id 1
 
sqoop:000> show connection --xid 1
1 connection(s) to show:
Connection with id 2 and name GPI DB (Enabled: true, Created by hadoop at 8/9/14 5:27 PM, Updated by hadoop at 8/9/14 5:27 PM)
Using Connector id 1
  Connection configuration
    JDBC Driver Class: oracle.jdbc.driver.OracleDriver
    JDBC Connection String: jdbc:oracle:thin:@//10.10.10.1:1521/gpi
    Username: system
    Password:
    JDBC Connection Properties:
  Security related configuration options
    Max connections: 10

Job Object definieren

Für das Job Object werden die folgenden Informationen benötigt:

  • Name und Schema der Tabelle
  • Speicherort im HDFS
    hdfs dfs -mkdir  /auditVault
    hdfs dfs -chmod 777  /auditVault
sqoop:000> create job --xid 1 --type import
Creating job for connection with id 1
Please fill following values to create new job object
Name: GPI AUD Import
 
Database configuration
 
Schema name: SYSTEM
Table name: AUD$
Table SQL:
Table column names:
Partition column name:USERID
Nulls in partition column:
Boundary query:
 
Output configuration
 
Storage type:
  0 : HDFS
Choose: 0
Output format:
  0 : TEXT_FILE
  1 : SEQUENCE_FILE
Choose: 1
Compression format:
  0 : NONE
  1 : DEFAULT
  2 : DEFLATE
  3 : GZIP
  4 : BZIP2
  5 : LZO
  6 : LZ4
  7 : SNAPPY
Choose: 0
Output directory: /auditVault/GPI_AUD
 
Throttling resources
 
Extractors:
Loaders:
 
New job was successfully created with validation status FINE  and persistent id 1
 
sqoop:000> show job
+----+----------------+--------+-----------+---------+
| Id |      Name      |  Type  | Connector | Enabled |
+----+----------------+--------+-----------+---------+
| 1  | GPI AUD Import | IMPORT | 1         | true    |
+----+----------------+--------+-----------+---------+

Job starten und Daten nach Hadoop importieren

sqoop:000> submission start --jid 1
Exception has occurred during processing command
Unknown command: No such property: start for class: groovysh_evaluate
 
 
????????
 
 
 
sqoop:000> start job --jid 1
Exception has occurred during processing command
Exception: org.apache.sqoop.common.SqoopException Message: CLIENT_0001:Server has returned exception
 
# Fehler meldung sichtbar machen:
sqoop:000> set option --name verbose --value true
sqoop:000> start job --jid 1
 
...
 
Caused by: Exception: java.lang.Throwable Message: GENERIC_JDBC_CONNECTOR_0005:No column is found to partition data
Stack trace:
 
...
sqoop:000>update job --jid 1
 
#alle alten werte bis auf 
Partition column name: TIMESTAMP#
#und
Output directory: /auditValut/gpi_aud
 
sqoop:000> start job --jid 1
2014-08-09 17:56:17 CEST: FAILURE_ON_SUBMIT
Exception: java.io.FileNotFoundException: File does not exist: hdfs://nosqldb01:9000/opt/hadoop/product/sqoop-1.99.3-bin-hadoop200/server/webapps/sqoop/WEB-INF/lib/sqoop-common-1.99.3.jar

Diese Anpassung an Hadoop Configuration löst dieser Verhalten:

vi $HADOOP_HOME/etc/hadoop/mapred-site.xml
 
<property>
  <name>mapreduce.framework.name</name>
  <value>yarn</value>
</property>

Nächster Test:

sqoop:000> start job --jid 1
2014-08-09 18:01:18 CEST: FAILURE_ON_SUBMIT
Exception: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): Cannot create directory /tmp/hadoop-yarn/staging/hadoop/.staging. Name node is in safe mode.
The reported blocks 10 has reached the threshold 0.9990 of total blocks 10. The number of live datanodes 1 has reached the minimum number 0. In safe mode extension

# Lösung:

hdfs dfsadmin -safemode leave

Nächster Test:

sqoop:000> start job --jid 1
Submission details
Job ID: 1
Server URL: http://nosqldb01:12000/sqoop/
Created by: hadoop
Creation date: 2014-08-09 18:06:13 CEST
Lastly updated by: hadoop
External ID: job_1407600059752_0001
        http://nosqldb01:8088/proxy/application_1407600059752_0001/
2014-08-09 18:06:13 CEST: BOOTING  - Progress is not available
 
# was tut sich:
 
 
sqoop:000> start job --jid 1
Submission details
Job ID: 1
Server URL: http://nosqldb01:12000/sqoop/
Created by: hadoop
Creation date: 2014-08-11 21:17:42 CEST
Lastly updated by: hadoop
External ID: job_1407784150673_0001
        http://nosqldb01:8088/proxy/application_1407784150673_0001/
Connector schema: Schema{name=AUD$,columns=[
        Decimal{name=SESSIONID,nullable=null,precision=null,scale=null},
        Decimal{name=ENTRYID,nullable=null,precision=null,scale=null},
        Decimal{name=STATEMENT,nullable=null,precision=null,scale=null},
        Date{name=TIMESTAMP#,nullable=null,fraction=null,timezone=null},
        Text{name=USERID,nullable=null,size=null},
        Text{name=USERHOST,nullable=null,size=null},
        Text{name=TERMINAL,nullable=null,size=null},
        Decimal{name=ACTION#,nullable=null,precision=null,scale=null},
        Decimal{name=RETURNCODE,nullable=null,precision=null,scale=null},
        Text{name=OBJ$CREATOR,nullable=null,size=null},
        Text{name=OBJ$NAME,nullable=null,size=null},
        Text{name=AUTH$PRIVILEGES,nullable=null,size=null},
        Text{name=AUTH$GRANTEE,nullable=null,size=null},
        Text{name=NEW$OWNER,nullable=null,size=null},
        Text{name=NEW$NAME,nullable=null,size=null},
        Text{name=SES$ACTIONS,nullable=null,size=null},
        Decimal{name=SES$TID,nullable=null,precision=null,scale=null},
        Decimal{name=LOGOFF$LREAD,nullable=null,precision=null,scale=null},
        Decimal{name=LOGOFF$PREAD,nullable=null,precision=null,scale=null},
        Decimal{name=LOGOFF$LWRITE,nullable=null,precision=null,scale=null},
        Decimal{name=LOGOFF$DEAD,nullable=null,precision=null,scale=null},
        Date{name=LOGOFF$TIME,nullable=null,fraction=null,timezone=null},
        Text{name=COMMENT$TEXT,nullable=null,size=null},
        Text{name=CLIENTID,nullable=null,size=null},
        Text{name=SPARE1,nullable=null,size=null},
        Decimal{name=SPARE2,nullable=null,precision=null,scale=null},
        Binary{name=OBJ$LABEL,nullable=null,size=null},
        Binary{name=SES$LABEL,nullable=null,size=null},
        Decimal{name=PRIV$USED,nullable=null,precision=null,scale=null},
        Decimal{name=SESSIONCPU,nullable=null,precision=null,scale=null},
        Date{name=NTIMESTAMP#,nullable=null,fraction=null,timezone=null},
        Decimal{name=PROXY$SID,nullable=null,precision=null,scale=null},
        Text{name=USER$GUID,nullable=null,size=null},
        Decimal{name=INSTANCE#,nullable=null,precision=null,scale=null},
        Text{name=PROCESS#,nullable=null,size=null},
        Binary{name=XID,nullable=null,size=null},
        Text{name=AUDITID,nullable=null,size=null},
        Decimal{name=SCN,nullable=null,precision=null,scale=null},
        Decimal{name=DBID,nullable=null,precision=null,scale=null},
        FloatingPoint{name=SQLBIND,nullable=null,byteSize=null},
        FloatingPoint{name=SQLTEXT,nullable=null,byteSize=null},
        Text{name=OBJ$EDITION,nullable=null,size=null}]}
2014-08-11 21:17:42 CEST: BOOTING  - Progress is not available

Status in YARN:

yarn application -list
 
14/08/11 21:21:33 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/08/11 21:21:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1
                Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress                        Tracking-URL
application_1407784150673_0001  Sqoop: GPI AUD Import              MAPREDUCE        hadoop         default                 RUNNING               UNDEFINED                  14%              http://nosqldb01:35645

Leider bricht der Ladevorgang am Ende immer ab:

Caused by: org.apache.sqoop.common.SqoopException: GENERIC_JDBC_CONNECTOR_0001:Unable to get a connection
	at org.apache.sqoop.connector.jdbc.GenericJdbcExecutor.<init>(GenericJdbcExecutor.java:51)
	at org.apache.sqoop.connector.jdbc.GenericJdbcImportExtractor.extract(GenericJdbcImportExtractor.java:42)
	at org.apache.sqoop.connector.jdbc.GenericJdbcImportExtractor.extract(GenericJdbcImportExtractor.java:31)
	at org.apache.sqoop.job.mr.SqoopMapper.run(SqoopMapper.java:96)
	... 7 more
Caused by: java.sql.SQLRecoverableException: No more data to read from socket

?? Suche wir weiter ??

Mehr Erfolg mit : Erste Schritte mit der Oracle Big Data Appliance

Einsatz von OraOop

Oraoop is a special plugin for sqoop that provides faster access to Oracle's RDBMS by using custom protocols that are not available publicly. Quest software partnered with Oracle to get those protocols, implemented them and created Oraoop.

See = > Quest:registered: Data Connector for Oracle and Hadoop

See : ⇒ http://blog.cloudera.com/wp-content/uploads/2011/01/oraoopuserguide-With-OraHive.pdf

??? Seit zwei Jahren nicht mehr gepflegt ????

Sqoop Client API

über die Client API läßt sich auch direkt auf den Server zugreifen um das zum Beispiel in eigene Programme zu integrieren.

http://sqoop.apache.org/docs/1.99.2/ClientAPI.html

Quellen

Cookies helfen bei der Bereitstellung von Inhalten. Durch die Nutzung dieser Seiten erklären Sie sich damit einverstanden, dass Cookies auf Ihrem Rechner gespeichert werden. Weitere Information
"Autor: Gunther Pipperr"
nosql/sqoop_erste_schritte.txt · Zuletzt geändert: 2014/11/17 11:41 von Gunther Pippèrr