I have been trying to import data from MySQL database to Hive using Sqoop utility. I got the table created and I have given the fetch-size as low as 10. Everytime I run the command, I am getting Java Heap Size Error and the job gets killed after 4 attempts. How can I fix this.
My sqoop command is as follows :
sqoop import --connect jdbc:mysql://my_local_ip/mydatabase --fetch-size 10 --username root -P --table table_name --hive-import --compression-codec=snappy --as-parquetfile -m 1
and I am getting :
16/08/29 07:06:24 INFO mapreduce.Job: The url to track the job: http://quickstart.cloudera:8088/proxy/application_1472465929944_0013/
16/08/29 07:06:24 INFO mapreduce.Job: Running job: job_1472465929944_0013
16/08/29 07:06:47 INFO mapreduce.Job: Job job_1472465929944_0013 running in uber mode : false
16/08/29 07:06:47 INFO mapreduce.Job: map 0% reduce 0%
16/08/29 07:07:16 INFO mapreduce.Job: Task Id : attempt_1472465929944_0013_m_000000_0, Status : FAILED
Error: Java heap space
16/08/29 07:07:37 INFO mapreduce.Job: Task Id : attempt_1472465929944_0013_m_000000_1, Status : FAILED
Error: Java heap space
16/08/29 07:07:59 INFO mapreduce.Job: Task Id : attempt_1472465929944_0013_m_000000_2, Status : FAILED
Error: Java heap space
16/08/29 07:08:21 INFO mapreduce.Job: map 100% reduce 0%
16/08/29 07:08:23 INFO mapreduce.Job: Job job_1472465929944_0013 failed with state FAILED due to: Task failed task_1472465929944_0013_m_000000
Try with
sqoop import -Dmapreduce.map.memory.mb=1024 -Dmapreduce.map.java.opts=-Xmx7200m -Dmapreduce.task.io.sort.mb=2400 --connect jdbc:mysql://local.ip/database_name --username root -P --hive-import --table table_name --as-parquetfile --warehouse-dir=/home/cloudera/hadoop --split-by 'id' -m 100
Initially, I have been using 10 mappers to process 10 million records and each chunk has a size of 1 million record. This was causing the error and as I fired 100 mapping jobs, it has processed the data successfully . The only thing I noticed is the time taken to complete the jobs. It has taken almost 1 hr to run all the 100 mapper jobs.
Related
this is the sqoop command which I am using to import data from SQL Server to Hive
sqoop-import-all-tables --connect "jdbc:sqlserver://ip.ip.ip.ip\MIGERATIONSERVER;port=1433;username=sa;password=blablaq;database=sqlserverdb" --create-hive-table --hive-import --hive-database hivemtdb
The problem is that sqlserverdb has about 100 tables but when i issue this command it is just importing 6 or 7 random tables to hive. This behavior is really strange for me. I am unable to find where I am doing mistake.
EDIT :1
Warning: /usr/hdp/2.4.3.0-227/accumulo does not exist! Accumulo imports will fail.
Please set $ACCUMULO_HOME to the root of your Accumulo installation.
16/10/13 13:17:38 INFO sqoop.Sqoop: Running Sqoop version: 1.4.6.2.4.3.0-227
16/10/13 13:17:38 INFO tool.BaseSqoopTool: Using Hive-specific delimiters for output. You can override
16/10/13 13:17:38 INFO tool.BaseSqoopTool: delimiters with --fields-terminated-by, etc.
16/10/13 13:17:38 INFO manager.SqlManager: Using default fetchSize of 1000
16/10/13 13:17:38 INFO tool.CodeGenTool: Beginning code generation
16/10/13 13:17:38 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [UserMessage] AS t WHERE 1=0
16/10/13 13:17:38 INFO orm.CompilationManager: HADOOP_MAPRED_HOME is /usr/hdp/2.4.3.0-227/hadoop-mapreduce
Note: /tmp/sqoop-sherry/compile/c809ee201c0aec1edf2ed5a1ef4aed4c/UserMessage.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
16/10/13 13:17:39 INFO orm.CompilationManager: Writing jar file: /tmp/sqoop-sherry/compile/c809ee201c0aec1edf2ed5a1ef4aed4c/UserMessage.jar
16/10/13 13:17:39 INFO mapreduce.ImportJobBase: Beginning import of UserMessage
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.3.0-227/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/hdp/2.4.3.0-227/zookeeper/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/10/13 13:17:40 INFO impl.TimelineClientImpl: Timeline service address: http://machine-02-xx:8188/ws/v1/timeline/
16/10/13 13:17:40 INFO client.RMProxy: Connecting to ResourceManager at machine-02-xx/xxx.xx.xx.xx:8050
16/10/13 13:17:42 INFO db.DBInputFormat: Using read commited transaction isolation
16/10/13 13:17:42 INFO mapreduce.JobSubmitter: number of splits:1
16/10/13 13:17:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1475746531098_0317
16/10/13 13:17:43 INFO impl.YarnClientImpl: Submitted application application_1475746531098_0317
16/10/13 13:17:43 INFO mapreduce.Job: The url to track the job: http://machine-02-xx:8088/proxy/application_1475746531098_0317/
16/10/13 13:17:43 INFO mapreduce.Job: Running job: job_1475746531098_0317
16/10/13 13:17:48 INFO mapreduce.Job: Job job_1475746531098_0317 running in uber mode : false
16/10/13 13:17:48 INFO mapreduce.Job: map 0% reduce 0%
16/10/13 13:17:52 INFO mapreduce.Job: map 100% reduce 0%
16/10/13 13:17:52 INFO mapreduce.Job: Job job_1475746531098_0317 completed successfully
16/10/13 13:17:52 INFO mapreduce.Job: Counters: 30
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=156179
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=87
HDFS: Number of bytes written=0
HDFS: Number of read operations=4
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Other local map tasks=1
Total time spent by all maps in occupied slots (ms)=3486
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=1743
Total vcore-seconds taken by all map tasks=1743
Total megabyte-seconds taken by all map tasks=2677248
Map-Reduce Framework
Map input records=0
Map output records=0
Input split bytes=87
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=30
CPU time spent (ms)=980
Physical memory (bytes) snapshot=233308160
Virtual memory (bytes) snapshot=3031945216
Total committed heap usage (bytes)=180879360
File Input Format Counters
Bytes Read=0
File Output Format Counters
Bytes Written=0
16/10/13 13:17:52 INFO mapreduce.ImportJobBase: Transferred 0 bytes in 12.6069 seconds (0 bytes/sec)
16/10/13 13:17:52 INFO mapreduce.ImportJobBase: Retrieved 0 records.
16/10/13 13:17:52 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM [UserMessage] AS t WHERE 1=0
16/10/13 13:17:52 WARN hive.TableDefWriter: Column SendDate had to be cast to a less precise type in Hive
16/10/13 13:17:52 INFO hive.HiveImport: Loading uploaded data into Hive
Logging initialized using configuration in jar:file:/usr/hdp/2.4.3.0-227/hive/lib/hive-common-1.2.1000.2.4.3.0-227.jar!/hive-log4j.properties
OK
Time taken: 1.286 seconds
Loading data to table sqlcmc.usermessage
Table sqlcmc.usermessage stats: [numFiles=1, totalSize=0]
OK
Time taken: 0.881 seconds
Note: /tmp/sqoop-sherry/compile/c809ee201c0aec1edf2ed5a1ef4aed4c/DadChMasConDig.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Logging initialized using configuration in jar:file:/usr/hdp/2.4.3.0-227/hive/lib/hive-common-1.2.1000.2.4.3.0-227.jar!/hive-log4j.properties
OK
First of all import-all-tables will run import table for all the tables.
If you does not define, number of mapper in the job, Sqoop will pick by default 4 mappers. So, it needs table to have primary key or you specify --split-by column name.
If this is the case, you will see error like:
ERROR tool.ImportAllTablesTool: Error during import: No primary key could be found for table test. Please specify one with --split-by or perform a sequential import with '-m 1'.
So you can use 1 mapper which will make your import process slow.
Better way is to add --autoreset-to-one-mapper, it will import tables with primary key with the number of mappers mentioned in the command and it will automatically use 1 mapper for the tables without primary key.
Coming to your problem,
sqoop import failed for table DadChMasConDig.
I don't know why it is not logged on console.
In importing this table there could be exception like
Encountered IOException running import job: java.io.IOException: Hive does not support the SQL type for column <somecolumn>
For example, varbinary is not supported.
If you import data only in HDFS, it should not be a problem. You can try:
sqoop-import-all-tables --connect "jdbc:sqlserver://ip.ip.ip.ip\MIGERATIONSERVER;port=1433;username=sa;password=blablaq;database=sqlserverdb"
I had the same issue and the following worked for me. Though typically --create-hive-table and --hive-overwrite don't go together and doesn't make sense together. But no other combination worked and each time only 3 of 10 or a fraction of tables were getting imported
sqoop import-all-tables \
--connect jdbc:mysql://<mysql-url>/my_database \
--username sql_user \
--password sql_pwd \
--hive-import \
--hive-database test_hive \
--hive-overwrite \
--create-hive-table \
--warehouse-dir /apps/hive/warehouse/test_hive.db \
-m 1
I am new in the hadoop world. I have setup a hadoop cluster in Linux and I try to run a mapreduce job from windows.
I wordcount example work find, but when I try to run another job that I wrote myself, I have this error:
16/07/05 16:04:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/07/05 16:04:42 INFO client.RMProxy: Connecting to ResourceManager at /myIP:8050
16/07/05 16:04:42 WARN mapreduce.JobResourceUploader: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/07/05 16:04:42 WARN mapreduce.JobResourceUploader: No job jar file set. User classes may not be found. See Job or Job#setJar(String).
16/07/05 16:04:42 INFO input.FileInputFormat: Total input paths to process : 3
16/07/05 16:04:42 INFO mapreduce.JobSubmitter: number of splits:3
16/07/05 16:04:43 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1467727275250_0002
16/07/05 16:04:43 INFO mapred.YARNRunner: Job jar is not present. Not adding any jar to the list of resources.
16/07/05 16:04:43 INFO impl.YarnClientImpl: Submitted application application_1467727275250_0002
16/07/05 16:04:43 INFO mapreduce.Job: The url to track the job: http://hadoopMaster:8088/proxy/application_1467727275250_0002/
16/07/05 16:04:43 INFO mapreduce.Job: Running job: job_1467727275250_0002
16/07/05 16:04:46 INFO mapreduce.Job: Job job_1467727275250_0002 running in uber mode : false
16/07/05 16:04:46 INFO mapreduce.Job: map 0% reduce 0%
16/07/05 16:04:46 INFO mapreduce.Job: Job job_1467727275250_0002 failed with state FAILED due to: Application application_1467727275250_0002 failed 2 times due to AM Container for appattempt_1467727275250_0002_000002 exited with exitCode: 1
For more detailed output, check application tracking page:http://hadoopMaster:8088/cluster/app/application_1467727275250_0002Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1467727275250_0002_02_000001
Exit code: 1
Exception message: /bin/bash: Zeile 0: fg: Keine Job-Steuerung in dieser Shell.
Stack trace: ExitCodeException exitCode=1: /bin/bash: line 0: fg: no Job control in this Shell.
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
16/07/05 16:04:46 INFO mapreduce.Job: Counters: 0
Can someone help me please?
mapred-site.xml add property
<property>
<name>mapred.remote.os</name>
<value>Linux</value>
<description>Remote MapReduce framework's OS, can be either Linux or Windows</description>
</property>
<property>
<name>mapreduce.app-submission.cross-platform</name>
<value>true</value>
</property>
I have installed Hadoop 2.7.1 stable version. I followed Tom White's book for installation in Pseudodistributed mode. I did set all environment variables like JAVA_HOME, HADOOP_HOME, PATH etc.. I configured yarn-site.xml, hdfs-site.xml, core-site.xml, mapred-site.xml.
I copied the sample file file.txt using following command.
$hadoop fs -copyFromLocal textFiles/file.txt file.txt
which shows me
Found 2 items
-rw-r--r-- 1 RAMA supergroup 3737 2015-12-27 21:52 file.txt
drwxr-xr-x - RAMA supergroup 0 2015-12-27 22:17 input
When I am executing the wordcount program in hadoop-mapreduce-examples-2.7.1.jar using below command
RAMAs-MacBook-Pro:hadoop-2.7.1 RAMA$ hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount file.txt output
it is throwing me following exception, for which I am not able to find any feasible solution.
15/12/27 22:41:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
15/12/27 22:41:53 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
15/12/27 22:41:53 INFO input.FileInputFormat: Total input paths to process : 1
15/12/27 22:41:53 INFO mapreduce.JobSubmitter: number of splits:1
15/12/27 22:41:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1451216397139_0020
15/12/27 22:41:54 INFO impl.YarnClientImpl: Submitted application application_1451216397139_0020
15/12/27 22:41:54 INFO mapreduce.Job: The url to track the job: http://192.168.1.6:8088/proxy/application_1451216397139_0020/
15/12/27 22:41:54 INFO mapreduce.Job: Running job: job_1451216397139_0020
15/12/27 22:41:57 INFO mapreduce.Job: Job job_1451216397139_0020 running in uber mode : false
15/12/27 22:41:57 INFO mapreduce.Job: map 0% reduce 0%
15/12/27 22:41:57 INFO mapreduce.Job: Job job_1451216397139_0020 failed with state FAILED due to: Application application_1451216397139_0020 failed 2 times due to AM Container for appattempt_1451216397139_0020_000002 exited with exitCode: 127
For more detailed output, check application tracking page:http://192.168.1.6:8088/cluster/app/application_1451216397139_0020Then, click on links to logs of each attempt.
Diagnostics: Exception from container-launch.
Container id: container_1451216397139_0020_02_000001
Exit code: 127
Stack trace: ExitCodeException exitCode=127:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 127
Failing this attempt. Failing the application.
15/12/27 22:41:57 INFO mapreduce.Job: Counters: 0
Any suggestions/comments is of great help...
In hadoop-env.sh, explicitly add the java home path in JAVA_HOME variable.
I am trying to run wordcount example already provided in hadoop on the following env: (Pseudodistributed mode)
Windows 7
Hadoop 2.7.1
JDK 1.7.x
RAM 4 GB
The jps command returns
C:\deploy\hadoop-2.7.1>jps
2336 ResourceManager
7500 NameNode
4984 Jps
6900 NodeManager
4940 DataNode
The command I use for setting the hadoop heap size
set HADOOP_HEAPSIZE=512
The command I use from the hadoop home installation directory is
bin\yarn jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.1.jar wordcount /input /output
I see the following stack trace
C:\deploy\hadoop-2.7.1>bin\yarn jar share/hadoop/mapreduce/hadoop-mapreduce-exam
ples-2.7.1.jar wordcount /input /output
15/08/14 22:36:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0
:8032
15/08/14 22:36:27 INFO input.FileInputFormat: Total input paths to process : 1
15/08/14 22:36:28 INFO mapreduce.JobSubmitter: number of splits:1
15/08/14 22:36:28 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_14
39571873038_0001
15/08/14 22:36:28 INFO impl.YarnClientImpl: Submitted application application_14
39571873038_0001
15/08/14 22:36:28 INFO mapreduce.Job: The url to track the job: http://XXX-PC
:8088/proxy/application_1439571873038_0001/
15/08/14 22:36:28 INFO mapreduce.Job: Running job: job_1439571873038_0001
15/08/14 22:36:37 INFO mapreduce.Job: Job job_1439571873038_0001 running in uber
mode : false
15/08/14 22:36:37 INFO mapreduce.Job: map 0% reduce 0%
15/08/14 22:36:37 INFO mapreduce.Job: Job job_1439571873038_0001 failed with sta
te FAILED due to: Application application_1439571873038_0001 failed 2 times due
to AM Container for appattempt_1439571873038_0001_000002 exited with exitCode:
1
For more detailed output, check application tracking page:http://XXX-PC:8088/
cluster/app/application_1439571873038_0001Then, click on links to logs of each a
ttempt.
Diagnostics: Exception from container-launch.
Container id: container_1439571873038_0001_02_000001
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:
722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.la
unchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.C
ontainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
.java:615)
at java.lang.Thread.run(Thread.java:744)
Shell output: 1 file(s) moved.
Container exited with a non-zero exit code 1
Failing this attempt. Failing the application.
15/08/14 22:36:37 INFO mapreduce.Job: Counters: 0
When I went to the Stderr logs as mentioned in the above stack trace, the actual error came out to be
Error: Could not create the Java Virtual Machine.
When I try to increase HADOOP_HEAPSIZE to 1024, the namenode, datanode and yarn daemons do not start at all and give me the same error of could not create java virtual machine
Has someone got the same problem ? How to solve this issue ?
You can solve this problem by editting the file /etc/hadoop/hadoop-env.sh like the one below:
export HADOOP_HEAPSIZE=3072
I use YARN on hadoop 2.6.0. When I ran an mapreduce job, I got an error like this :
15/03/12 22:22:59 INFO mapreduce.Job: Task Id : attempt_1426132548565_0003_m_000002_1, Status : FAILED
Error: Java heap space
15/03/12 22:22:59 INFO mapreduce.Job: Task Id : attempt_1426132548565_0003_m_000000_1, Status : FAILED
Error: Java heap space
15/03/12 22:23:20 INFO mapreduce.Job: Task Id : attempt_1426132548565_0003_m_000002_2, Status : FAILED
Error: Java heap space
Container killed by the ApplicationMaster.
Am I false to configure the java.opts propery. Is that error because of that configuration? Are there any connection between memory settings on yarn-site and mapred-site?
I very confused, I need your suggest all Thanks
When the container exceeds the memory/CPU usage, it is killed by the application master. In your case, the mappers might be using excess memory. Try adding the following configurations:
In mapred-site.xml:
<property>
<name> mapreduce.map.memory.mb </name>
<value>1024</value>
<description>Enter The amount of memory to request from the scheduler for each map task. </description>
</property>
The default value is 1024, try increasing it to 2048.
I would suggest you to restart the cluster after changing the configurations