Mahout : Cannot convert into sequence file - java

I'm trying to convert some text files into mahout sequence files. So I do
mahout seqdirectory -i inputFolder -o outputFolder
But I always get this exception
java.lang.Exception: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:403)
Caused by: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:164)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.<init>(CombineFileRecordReader.java:126)
at org.apache.mahout.text.MultipleTextFileInputFormat.createRecordReader(MultipleTextFileInputFormat.java:43)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.<init>(MapTask.java:491)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:734)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.java:155)
... 11 more
Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but class was expected
at org.apache.mahout.text.WholeFileRecordReader.<init>(WholeFileRecordReader.java:52)
... 16 more
I'm running Mahout 0.8 on Hadoop 2.2.0
Any ideas ?

The previous answers are incorrect. Mahout 0.8 had a MapReduce version of seqdirectory which was a new feature. A bug in the MR version was causing the exception you are seeing.
To execute seqdirectory with Mahout 0.8, please use the sequential version by specifying the -xm sequential option to your command line.
mahout seqdirectory -i inputFolder -o outputFolder -xm sequential
By default seqdirectory executes the MR version if none specified.
This issue has since been fixed in Mahout 0.9.

As I read somewhere that mahout 0.8 works with hadoop 1.2. I only downloaded mahout (it uses hadoop jar from lib/hadoop )

Related

How do I troubleshoot the installation of Apache Accumulo on Linux?

I am trying to install open source Accumulo on RHEL 7.x. I have two GB of swap space. I have installed Java 1.8, Hadoop 3, and Zookeeper. I have run the bootstrap_config.sh script for Accumulo 1.9.2.
I ran this (and expected it to work):
/bin/accumulo-1.9.2/bin/accumulo init
But I get this error:
[start.Main] ERROR: Uncaught exception
java.util.ServiceConfigurationError:
org.apache.accumulo.start.spi.KeywordExecutable: Provider
org.apache.accumulo.proxy.Proxy could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.accumulo.start.Main.checkDuplicates(Main.java:237)
at org.apache.accumulo.start.Main.getExecutables(Main.java:228)
at org.apache.accumulo.start.Main.main(Main.java:84) Caused by: java.lang.NoClassDefFoundError:
org/apache/commons/configuration/Configuration
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2671)
at java.lang.Class.getConstructor0(Class.java:3075)
at java.lang.Class.newInstance(Class.java:412)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 5 more Caused by: java.lang.ClassNotFoundException: org.apache.commons.configuration.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at org.apache.accumulo.start.classloader.AccumuloClassLoader$2.loadClass(AccumuloClassLoader.java:294)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 10 more
I used the Accumulo bootstrap_config.sh script to configure Hadoop version 3. How do I get "/bin/accumulo-1.9.2/bin/accumulo init" to work?
Accumulo 1.9.2 expects Hadoop 2 out of the box, but does have a build profile to rebuild a tarball specifically for use with Hadoop 3. You can build Accumulo with the Hadoop 3 profile by downloading the source tarball and doing:
mvn clean package -Dhadoop.profile=3 -DskipTests
If you're not interested in rebuilding from source, it may be possible to simply fix the class path issues by reading the error message, and adjusting your class path accordingly. In this case, it seems you're missing a commons-configuration jar.

Kafka Structured Streaming KafkaSourceProvider could not be instantiated

I am working on a streaming project where I have a kafka stream of ping statistics like so :
64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=1 ttl=62 time=0.913 ms
64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=2 ttl=62 time=0.936 ms
64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=3 ttl=62 time=0.980 ms
64 bytes from vas.fractalanalytics.com (192.168.30.26): icmp_seq=4 ttl=62 time=0.889 ms
I am trying to read this as a structured stream in pyspark. I start pyspark with the following command :
pyspark --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.0
Pyspark version is 2.4, python version is 2.7 (tried with 3.6 as well)
And I get an error as soon as I send this piece of code (followed from Structured Streaming + Kafka Integration Guide):
df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "172.18.2.21:2181").option("subscribe", "ping-stats").load()
I run into the following error :
py4j.protocol.Py4JJavaError: An error occurred while calling o37.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.kafka010.KafkaSourceProvider could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:247)
at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259)
at scala.collection.AbstractTraversable.filter(Traversable.scala:104)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:161)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoSuchMethodError: org.apache.spark.internal.Logging.$init$(Lorg/apache/spark/internal/Logging;)V
at org.apache.spark.sql.kafka010.KafkaSourceProvider.<init>(KafkaSourceProvider.scala:44)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 23 more
Can someone help me out with this?
I managed to solve this by ensuring that the spark-sql-kafka package's version matches the spark version.
In my case, I am now using --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.1 for my spark version 2.4.1, thereafter the .format("kafka") part of the code can be resolved.
Also, v2.12 of the package (i.e., org.apache.spark:spark-sql-kafka-0-10_2.12:2.4.1) does not seem stable at the time of writing, and using it will also cause the above error.
*EDIT: v2.12 spark-sql-kafka packages seem to only work with Spark built with Scala v2.12. Hence, for Spark v2.X versions (pre-built with Scala v2.11 by default), there's a need to instead use Spark binaries built with Scala v2.12 (e.g. spark-2.4.1-bin-without-hadoop-scala-2.12.tgz) if you really want to use spark-sql-kafka v2.12 package. For Spark v3.X, they are pre-built with Scala v2.12 by default, hence you'll only see/use v2.12 of the package.
Issue solved when used the Spark version 2.3.2 with Scala version 2.11.11 and dependency org.apache.spark:sl-kafka-0-10_2.10:2.0.2
#spark-shell -i Process.scala --master local[2] --packages org.apache.spark:sl-kafka-0-10_2.10:2.0.2 ...

Starting HBASE, java.lang.ClassNotFoundException: org.apache.htrace.SamplerBuilder

I am trying to start HBASE with start-hbase.sh, however, I get the error: java.lang.ClassNotFoundException: org.apache.htrace.SamplerBuilder.
I have tried to add various .jar's to various folders (as suggested in other threads) but nothing works. I am using Hadoop 3.11 and HBase 2.10 Here is the (end of the) error log.
java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster.
at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2972)
at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:236)
at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:140)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:149)
at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:2983)
Caused by: java.lang.NoClassDefFoundError: org/apache/htrace/SamplerBuilder
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:635)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:94)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2703)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2685)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:373)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
at org.apache.hadoop.hbase.util.CommonFSUtils.getRootDir(CommonFSUtils.java:358)
at org.apache.hadoop.hbase.util.CommonFSUtils.isValidWALRootDir(CommonFSUtils.java:407)
at org.apache.hadoop.hbase.util.CommonFSUtils.getWALRootDir(CommonFSUtils.java:383)
at org.apache.hadoop.hbase.regionserver.HRegionServer.initializeFileSystem(HRegionServer.java:691)
at org.apache.hadoop.hbase.regionserver.HRegionServer.<init>(HRegionServer.java:600)
at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:484)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:488)
at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:2965)
... 5 more
Caused by: java.lang.ClassNotFoundException: org.apache.htrace.SamplerBuilder
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:190)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:499)
... 25 more
HBase 2.1.0 release uses HTrace, that is an incubating Apache Foundation project.
There is a folder for 3rd-party libraries in HBase lib folder, client-facing-thirdparty. You need to copy htrace-core-3.1.0-incubating.jar from there to the HBase lib directory. (see reference)
There is also another solution at Cloudera Community that changes a configuration instead of adding the library manually.
There seems to have been a compatibility issue between Hbase and Hadoop, I reverted to using Hadoop 2.9.1 and Hbase 1.2.6 together with JDK 1.8.0
HBase 2.1.4 has no htrace-core-3.1.0-incubating.jar in client-facing-thirdparty
You need copy from 2.1.3 or download from https://mvnrepository.com/artifact/org.apache.htrace/htrace-core

Cannot setup Apache Spark 2.1.1 on Windows 10

I have installed Apache Spark 2.1.1 on Windows 10, with Java 1.8 and Python version 3.6 Anaconda 4.3.1. I have also downloaded the winutils.exe and setup environment avriables for JAVA_HOME, HADOOP_HOME and SPARK_HOME as well as updated the path variable. I have also run winutils.exe chmod -R 777 \tmp\hive. But I am getting the below error when running pyspark in cmd prompt.
Please please can someone help, let me know if I missed out any important detail
Thanks in advance!
c:\Spark>bin\pyspark
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "c:\Spark\python\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
File "c:\Spark\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: **An error occurred while calling o22.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':**
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
I still get errors when launching [spark-shell], but it looks like Spark launches since I get the 'Welcome to Spark' piece. The error I get is
C:\Spark>bin\spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/06/23 12:20:15 WARN General: Plugin (Bundle) "org.datanucleus.api.jdo" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/jars/datanucleus-api-jdo-3.2.6.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/bin/../jars/datanucleus-api-jdo-3.2.6.jar."
17/06/23 12:20:15 WARN General: Plugin (Bundle) "org.datanucleus.store.rdbms" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/jars/datanucleus-rdbms-3.2.9.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/bin/../jars/datanucleus-rdbms-3.2.9.jar."
17/06/23 12:20:15 WARN General: Plugin (Bundle) "org.datanucleus" is already registered. Ensure you dont have multiple JAR versions of the same plugin in the classpath. The URL "file:/C:/Spark/bin/../jars/datanucleus-core-3.2.10.jar" is already registered, and you are trying to register an identical plugin located at URL "file:/C:/Spark/jars/datanucleus-core-3.2.10.jar."
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$getOrCreate$5.apply(SparkSession.scala:878)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:878)
at org.apache.spark.repl.Main$.createSparkSession(Main.scala:96)
... 47 elided
Caused by: java.lang.reflect.InvocationTargetException:
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
... 58 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveExternalCatalog':
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)
at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:86)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession$$anonfun$sharedState$1.apply(SparkSession.scala:101)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:101)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:100)
at org.apache.spark.sql.internal.SessionState.<init>(SessionState.scala:157)
at org.apache.spark.sql.hive.HiveSessionState.<init>(HiveSessionState.scala:32)
... 63 more
Caused by: java.lang.reflect.InvocationTargetException: java.lang.reflect.InvocationTargetException: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:166)
... 71 more
Caused by: java.lang.reflect.InvocationTargetException:
java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:264)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:358)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
at org.apache.spark.sql.hive.HiveExternalCatalog.<init>(HiveExternalCatalog.scala:66)
... 76 more
Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Ljava/lang/String;I)V
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode0(Native Method)
at org.apache.hadoop.io.nativeio.NativeIO$Windows.createDirectoryWithMode(NativeIO.java:524)
at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:478)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:532)
at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:509)
at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:305)
at org.apache.hadoop.hive.ql.session.SessionState.createPath(SessionState.java:639)
at org.apache.hadoop.hive.ql.session.SessionState.createSessionDirs(SessionState.java:561)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:508)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:188)
... 84 more
14: error: not found: value spark
import spark.implicits._
^
14: error: not found: value spark
import spark.sql
^
Welcome to
Setup that worked for me is as follows "i didn't use winutils.exe":-
install pyspark and findspark using "Anaconda Command Prompt" as
pip3 install pyspark
and
pip3 install findspark
as you have already downloaded spark setup. unzip it and keep it in "C" drive i.e. "C:\spark-2.2.0-bin-hadoop2.7" and create new environment variable "SPARK_HOME" and set it to "C:\spark-2.2.0-bin-hadoop2.7\bin" and open "path" variable in system variables and add the same there as well.
now open your command prompt and come from "C:\User*" to "C:\" by doing cd.. twice and run following command
set SPARK_HOME='spark-2.2.0-bin-hadoop2.7'
and you are good to go.
Now you just have to import the file location before importing pyspark in your jupyter notebook. use following code:-
import findspark
findspark.init('C:\spark-2.2.0-bin-hadoop2.7')
import pyspark

Troubles running Mahout 0.9 text-processing examples

TL/DR : are mahout 0.9 examples compatible with hadoop 2.4 ?
My problem:
I would like to classify a bunch of documents using Mahout 0.9. To do so, I'm following the example described here.
I'm on windows and trying to go full native (ie, no cygwin). I already dispose of a local hadoop 2.4.1 cluster.
I downloaded the mahout sources and compiled it according to the wiki :
mvn "-Dhadoop2.version=2.4.1" -DskipTests clean install
I then tried to execute the example with the following example :
hadoop jar $Env:mahout_home/examples/target/mahout-examples-0.9-job.jar org.apache.mahout.driver.MahoutDriver seqdirectory -i Decomposition -o output
It all seems to work : I'm getting logs showing the mapreduce job begins to run. However, I quickly get the following errors :
Error: java.lang.RuntimeException: java.lang.reflect.InvocationTargetException
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.ja
va:166)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.<init>(CombineFileRecordReader.java:126)
at org.apache.mahout.text.MultipleTextFileInputFormat.createRecordReader(MultipleTextFileInputFormat.java:43)
at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.<init>(MapTask.java:492)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:735)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1556)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:162)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
at org.apache.hadoop.mapreduce.lib.input.CombineFileRecordReader.initNextRecordReader(CombineFileRecordReader.ja
va:157)
... 10 more
Caused by: java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.TaskAttemptContext, but c
lass was expected
at org.apache.mahout.text.WholeFileRecordReader.<init>(WholeFileRecordReader.java:59)
... 15 more
According to the various links I found, it seems to come from code intented for Hadoop 1.0.
Am I missing something, or are mahout provided examples not suited for a Hadoop 2.4 cluster ?
The problem is because the object TaskAttemptContext is a Interface in the version 2.4 of Hadoop and the job was expecting a class (version 1.1.2 Hadoop). TaskAttemptContext has been changed in version 2.0.

Categories

Resources