IndexOutOfBoundsException in Mahout - java

I'm trying to run mahout SGD classifier on a CSV file, and I'm getting this error -
[vineet#localhost bin]$ ./mahout trainlogistic --input ./filtered.csv --output model --target target --categories 33 \
--features 200 --passes 10 --predictors subject --types text --rate 50
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 6, Size: 4
at java.util.ArrayList.rangeCheck(ArrayList.java:604)
at java.util.ArrayList.get(ArrayList.java:382)
at org.apache.mahout.classifier.sgd.CsvRecordFactory.processLine(CsvRecordFactory.java:245)
at org.apache.mahout.classifier.sgd.TrainLogistic.mainToOutput(TrainLogistic.java:85)
at org.apache.mahout.classifier.sgd.TrainLogistic.main(TrainLogistic.java:65)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:601)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
The CSV file contains unicode text, and large text fields enclosed by quote characters.
I've tried the classifier on the sample donut.csv, and it works fine.
I also tried changing my header row to make it like "id","subject","field2",etc.., but it still doesn't work.
What am I doing wrong?

some lines may dirty - only have 4 attributes instead of 6. check your data again or try to feed only one line of data to validate my guess.

Related

Apache beam wildcard recursive search for files

I am using Spotify's Scio library for writing apache beam pipelines in scala. I want to search for files under a directory in a recursive way on a filesystem which can be hdfs, alluxio or GCS. Like *.jar should find all the files under the provided directory and sub-directories.
Apache beam sdk provided org.apache.beam.sdk.io.FileIO class for such purpose where I can find files on one directory level using pipeline.apply(FileIO.match().filepattern(filesPattern)).
How can I make it recursive to search for all files matching the provided pattern?
Currently, I am trying another approach, where I am creating resourceId of the provided pattern and getting current directory of the provided pattern, then I am trying to resolve all sub-directories in the current directory using resourceId.resolve() method. But it is throwing an exception for it.
val currentDir = FileSystems.matchNewResource(filesPattern, false).getCurrentDirectory
val childDir = currentDir.resolve("{#literal *}", StandardResolveOptions.RESOLVE_DIRECTORY)
For currentDir.resolve I am getting following exception:
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error.
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:546)
at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:421)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:427)
at org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:813)
at org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:287)
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:213)
at org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1050)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: java.lang.IllegalArgumentException: Illegal character in path at index 0: {#literal *}/
at java.net.URI.create(URI.java:852)
at java.net.URI.resolve(URI.java:1036)
at org.apache.beam.sdk.io.hdfs.HadoopResourceId.resolve(HadoopResourceId.java:46)
at com.sparkcognition.foundation.ingest.jobs.copyjob.FileOperations$.findFiles(BinaryFilesSink.scala:110)
at com.sparkcognition.foundation.ingest.jobs.copyjob.BinaryFilesSink$.main(BinaryFilesSink.scala:39)
at com.sparkcognition.foundation.ingest.jobs.copyjob.BinaryFilesSink.main(BinaryFilesSink.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:529)
... 12 more
Caused by: java.net.URISyntaxException: Illegal character in path at index 0: {#literal *}/
at java.net.URI$Parser.fail(URI.java:2848)
at java.net.URI$Parser.checkChars(URI.java:3021)
at java.net.URI$Parser.parseHierarchical(URI.java:3105)
at java.net.URI$Parser.parse(URI.java:3063)
at java.net.URI.<init>(URI.java:588)
at java.net.URI.create(URI.java:850)
... 22 more
Please suggest what should be the right way to search for files recursively using apache beam?
References:
https://beam.apache.org/releases/javadoc/2.11.0/index.html?org/apache/beam/sdk/io/fs/ResourceId.html
It looks like you have some code copied from some faulty javadoc. Some old versions of the example code were published with errors around the asterisks.
To find all files inside currentDir:
val childDir = currentDir.resolve("**", StandardResolveOptions.RESOLVE_FILES)

Eclipse. Saving workspace takes forever

My problem is that after closing eclipse the saving workspace didn't finish even after two hours and make me cant do anything on eclipse at all. I checked in
C:..\workspace.metadata.log and the error is :
!ENTRY org.eclipse.jdt.ui 4 10001 2016-12-25 12:51:13.037
!MESSAGE Internal Error
!STACK 1
Java Model Exception: Core Exception [code 3] Some characters cannot be mapped using "Cp1252" character encoding.
Either change the encoding or remove the characters which are not supported by the "Cp1252" character encoding.
at org.eclipse.jdt.internal.ui.javaeditor.DocumentAdapter.save(DocumentAdapter.java:474)
at org.eclipse.jdt.internal.core.CommitWorkingCopyOperation.executeOperation(CommitWorkingCopyOperation.java:123)
at org.eclipse.jdt.internal.core.JavaModelOperation.run(JavaModelOperation.java:729)
at org.eclipse.core.internal.resources.Workspace.run(Workspace.java:2313)
at org.eclipse.jdt.internal.core.JavaModelOperation.runOperation(JavaModelOperation.java:794)
at org.eclipse.jdt.internal.core.CompilationUnit.commitWorkingCopy(CompilationUnit.java:391)
at org.eclipse.jdt.ui.wizards.NewTypeWizardPage.createType(NewTypeWizardPage.java:2233)
at org.eclipse.jdt.internal.ui.wizards.NewClassCreationWizard.finishPage(NewClassCreationWizard.java:71)
at org.eclipse.jdt.internal.ui.wizards.NewElementWizard$2.run(NewElementWizard.java:118)
at org.eclipse.jdt.internal.core.BatchOperation.executeOperation(BatchOperation.java:39)
at org.eclipse.jdt.internal.core.JavaModelOperation.run(JavaModelOperation.java:729)
at org.eclipse.core.internal.resources.Workspace.run(Workspace.java:2313)
at org.eclipse.jdt.core.JavaCore.run(JavaCore.java:5358)
at org.eclipse.jdt.internal.ui.actions.WorkbenchRunnableAdapter.run(WorkbenchRunnableAdapter.java:106)
at org.eclipse.jface.operation.ModalContext$ModalContextThread.run(ModalContext.java:122)
Caused by: java.nio.charset.UnmappableCharacterException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at java.nio.charset.CharsetEncoder.encode(Unknown Source)
at org.eclipse.core.internal.filebuffers.ResourceTextFileBuffer.commitFileBufferContent(ResourceTextFileBuffer.java:366)
at org.eclipse.core.internal.filebuffers.ResourceFileBuffer.commit(ResourceFileBuffer.java:327)
at org.eclipse.jdt.internal.ui.javaeditor.DocumentAdapter.save(DocumentAdapter.java:472)
... 14 more
Caused by: org.eclipse.core.runtime.CoreException: Some characters cannot be mapped using "Cp1252" character encoding.
Either change the encoding or remove the characters which are not supported by the "Cp1252" character encoding.
at org.eclipse.core.internal.filebuffers.ResourceTextFileBuffer.commitFileBufferContent(ResourceTextFileBuffer.java:378)
at org.eclipse.core.internal.filebuffers.ResourceFileBuffer.commit(ResourceFileBuffer.java:327)
at org.eclipse.jdt.internal.ui.javaeditor.DocumentAdapter.save(DocumentAdapter.java:472)
at org.eclipse.jdt.internal.core.CommitWorkingCopyOperation.executeOperation(CommitWorkingCopyOperation.java:123)
at org.eclipse.jdt.internal.core.JavaModelOperation.run(JavaModelOperation.java:729)
at org.eclipse.core.internal.resources.Workspace.run(Workspace.java:2313)
at org.eclipse.jdt.internal.core.JavaModelOperation.runOperation(JavaModelOperation.java:794)
at org.eclipse.jdt.internal.core.CompilationUnit.commitWorkingCopy(CompilationUnit.java:391)
at org.eclipse.jdt.ui.wizards.NewTypeWizardPage.createType(NewTypeWizardPage.java:2233)
at org.eclipse.jdt.internal.ui.wizards.NewClassCreationWizard.finishPage(NewClassCreationWizard.java:71)
at org.eclipse.jdt.internal.ui.wizards.NewElementWizard$2.run(NewElementWizard.java:118)
at org.eclipse.jdt.internal.core.BatchOperation.executeOperation(BatchOperation.java:39)
at org.eclipse.jdt.internal.core.JavaModelOperation.run(JavaModelOperation.java:729)
at org.eclipse.core.internal.resources.Workspace.run(Workspace.java:2313)
at org.eclipse.jdt.core.JavaCore.run(JavaCore.java:5358)
at org.eclipse.jdt.internal.ui.actions.WorkbenchRunnableAdapter.run(WorkbenchRunnableAdapter.java:106)
at org.eclipse.jface.operation.ModalContext$ModalContextThread.run(ModalContext.java:122)
Caused by: java.nio.charset.UnmappableCharacterException: Input length = 1
at java.nio.charset.CoderResult.throwException(Unknown Source)
at java.nio.charset.CharsetEncoder.encode(Unknown Source)
at org.eclipse.core.internal.filebuffers.ResourceTextFileBuffer.commitFileBufferContent(ResourceTextFileBuffer.java:366)
... 16 more
The problem is I dont know how can i fixed it without opening eclipse (I cant because its saving workspace).
I'll be greatfull for help.
PS. Sorry for my english I'm not native speaker.
In top menu of Eclipse go to Window Menu
Windows Menu –> Preferences –> General (expand it) –> Workspace
(click on it).
Look for a box “Text File Encoding”. Default will be
“Cp1252″.
Change radio to select other and select “UTF-8″ from combo
box.
After setting this UTF-8 encoding, you can use UTF-8 in your code. Now you won’t get the ‘CP1252 character encoding’ error.

Creating a threshold file with weka's command line

I need to get the threshold curve from my trained classifiers automatically, so I'm figuring out how to do this with the command line (currently using Weka's SimpleCLI). Following the output from java weka.classifiers.functions.Logistic -h I'm trying to use the -threshold-file argument, described as:
-threshold-file <file>
The file to save the threshold data to.
The format is determined by the extensions, e.g., '.arff' for ARFF
format or '.csv' for CSV.
This is the line I'm trying to execute from the SimpleCLI (broken in 2 parts for ease of read):
java weka.classifiers.functions.Logistic -t ".\data\iris.arff" -no-cv -R 1.0E-8 -M -1 \
-threshold-file "C:\Temp\somefile.csv"
Which gives me either this:
java.lang.NullPointerException
or this (or both):
weka.classifiers.evaluation.ThresholdCurve.getCurve(ThresholdCurve.java:125)
weka.classifiers.evaluation.Evaluation.evaluateModel(Evaluation.java:1739)
weka.classifiers.Evaluation.evaluateModel(Evaluation.java:650)
weka.classifiers.AbstractClassifier.runClassifier(AbstractClassifier.java:359)
weka.classifiers.functions.Logistic.main(Logistic.java:1134)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
java.lang.reflect.Method.invoke(Unknown Source)
weka.gui.SimpleCLIPanel$ClassRunner.run(SimpleCLIPanel.java:199)
at weka.classifiers.evaluation.ThresholdCurve.getCurve(ThresholdCurve.java:125)
at weka.classifiers.evaluation.Evaluation.evaluateModel(Evaluation.java:1739)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:650)
at weka.classifiers.AbstractClassifier.runClassifier(AbstractClassifier.java:359)
at weka.classifiers.functions.Logistic.main(Logistic.java:1134)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at weka.gui.SimpleCLIPanel$ClassRunner.run(SimpleCLIPanel.java:199)
(Executing this from Windows cmd.exe gives me roughly the same messages. Note that I'm using Weka 3.7.11 in a Windows 7 machine (64 bits) and Java 7 (Update 55).)
Note that deleting the last part will make the command work ok, although not creating the desired threshold file.
I've tried many variants of this line, with the same result. I'm not familiar with java. What I need is to know how I'm doing this wrong.
Thanks in advance.
Update: it was pointed out to me that the line weka.classifiers.evaluation.ThresholdCurve.getCurve(ThresholdCurve.java:125) is:
if ((predictions.size() == 0)
|| (((NominalPrediction) predictions.get(0)).distribution().length <= classIndex)) {
return null;
}
(source)
So it seems that the problem arises because there are no predictions made at all. I don't know why would this happen and how I can reverse it.
The code you are using to invoke the simple cli classifier does not generate any evaluation result, based upon which you can get the threshold curve.
You can do the following.
Remove the -no-cv parameter.
Specify a test file using -T option.

YUICompressor crashes - stackoverflow error

I frequently get what appears to be a stackoverflow error ;-) from YUICompressor. The following is the first part of thousands of error lines that come from attempting to compress a 24074 byte css stylesheet (not the "Caused by java.lang.StackOverflowError about 8 lines down):
iMac1:src jas$ min ../style2.min.css style2.css
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at com.yahoo.platform.yui.compressor.Bootstrap.main(Bootstrap.java:21)
Caused by: java.lang.StackOverflowError
at java.lang.Character.codePointAt(Character.java:2335)
at java.util.regex.Pattern$CharProperty.match(Pattern.java:3344)
at java.util.regex.Pattern$Branch.match(Pattern.java:4114)
... (plus 1021 more error lines)
The errors happen usually after adding a couple of lines to the file getting compressed. The css is fine, and works perfectly in the uncompressed format. I don't see a particular pattern to the types of selectors added to the file that cause the errors. In this case, adding the following selector to a previously compressible file resulted in the errors:
#thisisatest
{
margin-left:87px;
}
I am wondering if there is perhaps a flag to java to enlarge the stack that might help. Or if that is not the problem, what is?
EDIT:
As I was posting this question, it dawned on me that I should check the java command to see if there was a parameter to enlarge the stack. Turns out that it is -Xssn, where "n" is a parameter to indicate the stack size. Its default value is 512k. So I tried 1024k but that still led to the stackoverflow. Trying 2048k works however, and I think this could be the solution.
EDIT 2:
While I no longer use this method for minification any longer, to be more specific here is the full command (which I have set up as a shell alias), showing how the -Xss2048k parameter is used:
java -Xss2048k -jar ~/Documents/RepHunter/Website\ Materials/Code/Third\ Party\ Libraries/YUI\ Compressor/yuicompressor-2.4.8.jar --type css -o
As posted in my edit, the solution was to add the parameters to the java command. The clue was the error line at the 5-th "at" line, as follows:
at com.yahoo.platform.yui.compressor.Bootstrap.main(Bootstrap.java:21)
Caused by: java.lang.StackOverflowError
Seeing that the issue was a "StackOverlowError" ;-) gave the suggestion to try to increase the stack size. The default is 512k. My first try of 1024k did not work. However increasing it to 2048k did work, and I have had no further issues.

Regular expression to parse a log file and find stacktraces

I'm working with a legacy Java app that has no logging and just prints all information to the console. Most exceptions are also "handled" by just doing a printStackTrace() call.
In a nutshell, I've just redirected the System.out and System.error streams to a log file, and now I need to parse that log file. So far all good, but I'm having problems trying to parse the log file for stack traces.
Some of the code is obscufated as well, so I need to run the stacktraces through a utility app to de-obscufate them. I'm trying to automate all of this.
The closest I've come so far is to get the initial Exception line using this:
.+Exception[^\n]+
And finding the "at ..(..)" lines using:
(\t+\Qat \E.+\s+)+
But I can't figure out how to put them together to get the full stacktrace.
Basically, the log files looks something like the following. There is no fixed structure and the lines before and after stack traces are completely random:
Modem ERROR (AT
Owner: CoreTalk
) - TIMEOUT
IN []
Try Open: COM3
javax.comm.PortInUseException: Port currently owned by CoreTalk
at javax.comm.CommPortIdentifier.open(CommPortIdentifier.java:337)
...
at UniPort.modemService.run(modemService.java:103)
Handling file: C:\Program Files\BackBone Technologies\CoreTalk 2006\InputXML\notify
java.io.FileNotFoundException: C:\Program Files\BackBone Technologies\CoreTalk 2006\InputXML\notify (The system cannot find the file specified)
at java.io.FileInputStream.open(Native Method)
...
at com.gobackbone.Store.a.a.handle(Unknown Source)
at com.jniwrapper.win32.io.FileSystemWatcher.fireFileSystemEvent(FileSystemWatcher.java:223)
...
at java.lang.Thread.run(Unknown Source)
Load Additional Ports
... Lots of random stuff
IN []
[Fatal Error] .xml:6:114: The entity name must immediately follow the '&' in the entity reference.
org.xml.sax.SAXParseException: The entity name must immediately follow the '&' in the entity reference.
at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(Unknown Source)
...
at com.gobackbone.Store.a.a.run(Unknown Source)
Looks like you just need to paste them together (and use a newline as glue):
.+Exception[^\n]+\n(\t+\Qat \E.+\s+)+
But I would change your regex a bit:
^.+Exception[^\n]++(\s+at .++)+
This combines the whitespace between the at... lines and uses possessive quantifiers to avoid backtracking.
We have been using ANTLR to tackle the parsing of logfiles (in a different application area). It's not trivial but if this is a critical task for you it will be better than using regexes.
I get good results using
perl -n -e 'm/(Exception)|(\tat )/ && print' /var/log/jboss4.2/debian/server.log
It dumps all lines which have Exception or \tat in them. Since the match is in the same time the order is kept.

Categories

Resources