Why is that the result from running the filter StringToWordVector in Weka GUI is different from the equivalent java code? I use the same attributes as I used in the gui but the tokenizer in java doesn't seem to do a proper job! I was told by a Ph.D student that it is common and no further answer from him.
Please help. My project is stalled.
Here is my code:
DataSource tempSource = new DataSource("/home/r_omio/Dataset.arff");
Instances temp = tempSource.getDataSet();
NumericToBinary nbTemp = new NumericToBinary();
nbTemp.setInputFormat(temp);
temp = Filter.useFilter(temp, nbTemp);
StringToWordVector stringFilterTemp = new StringToWordVector(2500);
stringFilterTemp.setOptions(
weka.core.Utils.splitOptions("-R 1,2,3,4 -W 2500 -prune-rate -1.0 <br>-N 1 -stemmer weka.core.stemmers.NullStemmer -M 1 -tokenizer weka.core.tokenizers.WordTokenizer -delimiters \" \\r\\n\\t.,;:\\\'\\\"()?![]_\"")
);
stringFilterTemp.setInputFormat(temp);
temp = Filter.useFilter(temp, stringFilterTemp);
I suspect your delimiters are incorrectly escaped. Try using the default delimiters in the GUI and leaving the tokenizer out in Java, which will use the default, and see if you get the same value.
Related
What am I doing?
I am writing a data analysis program in Java which relies on R´s arulesViz library to mine association rules.
What do I want?
My purpose is to store the rules in a String variable in Java so that I can process them later.
How does it work?
The code works using a combination of String.format and eval Java and RJava instructions respectively, being its behavior summarized as:
Given properly formatted Java data structures, creates a data frame in R.
Formats the recently created data frame into a transaction list using the arules library.
Runs the apriori algorithm with the transaction list and some necessary values passed as parameter.
Reorders the generated association rules.
Given that the association rules cannot be printed, they are written to the standard output with R´s write method, capture the output and store it in a variable. We have converted the association rules into a string variable.
We return the string.
The code is the following:
// Step 1
Rutils.rengine.eval("dataFrame <- data.frame(as.factor(c(\"Red\", \"Blue\", \"Yellow\", \"Blue\", \"Yellow\")), as.factor(c(\"Big\", \"Small\", \"Small\", \"Big\", \"Tiny\")), as.factor(c(\"Heavy\", \"Light\", \"Light\", \"Heavy\", \"Heavy\")))");
//Step 2
Rutils.rengine.eval("transList <- as(dataFrame, 'transactions')");
//Step 3
Rutils.rengine.eval(String.format("info <- apriori(transList, parameter = list(supp = %f, conf = %f, maxlen = 2))", supportThreshold, confidenceThreshold));
// Step 4
Rutils.rengine.eval("orderedRules <- sort(info, by = c('count', 'lift'), order = FALSE)");
// Step 5
REXP res = Rutils.rengine.eval("rulesAsString <- paste(capture.output(write(orderedRules, file = stdout(), sep = ',', quote = TRUE, row.names = FALSE, col.names = FALSE)), collapse='\n')");
// Step 6
return res.asString().replaceAll("'", "");
What´s wrong?
Running the code in Linux Will work perfectly, but when I try to run it in Windows, I get the following error referring to the return line:
Exception in thread "main" java.lang.NullPointerException
This is a common error I have whenever the R code generates a null result and passes it to Java. There´s no way to syntax check the R code inside Java, so whenever it´s wrong, this error message appears.
However, when I run the R code in brackets in the R command line in Windows, it works flawlessly, so both the syntax and the data flow are OK.
Technical information
In Linux, I am using R with OpenJDK 10.
In Windows, I am currently using Oracle´s latest JDK release, but trying to run the program with OpenJDK 12 for Windows does not solve anything.
Everything is 64 bits.
The IDE used in both operating systems is IntelliJ IDEA 2019.
Screenshots
Linux run configuration:
Windows run configuration:
I am trying to create a Scanner with JFlex. I created my .jflex file and it compiles and everything. The problem is that when I try to prove it, some times it gives me and error of ArrayIndexOutOfBoundsException: 769 in the .java class that JFlex created.
I am using Cup Parser generator too. I don't know if the problem can be related with the part of Cup Analysis, but here is how I called my analyzers.
ScannerLexico lexico = new ScannerLexico(new BufferedReader(new StringReader( jTextPane1.getText())));
ParserSintactico sintaxis = new ParserSintactico(lexico);
I don't know how to fix it. Please help me.
Here are the links to my code:
JFlex File "ScannerFranklin.jflex"
Java Class generated "ScannerLexico.java"
The part where I have the problem in the .java class created by JFlex, in next_token() function (Line 899 in java file).
int zzNext = zzTransL[ zzRowMapL[zzState] + zzCMapL[zzInput] ];
if (zzNext == -1) break zzForAction;
zzState = zzNext;
Thanks.
According to its documentation, JFlex throws ArrayIndexOutOfBounds exceptions whenever it encounters Unicode characters using the %7bit or %8bit/%full encoding options. It recommends to always use the %unicode option instead, which is the default.
You are using the %unicode option, but you're also using %full. Apparently when you have both options, %full takes precedence. So remove %full and the error should go away.
I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)
Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())
I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.
I am using WEKA API to perform 10fold Cross-Validation on my dataset using a KNN (IBk) classifier. I have selected some values for K and have written some simple loops that performs CV for each such value of K.
When I look at the results, I notice that all results are identical even when I use a different K value. How is this possible? Just for clarification, I added the line of code where I specify the options that I feed to my classifier. This works perfectly when I'm doing normal classification (so no Cross-Validation).
"-K " + k_value + " -W 0 -I -A \"weka.core.neighboursearch.LinearNNSearch -A \\\"weka.core.EuclideanDistance -R first-last\\\"\""
As you can see I change the k_value in every iteration. Please some help!
EDIT1
Some additional code that maybe can help.
Evaluation evaluation = new Evaluation(dataset);
evaluation.crossValidateModel(knn, dataset, 10, new Random(1), new Object[] {});
System.out.println("Error-rate " + evaluation.errorRate());
writer.println("AUROC: " + evaluation.areaUnderROC(0));
writer.println(evaluation.toClassDetailsString());
writer.println(evaluation.toMatrixString());
writer.println(evaluation.toSummaryString("\nResults\n======\n", true));
EDIT2
Instead of having k_value I replaced it with hard-coded values (here 10):
classification.setOptions("-K 10 -W 0 -I -A \"weka.core.neighboursearch.LinearNNSearch -A \\"weka.core.EuclideanDistance -R first-last\\"\"");
And again I get the exact same results as I had before when performing it with a different value of K. How can this happen? Is this how things work? Is this normal in weka?
I am trying to match a Java version in HKLM\SOFTWARE\Wow6432Node\Microsoft\Windows\CurrentVersion\Uninstall by iterating over the subkeys within Uninstall. I am trying to match a regular expression to Java 7 Update 40, but the regex is matching all DisplayName entries. Below is the code:
On Error Resume Next
Const HKEY_LOCAL_MACHINE = &H80000002
Dim oReg
Dim objShell
Set objShell = WScript.CreateObject("WScript.Shell")
Set oReg = GetObject("winmgmts:{impersonationLevel=impersonate}!\.\root\default:StdRegProv")
Dim sPath, aSub, sKey
Set objRegEx = New RegExp
objRegEx.Pattern = "\w{4}\s\d{1}\s\w{6}\s\d+"
objRedEx.IgnoreCase = True
objRegEx.Global = False
sPath = "SOFTWARE\Wow6432Node\Microsoft\Windows\CurrentVersion\Uninstall"
oReg.EnumKey HKEY_LOCAL_MACHINE, sPath, aSub
For Each sKey In aSub
disName = "HKLM" & "\" & sPath & "\" & sKey & "\DisplayName"
unString = "HKLM" & "\" & sPath & "\" & sKey & "\UninstallString"
reDisName = objShell.RegRead(disName)
reUnString = objShell.RegRead(unString)
'Wscript.echo(reDisName)
If objRexEx.Test( reDisName ) Then
Wscript.echo "Match"
End If
'Wscript.echo ObjShell.RegRead(disName)
'Wscript.echo ObjShell.RegRead(unString)
Next
Sorry if the formatting is off, I put a ctrl-k in front of each code line. This is my first time posting here so go easy...
You should start all your scripts with Option Explicit and Dim all your variables. Then you wouldn't need sln's eagle eyes to spot your typo:
Option Explicit
Dim objRegEx : Set objRegEx = New RegExp
objRegEx.Pattern = "\w{4}\s\d{1}\s\w{6}\s\d+"
objRedEx.IgnoreCase = True
output:
cscript 19188400.vbs
...\19188400.vbs(4, 1) Microsoft VBScript runtime error: Variable is undefined: 'objRedEx'
If you insist on using a global On Error Resume Next (a most dangerous mal-practice) then you should disable it until your script is thoroughly debugged. Keeping the OERN in a script known to have even the slightest problem is inviting desaster. Asking for help with code containing a global OERN is futile. So run you program without the OERN and see if the cause for its misbehaviour in't obvious.
Diagnostic output should be as specific as possible. Your WScript.Echo "Match" just shows that the statement is executed; a WScript.Echo "Match", disname would be a bit better. Using .Execute and looking at the Match's details could be more revealing.
The .Pattern should be more specific to. If you look for java updates, anchoring a literal "java" at the start of the string, and asking for "upgrade" instead of "\w{6}" may help to avoid false positives. OTOH, my display names don't look like
Java 7 Update 19
but like
Java(TM) 6 Update 19
and who knows what the next owner of Java will put into the display name.
You seem to have a few typo's
objRedEx.IgnoreCase = True
...
If objRexEx.Test( reDisName ) Then