What am I doing?
I am writing a data analysis program in Java which relies on R´s arulesViz library to mine association rules.
What do I want?
My purpose is to store the rules in a String variable in Java so that I can process them later.
How does it work?
The code works using a combination of String.format and eval Java and RJava instructions respectively, being its behavior summarized as:
Given properly formatted Java data structures, creates a data frame in R.
Formats the recently created data frame into a transaction list using the arules library.
Runs the apriori algorithm with the transaction list and some necessary values passed as parameter.
Reorders the generated association rules.
Given that the association rules cannot be printed, they are written to the standard output with R´s write method, capture the output and store it in a variable. We have converted the association rules into a string variable.
We return the string.
The code is the following:
// Step 1
Rutils.rengine.eval("dataFrame <- data.frame(as.factor(c(\"Red\", \"Blue\", \"Yellow\", \"Blue\", \"Yellow\")), as.factor(c(\"Big\", \"Small\", \"Small\", \"Big\", \"Tiny\")), as.factor(c(\"Heavy\", \"Light\", \"Light\", \"Heavy\", \"Heavy\")))");
//Step 2
Rutils.rengine.eval("transList <- as(dataFrame, 'transactions')");
//Step 3
Rutils.rengine.eval(String.format("info <- apriori(transList, parameter = list(supp = %f, conf = %f, maxlen = 2))", supportThreshold, confidenceThreshold));
// Step 4
Rutils.rengine.eval("orderedRules <- sort(info, by = c('count', 'lift'), order = FALSE)");
// Step 5
REXP res = Rutils.rengine.eval("rulesAsString <- paste(capture.output(write(orderedRules, file = stdout(), sep = ',', quote = TRUE, row.names = FALSE, col.names = FALSE)), collapse='\n')");
// Step 6
return res.asString().replaceAll("'", "");
What´s wrong?
Running the code in Linux Will work perfectly, but when I try to run it in Windows, I get the following error referring to the return line:
Exception in thread "main" java.lang.NullPointerException
This is a common error I have whenever the R code generates a null result and passes it to Java. There´s no way to syntax check the R code inside Java, so whenever it´s wrong, this error message appears.
However, when I run the R code in brackets in the R command line in Windows, it works flawlessly, so both the syntax and the data flow are OK.
Technical information
In Linux, I am using R with OpenJDK 10.
In Windows, I am currently using Oracle´s latest JDK release, but trying to run the program with OpenJDK 12 for Windows does not solve anything.
Everything is 64 bits.
The IDE used in both operating systems is IntelliJ IDEA 2019.
Screenshots
Linux run configuration:
Windows run configuration:
Related
I want to parallelize my data writing process. I am writing a data frame to Oracle Database. This data has 4 million rows and 8 columns. It takes 6.5 hours without parallelizing.
When I try to go parallel, I get the error
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: No running JVM detected. Maybe .jinit() would help.
I know this error. I can solve it when I work with single cluster. But I do not know how to tell other clusters the location of Java. Here is my code
Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre1.8.0_181')
library(rJava)
library(RJDBC)
library(DBI)
library(compiler)
library(dplyr)
library(data.table)
jdbcDriver =JDBC("oracle.jdbc.OracleDriver",classPath="C:/Program Files/directory/ojdbc6.jar", identifier.quote = "\"")
jdbcConnection =dbConnect(jdbcDriver, "jdbc:oracle:thin:#//XXXXX", "YYYYY", "ZZZZZ")
By using Sys.setenv(JAVA_HOME='C:/Program Files/Java/jre1.8.0_181') I solve the same problem for single core. But when I go parallel
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
clusterExport(cl, varlist = list("jdbcConnection", "brand3.merge.u"))
clusterEvalQ(cl, .libPaths("C:/Users/onur.boyar/Documents/R/win-library/3.5"))
clusterEvalQ(cl, library(RJDBC))
clusterEvalQ(cl, library(rJava))
parLapply(cl, 1:length(brand3.merge.u$CELL_PH_NUM), function(x) dbSendUpdate(jdbcConnection, "INSERT INTO xxnvdw.an_cust_analytics VALUES(?,?,?,?,?,?,?,?)", brand3.merge.u[x, 1], brand3.merge.u[x,2], brand3.merge.u[x,3],brand3.merge.u[x,4],brand3.merge.u[x,5],brand3.merge.u[x,6],brand3.merge.u[x,7],brand3.merge.u[x,8]))
#brand3.merge.u is my data frame that I try to write.
I get the above error and I do not know how to set my Java location for other nodes.
I want to use parLapply since it is faster than foreach. Any help would be appreciated. Thanks!
JAVA_HOME environment variable
If the problem really is with the location of Java, you could set the environment variable in your .Renviron file. It is likely located in ~/.Renviron. Add a line to that file and this will be propagated to all R session that run via your user:
JAVA_HOME='C:/Program Files/Java/jre1.8.0_181'
Alternatively, you can just add that location to your PATH environment variable.
JVM Initialization via rJava
On the other hand the error message may point to just a JVM not being initialized, which you can solve with .jinit, a minimal example:
library(parallel)
cl <- makeCluster(detectCores())
parallel::parLapply(cl, 1:5, function(x) {
rJava::.jinit()
rJava::.jnew(class = "java/lang/Integer", x)$toString()
})
Working around Java use
This was not specifically asked, but you can also work around the need for Java dependency using ODBC drivers, which for Oracle should be accessible here:
con <- DBI::dbConnect(
odbc::odbc(),
Driver = "[your driver's name]",
...
)
I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)
Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())
I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.
When I try to compile the below Latex document from Java, my pdflatex run crashes:
\documentclass{article}
\usepackage{tikz}
\usetikzlibrary{arrows}
\begin{document}
\pagestyle{empty}
%
\tikzstyle{int}=[draw, fill=blue!20, minimum size=2em]
\tikzstyle{init} = [pin edge={to-,thin,black}]
\begin{tikzpicture}[node distance=2.5cm,auto,>=latex']
\node [int, pin={[init]above:$v_0$}] (a) {$\frac{1}{s}$};
\node (b) [left of=a,node distance=2cm, coordinate] {a};
\node [int, pin={[init]above:$p_0$}] (c) [right of=a] {$\frac{1}{s}$};
\node [coordinate] (end) [right of=c, node distance=2cm]{};
\path[->] (b) edge node {$a$} (a);
\path[->] (a) edge node {$v$} (c);
\draw[->] (c) edge node {$p$} (end) ;
\end{tikzpicture}
\end{document}
pdflatex doesn't just produce some error, but it simply freezes. The log file is cut off in the middle, even before an enclosing quotation mark is completed (but always at the same position, I think).
I use this Java command to execute pdflatex:
Runtime.getRuntime().exec(command);
p.waitFor();
The command executed is:
"C:\Program Files\MiKTeX 2.9\miktex\bin\x64\pdflatex.exe" -output-directory "C:\Eig\Lehre\Info2\ImagesTemp" "C:\Eig\Lehre\Info2\ImagesTemp\graph.tex"
Executing the command by hand in a command line works fine! Also, the Java execution works fine when I don't include tikz in the latex document. This seems quite strange to me - is there some bug or am I missing something?
I'm using Miktex 2.9 and Java 8 on Windows, I've tried it on different Windows versions.
This problem is probably caused by not capturing the output of the process. You need to read every byte written to standard out and standard error by the child process else the system buffer will fill up and the process will block when it next attempts to write something.
Here's a related question: Capturing stdout when calling Runtime.exec
Which points to http://www.javaworld.com/article/2071275/core-java/when-runtime-exec---won-t.html for more information.
I am out of my R depth. I defined a function nGrams (using RWeka) that worked fine when I tried it out, and sometimes it still does. I do not know how to figure out what environment it works in, what environment I am in when I want to use it, etc. Any quick tips or can you point me to a webpage that could help? If I have to put in a change environment command every time I use it, that is just fine. I really do not understand the issue.
here is what I see in my console.
blog2gramfreq <- nGrams(cleanblogs100000, 2)
Error in ls(envir = envir, all.names = private) :
invalid 'envir' argument
Called from: top level
Called from: top level
Browse[1]>
structure(function (this, private = FALSE, ...)
{
envir <- attr(this, ".env")
ls(envir = envir, all.names = private)
}, export = FALSE, S3class = "Object", modifiers = "public")
I do see nGrams in my Global Environment window.
This was something that came up in a Coursera class blog that i did not find an answer to, at least for R. Here is an answer that worked for me when I received the "'OutOfMemoryError : not enough java heap space" error in R programming.
options(java.parameters="-Xmx4000m")
I have been using the JavaImp.vim script for auto importing Java statements in VIM
But trying out different directories in the JavaImpPaths, I am still unable to make JavaImp parse the Java files in the source to make auto imports possible
this is how my .vimrc looks like
let g:JavaImpPaths = "~/Documents/android-sdks/sources/android-21/android/content/"
let g:JavaImpClassList = "~/.vim/JavaImp/JavaImp.txt"
let g:JavaImpJarCache = "~/.vim/JavaImp/cache/"
This is what I get running JIG in new Vim window
:JIG
Do you want to create the directory ~/.vim/JavaImp/cache/?
Searching in path (package): ~/Documents/android-sdks/sources/android-21/android
/content/ ()
Sorting the classes, this may take a while ...
Assuring uniqueness...
Error detected while processing function <SNR>10_JavaImpGenerate:
line 75:
E37: No write since last change (add ! to override)
Done. Found 1 classes (0 unique)
Press ENTER or type command to continue
It might be late, but if anyone else comes along this might help them...
I got it working with the following changes to the script:
line 181 from
close
to
close!
And lines 207/208 from
let l:javaList = glob(a:cpath . "/**/*.java", 1, 1)
let l:clssList = glob(a:cpath . "/**/*.class", 1, 1)
to
let l:javaList = split(glob(a:cpath . "/**/*.java"), "\n")
let l:clssList = split(glob(a:cpath . "/**/*.class"), "\n")