I'm running into a very curious problem. I've an R code which doesn't contain any print statement (except my explicit calls to log the time taken) and yet the entire json gets dumped on to the "R console". This is causing a serious performance issue with our module and I need your help to track down the problem.
Here is just a part of the R file (due to company policies, I cannot post the entire source code and I apologize for not giving much info)
#run time/online time series models
LAST_TIME_ID <- DATA[nrow(DATA),id];
LAST_TIME_ID <- strptime(LAST_TIME_ID,format="%d-%m-%Y %H:%M");
#timestamp tag computation using frequency
FREQUENCY_VEC <- rep(TIMESTAMP_FREQUENCY*60,PREDICTION_NUMBER);
FREQUENCY_VEC <- cumsum(FREQUENCY_VEC);
TIMESTAMP_TAGS <- LAST_TIME_ID + FREQUENCY_VEC;
TIMESTAMP_TAGS <- format(strptime(TIMESTAMP_TAGS,format="%Y-%m-%d %H:%M"),format="%d-%m-%Y %H:%M");
#prepare the prediction points data per tag into table format
PREDICTION_DATA <- NULL;
startTime <- Sys.time();
for (tag_index in 1:length(MODEL[,tag_id])) {
TEMP <- data.table(id=as.character(TIMESTAMP_TAGS),tag_id = MODEL[tag_index,tag_id], prediction = as.vector(MODEL[,Forecast][[tag_index]]));
PREDICTION_DATA <- data.table(rbind(PREDICTION_DATA,TEMP));
rm(TEMP);
};
endTime <- Sys.time();
print(paste("seconds consumed (prediction points data per tag into TEMP): ",(endTime-startTime)/1000));
#OUTPUT <- dcast(PREDICTION_DATA,id~tag_id); #into output
OUTPUT <- PREDICTION_DATA;
#compute final output in json format
js_object <- toJSON(OUTPUT,asIs = TRUE);
js_object;
I can assure you the rest of code looks the same (i.e. no prints). I'm running my R code through Java (1.8) using RServe (REngine.jar) on Windows 8.
Any ideas/clues would be greatly appreciated.
The last line of your choice snippet where you just execute:
js_object;
prints the variable to the console.
Remove such type of statements where nothing assigned to a variable.
When you type the js_object file name at the end, you are calling that object and R is printing the contents to the console. All you need to do is remove that an it should stop printing it out.
Related
What am I doing?
I am writing a data analysis program in Java which relies on R´s arulesViz library to mine association rules.
What do I want?
My purpose is to store the rules in a String variable in Java so that I can process them later.
How does it work?
The code works using a combination of String.format and eval Java and RJava instructions respectively, being its behavior summarized as:
Given properly formatted Java data structures, creates a data frame in R.
Formats the recently created data frame into a transaction list using the arules library.
Runs the apriori algorithm with the transaction list and some necessary values passed as parameter.
Reorders the generated association rules.
Given that the association rules cannot be printed, they are written to the standard output with R´s write method, capture the output and store it in a variable. We have converted the association rules into a string variable.
We return the string.
The code is the following:
// Step 1
Rutils.rengine.eval("dataFrame <- data.frame(as.factor(c(\"Red\", \"Blue\", \"Yellow\", \"Blue\", \"Yellow\")), as.factor(c(\"Big\", \"Small\", \"Small\", \"Big\", \"Tiny\")), as.factor(c(\"Heavy\", \"Light\", \"Light\", \"Heavy\", \"Heavy\")))");
//Step 2
Rutils.rengine.eval("transList <- as(dataFrame, 'transactions')");
//Step 3
Rutils.rengine.eval(String.format("info <- apriori(transList, parameter = list(supp = %f, conf = %f, maxlen = 2))", supportThreshold, confidenceThreshold));
// Step 4
Rutils.rengine.eval("orderedRules <- sort(info, by = c('count', 'lift'), order = FALSE)");
// Step 5
REXP res = Rutils.rengine.eval("rulesAsString <- paste(capture.output(write(orderedRules, file = stdout(), sep = ',', quote = TRUE, row.names = FALSE, col.names = FALSE)), collapse='\n')");
// Step 6
return res.asString().replaceAll("'", "");
What´s wrong?
Running the code in Linux Will work perfectly, but when I try to run it in Windows, I get the following error referring to the return line:
Exception in thread "main" java.lang.NullPointerException
This is a common error I have whenever the R code generates a null result and passes it to Java. There´s no way to syntax check the R code inside Java, so whenever it´s wrong, this error message appears.
However, when I run the R code in brackets in the R command line in Windows, it works flawlessly, so both the syntax and the data flow are OK.
Technical information
In Linux, I am using R with OpenJDK 10.
In Windows, I am currently using Oracle´s latest JDK release, but trying to run the program with OpenJDK 12 for Windows does not solve anything.
Everything is 64 bits.
The IDE used in both operating systems is IntelliJ IDEA 2019.
Screenshots
Linux run configuration:
Windows run configuration:
I am using stanford posttager toolkit to tag list of words from academic papers. Here is my codes of this part:
st = StanfordPOSTagger(stanford_tagger_path, stanford_jar_path, encoding = 'utf8', java_options = '-mx2048m')
word_tuples = st.tag(document)
document is a list of words derived from nltk.word_tokenize, they come from mormal academic papers so usually there are several thousand of words (mostly 3000 - 4000). I need to process over 10000 files so I keep calling these functions. My program words fine on a small test set with 270 files, but when the number of file gets bigger, the program gives out this error (Java heap space 2G):
raise OSError('Java command failed : ' + str(cmd))
OSError: Java command failed
Note that this error does not occur immediately after the execution, it happens after some time of running. I really don't know the reason. Is this because my 3000 - 4000 words are too much ? Thank you very much for help !(Sorry for the bad edition, the error information is too long)
Here is my solution to the code,after I too faced the error.Basically increasing JAVA heapsize solved it.
import os
java_path = "C:\\Program Files\\Java\\jdk1.8.0_102\\bin\\java.exe"
os.environ['JAVAHOME'] = java_path
from nltk.tag.stanford import StanfordPOSTagger
path_to_model = "stanford-postagger-2015-12-09/models/english-bidirectional-distsim.tagger"
path_to_jar = "stanford-postagger-2015-12-09/stanford-postagger.jar"
tagger=StanfordPOSTagger(path_to_model, path_to_jar)
tagger.java_options='-mx4096m' ### Setting higher memory limit for long sentences
sentence = 'This is testing'
print tagger.tag(sentence.split())
I assume you have tried increasing the Java stack via the Tagger settings like so
stanford.POSTagger([...], java_options="-mxSIZEm")
Cf the docs, default is 1000:
def __init__(self, [...], java_options='-mx1000m')
In order to test if it is a problem with the size of the dataset, you can tokenize your text into sentences, e.g. using the Punkt Tokenizer and output them right after tagging.
I am preparing an R wrapper for a java code that I didn't write myself (and in fact I don't know java). I am trying to use rJava for the first time and I am struggling to get the .jcall right.
Here is an extract of the java code for which I write a wrapper:
public class Model4R{
[...cut...]
public String[][] runModel(String dir, String initFileName, String[] variableNames, int numSims) throws Exception {
[...cut...]
dir and initFileName are character strings for the directory and file name with initial conditions, variable names is a list of character strings that I would write like this in R: c("var1", "var2", "var3", ...) and can be of length from one to five. Finally, numSim is an integer.
Here is my tentative R code for a wrapper function:
runmodel <- function(dir, inFile, varNames, numSim){
hjw <- .jnew("Model4R")
out <- .jcall(hjw, "[[Ljava/lang/String", "runModel", as.character(dir), as.character(inFile), as.vector(varNames), as.integer(numSim))
return(out)
}
The error in R is:
Error in .jcall(hjw, "[[Ljava/lang/String", "runModel", as.character(dir),
: method runModel with signature (Ljava/lang/String;Ljava/lang/String;Ljava/lang/String;II)[[Ljava/lang/String not found
I suspect that the JNI type isn't correct for String[][]. Anyhow, any help that could direct me towards a solution would be welcome!
You're missing a semicolon at the end of the JNI for String[][] - it should be "[[Ljava/lang/String;". Also, I think you need to call .jarray instead of as.vector on varNames. The R error is telling you that rJava thinks the class of the third argument is Ljava/lang/String; instead of [Ljava/lang/String;.
I have been using the JavaImp.vim script for auto importing Java statements in VIM
But trying out different directories in the JavaImpPaths, I am still unable to make JavaImp parse the Java files in the source to make auto imports possible
this is how my .vimrc looks like
let g:JavaImpPaths = "~/Documents/android-sdks/sources/android-21/android/content/"
let g:JavaImpClassList = "~/.vim/JavaImp/JavaImp.txt"
let g:JavaImpJarCache = "~/.vim/JavaImp/cache/"
This is what I get running JIG in new Vim window
:JIG
Do you want to create the directory ~/.vim/JavaImp/cache/?
Searching in path (package): ~/Documents/android-sdks/sources/android-21/android
/content/ ()
Sorting the classes, this may take a while ...
Assuring uniqueness...
Error detected while processing function <SNR>10_JavaImpGenerate:
line 75:
E37: No write since last change (add ! to override)
Done. Found 1 classes (0 unique)
Press ENTER or type command to continue
It might be late, but if anyone else comes along this might help them...
I got it working with the following changes to the script:
line 181 from
close
to
close!
And lines 207/208 from
let l:javaList = glob(a:cpath . "/**/*.java", 1, 1)
let l:clssList = glob(a:cpath . "/**/*.class", 1, 1)
to
let l:javaList = split(glob(a:cpath . "/**/*.java"), "\n")
let l:clssList = split(glob(a:cpath . "/**/*.class"), "\n")
I have a long log file generated with log4j, 10 threads writing to log.
I am looking for log analyzer tool that could find lines where user waited for a long time (i.e where the difference between log entries for the same thread is more than a minute).
P.S I am trying to use OtrosLogViewer, but it gives filtering by certain values (for example, by thread ID), and does not compare between lines.
PPS
the new version of OtrosLogViewer has a "Delta" column that calculates the difference between adj log lines (in ms)
thank you
This simple Python script may be enough. For testing, I analized my local Apache log, which BTW uses the Common Log Format so you may even reuse it as-is. I simply compute the difference between two subsequent requests, and print the request line for deltas exceeding a certain threshold (1 second in my test). You may want to encapsulate the code in a function which also accepts a parameter with the thread ID, so you can filter further
#!/usr/bin/env python
import re
from datetime import datetime
THRESHOLD = 1
last = None
for line in open("/var/log/apache2/access.log"):
# You may insert here something like
# if not re.match(THREAD_ID, line):
# continue
# Python does not support %z, hence the [:-6]
current = datetime.strptime(
re.search(r"\[([^]]+)]", line).group(1)[:-6],
"%d/%b/%Y:%H:%M:%S")
if last != None and (current - last).seconds > THRESHOLD:
print re.search('"([^"]+)"', line).group(1)
last = current
Based on #Raffaele answer, I made some fixes to work on any log file (skipping lines that doesn't begin with the requested date, e.g. Jenkins console log).
In addition, added Max / Min Threshold to filter out lines base on duration limits.
#!/usr/bin/env python
import re
from datetime import datetime
MIN_THRESHOLD = 80
MAX_THRESHOLD = 100
regCompile = r"\w+\s+(\d\d\d\d-\d\d-\d\d \d\d:\d\d:\d\d).*"
filePath = "C:/Users/user/Desktop/temp/jenkins.log"
lastTime = None
lastLine = ""
with open(filePath, 'r') as f:
for line in f:
regexp = re.search(regCompile, line)
if regexp:
currentTime = datetime.strptime(re.search(regCompile, line).group(1), "%Y-%m-%d %H:%M:%S")
if lastTime != None:
duration = (currentTime - lastTime).seconds
if duration >= MIN_THRESHOLD and duration <= MAX_THRESHOLD:
print ("#######################################################################################################################################")
print (lastLine)
print (line)
lastTime = currentTime
lastLine = line
f.closed
Apache Chainsaw has a time delta column.