I want to use Lucene to calculate Precision and Recall.
I did these steps:
Made some index files. To do this I used indexer code and indexed .txt files which exist in this path C:/inn (there are 4 text files in this folder) and take them in "outt" folder by setting the indexpath to C:/outt in the Indexer code.
Created a package called lia.benchmark and a class inside it which is called "PrecisionRecall" and add externaljars (rightclick --> Java build path --> add external jars) and added Lucene-benchmark-.3.2.0jar and Lucene-core-3.3.0jar
Set the topicsfile path in code to C:/lia2e/src/lia/benchmark/topics.txt and
qrelsfile to C:/lia2e/src/lia/benchmark/qrels.txt and dir to "C:/outt".
Here is code:
package lia.benchmark;
import java.io.File;
import java.io.PrintWriter;
import java.io.BufferedReader;
import java.io.FileReader;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
import org.apache.lucene.benchmark.quality.*;
import org.apache.lucene.benchmark.quality.utils.*;
import org.apache.lucene.benchmark.quality.trec.*;
public class PrecisionRecall {
public static void main(String[] args) throws Throwable {
File topicsFile = new File("C:/lia2e/src/lia/benchmark/topics.txt");
File qrelsFile = new File("C:/lia2e/src/lia/benchmark/qrels.txt");
Directory dir = FSDirectory.open(new File("C:/outt"));
IndexSearcher searcher = new IndexSearcher(dir, true);
String docNameField = "filename";
PrintWriter logger = new PrintWriter(System.out, true);
TrecTopicsReader qReader = new TrecTopicsReader();
QualityQuery qqs[] = qReader.readQueries(
new BufferedReader(new FileReader(topicsFile)));
Judge judge = new TrecJudge(new BufferedReader(
new FileReader(qrelsFile)));
judge.validateData(qqs, logger);
QualityQueryParser qqParser = new SimpleQQParser("title", "contents");
QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, docNameField);
SubmissionReport submitLog = null;
QualityStats stats[] = qrun.execute(judge,
submitLog, logger);
QualityStats avg = QualityStats.average(stats);
avg.log("SUMMARY",2,logger, " ");
Initialized qrels and topics. In documents folder (C:\inn) I have 4 txt files which 2 of them is relevance to my query ( query is apple) so I filled qrels and topics.
the qrels file like this:
<num> Number: 0
<title> apple
<desc> Description:
<narr> Narrative:
and topics file like this:
0 0 789.txt 1
0 0 101.txt 1
I tried also the Path format namely for example "C:\inn\789.txt" instead of "789.txt"
but results are zero:
0 - contents:apple
0 Stats:
Search Seconds: 0.016
DocName Seconds: 0.000
Num Points: 2.000
Num Good Points: 0.000
Max Good Points: 2.000
Average Precision: 0.000
MRR: 0.000
Recall: 0.000
Precision At 1: 0.000
Search Seconds: 0.016
DocName Seconds: 0.000
Num Points: 2.000
Num Good Points: 0.000
Max Good Points: 2.000
Average Precision: 0.000
MRR: 0.000
Recall: 0.000
Precision At 1: 0.000
Can you tell me what is wrong with me?
I really need to know why results are zero.
I'm afraid that the qrels.txt format is wrong: the javadoc suggests the following:
Expected input format:
qnum 0 doc-name is-relevant
Two sample lines:
19 0 doc303 1
19 0 doc7295 0
(I know it's 2.3.0 javadoc, but the format wasn't changed in 3.0)
So it seems that you've swapped the files: TrecTopicsReader expects what you have in qrels.txt; TrecJudge expects what you have in topics.txt.
In my project, I will use h2o's machine learning algorithm. While I don't load the train date.
I use the folloing ways.
var f = FileUtils.getFile("D:\\from_2017_2_13\\untitled2\\src\\main\\resources\\extdata\\iris_wheader.csv")
var frame = FrameUtils.parseFrame(Key.make("iris_weather.hex"),f)
The 11111 was output, then the program will being runing, and not stopping
other way
var f = FileUtils.getFile("D:\\from_2017_2_13\\untitled2\\src\\main\\resources\\extdata\\iris_wheader.csv")
val parserSetup = H2OFrame.defaultParserSetup()
val f3 = new H2OFrame(parserSetup, f)
the error
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 65535
at water.DKV.get(DKV.java:202)
at water.DKV.get(DKV.java:175)
at water.parser.ParseSetup.createHexName(ParseSetup.java:594)
at water.fvec.H2OFrame.<init>(H2OFrame.scala:56)
at water.fvec.H2OFrame.<init>(H2OFrame.scala:84)
To load data into Scala as H2O Frame you can do the following:
import org.apache.spark.h2o._
import water.support.SparkContextSupport.addFiles
import org.apache.spark.SparkFiles
import java.io.File
val hc = H2OContext.getOrCreate(sc)
addFiles(sc, "/Users/avkashchauhan/smalldata/iris/iris.csv")
val irisData = new H2OFrame(new File(SparkFiles.get("iris.csv")))
Once data is loaded you can see the data frame as below:
scala> irisData
res1: water.fvec.H2OFrame =
Frame key: iris.hex
cols: 5
rows: 150
chunks: 1
size: 2454
Once you have ingested the data frame you can build model with it. If you are looking for a sample of using H2O library in Scala you can look for this blog for full end to end Scala based deep learning sample in H2O.
I've downloaded some grib data files from here: ftp://data-portal.ecmwf.int/20160721000000/ (file type is .bin) and want to extract the data from this file in my Java application (I want to load the extracted data into a database later). I'm just trying with the file ftp://wmo:essential#data-portal.ecmwf.int/20160721000000/A_HWXE85ECEM210000_C_ECMF_20160721000000_24h_em_ws_850hPa_global_0p5deg_grib2.bin.
Therefore I've created a new Java project and added the two libraries grib-8.0.29.jar and netcdfAll-4.6.6.jar. Documentation for the grib API can be found here: http://www.unidata.ucar.edu/software/decoders/grib/javadoc/. I need to open the downloaded files to get the data. Retrieving some metadata via Grib2Dump seems to work (see below). Also the Grib2Input instance sais, that I have a valid GRIB file of version 2.
Here my working code for retrieving some metadata:
public static void main(String[] args) throws IOException, InterruptedException {
File srcDir = new File("C://test//");
File[] localFiles = srcDir.listFiles();
for (File tempFile : localFiles) {
RandomAccessFile raf = new RandomAccessFile(tempFile.getAbsolutePath(), "r");
System.out.println("======= Grib2GDSVariables ==========");
Grib2GDSVariables gdsVariables = new Grib2GDSVariables(raf.readBytes(raf.read()));
System.out.println("Gds key : " + gdsVariables.getGdsKey());
System.out.println("======= Grib2Input ==========");
Grib2Input input = new Grib2Input(raf);
System.out.println("scan : " + input.scan(true, true));
System.out.println("getGDSs.size: " + input.getGDSs().size());
System.out.println("getProducts.size: " + input.getProducts().size());
System.out.println("getRecords.size: " + input.getRecords().size());
System.out.println("edition: " + input.getEdition());
System.out.println("======= Grib2Dump ==========");
Grib2Dump dump = new Grib2Dump();
dump.gribDump(new String[] {tempFile.getAbsolutePath()});
System.out.println("======= Grib2ExtractRawData ==========");
Grib2ExtractRawData extractRawData = new
Grib2ExtractRawData(raf); extractRawData.main(new String[] {tempFile.getAbsolutePath()});
This produces the following output:
======= Grib2GDSVariables ==========
Gds key : -1732955898
======= Grib2Input ==========
scan : true
getGDSs.size: 0
getProducts.size: 0
getRecords.size: 0
edition: 2
======= Grib2Dump ==========
Header : GRIB2
Discipline : 0 Meteorological products
GRIB Edition : 2
GRIB length : 113296
Originating Center : 98 European Center for Medium-Range Weather Forecasts (RSMC)
Originating Sub-Center : 0
Significance of Reference Time : 1 Start of forecast
Reference Time : 2016-07-21T00:00:00Z
Product Status : 0 Operational products
Product Type : 1 Forecast products
Number of data points : 259920
Grid Name : 0 Latitude_Longitude
Grid Shape: 6 Earth spherical with radius of 6,371,229.0 m
Number of points along parallel: 720
Number of points along meridian: 361
Basic angle : 0
Subdivisions of basic angle: -9999
Latitude of first grid point : 90.0
Longitude of first grid point : 0.0
Resolution & Component flags : 48
Winds : True
Latitude of last grid point : -90.0
Longitude of last grid point : 359.5
i direction increment : 0.5
j direction increment : 0.5
Grid Units : degrees
Scanning mode : 0
Product Definition : 2 Derived forecast on all ensemble members at a point in time
Parameter Category : 2 Momentum
Parameter Name : 1 Wind_speed
Parameter Units : m s-1
Generating Process Type : 4 Ensemble Forecast
ForecastTime : 24
First Surface Type : 100 Isobaric surface
First Surface value : 85000.0
Second Surface Type : 255 Missing
Second Surface value : -9.999E-252
======= Grib2ExtractRawData ==========
I tried around for two days now but couldn't get it to work! I can't obtain the content data (lat, lon, value) from the file...
Can someone give an example in Java?
You shouldn't use the GRIB classes in netCDF-java directly. Instead, use
That will give you access through the CDM, giving you a straightforward interface with variables and attributes. There's a tutorial here: https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/tutorial/NetcdfFile.html
I am new Stanford NLP. I can not find any good and complete documentation or tutorial. My work is to do sentiment analysis. I have a very large dataset of product reviews. I already distinguished them by positive and negative according to "starts" given by the users. Now I need to find the most occurred positive and negative adjectives as the features of my algorithm. I understand how to do tokenzation, lemmatization and POS taging from here. I got files like this.
The review was
Don't waste your money. This is a short DVD and the host is boring and offers information that is common sense to any idiot. Pass on this and buy something else. Very generic
and the output was.
Sentence #1 (6 tokens):
Don't waste your money.
[Text=Do CharacterOffsetBegin=0 CharacterOffsetEnd=2 PartOfSpeech=VBP Lemma=do]
[Text=n't CharacterOffsetBegin=2 CharacterOffsetEnd=5 PartOfSpeech=RB Lemma=not]
[Text=waste CharacterOffsetBegin=6 CharacterOffsetEnd=11 PartOfSpeech=VB Lemma=waste]
[Text=your CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=PRP$ Lemma=you]
[Text=money CharacterOffsetBegin=17 CharacterOffsetEnd=22 PartOfSpeech=NN Lemma=money]
[Text=. CharacterOffsetBegin=22 CharacterOffsetEnd=23 PartOfSpeech=. Lemma=.]
Sentence #2 (21 tokens):
This is a short DVD and the host is boring and offers information that is common sense to any idiot.
[Text=This CharacterOffsetBegin=24 CharacterOffsetEnd=28 PartOfSpeech=DT Lemma=this]
[Text=is CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=VBZ Lemma=be]
[Text=a CharacterOffsetBegin=32 CharacterOffsetEnd=33 PartOfSpeech=DT Lemma=a]
[Text=short CharacterOffsetBegin=34 CharacterOffsetEnd=39 PartOfSpeech=JJ Lemma=short]
[Text=DVD CharacterOffsetBegin=40 CharacterOffsetEnd=43 PartOfSpeech=NN Lemma=dvd]
[Text=and CharacterOffsetBegin=44 CharacterOffsetEnd=47 PartOfSpeech=CC Lemma=and]
[Text=the CharacterOffsetBegin=48 CharacterOffsetEnd=51 PartOfSpeech=DT Lemma=the]
[Text=host CharacterOffsetBegin=52 CharacterOffsetEnd=56 PartOfSpeech=NN Lemma=host]
[Text=is CharacterOffsetBegin=57 CharacterOffsetEnd=59 PartOfSpeech=VBZ Lemma=be]
[Text=boring CharacterOffsetBegin=60 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=boring]
[Text=and CharacterOffsetBegin=67 CharacterOffsetEnd=70 PartOfSpeech=CC Lemma=and]
[Text=offers CharacterOffsetBegin=71 CharacterOffsetEnd=77 PartOfSpeech=VBZ Lemma=offer]
[Text=information CharacterOffsetBegin=78 CharacterOffsetEnd=89 PartOfSpeech=NN Lemma=information]
[Text=that CharacterOffsetBegin=90 CharacterOffsetEnd=94 PartOfSpeech=WDT Lemma=that]
[Text=is CharacterOffsetBegin=95 CharacterOffsetEnd=97 PartOfSpeech=VBZ Lemma=be]
[Text=common CharacterOffsetBegin=98 CharacterOffsetEnd=104 PartOfSpeech=JJ Lemma=common]
[Text=sense CharacterOffsetBegin=105 CharacterOffsetEnd=110 PartOfSpeech=NN Lemma=sense]
[Text=to CharacterOffsetBegin=111 CharacterOffsetEnd=113 PartOfSpeech=TO Lemma=to]
[Text=any CharacterOffsetBegin=114 CharacterOffsetEnd=117 PartOfSpeech=DT Lemma=any]
[Text=idiot CharacterOffsetBegin=118 CharacterOffsetEnd=123 PartOfSpeech=NN Lemma=idiot]
[Text=. CharacterOffsetBegin=123 CharacterOffsetEnd=124 PartOfSpeech=. Lemma=.]
Sentence #3 (8 tokens):
Pass on this and buy something else.
[Text=Pass CharacterOffsetBegin=125 CharacterOffsetEnd=129 PartOfSpeech=VB Lemma=pass]
[Text=on CharacterOffsetBegin=130 CharacterOffsetEnd=132 PartOfSpeech=IN Lemma=on]
[Text=this CharacterOffsetBegin=133 CharacterOffsetEnd=137 PartOfSpeech=DT Lemma=this]
[Text=and CharacterOffsetBegin=138 CharacterOffsetEnd=141 PartOfSpeech=CC Lemma=and]
[Text=buy CharacterOffsetBegin=142 CharacterOffsetEnd=145 PartOfSpeech=VB Lemma=buy]
[Text=something CharacterOffsetBegin=146 CharacterOffsetEnd=155 PartOfSpeech=NN Lemma=something]
[Text=else CharacterOffsetBegin=156 CharacterOffsetEnd=160 PartOfSpeech=RB Lemma=else]
[Text=. CharacterOffsetBegin=160 CharacterOffsetEnd=161 PartOfSpeech=. Lemma=.]
Sentence #4 (2 tokens):
Very generic
[Text=Very CharacterOffsetBegin=162 CharacterOffsetEnd=166 PartOfSpeech=RB Lemma=very]
[Text=generic CharacterOffsetBegin=167 CharacterOffsetEnd=174 PartOfSpeech=JJ Lemma=generic]
I already have processed 10000 positive and 10000 negative file like this. Now How can I easily find the most occurred positive and negative features(adjectives)? Do i need to read all the output(processed) file and make a list frequency count of the adjectives like this or is there any easy way by stanford corenlp?
Here is an example of processing an annotated review and storing the adjectives in a Counter.
In the example the movie review "The movie was great! It was a great film." has a sentiment of "positive".
I would suggest altering my code to load in each file and build an Annotation with the file's text and recording the sentiment for that file.
Then you can go through each file and build up a Counter with positive and negative counts for each adjective.
The final Counter has the adjective "great" with a count of 2.
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;
import edu.stanford.nlp.stats.Counter;
import edu.stanford.nlp.stats.ClassicCounter;
import edu.stanford.nlp.util.CoreMap;
import java.util.Properties;
public class AdjectiveSentimentExample {
public static void main(String[] args) throws Exception {
Counter<String> adjectivePositiveCounts = new ClassicCounter<String>();
Counter<String> adjectiveNegativeCounts = new ClassicCounter<String>();
Annotation review = new Annotation("The movie was great! It was a great film.");
String sentiment = "positive";
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit, pos, lemma");
StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
for (CoreMap sentence : review.get(CoreAnnotations.SentencesAnnotation.class)) {
for (CoreLabel cl : sentence.get(CoreAnnotations.TokensAnnotation.class)) {
if (cl.get(CoreAnnotations.PartOfSpeechAnnotation.class).equals("JJ")) {
if (sentiment.equals("positive")) {
} else if (sentiment.equals("negative")) {
System.out.println("positive adjective counts");
I have multiple text files that contains information about different programming languages popularity in different countries based off of google searches. I have one text file for each year from 2004 to 2015. I also have a text file that breaks this down into each week (called iot.txt) but this file does not include the country.
Example data from 2004.txt:
Region java c++ c# python JavaScript
Argentina 13 14 10 0 17
Australia 22 20 22 64 26
Austria 23 21 19 31 21
Belgium 20 14 17 34 25
Bolivia 25 0 0 0 0
example from iot.txt:
Week java c++ c# python JavaScript
2004-01-04 - 2004-01-10 88 23 12 8 34
2004-01-11 - 2004-01-17 88 25 12 8 36
2004-01-18 - 2004-01-24 91 24 12 8 36
2004-01-25 - 2004-01-31 88 26 11 7 36
2004-02-01 - 2004-02-07 93 26 12 7 37
My problem is that i am trying to write code that will output the number of countries that have exhibited 0 interest in python.
This is my current code that I use to read the text files. But I'm not sure of the best way to tell the number of regions that have 0 interest in python across all the years 2004-2015. At first I thought the best way would be to create a list from all the text files not including iot.txt and then search that for any entries that have 0 interest in python but I have no idea how to do that.
Can anyone suggest a way to do this?
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.*;
public class Starter{
public static void main(String[] args) throws Exception {
BufferedReader fh =
new BufferedReader(new FileReader("iot.txt"));
//First line contains the language names
String s = fh.readLine();
List<String> langs =
new ArrayList<>(Arrays.asList(s.split("\t")));
langs.remove(0); //Throw away the first word - "week"
Map<String,HashMap<String,Integer>> iot = new TreeMap<>();
while ((s=fh.readLine())!=null)
String [] wrds = s.split("\t");
HashMap<String,Integer> interest = new HashMap<>();
for(int i=0;i<langs.size();i++)
interest.put(langs.get(i), Integer.parseInt(wrds[i+1]));
iot.put(wrds[0], interest);
regionsByYear = new HashMap<>();
for (int i=2004;i<2016;i++)
BufferedReader fh1 =
new BufferedReader(new FileReader(i+".txt"));
String s1 = fh1.readLine(); //Throw away the first line
HashMap<String,HashMap<String,Integer>> year = new HashMap<>();
while ((s1=fh1.readLine())!=null)
String [] wrds = s1.split("\t");
HashMap<String,Integer>langMap = new HashMap<>();
for(int j=1;j<wrds.length;j++){
langMap.put(langs.get(j-1), Integer.parseInt(wrds[j]));
Create a Map<String, Integer> using a HashMap and each time you find a new country while scanning the incoming data add it into the map country->0. Each time you find a usage of python increment the value.
At the end loop through the entrySet of the map and for each case where e.value() is zero output e.key().
So I have some prolog...
cobrakai$more operator.pl
Which defines some infix operators. I run it using SWI prolog and get the following (perfectly expected) results
?- halt.
cobrakai$swipl -s operator.pl
% library(swi_hooks) compiled into pce_swi_hooks 0.00 sec, 3,992 bytes
% /Users/josephreddington/Documents/workspace/com.plancomps.prolog.helloworld/operator.pl compiled 0.00 sec, 992 bytes
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 5.10.5)
Copyright (c) 1990-2011 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.
For help, use ?- help(Topic). or ?- apropos(Word).
?- be(a,c).
?- a be c.
?- +=(a,c).
ERROR: toplevel: Undefined procedure: (+=)/2 (DWIM could not correct goal)
?- halt.
cobrakai$swipl -s operator.pl
% library(swi_hooks) compiled into pce_swi_hooks 0.00 sec, 3,992 bytes
% /Users/josephreddington/Documents/workspace/com.plancomps.prolog.helloworld/operator.pl compiled 0.00 sec, 1,280 bytes
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 5.10.5)
Copyright (c) 1990-2011 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.
For help, use ?- help(Topic). or ?- apropos(Word).
?- be(a,c).
?- a be c.
?- +=(a,c).
?- a += c.
?- halt.
However, when I use Tuprolog to process the same file from Java (using the following code)
import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import alice.tuprolog.Prolog;
import alice.tuprolog.SolveInfo;
import alice.tuprolog.Theory;
public class Testinfixoperatorconstruction {
public static void main(String[] args) throws Exception {
Prolog engine = new Prolog();
engine.addTheory(new Theory(readFile("/Users/josephreddington/Documents/workspace/com.plancomps.prolog.helloworld/operator.pl")));
SolveInfo info = engine.solve("be(a,c).");
info = engine.solve("a be c.");
private static String readFile(String file) throws IOException {
BufferedReader reader = new BufferedReader(new FileReader(file));
String line = null;
StringBuilder stringBuilder = new StringBuilder();
String ls = System.getProperty("line.separator");
while ((line = reader.readLine()) != null) {
return stringBuilder.toString();
The prolog file does not parse - failing on the '+=' token.
Exception in thread "main" alice.tuprolog.InvalidTheoryException: Unexpected token '+='
at alice.tuprolog.TheoryManager.consult(TheoryManager.java:193)
at alice.tuprolog.Prolog.addTheory(Prolog.java:242)
at Testinfixoperatorconstruction.main(Testinfixoperatorconstruction.java:14)
We can try a slightly different approach, adding the operator directly in the java code with...
public static void main(String[] args) throws Exception {
Prolog engine = new Prolog();
engine.getOperatorManager().opNew("be", "xfx", 35);
engine.getOperatorManager().opNew("+=", "xfx", 35);
engine.addTheory(new Theory(
SolveInfo info = engine.solve("be(a,c).");
info = engine.solve("a be c.");
but we get the same error... :(
Can anyone tell me why this is happening? (and solutions would also be welcome).
SWI-Prolog could be too much permissive while parsing directives. Try enclosing operators between parenthesis:
edit I tried using 2p.jar, that allowed me to spot the problem. Need to quote operator' atom:
:-op(35,xfx, '+=').
X += Y.
p :- a += b.
interactive 2p console accepts this syntax. Note that 2p.jar by default load tuprolog libraries