How to set PivotTable Field Number Format Cell with Apache POI - java

I'd like to set number format cell of pivot table Value field Sum of Balance as # ##0.
Pivot table created with code based on Official POI Sample CreatePivotTable
Code below do create and get CTPivotField pivotField. But how to set its number format?
pivotTable.addColumnLabel(DataConsolidateFunction.SUM, 2);
CTPivotField pivotField = pivotTable
.getCTPivotTableDefinition()
.getPivotFields()
.getPivotFieldArray(2);
In MS Excel this is doing by next steps (see screenshot):
right click on Sum of Balance pivot table Value
select Field Settings
click Number...
set Format Cells
Help please with decide, advice or any idea.

Format of pivot table fields is setting by CTDataField.setNumFmtId(long numFmtId) for values and CTPivotField.setNumFmtId(long numFmtId) for columns & rows.
numFmtId is id number of format code. Available format codes are represented in Format cells list - Custom category:
Predefined format codes, thanks to Ji Zhou - MSFT, is here:
1 0
2 0.00
3 #,##0
4 #,##0.00
5 $#,##0_);($#,##0)
6 $#,##0_);[Red]($#,##0)
7 $#,##0.00_);($#,##0.00)
8 $#,##0.00_);[Red]($#,##0.00)
9 0%
10 0.00%
11 0.00E+00
12 # ?/?
13 # ??/??
14 m/d/yyyy
15 d-mmm-yy
16 d-mmm
17 mmm-yy
18 h:mm AM/PM
19 h:mm:ss AM/PM
20 h:mm
21 h:mm:ss
22 m/d/yyyy h:mm
37 #,##0_);(#,##0)
38 #,##0_);[Red](#,##0)
39 #,##0.00_);(#,##0.00)
40 #,##0.00_);[Red](#,##0.00)
45 mm:ss
46 [h]:mm:ss
47 mm:ss.0
48 ##0.0E+0
49 #
Full list of predefined format codes in MSDN NumberingFormat Class
Here is an example of applying format pivot table fields:
package ru.inkontext.poi;
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.SpreadsheetVersion;
import org.apache.poi.ss.usermodel.DataConsolidateFunction;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.util.AreaReference;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.usermodel.XSSFPivotTable;
import org.apache.poi.xssf.usermodel.XSSFSheet;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;
import org.openxmlformats.schemas.spreadsheetml.x2006.main.CTDataFields;
import java.io.FileOutputStream;
import java.io.IOException;
import java.util.List;
import java.util.Optional;
public class CreatePivotTableSimple {
private static void setFormatPivotField(XSSFPivotTable pivotTable,
long fieldIndex,
Integer numFmtId) {
Optional.ofNullable(pivotTable
.getCTPivotTableDefinition()
.getPivotFields())
.map(pivotFields -> pivotFields
.getPivotFieldArray((int) fieldIndex))
.ifPresent(pivotField -> pivotField
.setNumFmtId(numFmtId));
}
private static void setFormatDataField(XSSFPivotTable pivotTable,
long fieldIndex,
long numFmtId) {
Optional.ofNullable(pivotTable
.getCTPivotTableDefinition()
.getDataFields())
.map(CTDataFields::getDataFieldList)
.map(List::stream)
.ifPresent(stream -> stream
.filter(dataField -> dataField.getFld() == fieldIndex)
.findFirst()
.ifPresent(dataField -> dataField.setNumFmtId(numFmtId)));
}
public static void main(String[] args) throws IOException, InvalidFormatException {
XSSFWorkbook wb = new XSSFWorkbook();
XSSFSheet sheet = wb.createSheet();
//Create some data to build the pivot table on
setCellData(sheet);
XSSFPivotTable pivotTable = sheet.createPivotTable(
new AreaReference("A1:C6", SpreadsheetVersion.EXCEL2007),
new CellReference("E3"));
pivotTable.addRowLabel(1); // set second column as 1-th level of rows
setFormatPivotField(pivotTable, 1, 9); //set format of row field numFmtId=9 0%
pivotTable.addRowLabel(0); // set first column as 2-th level of rows
pivotTable.addColumnLabel(DataConsolidateFunction.SUM, 2); // Sum up the second column
setFormatDataField(pivotTable, 2, 3); //set format of value field numFmtId=3 # ##0
FileOutputStream fileOut = new FileOutputStream("stackoverflow-pivottable.xlsx");
wb.write(fileOut);
fileOut.close();
wb.close();
}
private static void setCellData(XSSFSheet sheet) {
String[] names = {"Jane", "Tarzan", "Terk", "Kate", "Dmitry"};
Double[] percents = {0.25, 0.5, 0.75, 0.25, 0.5};
Integer[] balances = {107634, 554234, 10234, 22350, 15234};
Row row = sheet.createRow(0);
row.createCell(0).setCellValue("Name");
row.createCell(1).setCellValue("Percents");
row.createCell(2).setCellValue("Balance");
for (int i = 0; i < names.length; i++) {
row = sheet.createRow(i + 1);
row.createCell(0).setCellValue(names[i]);
row.createCell(1).setCellValue(percents[i]);
row.createCell(2).setCellValue(balances[i]);
}
}
}
https://github.com/stolbovd/PoiSamples

Related

Apply LOOCV in java splitting with a specific condition

I have a csv file containing 24231 rows. I would like to apply LOOCV based on the project name instead of the observations of the whole dataset.
So if my dataset contains information for 15 projects, I would like to have the training set based on 14 projects and the test set based on the other project.
I was relying on weka's API, is there anything that automates this process?
For non-numeric attributes, Weka allows you to retrieve the unique values via Attribute.numValues() (how many are there) and Attribute.value(int) (the -th value).
package weka;
import weka.core.Attribute;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ConverterUtils;
public class LOOByValue {
/**
* 1st arg: ARFF file to load
* 2nd arg: 0-based index in ARFF to use for class
* 3rd arg: 0-based index in ARFF to use for LOO
*
* #param args the command-line arguments
* #throws Exception if loading/processing of data fails
*/
public static void main(String[] args) throws Exception {
// load data
Instances full = ConverterUtils.DataSource.read(args[0]);
full.setClassIndex(Integer.parseInt(args[1]));
int looCol = Integer.parseInt(args[2]);
Attribute looAtt = full.attribute(looCol);
if (looAtt.isNumeric())
throw new IllegalStateException("Attribute cannot be numeric!");
// iterate unique values to create train/test splits
for (int i = 0; i < looAtt.numValues(); i++) {
String value = looAtt.value(i);
System.out.println("\n" + (i+1) + "/" + full.attribute(looCol).numValues() + ": " + value);
Instances train = new Instances(full, full.numInstances());
Instances test = new Instances(full, full.numInstances());
for (int n = 0; n < full.numInstances(); n++) {
Instance inst = full.instance(n);
if (inst.stringValue(looCol).equals(value))
test.add((Instance) inst.copy());
else
train.add((Instance) inst.copy());
}
train.compactify();
test.compactify();
// TODO do something with the data
System.out.println("train size: " + train.numInstances());
System.out.println("test size: " + test.numInstances());
}
}
}
With Weka's anneal UCI dataset and the surface-quality for leave-one-out, you can generate something like this:
1/5: ?
train size: 654
test size: 244
2/5: D
train size: 843
test size: 55
3/5: E
train size: 588
test size: 310
4/5: F
train size: 838
test size: 60
5/5: G
train size: 669
test size: 229

Outputdata in Excel not iterating properly in SpringBoot

So i am trying to read an excel file in springboot, the excel file contains 10 sheets, the code iterated all the sheets successfully but the row headers and cell data are not correct except for the first sheet ie the 2nd to last sheets are taking the first sheet information
Also the output are not well arranged, is there a way to make it clean
Below is the code
import org.apache.poi.openxml4j.exceptions.InvalidFormatException;
import org.apache.poi.ss.usermodel.*;
import org.springframework.web.bind.annotation.RequestMapping;
import org.springframework.web.bind.annotation.RequestMethod;
import org.springframework.web.bind.annotation.RestController;
import java.io.File;
import java.io.IOException;
import java.util.Iterator;
#RestController
#RequestMapping("datafile")
public class DataController {
#RequestMapping(value = "getdata", method = RequestMethod.GET)
public void createBus() throws IOException {
final String SAMPLE_XLSX_FILE_PATH = "C:\\project\\transita\\src\\main\\resources\\transita.xlsx";
// Creating a Workbook from an Excel file (.xls or .xlsx)
Workbook workbook;
{
try {
workbook = WorkbookFactory.create(new File(SAMPLE_XLSX_FILE_PATH));
// Retrieving the number of sheets in the Workbook
System.out.println("Workbook has " + workbook.getNumberOfSheets() + " Sheets : ");
Iterator<Sheet> sheetIterator = workbook.sheetIterator();
System.out.println("Retrieving Sheets using Iterator");
while (sheetIterator.hasNext()) {
Sheet sheet = sheetIterator.next();
System.out.println("=> " + sheet.getSheetName());
sheet = workbook.getSheetAt(0);
// Create a DataFormatter to format and get each cell's value as String
DataFormatter dataFormatter = new DataFormatter();
// 1. You can obtain a rowIterator and columnIterator and iterate over them
System.out.println("\n\nIterating over Rows and Columns using Iterator\n");
Iterator<Row> rowIterator = sheet.iterator();
while (rowIterator.hasNext()) {
Row row = rowIterator.next();
// Now let's iterate over the columns of the current row
Iterator<Cell> cellIterator = row.cellIterator();
while (cellIterator.hasNext()) {
Cell cell = cellIterator.next();
String cellValue = dataFormatter.formatCellValue(cell);
System.out.print(cellValue + "\t");
}
System.out.println();
}
try {
workbook.close();
} catch (IOException e) {
e.printStackTrace();
}
}
} catch (IOException e) {
e.printStackTrace();
} catch (InvalidFormatException e) {
e.printStackTrace();
}
}
}
Below is the output
1st sheet
=> partners
Iterating over Rows and Columns using Iterator
partner_id partner_code partner_name partner_logo partner_address partner_telephone partner_email partner_website External Agency Oportunity Search sell partner_id
001 ABC ABC Transport Km 5 MCC Uratta Rd, Owerri, Imo State -1111 2348139862090, 0700222872678 info#abctransport.com https://www.abctransport.com TRUE TRUE FALSE FALSE A4
002 CHT Chisco Transport Ltd 104, Funsho Williams Avenue, Iponri, Surulere, . 0816517669, 08089273799, 08113798985 Customercare#chiscogroupng.com https://www.chiscotransport.com.ng TRUE TRUE FALSE FALSE A5
003 LIB Libra Motors NIgeria Ltd Cele Okota Road
Lagos Nigeria 09031565022 info#libmot.com www.libmot.com TRUE TRUE FALSE FALSE A6
004 GIG GIGM Ltd 20 Ikorodu Express Road, Jibowu, Lagos. 08139851110 contact#gigm.com. https://gigm.com/ TRUE FALSE FALSE FALSE A7
005 GUO GUO Jibowu street along ikorodu express, Jibowu, Lagos. 2348144988273 info#guotransport.com https://www.guotransport.com TRUE FALSE FALSE FALSE A8
2nd sheet
=> p_policies
Iterating over Rows and Columns using Iterator
partner_id partner_code partner_name partner_logo partner_address partner_telephone partner_email partner_website External Agency Oportunity Search sell partner_id
001 ABC ABC Transport Km 5 MCC Uratta Rd, Owerri, Imo State -1111 2348139862090, 0700222872678 info#abctransport.com https://www.abctransport.com TRUE TRUE FALSE FALSE A4
002 CHT Chisco Transport Ltd 104, Funsho Williams Avenue, Iponri, Surulere. 0816517669, 08089273799, 08113798985 Customercare#chiscogroupng.com https://www.chiscotransport.com.ng TRUE TRUE FALSE FALSE A5
003 LIB Libra Motors NIgeria Ltd Cele Okota Road
Lagos Nigeria 09031565022 info#libmot.com www.libmot.com TRUE TRUE FALSE FALSE A6
004 GIG GIGM Ltd 20 Ikorodu Express Road, Jibowu, Lagos. 08139851110 contact#gigm.com. https://gigm.com/ TRUE FALSE FALSE FALSE A7
005 GUO GUO Jibowu street along ikorodu express, Jibowu, Lagos. 2348144988273 info#guotransport.com https://www.guotransport.com TRUE FALSE FALSE FALSE A8
3RD SHEET
=> schedules
Iterating over Rows and Columns using Iterator
partner_id partner_code partner_name partner_logo partner_address partner_telephone partner_email partner_website External Agency Oportunity Search sell partner_id
001 ABC ABC Transport Km 5 MCC Uratta Rd, Owerri, Imo State -1111 2348139862090, 0700222872678 info#abctransport.com https://www.abctransport.com TRUE TRUE FALSE FALSE A4
002 CHT Chisco Transport Ltd 104, Funsho Williams Avenue, Iponri, Surulere, L. 0816517669, 08089273799, 08113798985 Customercare#chiscogroupng.com https://www.chiscotransport.com.ng TRUE TRUE FALSE FALSE A5
003 LIB Libra Motors NIgeria Ltd Cele Okota Road
Lagos Nigeria 09031565022 info#libmot.com www.libmot.com TRUE TRUE FALSE FALSE A6
004 GIG GIGM Ltd 20 Ikorodu Express Road, Jibowu, Lagos. 08139851110 contact#gigm.com. https://gigm.com/ TRUE FALSE FALSE FALSE A7
005 GUO GUO Jibowu street along ikorodu express, Jibowu, Lagos. 2348144988273 info#guotransport.com https://www.guotransport.com TRUE FALSE FALSE FALSE A8
The problem is in the first portion of code you've posted, which is:
// ...other stuff...
Iterator<Sheet> sheetIterator = workbook.sheetIterator();
System.out.println("Retrieving Sheets using Iterator");
while (sheetIterator.hasNext()) {
Sheet sheet = sheetIterator.next();
System.out.println("=> " + sheet.getSheetName());
sheet = workbook.getSheetAt(0);
// Create a DataFormatter to format and get each cell's value as String
DataFormatter dataFormatter = new DataFormatter();
// ...other stuff...
You're using the sheet iterator properly, retriving a new Sheet from the iterator at the beginning of your while using the following line:
Sheet sheet = sheetIterator.next();
But then, mysteriously, you're overriding the Sheet the iterator provides you with the Sheet at index zero (which is the first one of your Workbook) with the indicted line of code:
sheet = workbook.getSheetAt(0);
So of course your while-loop is actually iterating through always the first Sheet of your Workbook (the Sheet at index zero). Remove that bad bad line and the problem is solved:
// ...other stuff...
Iterator<Sheet> sheetIterator = workbook.sheetIterator();
System.out.println("Retrieving Sheets using Iterator");
while (sheetIterator.hasNext()) {
Sheet sheet = sheetIterator.next();
System.out.println("=> " + sheet.getSheetName());
// sheet = workbook.getSheetAt(0); <-- BAD, BAD LINE!
// Create a DataFormatter to format and get each cell's value as String
DataFormatter dataFormatter = new DataFormatter();
// ...other stuff...

Java Apache Spark flatMaps & Data Wrangling

I have to pivot the data in a file and then store it in another file. I am having some difficulty pivoting the data.
I have multiple files, that contain data which looks somewhat like I show below. The columns are variable lengths. I am trying to merge the files, first. But for some reason, the output is not correct. I haven't even tried the pivot method, but am not sure how to use it either.
How can this be achieved?
File 1:
0,26,27,30,120
201008,100,1000,10,400
201009,200,2000,20,500
201010,300,3000,30,600
File 2:
0,26,27,30,120,145
201008,100,1000,10,400,200
201009,200,2000,20,500,100
201010,300,3000,30,600,150
File 3:
0,26,27,120,145
201008,100,10,400,200
201009,200,20,500,100
201010,300,30,600,150
Output:
201008,26,100
201008,27,1000
201008,30,10
201008,120,400
201008,145,200
201009,26,200
201009,27,2000
201009,30,20
201009,120,500
201009,145,100
.....
I am not quite familiar with Spark, but am trying to use flatMap and flatMapValues. I am not sure how I can use it for now, but would appreciate some guidance.
import org.apache.commons.lang.StringUtils;
import org.apache.log4j.Level;
import org.apache.log4j.Logger;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.sql.SparkSession;
import lombok.extern.slf4j.Slf4j;
#Slf4j
public class ExecutionTest {
public static void main(String[] args) {
Logger.getLogger("org.apache").setLevel(Level.WARN);
Logger.getLogger("org.spark_project").setLevel(Level.WARN);
Logger.getLogger("io.netty").setLevel(Level.WARN);
log.info("Starting...");
// Step 1: Create a SparkContext.
boolean isRunLocally = Boolean.valueOf(args[0]);
String filePath = args[1];
SparkConf conf = new SparkConf().setAppName("Variable File").set("serializer",
"org.apache.spark.serializer.KryoSerializer");
if (isRunLocally) {
log.info("System is running in local mode");
conf.setMaster("local[*]").set("spark.executor.memory", "2g");
}
SparkSession session = SparkSession.builder().config(conf).getOrCreate();
JavaSparkContext jsc = new JavaSparkContext(session.sparkContext());
jsc.textFile(filePath, 2)
.map(new Function<String, String[]>() {
private static final long serialVersionUID = 1L;
#Override
public String[] call(String v1) throws Exception {
return StringUtils.split(v1, ",");
}
})
.foreach(new VoidFunction<String[]>() {
private static final long serialVersionUID = 1L;
#Override
public void call(String[] t) throws Exception {
for (String string : t) {
log.info(string);
}
}
});
}
}
Solution in Scala as I am not a JAVA person, you should be able to adapt. And add sorting, cache, etc.
Data is as follows, 3 files with duplicate entry evident, get rid of that if you do not want.
0, 5,10, 15 20
202008, 5,10, 15, 20
202009,10,20,100,200
8 rows generated above.
0,888,999
202008, 5, 10
202009, 10, 20
4 rows generated above.
0, 5
202009,10
1 row, which is a duplicate.
// Bit lazy with columns names, but anyway.
import org.apache.spark.sql.functions.input_file_name
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import spark.implicits._
val inputPath: String = "/FileStore/tables/g*.txt"
val rdd = spark.read.text(inputPath)
.select(input_file_name, $"value")
.as[(String, String)]
.rdd
val rdd2 = rdd.zipWithIndex
val rdd3 = rdd2.map(x => (x._1._1, x._2, x._1._2.split(",").toList.map(_.toInt)))
val rdd4 = rdd3.map { case (pfx, pfx2, list) => (pfx,pfx2,list.zipWithIndex) }
val df = rdd4.toDF()
df.show(false)
df.printSchema()
val df2 = df.withColumn("rankF", row_number().over(Window.partitionBy($"_1").orderBy($"_2".asc)))
df2.show(false)
df2.printSchema()
val df3 = df2.withColumn("elements", explode($"_3"))
df3.show(false)
df3.printSchema()
val df4 = df3.select($"_1", $"rankF", $"elements".getField("_1"), $"elements".getField("_2")).toDF("fn", "line_num", "val", "col_pos")
df4.show(false)
df4.printSchema()
df4.createOrReplaceTempView("df4temp")
val df51 = spark.sql("""SELECT hdr.fn, hdr.line_num, hdr.val AS pfx, hdr.col_pos
FROM df4temp hdr
WHERE hdr.line_num <> 1
AND hdr.col_pos = 0
""")
df51.show(100,false)
val df52 = spark.sql("""SELECT t1.fn, t1.val AS val1, t1.col_pos, t2.line_num, t2.val AS val2
FROM df4temp t1, df4temp t2
WHERE t1.col_pos <> 0
AND t1.col_pos = t2.col_pos
AND t1.line_num <> t2.line_num
AND t1.line_num = 1
AND t1.fn = t2.fn
""")
df52.show(100,false)
df51.createOrReplaceTempView("df51temp")
df52.createOrReplaceTempView("df52temp")
val df53 = spark.sql("""SELECT DISTINCT t1.pfx, t2.val1, t2.val2
FROM df51temp t1, df52temp t2
WHERE t1.fn = t2.fn
AND t1.line_num = t2.line_num
""")
df53.show(false)
returns:
+------+----+----+
|pfx |val1|val2|
+------+----+----+
|202008|888 |5 |
|202009|999 |20 |
|202009|20 |200 |
|202008|5 |5 |
|202008|10 |10 |
|202009|888 |10 |
|202008|15 |15 |
|202009|5 |10 |
|202009|10 |20 |
|202009|15 |100 |
|202008|20 |20 |
|202008|999 |10 |
+------+----+----+
What we see is Data Wrangling requiring massaged data for tempview creations and JOINing with SQL appropriately.
The key here is to know how to massage the data to make things easy. Note no groupBy etc. Per file, with varying length stuff, JOINing not attempted in RDD, too inflexible. Rank shows line#, so you know the first line with the 0 business.
This is what we call Data Wrangling. This is what we also call hard work for a few points on SO. This is one of my best efforts, and also one of the last of such efforts.
Weakness of solution is a lot of work to get 1st record of a file, there are alternatives. https://www.cyberciti.biz/faq/unix-linux-display-first-line-of-file/ preprocesing is what I would realistically consider.

Is it possible to create a list in java using data from multiple text files

I have multiple text files that contains information about different programming languages popularity in different countries based off of google searches. I have one text file for each year from 2004 to 2015. I also have a text file that breaks this down into each week (called iot.txt) but this file does not include the country.
Example data from 2004.txt:
Region java c++ c# python JavaScript
Argentina 13 14 10 0 17
Australia 22 20 22 64 26
Austria 23 21 19 31 21
Belgium 20 14 17 34 25
Bolivia 25 0 0 0 0
etc
example from iot.txt:
Week java c++ c# python JavaScript
2004-01-04 - 2004-01-10 88 23 12 8 34
2004-01-11 - 2004-01-17 88 25 12 8 36
2004-01-18 - 2004-01-24 91 24 12 8 36
2004-01-25 - 2004-01-31 88 26 11 7 36
2004-02-01 - 2004-02-07 93 26 12 7 37
My problem is that i am trying to write code that will output the number of countries that have exhibited 0 interest in python.
This is my current code that I use to read the text files. But I'm not sure of the best way to tell the number of regions that have 0 interest in python across all the years 2004-2015. At first I thought the best way would be to create a list from all the text files not including iot.txt and then search that for any entries that have 0 interest in python but I have no idea how to do that.
Can anyone suggest a way to do this?
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.*;
public class Starter{
public static void main(String[] args) throws Exception {
BufferedReader fh =
new BufferedReader(new FileReader("iot.txt"));
//First line contains the language names
String s = fh.readLine();
List<String> langs =
new ArrayList<>(Arrays.asList(s.split("\t")));
langs.remove(0); //Throw away the first word - "week"
Map<String,HashMap<String,Integer>> iot = new TreeMap<>();
while ((s=fh.readLine())!=null)
{
String [] wrds = s.split("\t");
HashMap<String,Integer> interest = new HashMap<>();
for(int i=0;i<langs.size();i++)
interest.put(langs.get(i), Integer.parseInt(wrds[i+1]));
iot.put(wrds[0], interest);
}
fh.close();
HashMap<Integer,HashMap<String,HashMap<String,Integer>>>
regionsByYear = new HashMap<>();
for (int i=2004;i<2016;i++)
{
BufferedReader fh1 =
new BufferedReader(new FileReader(i+".txt"));
String s1 = fh1.readLine(); //Throw away the first line
HashMap<String,HashMap<String,Integer>> year = new HashMap<>();
while ((s1=fh1.readLine())!=null)
{
String [] wrds = s1.split("\t");
HashMap<String,Integer>langMap = new HashMap<>();
for(int j=1;j<wrds.length;j++){
langMap.put(langs.get(j-1), Integer.parseInt(wrds[j]));
}
year.put(wrds[0],langMap);
}
regionsByYear.put(i,year);
fh1.close();
}
}
}
Create a Map<String, Integer> using a HashMap and each time you find a new country while scanning the incoming data add it into the map country->0. Each time you find a usage of python increment the value.
At the end loop through the entrySet of the map and for each case where e.value() is zero output e.key().

Precision recall in lucene java

I want to use Lucene to calculate Precision and Recall.
I did these steps:
Made some index files. To do this I used indexer code and indexed .txt files which exist in this path C:/inn (there are 4 text files in this folder) and take them in "outt" folder by setting the indexpath to C:/outt in the Indexer code.
Created a package called lia.benchmark and a class inside it which is called "PrecisionRecall" and add externaljars (rightclick --> Java build path --> add external jars) and added Lucene-benchmark-.3.2.0jar and Lucene-core-3.3.0jar
Set the topicsfile path in code to C:/lia2e/src/lia/benchmark/topics.txt and
qrelsfile to C:/lia2e/src/lia/benchmark/qrels.txt and dir to "C:/outt".
Here is code:
package lia.benchmark;
import java.io.File;
import java.io.PrintWriter;
import java.io.BufferedReader;
import java.io.FileReader;
import org.apache.lucene.search.*;
import org.apache.lucene.store.*;
import org.apache.lucene.benchmark.quality.*;
import org.apache.lucene.benchmark.quality.utils.*;
import org.apache.lucene.benchmark.quality.trec.*;
public class PrecisionRecall {
public static void main(String[] args) throws Throwable {
File topicsFile = new File("C:/lia2e/src/lia/benchmark/topics.txt");
File qrelsFile = new File("C:/lia2e/src/lia/benchmark/qrels.txt");
Directory dir = FSDirectory.open(new File("C:/outt"));
IndexSearcher searcher = new IndexSearcher(dir, true);
String docNameField = "filename";
PrintWriter logger = new PrintWriter(System.out, true);
TrecTopicsReader qReader = new TrecTopicsReader();
QualityQuery qqs[] = qReader.readQueries(
new BufferedReader(new FileReader(topicsFile)));
Judge judge = new TrecJudge(new BufferedReader(
new FileReader(qrelsFile)));
judge.validateData(qqs, logger);
QualityQueryParser qqParser = new SimpleQQParser("title", "contents");
QualityBenchmark qrun = new QualityBenchmark(qqs, qqParser, searcher, docNameField);
SubmissionReport submitLog = null;
QualityStats stats[] = qrun.execute(judge,
submitLog, logger);
QualityStats avg = QualityStats.average(stats);
avg.log("SUMMARY",2,logger, " ");
dir.close();
}
}
Initialized qrels and topics. In documents folder (C:\inn) I have 4 txt files which 2 of them is relevance to my query ( query is apple) so I filled qrels and topics.
the qrels file like this:
<top>
<num> Number: 0
<title> apple
<desc> Description:
<narr> Narrative:
</top>
and topics file like this:
0 0 789.txt 1
0 0 101.txt 1
I tried also the Path format namely for example "C:\inn\789.txt" instead of "789.txt"
but results are zero:
0 - contents:apple
0 Stats:
Search Seconds: 0.016
DocName Seconds: 0.000
Num Points: 2.000
Num Good Points: 0.000
Max Good Points: 2.000
Average Precision: 0.000
MRR: 0.000
Recall: 0.000
Precision At 1: 0.000
SUMMARY
Search Seconds: 0.016
DocName Seconds: 0.000
Num Points: 2.000
Num Good Points: 0.000
Max Good Points: 2.000
Average Precision: 0.000
MRR: 0.000
Recall: 0.000
Precision At 1: 0.000
Can you tell me what is wrong with me?
I really need to know why results are zero.
I'm afraid that the qrels.txt format is wrong: the javadoc suggests the following:
Expected input format:
qnum 0 doc-name is-relevant
Two sample lines:
19 0 doc303 1
19 0 doc7295 0
(I know it's 2.3.0 javadoc, but the format wasn't changed in 3.0)
So it seems that you've swapped the files: TrecTopicsReader expects what you have in qrels.txt; TrecJudge expects what you have in topics.txt.

Categories

Resources