I want to read index from my Indexer file.
So the result that i want are all terms of each documents and number of TF-IDF.
Please suggest some example code for me. Thx :)
First things is to get a listing of documents. An alternative might be iterating through indexed terms, but the method IndexReader.terms() appears to have been removed from 4.0 (though it exists in AtomicReader, which could be worth looking at). The best method I'm aware of to get all documents is to simply loop through the documents by the document id:
//where reader is your IndexReader, however you go about opening/managing it
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
//operate on the document with id = i ...
}
Then you need a listing of all indexed terms. I'm assuming we have no interest in stored fields, since the data you want doesn't make sense for them. For retrieving the terms you can use IndexReader.getTermVectors(int). Note, I'm not actually retrieving the document, since we don't need to access it directly. Continuing from where we left off:
String field;
FieldsEnum fieldsiterator;
TermsEnum termsiterator;
//To Simplify, you can rely on DefaultSimilarity to calculate tf and idf for you.
DefaultSimilarity freqcalculator = new DefaultSimilarity()
//numDocs and maxDoc are not the same thing:
int numDocs = reader.numDocs();
int maxDoc = reader.maxDoc();
for (int i=0; i<maxDoc; i++) {
if (reader.isDeleted(i))
continue;
fieldsiterator = reader.getTermVectors(i).iterator();
while (field = fieldsiterator.next()) {
termsiterator = fieldsiterator.terms().iterator();
while (terms.next()) {
//id = document id, field = field name
//String representations of the current term
String termtext = termsiterator.term().utf8ToString();
//Get idf, using docfreq from the reader.
//I haven't tested this, and I'm not quite 100% sure of the context of this method.
//If it doesn't work, idfalternate below should.
int idf = termsiterator.docfreq();
int idfalternate = freqcalculator.idf(reader.docFreq(field, termsiterator.term()), numDocs);
}
}
}
Related
I want to compute TF-IDF scores, which are normalized by field norm, for each term in a field COMBINED_FIELD of various documents that are found via Lucene. As you can see in the code below, I'm able to get the term frequency for each term in a field of a document, I can also get the document frequency, but I cannot find a way to get the norm of this field at query time. All approaches I have found so far rely on methods that only exist in older Lucene versions, but not for Lucene 6. The way to go might be the usage of LeafReader, but I didn't find a way of getting an instance of it.
Do you have an idea how I could get the norm of the field COMBINED_FIELD for each document?
Or can I use termVector.size() as replacement for the field length? Does size() consider the number of
occurences of each term or is every term counted only once?
Thanks in advance!
IndexSearcher iSearcher = null;
ScoreDoc[] docs = null;
try {
iSearcher = this.searchManager.acquire();
IndexReader reader = iSearcher.getIndexReader();
MultiFieldQueryParser parser = new MultiFieldQueryParser(this.getSearchFields(), this.queryAnalyzer);
parser.setDefaultOperator(QueryParser.Operator.OR);
Query query = parser.parse(QueryParser.escape(searchString));
docs = iSearcher.search(query, maxSearchResultNumber).scoreDocs;
for(int i=0; i < docs.length; i++) {
Terms termVector = reader.getTermVector(docs[i].doc, COMBINED_FIELD);
TermsEnum itr = termVector.iterator();
BytesRef term = null;
PostingsEnum postings = null;
while((term = itr.next()) != null){
String termText = term.utf8ToString();
postings = itr.postings(postings, PostingsEnum.FREQS);
postings.nextDoc();
int tf = postings.freq();
int docFreq = reader.docFreq(new Term(COMBINED_FIELD, term));
//HERE I WANT TO GET THE FIELD LENGTH OF THE CURRENT DOCUMENT
}
}
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
} finally {
try {
this.searchManager.release(iSearcher);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Alternively, is there a way to get the TF-IDF or BM25 value for each term of a field directly from Lucene?
Lucene internally computes the norms during indexing in the method org.apache.lucene.search.similarities.Similarity#computeNorm, then encodes it and stores on the disk in the .nvm files. Later, during querying/scoring it has been decoded only.
I think, one possible way to do it programmatically in the Lucene is to extend Similarity class and somehow get this information during indexing and store somewhere. Doesn't sounds to me like a best way to go, but at least something.
On the other hand, BM25Similarity computes the length this way:
discountOverlaps ? state.getLength() - state.getNumOverlap() : state.getLength();
where getLength() is the number of terms in the field, which you could calculate by iterating in the while as you do in your example.
I want to query multiple candidates for a search string which could look like "My sear foo".
Now I want to look for documents which have a field that contains one (or more) of the entered strings (seen as splitted by whitespaces).
I found some code which allows me to do a search by pattern:
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(trim).endKey(trim + "\u9999");
return db.queryView(viewQuery, DeviceEntityCouch.class);
}
which works quite nice for looking just for one pattern. But how do I have to modify my code to get a multiple contains on doc.serialNumber?
EDIT:
This is the current workaround, but there must be a better way i guess.
Also there is only an OR logic. So an entry fits term1 or term2 to be in the list.
#View(name = "find_by_serial_pattern", map = "function(doc) { var i; if(doc.serialNumber) { for(i=0; i < doc.serialNumber.length; i+=1) { emit(doc.serialNumber.slice(i), doc);}}}")
public List<DeviceEntityCouch> findBySerialPattern(String serialNumber) {
String trim = serialNumber.trim();
if (StringUtils.isEmpty(trim)) {
return new ArrayList<>();
}
String[] split = trim.split(" ");
List<DeviceEntityCouch> list = new ArrayList<>();
for (String s : split) {
ViewQuery viewQuery = createQuery("find_by_serial_pattern").startKey(s).endKey(s + "\u9999");
list.addAll(db.queryView(viewQuery, DeviceEntityCouch.class));
}
return list;
}
Looks like you are implementing a full text search here. That's not going to be very efficient in CouchDB (I guess same applies to other databases).
Correct me if I am wrong but from looking at your code looks like you are trying to search a list of serial numbers for a pattern. CouchDB (or any other database) is quite efficient if you can somehow index the data you will be searching for.
Otherwise you must fetch every single record and perform a string comparison on it.
The only way I can think of to optimize this in CouchDB would be the something like the following (with assumptions):
Your serial numbers are not very long (say 20 chars?)
You force the search to be always 5 characters
Generate view that emits every single 5 char long substring from your serial number - more or less this (could be optimized and not sure if I got the in):
...
for (var i = 0; doc.serialNo.length > 5 && i < doc.serialNo.length - 5; i++) {
emit([doc.serialNo.substring(i, i + 5), doc._id]);
}
...
Use _count reduce function
Now the following url:
http://localhost:5984/test/_design/serial/_view/complex-key?startkey=["01234"]&endkey=["01234",{}]&group=true
Will return a list of documents with a hit count for a key of 01234.
If you don't group and set the reduce option to be false, you will get a list of all matches, including duplicates if a single doc has multiple hits.
Refer to http://ryankirkman.com/2011/03/30/advanced-filtering-with-couchdb-views.html for the information about complex keys lookups.
I am not sure how efficient couchdb is in terms of updating that view. It depends on how many records you will have and how many new entries appear between view is being queried (I understand couchdb rebuilds the view's b-tree on demand).
I have generated a view like that that splits doc ids into 5 char long keys. Out of over 1K docs it generated over 30K results - id being 32 char long, simple maths really: (serialNo.length - searchablekey.length + 1) * docscount).
Generating the view took a while but the lookups where fast.
You could generate keys of multiple lengths, etc. All comes down to your records count vs speed of lookups.
I am currently trying to make a naming convention. The idea behind this is parsing.
Lets say I obtain an xml doc. Everything can be used once, but these 2 in the code below can be submitted several times within the xml document. It could be 1, or simply 100.
This states that ItemNumber and ReceiptType will be grabbed for the first element.
ItemNumber1 = eElement.getElementsByTagName("ItemNumber").item(0).getTextContent();
ReceiptType1 = eElement.getElementsByTagName("ReceiptType").item(0).getTextContent();
This one states that it will grab the second submission if they were in their twice.
ItemNumber2 = eElement.getElementsByTagName("ItemNumber").item(1).getTextContent();
ReceiptType2 = eElement.getElementsByTagName("ReceiptType").item(1).getTextContent();
ItemNumber and ReceiptType must both be submitted together. So if there is 30 ItemNumbers, there must be 30 Receipt Types.
However now I would like to set this in an IF statement to create variables.
I was thinking something along the lines of:
int cnt = 2;
if (eElement.getElementsByTagName("ItemNumber").item(cnt).getTextContent();)
**MAKE VARIABLE**
Then make a loop which adds one to count to see if their is a third or 4th. Now here comes the tricky part..I need them set to a generated variable. Example if ItemNumber 2 existed, it would set it to
String ItemNumber2 = eElement.getElementsByTagName("ItemNumber").item(cnt).getTextContent();
I do not wish to make pre-made variable names as I don't want to code a possible 1000 variables if that 1000 were to happen.
KUDOS for anyone who can help or give tips on just small parts of this as in the naming convention etc. Thanks!
You don't know beforehand how many ItemNumbers and ReceiptTypes you'll get ? Maybe consider using two Lists (java.util.List). Here is an example.
boolean finished = ... ; // true if there is no more item to process
List<String> listItemNumbers = new ArrayList<>();
List<String> listReceiptTypes = new ArrayList<>();
int cnt = 0;
while(!finished) {
String itemNumber = eElement.getElementsByTagName("ItemNumber").item(cnt).getTextContent();
String receiptType = eElement.getElementsByTagName("ReceiptType").item(cnt).getTextContent();
listItemNumbers.add(itemNumber);
listReceiptTypes.add(receiptType);
++cnt;
// update 'finished' (to test if there are remaining itemNumbers to process)
}
// use them :
int indexYouNeed = 32; // for example
String itemNumber = listItemNumbers.get(indexYouNeed); // index start from 0
String receiptType = listReceiptTypes.get(indexYouNeed);
Im currently working on a program and any time i call Products[1] there is no null pointer error however, when i call Products[0] or Products[2] i get a null pointer error. However i am still getting 2 different outputs almost like there is a [0] and 1 or 1 and 2 in the array. Here is my code
FileReader file = new FileReader(location);
BufferedReader reader = new BufferedReader(file);
int numberOfLines = readLines();
String [] data = new String[numberOfLines];
Products = new Product[numberOfLines];
calc = new Calculator();
int prod_count = 0;
for(int i = 0; i < numberOfLines; i++)
{
data = reader.readLine().split("(?<=\\d)\\s+|\\s+at\\s+");
if(data[i].contains("input"))
{
continue;
}
Products[prod_count] = new Product();
Products[prod_count].setName(data[1]);
System.out.println(Products[prod_count].getName());
BigDecimal price = new BigDecimal(data[2]);
Products[prod_count].setPrice(price);
for(String dataSt : data)
{
if(dataSt.toLowerCase().contains("imported"))
{
Products[prod_count].setImported(true);
}
else{
Products[prod_count].setImported(false);
}
}
calc.calculateTax(Products[prod_count]);
calc.calculateItemTotal(Products[prod_count]);
prod_count++;
This is the output :
imported box of chocolates
1.50
11.50
imported bottle of perfume
7.12
54.62
This print works System.out.println(Products[1].getProductTotal());
This becomes a null pointer System.out.println(Products[2].getProductTotal());
This also becomes a null pointer System.out.println(Products[0].getProductTotal());
You're skipping lines containing "input".
if(data[i].contains("input")) {
continue; // Products[i] will be null
}
Probably it would be better to make products an ArrayList, and add only the meaningful rows to it.
products should also start with lowercase to follow Java conventions. Types start with uppercase, parameters & variables start with lowercase. Not all Java coding conventions are perfect -- but this one's very useful.
The code is otherwise structured fine, but arrays are not a very flexible type to build from program logic (since the length has to be pre-determined, skipping requires you to keep track of the index, and it can't track the size as you build it).
Generally you should build List (ArrayList). Map (HashMap, LinkedHashMap, TreeMap) and Set (HashSet) can be useful too.
Second bug: as Bohemian says: in data[] you've confused the concepts of a list of all lines, and data[] being the tokens parsed/ split from a single line.
"data" is generally a meaningless term. Use meaningful terms/names & your programs are far less likely to have bugs in them.
You should probably just use tokens for the line tokens, not declare it outside/ before it is needed, and not try to index it by line -- because, quite simply, there should be absolutely no need to.
for(int i = 0; i < numberOfLines; i++) {
// we shouldn't need data[] for all lines, and we weren't using it as such.
String line = reader.readLine();
String[] tokens = line.split("(?<=\\d)\\s+|\\s+at\\s+");
//
if (tokens[0].equals("input")) { // unclear which you actually mean.
/* if (line.contains("input")) { */
continue;
}
When you offer sample input for a question, edit it into the body of the question so it's readable. Putting it in the comments, where it can't be read properly, is just wasting the time of people who are trying to help you.
Bug alert: You are overwriting data:
String [] data = new String[numberOfLines];
then in the loop:
data = reader.readLine().split("(?<=\\d)\\s+|\\s+at\\s+");
So who knows how large it is - depends on the success of the split - but your code relies on it being numberOfLines long.
You need to use different indexes for the line number and the new product objects. If you have 20 lines but 5 of them are "input" then you only have 15 new product objects.
For example:
int prod_count = 0;
for (int i = 0; i < numberOfLines; i++)
{
data = reader.readLine().split("(?<=\\d)\\s+|\\s+at\\s+");
if (data[i].contains("input"))
{
continue;
}
Products[prod_count] = new Product();
Products[prod_count].setName(data[1]);
// etc.
prod_count++; // last thing to do
}
I have this code, it should find a pre known method's name in the chosen file:
String[] sorok = new String[listaZ.size()];
String[] sorokPlusz1 = new String[listaIdeig.size()];
boolean keresesiFeltetel1;
boolean keresesiFeltetel3;
boolean keresesiFeltetel4;
int ind=0;
for (int i = 0; i < listaZ.size(); i++) {
for (int id = 0; id < listaIdeig.size(); id++) {
sorok = listaZ.get(i);
sorokPlusz1 = listaIdeig.get(id);
for (int j = 0; j < sorok.length; j++) {
for (int jj = 1; jj < sorok.length; jj++) {
keresesiFeltetel3 = (sorok[j].equals(oldName)) && (sorokPlusz1[id].startsWith("("));
keresesiFeltetel4 = sorok[j].startsWith(oldNameV3);
keresesiFeltetel1 = sorok[j].equals(oldName) && sorok[jj].startsWith("(");
if (keresesiFeltetel1 || keresesiFeltetel3 || keresesiFeltetel4) {
Array.set(sorok, j, newName);
listaZarojeles.set(i, sorok);
}
}
System.out.println(ind +". index, element: " +sorok[j]);
}
ind++;
}
}
listaZ is an ArrayList, elements spearated by '(' and ' ', listaIdeig is this list, without the first line (because of the keresesifeltetel3)
oldNameV3 is: oldName+ ()
I'd like to find a method's name if this is looking like this:
methodname
() {...
To do this I need the next line in keresesifeltetel 3, but I can't get it working properly. It's not finding anything or dropping errors.
Right now it writes out the input file's element's about 15 times, then it should; and shows error on keresesifeltetel3, and:
Exception in thread "AWT-EventQueue-0" java.lang.ArrayIndexOutOfBoundsException: 0
I think your problem is here: sorokPlusz1[id]. id does not seem to span sorokPlusz1's range. I suspect you want to use jj and that jj should span sorokPlusz1's range instead of sorok's and that sorok[jj].startsWith("(") should be sorokPlusz1[jj].startsWith("(").
But note that I'm largely speculating as I'm not 100% sure what you're trying to do or what listaZ and listaIdeig look like.
You're creating sorok with size = listaZ's size, and then you do this: sorok = listaZ.get(i);. This is clearly not right. Not knowing the exact type of listaZ makes it difficult to tell you what's wrong with it. If it's ArrayList<String[]>, then change
String[] sorok = new String[listaZ.size()]; to String[] sorok = null; or String[] sorok;. If it's ArrayList<String> then you probably want to do something more like sorok[i] = listaZ.get(i);
Now for some general notes about asking questions here: (with some repetition of what was said in the comments) (in the spirit of helping you be successful in getting answers to questions on this site).
Your question is generally unclear. After reading through your question and the code, I still have little idea what you're trying to do and what the input variables (listaZ and listaIdeig) look like.
Using non-English variable names makes it more difficult for any English speaker to help. Even changing sorok to array and keresesiFeltetelX to bX would be better (though still not great). Having long variable names that aren't understandable makes it much more difficult to read.
Comment your code. Enough comments (on almost every line) makes it much easier to understand your code.
Examples. If you have difficulty properly explaining what you want to do (in English), you can always provide a few examples which would assist your explanation a great deal (and doing this is a good idea in general). Note that a good example is both providing the input and the desired output (and the actual output, if applicable).