Lucene 4.0 getFieldInfos - java

I am pretty much trying to do this on lucene4.0 (java): How to incorporate multiple fields in QueryParser?
Though I'd like to search on all fields (all are not present on all documents) and I don't know their names. So here I found:
QueryParser queryParser = new MultiFieldQueryParser(Version.LUCENE_29, ir.GetFieldNames(IndexReader.FieldOption.ALL).toArray(), analyzer)
Though getfieldNames() has been replace in 4.0 by "LUCENE-3679 Replace IndexReader.getFieldNames with IndexReader.getFieldInfos".
However, the problem is that getFieldNames or any getField* is defined in IndexReader.
I have been looking online for ages for a solution. What am I missing and how can I do this?

FieldInfos are only available on AtomicReader. You can get a FieldInfos view on a composite reader by calling MultiFields.getMergedFieldInfos.

Related

Indexing external text data to lucene index in GraphDB

Is it possible to index external to RDF data?
Like in RDF there is a triple with the object as a link to an external file. Can the content of this file be indexed instead of the link value?
I suspect that the answer above misunderstood the question. The question refers to external content - i.e., if GraphDB's Lucene is able to index the content available at http://example.org, rather than the RDF literal associated with it (and then return in searches the triple pointing to that content).
From what I was able to try no, this is not currently supported.
Absolutely. Lucene is a core part of GraphDB and it offers the standard functionality which comes with a standalone Lucene. The data will have to be parametrized as a String literal. <http://www.example.org/> rdfs:label "An example webpage url."#EN .
Then you can configure a Lucene Index:
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {
luc:index luc:setParam "uris" .
luc:include luc:setParam "literals" .
luc:moleculeSize luc:setParam "1" .
luc:includePredicates luc:setParam "http://www.w3.org/2000/01/rdf-schema#label" .
}
And once you have the configuration, you can create the index.
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
INSERT DATA {
luc:myTestIndex luc:createIndex "true" .
}
And, given the index and your data, you can query it.
PREFIX luc: <http://www.ontotext.com/owlim/lucene#>
SELECT * {
?subj luc:myTestIndex "web*"
}
Since you are asking about the subject of something which contains the string web*, you'll get <http://www.example.org/>. If you had other triples linking to this one, they might have also appeared.
More information about the way in which GraphDB interacts with Lucene and its Full-Text-Search capabilities can be found within the GraphDB documentation.

Issue with comparing XML documents in Java using oracle.xml.differ.XMLDiff

I have an issue trying to compare 2 XML documents in Java, using oracle.xml.differ.XMLDiff. The code is fully implemented and I expected it to be working fine, until I discovered an attribute change is not picked up in some instances. To demonstrate this, I have the following:
Setup:
DOMParser parser = new DOMParser();
parser.setPreserveWhitespace(false);
parser.parse(isCurrent);
XMLDocument currentXmlDoc = parser.getDocument();
parser.parse(isPrior);
XMLDocument priorXmlDoc = parser.getDocument();
XMLDiff xmlDiff = new XMLDiff();
xmlDiff.setDocuments(currentXmlDoc, priorXmlDoc);
In the first case, the attribute change in Strike is picked up fine. I have the following 2 XML files:
XML1
<Periods>
<Period Start="2011-03-28" End="2011-04-17" AverageStart="" AverageEnd="" Notional="6000000.0000" OptionType="Swap" Payment="2011-04-19" Strike="72.0934800" Underlying="ZA" ResetStrike="No" ResetNotional="No" QuotingDate="2011-04-17" Multiplier="1.000000" PlusConstant="0.000000" StopLossPercent="" StopLossLevel=""/>
</Periods>
XML2
<Periods>
<Period Start="2011-03-28" End="2011-04-17" AverageStart="" AverageEnd="" Notional="6000000.0000" OptionType="Swap" Payment="2011-04-19" Strike="0.0000000" Underlying="ZA" ResetStrike="No" ResetNotional="No" QuotingDate="2011-04-17" Multiplier="1.000000" PlusConstant="0.000000" StopLossPercent="" StopLossLevel=""/>
</Periods>
In the second case, the attribute change in Strike is not picked up. I have the following 2 XML files:
XML1
<Periods>
<Period Start="2011-03-28" End="2011-04-30" Payment="2011-05-02" Notional="5220000.000000" Strike="176.201900" StopLossPercent="" StopLossLevel=""/>
</Periods>
XML2
<Periods>
<Period Start="2011-03-28" End="2011-04-30" Payment="2011-05-02" Notional="5220000.000000" Strike="0.000000" StopLossPercent="" StopLossLevel=""/>
</Periods>
Does anyone know if I'm doing something wrong, or is there a bug in the XMLDiff package?
Alternatively, does anyone know a different tool that can be used in the same way, just identifying differences in nodes and attributes between XML files, regardless of the order?
Thanks,
Milena
UPDATE: As it's extremely time-consuming to get new external packages approved for use in our system, in the ideal case I'd like to find a solution to making oracle.xml.differ.XMLDiff work. Obviously if there really is a bug and this can't be bypassed I'll consider other tools.
UPDATE 2: Since nobody seems to know about the XMLDiff bug, I'll try implementing the suggested XMLUnit package, it should do the trick.
In a unit test i'm using org.custommonkey.xmlunit.Diff for comparing xml content. See http://xmlunit.sourceforge.net/api/org/custommonkey/xmlunit/Diff.html
I'm comparing xml strings but you can also compare xml w3c documents. I hope you can convert your XMLDocument to either a String of an org.w3c.dom.Document.
my testcase looks like this:
String actualXML = SomeClass.getElement().asXML();
String expectedXML = IOUtils.toString(this.getClass().getResourceAsStream("/expected.xml"));
org.custommonkey.xmlunit.Diff myDiff = new Diff(StringUtils.deleteWhitespace(expectedXML), StringUtils.deleteWhitespace(actualXML));
assertTrue(MessageFormat.format("XML must be simular: {0}\nActual XML:\n{1}\n", myDiff, actualXML), myDiff.similar());
p.s. I also use the apache commons StringUtils.deleteWhitespace() method, cause i'm not interested in white space differences.

From Lucene 2 to Lucene 4

I'm migrating my Java application from Lucene 2 to Lucene 4, and I cannot find any good way to convert my code. I also tried to go to http://lucene.apache.org/core/4_0_0-ALPHA/MIGRATE.html but the example code in it simply does not work (for example the method reader.termDocsEnum does not exist for IndexReader or DirectoryReader, but only for AtomicReader I never heard about).
Given an IndexReader called indexReader, the old code was:
Term find = new Term("field", "value");
TermDocs td = indexReader.termDocs(find);
while (termDocs.next()) {
Document d = termDocs.doc();
// do stuff
}
How can I convert that code?
Thanks!
The following should be relevant to your case:
The docs/positions enums cannot seek to a term. Instead, TermsEnum is able to seek, and then you request the docs/positions enum from that TermsEnum.
I guess you need this:
TermsEnum termsEnum = atomicReader.terms("fieldName").iterator();
BytesRef text = new BytesRef("searchTerm");
if (termsEnum.seekExact(text, true)) {
...
}
The low-level API is now clearly oriented towards atomic (non-composite) readers because this is the only way to top performance. You might wrap te composite reader you acquire from Directory in a SlowCompositeReaderWrapper, but, as the classname already warns, it will be slow.

Lucene Highlighter Isn't Match Prefixes

I'm using Lucene's Highlighter to highlight parts of a string. The code below seems to work fine for finding the stemmed words but not for prefix matching.
EnglishAnalyzer analyzer = new EnglishAnalyzer(Version.LUCENE_34);
QueryParser parser = new QueryParser(Version.LUCENE_30, "", analyzer);
Query query = parser.parse(pQuery);
QueryScorer scorer = new QueryScorer(query);
Fragmenter fragmenter = new SimpleSpanFragmenter(scorer, 40);
Highlighter highlighter = new Highlighter(scorer);
highlighter.setTextFragmenter(fragmenter);
String[] frags = highlighter.getBestFragments(analyzer, "", pText, 4);
I've read in a few different places I need to call Query.rewrite to get the prefix matching to work. That method takes an IndexReader arguement though and I'm not sure how to get it. All of the example's I've found that call Query.rewreite don't show where the IndexReader came from. I'll add that that this is the only Lucene code I'm using. I'm not using Lucene to do the searching itself, just for the highlighting.
How do I create an IndexReader and is it possible to create one if I'm using Lucene the way that I am. Or perhaps there's a different way to get it to highlight the prefix matches? I'm very new to Lucene and I'm sure what all of these pieces do or if they're all necessary. I've just copied them from various example's I've found online. So if I've doing anything else wrong please let me know. Thanks.
Suppose you have a query field:abc* . What query.rewrite basically does is: it reads the index(this why you need an IndexReader) finds all terms that start with abc and changes your query as ,for ex., field:abc1 field:abc2 field:abc3. If you know the location of the index, you can use IndexReader.Open to get an IndexReader. If you don't have an index at all, you should search your pText, find all words that start with abc and update your query accordingly.

Saxon 8 (Java version) problem

I'll point out now, that I'm new to using saxon, and I've tried following the docs and examples in the package, but I'm just not having luck with this problem.
Basically, I'm trying to do some xml processing in java using saxon v8. In order to get something working, I took one of the sample files included in the package and modified to my needs. It works so long as I'm not using namespaces, and that is my question. How can I get around the namespace problem? I don't really care to use it, but it exists in my xml, so I either have to use it or ignore it. Either solution is fine.
Anyway, here is my starter code. It doesn't do anything but take an xpath query try to use it against the hard coded xml doc.
public static void main(String[] args) {
String query = args[0];
File XMLStream=null;
String xmlFileName="doc.xml";
OutputStream destStream=System.out;
XQueryExpression exp=null;
Configuration C=new Configuration();
C.setSchemaValidation(false);
C.setValidation(false);
StaticQueryContext SQC=new StaticQueryContext(C);
DynamicQueryContext DQC=new DynamicQueryContext(C);
QueryProcessor processor = new QueryProcessor(SQC);
Properties props=new Properties();
try{
exp=processor.compileQuery(query);
XMLStream=new File(xmlFileName);
InputSource XMLSource=new InputSource(XMLStream.toURI().toString());
SAXSource SAXs=new SAXSource(XMLSource);
DocumentInfo DI=SQC.buildDocument(SAXs);
DQC.setContextNode(DI);
SequenceIterator iter = exp.iterator(DQC);
while(true){
Item i = iter.next();
if(i != null){
System.out.println(i.getStringValue());
}
else break;
}
}
catch (Exception e){
System.err.println(e.getMessage());
}
}
An example XML file is here...
<?xml version="1.0"?>
<ns1:animal xmlns:ns1="http://my.catservice.org/">
<cat>
<catId>8889</catId>
<fedStatus>true</fedStatus>
</cat>
</ns1:animal>
If I run this with a query including the namespace, I get an error. For example:
/ns1:animal/cat/ gives the error: "Prefix ns1 has not been declared".
If I remove the ns1: from the query, it gives me nothing. If I doctor the xml to remove the "ns1:" prepended to "animal" I can run the query /animal/cat/ with success.
Any help would be greatly appreciated. Thanks.
Error message correctly points out that your xpath expression does not indicate what namespace prefix "ns1" means (binds to). Just because document to operate on happens to use binding for "ns1" does not mean it is what should be used: this because in XML, it's the namespace URI that matters, and prefixes are just convenient shortcuts to the real thing.
So: how do you define the binding? There are 2 generic ways; either provide a context that can resolve the prefix, or embed actual URI within XPath expression.
Regarding the first approach, this email from Saxon author mentions JAXP method XPath.setNamespaceContext(), similarly, Jaxen XPath processor FAQ has some sample code that could help
That's not very convenient, as you have to implement NamespaceContext, but once you have an implementation you'll be set.
So the notation approach... let's see: Top Ten Tips to Using XPath and XPointer shows this example:
to match element declared with namespace like:
xmlns:book="http://my.example.org/namespaces/book"
you use XPath name like:
{http://my.example.org/namespaces/book}section
which hopefully is understood by Saxon (or Jaxen).
Finally, I would recommend upgrading to Saxon9 if possible, if you have any trouble using one of above solutions.
If you want to have something working out of the box, you can check out embedding-xquery-in-java. There's github project, which uses Saxon to evaluate some sample XQuery expressions.
Regards

Categories

Resources