Lucene 3.5 Custom Payloads - java

Working with a Lucene index, I have a standard document format that looks something like this:
Name: John Doe
Job: Plumber
Hobby: Fishing
My goal is to append a payload to the job field that would hold additional information about Plumbing, for instance, a wikipedia link to the plumbing article. I do not want to put payloads anywhere else. Initially, I found an example that covered what I'd like to do, but it used Lucene 2.2, and has no updates to reflect the changes in the token stream api.
After some more research, I came up with this little monstrosity to build a custom token stream for that field.
public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) {
final TokenStream ts = analyzer.tokenStream(fieldName, reader) ;
TokenStream res = new TokenStream() {
CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);
public boolean incrementToken() throws IOException {
while(true) {
boolean hasNext = ts.incrementToken();
if(hasNext) {
termAtt.append("test");
payAtt.setPayload(new Payload(item.getBytes()));
}
return hasNext;
}
}
};
return res;
}
When I take the token stream and iterate over all the results, prior to adding it to a field, I see it successfully paired the term and the payload. After calling reset() on the stream, I add it to a document field and index the document. However, when I print out the document and look at the index with Luke, my custom token stream didn't make the cut. The field name appears correctly, but the term value from the token stream does not appear, nor does either indicate the successful attachment of a payload.
This leads me to 2 questions. First, did I use the token stream correctly and if so, why doesn't it tokenize when I add it to the field? Secondly, if I didn't use the stream correctly, do I need to write my own analyzer. This example was cobbled together using the Lucene standard analyzer to generate the token stream and write the document. I'd like to avoid writing my own analyzer if possible because I only wish to append the payload to one field!
Edit:
Calling code
TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue);
CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class);
while(ts.incrementToken()) {
System.out.println("Term = " + cta.toString());
System.out.println("Payload = " + new String(payload.getPayload().getData()));
}
ts.reset();

It's very hard to tell why the payloads are not saved, the reason may lay in the code that uses the method that you presented.
The most convenient way to set payloads is in a TokenFilter -- I think that taking this approach will give you much cleaner code and in turn make your scenario work correctly. I think that it's most illustrative to take a look at some filter of this type in Lucene source, e.g. TokenOffsetPayloadTokenFilter. You can find an example of how it should be used in the test for this class.
Please also consider if there is no better place to store these hyperlinks than in payloads. Payloads have very special application for e.g. boosting some terms depending on their location or formatting in the original document, part of speech... Their main purpose is to affect how the search is performed, so they are normally numeric values, efficiently packed to cut down the index size.

I might be missing something, but...
You don't need a custom tokenizer to associate additional information to a Lucene document. Just store is as an unanalyzed field.
doc.Add(new Field("fname", "Joe", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("job", "Plumber", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("link","http://www.example.com", Field.Store.YES, Field.Index.NO));
You can then get the "link" field just like any other field.
Also, if you did need a custom tokenizer, then you would definitely need a custom analyzer to implement it, for both the index building and searching.

Related

Parsing RDF items

I have a couple lines of (I think) RDF data
<http://www.test.com/meta#0001> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class>
<http://www.test.com/meta#0002> <http://www.test.com/meta#CONCEPT_hasType> "BEAR"^^<http://www.w3.org/2001/XMLSchema#string>
Each line has 3 items in it. I want to pull out the item before and after the URL. So that would result in:
0001, type, Class
0002, CONCEPT_hasType, (BEAR, string)
Is there a library out there (java or scala) that would do this split for me? Or do I just need to shove string.splits and assumptions in my code?
Most RDF libraries will have something to facilitate this. For example, if you parse your RDF data using Eclipse RDF4J's Rio parser, you will get back each line as a org.eclipse.rdf4j.model.Statement, with a subject, predicate and object value. The subject in both your lines will be an org.eclipse.rdf4j.model.IRI, which has a getLocalName() method you can use to get the part behind the last #. See the Javadocs for more details.
Assuming your data is in N-Triples syntax (which it seems to be given the example you showed us), here's a simple bit of code that does this and prints it out to STDOUT:
// parse the file into a Model object
InputStream in = new FileInputStream(new File("/path/to/rdf-data.nt"));
org.eclipse.rdf4j.model.Model model = Rio.parse(in, RDFFormat.NTRIPLES);
for (org.eclipse.rdf4j.model.Statement st: model) {
org.eclipse.rdf4j.model.Resource subject = st.getSubject();
if (subject instanceof org.eclipse.rdf4j.model.IRI) {
System.out.print(((IRI)subject).getLocalName());
}
else {
System.out.print(subject.stringValue());
}
// ... etc for predicate and object (the 2nd and 3rd elements in each RDF statement)
}
Update if you don't want to read data from a file but simply use a String, you could just use a java.io.StringReader instead of an InputStream:
StringReader r = new StringReader("<http://www.test.com/meta#0001> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Class> .");
org.eclipse.rdf4j.model.Model model = Rio.parse(r, RDFFormat.NTRIPLES);
Alternatively, if you don't want to parse the data at all and just want to do String processing, there is a org.eclipse.rdf4j.model,URIUtil class which you can just feed a string and it can give you back the index of the local name part:
String uri = "http://www.test.com/meta#0001";
String localpart = uri.substring(URIUtil.getLocalNameIndex(uri)); // will be "0001"
(disclosure: I am on the RDF4J development team)

Unusual output when using Jsoup in Java

I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance
Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!

How to prevent solr from decoding a url while indexing?

I'm using Solrj to index document in Solr, one of the field being url. While creating the solr document and subsequently passing it to a SolrServer, I'm not doing any explicit decoding, in order to keep the original format of the url. But, once it's indexed, the urls are decoded.
Here's a test example which contains apostrophe.
http://test.com/test/Help/What%e2%80%99s_N1
In solr index, it's being decoded to
http://test.com/test/Help/What's_N1
Here's a sample code :
SolrServer solrServer = new StreamingUpdateSolrServer(solrPostUrl, solrQueueSize, solrThreads);
SolrInputDocument solrDoc = new SolrInputDocument();
solrDoc.addField("url", "http://test.com/test/Help/What%e2%80%99s_N1");
UpdateResponse solrResponse = solrServer.add(solrDoc);
I looked into the SolrInputDocument object, it does have the right format, i.e. the encoded version.
I'll appreciate if someone can provide pointers to this.
Thanks
I think its because of your tokenizers
A good general purpose tokenizer that strips many extraneous
characters and sets token types to meaningful values. Token types are
only useful for subsequent token filters that are type-aware of the
same token types. There aren't any filters that use
StandardTokenizer's types.
about standardTokenizer
check it out here
you can change all of this behaviour in the solr/schema.xml

Externalize XML construction from a stream of CSV in Java

I get a stream of values as CSV , based on some condition I need to generate a XML including only a set of values from the CSV. For e.g .
Input : a:value1, b:value2, c:value3, d:value4, e:value5.
if (condition1)
XML O/P = <Request><ValueOfA>value1</ValueOfA><ValueOfE>value5</ValueOfE></Request>
else if (condition2)
XML O/P = <Request><ValueOfB>value2</ValueOfB><ValueOfD>value4</ValueOfD></Request>
I want to externalize the process in a way that given a template the output XML is generated accordingly. String manipulation is the easiest way of implementing this but I do not want to mess up the XML if some special characters appear in the input, etc. Please suggest.
Perhaps you could benefit from templating engine, something like Apache Velocity.
I would suggest creating an xsd and using JAXB to create the Java binding classes that you can use to generate the XML.
I recommend my own templating engine (JATL http://code.google.com/p/jatl/) Although its geared to (X)HTML its also very good at generating XML.
I didn't bother solving the whole problem for you (that is double splitting on the input ("," and then ":").) but this is how you would use JATL.
final String a = "stuff";
HtmlWriter html = new HtmlWriter() {
#Override
protected void build() {
//If condition1
start("Request").start("ValueOfA").text(a).end().end();
}
};
//Now write.
StringWriter writer = new StringWriter();
String results = html.write(writer).getBuffer().toString();
Which would generate
<Request><ValueOfA>stuff</ValueOfA></Request>
All the correct escaping is handled for you.

Parsing XML file with preserving information about the line number

I am creating a tool that analyzes some XML files (XHTML files to be precise). The purpose of this tool is not only to validate the XML structure, but also to check the value of some attributes.
So I created my own org.xml.sax.helpers.DefaultHandler to handle events during the XML parsing. One of my requirements is to have the information about the current line number. So I decided to add a org.xml.sax.helpers.LocatorImpl to my own DefaultHandler. This solves almost all my problems, except one regarding the XML attributes.
Let's take an example:
<rootNode>
<foo att1="val1"/>
<bar att2="val2"
answerToEverything="43"
att3="val3"/>
</rootNode>
One of my rules indicates that if the attribute answerToEverything is defined on the node bar, its value should not be different from 42.
When encountering such XML, my tool should detect an error. As I want to give a precise error message to the user, such as:
Error in file "foo.xhtml", line #4: answerToEverything only allow "42" as value.
my parser must be able to keep the line number during the parsing, even for attributes. If we consider the following implementation for my own DefaultHandler class:
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start element <" + qName + ">" + x());
for (int i = 0; i < attributes.getLength(); i++) {
System.out.println("Att '" + attributes.getQName(i) + "' = '" + attributes.getValue(i) + "' at " + locator.getLineNumber() + ":" + locator.getColumnNumber());
}
}
then for the node >bar>, it will display the following output:
Start element at 5:23
Att 'att2' = 'val2' at 5:23
Att 'answerToEverything' = '43' at 5:23
Att 'att3' = 'val3' at 5:23
As you can see, the line number is wrong because the parser will consider the whole node, including its attributes as one block.
Ideally, if the interface ContentHandler would have defined the startAttribute and startElementBeforeReadingAttributes methods, I wouldn't have any problem here :o)
So my question is how can I solve my problem?
For information, I am using Java 6
ps: Maybe another title for this question could be Java SAX parsing with attributes parsing events, or something like that...
I think that only way to implement this is to create your own InputStream (or Reader) that counts lines and somehow communicates with your SAX handler. I have not tried to implement this myself but I believe it is possible. I wish you good luck and would be glad if you succeed to do this and post your results here.
Look for an open source XML editor, its parser might have this information.
Editors don't use the same kind of parser that an application that just uses xml for data would use. Editors need more information, like you say line numbers and I would also think information about whitespace characters. A parser for an editor should not lose any information about characters in the file. That is the way you can implement for example a format function or "select enclosing element" (Alt-Shift-Up in Eclipse).
In both XmlBeans and JAXB it is possible to preserve line number information. You could consider using one of these tools (it is easier in XmlBeans).

Categories

Resources