Parsing XML file with preserving information about the line number

Parsing XML file with preserving information about the line number - java

I am creating a tool that analyzes some XML files (XHTML files to be precise). The purpose of this tool is not only to validate the XML structure, but also to check the value of some attributes.
So I created my own org.xml.sax.helpers.DefaultHandler to handle events during the XML parsing. One of my requirements is to have the information about the current line number. So I decided to add a org.xml.sax.helpers.LocatorImpl to my own DefaultHandler. This solves almost all my problems, except one regarding the XML attributes.
Let's take an example:
<rootNode>
<foo att1="val1"/>
<bar att2="val2"
answerToEverything="43"
att3="val3"/>
</rootNode>
One of my rules indicates that if the attribute answerToEverything is defined on the node bar, its value should not be different from 42.
When encountering such XML, my tool should detect an error. As I want to give a precise error message to the user, such as:
Error in file "foo.xhtml", line #4: answerToEverything only allow "42" as value.
my parser must be able to keep the line number during the parsing, even for attributes. If we consider the following implementation for my own DefaultHandler class:
public void startElement(String uri, String localName, String qName, Attributes attributes) throws SAXException {
System.out.println("Start element <" + qName + ">" + x());
for (int i = 0; i < attributes.getLength(); i++) {
System.out.println("Att '" + attributes.getQName(i) + "' = '" + attributes.getValue(i) + "' at " + locator.getLineNumber() + ":" + locator.getColumnNumber());
}
}
then for the node >bar>, it will display the following output:
Start element at 5:23
Att 'att2' = 'val2' at 5:23
Att 'answerToEverything' = '43' at 5:23
Att 'att3' = 'val3' at 5:23
As you can see, the line number is wrong because the parser will consider the whole node, including its attributes as one block.
Ideally, if the interface ContentHandler would have defined the startAttribute and startElementBeforeReadingAttributes methods, I wouldn't have any problem here :o)
So my question is how can I solve my problem?
For information, I am using Java 6
ps: Maybe another title for this question could be Java SAX parsing with attributes parsing events, or something like that...

I think that only way to implement this is to create your own InputStream (or Reader) that counts lines and somehow communicates with your SAX handler. I have not tried to implement this myself but I believe it is possible. I wish you good luck and would be glad if you succeed to do this and post your results here.

Look for an open source XML editor, its parser might have this information.
Editors don't use the same kind of parser that an application that just uses xml for data would use. Editors need more information, like you say line numbers and I would also think information about whitespace characters. A parser for an editor should not lose any information about characters in the file. That is the way you can implement for example a format function or "select enclosing element" (Alt-Shift-Up in Eclipse).

In both XmlBeans and JAXB it is possible to preserve line number information. You could consider using one of these tools (it is easier in XmlBeans).

Related

XML File looses its format after reading and writing in Java

I'm writing a program in Java that it's going to read a XML file and do some modification,and then write the file with the same format.
The following is the code block that reads and writes the XML file:
final Document fileDocument = parseFileAsDocument(file);
final OutputFormat format = new OutputFormat(fileDocument);
try {
final FileWriter out = new FileWriter(file);
final XMLSerializer serializer = new XMLSerializer(out,format);
serializer.serialize(fileDocument);
}
catch (final IOException e) {
System.out.println(e.getMessage());
}
This is the method used to parse the file:
private Document parseFileAsDocument(final File file) {
Document inputDocument = null;
try {
inputDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
}//catching some exceptions{}
return inputDocument;
}
I'm noticing two changes after the file is written:
Before I had a node similar to this:
<instance ref='filter'>
<value></value>
</instance>
After reading and writing, the node looks like this:
<instance ref="filter">
<value/>
</instance>
As you can see from above, the 'filter' has been changed to "filter" with double quote.
The second change is <value></value> has been changed to <value/>. This change happens across the XML file whenever we have a node similar to <tag></tag> with no value in between. So if we have something like <tag>somevalue</tag>, there is no issue.
Any thought please how to get the XML nodes format to be the same after writing?
I'd appreciate it!

You can't, and you shouldn't try. It's a bit like complaining that when you add 0123 and 0234, you get 357 without the leading zeroes. Leading zeroes in integers aren't considered significant, so arithmetic operations don't preserve them. The same happens to insignificant details of your XML, like the distinction between double quotes and single quotes, and the distinction between a self-closing tags and a start/end tag pair for an empty element. If any consumer of the XML is depending on these details, they need to be sent for retraining.
The most usual reason for asking for lexical details to be preserved is that you want to detect changes. But this means you are doing your comparisons the wrong way: you should be comparing at the logical level, not the physical level. One way to do comparisons is to canonicalize the XML, so whenever there is an arbitrary choice to be made between equivalent representations, it is made the same way.

How to retrieve all the elements name from a xml schema

I am having a problem with getting a name of a schema elements in java. I am creating a small xml editor which can load a xml schema and validate a xml file against xml schema. I want to parse a schema, get every elements name and then put it in my content assistant, so the user can see all the available elements.
I already read XSOM User's guide, but I didn't understand much...
Can someone help me to implement my addElementsFromSchema(File xsdfile) function, because I lost myself trying.
public static void addElementsFromSchema(File xsdfile){
}

It sounds like your primary need, at least for now, is to get the element names. You can get the element names with something like:
XSOMParser parser = new XSOMParser();
parser.parse(xsdfile);
XSSchemaSet schemas = parser.getResult();
Iterator<XSElementDecl> i = schemas.iterateElementDecls();
while (i.hasNext()) {
XSElementDecl element = i.next();
String name = element.getName();
// Add to editor
}
Showing element definitions is a lot more difficult, as element declarations in XML schemas can get quite complex.

Regex Email addresses out of xml

My question: What's a good way to parse the information below?
I have a java program that gets it's input from XML. I have a feature which will send an error email if there was any problem in the processing. Because parsing the XML could be a problem, I want to have a feature that would be able to regex the emails out of the xml (because if parsing was the problem then I couldn't get the error e-mails out of the xml normally).
Requirements:
I want to be able to parse the to, cc, and bcc attributes seperately
There are other elements which have to, cc, and bcc attributes
Whitespace does not matter, so my example may show the attributes on a newline, but that's not always the case.
The order of the attributes does not matter.
Here's an example of the xml:
<error_options
to="your_email#your_server.com"
cc="cc_error#your_server.com"
bcc="bcc_error#your_server.com"
reply_to="someone_else#their_server.com"
from="bo_error#some_server.org"
subject="Error running System at ##TIMESTAMP##"
force_send="false"
max_email_size="10485760"
oversized_email_action="zip;split_all"
>
I tried this error_options.{0,100}?to="(.*?)", but that matched me down to reply_to. That made me think there are probably some cases I might miss, which is why I'm posting this as a question.

This piece will put all attributes from your String s="<error_options..." into a map:
Pattern p = Pattern.compile("\\s+?(.+?)=\"(.+?)\\s*?\"",Pattern.DOTALL);
Map a = new HashMap() ;
Matcher m = p.matcher(s) ;
while( m.find() ) {
String key = m.group(1).trim() ;
String val = m.group(2).trim() ;
a.put(key, val) ;
}
...then you can extract the values that you're interested in from that map.

This question is similar to RegEx match open tags except XHTML self-contained tags. Never ever parse XML or HTML with regular expressions. There are many XML parser implementation in Java to do this task properly. Read the document and parse the attributes one by one.
Don't mind, if the users XML is not well-formed, the parsers can handle a lot of sloppiness.

/<error_options(?=\s)[^>]*?(?<=\n)\s*to="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*cc="([^"]*)"/s;
/<error_options(?=\s)[^>]*?(?<=\n)\s*bcc="([^"]*)"/s;

Extracting XML Attributes in Java(ISBNDB)

So I'm writing an android app that needs to grab book price data from the web. I found isbndb.com which seems to provide good reasourses and price comparison. The only issue is that their xml files are a bit complex.
I am new to parsing XML in Java and don't know too much. I know how to parse basic xml files. With simple tags. I usually use the DocumentBuilder and the DocumentBuilderFactory However this is the part of the file which I'm trying to parse.
<Prices price_time="2012-04-08T20:05:49Z">
<Price store_isbn="" store_title="Discworld: Thief of Time" store_url="http://isbndb.com/x/book/thief_of_time/buy/isbn/ebay.html" store_id="ebay" currency_code="USD" is_in_stock="1" is_historic="0" check_time="2008-12-09T12:00:51Z" is_new="0" currency_rate="1" price="0.99"/>
<Price store_isbn="" store_title="" store_url="http://bookshop.blackwell.com/bobus/scripts/home.jsp?action=search&type=isbn&term=0061031321&source=1154376025" store_id="blackwell" currency_code="USD" is_in_stock="0" is_historic="0" is_new="1" check_time="2011-11-08T02:54:15Z" currency_rate="1" price="7.99"/>
</Prices>
What I am trying to do is grab the info in the attribute values such as store_isbn or store_title. If anyone could help me with this I would really appreciate it.
Thanks

You can use the above mentioned link for parsing xml and For retrieving the attribute values you can use following.
public void startElement(String uri, String localName,String qName,
Attributes attributes) throws SAXException {
System.out.println("Start Element :" + attributes.getValue("store_title"));
}
attributes.getValue("store_title") method will be used for parsing attribute values. Hope it will help.

Lucene 3.5 Custom Payloads

Working with a Lucene index, I have a standard document format that looks something like this:
Name: John Doe
Job: Plumber
Hobby: Fishing
My goal is to append a payload to the job field that would hold additional information about Plumbing, for instance, a wikipedia link to the plumbing article. I do not want to put payloads anywhere else. Initially, I found an example that covered what I'd like to do, but it used Lucene 2.2, and has no updates to reflect the changes in the token stream api.
After some more research, I came up with this little monstrosity to build a custom token stream for that field.
public static TokenStream tokenStream(final String fieldName, Reader reader, Analyzer analyzer, final String item) {
final TokenStream ts = analyzer.tokenStream(fieldName, reader) ;
TokenStream res = new TokenStream() {
CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);
PayloadAttribute payAtt = addAttribute(PayloadAttribute.class);
public boolean incrementToken() throws IOException {
while(true) {
boolean hasNext = ts.incrementToken();
if(hasNext) {
termAtt.append("test");
payAtt.setPayload(new Payload(item.getBytes()));
}
return hasNext;
}
}
};
return res;
}
When I take the token stream and iterate over all the results, prior to adding it to a field, I see it successfully paired the term and the payload. After calling reset() on the stream, I add it to a document field and index the document. However, when I print out the document and look at the index with Luke, my custom token stream didn't make the cut. The field name appears correctly, but the term value from the token stream does not appear, nor does either indicate the successful attachment of a payload.
This leads me to 2 questions. First, did I use the token stream correctly and if so, why doesn't it tokenize when I add it to the field? Secondly, if I didn't use the stream correctly, do I need to write my own analyzer. This example was cobbled together using the Lucene standard analyzer to generate the token stream and write the document. I'd like to avoid writing my own analyzer if possible because I only wish to append the payload to one field!
Edit:
Calling code
TokenStream ts = tokenStream("field", new StringReader("value"), a, docValue);
CharTermAttribute cta = ts.getAttribute(CharTermAttribute.class);
PayloadAttribute payload = ts.getAttribute(PayloadAttribute.class);
while(ts.incrementToken()) {
System.out.println("Term = " + cta.toString());
System.out.println("Payload = " + new String(payload.getPayload().getData()));
}
ts.reset();

It's very hard to tell why the payloads are not saved, the reason may lay in the code that uses the method that you presented.
The most convenient way to set payloads is in a TokenFilter -- I think that taking this approach will give you much cleaner code and in turn make your scenario work correctly. I think that it's most illustrative to take a look at some filter of this type in Lucene source, e.g. TokenOffsetPayloadTokenFilter. You can find an example of how it should be used in the test for this class.
Please also consider if there is no better place to store these hyperlinks than in payloads. Payloads have very special application for e.g. boosting some terms depending on their location or formatting in the original document, part of speech... Their main purpose is to affect how the search is performed, so they are normally numeric values, efficiently packed to cut down the index size.

I might be missing something, but...
You don't need a custom tokenizer to associate additional information to a Lucene document. Just store is as an unanalyzed field.
doc.Add(new Field("fname", "Joe", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("job", "Plumber", Field.Store.YES, Field.Index.ANALYZED));
doc.Add(new Field("link","http://www.example.com", Field.Store.YES, Field.Index.NO));
You can then get the "link" field just like any other field.
Also, if you did need a custom tokenizer, then you would definitely need a custom analyzer to implement it, for both the index building and searching.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing XML file with preserving information about the line number - java

In both XmlBeans and JAXB it is possible to preserve line number information. You could consider using one of these tools (it is easier in XmlBeans).

Related

XML File looses its format after reading and writing in Java

How to retrieve all the elements name from a xml schema

Regex Email addresses out of xml

Extracting XML Attributes in Java(ISBNDB)

Lucene 3.5 Custom Payloads

Categories

Resources