Unmarshaling xml with html entities using JAXB - java

I need to load wikipedia revision histories into POJOs, so I'm using JAXB to unmarshall the wikipeida data dump (well, individual pages of it). The problem is that the text nodes occasionally contain entities that are not defined in the wikipedia xml dump. eg: ° (`°' pleases keep in mind that I do not know the complete set of entities that I need to be able to read. My input file is 3tb, so let's just assume that everything html can render is in there.).
How can I configure JAXB to handle entities that are not valid xml?
Here is the SAX Exception that JAXB throws when it encounters an undefined entity:
Exception in thread "main" javax.xml.bind.UnmarshalException
- with linked exception:
[org.xml.sax.SAXParseException: The entity "deg" was referenced, but not declared.]
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.createUnmarshalException(AbstractUnmarshallerImpl.java:315)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.createUnmarshalException(UnmarshallerImpl.java:481)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:199)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal(UnmarshallerImpl.java:168)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:137)
at javax.xml.bind.helpers.AbstractUnmarshallerImpl.unmarshal(AbstractUnmarshallerImpl.java:184)
at com.stottlerhenke.tools.wikiparse.WikipediaIO.readPage(WikipediaIO.java:73)
at com.stottlerhenke.tools.wikiparse.WikipediaIO.main(WikipediaIO.java:53)
Caused by: org.xml.sax.SAXParseException: The entity "deg" was referenced, but not declared.
at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanEntityReference(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
at com.sun.xml.internal.bind.v2.runtime.unmarshaller.UnmarshallerImpl.unmarshal0(UnmarshallerImpl.java:195)
Edit: The input that triggered that exception is the complete revision history for the wikipedia article on the Arctic Circle. The XSD used to generate the JAXB classes is here: http://www.mediawiki.org/xml/export-0.3.xsd
Edit: The source of this problem was an error on my part -- I was using an initial extractor that did not maintain encoded entities properly. However, I did find a way around this, should anyone have the problem I thought I had. See below.

Resolving entities is not the job of JAXB's. It's the job of the underlying
XML parser.
What you could do is:
read the data yourself using DOM
replace all unresolved entities by something you wish
then, let JAXB handle the result

This is a hack, but it works in a pinch.
I downloaded the html entity definitions from w3.org, and set the doctype of the input xml file to xhtml-transitional, but directed the doctype url to a local dtd:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "xhtml1-transitional.dtd">
xhtml1-transitional.dtd, in turn, requires:
xhtml-lat1.ent
xhtml-special.ent
xhtml-symbol.ent
which I sucked down and put along side xhtml1-transitional.dtd
(All files are available at: http://www.w3.org/TR/xhtml1/DTD/ )
Like I said, ugly as hell, but it did seem to do the job.

Related

Exception when parsing structure.rdf.u8, using Jena

Model model = ModelFactory.createDefaultModel();
InputStream in = FileManager.get().open( "W:\\structure.rdf.u8" );
model.read(in, null);
model.write(System.out);
I use the above code, which is provided in the Jena documentation, to parse the ODP. First it gave some exception, so I added all the jar files in the Jena package and got the following long exception:
log4j:WARN No appenders could be found for logger (org.apache.jena.riot.system.stream.JenaIOEnvironment).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 5, col: 5 ] {E201} The attributes on this property element, are not permitted with any content; expecting end element tag.
at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:128)
at org.apache.jena.riot.lang.LangRDFXML$ErrorHandlerBridge.error(LangRDFXML.java:246)
at org.apache.jena.rdfxml.xmlinput.impl.ARPSaxErrorHandler.error(ARPSaxErrorHandler.java:37)
at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:196)
at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:173)
at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.warning(XMLHandler.java:168)
at org.apache.jena.rdfxml.xmlinput.impl.ParserSupport.warning(ParserSupport.java:194)
at org.apache.jena.rdfxml.xmlinput.states.Frame.warning(Frame.java:55)
at org.apache.jena.rdfxml.xmlinput.states.WantEmpty.characters(WantEmpty.java:33)
at org.apache.jena.rdfxml.xmlinput.impl.XMLHandler.characters(XMLHandler.java:137)
at org.apache.xerces.parsers.AbstractSAXParser.characters(Unknown Source)
at org.apache.xerces.impl.XMLNamespaceBinder.characters(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
`
I don't know if I need to remove some of the jar files to fix this or the code provided in the Apache site is wrong?
It's not legal RDF/XML; close but has errors.
(at least the one http://rdf.dmoz.org/rdf/structure.rdf.u8.gz isn't)
Top level 'RDF' is not the RDF marker RDF, it's http://dmoz.org/rdf/RDF It should be r:RDF but then ...
The r namespace is wrong (should be http://www.w3.org/1999/02/22-rdf-syntax-ns#, not http://www.w3.org/TR/RDF/).

Exception while trying to load openNLP POS models

I have been trying to use POS Models for POS tagging, but while loading the Models I get the following exception, and this happens for both maxent as well as perceptron models:
java.io.EOFException: Unexpected end of ZLIB input stream
at java.util.zip.InflaterInputStream.fill(Unknown Source)
at java.util.zip.InflaterInputStream.read(Unknown Source)
at java.util.zip.ZipInputStream.read(Unknown Source)
at java.io.DataInputStream.readFully(Unknown Source)
at java.io.DataInputStream.readLong(Unknown Source)
at java.io.DataInputStream.readDouble(Unknown Source)
at opennlp.model.BinaryFileDataReader.readDouble(BinaryFileDataReader.java:53)
at opennlp.model.AbstractModelReader.readDouble(AbstractModelReader.java:75)
at opennlp.model.AbstractModelReader.getParameters(AbstractModelReader.java:146)
at opennlp.perceptron.PerceptronModelReader.constructModel(PerceptronModelReader.java:69)
at opennlp.model.GenericModelReader.constructModel(GenericModelReader.java:59)
at opennlp.model.AbstractModelReader.getModel(AbstractModelReader.java:87)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:35)
at opennlp.tools.util.model.GenericModelSerializer.create(GenericModelSerializer.java:31)
at opennlp.tools.util.model.BaseModel.loadModel(BaseModel.java:231)
at opennlp.tools.util.model.BaseModel.(BaseModel.java:190)
at opennlp.tools.postag.POSModel.(POSModel.java:86)
at nlpcheck.NlpPOC.POSTag(NlpPOC.java:54)
at nlpcheck.NlpPOC.main(NlpPOC.java:86)
I have tried loading the tokenizaton model (en-token.bin) and Its loading and working fine.
Following is java snippet that I am using to load Model:
InputStream is = new FileInputStream(MODEL_PATH);
POSModel model = new POSModel(is);
I have downloaded the models (en-pos-perceptron.bin, en-pos-maxent.bin) from http://www.opennlp.org/models-1.5/.
It turns out the model file hosted on site mentioned above were corrupt, I was trying a different tool namely GATE(General architecture for Text Engineering) which was using the same model files so I copied them and put them on build path and it worked.

FontFactory (lowagie), Java, getting UnsupportedEncodingException when trying to use UniJIS-UCS2-H (for Japanese)

I am using com.lowagie.text.FontFactory in generating a PDF file and am trying to use a custom font, KozMinPro-Regular, which provides support for Japanese characters, as we have a need to support this. I have found examples from searching that show how to do this similar to how I am doing it below and these examples assume that UniJIS-UCS2-H encoding is supported but when I try this I am getting the exception below that says this encoding is not supported. I would appreciate if anyone may have any insight into this. Thanks
FontFactory.register("/usr/share/fonts/truetype/KozMinPro-Regular.ttf", "JapaneseCompatible");
contentFont = FontFactory.getFont("JapaneseCompatible", "UniJIS-UCS2-H", true, 11, Font.BOLD);
headerFont = FontFactory.getFont("JapaneseCompatible", "UniJIS-UCS2-H", true, 11, Font.BOLD);
The exception I get:
Exception: [.ReportPdfView] Exception caught during generation of pdf file. Cause: UniJIS-UCS2-H
ExceptionConverter: java.io.UnsupportedEncodingException: UniJIS-UCS2-H
at java.lang.StringCoding.encode(StringCoding.java:286)
at java.lang.String.getBytes(String.java:954)
at com.lowagie.text.pdf.PdfEncodings.convertToBytes(Unknown Source)
at com.lowagie.text.pdf.TrueTypeFont.<init>(Unknown Source)
at com.lowagie.text.pdf.BaseFont.createFont(Unknown Source)
at com.lowagie.text.pdf.BaseFont.createFont(Unknown Source)
at com.lowagie.text.pdf.BaseFont.createFont(Unknown Source)
at com.lowagie.text.FontFactoryImp.getFont(Unknown Source)
at com.lowagie.text.FontFactoryImp.getFont(Unknown Source)
at com.lowagie.text.FontFactory.getFont(Unknown Source)
at com.lowagie.text.FontFactory.getFont(Unknown Source)
You need iTextAsian.jar . It gives CJK support.
see...
http://itextpdf.sourceforge.net/ for earlier versions of iText or
http://sourceforge.net/projects/itext/files/extrajars/ for later version of iText.(extrajars.zip contains iTextAsian.jar)

InvalidClassException : class invalid for deserialization

I am using SSA parser library in my project. When I invoke main method of one of it's class using command prompt it works fine on my machine.
I execute following command from command prompt :
java -Xmx800M -cp %1 edu.stanford.nlp.parser.lexparser.LexicalizedParser -retainTMPSubcategories -outputFormat "penn,typedDependenciesCollapsed" englishPCFG.ser.gz %2
But when I tried to use the same class in my java program, I am getting Caused by: java.io.InvalidClassException: edu.stanford.nlp.stats.Counter; edu.stanford.nlp.stats.Counter; class invalid for deserialization exception.
Following line throws error :
LexicalizedParser _parser = new LexicalizedParser("C:\englishPCFG.ser.gz");
This englishPCFG.ser.gz file contains some classes or information which gets loaded when creating object of type LexicalizedParser.
Following is the stacktrace :
Loading parser from serialized file C:\englishPCFG.ser.gz ...
Exception in thread "main" java.lang.RuntimeException: Invalid class in file: C:\englishPCFG.ser.gz
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserDataFromSerializedFile(LexicalizedParser.java:822)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserDataFromFile(LexicalizedParser.java:603)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.<init>(LexicalizedParser.java:168)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.<init>(LexicalizedParser.java:154)
at com.tcs.srl.ssa.SSAInvoker.<init>(SSAInvoker.java:21)
at com.tcs.srl.ssa.SSAInvoker.main(SSAInvoker.java:53)
Caused by: java.io.InvalidClassException: edu.stanford.nlp.stats.Counter; edu.stanford.nlp.stats.Counter; class invalid for deserialization
at java.io.ObjectStreamClass.checkDeserialize(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.io.ObjectInputStream.readObject0(Unknown Source)
at java.io.ObjectInputStream.readObject(Unknown Source)
at edu.stanford.nlp.parser.lexparser.LexicalizedParser.getParserDataFromSerializedFile(LexicalizedParser.java:814)
... 5 more
Caused by: java.io.InvalidClassException: edu.stanford.nlp.stats.Counter; class invalid for deserialization
at java.io.ObjectStreamClass.initNonProxy(Unknown Source)
at java.io.ObjectInputStream.readNonProxyDesc(Unknown Source)
at java.io.ObjectInputStream.readClassDesc(Unknown Source)
... 17 more
I am new to Java world so I dont to why this error is coming and what should I do to avoid it.
I googled for this error then I found out that this error comes because of some version mismatch which I think is something similar to dll hell of windows API. Am I correct?
Anyone knows why this kind of error comes? and what should we do to avoid it?
Please enlighten !!!
It could be because the serialVersionUID of the classe has changed, and you are trying to read an object that was written with another version of the class.
You can force the version number by déclaring a serialVersionUID in your serializable class:
private static final long serialVersionUID = 1L;
The java word for dll hell is classpath hell ;-) But that's not your hell anyway.
Object serialization is a process of persisting java objects to files (or streams). The output format is binary. Deserialization (iaw: making java objects from serialized data) requires the same versions of the classes.
So it is possible, that you simply use an older or newer version of that Counter class. This input file should be shipped with a documentation that clearly says, which version of the parser is required. I'd investigate in that direction first.
OT: For the sake of completeness I ran into InvalidClassException ... class invalid for deserialization (and this question) when solving another problem.
(Since edu.stanford.nlp.stats.Counter is not anonymous, the case in this question is certainly not the same case as mine.)
I was sending a serialized class from server to client, the class had two anonymous classes. The jar with these classes was shared among server and client but for server it was compiled in Eclipse JDT, for client in javac. Compilers generated different ordering of names $1, $2 for anonymous classes, hence instance of $1 was sent by server but could not be received as $1 at client side. More info in blogpost (in Czech, though example is obvious).
Try using serialVer to generate the serialID of your old classes that you're trying to de-serialize and add it explicitly (private static final long serialVersionUID = (insert number from serialVer here)L;) in the new versions of the class. If you change anything in a class serialized and you haven't setted the serialID, java thinks the class you've serialized isn't compatible with the new one.
This error suggests that the serialized objects within C:\englishPCFG.ser.gz were serialized with using a older or newer definition of the class which unfortunately is different in such a way that it breaks the terms of compatible serialization from one version to another.
Please see http://download.oracle.com/javase/1.4.2/docs/api/java/io/InvalidClassException.html
Can you check to see when this file was produced and then if possible locate the version of the SSAParser library at the time of it's creation?

Why is WSDL parser still importing external documents?

I tried to turn off importing documents in WSDL4J (1.6.2) in the way suggested
by the API documentation:
wsdlReader.setFeature("javax.wsdl.importDocuments", false);
In fact, it stops importing XML schema files declared with wsdl:import tag, but does stop importing files declared with xs:import tags.
The following code snippet [see at the end of the letter] for the example file
http://www.ibspan.waw.pl/~gawinec/example.wsdl
returns the following exception:
javax.wsdl.WSDLException: WSDLException (at /definitions/types/xs:schema):
faultCode=OTHER_ERROR: An error occurred trying to resolve schema referenced
at 'EchoExceptions.xsd', relative to
'http://www.ibspan.waw.pl/~gawinec/example.wsdl'.:
java.io.FileNotFoundException: This file was not found:
http://www.ibspan.waw.pl/~gawinec/EchoExceptions.xsd
at com.ibm.wsdl.xml.WSDLReaderImpl.parseSchema(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.parseSchema(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.parseTypes(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.parseDefinitions(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at com.ibm.wsdl.xml.WSDLReaderImpl.readWSDL(Unknown Source)
at IsolatedExample.main(IsolatedExample.java:15)
Caused by: java.io.FileNotFoundException: This file was not found:
http://www.ibspan.waw.pl/~gawinec/EchoExceptions.xsd
at com.ibm.wsdl.util.StringUtils.getContentAsInputStream(Unknown Source)
... 10 more
Can you suggest me any solution to this problem? I just don't want to import
external XML schemata.
Regards,
Maciej
import javax.wsdl.WSDLException;
import javax.wsdl.factory.WSDLFactory;
import javax.wsdl.xml.WSDLReader;
public class IsolatedExample {
public static void main(String[] args) {
WSDLFactory wsdlFactory;
try {
wsdlFactory = WSDLFactory.newInstance();
WSDLReader wsdlReader = wsdlFactory.newWSDLReader();
wsdlReader.setFeature("javax.wsdl.verbose", false);
wsdlReader.setFeature("javax.wsdl.importDocuments", false);
wsdlReader.readWSDL("http://www.ibspan.waw.pl/~gawinec/example.wsdl");
} catch (WSDLException e) {
e.printStackTrace();
}
}
}
A quick look at WSDL4J (it's been a while since I've worked directly with this project) suggests that there is no option specifically to prevent the reading of imported schemas. You may have stumbled upon on a bug in WSDL4J's mechanism of deserializing schemas. That said, if you're not interested in the contents of any schemas, including those inlined in the WSDL document, you can register your own extension registry (simply modify the PopulatedExtensionRegistry class to leave out the SchemaDeserializer).
Specifically, leave out the following lines:
mapExtensionTypes(Types.class, SchemaConstants.Q_ELEM_XSD_1999,
SchemaImpl.class);
registerDeserializer(Types.class, SchemaConstants.Q_ELEM_XSD_1999,
new SchemaDeserializer());
registerSerializer(Types.class, SchemaConstants.Q_ELEM_XSD_1999,
new SchemaSerializer());
mapExtensionTypes(Types.class, SchemaConstants.Q_ELEM_XSD_2000,
SchemaImpl.class);
registerDeserializer(Types.class, SchemaConstants.Q_ELEM_XSD_2000,
new SchemaDeserializer());
registerSerializer(Types.class, SchemaConstants.Q_ELEM_XSD_2000,
new SchemaSerializer());
mapExtensionTypes(Types.class, SchemaConstants.Q_ELEM_XSD_2001,
SchemaImpl.class);
registerDeserializer(Types.class, SchemaConstants.Q_ELEM_XSD_2001,
new SchemaDeserializer());
registerSerializer(Types.class, SchemaConstants.Q_ELEM_XSD_2001,
new SchemaSerializer());
I haven't used Java for webservices, but have you tried setting an absolute path to the schemas you import? Perhaps it's trying to load a local file.
You could also try sniffing the wire to see if you're making a request, perhaps it's malformed.
$0.02

Categories

Resources