Is Scala/Java not respecting w3 "excess dtd traffic" specs?

Is Scala/Java not respecting w3 "excess dtd traffic" specs? - java

I'm new to Scala, so I may be off base on this, I want to know if the problem is my code. Given the Scala file httpparse, simplified to:
object Http {
import java.io.InputStream;
import java.net.URL;
def request(urlString:String): (Boolean, InputStream) =
try {
val url = new URL(urlString)
val body = url.openStream
(true, body)
}
catch {
case ex:Exception => (false, null)
}
}
object HTTPParse extends Application {
import scala.xml._;
import java.net._;
def fetchAndParseURL(URL:String) = {
val (true, body) = Http request(URL)
val xml = XML.load(body) // <-- Error happens here in .load() method
"True"
}
}
Which is run with (URL doesn't matter, this is a joke example):
scala> HTTPParse.fetchAndParseURL("http://stackoverflow.com")
The result invariably:
java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/strict.dtd
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1187)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:973)
at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.startEntity(XMLEnti...
I've seen the Stack Overflow thread on this with respect to Java, as well as the W3C's System Team Blog entry about not trying to access this DTD via the web. I've also isolated the error to the XML.load() method, which is a Scala library method as far as I can tell.
My Question: How can I fix this? Is this something that is a by product of my code (cribbed from Raphael Ferreira's post), a by product of something Java specific that I need to address as in the previous thread, or something that is Scala specific? Where is this call happening, and is it a bug or a feature? ("Is it me? It's her, right?")

I've bumped into the SAME issue, and I haven't found an elegant solution (I'm thinking into posting the question to the Scala mailing list) Meanwhile, I found a workaround: implement your own SAXParserFactoryImpl so you can set the f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true); property. The good thing is it doesn't require any code change to the Scala code base (I agree that it should be fixed, though).
First I'm extending the default parser factory:
package mypackage;
public class MyXMLParserFactory extends SAXParserFactoryImpl {
public MyXMLParserFactory() throws SAXNotRecognizedException, SAXNotSupportedException, ParserConfigurationException {
super();
super.setFeature("http://xml.org/sax/features/validation", false);
super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false);
super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
}
}
Nothing special, I just want the chance to set the property.
(Note: that this is plain Java code, most probably you can write the same in Scala too)
And in your Scala code, you need to configure the JVM to use your new factory:
System.setProperty("javax.xml.parsers.SAXParserFactory", "mypackage.MyXMLParserFactory");
Then you can call XML.load without validation

Without addressing, for now, the problem, what do you expect to happen if the function request return false below?
def fetchAndParseURL(URL:String) = {
val (true, body) = Http request(URL)
What will happen is that an exception will be thrown. You could rewrite it this way, though:
def fetchAndParseURL(URL:String) = (Http request(URL)) match {
case (true, body) =>
val xml = XML.load(body)
"True"
case _ => "False"
}
Now, to fix the XML parsing problem, we'll disable DTD loading in the parser, as suggested by others:
def fetchAndParseURL(URL:String) = (Http request(URL)) match {
case (true, body) =>
val f = javax.xml.parsers.SAXParserFactory.newInstance()
f.setNamespaceAware(false)
f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
val MyXML = XML.withSAXParser(f.newSAXParser())
val xml = MyXML.load(body)
"True"
case _ => "False"
}
Now, I put that MyXML stuff inside fetchAndParseURL just to keep the structure of the example as unchanged as possible. For actual use, I'd separate it in a top-level object, and make "parser" into a def instead of val, to avoid problems with mutable parsers:
import scala.xml.Elem
import scala.xml.factory.XMLLoader
import javax.xml.parsers.SAXParser
object MyXML extends XMLLoader[Elem] {
override def parser: SAXParser = {
val f = javax.xml.parsers.SAXParserFactory.newInstance()
f.setNamespaceAware(false)
f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
f.newSAXParser()
}
}
Import the package it is defined in, and you are good to go.

This is a scala problem. Native Java has an option to disable loading the DTD:
f.setFeature("http://apache.org/xml/features/disallow-doctype-decl", true);
There are no equivalent in scala.
If you somewhat want to fix it yourself, check scala/xml/parsing/FactoryAdapter.scala and put the line in
278 def loadXML(source: InputSource): Node = {
279 // create parser
280 val parser: SAXParser = try {
281 val f = SAXParserFactory.newInstance()
282 f.setNamespaceAware(false)
<-- insert here
283 f.newSAXParser()
284 } catch {
285 case e: Exception =>
286 Console.err.println("error: Unable to instantiate parser")
287 throw e
288 }

GClaramunt's solution worked wonders for me. My Scala conversion is as follows:
package mypackage
import org.xml.sax.{SAXNotRecognizedException, SAXNotSupportedException}
import com.sun.org.apache.xerces.internal.jaxp.SAXParserFactoryImpl
import javax.xml.parsers.ParserConfigurationException
#throws(classOf[SAXNotRecognizedException])
#throws(classOf[SAXNotSupportedException])
#throws(classOf[ParserConfigurationException])
class MyXMLParserFactory extends SAXParserFactoryImpl() {
super.setFeature("http://xml.org/sax/features/validation", false)
super.setFeature("http://apache.org/xml/features/disallow-doctype-decl", false)
super.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false)
super.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
}
As mentioned his the original post, it is necessary to place the following line in your code somewhere:
System.setProperty("javax.xml.parsers.SAXParserFactory", "mypackage.MyXMLParserFactory")

It works. After some detective work, the details as best I can figure them:
Trying to parse a developmental RESTful interface, I build the parser and get the above (rather, a similar) error. I try various parameters to change the XML output, but get the same error. I try to connect to an XML document I quickly whip up (cribbed stupidly from the interface itself) and get the same error. Then I try to connect to anything, just for kicks, and get the same (again, likely only similar) error.
I started questioning whether it was an error with the sources or the program, so I started searching around, and it looks like an ongoing issue- with many Google and SO hits on the same topic. This, unfortunately, made me focus on the upstream (language) aspects of the error, rather than troubleshoot more downstream at the sources themselves.
Fast forward and the parser suddenly works on the original XML output. I confirmed that there was some additional work has been done server side (just a crazy coincidence?). I don't have either earlier XML but suspect that it is related to the document identifiers being changed.
Now, the parser works fine on the RESTful interface, as well any well formatted XML I can throw at it. It also fails on all XHTML DTD's I've tried (e.g. www.w3.org). This is contrary to what #SeanReilly expects, but seems to jive with what the W3 states.
I'm still new to Scala, so can't determine if I have a special, or typical case. Nor can I be assured that this problem won't re-occur for me in another form down the line. It does seem that pulling XHTML will continue to cause this error unless one uses a solution similar to those suggested by #GClaramunt $ #J-16 SDiZ have used. I'm not really qualified to know if this is a problem with the language, or my implementation of a solution (likely the later)
For the immediate timeframe, I suspect that the best solution would've been for me to ensure that it was possible to parse that XML source-- rather than see that other's have had the same error and assume there was a functional problem with the language.
Hope this helps others.

There are two problems with what you are trying to do:
Scala's xml parser is trying to physically retrieve the DTD when it shouldn't. J-16 SDiZ seems to have some advice for this problem.
The Stack overflow page you are trying to parse isn't XML. It's Html4 strict.
The second problem isn't really possible to fix in your scala code. Even once you get around the dtd problem, you'll find that the source just isn't valid XML (empty tags aren't closed properly, for example).
You have to either parse the page with something besides an XML parser, or investigate using a utility like tidy to convert the html to xml.

My knowledge of Scala is pretty poor, but couldn't you use ConstructingParser instead?
val xml = new java.io.File("xmlWithDtd.xml")
val parser = scala.xml.parsing.ConstructingParser.fromFile(xml, true)
val doc = parser.document()
println(doc.docElem)

For scala 2.7.7 I managed to do this with scala.xml.parsing.XhtmlParser

Setting Xerces switches only works if you are using Xerces. An entity resolver works for any JAXP parser.
There are more generalized entity resolvers out there, but this implementation does the trick when all I'm trying to do is parse valid XHTML.
http://code.google.com/p/java-xhtml-cache-dtds-entityresolver/
Shows how trivial it is to cache the DTDs and forgo the network traffic.
In any case, this is how I fix it. I always forget. I always get the error. I always go fetch this entity resolver. Then I'm back in business.

Related

How to Handle Breaking Change in Interface of docx4j Class "FlatOpcXmlCreator" in V11.3.2 (aka V8.3.0)

I'm about to update our project's dependencies and found out docx4j has changed it's interface of class FlatOpcXmlCreator. The get() method is now not only deprecated but completely deactivated as the following docx4j code snippet shows:
package org.docx4j.convert.out.flatOpcXml;
public class FlatOpcXmlCreator implements Output {
...
#Deprecated
public Package get() throws Docx4JException {
throw new Docx4JException("Deprecated.");
}
(IMHO I think here they just skipped the usual deprecated step of still supporting the functionality, giving a warning and documenting hints how to upgrade to the new interface). Anyway, now the following code does not work anymore (since the last line calls the deprecated get() method).
JAXBContext jc = Context.jcXmlPackage;
Marshaller marshaller = jc.createMarshaller();
org.w3c.dom.Document doc = XmlUtils.neww3cDomDocument();
FlatOpcXmlCreator worker = new FlatOpcXmlCreator(wordPackage);
marshaller.marshal(worker.get(), doc);
Does anybody know how to fix it? I checked the release news. The change ist mentioned here, but without more details:
https://www.docx4java.org/forums/announces/docx4j-8-3-0-released-t2992.html
Just some details to the actual version switch I tried (because versioning here is not totally abvious):
The change happened in V8.3.0.
V11.3.2 is the Java 11's branch of V8.3.2.
We were using V11.2.9 before.
The error occurred when I switched to V11.3.2.
The release news of V8.3.0 mentioned the change in FlatOpcXmlCreator was due to the following issue: https://github.com/plutext/docx4j/issues/444. I had a short look at the discussion but couldn't find any helpful information regards my issue.
Edit: In my example worker.get() was used as a shortcut to get a dom representation of the document. Because that was a little bit hacky (it's not FlatOpcXmlCreator's scope to expose it's internal structure) the current version of docx4j just can't fulfill our needs. A solution would be to apply Jason's solution and parse the string ourself.
Solution: Following JasonPlutext's suggestions (especially the one from the comment section) led to the following fix:
FlatOpcXmlCreator worker = new FlatOpcXmlCreator(wordPackage);
worker.populate();
var outputStream = new ByteArrayOutputStream();
worker.marshal(outputStream);
org.w3c.dom.Document doc = XmlUtils.getNewDocumentBuilder().parse(
new ByteArrayInputStream(outputStream.toByteArray()));

For an example of what to do now, please see https://github.com/plutext/docx4j/blob/master/docx4j-core/src/main/java/org/docx4j/openpackaging/packages/OpcPackage.java#L735
FlatOpcXmlCreator opcXmlCreator = new FlatOpcXmlCreator(this);
opcXmlCreator.populate();
opcXmlCreator.marshal(outStream);
Given your wordPackage, just wordPackage.save(outStream, Docx4J.FLAG_SAVE_FLAT_XML)

Reading the spss file java

SPSSReader reader = new SPSSReader(args[0], null);
Iterator it = reader.getVariables().iterator();
while (it.hasNext())
{
System.out.println(it.next());
}
I am using this SPSSReader to read the spss file. Here,every string is printed with some junk characters appended with it.
Obtained Result :
StringVariable: nameogr(nulltpc{)(10)
NumericVariable: weightppuo(nullf{nd)
DateVariable: datexsgzj(nulllanck)
DateVariable: timeppzb(null|wt{l)
DateVariable: datetimegulj{(null|ns)
NumericVariable: commissionyrqh(nullohzx)
NumericVariable: priceeub{av(nullvlpl)
Expected Result :
StringVariable: name (10)
NumericVariable: weight
DateVariable: date
DateVariable: time
DateVariable: datetime
NumericVariable: commission
NumericVariable: price
Thanks in advance :)

I tried recreating the issue and found the same thing.
Considering that there is a licensing for that library (see here), I would assume that this might be a way of the developers to ensure that a license is bought as the regular download only contains a demo version as evaluation (see licensing before the download).
As that library is rather old (copyright of the website is 2003-2008, requirement for the library is Java 1.2, no generics, Vectors are used, etc), I would recommend a different library as long as you are not limited to the one used in your question.
After a quick search, it turned out that there is an open source spss reader here which is also available through Maven here.
Using the example on the github page, I put this together:
import com.bedatadriven.spss.SpssDataFileReader;
import com.bedatadriven.spss.SpssVariable;
public class SPSSDemo {
public static void main(String[] args) {
try {
SpssDataFileReader reader = new SpssDataFileReader(args[0]);
for (SpssVariable var : reader.getVariables()) {
System.out.println(var.getVariableName());
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
I wasn't able to find stuff that would print NumericVariable or similar things but as those were the classnames of the library you were using in the question, I will assume that those are not SPSS standardized. If they are, you will either find something like that in the library or you can open an issue on the github page.
Using the employees.sav file from here I got this output from the code above using the open source library:
resp_id
gender
first_name
last_name
date_of_birth
education_type
education_years
job_type
experience_years
monthly_income
job_satisfaction
No additional characters no more!
Edit regarding the comment:
That is correct. I read through some SPSS stuff though and from my understanding there are only string and numeric variables which are then formatted in different ways. The version published in maven only gives you access to the typecode of a variable (to be honest, no idea what that is) but the github version (that does not appear to be published on maven as 1.3-SNAPSHOT unfortunately) does after write- and printformat have been introduced.
You can clone or download the library and run mvn clean package (assuming you have maven installed) and use the generated library (found under target\spss-reader-1.3-SNAPSHOT.jar) in your project to have the methods SpssVariable#getPrintFormat and SpssVariable#getWriteFormat available.
Those return an SpssVariableFormat which you can get more information from. As I have no clue what all that is about, the best I can do is to link you to the source here where references to the stuff that was implemented there should help you further (I assume that this link referenced to in the documentation of SpssVariableFormat#getType is probably the most helpful to determine what kind of format you have there.
If absolutely NOTHING works with that, I guess you could use the demo version of the library in the question to determine the stuff through it.next().getClass().getSimpleName() as well but I would resort to that only if there is no other way to determining the format.

I am not sure, but looking at your code, it.next() is returning a Variable object.
There has to be some method to be chained to the Variable object, something like it.next().getLabel() or it.next().getVariableName(). toString() on an Object is not always meaningful. Check toString() method of Variable class in SPSSReader library.

recognize parameter change from git repository

I want to extract signature changes (method parameter changes to be exact) from commits to git repository by a java program. I have used the following code:
for (Ref branch : branches) {
String branchName = branch.getName();
for (RevCommit commit : commits) {
boolean foundInThisBranch = false;
RevCommit targetCommit = walk.parseCommit(repo.resolve(
commit.getName()));
for (Map.Entry<String, Ref> e : repo.getAllRefs().entrySet()) {
if (e.getKey().startsWith(Constants.R_HEADS)) {
if (walk.isMergedInto(targetCommit, walk.parseCommit(
e.getValue().getObjectId()))) {
String foundInBranch = e.getValue().getName();
if (branchName.equals(foundInBranch)) {
foundInThisBranch = true;
break;
}
}
}
}
I can extract commit message, commit data and Author name from that, however, I am not able to extract parameter changes from them. I mean it is unable for me to identify parameter changes. I want to know if there is any way to recognize that. I mean it is impossible to recognize them from commit notes that are generated by programmers; I am looking for something like any specific annotation or something else.
This is my code to extract differences:
CanonicalTreeParser oldTreeIter = new CanonicalTreeParser();
oldTreeIter.reset(reader, oldId);
CanonicalTreeParser newTreeIter = new CanonicalTreeParser();
newTreeIter.reset(reader, headId);
List<DiffEntry> diffs= git.diff()
.setNewTree(newTreeIter)
.setOldTree(oldTreeIter)
.call();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DiffFormatter df = new DiffFormatter(out);
df.setRepository(git.getRepository());
The export is really huge and impossible to extract method changes.

You show a way you've found to examine the diffs, but say that the output is too large and you can't extract the method signature changes. If by that you mean that you're asking about specific git support for telling you that a method signature changes, then no - no such support exists. This is because git does not "know" anything about the languages you may or may not have used in the files under source control. Everything is just content that is, or is not, different from other content.
Since a method signature could be split across lines in any number of ways, it's not even guaranteed that just because a method's signature changed its name would appear anywhere in the diff. What you would really have to do is perform a sort of "structural diff". That is, you would have to
check out the "old" version, and pass it to a java parser
check out the "new" version, and pass it to a java parser
compare the resulting parse trees, looking for methods that belong to the same object, but have changed
Even that won't be terribly easy, because methods could be renamed, and because method overloading could make it unclear which signature change goes with which version of a method.
From there what you have is a non-trivial coding problem, which is beyond the scope of SO to answer. If you decide to tackle this problem and run into specific programming questions along the way, of course you could post those questions and perhaps someone will be able to help.

How to know the Java interfaces an OpenOffice Calc UNO object supports (through queryInterface)

I'm developing a "macro" for OpenOffice Calc. As the language, I chose Java, in order to get code assistance in Eclipse. I even wrote a small ant build script that compiles and embeds the "macro" in an *.ods file. In general, this works fine and surprisingly fast; I'm already using some simple stuff quite successfully.
BUT
So often I get stuck because with UNO, I need to "query" an interface for any given non-trivial object, to be able to access data / call methods of that object. I.e., I literally need to guess which interfaces a given object may provide. This is not at all obvious and not even visible during Java development (through some sort of meta-information, reflection or the like), and also sparsely documented (I downloaded tons of stuff, but I don't find the source or maybe JavaDoc for the interfaces I'm using, like XButton, XPropertySet, etc. - XButton has setLabel, but not getLabel - what??).
There is online documentation (for the most fundamental concepts, which is not bad at all!), but it lacks many details that I'm faced with. It always magically stops exactly at the point I need to solve.
I'm willing to look at the C++ code to get a clue what interfaces an object (e.g. the button / event I'm currently stuck with) may provide. Confusingly, the C++ class and file names don't exactly match the Java interfaces. It's almost what I'm looking for, but then in Java I don't really find the equivalent, or calling queryInterface on a given object returns null.. It's becoming a bit frustrating.
How are the UNO Java interfaces generated? Is there some kind of documentation in the code that serves as the origin for the generated (Java) code?
I think I really need to know what interfaces are available at which point, in order to become a bit more fluent during Java-UNO-macro development.

For any serious UNO project, use an introspection tool.
As an example, I created a button in Calc, then used the Java Object Inspector to browse to the button.
Right-clicking and choosing "Add to Source Code" generated the following.
import com.sun.star.awt.XControlModel;
import com.sun.star.beans.XPropertySet;
import com.sun.star.container.XIndexAccess;
import com.sun.star.container.XNameAccess;
import com.sun.star.drawing.XControlShape;
import com.sun.star.drawing.XDrawPage;
import com.sun.star.drawing.XDrawPageSupplier;
import com.sun.star.sheet.XSpreadsheetDocument;
import com.sun.star.sheet.XSpreadsheets;
import com.sun.star.uno.AnyConverter;
import com.sun.star.uno.UnoRuntime;
import com.sun.star.uno.XInterface;
//...
public void codesnippet(XInterface _oUnoEntryObject){
try{
XSpreadsheetDocument xSpreadsheetDocument = (XSpreadsheetDocument) UnoRuntime.queryInterface(XSpreadsheetDocument.class, _oUnoEntryObject);
XSpreadsheets xSpreadsheets = xSpreadsheetDocument.getSheets();
XNameAccess xNameAccess = (XNameAccess) UnoRuntime.queryInterface(XNameAccess.class, xSpreadsheets);
Object oName = xNameAccess.getByName("Sheet1");
XDrawPageSupplier xDrawPageSupplier = (XDrawPageSupplier) UnoRuntime.queryInterface(XDrawPageSupplier.class, oName);
XDrawPage xDrawPage = xDrawPageSupplier.getDrawPage();
XIndexAccess xIndexAccess = (XIndexAccess) UnoRuntime.queryInterface(XIndexAccess.class, xDrawPage);
Object oIndex = xIndexAccess.getByIndex(0);
XControlShape xControlShape = (XControlShape) UnoRuntime.queryInterface(XControlShape.class, oIndex);
XControlModel xControlModel = xControlShape.getControl();
XPropertySet xPropertySet = (XPropertySet) UnoRuntime.queryInterface(XPropertySet.class, xControlModel);
String sLabel = AnyConverter.toString(xPropertySet.getPropertyValue("Label"));
}catch (com.sun.star.beans.UnknownPropertyException e){
e.printStackTrace(System.out);
//Enter your Code here...
}catch (com.sun.star.lang.WrappedTargetException e2){
e2.printStackTrace(System.out);
//Enter your Code here...
}catch (com.sun.star.lang.IllegalArgumentException e3){
e3.printStackTrace(System.out);
//Enter your Code here...
}
}
//...
Python-UNO may be better than Java because it does not require querying specific interfaces. Also XrayTool and MRI are easier to use than the Java Object Inspector.

Handling non-fatal errors in Java

I've written a program to aid the user in configuring 'mechs for a game. I'm dealing with loading the user's saved data. This data can (and some times does) become partially corrupt (either due to bugs on my side or due to changes in the game data/rules from upstream).
I need to be able to handle this corruption and load as much as possible. To be more specific, the contents of the save file are syntactically correct but semantically corrupt. I can safely parse the file and drop whatever entries that are not semantically OK.
Currently my data parser will just show a modal dialog with an appropriate warning message. However displaying the warning is not the job of the parser and I'm looking for a way of passing this information to the caller.
Some code to show approximately what is going on (in reality there is a bit more going on than this, but this highlights the problem):
class Parser{
public void parse(XMLNode aNode){
...
if(corrupted) {
JOptionPane.showMessageDialog(null, "Corrupted data found",
"error!", JOptionPane.WARNING_MESSAGE);
// Keep calm and carry on
}
}
}
class UserData{
static UserData loadFromFile(File aFile){
UserData data = new UserData();
Parser parser = new Parser();
XMLDoc doc = fromXml(aFile);
for(XMLNode entry : doc.allEntries()){
data.append(parser.parse(entry));
}
return data;
}
}
The thing here is that bar an IOException or a syntax error in the XML, loadFromFile will always succeed in loading something and this is the wanted behavior. Somehow I just need to pass the information of what (if anything) went wrong to the caller. I could return a Pair<UserData,String> but this doesn't look very pretty. Throwing an exception will not work in this case obviously.
Does any one have any ideas on how to solve this?

Depending on what you are trying to represent, you can use a class, like SQLWarning from the java.sql package. When you have a java.sql.Statement and call executeQuery you get a java.sql.ResultSet and you can then call getWarnings on the result set directly, or even on the statement itself.
You can use an enum, like RefUpdate.Result, from the JGit project. When you have a org.eclipse.jgit.api.Git you can create a FetchCommand, which will provide you with a FetchResult, which will provide you with a collection of TrackingRefUpdates, which will each contain a RefUpdate.Result enum, which can be one of:
FAST_FORWARD
FORCED
IO_FAILURE
LOCK_FAILURE
NEW
NO_CHANGE
NOT_ATTEMPTED
REJECTED
REJECTED_CURRENT_BRANCH
RENAMED
In your case, you could even use a boolean flag:
class UserData {
public boolean isCorrupt();
}
But since you mentioned there is a bit more than that going on in reality, it really depends on your model of "corrupt". However, you will probably have more options if you have a UserDataReader that you can instantiate, instead of a static utility method.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.