Jena- Writing onto owl file- Unexpected result result - java

I created a file system that stores metadata of files and folders in an owl file.
For file system, I am using java binding of FUSE i.e. FUSE-JNA
For OWL, I am using Jena:
Initially my file system runs ok with no error. But after sometime my program stops reading .owl file and throws some errors. One of the error is below:
Errors I get while reading .owl file:
SEVERE: Exception thrown: org.apache.jena.riot.RiotException: [line: 476, col: 52] The value of attribute "rdf:about" associated with an element type "File" must not contain the '<' character.
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:136)
org.apache.jena.riot.lang.LangRDFXML$ErrorHandlerBridge.fatalError(LangRDFXML.java:252)
com.hp.hpl.jena.rdf.arp.impl.ARPSaxErrorHandler.fatalError(ARPSaxErrorHandler.java:48)
com.hp.hpl.jena.rdf.arp.impl.XMLHandler.warning(XMLHandler.java:209)
com.hp.hpl.jena.rdf.arp.impl.XMLHandler.fatalError(XMLHandler.java:239)
org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
...
I open my .owl file, what I found is the Jena is not writing correctly. In picture below if you see number 3 highlighted error in blue color, its incomplete, there is some code missing there.
Secondly, number 2 blue highlighted error is also written wrongly.In my ontology is property of File. It should be written as of number 1 blue highlighted code.
Although both the number 1 and number 2 code is written by jena. Most of the owl code is written correctly by Jena as similar to number 1 but some time jena writes it wrongly as similar to number 2 in picture. I do not know why.
(to see the picture in full size, open it in new tab or save it on your computer)
This is how I am writing to .owl file using jena api:
public void setDataTypeProperty(String resourceURI, String propertyName, String propertyValue) //create new data type property. Accept four arguments: URI of resource as string, property name (i.e #hasPath), old value as string and new value as string.
{
Model model = ModelFactory.createDefaultModel();
//read model from file
InputStream in = FileManager.get().open(inputFileName);
if (in == null)
{
throw new IllegalArgumentException( "File: " + inputFileName + " not found");
}
model.read(in, "");
try {
in.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
// Add property to Model
Resource resource = model.createResource(resourceURI);
resource.addProperty(model.createProperty(baseURI+propertyName), model.createLiteral(propertyValue));
//Writing model to file
try {
FileWriter out = new FileWriter( inputFileName );
model.write( out, "RDF/XML-ABBREV" );
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Please guide me how to fix the number 2 and number 3 blue highlighted errors of Jena.

There is an issue of input-sanitation to your method. I cannot be certain that your input data is invalid, but it is certainly something that should be tested in any method that is programmatically constructing URIs or literals.
URIs
For example, the following two lines are dangerous because they can allow characters that are not allowed in a URI, or they can allow characters for literal values that cannot be serialized as XML.
Resource resource = model.createResource(resourceURI);
resource.addProperty(model.createProperty(baseURI+propertyName), model.createLiteral(propertyValue));
To fix the problem of URIs, use URLEncoder to sanitize the uris themselves:
final String uri = URLEncoder.encode(resourceURI, "UTF-8");
final String puri = URLEncoder.encode(baseURI+propertyName);
final Resource resource = model.createResource(uri);
resource.addProperty(model.createProperty(puri), model.createLiteral(propertyValue));
To test for the problem us URIs, you can use Jena's IRIFactory types in order to validate that the URI you are constructing adheres to some particular specification.
Literals
To solve the problem of literals is a little more tricky. You are not getting an exception that indicates that you have a bad value for a literal, but I am including this for completeness (so you can sanitize all inputs, and not only the ones that may be causing a problem now).
Jena's writers do not test the values of literals until they are being serialized as XML. The pattern that they use to detect invalid XML characters is focused only on the characters that are required to replace as part of the RDF XML specification. Jena delegates the final validation (and exception throwing) to the underlying XML library. This makes sense, because there could exist a future RDF serialization that allows the expression of all characters. I was recently bit by it (for example, a string that contains a backspace character), so I created a more strict pattern in order to eagerly detect this situation at runtime.
final Pattern elementContentEntities = Pattern.compile( "[\0-\31&&[^\n\t\r]]|\127|[\u0080-\u009F]|[\uD800-\uDFFF]|\uFFFF|\uFFFE" );
final Matcher m = elementContentEntities.matcher( propertyValue );
if( m.find() ) {
// TODO sanitise your string literal, it contains invalid characters
}
else {
// TODO your string is good.
}

The nature of the truncation at #3 - "admi" - leads me to think that maybe this is a problem with your underlying data transport and storage, and has nothing to do with XML, RDF, Jena, or anything else up at this level. Maybe an ignored exception?

My main program was some times passing resourceURI argument as blank/null to setDataTypeProperty method. That's why it was creating problem.
So I have modified my code and added two lines at start of the method:
public void setDataTypeProperty(String resourceURI, String propertyName, String propertyValue) //create new data type property. Accept four arguments: URI of resource as string, property name (i.e #hasPath), old value as string and new value as string.
{
if (resourceURI==null)
return;
...
...
Now I am running it since few days but did not face the above mentioned errors yet.

Related

Creating a text file with java without using absolute path

following the question I asked before How to have my java project to use some files without using their absolute path? I found the solution but another problem popped up in creating text files that I want to write into.here's my code:
private String pathProvider() throws Exception {
//finding the location where the jar file has been located
String jarPath=URLDecoder.decode(getClass().getProtectionDomain().getCodeSource().getLocation().getPath(), "UTF-8");
//creating the full and final path
String completePath=jarPath.substring(0,jarPath.lastIndexOf("/"))+File.separator+"Records.txt";
return completePath;
}
public void writeRecord() {
try(Formatter writer=new Formatter(new FileWriter(new File(pathProvider()),true))) {
writer.format("%s %s %s %s %s %s %s %s %n", whichIsChecked(),nameInput.getText(),lastNameInput.getText()
,idInput.getText(),fieldOfStudyInput.getText(),date.getSelectedItem().toString()
,month.getSelectedItem().toString(),year.getSelectedItem().toString());
successful();
} catch (Exception e) {
failure();
}
}
this works and creates the text file wherever the jar file is running from but my problem is that when the information is been written to the file, the numbers,symbols, and English characters are remained but other characters which are in Persian are turned into question marks. like: ????? 111 ????? ????.although running the app in eclipse doesn't make this problem,running the jar does.
Note:I found the code ,inside pathProvider method, in some person's question.
Your pasted code and the linked question are complete red herrings - they have nothing whatsoever to do with the error you ran into. Also, that protection domain stuff is a hack and you've been told before not to write data files next to your jar files, it's not how OSes (are supposed to) work. Use user.home for this.
There is nothing in this method that explains the question marks - the string, as returned, has plenty of issues (see above), but NOT that it will result in question marks in the output.
Files are fundamentally bytes. Strings are fundamentally characters. Therefore, when you write code that writes a string to a file, some code somewhere is converting chars to bytes.
Make sure the place where that happens includes a charset encoding.
Use the new API (I think you've also been told to do this, by me, in an earlier question of yours) which defaults to UTF-8. Alternatively, specify UTF-8 when you write. Note that the usage of UTF-8 here is about the file name, not the contents of it (as in, if you put persian symbols in the file name, it's not about persian symbols in the contents of the file / in the contents you want to write).
Because you didn't paste the code, I can't give you specific details as there are hundreds of ways to do this, and I do not know which one you used.
To write to a file given a String representing its path:
Path p = Paths.get(completePath);
Files.write("Hello, World!", p);
is all you need. This will write as UTF_8, which can handle persian symbols (because the Files API defaults to UTF-8 if you specify no encoding, unlike e.g. new File, FileOutputStream, FileWriter, etc).
If you're using outdated APIs: new BufferedWriter(new OutputStreamWriter(new FileOutputStream(thePath), StandardCharsets.UTF-8) - but note that this is a resource leak bug unless you add the appropriate try-with-resources.
If you're using FileWriter: FileWriter is broken, never use this class. Use something else.
If you're converting the string on its own, it's str.getBytes(StandardCharsets.UTF_8), not str.getBytes().

Comparing two PDF files text using PDFBox is failing eventhough both files are having same text

I am using PDFBOX as a utility in my selenium automation for export testing . We are comparing actual exported pdf file with the expected ones using pdfbox and then pass/fail test accordingly. This works pretty much smoothly . However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
Expected pdf file
Actual pdf file
Below is the general utility i am using to compare pdf files
private static void arePDFFilesEqual(File pdfFile1, File pdfFile2) throws IOException
{
LOG.info("Comparing PDF files ("+pdfFile1+","+pdfFile2+")");
PDDocument pdf1 = PDDocument.load(pdfFile1);
PDDocument pdf2 = PDDocument.load(pdfFile2);
PDPageTree pdf1pages = pdf1.getDocumentCatalog().getPages();
PDPageTree pdf2pages = pdf2.getDocumentCatalog().getPages();
try
{
if (pdf1pages.getCount() != pdf2pages.getCount())
{
String message = "Number of pages in the files ("+pdfFile1+","+pdfFile2+") do not match. pdfFile1 has "+pdf1pages.getCount()+" no pages, while pdf2pages has "+pdf2pages.getCount()+" no of pages";
LOG.debug(message);
throw new TestException(message);
}
PDFTextStripper pdfStripper = new PDFTextStripper();
LOG.debug("pdfStripper is :- " + pdfStripper);
LOG.debug("pdf1pages.size() is :- " + pdf1pages.getCount());
for (int i = 0; i < pdf1pages.getCount(); i++)
{
pdfStripper.setStartPage(i + 1);
pdfStripper.setEndPage(i + 1);
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!pdf1PageText.equals(pdf2PageText))
{
String message = "Contents of the files ("+pdfFile1+","+pdfFile2+") do not match on Page no: " + (i + 1)+" pdf1PageText is : "+pdf1PageText+" , while pdf2PageText is : "+pdf2PageText;
LOG.debug(message);
System.out.println("fff");
LOG.debug("pdf1PageText is " + pdf1PageText);
LOG.debug("pdf2PageText is " + pdf2PageText);
String difference = StringUtils.difference(pdf1PageText, pdf2PageText);
LOG.debug("difference is "+difference);
throw new TestException(message+" [[ Difference is ]] "+difference);
}
}
LOG.info("Returning True , as PDF Files ("+pdfFile1+","+pdfFile2+") get matched");
} finally {
pdf1.close();
pdf2.close();
}
}
Eclipse shows this differences in console
https://s3.amazonaws.com/uploads.hipchat.com/95223/845692/9Ex0QW2fFeRqu8s/upload.png
I can see it is failing because of symbols like (curley braces , {} , hash # , exclamation mark !) however i don't know how to fix this one ..
Can anyone please tell me how to fix this one ?
However recently I came across actual exported file , which looks as same as expected one (as far as data is concerned) , however when comparing it with pdfbox , it is failing
That this might happen, should not surprise you. After all your test does not compare the looks of the pages in question but the results of text extraction.
While the look of textual data on the pages depends on the drawing instructions for the glyphs in question in the respective (in case of your files) embedded font file, the result of text extraction of the same textual data on the pages depends on the ToUnicode table or Encoding value of the PDF font information structures for that font file.
And indeed, while the textual data of the expected and the actual document use the same glyphs of the respective fonts, the ToUnicode tables in the expected and the actual document for one font claim that certain glyphs represent different Unicode code points.
The font in question has these three glyphs:
The ToUnicode map for that font in your expected document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F125> <F128> ]
which claim that these three characters correspond to U+0000, U+F125, and U+F128.
The ToUnicode map for that font in your actual document contains the mappings
<0000> <0000> <0000>
<0001> <0002> [<F126> <F129> ]
which claim that these three characters correspond to U+0000, U+F126, and U+F129.
Thus, your test correctly has found a difference between expected and actual document, so its failure result is correct. Thus, you don't have to fix anything, the software producing the actual document has an issue!
(One could argue that the differences are inside Unicode private use areas and don't matter. In that case you'd have to update your test to ignore differences of characters from Unicode private use areas. But that should have been told you before you started creating tests.)
This is a tough one, since similar or even the same Unicode characters might have different byte representation, depending on font, encoding and other factors during PDF generation.
A possible solution I can think of if you can safely assume that the relevant text pieces are represented by 8 bit characters:
String stripUnicode(String s) {
StringBuilder sb = new StringBuilder(s.length());
for (char c : s.toCharArray()) {
if (c <= 0xFF) {
sb.append(c);
}
}
return sb.toString();
}
...
String pdf1PageText = pdfStripper.getText(pdf1);
String pdf2PageText = pdfStripper.getText(pdf2);
if (!stripUnicode(pdf1PageText).equals(stripUnicode(pdf2PageText)))
...
If you need Unicode support, you need to implement your own custom comparison algorithm that is able to identify similar characters and treat them as equal.

XML File looses its format after reading and writing in Java

I'm writing a program in Java that it's going to read a XML file and do some modification,and then write the file with the same format.
The following is the code block that reads and writes the XML file:
final Document fileDocument = parseFileAsDocument(file);
final OutputFormat format = new OutputFormat(fileDocument);
try {
final FileWriter out = new FileWriter(file);
final XMLSerializer serializer = new XMLSerializer(out,format);
serializer.serialize(fileDocument);
}
catch (final IOException e) {
System.out.println(e.getMessage());
}
This is the method used to parse the file:
private Document parseFileAsDocument(final File file) {
Document inputDocument = null;
try {
inputDocument = DocumentBuilderFactory.newInstance().newDocumentBuilder().parse(file);
}//catching some exceptions{}
return inputDocument;
}
I'm noticing two changes after the file is written:
Before I had a node similar to this:
<instance ref='filter'>
<value></value>
</instance>
After reading and writing, the node looks like this:
<instance ref="filter">
<value/>
</instance>
As you can see from above, the 'filter' has been changed to "filter" with double quote.
The second change is <value></value> has been changed to <value/>. This change happens across the XML file whenever we have a node similar to <tag></tag> with no value in between. So if we have something like <tag>somevalue</tag>, there is no issue.
Any thought please how to get the XML nodes format to be the same after writing?
I'd appreciate it!
You can't, and you shouldn't try. It's a bit like complaining that when you add 0123 and 0234, you get 357 without the leading zeroes. Leading zeroes in integers aren't considered significant, so arithmetic operations don't preserve them. The same happens to insignificant details of your XML, like the distinction between double quotes and single quotes, and the distinction between a self-closing tags and a start/end tag pair for an empty element. If any consumer of the XML is depending on these details, they need to be sent for retraining.
The most usual reason for asking for lexical details to be preserved is that you want to detect changes. But this means you are doing your comparisons the wrong way: you should be comparing at the logical level, not the physical level. One way to do comparisons is to canonicalize the XML, so whenever there is an arbitrary choice to be made between equivalent representations, it is made the same way.

How can I map the tagx calls in a jspx?

We have large project which uses immense amount of tagx of our creation, and we are about to re factor the UIs underlying code. This means that many tagx will be merged, thrown away and views (jspx-s) rewritten. To be able to delegate the re factoring into reasonable pieces without conflicting with each other we would like to "map" the tagx calls.
Is there an easy way, or a tool maybe, that goes through the jspx/tagx files and lists which tagx they have called (not just the library, but the specific tagx)?
So for example:
create.jspx calls in its body:
c:if
form:create
form:dependency
myowntaglib1:myowntag1
myowntaglibN:myowntagN
etc
and app lists this out.
Simplest way to do that would be to write simple java program that recursively goes through directories searching for jspx files, and using XML Parser, ie. SAX parser listen to
XMLStreamConstants.START_ELEMENT
and then displaying
xmlReader.getName().getLocalPart();
Sample code:
XMLInputFactory xmlFactory = XMLInputFactory.newInstance();
List<TercCode> tercCodeList = new ArrayList<TercCode>();
try {
XMLStreamReader xmlReader = xmlFactory.createXMLStreamReader(fname, stream);
while (xmlReader.hasNext()) {
// returns the event type
int eventType = xmlReader.next();
// returns event type for reference
if (xmlReader.getEventType() == XMLStreamConstants.START_ELEMENT){
System.out.println(xmlReader.getName().getLocalPart());
}
} catch (XMLStreamException e) {
e.printStackTrace();
}
stream should be FileInputStream for fname file.
Instead of displaying tag names, you can put them to HashMap and display them after all file is parsed, you'll not get duplicates then.

How to process a string with 823237 characters

I have a string that has 823237 characters in it. its actually an xml file and for testing purpose I want to return as a response form a servlet.
I have tried everything I can possible think of
1) creating a constant with the whole string... in this case Eclipse complains (with a red line under servlet class name) -
The type generates a string that requires more than 65535 bytes to encode in Utf8 format in the constant pool
2) breaking the whole string into 20 string constants and writing to the out object directly
something like :
out.println( CONSTANT_STRING_PART_1 + CONSTANT_STRING_PART_2 +
CONSTANT_STRING_PART_3 + CONSTANT_STRING_PART_4 +
CONSTANT_STRING_PART_5 + CONSTANT_STRING_PART_6 +
// add all the string constants till .... CONSTANT_STRING_PART_20);
in this case ... the build fails .. complaining..
[javac] D:\xx\xxx\xxx.java:87: constant string too long
[javac] CONSTANT_STRING_PART_19 + CONSTANT_STRING_PART_20);
^
3) reading the xml file as a string and writing to out object .. in this case I get
SEVERE: Allocate exception for servlet MyServlet
Caused by: org.apache.xmlbeans.XmlException: error: Content is not allowed in prolog.
Finally my question is ... how can I return such a big string (as response) from the servlet ???
You can avoid to load all the text in memory using streams:
InputStream is = new FileInputStream("path/to/your/file"); //or the following line if the file is in the classpath
InputStream is = MyServlet.class.getResourceAsStream("path/to/file/in/classpath");
byte[] buff = new byte[4 * 1024];
int read;
while ((read = is.read(buff)) != -1) {
out.write(buff, 0, read);
}
The second approach might work the following way:
out.print(CONSTANT_STRING_PART_1);
out.print(CONSTANT_STRING_PART_2);
out.print(CONSTANT_STRING_PART_3);
out.print(CONSTANT_STRING_PART_4);
// ...
out.print(CONSTANT_STRING_PART_N);
out.println();
You can do this in a loop of course (which is highly recommended ;)).
The way you do it, you just temporarely create the large string again to then pass it to println(), which is the same problem as the first one.
Ropes: Theory and practice
Why and when to use Ropes for Java for string manipulations
You can read a 823K file into a String. Maybe not the most elegant method, but totally doable. Method 3 should have worked. There was an XML error, but that has nothing to do with reading from a file into a String, or the length of the data.
It has to be an external file, though, because it is too big to be inlined into a class file (there are size limits for those).
I recommend Commons IO FileUtils#readFileToString.
You have to deal with ByteArrayOutputStream and not with the String it self. If you want to send your String in the http response all you have to do is to read from that byteArray stream and write in the response stream like this :
ByteArrayOutputStream baos = new ByteArrayOutputStream(8232237);
baos.write(constant1.getBytes());
baos.write(constant2.getBytes());
...
baos.writeTo(response.getOutputStream());
Both problem 1) and 2) are due to the same fundamental issue. A String literal (or constant String expression) cannot be more than 65535 characters because there is a hard limit on string constants in the class file format.
The third problem sounds like a bug in the way you've implemented it rather than a fundamental problem. In fact, it sounds like you are trying to load the XML as a DOM and then unparse it (which is unnecessary), and that somehow you have managed to mangle the XML in the process. (Or maybe it is mangled in the file you are trying to read ...)
The simple and elegant solution is to save the stuff in a file, and then read it as plain text.
Or ... less elegant, but just as effective:
String[] strings = new String[](
"longString1",
"longString2",
...
"longStringN"};
for (String str : strings) {
out.write(str);
}
Of course, the problem with embedding test data as string literals is that you have to escape certain characters in the string to keep the compiler happy. That's tedious if you have to do it by hand.

Categories

Resources