I am using the following steps to read quads into Jena Model
InputStreamin = FileManager.get().open(fn); //fn--filename
Model md = ModelFactory.createDefaultModel();
md.read(in,null,"TTL");
Quads in file are:
#prefix dbpedia: <http://dbpedia.org/resource/> .
dbpedia:53b56e90c8a15fcd48eb5001 dbpedia:type dbpedia:willtest dbpedia:1 .
dbpedia:53b56e90c8a15fcd48eb5001 dbpedia:end dbpedia:1404394351023 dbpedia:1 .
dbpedia:53b56e90c8a15fcd48eb5001 dbpedia:room dbpedia:Room202cen dbpedia:1 .
dbpedia:53debf266ad34658725225ed dbpedia:reading dbpedia:0 dbpedia:2 .
dbpedia:53debf206ad34658725225e5 dbpedia:begining dbpedia:1407106678270 dbpedia:3 .
But on running I get following error:
Exception in thread "main" com.hp.hpl.jena.n3.turtle.TurtleParseException: Encountered " <DECIMAL> "1. "" at line 2, column 60.
Was expecting one of:
";" ...
"," ...
"." ...
Error is generated due to the Quad file only. A triple file is read clearly. Is there any other method to read quads into Jena Model?
UPDATE#1
I did as Christian has mentioned in the answer, but now I get the following errors:
Exception in thread "main" com.hp.hpl.jena.shared.JenaException:
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1;
Content is not allowed in prolog.
Same data file can be found at link.
Looks like you are reading a Turtle File, and as of http://www.w3.org/TR/turtle/#abstract , Turtle is not compatible with N-Quads:
This document defines a textual syntax for RDF called Turtle that
allows an RDF graph to be completely written in a compact and natural
text form, with abbreviations for common usage patterns and datatypes.
Turtle provides levels of compatibility with the N-Triples [N-TRIPLES]
format as well as the triple pattern syntax of the SPARQL W3C
Recommendation.
What you are basically doing is, you tell the parser that it has to parse a "Triples-syntax" file but you pass down a "Quad-syntax" file.
Change your file ending to .nq and use md.read(in,null); instead. This should then automatically detect that it is "Quad-syntax". And of course also make sure that your file is according to the N-Quads syntax, as defined here: http://www.w3.org/TR/n-quads/
Related
This question already has an answer here:
Parse special characters in xml stax file
(1 answer)
Closed last month.
I have an XML which I need to parse using XMLInputFactory(java.xml.stream).
XML is of this type:
<SACL>
<Criteria>Dinner</Criteria>
<Value> Rice & amp ;(without spaces) Beverage </Value>
</SACL>
I am parsing this using XML Factory Reader in JAVA and my code is:
if(xmlEvent.asStartElement().getName().getLocalPart().equals("Value"){
xmlEvent = xmlEventReader.nextEvent();
value = xmlEvent.asCharacters().getData().trim(); //Issue is in the if bracket only
}
(xmlEventReader = XMLInputFactory.newInstance().createXMLEventReader(new FileInputStream(file.getPath())); //using java.xml.stream.XMLEventReader
But it is parsing the data like this only "Rice" (missing & Beverage)
Expected Output : Rice & Beverage
Can someone suggest what is the issue with "& amp ;"(without spaces) and how can it be fixed?
I've worked on a project that did XML parsing recently, so I know almost exactly what's happening here: the parser sees & as a separate event (XMLStreamConstants.ENTITY_REFERENCE).
Try setting property XMLInputFactory.IS_REPLACING_ENTITY_REFERENCES to true in your XML parser's options. If the parser is properly implemented, the entity is replaced and made part of the text.
Keep in mind that the parser is allowed to split it into multiple characters events, especially if you have large pieces of text. Setting property XMLInputFactory.IS_COALESCING to true should prevent that.
I have a Java program which process xml files. When transforming xml into another xml file base on certain schema( xsd/xsl) it throws following error.
This error only throws for one xml file which has a xml tag like this.
<abc>xxx yyyy “ggggg vvvv” uuuu</abc>
But after removing or re-type two quotes, it doesn’t throw the error.
Anybody, please assist me to resolve this issue.
java.io.CharConversionException: Character larger than 4 bytes are not supported: byte 0x93 implies a length of more than 4 bytes
at .org.apache.xmlbeans..impl.piccolo.xml.UTF8XMLDecoder.decode(UTF8XMLDecoder.java:162)
<?xml version= “1.0’ encoding =“UTF-8” standalone =“yes “?><xyz xml s=“http://pqr.yy”><Header><abc> aaa “cccc” aaaaa vvv</abc></Header></xyz>.
As others have reported in comments, it has failed because the typographical quotation marks are encoded in Windows-1292 encoding, not in UTF-8, so the parser hasn't managed to decode them.
The encoding declared in the XML declaration must match the actual encoding used for the characters.
To find out how this error arose, and to prevent it happening again, we would need to know where this (wannabe) XML file came from, and how it was created.
My guess would be that someone used a "smart" editor; Microsoft editors in particular are notorious for changing what you type to what Microsoft think you wanted to type. If you're editing XML by hand it's best to use an XML-aware editor.
When parsing a set of ontologies, some of the files give me the following error while others work well (Note that I am using OWL API 5.1.6):
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1033)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:933)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadImports(OWLOntologyManagerImpl.java:1630)
....
Could not parse JSONLD org.eclipse.rdf4j.rio.jsonld.JSONLDParser.parse(JSONLDParser.java:110)
org.semanticweb.owlapi.rio.RioParserImpl.parseDocumentSource(RioParserImpl.java:172)
org.semanticweb.owlapi.rio.RioParserImpl.parse(RioParserImpl.java:125)
....
Stack trace:
org.eclipse.rdf4j.rio.RDFParseException: unqualified attribute 'class' not allowed [line 3, column 65]
org.semanticweb.owlapi.rio.RioParserImpl.parse(RioParserImpl.java:138)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyFactoryImpl.loadOWLOntology(OWLOntologyFactoryImpl.java:193)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.load(OWLOntologyManagerImpl.java:1071)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:1033)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadOntology(OWLOntologyManagerImpl.java:933)
uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.loadImports(OWLOntologyManagerImpl.java:1630)
....
and many errors like those.
Any idea how to fix this problem(s)?
update:
The snippet that loads the ontology is:
File file = new File("C:\\vocabs\\" + Ontofile.getName());
OWLOntologyManager m = OWLManager.createOWLOntologyManager();
OWLOntology o;
o = m.loadOntologyFromOntologyDocument(file);
OWLDocumentFormat format = m.getOntologyFormat(o);
OWLOntologyXMLNamespaceManager nsManager = new
OWLOntologyXMLNamespaceManager(o, format);
This error is saying that one of the ontologies you're parsing is not valid JSON/LD format.
To fix this, you have to do two things:
Ensure the format that's being used is the one you expect: OWLAPI, if no format is specified, will attempt to use all parsers available until one of them successfully parses the ontology
Fix the input data if the format is correct: in this case, for JSON/LD, the error is on line 3
If the format used is not what should be, you need to specify a format in your code - for that, you'll have to add a snippet of the code you're using to parse your files.
Currently, I'm working on a feature that involves parsing XML that we receive from another product. I decided to run some tests against some actual customer data, and it looks like the other product is allowing input from users that should be considered invalid. Anyways, I still have to try and figure out a way to parse it. We're using javax.xml.parsers.DocumentBuilder and I'm getting an error on input that looks like the following.
<xml>
...
<description>Example:Description:<THIS-IS-PART-OF-DESCRIPTION></description>
...
</xml>
As you can tell, the description has what appears to be an invalid tag inside of it (<THIS-IS-PART-OF-DESCRIPTION>). Now, this description tag is known to be a leaf tag and shouldn't have any nested tags inside of it. Regardless, this is still an issue and yields an exception on DocumentBuilder.parse(...)
I know this is invalid XML, but it's predictably invalid. Any ideas on a way to parse such input?
That "XML" is worse than invalid – it's not well-formed; see Well Formed vs Valid XML.
An informal assessment of the predictability of the transgressions does not help. That textual data is not XML. No conformant XML tools or libraries can help you process it.
Options, most desirable first:
Have the provider fix the problem on their end. Demand well-formed XML. (Technically the phrase well-formed XML is redundant but may be useful for emphasis.)
Use a tolerant markup parser to cleanup the problem ahead of parsing as XML:
Standalone: xmlstarlet has robust recovering and repair capabilities credit: RomanPerekhrest
xmlstarlet fo -o -R -H -D bad.xml 2>/dev/null
Standalone and C/C++: HTML Tidy works with XML too. Taggle is a port of TagSoup to C++.
Python: Beautiful Soup is Python-based. See notes in the Differences between parsers section. See also answers to this question for more
suggestions for dealing with not-well-formed markup in Python,
including especially lxml's recover=True option.
See also this answer for how to use codecs.EncodedFile() to cleanup illegal characters.
Java: TagSoup and JSoup focus on HTML. FilterInputStream can be used for preprocessing cleanup.
.NET:
XmlReaderSettings.CheckCharacters can
be disabled to get past illegal XML character problems.
#jdweng notes that XmlReaderSettings.ConformanceLevel can be set to
ConformanceLevel.Fragment so that XmlReader can read XML Well-Formed Parsed Entities lacking a root element.
#jdweng also reports that XmlReader.ReadToFollowing() can sometimes
be used to work-around XML syntactical issues, but note
rule-breaking warning in #3 below.
Microsoft.Language.Xml.XMLParser is said to be “error-tolerant”.
Go: Set Decoder.Strict to false as shown in this example by #chuckx.
PHP: See DOMDocument::$recover and libxml_use_internal_errors(true). See nice example here.
Ruby: Nokogiri supports “Gentle Well-Formedness”.
R: See htmlTreeParse() for fault-tolerant markup parsing in R.
Perl: See XML::Liberal, a "super liberal XML parser that parses broken XML."
Process the data as text manually using a text editor or
programmatically using character/string functions. Doing this
programmatically can range from tricky to impossible as
what appears to be
predictable often is not -- rule breaking is rarely bound by rules.
For invalid character errors, use regex to remove/replace invalid characters:
PHP: preg_replace('/[^\x{0009}\x{000a}\x{000d}\x{0020}-\x{D7FF}\x{E000}-\x{FFFD}]+/u', ' ', $s);
Ruby: string.tr("^\u{0009}\u{000a}\u{000d}\u{0020}-\u{D7FF}\u{E000}-\u{FFFD}", ' ')
JavaScript: inputStr.replace(/[^\x09\x0A\x0D\x20-\xFF\x85\xA0-\uD7FF\uE000-\uFDCF\uFDE0-\uFFFD]/gm, '')
For ampersands, use regex to replace matches with &: credit: blhsin, demo
&(?!(?:#\d+|#x[0-9a-f]+|\w+);)
Note that the above regular expressions won't take comments or CDATA
sections into account.
A standard XML parser will NEVER accept invalid XML, by design.
Your only option is to pre-process the input to remove the "predictably invalid" content, or wrap it in CDATA, prior to parsing it.
The accepted answer is good advice, and contains very useful links.
I'd like to add that this, and many other cases of not-wellformed and/or DTD-invalid XML can be repaired using SGML, the ISO-standardized superset of HTML and XML. In your case, what works is to declare the bogus THIS-IS-PART-OF-DESCRIPTION element as SGML empty element and then use eg. the osx program (part of the OpenSP/OpenJade SGML package) to convert it to XML. For example, if you supply the following to osx
<!DOCTYPE xml [
<!ELEMENT xml - - ANY>
<!ELEMENT description - - ANY>
<!ELEMENT THIS-IS-PART-OF-DESCRIPTION - - EMPTY>
]>
<xml>
<description>blah blah
<THIS-IS-PART-OF-DESCRIPTION>
</description>
</xml>
it will output well-formed XML for further processing with the XML tools of your choice.
Note, however, that your example snippet has another problem in that element names starting with the letters xml or XML or Xml etc. are reserved in XML, and won't be accepted by conforming XML parsers.
IMO these cases should be solved by using JSoup.
Below is a not-really answer for this specific case, but found this on the web (thanks to inuyasha82 on Coderwall). This code bit did inspire me for another similar problem while dealing with malformed XMLs, so I share it here.
Please do not edit what is below, as it is as it on the original website.
The XML format, requires to be valid a unique root element declared in the document.
So for example a valid xml is:
<root>
<element>...</element>
<element>...</element>
</root>
But if you have a document like:
<element>...</element>
<element>...</element>
<element>...</element>
<element>...</element>
This will be considered a malformed XML, so many xml parsers just throw an Exception complaining about no root element. Etc.
In this example there is a solution on how to solve that problem and succesfully parse the malformed xml above.
Basically what we will do is to add programmatically a root element.
So first of all you have to open the resource that contains your "malformed" xml (i. e. a file):
File file = new File(pathtofile);
Then open a FileInputStream:
FileInputStream fis = new FileInputStream(file);
If we try to parse this stream with any XML library at that point we will raise the malformed document Exception.
Now we create a list of InputStream objects with three lements:
A ByteIputStream element that contains the string: <root>
Our FileInputStream
A ByteInputStream with the string: </root>
So the code is:
List<InputStream> streams =
Arrays.asList(
new ByteArrayInputStream("<root>".getBytes()),
fis,
new ByteArrayInputStream("</root>".getBytes()));
Now using a SequenceInputStream, we create a container for the List created above:
InputStream cntr =
new SequenceInputStream(Collections.enumeration(str));
Now we can use any XML Parser library, on the cntr, and it will be parsed without any problem. (Checked with Stax library);
i am getting an error while using textextractor of pdfclown library. The code i used is
TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");
// Extract the page text!
Map textStrings = textExtractor.extract(page);
a part of the error i got is
exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>
I also found out that this happens when my pdf contains some bullets for example
item 1
item 2
item 3
Plz help me out to extract the text from such pdfs.
(The following comment turned out to be the solution:)
Using your highlighter.java class (provided on your google drive in a comment) together with the current PDF Clown trunk version as jar, the PDF was processed without incident, especially without NullPointerException (the highlights partially were not at the right position, though).
After looking at your shared google drive contents, though, I assumed you did not use a PDF Clown jar but instead merely compiled the classes from the distribution source folder and used them.
The PDF Clown jar files contain additional ressources, though, which your setup consequentially did not include. Thus:
Your highlighter.java has to be used with pdfclown.jar in the classpath.