Parsing XML comments with DOM

Parsing XML comments with DOM - java

I need to parse XML Tags which are commented out like
<DataType Name="SecureCode" Size="4" Type="NVARCHAR">
<!-- <Validation>
<Regex JavaPattern="^[0-9]*$" JSPattern="^[0-9]*$"/>
</Validation> -->
<UIType Size="4" UITableSize="4"/>
</DataType>
But all I found was setIgnoringComments(boolean)
Document doc = docBuilder.parse(new File(PathChecker.getDataTypesFile()));
docFactory.setIgnoringComments(true); // ture or false, no difference
But it doesn't seem to change anything.
Is there any other way to parse this comments? I have to use DOM.
Regards

Method "setIgnoringComments" removed comments from DOM tree during parsing.
With "setIgnoringComments(false)" you can get comment text like:
NodeList nl = doc.getDocumentElement().getChildNodes();
for (int i = 0; i < nl.getLength(); i++) {
if (nl.item(i).getNodeType() == Element.COMMENT_NODE) {
Comment comment=(Comment) nl.item(i);
System.out.println(comment.getData());
}
}

Since there seems not to exist a "regular way" of solving the problem I've just removed the comments.
BufferedReader br = new BufferedReader(new FileReader(new File(PathChecker.getDataTypesFile())));
BufferedWriter bw = new BufferedWriter(new FileWriter(new File(PathChecker.getDataTypesFileWithoutComments())));
String line = "";
while ((line = br.readLine()) != null) {
line = line.replace("<!--", "").replace("-->", "") + "\n";
bw.write(line);
}

Related

How to parse this provided XML with java.xml.xpath?

I am trying to parse this XML:
<?xml version="1.0" encoding="UTF-8"?>
<veranstaltungen>
<veranstaltung id="201611211500#25045271">
<titel>Mal- und Zeichen-Treff</titel>
<start>2016-11-21 15:00:00</start>
<veranstaltungsort id="20011507">
<name>Freizeitclub - ganz unbehindert </name>
<anschrift>Macht los e.V.
Lipezker Straße 48
03048 Cottbus
</anschrift>
<telefon>xxxx xxxx </telefon>
<fax>0355 xxxx</fax>
[...]
</veranstaltungen>
As you can see, some of the texts have whitespace or even linebreaks. I am having issues with the text from the node anschrift, because I need to find the right location data in a database. Problem is, the returned String is:
Macht los e.V.Lipezker Straße 4803048 Cottbus
instead of:
Macht los e.V. Lipezker Straße 48 03048 Cottbus
I know the correct way to parse it should be with normalie-space() but I cannot quite work out how to do it. I tried this:
// Does not work; afaik because xpath 1 normalizes just the first node
xPath.compile("normalize-space(veranstaltungen/veranstaltung[position()=1]/veranstaltungsort/anschrift/text()"));
// Does not work
xPath.compile("veranstaltungen/veranstaltung[position()=1]/veranstaltungsort[normalize-space(anschrift/text())]"));
I also tried the solution given here: xpath-normalize-space-to-return-a-sequence-of-normalized-strings
xPathExpression = xPath.compile("veranstaltungen/veranstaltung[position()=1]/veranstaltungsort");
NodeList result = (NodeList) xPathExpression.evaluate(doc, XPathConstants.NODESET);
String normalize = "normalize-space(.)";
xPathExpression = xPath.compile(normalize);
int length = result.getLength();
for (int i = 0; i < length; i++) {
System.out.println(xPathExpression.evaluate(result.item(i), XPathConstants.STRING));
}
System.out prints:
Macht los e.V.Lipezker Straße 4803048 Cottbus
What am I doing wrong?
Update
I have a workaround already, but this can't be the solution. The following few lines show how I put the String together from the HTTPResponse:
try (BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), Charset.forName(charset)))) {
final StringBuilder stringBuilder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
// stringBuilder.append(line);
// WORKAROUND: Add a space after each line
stringBuilder.append(line).append(" ");
}
// Work with the red lines
}
I would rather have a solid solution.

Originally, you seem to be using the following code for reading the XML:
try (BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), Charset.forName(charset)))) {
final StringBuilder stringBuilder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
}
}
This is where your newlines get eaten: readline() does not return the trailing newline characters. If you then parse the contents of the stringBuilder object, you will get an incorrect DOM, where the text nodes do not contain the original newlines from the XML.

Thanks to the help of Markus, I was able to solve the issue. The reason was the readLine() method of the BufferedReader discarding line breaks. The following codesnippet works for me (Maybe it can be improved):
public Document getDocument() throws IOException, ParserConfigurationException, SAXException {
final HttpResponse response = getResponse(); // returns a HttpResonse
final HttpEntity entity = response.getEntity();
final Charset charset = ContentType.getOrDefault(entity).getCharset();
// Not 100% sure if I have to close the InputStreamReader. But I guess so.
try (InputStreamReader isr = new InputStreamReader(entity.getContent(), charset == null ? Charset.forName("UTF-8") : charset)) {
return documentBuilderFactory.newDocumentBuilder().parse(new InputSource(isr));
}
}

I am using the epublib and I am trying to get the entire chapter of a book at a time

I am trying to get one chapter at a time of a book. I am using the Paul Seigmann library. However, I am not sure how to do it but I am able to get all the text from the book. Not sure where to go from there.
// find InputStream for book
InputStream epubInputStream = assetManager
.open("the_planet_mappers.epub");
// Load Book from inputStream
mThePlanetMappersBookEpubLib = (new EpubReader()).readEpub(epubInputStream);
Spine spine = new Spine(mThePlanetMappersBookEpubLib.getTableOfContents());
for (SpineReference bookSection : spine.getSpineReferences()) {
Resource res = bookSection.getResource();
try {
InputStream is = res.getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(is));
String line;
while ((line = r.readLine()) != null) {
line = Html.fromHtml(line).toString();
Log.i("Read it ", line);
mEntireBook.append(line);
}
} catch (IOException e) {
}

I don't know if you're still looking for an answer, but...
I'm working on it too right now. This is the code I have to retrieve the content of all the epub file:
public ArrayList<String> getBookContent(Book bi) {
// GET THE CONTENTS OF ALL PAGES
StringBuilder string = new StringBuilder();
ArrayList<String> listOfPages = new ArrayList<>();
Resource res;
InputStream is;
BufferedReader reader;
String line;
Spine spine = bi.getSpine();
for (int i = 0; spine.size() > i; i++) {
res = spine.getResource(i);
try {
is = res.getInputStream();
reader = new BufferedReader(new InputStreamReader(is));
while ((line = reader.readLine()) != null) {
// FIRST PAGE LINE -> <?xml version="1.0" encoding="utf-8" standalone="no"?>
if (line.contains("<?xml version=\"1.0\" encoding=\"utf-8\" standalone=\"no\"?>")) {
string.delete(0, string.length());
}
// ADD THAT LINE TO THE FINAL STRING REMOVING ALL THE HTML
string.append(Html.fromHtml(formatLine(line)));
// LAST PAGE LINE -> </html>
if (line.contains("</html>")) {
listOfPages.add(string.toString());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
return listOfPages;
}
private String formatLine(String line) {
if (line.contains("http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd")) {
line = line.substring(line.indexOf(">") + 1, line.length());
}
// REMOVE STYLES AND COMMENTS IN HTML
if ((line.contains("{") && line.contains("}"))
|| ((line.contains("/*")) && line.contains("*/"))
|| (line.contains("<!--") && line.contains("-->"))) {
line = line.substring(line.length());
}
return line;
}
As you may have notice I need to improve the filter, but I have every chapter of that book in my ArrayList. Now I just need to call that ArrayList like myList.get(0); and is done.
To show the text in a proper way, I'm using the bluejamesbond:textjustify library (https://github.com/bluejamesbond/TextJustify-Android).
It is easy to use and powerful.
I hope it helps you, and if anybody finds a better way to filter that html, notice me, please.

Java jaxb utf-8/iso convertions

I have a XML file that contains non-standard characters (like a weird "quote").
I read the XML using UTF-8 / ISO / ascii + unmarshalled it:
BufferedReader br = new BufferedReader(new InputStreamReader(
(conn.getInputStream()),"ISO-8859-1"));
String output;
StringBuffer sb = new StringBuffer();
while ((output = br.readLine()) != null) {
//fetch XML
sb.append(output);
}
try {
jc = JAXBContext.newInstance(ServiceResponse.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
ServiceResponse OWrsp = (ServiceResponse) unmarshaller
.unmarshal(new InputSource(new StringReader(sb.toString())));
I have a oracle function that will take iso-8859-1 codes, and converts/maps them to "literal" symbols. i.e: "&#x2019" => "left single quote"
JAXB unmarshal using iso, displays the characters with iso conversion fine. i.e all weird single quotes will be encoded to "&#x2019"
so suppose my string is: class of 10–11‐year‐olds (note the weird - between 11 and year)
jc = JAXBContext.newInstance(ScienceProductBuilderInfoType.class);
Marshaller m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
//save a temp file
File file2 = new File("tmp.xml");
this will save in file :
class of 10–11‐year‐olds. (what i want..so file saving works!)
[side note: i have read the file using java file reader, and it out puts the above string fine]
the issue i have is that the STRING representation using jaxb unmarshaller has weird output, for some reason i cannot seem to get the string to represent –.
when I
1: check the xml unmarshalled output:
class of 10?11?year?olds
2: the File output:
class of 10–11‐year‐olds
i even tried to read the file from the saved XML, and then unmarshal that (in hopes of getting the – in my string)
String sCurrentLine;
BufferedReader br = new BufferedReader(new FileReader("tmp.xml"));
StringBuffer sb = new StringBuffer();
while ((sCurrentLine = br.readLine()) != null) {
sb.append(sCurrentLine);
}
ScienceProductBuilderInfoType rsp = (ScienceProductBuilderInfoType) unm
.unmarshal(new InputSource(new StringReader(sb.toString())));
no avail.
any ideas how to get the iso-8859-1 encoded character in jaxb?

Solved: using this tibid code found on stackoverflow
final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
HtmlEncoder.escapeNonLatin(MYSTRING)

Replace String in file using marker in Java

i have problem when try to replacing String in File.
in my file i have :
<!-- Header -->
<header fontName="Arial" size="24"/>
<!-- Content -->
<content>
<fontName="Arial" size="11"/>
</content>
How to replace fontName and size just for <!-- Header --> ?
This is my code for replace
public class StringReplacement {
public static void main(String args[])
{
try
{
File file = new File("file.xml");
BufferedReader reader = new BufferedReader(new FileReader(file));
String line = "", oldtext = "";
while((line = reader.readLine()) != null)
{
oldtext += line + "\r\n";
}
reader.close();
// replace a word in a file
//String newtext = oldtext.replaceAll("drink", "Love");
//To replace a line in a file
String newtext = oldtext.replaceAll("Arial", "Times New Roman");
FileWriter writer = new FileWriter("file.xml");
writer.write(newtext);
writer.close();
}
catch (IOException ioe)
{
ioe.printStackTrace();
}
}
}
But it just replace all the text to be replaced.

If you are sure that this is the format of the file you can simply do the following:
String newtext = oldtext.replaceAll("header fontName=\"Arial\"", "header fontName=\"Times New Roman\"");
By the way use a StringBuilder to append Strings.

In your read loop while((line = reader.readLine()) != null) you could test if you found the <!-- Header --> line (and not yet the <!-- Content --> line), and do your replace only in the header block.
boolean inHeader == false;
while((line = reader.readLine()) != null) {
if (line.equals("<!-- Header -->")) {
inHeader = true;
} else if (line.equals("<!-- Content -->")) {
inHeader = false;
}
if (inHeader) {
line = line.replaceAll("Arial", "Times New Roman");
}
oldtext += line + "\r\n";
}
And remove the line
String newtext = oldtext.replaceAll("Arial", "Times New Roman");
EDIT: It would probably be cleaner to detect arbitrary tags rather than hardcoding Header and Content. That would require a regular expression to match <!-- (tag) --> and test if tag is equal to "Header", but this approach is easier, of course.

stax to DOM converter

I want to read DOM document using Stax stream readers and write it using Stax stream writers.
I want to modify xml file and change some element values
I want the cursor to point at a certain element in xml file befor building dom tree
I wrote this code but the xml file did not modified
can anybody help me ?
FileInputStream input = new FileInputStream("cv.xml");
XMLStreamReader reader = XMLInputFactory.newInstance().createXMLStreamReader(input);
BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
//-*-*- get new entries from input stream
System.out.println("<< CV >>\n -> Modify the first reference\n ** Modify The Name **");
System.out.print(" Enter degree : ");
String degree = in.readLine();
System.out.print(" Enter first name : ");
String fName = in.readLine();
System.out.print(" Enter last name : ");
String lName = in.readLine();
System.out.println(" ** Modify The Address ** ");
System.out.print(" Enter new city : ");
String newCity = in.readLine();
System.out.print(" Enter new country : ");
String newCountry = in.readLine();
//-*-*- let the reader point at the first "reference" element
int eventType;
boolean ref = false, fname = false;
while (reader.hasNext()) {
eventType = reader.next();
switch (eventType) {
case XMLEvent.START_ELEMENT:
if (reader.getLocalName().equalsIgnoreCase("references"))
return;
}
}
//-*-*- build DOM trees using Stax stream reader
Document doc = new DOMConverter().buildDocument(reader);
reader.close();
input.close();
//-*-*- start modification
Element firstRef = (Element)doc.getElementsByTagName("reference").item(0);
NodeList name = (NodeList)firstRef.getElementsByTagName("name");
//-*-*- modify the degree (Dr. , Eng. , Dev. ,etc)
Attr att = (Attr)name.item(0).getAttributes().item(0);
((Node)att).setNodeValue(degree);
//-*-*- modify first name
NodeList firstName = (NodeList)firstRef.getElementsByTagName("fname");
NodeList firstNameChilds = (NodeList)firstName.item(0).getChildNodes();
((Node)firstNameChilds.item(0)).setNodeValue(fName);
//-*-*- modify last name
NodeList lastName = (NodeList)firstRef.getElementsByTagName("lname");
NodeList lastNameChilds = (NodeList)lastName.item(0).getChildNodes();
((Node)lastNameChilds.item(0)).setNodeValue(lName);
//-*-*- modify city
NodeList city = (NodeList)firstRef.getElementsByTagName("city");
NodeList cityChilds = (NodeList)city.item(0).getChildNodes();
((Node)cityChilds.item(0)).setNodeValue(newCity);
//-*-*- modify country
NodeList country = (NodeList)firstRef.getElementsByTagName("country");
NodeList countryChilds = (NodeList)country.item(0).getChildNodes();
((Node)countryChilds.item(0)).setNodeValue(newCountry);
//-*-*- write DOM document
FileOutputStream out = new FileOutputStream("cv.xml");
XMLStreamWriter sw = XMLOutputFactory.newInstance().createXMLStreamWriter(out);
new DOMConverter().writeDocument(doc, sw);
sw.close();
out.close();

It's probably because you are returning when you find the "references" element. Maybe break is what you meant.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Parsing XML comments with DOM - java

Related

How to parse this provided XML with java.xml.xpath?

I am using the epublib and I am trying to get the entire chapter of a book at a time

Java jaxb utf-8/iso convertions

Replace String in file using marker in Java

stax to DOM converter

Categories

Resources