How to read/write Java ASCII Characters value with XMLStreamReader? - java

I'd like to use XMLStreamReader for reading a XML file which contains Horizontal Tab ASCII Codes , for example:
<tag>foo bar</tag>
and print out or write it back to another xml file.
Google tells me to set javax.xml.stream.isCoalescing to true in XMLInputFactory, but my test code below does not work as expected.
public static void main(String[] args) throws IOException, XMLStreamException {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(factory.IS_COALESCING, true);
System.out.println("IS_COALESCING supported ? " + factory.isPropertySupported(factory.IS_COALESCING));
System.out.println("factory IS_COALESCING value is " +factory.getProperty(factory.IS_COALESCING));
String rawString = "<tag>foo bar</tag>";
XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(rawString));
System.out.println("reader IS_COALESCING value is " +reader.getProperty(factory.IS_COALESCING));
PrintWriter pw = new PrintWriter(System.out, true);
while (reader.hasNext())
{
reader.next();
pw.print(reader.getEventType());
if (reader.hasText())
pw.append(' ').append(reader.getText());
pw.println();
}
}
The output is
IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo bar
2
8
But I want to keep the same Horizontal Tab like:
IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo bar
2
8
What am I missing here? thanks

From what I see, the parsing part is correct - it's just not printed as you envision it. Your unicode encoding is interpreted by the XML reader as \t and represented accordingly in Java.
Using Guava's XmlEscapers, I can produce something similar to what you want to have:
public class Test {
public static void main(String[] args) throws IOException, XMLStreamException {
XMLInputFactory factory = XMLInputFactory.newInstance();
factory.setProperty(XMLInputFactory.IS_COALESCING, true);
System.out.println("IS_COALESCING supported ? " + factory.isPropertySupported(XMLInputFactory.IS_COALESCING));
System.out.println("factory IS_COALESCING value is " + factory.getProperty(XMLInputFactory.IS_COALESCING));
String rawString = "<tag>foo bar</tag>";
XMLStreamReader reader = factory.createXMLStreamReader(new StringReader(rawString));
System.out.println("reader IS_COALESCING value is " + reader.getProperty(XMLInputFactory.IS_COALESCING));
PrintWriter pw = new PrintWriter(System.out, true);
while (reader.hasNext()) {
reader.next();
pw.print(reader.getEventType());
if (reader.hasText()) {
pw.append(' ').append(XmlEscapers.xmlAttributeEscaper().escape(reader.getText()));
}
pw.println();
}
}
The Output looks like this:
IS_COALESCING supported ? true
factory IS_COALESCING value is true
reader IS_COALESCING value is true
1
4 foo bar
2
8
Some remarks to this:
The library itself is marked as unstable, there might be other alternatives
\t does not need to be escaped in XML content, thus I had to choose the attribute converter. While it works, there might be some side effects
Is a 100%-copy of the content really required? Otherwise, I would suggest to let the XML libraries do their work and have them create the correct encoding.
If you really want to have a 1:1 copy, is it an option to specify the input as CDATA?

Related

What is the property IS_COALESCING in XMLInputFactory for?

I don't really understand the definition from the Oracle documentation:
The property that requires the parser to coalesce adjacent character
data sections
I've tried a few examples with both this property to true and false, and there don't seem to be any noticeable changes.
Can anyone please provide me with a better explanation and maybe an example in which it matters?
It can e.g. make a difference if the text content of an element is a mix of plain &-encoded text, and CDATA-encoded text.
Demo
public static void main(String[] args) throws Exception {
test(false);
test(true);
}
static void test(boolean coalesce) throws Exception {
System.out.println("IS_COALESCING = " + coalesce + ":");
String xml = "<Root>abc<![CDATA[def]]>ghi</Root>";
XMLInputFactory xmlInputFactory = XMLInputFactory.newFactory();
xmlInputFactory.setProperty(XMLInputFactory.IS_COALESCING, coalesce);
XMLEventReader reader = xmlInputFactory.createXMLEventReader(new StringReader(xml));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isCharacters())
System.out.println(" \"" + event.asCharacters().getData() + "\"");
}
}
Output
IS_COALESCING = false:
"abc"
"def"
"ghi"
IS_COALESCING = true:
"abcdefghi"
If you parsed into DOM, the <Root> element would have 3 Node children:
Text where getData() returns "abc"
CDATASection where getData() returns "def"
Text where getData() returns "ghi"
The XMLInputFactory property works the same as the DocumentBuilderFactory.setCoalescing(boolean coalescing) method:
Specifies that the parser produced by this code will convert CDATA nodes to Text nodes and append it to the adjacent (if any) text node. By default the value of this is set to false

Parse simple XML document from URL to String variable

I am attempting to read XML from a server on http://localhost:8000, into a string variable.
The layout of the XML document is very simple, and when directing to http://localhost:8000, the following is displayed:
<result>Hello World</result>
Is there a simple way to parse this into a String variable from the localhost URL, so that for example, if I was to run:
System.out.println(XMLVariable)
(where XMLVariable is the string variable in which the content was stored in) that the output to the command line would simply be "Hello World"?
You need to parse the response from the server into an XML data structure of some sort.
The easiest way that I'm aware of (in Java) to do that is to use dom4j.
It can be as simple as this...
SAXReader reader = new SAXReader();
Document document = reader.read("http://localhost:8000/");
System.out.println(document.getText());
You can use StAX for parsing the response:
private Optional<String> extractResultValue(String xml) throws XMLStreamException {
final XMLInputFactory factory = XMLInputFactory.newInstance();
final XMLEventReader reader = factory.createXMLEventReader(new StringReader(xml));
while (reader.hasNext()) {
XMLEvent event = reader.nextEvent();
if (event.isCharacters()) {
return Optional.ofNullable(event.asCharacters().getData());
}
}
return Optional.empty();
}
Example call:
extractResultValue("<Your data from server>")
extractResultValue("<result>Hello World</result>") // Optional[Hello World]
extractResultValue("<result></result>") // Optional.empty
extractResultValue("<test>value</test>") // Optional[value]

Xalan's SAX implementation - double encoding entities in string

I'm using Sax with xalan implementation (v. 2.7.2). I have string in html format
" <p>Test k"nnen</p>"
and I have to pass it to content of xml tag.
The result is:
"<p>Test k&quot;nnen</p>"
xalan encodes the ampersand sign although it's a part of already escaped entity.
Anyone knows a way how to make xalan understand escaped entities and not escape their ampersand?
One of possible solution is to add startCDATA() to transformerHandler but It's not something can use in my code.
public class TestSax{
public static void main(String[] args) throws TransformerConfigurationException, SAXException {
TestSax t = new TestSax();
System.out.println(t.createSAXXML());
}
public String createSAXXML() throws SAXException, TransformerConfigurationException {
Writer writer = new StringWriter( );
StreamResult streamResult = new StreamResult(writer);
SAXTransformerFactory transformerFactory =
(SAXTransformerFactory) SAXTransformerFactory.newInstance( );
String data = null;
TransformerHandler transformerHandler =
transformerFactory.newTransformerHandler( );
transformerHandler.setResult(streamResult);
transformerHandler.startDocument( );
transformerHandler.startElement(null,"decimal","decimal", null);
data = " <p>Test k"nnen</p>";
transformerHandler.characters(data.toCharArray(),0,data.length( ));
transformerHandler.endElement(null,"decimal","decimal");
transformerHandler.endDocument( );
return writer.toString( );
}}
If your input is XML, then you need to parse it. Then <p> and </p> will be recognized as tags, and " will be recognized as an entity reference.
On the other hand if you want to treat it as a string and pass it through XML machinery, then "<" and "&" are going to be preserved as ordinary characters, which means they will be escaped as < and & respectively.
If you want "<" treated as an ordinary character but "&" treated with its XML meaning, then you need software with some kind of split personality, and you're not going to get that off-the-shelf.

Reading from property file containing utf 8 character

I am reading a property file which consists of a message in the UTF-8 character set.
Problem
The output is not in the appropriate format. I am using an InputStream.
The property file looks like
username=LBSUSER
password=Lbs#123
url=http://localhost:1010/soapfe/services/MessagingWS
timeout=20000
message=Spanish character are = {á é í, ó,ú ,ü, ñ, ç, å, Á, É, Í, Ó, Ú, Ü, Ñ, Ç, ¿, °, 4° año = cuarto año, €, ¢, £, ¥}
And I am reading the file like this,
Properties props = new Properties();
props.load(new FileInputStream("uinsoaptest.properties"));
String username = props.getProperty("username", "test");
String password = props.getProperty("password", "12345");
String url = props.getProperty("url", "12345");
int timeout = Integer.parseInt(props.getProperty("timeout", "8000"));
String messagetext = props.getProperty("message");
System.out.println("This is soap msg : " + messagetext);
The output of the above message is
You can see the message in the console after the line
{************************ SOAP MESSAGE TEST***********************}
I will be obliged if I can get any help reading this file properly. I can read this file with another approach but I am looking for less code modification.
Use an InputStreamReader with Properties.load(Reader reader):
FileInputStream input = new FileInputStream(new File("uinsoaptest.properties"));
props.load(new InputStreamReader(input, Charset.forName("UTF-8")));
As a method, this may resemble the following:
private Properties read( final Path file ) throws IOException {
final var properties = new Properties();
try( final var in = new InputStreamReader(
new FileInputStream( file.toFile() ), StandardCharsets.UTF_8 ) ) {
properties.load( in );
}
return properties;
}
Don't forget to close your streams. Java 7 introduced StandardCharsets.UTF_8.
Use props.load(new FileReader("uinsoaptest.properties")) instead. By default it uses the encoding Charset.forName(System.getProperty("file.encoding")) which can be set to UTF-8 with System.setProperty("file.encoding", "UTF-8") or with the commandline parameter -Dfile.encoding=UTF-8.
If somebody use #Value annotation, could try StringUils.
#Value("${title}")
private String pageTitle;
public String getPageTitle() {
return StringUtils.toEncodedString(pageTitle.getBytes(Charset.forName("ISO-8859-1")), Charset.forName("UTF-8"));
}
You should specify the UTF-8 encoding when you construct your FileInputStream object. You can use this constructor:
new FileInputStream("uinsoaptest.properties", "UTF-8");
If you want to make a change to your JVM so as to be able to read UTF-8 files by default, you will have to change the JAVA_TOOL_OPTIONS in your JVM options to something like this :
-Dfile.encoding=UTF-8
If anybody comes across this problem in Kotlin, like me:
The accepted solution of #Würgspaß works here as well. The corresponding Kotlin syntax:
Instead of the usual
val properties = Properties()
filePath.toFile().inputStream().use { stream -> properties.load(stream) }
I had to use
val properties = Properties()
InputStreamReader(FileInputStream(filePath.toFile()), StandardCharsets.UTF_8).use { stream -> properties.load(stream) }
With this, special UTF-8 characters are loaded correctly from the properties file given in filePath.

Apache POI: find characters in Word document without spaces

I want to read the number of characters without spaces in a Word document using Apache POI.
I can get the number of characters with spaces using the SummaryInformation.getCharCount() method as in the following code:
public void countCharacters() throws FileNotFoundException, IOException {
File wordFile = new File(BASE_PATH, "test.doc");
POIFSFileSystem p = new POIFSFileSystem(new FileInputStream(wordFile));
HWPFDocument doc = new HWPFDocument(p);
SummaryInformation props = doc.getSummaryInformation();
int numOfCharsWithSpaces = props.getCharCount();
System.out.println(numOfCharsWithSpaces);
}
However there seems to be no method for returning the number of characters without spaces.
How do I find this value?
If you want to base this on the metadata of the document, all you will get is estimates (according to the Microsoft specs). There are essentially two values which you can play around with:
GKPIDSI_CHARCOUNT (which is what you already accessed in your own code sample)
GKPIDDSI_CCHWITHSPACES
Don't ask me about the exact differences of those two values, though. I haven't designed this stuff...
Below is a code sample to illustrate the access to them (GKPIDDSI_CCHWITHSPACES is a little awkward):
HWPFDocument document = [...];
SummaryInformation summaryInformation = document.getSummaryInformation();
System.out.println("GKPIDSI_CHARCOUNT: " + summaryInformation.getCharCount());
DocumentSummaryInformation documentSummaryInformation = document.getDocumentSummaryInformation();
Integer count = null;
for (Property property : documentSummaryInformation.getProperties()) {
if (property.getID() == 0x11) {
count = (Integer) property.getValue();
break;
}
}
System.out.println("GKPIDDSI_CCHWITHSPACES: " + count);
The moment at which Word's internal algorithm that updates those values kicks in is rather unpredictable to me. So what you see in Word's own statistics may not necessarily be the same as when running the above code.

Categories

Resources