Java jaxb utf-8/iso convertions - java

I have a XML file that contains non-standard characters (like a weird "quote").
I read the XML using UTF-8 / ISO / ascii + unmarshalled it:
BufferedReader br = new BufferedReader(new InputStreamReader(
(conn.getInputStream()),"ISO-8859-1"));
String output;
StringBuffer sb = new StringBuffer();
while ((output = br.readLine()) != null) {
//fetch XML
sb.append(output);
}
try {
jc = JAXBContext.newInstance(ServiceResponse.class);
Unmarshaller unmarshaller = jc.createUnmarshaller();
ServiceResponse OWrsp = (ServiceResponse) unmarshaller
.unmarshal(new InputSource(new StringReader(sb.toString())));
I have a oracle function that will take iso-8859-1 codes, and converts/maps them to "literal" symbols. i.e: "&#x2019" => "left single quote"
JAXB unmarshal using iso, displays the characters with iso conversion fine. i.e all weird single quotes will be encoded to "&#x2019"
so suppose my string is: class of 10–11‐year‐olds (note the weird - between 11 and year)
jc = JAXBContext.newInstance(ScienceProductBuilderInfoType.class);
Marshaller m = jc.createMarshaller();
m.setProperty(Marshaller.JAXB_ENCODING, "ISO-8859-1");
//save a temp file
File file2 = new File("tmp.xml");
this will save in file :
class of 10–11‐year‐olds. (what i want..so file saving works!)
[side note: i have read the file using java file reader, and it out puts the above string fine]
the issue i have is that the STRING representation using jaxb unmarshaller has weird output, for some reason i cannot seem to get the string to represent –.
when I
1: check the xml unmarshalled output:
class of 10?11?year?olds
2: the File output:
class of 10–11‐year‐olds
i even tried to read the file from the saved XML, and then unmarshal that (in hopes of getting the – in my string)
String sCurrentLine;
BufferedReader br = new BufferedReader(new FileReader("tmp.xml"));
StringBuffer sb = new StringBuffer();
while ((sCurrentLine = br.readLine()) != null) {
sb.append(sCurrentLine);
}
ScienceProductBuilderInfoType rsp = (ScienceProductBuilderInfoType) unm
.unmarshal(new InputSource(new StringReader(sb.toString())));
no avail.
any ideas how to get the iso-8859-1 encoded character in jaxb?

Solved: using this tibid code found on stackoverflow
final class HtmlEncoder {
private HtmlEncoder() {}
public static <T extends Appendable> T escapeNonLatin(CharSequence sequence,
T out) throws java.io.IOException {
for (int i = 0; i < sequence.length(); i++) {
char ch = sequence.charAt(i);
if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) {
out.append(ch);
} else {
int codepoint = Character.codePointAt(sequence, i);
// handle supplementary range chars
i += Character.charCount(codepoint) - 1;
// emit entity
out.append("&#x");
out.append(Integer.toHexString(codepoint));
out.append(";");
}
}
return out;
}
}
HtmlEncoder.escapeNonLatin(MYSTRING)

Related

String index out of range: -1 working with XML file

I downloaded an xml file from web service. If I open file from file system is formed correctly, but when I run my code isn't formed correctly.
A part of xml file formed correctly, it opened from file system:
<?xml version="1.0" encoding="UTF-8"?><ns3:FatturaElettronica xmlns:ns3="http://ivaservizi.agenziaentrate.gov.it/docs/xsd/fatture/v1.2" xmlns:ns2="http://www.w3.org/2000/09/xmldsig#" versione="FPR12">
Here the same xml file managed by my code:
ÿþ<
I can't copy the code and I put an image of what I see on the eclipse console.
I tryed different ways to manage this file, but nothing worked.
This is the code that it manages files. I put all ways I tryed to solve the error.
private static String readFile(File file, Writer writerArg) throws FileNotFoundException, IOException,Exception
{
FileInputStream fis = null;
InputStreamReader isr = null;
String typeEncoding = null;
/*
* First way
*
* BufferedReader br = new BufferedReader(new FileReader(fileName));
String nextLine = "";
StringBuffer sb = new StringBuffer();
while ((nextLine = br.readLine()) != null)
{
// System.out.println("Writing: " + nextLine);
writerArg.write(nextLine);
// sb.append(nextLine);
sb.append(nextLine+"\n");
} // Convert the content into to a string
String clobData = sb.toString().trim();
*/
/*
* Second way
*
* fis = new FileInputStream(file);
isr = new InputStreamReader(fis);
typeEncoding = isr.getEncoding();
Charset inputCharset = Charset.forName(typeEncoding);
BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(file), inputCharset));
String str;
String nextLine = "";
StringBuffer sb = new StringBuffer();
while ((nextLine = in.readLine()) != null) {
System.out.println(nextLine);
writerArg.write(nextLine);
// sb.append(nextLine);
sb.append(nextLine+"\n");
}
String clobData = sb.toString().trim();
// Return the data.
return clobData;
*/
/* Third way */
String data = "";
data = new String(Files.readAllBytes(Paths.get(file.getAbsolutePath())));
System.out.println(data);
return data;
}
And when the below code receives the string I get the error: "String index out of range: -1"
schema=stringXml.substring(0,stringXml.indexOf("<FatturaElettronicaHeader")).trim();
The first way downloaded thousands of files and managed them. Only this file gives my this error. It's from yesterday that I'm looking for a way to solve the error.
Can anyone give my any idea?

How to replace text html for convert to pdf ? Java

I want modify html file for convert this to pdf.
Currently I convert an html file to pdf using "ITextRenderer".
Currently:
OutputStream out = new FileOutputStream(htmlFileOutPutPath);
//Flying Saucer
ITextRenderer renderer = new ITextRenderer();
renderer.setDocument(htmlFilePath);
renderer.layout();
renderer.createPDF(out);
out.close();
//This success!! html file to pdf generated!
1- but more later I have the need to modify the html file before generating it as pdf, for this I think extract html file content and convert to string, then I replace some text on string html:
public String htmlFileToString() throws IOException {
StringBuilder contentBuilder = new StringBuilder();
String path = "C:/Users/User1/Desktop/to_pdf_replace.html";
BufferedReader in = new BufferedReader(new FileReader(path));
String str;
while ((str = in.readLine()) != null) {
contentBuilder.append(str);
}
in.close();
String content = contentBuilder.toString();
return content;
}
2- Then Replace tags in string from html
public String replaceHTMLcontent(String strSource)
{
String name = "Ana";
String age = "23";
String html = strSource;
strSource = strSource.replace("##Name##", name);
strSource = strSource.replace("##Age##", age);
//## ## -> are my html custom tags to replace
return strSource;
}
MAIN:
public static void main(String[] args) {
String stringFromHtml = new DocumentBLL().htmlFileToString();
String stringFromHtmlReplaced = new DocumentBLL().replaceHTMLcontent(stringFromHtml );
}
But now I do not know how to replace the new string with the old html string of the html file
You can first convert the whole html file into a string and do
String.replace("What I want to replace", "What it will be replaced with.");
Or if say you want to replace text1 and it's in a specific line, you can iterate through the file line by line (will be read as string) and look if there's text1 and implement the code I used above.
In addition, you can use this
BufferedReader file = new BufferedReader(new FileReader("myFile.html"));
String line;
StringBuffer buffer = new StringBuffer();
while (line = file.readLine()) {
buffer.append(line);
buffer.append('\n');
}
String input = buffer.toString();
file.close();
input = input.replace("What I want to insert into", "What I (hi there) want to insert into");
FileOutputStream out = new FileOutputStream("myFile.html");
out.write(inputStr.getBytes());
out.close();

How to parse this provided XML with java.xml.xpath?

I am trying to parse this XML:
<?xml version="1.0" encoding="UTF-8"?>
<veranstaltungen>
<veranstaltung id="201611211500#25045271">
<titel>Mal- und Zeichen-Treff</titel>
<start>2016-11-21 15:00:00</start>
<veranstaltungsort id="20011507">
<name>Freizeitclub - ganz unbehindert </name>
<anschrift>Macht los e.V.
Lipezker Straße 48
03048 Cottbus
</anschrift>
<telefon>xxxx xxxx </telefon>
<fax>0355 xxxx</fax>
[...]
</veranstaltungen>
As you can see, some of the texts have whitespace or even linebreaks. I am having issues with the text from the node anschrift, because I need to find the right location data in a database. Problem is, the returned String is:
Macht los e.V.Lipezker Straße 4803048 Cottbus
instead of:
Macht los e.V. Lipezker Straße 48 03048 Cottbus
I know the correct way to parse it should be with normalie-space() but I cannot quite work out how to do it. I tried this:
// Does not work; afaik because xpath 1 normalizes just the first node
xPath.compile("normalize-space(veranstaltungen/veranstaltung[position()=1]/veranstaltungsort/anschrift/text()"));
// Does not work
xPath.compile("veranstaltungen/veranstaltung[position()=1]/veranstaltungsort[normalize-space(anschrift/text())]"));
I also tried the solution given here: xpath-normalize-space-to-return-a-sequence-of-normalized-strings
xPathExpression = xPath.compile("veranstaltungen/veranstaltung[position()=1]/veranstaltungsort");
NodeList result = (NodeList) xPathExpression.evaluate(doc, XPathConstants.NODESET);
String normalize = "normalize-space(.)";
xPathExpression = xPath.compile(normalize);
int length = result.getLength();
for (int i = 0; i < length; i++) {
System.out.println(xPathExpression.evaluate(result.item(i), XPathConstants.STRING));
}
System.out prints:
Macht los e.V.Lipezker Straße 4803048 Cottbus
What am I doing wrong?
Update
I have a workaround already, but this can't be the solution. The following few lines show how I put the String together from the HTTPResponse:
try (BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), Charset.forName(charset)))) {
final StringBuilder stringBuilder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
// stringBuilder.append(line);
// WORKAROUND: Add a space after each line
stringBuilder.append(line).append(" ");
}
// Work with the red lines
}
I would rather have a solid solution.
Originally, you seem to be using the following code for reading the XML:
try (BufferedReader reader = new BufferedReader(new InputStreamReader(response.getEntity().getContent(), Charset.forName(charset)))) {
final StringBuilder stringBuilder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
stringBuilder.append(line);
}
}
This is where your newlines get eaten: readline() does not return the trailing newline characters. If you then parse the contents of the stringBuilder object, you will get an incorrect DOM, where the text nodes do not contain the original newlines from the XML.
Thanks to the help of Markus, I was able to solve the issue. The reason was the readLine() method of the BufferedReader discarding line breaks. The following codesnippet works for me (Maybe it can be improved):
public Document getDocument() throws IOException, ParserConfigurationException, SAXException {
final HttpResponse response = getResponse(); // returns a HttpResonse
final HttpEntity entity = response.getEntity();
final Charset charset = ContentType.getOrDefault(entity).getCharset();
// Not 100% sure if I have to close the InputStreamReader. But I guess so.
try (InputStreamReader isr = new InputStreamReader(entity.getContent(), charset == null ? Charset.forName("UTF-8") : charset)) {
return documentBuilderFactory.newDocumentBuilder().parse(new InputSource(isr));
}
}

dynamic mapping in elasticsearch

I have a .txt file which contains the data in json format. I would like to create mapping by using the below method, but it is showing an exception:
Root type mapping not empty after parsing (while creating mapping with dynamic template).
I would be thankful to you if you tell me what is the mistake and how to resolve it.
InputStream fileStream;
StringBuilder mapTemplate= new StringBuilder();
String line;
File mapFile = new File(mapFileBase); //mapFileBase is a string which holds the path name of the .txt file
fileStream = new FileInputStream(mapFile);
BufferedReader br = new BufferedReader(new inputStreamReader(fileStream));
while ((line = br.readLine()) != null) {
mapTemplate.append(line);
}
String mTemplate=mapTemplate.toString();
mTemplate=mTemplate.replaceAll("\n ", "").replaceAll("\\s+", "");
System.out.println(mTemplate);
createIndexRequestBuilder.addMapping(type, mappingBuilder);
// MAPPING DONE
createIndexRequestBuilder.execute().actionGet();
And I can't use
XContentBuilder mappingBuilder = XContentFactory.jsonBuilder()
.startObject()
.startObject(type)
.startObject("properties")..........
as the json file is very huge.

while calling grep in java, it doesn't work for french characters

I'm calling grep in java to separately count the number of a list of words in a corpus.
BufferedReader fb = new BufferedReader(
new InputStreamReader(
new FileInputStream("french.txt"), "UTF8"));
while ((l = fb.readLine()) != null){
String lpt = "\\b"+l+"\\b";
String[] args = new String[]{"grep","-ic",lpt,corpus};
Process grepCommand = Runtime.getRuntime().exec(args);
grep.waitFor()
}
BufferedReader grepInput = new BufferedReader(new InputStreamReader(grep.getInputStream()));
int tmp = Integer.parseInt(grepInput.readLine());
System.out.println(l+"\t"+tmp);
This works well for my English word-list and corpus. But I also have a French word list and corpus. It doesn't work for french and a sample output on java console looks like this:
� bord 0
� c�t� 0
correct form: "à bord" and "à côté".
Now my question is: where is the problem? Should I fix my java code, or it's a grep issue?
If so how do I fix it. (I also can't see french characters on my terminal correctly even though I changed the encoding to UTF-8).
The problem is in your design. Do not call grep from java. Use pure java implementation instead: read file line by line and implement your own "grep" using pure java API.
But seriously I believe that the problem is in your shell. Did you try to run grep manually and filter French characters? I believe it will not work for you. It depends on your shell configuration and therefore depends on platform. Java can provide platform independent solution. To achieve this you should avoid as much as it is possible using non-pure-java techniques including executing command line utilities.
BTW code that reads line-by-line your file and uses String.contains() or pattern matching for lines filtering even shorter than code that runs grep.
I would suggest that you read the file line by line then call split on the word boundary to get the number of words.
public static void main(String[] args) throws IOException {
final File file = new File("myFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final String[] words = line.split("\\b");
System.out.println(words.length + " words in line \"" + line + "\".");
}
}
}
This avoids calling grep from you program.
The odd characters you are getting may well be do to with using the wrong encoding. Are you sure your file is in "UTF-8"?
EDIT
OP wants to read one file line-by-line and then search for occurrences of the read line in another file.
This can still be done more easily using java. Depending on how big your other file is you can either read it into memory first and search it or search it line-by-line also
A simple example reading the file into memory:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
final File corpusFile = new File("corpus");
final String corpusFileContent = readFileToString(corpusFile);
final File file = new File("myEngramFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final int matches = countOccurencesOf(line, corpusFileContent);
};
}
}
private static String readFileToString(final File file) throws IOException {
final StringBuilder stringBuilder = new StringBuilder();
try (final FileChannel fc = new RandomAccessFile(file, "r").getChannel()) {
final ByteBuffer byteBuffer = ByteBuffer.allocate(4096);
final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder();
while (fc.read(byteBuffer) > 0) {
byteBuffer.flip();
stringBuilder.append(charsetDecoder.decode(byteBuffer));
byteBuffer.reset();
}
}
return stringBuilder.toString();
}
private static int countOccurencesOf(final String countMatchesOf, final String inString) {
final Matcher matcher = Pattern.compile("\\b" + countMatchesOf + "\\b").matcher(inString);
int count = 0;
while (matcher.find()) {
++count;
}
return count;
}
This should work fine if your "corpus" file is less than a hundred megabytes or so. Any bigger and you will want to change the "countOccurencesOf" method to something like this
private static int countOccurencesOf(final String countMatchesOf, final File inFile) throws IOException {
final Pattern pattern = Pattern.compile("\\b" + countMatchesOf + "\\b");
int count = 0;
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
++count;
}
};
}
return count;
}
Now you would just pass your "File" object into the method rather than the stringified file.
Note that the streaming approach reads files line-by-line and hence drops the linebreaks, you need to add them back before parsing the String if your Pattern relies on them being there.

Categories

Resources