How to keep character "&" from ISO-8859-1 to UTF-8 - java

I'd just written a java file using Eclipse encoding with ISO-8859-1.
In this file, I want to create a String such like that (in order to create a XML content and save it into a database) :
// <image><img src="path_of_picture"></image>
String xmlContent = "<image><img src=\"" + path_of_picture+ "\"></image>";
In another file, I get this String and create a new String with this constructor :
String myNewString = new String(xmlContent.getBytes(), "UTF-8");
In order to be understood by a XML parser, my XML content must be converted to :
<image><img src="path_of_picture"></image>
Unfortunately, I can't find how to write xmlContent to get this result in myNewString.
I tried two methods :
// First :
String xmlContent = "<image><img src=\"" + content + "\"></image>";
// But the result is just myNewString = <image><img src="path_of_picture"></image>
// and my XML parser can't get the content of <image/>
//Second :
String xmlContent = "<image><img src=\"" + content + "\"></image>";
// But the result is just myNewString = <image>&lt;img src="path_of_picture"&gt;</image>
Do you have any idea ?

This is unclear. But Strings don't have an encoding. So when you write
String s = new String(someOtherString.getBytes(), someEncoding);
you will get various results depending on your default encoding setting (which is used for the getBytes() method).
If you want to read a file encoded with ISO-8859-1, you simply do:
read the bytes from the file: byte[] bytes = Files.readAllBytes(path);
create a string using the file's encoding: String content = new String(bytes, "ISO-8859-1);
If you need to write back the file with a UTF-8 encoding you do:
convert the string to bytes with UTF-8 encoding: byte[] utfBytes = content.getBytes("UTF-8");
write the bytes to the file: Files.write(path, utfBytes);

I dont feel that your question is related to encoding but if you want to "create a String such like that (in order to create a XML content and save it into a database)", you can use this code:
public static Document loadXMLFromString(String xml) throws Exception
{
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
InputSource is = new InputSource(new StringReader(xml));
return builder.parse(is);
}
Refer this SO answer.

Related

Why does my code return unicode characters?

String encodedInputText = URLEncoder.encode("input=" + question, "UTF-8");
urlStr = Parameters.getWebserviceURL();
URL url = new URL(urlStr + encodedInputText + "&sku=" + sku);
BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream(), "UTF-8"));
jsonOutput = in.readLine();
in.close();
The problem is that the returned JSON string contains all unicodes like
"question":"\u51e0\u5339\u7684",
Not the actual Chinese characters. The "UTF-8" should solve the problem. Why doesn't it?
EDIT:
ObjectMapper mapper = new ObjectMapper();
ResponseList = responseList = mapper.readValue(jsonOutput, ResponseList.class);
This is not problem of encoding, it is problem your data source. Encoding comes into play when you convert bytes into string. You expect encoding to convert string in form of \uxxxx into another string which is not going to happen.
The whole point is, that the source of data is serializing data this way so your raw data is gone and is replaced with \uxxxx.
Now you would have to manualy capture \uxxx sequences and convert that to actual characters.

Convert byte[] to String and back

I'm trying to save content of a pdf file in a json and thought of saving the pdf as String value converted from byte[].
byte[] byteArray = feature.convertPdfToByteArray(Paths.get("path.pdf"));
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
The result of the above code is as follows:
true
false
421371 vs 760998
The two String's are equal while the two byte[]s are not. Why is that and how to correctly convert/save a pdf inside a json?
You are probably using the wrong charset when reading from the PDF file.
For example, the character é (e with acute) does not exists in ISO-8859-1 :
byte[] byteArray = "é".getBytes(StandardCharsets.ISO_8859_1);
String byteString = new String(byteArray, StandardCharsets.UTF_8);
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
System.out.println(secondString.equals(byteString));
System.out.println(Arrays.equals(byteArray, newByteArray));
System.out.println(byteArray.length + " vs " + newByteArray.length);
Output :
true
false
1 vs 3
Why is that
If the byteArray indeed contains a PDF, it most likely is not valid UTF-8. Thus, wherever
String byteString = new String(byteArray, StandardCharsets.UTF_8);
stumbles over a byte sequence which is not valid UTF-8, it will replace that by a Unicode replacement character. I.e. this line damages your data, most likely beyond repair. So the following
byte[] newByteArray = byteString.getBytes(StandardCharsets.UTF_8);
does not result in the original byte array but instead a damaged version of it.
The newByteArray, on the other hand, is the result of UTF-8 encoding a given string, byteString. Thus, newByteArray is valid UTF-8 and
String secondString = new String(newByteArray, StandardCharsets.UTF_8);
does not need to replace anything outside the UTF-8 mappings, in particular byteString and secondString are equal.
how to correctly convert/save a pdf inside a json?
As #mammago explained in his comment,
JSON is not the appropriate format for binary content (like files). You should propably use something like base64 to create a string out of your PDF and store that in your JSON object.

HTML tags are getting converted

I have the following code snippet to have output from XML data which is stored in the database table
ServletOutputStream os = response.getOutputStream();
String contentDisposition = "attachment;filename=Test.HTML";
response.setHeader("Content-Disposition",contentDisposition);
response.setContentType("text/html");
XMLNode xmlNode = (XMLNode)am.invokeMethod("getDataXML");
ByteArrayOutputStream outputStream =
new ByteArrayOutputStream();
xmlNode.print(outputStream);
ByteArrayInputStream inputStream =
new ByteArrayInputStream(outputStream.toByteArray());
ByteArrayOutputStream pdfFile = new ByteArrayOutputStream();
TemplateHelper.processTemplate(((OADBTransactionImpl)pageContext.getApplicationModule(webBean).getOADBTransaction()).getAppsContext(),
"INV",
"MyTemplate",
((OADBTransactionImpl)pageContext.getApplicationModule(webBean).getOADBTransaction()).getUserLocale().getLanguage(),
((OADBTransactionImpl)pageContext.getApplicationModule(webBean).getOADBTransaction()).getUserLocale().getCountry(),
inputStream,
TemplateHelper.OUTPUT_TYPE_HTML,
null, pdfFile);
byte[] b = pdfFile.toByteArray();
response.setContentLength(b.length);
os.write(b, 0, b.length);
os.flush();
os.close();
pdfFile.flush();
pdfFile.close();
public XMLNode getDataXML() {
OAViewObject vo = (OAViewObject)findViewObject("DataVO");
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
XMLNode xmlNode =
(XMLNode)vo.writeXML(4, XMLInterface.XML_OPT_ALL_ROWS);
return xmlNode;
}
I have HTML tags which is stored in the table as
<STRONG>this</STRONG> is only a test.
However the above is getting converted to
<STRONG>this</STRONG>is only a test.
How can I preserve the original HTML tags when I execute the code or how do I convert it back to the original without using any third party libraries as we have a restriction of using third party libraries in the server.
take a look at this for more information
The HTML character encoder converts all applicable characters to their
corresponding HTML entities. Certain characters have special
significance in HTML and should be converted to their correct HTML
entities to preserve their meanings. For example, it is not possible
to use the < character as it is used in the HTML syntax to create and
close tags. It must be converted to its corresponding < HTML entity
to be displayed in the content of an HTML page. HTML entity names are
case sensitive.
and then this may help you :
use the Apache Commons StringEscapeUtils.unescapeHtml4() for this:
Unescapes a string containing entity escapes to a string containing
the actual Unicode characters corresponding to the escapes. Supports
HTML 4.0 entities.
Edit
it seems the java itself has this method
URLDecoder.decode(String stringToDecode)
and this
URLDecoder.decode(String stringToDecode, String charset);
hope this works for you

How to convert UTF-8 to GBK string in java

I retrieved HTML string from an objective site and within it there is a section
class="f9t" name="Óû§Ãû:ôâÈ»12"
I know it's in GBK encoding, as I can see it from the FF browser display. But I do not know how to convert that name string into a readable GBK string (such as 上海 or 北京).
I am using
String sname = new String(name.getBytes(), "UTF-8");
byte[] gbkbytes = sname.getBytes("gb2312");
String gbkStr = new String( gbkbytes );
System.out.println(gbkStr);
but it's not printed right in GBK text
???¡ì??:????12
I have no clue how to proceed.
You can try this if you already read the name with a wrong encoding and get the wrong name value "Óû§Ãû:ôâÈ»12", as #Karol S suggested:
new String(name.getBytes("ISO-8859-1"), "GBK")
Or if you read a GBK or GB2312 string from internet or a file, use something like this to get the right string at the first place:
BufferedReader r = new BufferedReader(new InputStreamReader(is,"GBK")); name = r.readLine();
Assuming that name.getBytes() returns GBK encoded string it's enough to create string specifying encoding of array of bytes:
new String(gbkString.getBytes(), "GBK");
Regarding to documentation the name of encryption should be GBK.
Sample code:
String gbkString = "Óû§Ãû:ôâÈ»12";
String utfString = new String(gbkString.getBytes(), "GBK");
System.out.println(utfString);
Result (not 100% sure that it's correct :) ):
脫脙禄搂脙没:么芒脠禄12

How to handle (R) symbol during XML XSLT transformation

I have an UTF-8 XML (passed as a string) which contains the following line:
<LongName>SomeName®</LongName>.
And it should be transformed into another UTF-8 XML after XSLT transformation. The problem is only with ® symbol, it's transformed into two symbols: ®
Here's the code:
public String transform (String inputXML) throws TransformerException {
TransformerFactory factory = TransformerFactory.newInstance();
OutputStream os = new ByteArrayOutputStream();
InputStream transformationFile = getClass().getResourceAsStream(TRANSFORMER_PATH);
Transformer transformer = factory.newTransformer(new StreamSource(transformationFile));
InputStream is = new ByteArrayInputStream(inputXML.getBytes(Charset.forName("UTF-8")));
Source input = new StreamSource(is);
transformer.transform(input, new StreamResult(os));
return os.toString();
}
So the question is - how to correctly transform ® to ® from UTF-8 to UTF-8 XML?
Your error is the last line:
return os.toString();
Since os is a ByteArrayOutputStream it has to convert the byte array to a String and it will use the current platform default encoding instead of UTF-8. You may use return os.toString("UTF-8");.
Instead of
InputStream is = new ByteArrayInputStream(inputXML.getBytes(Charset.forName("UTF-8")));
Source input = new StreamSource(is);
try
Source input = new StreamSource(StringReader(inputXML));

Categories

Resources