Eliminating character e280a8 from a Java String

Eliminating character e280a8 from a Java String - java

My application takes a Java string and puts it in a JSON response, it works on IE but for some reason on Chrome and Firefox I don't see the data on the page, I don't get any console errors and I do get the Response Object with the ability to examine it on Firebug and Chrome debugging tools.
I am working with Java 6, and the String in question is created from a CLOB column from an Oracle DB:
4:42 PM<
This is the hex code of the above String as it is on Oracle:
34,3a,34,32,20,50,4d,e2,80,a8,3c
As you can see, between the "M" (4d) and the "<" (3c) we have the values e2,80,a8, which according to UTF-8 is a line separator (e280a8), I've tested my application by adding only the substring until the "M" and it works on all browsers, but the moment I include one more character it breaks. So it is safe to say that the character is causing the issue.
The Java console outputs the string as:
4:42 PMâ€¨<
And its byte values as:
52,58,52,50,32,80,77,-30,-128,-88,60
Since I know that there should not be a line break or anything else between the "M" and "<", I think the solution would be to scrub that character, but desc = desc.replaceAll("â€¨", ""); doesn't seem to work.
Any suggestions?

The bytes are in UTF-8, and it is the Unicode line separator "\u2028". You are right.
desc = desc.replace("\u2028", "");

Related

\n (new line) not working in creating text file

I want to download the text file by clicking on button, everything is working fine as expected. But the problem is the data I want to insert in text file is just one line.
String fileContent = "Simple Solution \nDownload Example 1";
here, \n is not working. It resulting in output as:
Simple Solution Download Example 1
Code snippets:
interface:
interface implementation in my service class:
controller:

Don't use hardcoded \n nor \r\n - line-separators are platform-specific (Windows differs to all other OS).
What you can do is:
Use System.lineSeparator()
Build content with String.format() and replace \n with %n

The main problem is that the server computer and client computer are basically independent with respect to character set encoding and line separators.
Defaults will not do.
As we are living in a Windows centric world (I am a linuxer), user "\r\n".
Then java can mix any Unicode script. A file does not have info on its encoding.
If it originates on an other computer/platform, that raises problems.
String fileContent = "Simple Solution façade, mañana, €\r\n"
+ "Download Обичам ĉĝĥĵŝŭ Example 1";
So the originating computer explicitly define the encoding. It should not do:
fileContent.getBytes(); // Default platform encoding Charset.defaultCharset().
So the originating computer can do:
fileContent.getBytes(StandardCharsets.UTF_8); // UTF-8, full Unicode.
fileContent.getBytes("Windows-1252); // MS Windows Latin 1, some ? failures.
The contentType can be set appropriately with "text/plain;charset=UTF-8" or for Windows-1252 "text/plain;charset=ISO-8859-1".
And from that byte[] you should take the .length for the contentLength.
Writing to the file can use Files.writeString
In that case use Files.size(exportedPath) for the content length.
Files.newInputStream(exportedPath) is the third goodie from Files.

Character coding between mysql and java

I have an error in printing special characters in Java.
The system reads a product name from a mysql database, and checking the database from the command line, the data displays the Registered Symbol ® correctly.
A java program then runs a database read to get information of orders to print out as a PDF, but when the print is produced the ® symbol becomes 'fi'.
Is there a way of retaining the myself character coding when handling in Java?

Before printing to PDF, you can replace the special characters with the unicode characters as below.
public static String specialCharactersConversion( String charString ) {
if( isNotEmpty( charString ) ){
charString = charString.replaceAll( "\\(R\\)", "\u00AE" );
}
}

There is an Open Source java library MgntUtils that has a Utility that converts Strings to unicode sequence and vise versa:
result = "Hello World";
result = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence(result);
System.out.println(result);
result = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(result);
System.out.println(result);
The output of this code is:
\u0048\u0065\u006c\u006c\u006f\u0020\u0057\u006f\u0072\u006c\u0064
Hello World
So, what you can do before converting your text to PDF, you can convert special characters or entire text to Unicode sequences. The answer is copied with modifications from this question: Convert International String to \u Codes in java
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder

Gradle/Eclipse: Different behavior of german "Umlaute" when using equality?

I am experiencing a weird behavior with german "Umlaute" (ä, ö, ü, ß) when using Java's equality checks (either directly or indirectly.
Everything works as expected when running, debugging or testing from Eclipse and input containing "Umlaute" is treated as equal or not as expected.
However when I build the application using Spring Boot and run it, these equality checks fail for words that contain "Umlaute", i.e. for words like "Nationalität".
Input is retrieved from a webpage via Jsoup and content of a table is extracted for some keywords. The encoding of the page is UTF-8 and I have handling in place for Jsoup to convert it if this is not the case.
The encoding of the source files is UTF-8 as well.
Connection connection = Jsoup.connect(url)
.header("accept-language", "de-de, de, en")
.userAgent("Mozilla/5.0")
.timeout(10000)
.method(Method.GET);
Response response = connection.execute();
if(logger.isDebugEnabled())
logger.debug("Encoding of response: " +response.charset());
Document doc;
if(response.charset().equalsIgnoreCase("UTF-8"))
{
logger.debug("Response has expected charset");
doc = Jsoup.parse(response.body(), baseURL);
}
else
{
logger.debug("Response doesn't have exepcted charset and is converted");
doc = Jsoup.parse(new String(response.bodyAsBytes(), "UTF-8"), baseURL);
}
logger.debug("Encoding of document: " +doc.charset());
if(!doc.charset().equals(Charset.forName("UTF-8")))
{
logger.debug("Changing encoding of document from " +doc.charset());
doc.updateMetaCharsetElement(true);
doc.charset(Charset.forName("UTF-8"));
logger.debug("Changed encoding of document to: " +doc.charset());
}
return doc;
Example log output (from deployed app) of reading content.
Encoding of response: utf-8
Response has expected charset
Encoding of document: UTF-8
Example input:
<tr><th>Nationalität:</th> <td> [...] </td> </tr>
Example code that fails for words containing ä, ö, ü or ß but works fine for other words:
Element header = row.select("th").first();
String text = header.ownText();
if("Nationalität:".equals(text))
{
// goes here in eclipse
}
else
{
// and here in deployed spring boot app
}
Is there any difference between running from Eclipse and a built & deployed app that I am missing? Where else could this behavior come from and how I this be resolved?
As far as I can see this is not (directly) an encoding issue since the input shows "Umlaute" correctly...
Since this is not reproducible when debugging, I am having a hard time figuring out what exactly goes wrong.
Edit: While input looks fine in logs (i.e. diacritics show up correctly) I realized that they don't look correct in the console:
<th>Nationalit├ñt:</th>
I am currently using a Normalizer as suggested by Mirko like this:
Normalizer.normalize(input, Form.NFC);
(also tried it with NFD).
How do (SpringBoot-) console and (logback) logoutput differ?

Diacritics like umlauts can often be represented in two different ways in unicode: As a single-codepoint character or as a composition of two characters. This isn't a problem of the encoding, it can happen in UTF-8, UTF-16, UTF-32 etc.
Java's equals method may not consider composite characters equal to single-codepoint characters, even though they look exactly the same.
Try to have a look at the binary representation of the strings you are comparing, this way you should be able to track down the differences.
You could also use the methods of the "Character" class to iterate through the strings and print out the properties of all the characters. Maybe this helps, too, to figure out differences.
In any case, it could help if you use java.text.Normalizer on both "sides" of the "equals", to normalize the text to, for example, Unicode Normalization Form C. This way, differences like the aforementioned should be straightened out and the strings should compare as expected.

Have you tried printing the keycode to console to see if they actually match when compiled? Maybe Eclipse is handling the charset gracefully but when it's compiled it's down to some Java/System settings?

I think I tracked this down to the build of the standalone app being the culprit.
As described above, when running from Eclipse all is fine, the problem only occurred when I ran the standalone Spring Boot app.
This is being built with Gradle. In my build.gradle I have
compileJava.options.encoding = 'UTF-8'
in order to force UTF-8 being used for encoding. This should (usually) be enough. I however also use AspectJ (via gradle-aspectj plugin) which apparently breaks this behavior (involuntarily?) and results in a default encoding to be used instead of the one explicitly defined.
In order to solve this I added
compileAspect {
additionalAjcArgs = ['encoding' : 'UTF-8']
}
to my build.gradle which passes the encoding option on to the ajc compiler. This seems to have fixed the problem for the regular build.
The problem still occurs however when tests are run from gradle. I was not yet able to find out what needs to be done there and why the above configuration is not enough.
This is now tracked in a separate question.

Multiple words not getting searched , not taking space

when i pass string with space in bw the words to the servlet and run the android aaplication
error comes like this
03-01 09:32:41.110: E/Excepiton(1301): java.io.FileNotFoundException: http//address of server:8088/First/MyServlet?ads_title=test test&city=Pune
here ads_title=test test and city = Delhi
but it works fine when i pass single word string
like ads_title=test
and city = Delhi
but when i run query on sql with both the value that works that means query is fine.
String stringURL="http//laddress of server:8088/First/MyServlet" +
String.format("?ads_title=%s&city=%s",editText1.getText(),City);
that is where i am passing the values

Data sent as a URL must be "encoded" to ensure that all the data passes properly to the server to be interpreted correctly. Fortunately, Java provides a standard class URLEncoder and the encoding specified by the World Wide Web Consortium is "UTF-8 so, use
String finalURL = URLEncoder(stringURL,"UTF-8");
(That way you don't have to know what the encoding is for each special character.)

I agree with the comments (not sure why they didn't post as an answer though?) - you want to try encoding your URL - so that the space is handled correctly (%20)
Java URL encoding of query string parameters

Chinese character in URL with Java

I used the following line in Firefox's URL field :
http://www.baidu.com/s?wd=你
This line was generated by my Java program.
The last Chinese character in the URL field sometimes became: %C4%E3 [Correct]
Other times it became: %E4%BD%A0 [Incorrect]
I tried to use the URL with IE. It shows up still as 你, but the result page search field shows the character as 浣. Could this be a UTF-8 or UTF-16 encoding problem? How do I get the correct code %C4%E3 from the char 你 with my Java program?

URLEncoder.encode(string, encoding)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Eliminating character e280a8 from a Java String - java

The bytes are in UTF-8, and it is the Unicode line separator "\u2028". You are right. desc = desc.replace("\u2028", "");

Related

\n (new line) not working in creating text file

Character coding between mysql and java

Gradle/Eclipse: Different behavior of german "Umlaute" when using equality?

Multiple words not getting searched , not taking space

Chinese character in URL with Java

Categories

Resources