Cannot verify text which contains diacritics

Cannot verify text which contains diacritics - java

I am using selenium webdriver with java and trying to verify some texts I find on a page.
The text contains diacritics like ţ ă etc.
The problem I encounter is when I run my test from the command line using maven; I need to do this because I will be integrating them into Jenkins.
So I have a simple assert in my test:
Assert.assertEquals("some text with ţ", driver.findElement(text).getText());
which fails and I don't know what is the right way to make this work.
I have read that the default encoding for strings in Java is UTF-16, so when the text is taken from the page with getText, the string is already encoded and I suppose that means that the characters are lost. On the other hand, I don't know if the comparing text itself "some text with ţ" is interpreted ok.
Has anyone had problems similar to this? And how have you solved them?
Thanks

Maven is even issuing a warning specific to your error:
WARNING: character encoding not set. Using the platform default encoding, i.e., the
build is platform-dependent!
or a similar message.
The solution is to:
make sure you save the Java source code files in UTF-8;
make sure you explicitly configure the encoding in pom.xml (a setting on the Compiler plugin).

Related

Using iText to create PDF, special characters display properly when executed from command line don't display from app

I have a PDF that is being generated using iText 5 with Java. Recently, I discovered that special characters (° ≥ ≤ etc..) were not displaying. I have embedded a unicode .ttf file and have been able to generate a PDF with the characters necessary if executing the .jar logged in as myself from the command line. When executing the .jar from a php file (using shell_exec(), exec(), and system()) the PDF is created and all of the content is there, except the special characters have been replaced with ? symbols.
I have a feeling that the issue is stemming from apache being logged in as the user. To verify that it was not a php issue, I also started an interactive php session, logged in as myself, and used the exact command that is being executed by the web application (shell_exec(), exec(), and system()) and the characters display correctly.
Additionally, I have checked httpd.conf to determine if the defaultCharSet was UTF-8 and found the AddDefaultCharset UTF-8 was present. I have also created a .htaccess file setting UTF-8 as the default charset.
Any ideas as to what is going on? I feel it has to be user/apache related, but have hit a wall.
UPDATE:
I have a button click event that triggers the php command:
$output = shell_exec("java -jar /data/eng/java/pdf-generator.jar $arg1 $arg2");
Everything else happens internally from the pdf-generator.jar program where it then make a call to the server which then returns an xml string. I have validated the xml and know that it is not causing the issue. Additionally I have written another test script that uses the php function:
echo mb_detect_encoding($response); which returns the string UTF-8 only so I know the string is encoded in UTF-8.
The PDF always updates regardless of the method of executing the jar, the only difference is the symbols.

Found a solution to my problem. Adding these three lines forced UTF-8 encoding.
$locale='en_US.UTF-8';
setlocale(LC_ALL,$locale);
putenv('LC_ALL='.$locale);
$output = shell_exec("java -jar /data/eng/java/pdf-generator.jar $arg1 $arg2");

Java UTF8 encoding issues with Storm

Pretty desperate for help after 2 days trying to debug this issue.
I have some text that contains unicode characters, for example, the word:
korte støvler
If I run code that writes this word to a file on one of the problem machines, it works correctly. However, when I write the file exactly the same way in a storm bolt, it does not encode correctly, and the ø character is replaced with question marks.
In the storm_env.ini file I have set
STORM_JAR_JVM_OPTS:-Dfile.encoding=UTF-8
I also set the encoding as UTF-8 in the code, and in mvn when it is packaged.
I have run tests on the boxes to check JVM default encodings, and they are all UTF-8.
I have tried 3 different methods of writing the file and all cause the same issue, so it is definitely not that.

This issue was fixed by simply build another machine on ec2. It had exactly the same software versions and configuration as the boxes with issues.

Eclipse UTF-8-weird characters

I am writing a program in java with Eclipse IDE, and i want to write my comments in Greek. So i changed the encoding from Window->Preferences->General->Content Types->Text->Java Source File, to UTF-8. The comments in my code are ok but when i run my program some words contains weird characters e.g San Germ�n (San Germán). If i change the encoding to ISO-8859-1, all are ok when i run the program but the comments in my code are not(weird characters !). So, what is going wrong with it?
Edit: My program is in java swing and the weird characters with UTF-8 are Strings in cells of a JTable.
EDIT(2): Ok, i solve my problem i keep the UTF-8 encoding for java file but i change the encoding of the strings. String k = new String(myStringInByteArray,ISO-8859-1);

This is most likely due to the compiler not using the correct character encoding when reading your source. This is a very common source of error when moving between systems.
The typical way to solve it is to use plain ASCII (which is identical in both Windows 1252 and UTF-8) and the "\u1234" encoding scheme (unicode character 0x1234), but it is a bit cumbersome to handle as Eclipse (last time I looked) did not transparently support this.
The property file editor does, though, so a reasonable suggestion could be that you put all your strings in a property file, and load the strings as resources when needing to display them. This is also an excellent introduction to Locales which are needed when you want to have your application be able to speak more than one language.

How to configure content type and content encoding using glassfish server?

I am building a web application using JSF and Spring in eclipse indigo using Glassfish server 3.1.2 . Everything is going great but it is showing me this error in firebug in 2 JavaScript files.
When I check in those files I didn't find any illegal character in those files but firebug still showing this.
I have used these files in one of ASP.Net project and they didn't mess there so i checked and matched their content type from both projects then I found that in ASP.Net project these files have
Content-Type = application/x-javascript
And in my JSP-Spring(JAVA) project there
Content-Type = text/javascript;charset=ISO-8859-1
is this.So you can see that sames files have changed their content scheme. I found that this scheme can be change by configuration in glassfish server.So I want to change my JS files content-Type to same as in ASP type.
If anyone have any other solution then please share because I haven't found any solution other than changing the scheme from glassfish serverThanks

Those strange characters you are seeing is the UTF-8 Byte Order Mark. They are a special set of bytes that indicate how a document is encoded. Unfortunately, when interpreted as ISO-8859-1, you wind up with the problem you have. There are two ways to resolve this.
The first way is to change the output character set to UTF-8. I believe this can be done in your server configuration, in your web.xml configuration, or by setting the character set on the HTTP request object in code; for example: yourServletRequest.setCharacterEncoding("UTF-8");
The second way is to remove the BOM from your Javascript files. You can do this by opening them in Notepad++, going to Encoding > Convert to ANSI, and then saving them. Alternatively, open them in Notepad, then go to Save As and ensure that the Encoding option is set to ANSI before saving them. Note that this may cause issues if you have non-ISO-8859-1 text in your Javascript files, although this is unlikely.

JTable won't display Unicode correctly when the application is executed from the command line or a jar file. It works fine in eclipse, though

I'm writing an application that reads a text file containing a list of vocabulary words in both English and Chinese. These are then displayed in a JTable. When I run or debug the app in Eclipse, everything displays fine. I can see and read the characters and the English. However, when I execute the app from the command line or from an executable jar, it's all wrong. The characters show up as either squares or as gibberish.
I also have a text box that when I type Chinese into it, it displays correctly.
My first thought was that it was a font problem. I was using a font installed on my system. Since I can't guarantee that the person using this app will have that font, I moved it to a resource folder and load the font from a file. The font appears as though it's been loaded so I'm convinced it's not a font issue.
I found another question that suggested using -Dfile.encoding=utf-8. I've tried this and it did not work.
Would the brilliant folks at Stack Overflow have any advice on how to make this work?

I'm writing this on a non-chinese version of Windows.
Well then you won't ever be able to get a Java program to produce Chinese command-line output.
Java, like almost all languages, uses the C standard library which has byte-based IO. The Windows command prompt interprets byte-based IO using the locale-specific default code page. That's never a UTF, so Unicode characters outside of the current locale's default code page just won't work.
(In theory you should be able to get it to work by changing your console fonts and using chcp 65001 (UTF-8) together with -Dfile.encoding=UTF-8, but in practice it doesn't work reliably due to bugs in the C runtime. Unicode on the command prompt is a long-standing sore point.)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.