IntelliJ IDEA encoding problems in Gradle project - java

Normally, I do not ask questions here, but problems I face up is so eerie that I can't fight it alone no more, I'm exhausted. Anyway, I'm going to describe everything I have found and I have found many interesting things I want to believe will help someone to help me.
Software versions:
- OS: Windows 10 Pro version: 1909 build: 18363.720
- IntelliJ IDEA: 2019.2.4 Ultimate
- Gradle wrapper version: 5.2.1-all
- jdk: 8
Problem lying in encodings, specially in console output in Gradle project.
Here is my build.gradle file:
plugins {
id 'java'
id 'idea'
id 'application'
}
group 'com.diceeee.mentoring'
version 'release'
sourceCompatibility = 1.8
application.mainClassName('D')
compileJava.options.encoding = 'utf-8'
tasks.withType(JavaCompile) {
options.encoding = 'utf-8'
}
repositories {
mavenCentral()
jcenter()
}
dependencies {
testCompile group: 'junit', name: 'junit', version: '4.12'
}
My sources are in UTF-8 encoding with CRLF, so in build.gradle I set that sources should be compiled with utf-8 encoding instead of my system default windows-1251 encoding.
Here is D.java:
import java.io.FileWriter;
import java.io.IOException;
public class D {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("file.encoding"));
String testLine = "Проверка работоспособности И Ш";
System.out.println(testLine);
FileWriter writer = new FileWriter("D:\\test.txt");
writer.write(testLine);
writer.close();
}
}
Also I have gradle.properties with one line:
org.gradle.jvmargs=-Dfile.encoding=utf-8
I checked if it works and assured myself that it works, encoding of Encoder in System.out really changed to utf-8.
When I run my gradle project, I get this:
21:04:53: Executing task 'D.main()'...
> Task :compileJava UP-TO-DATE
> Task :processResources NO-SOURCE
> Task :classes UP-TO-DATE
> Task :D.main()
UTF-8
�������� ����������������� � �
Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/5.2.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 0s
2 actionable tasks: 1 executed, 1 up-to-date
21:04:54: Task execution finished 'D.main()'.
There comes more info.
1) It's not coincidence that I left output in file in code. If we try to look in file, we can see this:
Проверка работоспособности И Ш
I'm not sure about is it right, but I have concluded that problem is lying somewhere in console because if there would be a problem with default encoding, file writer had used wrong encoding for file and outputs would be equal. But it does not happen.
2) I have debugged internals of PrintStream, OutputStreamWriter and StreamEncoder classes. StreamEncoder really uses utf-8 charset, also it encoded utf-8 text to the right byte sequence:
String testLine = "Проверка работоспособности И Ш";
Every cyrillic letter is 2 bytes, spaces are 1 byte, if we count all letters, we get 57.
Now, look here:
Encoder debugging screen with resulting bytes
So, as we can see, we get these first 57 bytes (other are from other inputs, buffer uses limits):
[-48, -97, -47, -128, -48, -66, -48, -78, -48, -75, -47, -128, -48, -70, -48, -80, 32, -47, -128, -48, -80, -48, -79, -48, -66, -47, -126, -48, -66, -47, -127, -48, -65, -48, -66, -47, -127, -48, -66, -48, -79, -48, -67, -48, -66, -47, -127, -47, -126, -48, -72, 32, -48, -104, 32, -48, -88, 91]
It looks properly, cyrillic letters encoded like [-48, -97], [-47, -128] and other groups of 2 bytes, so looks nice, spaces are matched too. So, encoder does the great job, it works, but what then is happening?
I dunno. Seriously. But there is more info. If it didn't seem mindblowing, I have prepared something else for ya.
I have created a clean Java project without any gradle/maven etc, only my own jdk and nothing more.
Program is the same:
package com.company;
import java.io.FileWriter;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("file.encoding"));
String testLine = "Проверка работоспособности И Ш";
System.out.println(testLine);
FileWriter writer = new FileWriter("D:\\test.txt");
writer.write(testLine);
writer.close();
}
}
I run it and what do I get?
"C:\Program Files\Java\jdk1.8.0_181\bin\java.exe" "-javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2019.2.4\lib\idea_rt.jar=58901:C:\Program Files\JetBrains\IntelliJ IDEA 2019.2.4\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_181\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\rt.jar;C:\Users\<my_removed_name>\IdeaProjects\test\out\production\test" com.company.Main
UTF-8
Проверка работоспособности И Ш
Process finished with exit code 0
And after that, I'm just died. Wtf is happening??? Back to the gradle project for a moment. I did a little modification:
import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
public class D {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("file.encoding"));
String testLine = new String("Проверка работоспособности И Ш".getBytes(StandardCharsets.UTF_8), "windows-1251");
System.out.println(testLine);
FileWriter writer = new FileWriter("D:\\test.txt");
writer.write(testLine);
writer.close();
}
}
And output now is:
21:43:06: Executing task 'D.main()'...
> Task :compileJava
> Task :processResources NO-SOURCE
> Task :classes
> Task :D.main()
UTF-8
Проверка работоспособности �? Ш
Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/5.2.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 0s
2 actionable tasks: 2 executed
21:43:06: Task execution finished 'D.main()'.
In file:
Проверка работоспособности � Ш
Also, this output in console is the first thing that pushed me to determine what is going wrong, I was just coding and found that something is really wrong with cyrillic "И". I tried to solve it, and again, and again... and now I'm here, because I'm in the dead end, I tried all what I have found in the similar questions and topics about encoding problems, I have red some articles about default encoding in java, that Windows uses cp866 encoding in console, windows-1251 encoding as default, that we need to determine encoding explicitly with -Dfile.encoding=UTF-8, nothing helps, I don't even know what to look for to find a problem. I thought gradle did not recognize property and charset was still windows-1251, but debugging showed I was wrong.
Well, here is a complete list of things I have tried to solve a problem:
1) Set -Dfile.encoding=UTF-8 in idea.exe.vmoptions and idea64.exe.vmoptions with restart. Didn't help.
2) Set UTF-8 in IntelliJ IDEA -> Settings -> Editor -> File Encodings everywhere. Didn't help.
3) Set gradle compiler encoding to utf-8. Didn't help.
4) Set gradle jvm option org.gradle.jvmargs=-Dfile.encoding=utf-8. Didn't help.
5) Checked that Windows has russian language as default for programs that do not support unicode for cyrillic supporting. Didn't help.
I'm not sure what is the problem with gradle because clean project without gradle works great, console output is okay. But with gradle, cyrillic symbols are incorrect. Also, I tried to somehow correct output to console with getBytes(charset) and new String(byte[], charset) method/constructor, I tried these variants:
String testLine = new String("Проверка работоспособности И Ш".getBytes(StandardCharsets.UTF_8), "windows-1251");
Output:
Проверка работоспособности �? Ш
Not working.
String testLine = new String("Проверка работоспособности И Ш".getBytes(StandardCharsets.UTF_8), "cp866");
Output:
?�?�???????�???? ?�???????�???�?????�?????????�?�?? ?� ?�
Not working.
String testLine = new String("Проверка работоспособности И Ш".getBytes(StandardCharsets.UTF_8), "utf-8");
Output:
�������� ����������������� � �
Result we get without any convertations.
Also, I tried one more thing, is System.out wrapper to set another console encoding.
public class D {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("file.encoding"));
System.setOut(new PrintStream(System.out, true, "utf-8"));
String testLine = "Проверка работоспособности И Ш";
System.out.println(testLine);
FileWriter writer = new FileWriter("D:\\test.txt");
writer.write(testLine);
writer.close();
}
}
And we still have nothing in output, it even didn't change:
> Task :D.main()
UTF-8
�������� ����������������� � �
Well, according to all this information, I think that something is really not good with console itself, because even the last execution of code above have this output in file:
Проверка работоспособности И Ш
It is in utf-8 encoding, it's correct output. But System.out.println prints something irrational in console, even if Encoder works good. I don't know what the shit is going on (sry for dirty-talking), if problem is really in gradle, how to check it? Or how to let gradle use another encoding for console output? Or maybe it is still something with IntelliJ IDEA even if output in project without gradle is correct?
I feel like a detective, but I have stalled, stucked in that case. I'm grateful if somebody helps me.

Run \ Edit Configurations, select your run configuration and write -Dfile.encoding=UTF-8 in VM Options field. This resolved issue for me.

I was experiencing a similar issue.
It's a Gradle-IntelliJ-on-non-ascii-language-version-Windows specific problem.
I solved this in the following way:
Set systemProp.file.encoding=utf-8 in gradle.properties file in the project
On IntelliJ, go to Settings -> Tools -> Terminal -> Application Settings and set cmd.exe /K "chcp 65001" as "Shell path"
The shell path should be just cmd.exe by default.
With the property value in the properties file should help build work with Gradle tool on IntelliJ,
and the shell path setting resolves the encoding on the integrated terminal.
If you are using the cmd outside of the IntelliJ and not from the integrated terminal on IntelliJ, simply call chcp 65001 on the console.
This will set the character encoding on the cmd console UTF-8.

Change the font to one that is able to correctly display all the characters in Settings (Preferences on macOS) | Editor | Font | Font settings.

Related

python tarfile.py "file could not be opened successfully"

I have a tarball that I can't open using python:
>>> import tarfile
>>> tarfile.open('/tmp/bad.tar.gz')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "tarfile.py", line 1672, in open
raise ReadError("file could not be opened successfully")
tarfile.ReadError: file could not be opened successfully
but I'm able to extract the file with no problem on the command line.
$ tar -xzvf /tmp/bad.tar.gz
I've traced the python tarfile code, and there's a function "nti" where they're converting bytes. It gets to this line:
obj.uid = nti(buf[108:116])
and blows up. These bits (for the UID) coming through as eight spaces. Not sure where to go from here...
Honestly it looks like the bug is in tarfile.py's nti function:
n = int(nts(s) or "0", 8)
The fall-through logic (or "0") is not working because s is spaces, not None, so int() blows up.
I copied tarfile.py from /var/lib/python2.7/ and wrapped that particular line with a try/catch, which fixed me up:
try:
obj.uid = nti(buf[108:116])
except InvalidHeaderError:
obj.uid = 0
It's a hack solution, though. Really I'd prefer that the python folk took a look at it and fixed the "or "0" logic.
Update
Turns out the tarball was created by the maven-assembly-plugin in a Java 6 project that had just been upgraded to Java 7. The issue was resolved by upgrading the maven-assembly-plugin to 2.5.3.

Java Command Fails in NLTK Stanford POS Tagger

I request your kind help and assistance in solving the error of "Java Command Fails" which keeps throwing whenever I try to tag an Arabic corpus with size of 2 megabytes. I have searched the web and stanford POS tagger mailing list. However, I did not find the solution. I read some posts on problems similar to this, and it was suggested that the memory is used out. I am not sure of that. Still I have 19GB free memory. I tried every possible solution offered, but the same error keeps showing.
I have average command on Python and good command on Linux. I am using LinuxMint17 KDE 64-bit, Python3.4, NLTK alpha and Stanford POS tagger model for Arabic . This is my code:
import nltk
from nltk.tag.stanford import POSTagger
arabic_postagger = POSTagger("/home/mohammed/postagger/models/arabic.tagger", "/home/mohammed/postagger/stanford-postagger.jar", encoding='utf-8')
print("Executing tag_corpus.py...\n")
# Import corpus file
print("Importing data...\n")
file = open("test.txt", 'r', encoding='utf-8').read()
text = file.strip()
print("Tagging the corpus. Please wait...\n")
tagged_corpus = arabic_postagger.tag(nltk.word_tokenize(text))
IF THE CORPUS SIZE IS LESS THAN 1MB ( = 100,000 words), THERE WILL BE NO ERROR. BUT WHEN I TRY TO TAG 2MB CORPUS, THEN THE FOLLOWING ERROR MESSAGE IS SHOWN:
Traceback (most recent call last):
File "/home/mohammed/experiments/current/tag_corpus2.py", line 17, in <module>
tagged_lst = arabic_postagger.tag(nltk.word_tokenize(text))
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 59, in tag
return self.batch_tag([tokens])[0]
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/tag/stanford.py", line 81, in batch_tag
stdout=PIPE, stderr=PIPE)
File "/usr/local/lib/python3.4/dist-packages/nltk-3.0a3-py3.4.egg/nltk/internals.py", line 171, in java
raise OSError('Java command failed!')
OSError: Java command failed!
I intend to tag 300 Million words to be used in my Ph.D. research project. If I keep tagging 100 thousand words at a time, I will have to repeat the task 3000 times. It will kill me!
I really appreciate your kind help.
After your import lines add this line:
nltk.internals.config_java(options='-xmx2G')
This will increase the maximum RAM size that java allows Stanford POS Tagger to use. The '-xmx2G' changes the maximum allowable RAM to 2GB instead of the default 512MB.
See What are the Xms and Xmx parameters when starting JVMs? for more information
If you're interested in how to debug your code, read on.
So we see that the command fail when handling huge amount of data so the first thing to look at is how the Java is initialized in NLTK before calling the Stanford tagger, from https://github.com/nltk/nltk/blob/develop/nltk/tag/stanford.py#L19 :
from nltk.internals import find_file, find_jar, config_java, java, _java_options
We see that the nltk.internals package is handling the different Java configurations and parameters.
Then we take a look at https://github.com/nltk/nltk/blob/develop/nltk/internals.py#L65 and we see that the no value is added for the memory allocation for Java.
In version 3.9.2, the StanfordTagger class constructor accepts a parameter called java_options which can be used to set the memory for the POSTagger and also the NERTagger.
E.g. pos_tagger = StanfordPOSTagger('models/english-bidirectional-distsim.tagger', path_to_jar='stanford-postagger-3.9.2.jar', java_options='-mx3g')
I found the answer by #alvas to not work because the StanfordTagger was overriding my memory setting with the built-in default of 1000m. Perhaps using nltk.internals.config_java after initializing StanfordPOSTagger might work but I haven't tried that.

Why do I see scrabbled output when using JSch?

I am trying to use JSch. I tried the example here
Although I can connect the output is weird.
I get the following:
Last login: Thu Jan 31 19:44:25 2013 from 10.2.251.77
[1mcli:~ # [m
And if I do e.g. an ls I get:
[0m[01;34m.InstallAnywhere[0m [00m.bash_history [00m.bash_profile[0m
[01;34mbin[0m [00msles11-patched[0m
[01;34m.kbd[0m [00mindex.html[0m [00mtest.sql[0m
[00m.viminfo[0m [00;31mipvsadm-1.26-1.src.rpm[0m
[m[1mcli:~ # [m
These are the directory contents but why are they displayed like that?
I am running in this from Eclipse and this is what I see in Eclipse output. If I run this from Windows CMD it stucks
Update:
I noticed that if I connect to a different linux the output is fine!
Only if I connect to a specific linux installation I see these weird characters! Any idea what is causing this?
Update2:
Following the link of #PeterMmm I did printf "äöü" | xxd. Both the "bad" and good one give:
0000000: e4f6 fc
I also did locale.
In the "bad" case:
# locale
LANG=POSIX
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
In the good system:
LANG=POSIX
LC_CTYPE=en_US.UTF-8
LC_NUMERIC="POSIX"
LC_TIME="POSIX"
LC_COLLATE="POSIX"
LC_MONETARY="POSIX"
LC_MESSAGES="POSIX"
LC_PAPER="POSIX"
LC_NAME="POSIX"
LC_ADDRESS="POSIX"
LC_TELEPHONE="POSIX"
LC_MEASUREMENT="POSIX"
LC_IDENTIFICATION="POSIX"
LC_ALL=
Configuration seems to be the same. So what could be causing this?
Please check
Funny Shell Output: [01;32mtestfile.txt[00m instead of testfile.txt
((ChannelShell) channel).setPtyType("dumb");
Does the trick.
They are escape sequences for the terminal emulation. I guess that there is no relation to the character encoding.
Update:
If ChannelShell#setPty(false) is invoked, a pseudo-terminal will not be allocated and escape sequences will not be appeared.
Channel channel=session.openChannel("shell");
((ChannelShell)channel).setPty(false); // !!
...
channel.connect();

Reed-Solomon encoding and decoding implementation example in Java

I need to encode and decode some text using Reed-Solomon error correction codes. Implementation should be in Java.
I have gone through Sean Owen's implementation classes but was not able to construct these classes with a working example.
Can somebody please post an working example of Reed-Solomon error correction codes or any reference links.
this is a bit late, but there is a fully working example in Java on github here:
https://github.com/alexbeutel/Error-Correcting-Codes/tree/master/src
It features the following classes:
Decoder.java <== R-S Decoder class
Encoder.java <== R-S Encoder class
ErrorCodesMain.java <== Fully working example
GF257.java <== Galois Fields(257) class
GF28.java <== Galois Fields(2^8) class
To build the project from the command line:
javac ErrorCodesMain.java Decoder.java Encoder.java GF257.java GF28.java
To run it:
java ErrorCodesMain
Here is the program's output:
# of Generators of GF(2^8): 128
# of Generators of GF(257): 128
Generator: 206
Erasures: 38, 1, 7, 15, 28, 16, 29, 28, 7, 8,
OUTPUT FROM O(nk) IN GF(2^8): Hello, my name is Alex Beutel.
FFT OUTPUT DECODED: Hello, my name is Alex Beutel.
OUTPUT FROM O(nk) IN GF(257): Hello, my name is Alex Beutel.

How to apply a patch

I have this patch code which i downloaded from a web article (Calling Matlab from Java).
http://www.cs.virginia.edu/~whitehouse/matlab/JavaMatlab.html
But I donot know how to apply it in my windowsXp running computer.
What I'm trying to do is call Matlab script file from java. I have found the necessary source codes and every thing but this mater is holding be back.
Any help is highly appreciated. Thank you.
Here's the patch code.
Index: MatlabControl.java
===================================================================
RCS file: /cvsroot/tinyos/tinyos-1.x/tools/java/net/tinyos/matlab/MatlabControl.java,v
retrieving revision 1.3
diff -u -r1.3 MatlabControl.java
--- MatlabControl.java 31 Mar 2004 18:43:50 -0000 1.3
+++ MatlabControl.java 16 Aug 2004 20:36:51 -0000
## -214,7 +214,8 ##
matlab.evalConsoleOutput(command);
}else{
- matlab.fevalConsoleOutput(command, args, 0, null);
+ // matlab.fevalConsoleOutput(command, args, 0, null);
+ matlab.fevalConsoleOutput(command, args);
}
} catch (Exception e) {
System.out.println(e.toString());
I'd download the standard UNIX patch tool and use:
patch -p0 <my_patch.diff
You need to apply that patch to the file MatlabControl.java. On Unix, you have the standard patch program to do that, but that ofcourse isn't normally present on Windows.
But looking at the patch file, it's very small and you could easily do the change by hand. Look at the patch file: The lines with a - in the left column must be removed. The lines with a + must be added.
So you must look in MatlabControl.java and remove this line:
matlab.fevalConsoleOutput(command, args, 0, null);
And add these lines:
// matlab.fevalConsoleOutput(command, args, 0, null);
matlab.fevalConsoleOutput(command, args);
In other words, it's a very small and simple change, you just have to remove the last two arguments to the method call to fevalConsoleOutput().
If you want the patch command (and lots of other Unix utilities) on Windows, you could download and install Cygwin.
If you use dev tools like Eclipse you can easily apply it as it is an option in the contextual menu (right click) go to Team - > Apply Patch. It should work.
This patch is so small, you can easily apply it by hand.
So simply open the file MatlabControl.java and change line 214 (the one prepended with -) to fit the lines prepended with +.
After that your code should look like:
else{
// matlab.fevalConsoleOutput(command, args, 0, null);
matlab.fevalConsoleOutput(command, args);
}
JMI (Java-to-Matlab Interface)'s Matlab class and its fevalConsoleOutput method are explained here: http://UndocumentedMatlab.com/blog/jmi-java-to-matlab-interface/
By Tortoise SVN, we can apply patch by following the below way. Click on Apply patch and browse the patch file.
Tortoise SVN

Categories

Resources