Gujarati text in Java String - java

I have Gujarati Bible and trying to insert each verse in MySQL database using parser written in Java. When I assign Gujarati text to Java String variable it shows junks in debug.
E.g. This is my Gujarati text
હે યહોવા તું મારો દેવ છે;
I assign it to Java String variable as shown below
verse._verseText = "હે યહોવા તું મારો દેવ છે;";
What i see in debug window is all junk characters. Any help is appreciated. If need more information let me know and I will provide as and when asked.
UPDATE
Pasting my parser code here
private Boolean Insert(String _text)
{
BibleVerse verse = new BibleVerse();
String[] data = _text.split("\\|");
try
{
if (data[0].equals(bookName) || bookName.equals("All"))
{
verse._Version = "Gujarati";
verse._book = data[0];
verse._chapter = Integer.parseInt(data[1]);
verse._verse = Integer.parseInt(data[2]);
verse._verseText = new String(data[3].getBytes(), "UTF-8");
_bibleDatabase.Insert(verse);
pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - INSERTED.");
}
else
{
pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - SKIPPED.");
}
return true;
}
catch(Exception e)
{
pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
return false;
}
}
Here is the sample line from the text file
Isaiah|25|1|હે યહોવા તું મારો દેવ છે; હું તને મોટો માનીશ, હું તારા નામની સ્તુતિ કરીશ; કેમકે તેં અદભુત કાર્યો કર્યાં છે, તેં વિશ્વાસુપણે તથા સત્યતાથી પુરાતન સંકલ્પો પાર પાડ્યા છે.
UPDATE
Here is the code where I open & read file.
try
{
FileReader _file = new FileReader(this._filename);
_bufferedReader = new BufferedReader(_file);
SwingWorker parseWorker = new SwingWorker()
{
#Override
protected Object doInBackground() throws Exception
{
String line;
String[] data;
int lineno=0;
BibleVerse verse = new BibleVerse();
while ((line = _bufferedReader.readLine()) != null)
{
++lineno;
pcs.firePropertyChange("pgbupdate", null, lineno);
Insert(line);
}
_bufferedReader.close();
return null;
}
#Override
protected void done()
{
pcs.firePropertyChange("logupdate", null, "Parsing complete.");
}
};
parseWorker.execute();
}
catch (Exception e)
{
pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
}

The problem is this:
FileReader _file = new FileReader(this._filename);
This reads the file using the platform's default charset. If your data file is not encoded in that charset, you will get incorrect characters.
On Windows, the default charset is almost always UTF-16LE. On most other systems, it's UTF-8.
The easiest solution is to find out the actual encoding of your data file, so you can specify it explicitly in the code. The encoding of a file can be determined with the file command on Unix and Linux systems. In Windows, you may need to examine it with a binary editor, or install something like Cygwin, which has a file command of its own.
Once you know what it is, you should pass it explicitly to the construction of your Reader:
// Replace "UTF-8" with the actual encoding of your data file (if it's not UTF-8).
Reader _file = new InputStreamReader(new FileInputStream(this._filename), "UTF-8");
Once you've done that, there is no reason for any other part of your code to concern itself with bytes. You should replace this:
verse._verseText = new String(data[3].getBytes(), "UTF-8");
with this:
verse._verseText = data[3];

how to inject chinese characters using javascript?
not quite the same problem, but I think the same solution may work in this case.
If the script is inline (in the HTML file), then it's using the
encoding of the HTML file and you won't have an issue.
If the script is loaded from another file:
Your text editor must save the file in an appropriate encoding such as
utf-8 (it's probably doing this already if you're able to save it,
close it, and reopen it with the characters still displaying
correctly) Your web server must serve the file with the right http
header specifying that it's utf-8 (or whatever the enocding happens to
be, as determined by your text editor settings). Here's an example for
how to do this with php: Set http header to utf-8 php If you can't
have your webserver do this, try to set the charset attribute on your
script tag (e.g. > I tried to see what the spec said should happen
in the case of mismatching charsets defined by the tag and the http
headers, but couldn't find anything concrete, so just test and see if
it helps. If that doesn't work, place your script inline

It looks like if you want to store Gujarati text in Java string, you need to use unicode characters. See this: http://jrgraphix.net/r/Unicode/0A80-0AFF
So for example the first Gujarati character:
char example = '0A80';
String result = Character.toString((char)example);

Related

Problem with input from user saved to file by RandomAccessFile methods

I've got a problem with input from user. I need to save input from user into binary file and when I read it and show it on the screen it isn't working properly. I dont want to put few hundreds of lines, so I will try to dexcribe it in more compact form. And encoding in NetBeans in properties of project is "UTF-8"
I got input from user, in NetBeans console or cmd console. Then I save it to object made up of strings, then add it to ArrayList<Ksiazka> where Ksiazka is my class (basically a book's properties). Then I save whole ArrayList object to file baza.bin. I do it by looping through whole list of objects of class Ksiazka, taking each String one by one and saving it into file baza.bin using method writeUTF(oneOfStrings). When I try to read file baza.bin I see question marks instead of special characters (ą, ć, ę, ł, ń, ó, ś, ź). I think there is a problem in difference in encoding of file and input data, but to be honest I don't have any idea ho to solve that.
Those are attributes of my class Ksiazka:
private String id;
private String tytul;
private String autor;
private String rok;
private String wydawnictwo;
private String gatunek;
private String opis;
private String ktoWypozyczyl;
private String kiedyWypozyczona;
private String kiedyDoOddania;
This is method for reading data from user:
static String podajDana(String[] tab, int coPokazac){
System.out.print(tab[coPokazac]);
boolean podawajDalej = true;
String linia = "";
Scanner klawiatura = new Scanner(System.in, "utf-8");
do{
try {
podawajDalej = false;
linia = klawiatura.nextLine();
}
catch(NoSuchElementException e){
System.err.println("Wystąpił błąd w czasie podawania wartości!"
+ " Spróbuj jeszcze raz!");
}
catch(IllegalStateException e){
System.err.println("Wewnętrzny błąd programu typu 2! Zgłoś to jak najszybciej"
+ " razem z tą wiadomością");
}
}while(podawajDalej);
return linia;
}
String[] tab is just array of strings I want to be able to show on the screen, each set (array) has its own function, int coPokazac is number of line from an array I want to show.
and this one saves all data from ArrayList<Ksiazka> to file baza.bin:
static void zapiszZmiany(ArrayList<Ksiazka> bazaKsiazek){
try{
RandomAccessFile plik = new RandomAccessFile("baza.bin","rw");
for(int i = 0; i < bazaKsiazek.size(); i++){
plik.writeUTF(bazaKsiazek.get(i).zwrocId());
plik.writeUTF(bazaKsiazek.get(i).zwrocTytul());
plik.writeUTF(bazaKsiazek.get(i).zwrocAutor());
plik.writeUTF(bazaKsiazek.get(i).zwrocRok());
plik.writeUTF(bazaKsiazek.get(i).zwrocWydawnictwo());
plik.writeUTF(bazaKsiazek.get(i).zwrocGatunek());
plik.writeUTF(bazaKsiazek.get(i).zwrocOpis());
plik.writeUTF(bazaKsiazek.get(i).zwrocKtoWypozyczyl());
plik.writeUTF(bazaKsiazek.get(i).zwrocKiedyWypozyczona());
plik.writeUTF(bazaKsiazek.get(i).zwrocKiedyDoOddania());
}
plik.close();
}
catch (FileNotFoundException ex){
System.err.println("Nie znaleziono pliku z bazą książek!");
}
catch (IOException ex){
System.err.println("Błąd zapisu bądź odczytu pliku!");
}
}
I think that there is a problem in one of those two methods (either I do something wrong while reading it or something wrong when it is saving data to file using writeUTF()) but even tho I tried few things to solve it, none of them worked.
After quick talk with lecturer I got information that I can use at most JDK 8.
You are using different techniques for reading and writing, and they are not compatible.
Despite the name, the writeUTF method of RandomAccessFile does not write a UTF-8 string. From the documentation:
Writes a string to the file using modified UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the file, starting at the current file pointer, as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for each character.
writeUTF will write a two-byte length, then write the string as UTF-8, except that '\u0000' characters are written as two UTF-8 bytes and supplementary characters are written as two UTF-8 encoded surrogates, rather than single UTF-8 codepoint sequences.
On the other hand, you are trying to read that data using new Scanner(System.in, "utf-8") and klawiatura.nextLine();. This approach is not compatible because:
The text was not written as a true UTF-8 sequence.
Before the text was written, two bytes indicating its numeric length were written. They are not readable text.
writeUTF does not write a newline. It does not write any terminating sequence at all, in fact.
The best solution is to remove all usage of RandomAccessFile and replace it with a Writer:
Writer plik = new FileWriter(new File("baza.bin"), StandardCharsets.UTF_8);
for (int i = 0; i < bazaKsiazek.size(); i++) {
plik.write(bazaKsiazek.get(i).zwrocId());
plik.write('\n');
plik.write(bazaKsiazek.get(i).zwrocTytul());
plik.write('\n');
// ...

Image loads as base64 string when send via http/1.0?

I am trying to implement http/1.0 in a project with a website that's loaded with a serversocket i've coded. It works fine with character based files. But with image files that i've specified to return the base64 encoded version of the image doesn't work even though the right headers are set such as content-type: image/png and content-transfer-encoding: base64 RFC 2045. I've tried to look at the packets from chrome's networking tool and it looks like it's treating it as a document event though it's an image file. I have no clue whatsoever to do since i've been stuck on this issue for a couple of DAYS! I've searched all of stackoverflow, all of google and i am basically stuck.
I posted this question a day or 2 ago where it was recommended to use a byte reader (which i've also tried) without luck. Any visual inputs are of great appreciation.
I have 2 methods that are relevant.
The first one is the one where i choose the way to read the file depending on if it's an image or text.
public String readUri(String reqUri) {
returnFile = "";
if (this.fileExists(reqUri)) {
fileType = this.fileType(reqUri); // returns e.g image from image/png
if (fileType.equals("text")) {
// bufferedreader ...
} else if (fileType.equals("image")) {
File imgPath = new File(reqUri);
try {
FileInputStream fileInputStreamReader = new FileInputStream(imgPath);
byte[] bytes = new byte[(int)imgPath.length()];
fileInputStreamReader.read(bytes);
returnFile = Base64.getEncoder().encodeToString(bytes);
fileInputStreamReader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return returnFile;
}
The second one collects this data from the above method. This method is called in my get request controller and sends back the data to the client through the serversocket.
StringBuilder response = new StringBuilder();
public String response(
String HTTPVersion, int statusCode, String fileContent, String contentType) {
response.append(
HTTPVersion + " " +
statusCode + " " +
this.getHTTPStatusText(statusCode) + "\n"
);
response.append("Content-transfer-encoding: BASE64");
response.append("Content-Type: " + contentType + "\n");
response.append("content-length: " + fileContent.length() + "\n");
response.append("Date: " + date() + "\n");
response.append("\n");
response.append(fileContent + "\n");
return response.toString();
}
Here is a request/response from chromes networking tool:
This is how the image is currently loaded with the base64 encoding:
HTTP IS NOT MIME
RFC 2045 is MIME, and although HTTP is similar in some respects to MIME, it is not MIME, and it differs in other respects. In particular it DOES NOT USE Content-Transfer-Encoding. It DOES USE Content-Encoding with a similar meaning. See https://www.rfc-editor.org/rfc/rfc1945#section-10.3 and https://www.rfc-editor.org/rfc/rfc1945#appendix-C.3 et seq.
Also, you are terminating the lines of the response header with only Java \n which is LF. The standards call for CR LF (Java \r\n) and always have. Some receivers are tolerant, following Postel's dictum, but you shouldn't rely on that. And worse your code doesn't appear to terminate the CTE line at all, although since Chrome parsed it okay I'm guessing you just posted the wrong code. Also you should NOT add a line terminator after the body that isn't counted in Content-Length, although if you are using original HTTP/1.0, i.e. without keepalive, this won't matter, because there can't be another request and response on the same transport connection.

Which encoding for ProcessBuilder parameters

Using ProcessBuilder, I need to be able to send non-ASCII parameters to another Java program.
In this case, a program Abc needs to send e.g. Arabic characters to Def program through the parameters. I have control of Abc code, but not of Def.
Using the normal way of ProcessBuilder without any playing with the encoding, it was mentioned here, it is not possible. Def recieves question marks "?????".
However, I am able to get some result, but different encodings can be used for different scenarios.
E.g. I am trying all encodings to send to the recipient, and comparing the result of what is expected.
Windows, IntelliJ console:
Default charset: UTF-8
Found charsets: windows-1252, windows-1254 and windows-1258
Windows, command prompt:
Default charset: windows-1252
Found charsets: CESU-8 and UTF-8
Ubuntu, command prompt:
Default charset: ISO-8859-1
Found charsets: ISO-2022-CN, ISO-2022-KR, ISO-8859-1, ISO-8859-15, ISO-8859-9, x-IBM1129, x-ISO-2022-CN-CNS and x-ISO-2022-CN-GB
My question is: how to programmatically know which correct encoding to use, since I need to have something universal?
In other words, what is the relation between the default charset and the found ones?
public class Abc {
private static final Path PATH = Paths.get("."); // With maven: ./target/classes
public static void main(String[] args) throws Exception {
var string = "hello أحمد";
var bytes = string.getBytes();
System.out.println("Original string: " + string);
System.out.println("Default charset: " + Charset.defaultCharset());
for (var c : Charset.availableCharsets().values()) {
var newString = new String(bytes, c);
var process = new ProcessBuilder().command("java", "-cp",
PATH.toAbsolutePath().toString(),
"Def", newString).start();
process.waitFor();
var output = asString(process.getInputStream());
if (output.contains(string)) {
System.out.println("Found " + c + " " + output);
}
}
}
private static String asString(InputStream is) throws IOException {
try (var reader = new BufferedReader(new InputStreamReader(is))) {
var builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
if (builder.length() != 0) {
builder.append(System.lineSeparator());
}
builder.append(line);
}
return builder.toString();
}
}
}
public class Def {
public static void main(String[] args) {
System.out.println(args[0]);
}
}
Under the hood, what's actually being passed around is bytes, not chars. Normally, you'd expect the java method that ends up turning characters into bytes to have an overload that lets you specify charset, but, for whatever reason, it does not exist here.
How it should work is thusly:
You pass a string to ProcessBuilder
PB will turn that string into bytes using Charset.defaultCharset() (why? Because PB is all about making the OS do things, and the default charset reflects the OS's preferred charset).
These bytes are then fed to the process.
The process starts up. If it is java, and we're talking the args in psv main(String[] args), the same is done in reverse: Java takes the bytes and turns them back to characters via Charset.defaultCharset(), again.
This does show an immediate issue: If the default charset is not capable of representing a certain character, then in theory you are out of luck.
That would strongly suggest that using java to fire up java.exe should ordinarily mean you can pass whatever you want (unless the characters involved aren't representable in the system's charset).
Your code is odd. In particular, this line is the problem:
var bytes = string.getBytes();
This is short for string.getBytes(Charset.defaultCharset()). So now you have your bytes in the provided charset.
var newString = new String(bytes, c);
and now you're taking those bytes and turning them into a string using a completely different charset. I'm not sure what you're trying to accomplish with this. Pure gobbledygook would come out.
In other words, what is the relation between the default charset and the found ones?
What do you mean by 'found ones'? The string "Found charsets" appears nowhere in your code. If you mean: What Charset.availableCharsets() returns - there is no relationship at all. availableCharsets isn't relevant for ProcessBuilder.
One possibility is to convert your String to Unicode sequences string and then pass it to another process and there convert it back to a regular String. String of Unicode sequences will always contain ASCI characters only. Here is how it may look like:
String encoded = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("hello أحمد"));
The result will be that String encode will hold this value:
"\u0068\u0065\u006c\u006c\u006f\u0020\u0623\u062d\u0645\u062f"
This String you can safely pass to another process. In that other process, you can do the following:
String originalString = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encodedString);
And the result will be that originalString will now hold this value:
"hello أحمد"
Class StringUnicodeEncoderDecoder could be found in an Open Source library called MgntUtils. You can get this library as Maven Artifact or get it on Github (including source code and JavaDoc). JavaDoc online is available here
This library and this particular feature is used and well tested by multiple users.
Disclamer: This library is written by me

Decode alfresco file name or replace unicode[_x0020_] characters in String/fileName

I am using alfresco download upload services using java.
When I upload the file to alfreco server it gives me the following path :
/app:Home/cm:Company_x0020_Home/cm:Abc/cm:TestFile/cm:V4/cm:BC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
When I use the same file path and download using alfresco services I took the file name at the end of the path
i.e ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
How can I remove or decode the [Unicode] characters in fileName
String decoded = URLDecoder.decode(queryString, "UTF-8");
The above does not work .
These are some Unicode characters which appeared in my file name.
https://en.wikipedia.org/wiki/List_of_Unicode_characters
Please do not mark the question as duplicate as I have searched below links but non of those gave the solution.
Following are the links that I have searched for replacing unicode charectors in String with java.
Java removing unicode characters
Remove non-ASCII characters from String in Java
How can I replace a unicode character in java string
Java Replace Unicode Characters in a String
The solution given by Jeff Potts will be perfect .
But i had a situation where i was using file name in diffrent project where i wont use org.alfresco related jars
I had to take all those dependencies to use for a simple file decoding
So i used java native methods which uses regex to parse the file name and decode it,which gave me the perfect solution which was same from using
ISO9075.decode(test);
This is the code which can be used
public String decode_FileName(String fileName) {
System.out.println("fileName : " + fileName);
String decodedfileName = fileName;
String temp = "";
Matcher m = Pattern.compile("\\_x(.*?)\\_").matcher(decodedfileName); //rejex which matches _x0020_ kind of charectors
List<String> unicodeChars = new ArrayList<String>();
while (m.find()) {
unicodeChars.add(m.group(1));
}
for (int i = 0; i < unicodeChars.size(); i++) {
temp = unicodeChars.get(i);
if (isInteger(temp)) {
String replace_char = String.valueOf(((char) Integer.parseInt(String.valueOf(temp), 16)));//converting
decodedfileName = decodedfileName.replace("_x" + temp + "_", replace_char);
}
}
System.out.println("Decoded FileName :" + decodedfileName);
return decodedfileName;
}
And use this small java util to know Is integer
public static boolean isInteger(String s) {
try {
Integer.parseInt(s);
} catch (NumberFormatException e) {
return false;
} catch (NullPointerException e) {
return false;
}
return true;
}
So the above code works as simple as this :
Example :
0028 Left parenthesis U+0028 You can see in the link
https://en.wikipedia.org/wiki/List_of_Unicode_characters
String replace_char = String.valueOf(((char) Integer.parseInt(String.valueOf("0028"), 16)));
System.out.println(replace_char);
This code gives output : ( which is a Left parenthesis
This is what the logic i have used in my java program.
The above program will give results same as ISO9075.decode(test)
Output :
fileName : ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
Decoded FileName :ABC1X 0400 0109-(1-2)_v2.pdf
In the org.alfresco.util package you will find a class called ISO9075. You can use it to encode and decode strings according to that spec. For example:
String test = "ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf";
String out = ISO9075.decode(test);
System.out.println(out);
Returns:
ABC1X 0400 0109-(1-2)_v2.pdf
If you want to see what it does behind the scenes, look at the source.

How to get rid of "Rogue Chars" in an .txt encoded under UTF-8

My program is reading from a .txt encoded with UTF-8. The reason why I'm using UTF-8 is to handle the characters åäö. The problem I come across is when the lines are read is that there seems to be some "rogue" characters sneaking in to the string which causes problems when I'm trying to store those lines into variables. Here's the code:
public void Läsochlista()
{
String Content = "";
String[] Argument = new String[50];
int index = 0;
Log.d("steg1", "steg1");
try{
InputStream inputstream = openFileInput("text.txt");
if(inputstream != null)
{
Log.d("steg2", "steg2");
//InputStreamReader inputstreamreader = new InputStreamReader(inputstream);
//BufferedReader bufferreader = new BufferedReader(inputstreamreader);
BufferedReader in = new BufferedReader(new InputStreamReader(inputstream, "UTF-8"));
String reciveString = "";
StringBuilder stringbuilder = new StringBuilder();
while ((reciveString = in.readLine()) != null)
{
Argument[index] = reciveString;
index++;
if(index == 6)
{
Log.d(Argument[0], String.valueOf((Argument[0].length())));
AllaPlatser.add(new Platser(Float.parseFloat(Argument[0]), Float.parseFloat(Argument[1]), Integer.parseInt(Argument[2]), Argument[3], Argument[4], Integer.parseInt(Argument[5])));
Log.d("En ny plats skapades", Argument[3]);
Arrays.fill(Argument, null);
index = 0;
}
}
inputstream.close();
Content = stringbuilder.toString();
}
}
catch (FileNotFoundException e){
Log.e("Filen", " Hittades inte");
} catch (IOException e){
Log.e("Filen", " Ej läsbar");
}
}
Now, I'm getting the error
Invalid float: "61.193521"
where the line only contains the chars "61.193521". When i print out the length of the string as read within the program, the output shows "10" which is one more character than the string is supposed to contain. The question; How do i get rid of those invisible "Rouge" chars? and why are they there in the first place?
When you save a file as "UTF-8", your editor may be writing a byte-order mark (BOM) at the beginning of the file.
See if there's an option in your editor to save UTF-8 without the BOM.
Apparently the BOM is just a pain in the butt: What's different between UTF-8 and UTF-8 without BOM?
I know you want to be able to have extended characters in your data; however, you may want to pick a different encoding like Latin-1 (ISO 8859-1).
Or you can just read & discard the first three bytes from the input stream before you wrap it with the reader.
Unfortunately you have not provided the sample text file so testing with your code exactly is not possible and here is the theoretical answer based on guess, what could have been the reasons:
Looks like it is BOM related issue and you may have to treat this. Some related detail is given here: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html
And some information here: What is XML BOM and how do I detect it?
Basically there are various situation:
In one of the situation we face issues when we don't read and write using correct encoding.
In another situation we use an editor or reader which doesn't support UTF-8
Third is when we are using correct encoding for reading and writing, we are not facing issue in a text editor but facing issue in some other application or program. I think your issues is related to third case.
In third situation we may have to remove the BOM using a program or deal with it according to our context.
Here is some solution I guess you may find interesting:
UTF-8 file reading: the first character issue
You can use code given in this threads answer or use apache commons to deal with it:
Byte order mark screws up file reading in Java

Categories

Resources