JAVA - Charcater Conversion from text file (UTF-16 to UTF-8)

JAVA - Charcater Conversion from text file (UTF-16 to UTF-8) - java

I am new to JAVA. I wanted a JAVA code to convert a text file coming from Unix to a text file that goes to Linux server. So, its a character conversion code from UTF-16 TO UTF-8. The text file goes tthrough oracle database before it reaches linux server. I need this conversion because some special symbols are getting converted to garbage values. Please help Java Experts :)

Related

copy from microsoft outlook to JTextArea in java results in illegal character

I have a Java Swing application for a client that works internationally. A few times per year a piece of text is copied from a Microsoft application like Outlook into a JTextArea, and then stored in a UTF-8 database. Upon retrieval, the database throws an SQL exception that it can't decode some character.
I'm trying to comprehend the copy-paste part of the process. Is there any information about how copying from windows to Java works exactly? Can't find it.
Windows is setup using CP1252, but the text in Outlook definitely is using non CP1252 characters, so the copied data has an encoding. And when that is pasted into JTextArea, does Java transcode that to UTF-16 (it's internal string encoding)?

Decoding binary files stored inside a database after being uploaded from a browser

In migrating from a CMS that stored files in the database, over to a system that stores them in AWS S3, I cant seem to find any options other than reverse engineering the format from Java (the old system) and implementing it all myself from scratch in python, using either the java code or rfc1867 as a reference.
I have database dumps containing long strings of encoded files.
I'm not 100% clear which binary file upload encoding has been used. But there is consistency between the first characters of each file types.
UEsDBBQA is the first 8 characters in a large number of the DOCX file formats, and UEsDBBQABgAIAAAA is the first 16 characters in more than 75% of the DOCX files.
JVBERi0xLj is the first 10 characters of many of the PDF files.
Every web application framework that allows file uploads has to decode these... so its a known problem. But I cannot find a way to decode these strings with either Python (my language of choice), or with some kind of command line decoding tool...
file doesnt recognise them.
hachoir doesnt recognise them.
Are there any simple tools I can just install, I dont care if they are in C, Perl, Python, Ruby, JavaScript or Mabolge, I just want a tool that can take the encoded string as input (file, stdin, I don't care) and output the decoded original files.
Or am I overthinking the algorithm to decode these files and it would be simpler than it looks and someone can show me how to decode them using pure python?

Most commonly used encoding algorithm to represent binary data as text is Base64. I just did a quick test on a PDF file in Java and I got exactly the same header character sequence when Base64-encoding it.
byte[] bytes = Files.readAllBytes(Paths.get("/test/test.pdf"));
String base64 = DatatypeConverter.printBase64Binary(bytes);
System.out.println(base64.substring(0, 10)); // JVBERi0xLj
So, you're most likely looking for a Base64 decoder.
I don't do Python, so here's a Google search suggestion and the first Stack Overflow link which appeared in the search results to date: Python base64 data decode.

java unicode conversion on linux not working on max os x

I am writing a java application on Ubuntu Linux that reads in a text file and creates an xml file from the data. Some of the text contains curly apostrophes and quotes that I convert to straight apostrophes and quotes using the following code:
dataLine = dataLine.replaceAll( "[\u2018|\u2019]", "\u0027" ).replaceAll( "[\u201C|\u201D]", "\u005c\u0022" );
This works fine, but when I port the jar file to a Mac OSX machine, I get three question marks where I should get straight apostrophes and quotes. I created a test application on the Mac using the same line of code to do the conversion and the same test file for input and it worked fine. Why doesn't the jar file created on the Linux machine work correctly on a Mac? I thought java was supposed to be cross platform compatible.

Chances are you'tr not reading the file correctly to start with. You haven't shown how you're reading the file, but my guess is that you're just using FileReader, or an InputStreamReader without specifying the encoding. In that case, the default platform encoding is used - and if that's not the actual encoding of the file, you won't be reading the right characters. You should be able to detect that without doing any replacement at all.
Instead, you should use a FileInputStream and wrap it in an InputStreamReader with the correct encoding - which is likely to be UTF-8 as it's XML. (You should be able to check this easily.)

International characters with Java

I am building an app that takes information from java and builds an excel spreadsheet. Some of the information contains international characters. I am having an issue when Russian characters, for example, are rendered correctly in Java, but when I send those characters to Excel, they are not rendered properly. I initially thought the issue was an encoding problem, but I am now thinking that the problem is simply not have the Russian language pack loaded on my Windows 7 machine.
I need to know if there is a way for a Java application to "force" Excel to show international characters.
Thanks

Check the file encoding you're using is characters don't show up. Java defaults to platform native encoding (Win-1252 for Windows) instead of UTF-8. You can explicitly set the writers to use UTF-8 or Cyrillic encoding.

Issue with encoding UTF-8 when FTPing files

I am able to have my application upload files via FTP using the FTPClient Java library.
(I happen to be uploading to an Oracle XML DB repository.)
Everything uploads fine unless the xml file has curly quotes in it. In which case I get the error:
LPX-00200: could not convert from encoding UTF-8 to UCS2
I can upload what I believe to be the same file using the Windows CMD line FTP tool. I am wondering if there is some encoding setting that the windows CMD line tool uses that maybe I need to set in my Java code.
Anyone know stuff about this? Thanks!!

I don't know that application but you could try to use -Dfile.encoding=UTF-8 on your JVM command line

Not familiar with Oracle XML DB repositories—can they accept compressed uploads? Zipping or gzipping your file would save resources and frustrate any ASCII file type autodetection in use.

In binary this problem goes away.
FTPClient.setType(FTPClient.TYPE_BINARY);
http://www.sauronsoftware.it/projects/ftp4j/manual.php#3

If your file contains curly quotes, they are in the high-order bit set range in iso-8859-1 and windows-1252 character sets. In UTF-8, those characters usually take two bytes in UTF-8.
It's quite possible that you've accidentally encoded the xml file in one of these encodings instead of UTF-8. That would result in a conversion error, because the high-order bit being set is only allowed in sequences of multiple UTF-8 octets.
If you're in Windows, open the file in Notepad and try re-saving the document using Save As... with the UTF-8 encoding, and upload the changed file.. In Unix, use iconv or a similar tool to convert from iso-8859-1 to UTF-8 before uploading.
If the XML document explicitly marks its encoding, make sure it's marked with the correct encoding (e.g. UTF-8). In many xml parsers, you can parse iso-8859-1 or windows-1252 character set encoded XML as long as it's marked as such.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

JAVA - Charcater Conversion from text file (UTF-16 to UTF-8) - java

Related

copy from microsoft outlook to JTextArea in java results in illegal character

Decoding binary files stored inside a database after being uploaded from a browser

java unicode conversion on linux not working on max os x

International characters with Java

Issue with encoding UTF-8 when FTPing files

Categories

Resources