Reading UTF-8 encoded text from InputStream - java

I'm having problems reading all Japanese/Chinese characters from an input stream.
Basically, I'm retrieving a JSON object from an API.
Below is my code:
try {
URL url = new URL(string);
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(),StandardCharsets.UTF_8));
result = br.readLine();
br.close();
} catch(Exception e) {
}
For some reason, not all characters are read by the input stream. What could be the problem?
To be specific, some characters appear when I print them out in the console, while some appear as black boxes with question marks. Also, there are no black boxes with questions marks when I check the actual JSON object through a browser.

What you see when "printing to a console" really has nothing to do with whether data was read or not, but has everything to do with the capabilities of your console.
If you are fetching data from a URL, and you know for sure that the bytes you have fetched represent UTF-8 encoded text, and the entire data fits on one line of text, then there is no reason why your code should not work.
It sounds like you are not sure things work because you are trying to print text to your console. Perhaps your console is not set to render UTF-8 encoded text? Perhaps your console font does not have enough glyphs to cover the font?
Here are two things you can try:
Instead of writing the text to your console, save it to a file. Then use a command like hexdump -C (on a *nix system, I have no idea how to do that in Windows) and look at the binary representation to make sure all your expected characters are there.
Save your data to a text file, then open it in a web browser, since browsers probably have much richer font support than a console.
If you still suspect you've read the remote data incorrectly, you can run your retrieved text through a JSON validator, just to make sure.

Try this instead: "ISO-8859-1".

Related

Characters altered by Lotus when receiving a POST through a Java WebAgent with OpenURL command

I have a Java WebAgent in Lotus-Domino which runs through the OpenURL command (https://link.com/db.nsf/agentName?openagent). This agent is created for receiving a POST with XML content. Before even parsing or saving the (XML) content, the webagent saves the content into a in-memory document:
For an agent run from a browser with the OpenAgent URL command, the
in-memory document is a new document containing an item for each CGI
(Common Gateway Interface) variable supported by Domino®. Each item
has the name and current value of a supported CGI variable. (No design
work on your part is needed; the CGI variables are available
automatically.)
https://www.ibm.com/support/knowledgecenter/en/SSVRGU_9.0.1/basic/H_DOCUMENTCONTEXT_PROPERTY_JAVA.html
The content of the POST will be saved (by Lotus) into the request_content field. When receiving content with this character: é, like:
<Name xml:lang="en">tést</Name>
The é is changed by Lotus to a ?®. This is also what I see when reading out the request_content field in the document properties. Is it possible to save the é as a é and not a: ?® in Lotus?
Solution:
The way I fixed it is via this post:
Link which help me solve this problem
The solution but in Java:
/****** INITIALIZATION ******/
session = getSession();
AgentContext agentContext = session.getAgentContext();
Stream stream = session.createStream();
stream.open("C:\\Temp\\test.txt", "LMBCS");
stream.writeText(agentContext.getDocumentContext().getItemValueString("REQUEST_CONTENT"));
stream.close();
stream.open("C:\\Temp\\test.txt", "UTF-8");
String Content = stream.readText();
stream.close();
System.out.println("Content: " + Content);
I've dealt with this before, but I no longer have access to the code so I'm going to have to work from memory.
This looks like a UTF-8 vs UTF-16 issue, but there are up to five charsets that can come into play: the charset used in the code that does the POST, the charset of the JVM the agent runs in, the charset of the Domino server code, the charset of the NSF - which is always LMBCS, and the charset of the Domino server's host OS.
If I recall correctly, REQUEST_CONTENT is treated as raw data, not character data. To get it right, you have to handle the conversion of REQUEST_CONTENT yourself.
The Notes API calls that you use to save data in the Java agent will automatically convert from Unicode to LMBCS and vice versa, but this only works if Java has interpreted the incoming data stream correctly. I think in most cases, the JVM running under Domino is configured for UTF-16 - though that may not be the case. (I recall some issue with a server in Japan, and one of the charsets that came into play was one of the JIS standard charsets, but I don't recall if that was in the JVM.)
So if I recall correctly, you need to read REQUEST_CONTENT as UTF-8 from a String into a byte array by using getBytes("UTF-8") and then construct a new String from the byte array using new String(byte[] bytes, "UTF-16"). That's assuming that Then pass that string to NotesDocument.ReplaceItemValue() so the Notes API calls should interpret it correctly.
I may have some details wrong here. It's been a while. I built a database a long time ago that shows the LMBCS, UTF-8 and UTF-16 values for all Unicode characters years ago. If you can get down to the byte values, it can be a useful tool for looking at data like this and figuring out what's really going on. It's downloadable from OpenNTF here. In a situation like this, I recall writing code that got the byte array and converted it to hex and wrote it to a NotesItem so that I could see exactly what was coming in and compare it to the database entries.
And, yes, as per the comments, it's much better if you let the XML tools on both sides handle the charset issues and encoding - but it's not always foolproof. You're adding another layer of charsets into the process! You have to get it right. If the goal is to store data in NotesItems, you still have to make sure that the server-side XML tools decode into the correct charset, which may not be the default.
my heart breaks looking at this. I also just passed through this hell, found the old advice, but... I just could not write to disk to solve this trivial matter.
Item item = agentContext.getDocumentContext().getFirstItem("REQUEST_CONTENT");
byte[] bytes = item.getValueCustomDataBytes("");
String content= new String (bytes, Charset.forName("UTF-8"));
Edited in response to comment by OP: There is an old post on this theme:
http://www-10.lotus.com/ldd/nd85forum.nsf/DateAllFlatWeb/ab8a5283e5a4acd485257baa006bbef2?OpenDocument (the same thread that OP used for his workaround)
the guy claims that when he uses a particular http header the method fails.
Now he was working with 8.5 and using LS. In my case I cannot make it fail by sending an additional header (or in function of the string argument)
How I Learned to Stop Worrying and Love the Notes/Domino:
For what it's worth getValueCustomDataBytes() works only with very short payloads. Dependent on content! Starting your text with an accented character such as 'é' will increase the length it still works with... But whatever I tried I could not get past 195 characters. Am I surprised? After all these years with Notes, I must admit I still am...
Well, admittedly it should not have worked in the first place as it is documented to be used only with User Defined Data fields.
Finally
Use IBM's icu4j and icu4j-charset packages - drop them in jvm/lib/ext. Then the code becomes:
byte[] bytes = item.getText().getBytes(CharsetICU.forNameICU("LMBCS"));
String content= new String (bytes, Charset.forName("UTF-8"));
and yes, will need a permission in java.policy:
permission java.lang.RuntimePermission "charsetProvider";
Is this any better than passing through the file system? Don't know. But kinda looks cleaner.

Java adds spaces when reading in a line?

So I'm at my wit's end with this program. I'm reading in from a text file in Java. Barring everything that I do with the string once I have it, this is the bare minimum code to be shown.
while ((lineIn = myReader.readLine()) != null) {
System.out.println("LineIn: \""+lineIn+"\"");
System.out.println("Length: "+lineIn.length());
}
What it prints out, however, is very strange indeed. The line should read:
001 2014/06/09 09:40:24 0.000
But this is what I get:
LineIn: "�2�6�1�8� �2�0�1�4�/�0�7�/�1�0� �2�3�:�1�5�:�0�3� �0�.�0�0�0�"
Length: 61
On Stack Overflow it actually shows up fine. You may be able to copy and paste the "LineIn: etc" into your address bar and see there are little invisible spaces in the numbering. I have no idea why those are there, what they are, and where Java is getting them from. Opening the document it's sourced from in a simple text editor shows no such spacing, and copy+pasting from the text editor into the browser address bar has no superfluous spacing either. It's very peculiar and I hope someone can offer insight. I'm pulling out my hair here.
It looks like you're reading UTF-16 data as if it had an 8-bit encoding.
If you construct a java.io.InputStreamReader, you can specify the input text charset such as "UTF-16".
It could be due to the formatting and encoding that your reader is using, try using Scanner instead.
Java certainly is not doing that, it might be UTF-16 encoded file. Can you upload the file or a small part of it somewhere?

Trying to parse ical file with ical4j problems with newlines after description property

im trying to parse the ical here: http://www.dsek.se/kalender/ical.php?person=&dsek&tlth
with this code:
URL url=new URL("http://www.dsek.se/kalender/ical.php?person=&dsek&tlth");
calendar=Calendars.load(url);
well, that is basicly the gist of the calendar code.
But im getting problems, I think somehow the "DEDSCRIPTION: text" gets transformed into "DESCRIPTION:
newLine text" before getting parsed and thus the parser wont work I think.
The problem only appears on the rows where after DESCRIPTION: there is a whitespace, the rows that look like "DESCRIPTION:text" work fine. I've also tested another file that don't have these newlines and that file works fine.
So im guessing that maybe it is some kind of character encoding problem? that the URL object changes the encoding of the file? the character encoding on the file is ISO-8859-15
Or is it just that they have written the file with newlines after "DESCRIPTION:"? And if that is the case how do I solve this? :S
if it matters somehow the the app is running on android :)
The issue is that the DESCRIPTION field does not follow proper line folding. See https://www.rfc-editor.org/rfc/rfc5545#section-3.1
So wherever you have something like
DESCRIPTION:
some text
you should have instead
DESCRIPTION:
some text
(please note the space after the linefeed and before the text) or simply
DESCRIPTION:some text
You might be able to get away with a simple Regex to fix that.
Then the file is also missing line folding for those DESCRIPTION that have a length greater than 75 characters. iCal4j should be fine with that.
Finally, regarding the character encoding, UTF-8 is the default (other encoding are actually deprecated. see https://www.rfc-editor.org/rfc/rfc5545#section-6) so the Calendars.load() method just assumes UTF-8.
So, you will have to
Reader r = new InputStreamReader(url.openStream(), "ISO-8859-15");
CalendarBuilder builder = new CalendarBuilder();
Calendar calendar = builder.build(r);
Of course, the best solution would be for the authors of those ics files to fix those issues (line folding AND content encoding) on their side.

how to implement UTF-8 format in Swing application?

In my Swing chat application i am having the send button, one text area, and a text field.
If I press Send button, I need to send the text from text field to text area. It is working fine in English but not in local language.
Please give some idea or some code that will help me to solve this.
First of all, the internal character representation of String is UTF-16, so you don't need to worry once you have the string in your JVM.
The problem is probably the conversion between a sequence of characters that gets sent over the internet and a String object. When parsing a string you need to provide the encoding, e.g. when using InputStreamReader, you have to pass the Charset parameter:
InputStreamReader(InputStream in, Charset cs)
Create an InputStreamReader that uses the given charset.
The encoding has to be provided, because Java can't magically guess the encoding of a byte sequence.

Add a language to my search engine, arabic letters in Eclipse

I want to do a search engine in arabic, and i have already a code for searching in english I had just to change the Analyzer but when i wrote in arabic in the console, I change to UTF-8 and i get 0 found so I think that eclipse give the arabic word to the query in a code , and the query doesn't recognize this code, my question is how can I do to make the arabic word readable to the query?
QueryParser parser = new QueryParser(Version.LUCENE_30,
"contents", new ArabicAnalyzer(Version.LUCENE_30));
Try looking in project properties, in the "Resource" section. Set your text file encoding to UTF-8 & see if that fixes the problem. I am assuming you have the right fonts already installed.
I believe you are reading characters like this:
BufferedReader reader = new BufferedReader(new InputStreamReader(System.in));
try {
String token = reader.readLine();
System.out.println(token);
} catch (IOException e) {
e.printStackTrace();
}
In that case character encoding is exactly the same as current system code page (at least in Windows). The problem is, Eclipse will allow you to paste Arabic letters to its console window but will lose information during the process. I am not sure if setting System code page (in OS Regional options) to windows-1256 will help but it could. I have tried to pass Charset.forName("windows-1256") as a second parameter to InputStreamReader and then input something with Arabic keyboard but it does not work.
OK, but we are not so helpless after all. Since that is meant for testing (right?), you can follow one of two approaches to fix the problem:
Use some basic Swing UI (JFrame + JTextField + JLabel and maybe some button)
Provide unescaping mechanism and enter characters as code points (i.e. \u0629)
The best fix would be to fix Eclipse (which is broken) and for example implement Console (System.console()) but I am not so sure if they would accept such patch.
You can try to give Unicode symbols in the console instead of Arabic characters.
Use a converter like this one to convert your Arabic text to Unicode symbols.

Categories

Resources