Java adds spaces when reading in a line? - java

So I'm at my wit's end with this program. I'm reading in from a text file in Java. Barring everything that I do with the string once I have it, this is the bare minimum code to be shown.
while ((lineIn = myReader.readLine()) != null) {
System.out.println("LineIn: \""+lineIn+"\"");
System.out.println("Length: "+lineIn.length());
}
What it prints out, however, is very strange indeed. The line should read:
001 2014/06/09 09:40:24 0.000
But this is what I get:
LineIn: "�2�6�1�8� �2�0�1�4�/�0�7�/�1�0� �2�3�:�1�5�:�0�3� �0�.�0�0�0�"
Length: 61
On Stack Overflow it actually shows up fine. You may be able to copy and paste the "LineIn: etc" into your address bar and see there are little invisible spaces in the numbering. I have no idea why those are there, what they are, and where Java is getting them from. Opening the document it's sourced from in a simple text editor shows no such spacing, and copy+pasting from the text editor into the browser address bar has no superfluous spacing either. It's very peculiar and I hope someone can offer insight. I'm pulling out my hair here.

It looks like you're reading UTF-16 data as if it had an 8-bit encoding.
If you construct a java.io.InputStreamReader, you can specify the input text charset such as "UTF-16".

It could be due to the formatting and encoding that your reader is using, try using Scanner instead.

Java certainly is not doing that, it might be UTF-16 encoded file. Can you upload the file or a small part of it somewhere?

Related

Reading UTF-8 encoded text from InputStream

I'm having problems reading all Japanese/Chinese characters from an input stream.
Basically, I'm retrieving a JSON object from an API.
Below is my code:
try {
URL url = new URL(string);
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream(),StandardCharsets.UTF_8));
result = br.readLine();
br.close();
} catch(Exception e) {
}
For some reason, not all characters are read by the input stream. What could be the problem?
To be specific, some characters appear when I print them out in the console, while some appear as black boxes with question marks. Also, there are no black boxes with questions marks when I check the actual JSON object through a browser.
What you see when "printing to a console" really has nothing to do with whether data was read or not, but has everything to do with the capabilities of your console.
If you are fetching data from a URL, and you know for sure that the bytes you have fetched represent UTF-8 encoded text, and the entire data fits on one line of text, then there is no reason why your code should not work.
It sounds like you are not sure things work because you are trying to print text to your console. Perhaps your console is not set to render UTF-8 encoded text? Perhaps your console font does not have enough glyphs to cover the font?
Here are two things you can try:
Instead of writing the text to your console, save it to a file. Then use a command like hexdump -C (on a *nix system, I have no idea how to do that in Windows) and look at the binary representation to make sure all your expected characters are there.
Save your data to a text file, then open it in a web browser, since browsers probably have much richer font support than a console.
If you still suspect you've read the remote data incorrectly, you can run your retrieved text through a JSON validator, just to make sure.
Try this instead: "ISO-8859-1".

Passing Unicode line return characters set in Class to client side (DWR/HTML/UTF8) for InDesign Team

I've built a content management tool that allows a product team to create and manage product that gets exported to a website and for a different team of designers to create print ads for newspapers displaying the same product data.
My problem is with the InDesign graphic designers and the macros that they use within InDesign. The macros have the ability to take copy/pasted text/data and auto format the text inside InDesign based on the presence of certain characters. In particular the design team uses tab, "soft line break" (shift return), and regular line breaks (hard returns) in their macros.
Right now I generate a block of text with the records and the desired formatting characters in a java Class and then that's sent via DWR to the client side. When there is a requirement for a tab character I send \t, return is \r and I was hoping that a soft line break would be \n however InDesign seems to regard both \r and \n as a regular line break.
I had given up on being able to pass a soft-return until yesterday when I cam across Unicode \u2028 (soft line break) and \u2029 (regular line break). I've tried outputting both of these characters instead of \r and \n in the hopes that InDesign may regard these characters differently. In the box that the designers copy the output from it looks like there is no character there. There's no line break at all in the places where I've specific \u2028 to appear. When I copy/paste the output into a text editor it shows me that there is an unrecognized character there (it displays as a box with a question mark around it).
Platform is Java/MySQL running on Tomcat.
To date, I haven't had to deal too much with character encoding in this application. Header has <meta charset="utf-8" /> set but that's about it so far. I've tried setting this to utf-16 but it doesn't change the output. All of the tables in the MySQL database are set to utf8/utf8_general_ci.
Thoughts? How can I force InDesign to take copy/pasted text and recognize all of its macro capable characters? Actually, it's just the soft line breaks that it's not recognizing. HELP! :)
Thank you. Sorry this is so long!
Ryan V
I've been playing around with ID CS6 (OS X) for a while and I can't for the life of me get it to recognize a pasted LF as a forced line break. LF and CR and CRLF all go to paragraph breaks. U+2028 and U+2029 are display as empty glyphs, not breaks.
I'm a little wary of posting this as an answer, but I'll give it a go:
You might consider providing the text as a downloaded .txt file. CS5 introduced "Tagged Text" (a sort of XML-ish text document with full support for InDesign characters, attributes, etc.,) so this means your designers will be able to place the text file and InDesign will treat everything as intended.
To turn your existing text into CS5+'s Tagged Text (see the reference here), plop a <ASCII-MAC> or <ASCII-WIN> (as appropriate) as the first line and escape any '<' or '>'s with a backslash, then you're free to use <0x000A> as a forced line break. (literally those 8 characters)
That's probably mega-overkill, but it's certainly the most stupidly reliable way I can think of. I'll edit if I get anything else working.
NB. "forced line break" is the term InDesign itself uses for the character produced by Shift+Enter, your "soft line break;" contrast with "paragraph break" for a standard carriage return. InDesign apparently represents forced breaks with LF (U+000A) and paragraph breaks with CR (U+000D).
I'm not sure how you were trying to transfer and print out your characters (if you post your DWR and javascript code I might be able to help more), but one thing I would try is to ensure that your java output is actual UTF-8 using something like:
String yourRecordString = "Some line 1. \u2028Some line 2.";
ByteBuffer bb = Charset.forName("UTF-8").encode(yourRecordString);
Then, you can write out the bytes in bb into an output stream/file and check them. (Make sure to write them as bytes and not as a String nor as chars.) For example, the UTF-8 encoding of \u2028 is E2 80 A8, so you should see that sequence at the appropriate place in your output. (I use hexmode in vim for things like this.)
Then, make sure that these bytes get received back on the javascript side. (While I'm not an expert with DWR, I might prefer to make your java function return something other than a String.)
This should at least help you diagnose where the problem lies. If you do see that sequence and if InDesign still isn't recognizing the soft line breaks, then you at least know the problem is with InDesign and that you will have to find some other solution (such as modifying the designer's macros to recognize other characters).
(Also, note that you can see the default encoding for your JVM using Charset.defaultCharset(). My guess is that your default is not UTF-8 and that InDesign may have also had a problem with the UTF-16 you tried due to endianess or something like that.)

Trying to parse ical file with ical4j problems with newlines after description property

im trying to parse the ical here: http://www.dsek.se/kalender/ical.php?person=&dsek&tlth
with this code:
URL url=new URL("http://www.dsek.se/kalender/ical.php?person=&dsek&tlth");
calendar=Calendars.load(url);
well, that is basicly the gist of the calendar code.
But im getting problems, I think somehow the "DEDSCRIPTION: text" gets transformed into "DESCRIPTION:
newLine text" before getting parsed and thus the parser wont work I think.
The problem only appears on the rows where after DESCRIPTION: there is a whitespace, the rows that look like "DESCRIPTION:text" work fine. I've also tested another file that don't have these newlines and that file works fine.
So im guessing that maybe it is some kind of character encoding problem? that the URL object changes the encoding of the file? the character encoding on the file is ISO-8859-15
Or is it just that they have written the file with newlines after "DESCRIPTION:"? And if that is the case how do I solve this? :S
if it matters somehow the the app is running on android :)
The issue is that the DESCRIPTION field does not follow proper line folding. See https://www.rfc-editor.org/rfc/rfc5545#section-3.1
So wherever you have something like
DESCRIPTION:
some text
you should have instead
DESCRIPTION:
some text
(please note the space after the linefeed and before the text) or simply
DESCRIPTION:some text
You might be able to get away with a simple Regex to fix that.
Then the file is also missing line folding for those DESCRIPTION that have a length greater than 75 characters. iCal4j should be fine with that.
Finally, regarding the character encoding, UTF-8 is the default (other encoding are actually deprecated. see https://www.rfc-editor.org/rfc/rfc5545#section-6) so the Calendars.load() method just assumes UTF-8.
So, you will have to
Reader r = new InputStreamReader(url.openStream(), "ISO-8859-15");
CalendarBuilder builder = new CalendarBuilder();
Calendar calendar = builder.build(r);
Of course, the best solution would be for the authors of those ics files to fix those issues (line folding AND content encoding) on their side.

How can I output data with special characters visible?

I have a text file that was provided to me and no one knows the encoding on it. Looking at it in a text editor, everything looks fine, aligned properly into neat columns.
However, I'm seeing some anomalies when I read the data. Even though, visually, the field "Foo" appears in the same columns in the text file (for instance, in columns 15-20), when I try to pull it out using substring(15,20) my data varies wildly. Sometimes I'll pull bytes 11-16, sometimes 18-23, sometimes 15-20...there's no consistency between records.
I suspect that there are some special chartacters, invisible to my text editor, but readable by (and counted in the index of) the String methods. Is there any way in Java to dump the contents of the file with any special characters visible so I can see what I need to Strings I need replace with regex?
If not in Java, can anyone recommed a tool that may be able to help me out?
I would start with having a look at the file directly. Any code adds a layer of doubt. Take a Total Commander (or equivalent on your platform), view the file (F3) and switch to hex mode. You suggest that the special characters behavior is not even consistent between lines, so you should get some visual clue about the format before you even attempt to fix it algorithmically.
Have you tried printing the contents of the file as individual integers or bytes? That way you can see if there are any hidden characters.

How do I decipher garbled/gibberish characters in my networking program

I am working on a client-server networking app in Java SE. I am using stings terminated by a newline from client to server and the server responds with a null terminated string.
In the output window of Netbeans IDE I am finding some gibberish characters amongst the strings that I send and receive.
I can't figure out what these characters are they mostly look like a rectagular box, when I paste that line containing the character in Notepad++ all the characters following and including that character disapear.
How could I know what characters are appearing in the output sreen of the IDE?
If the response you are getting back from the server is supposed to be human readable text, then this is probably a character encoding problem. For example, if the client and server are both written in Java, it is likely that they using/assuming different character encodings for the text. (It is also possible that the response is not supposed to be human readable text. In that case, the client should not be trying to interpret it as text ... so this question is moot.)
You typically see boxes (splats) when a program tries to render some character code that it does not understand. This maybe a real character (e.g. a Japanese character, mathematical symbol or the like) or it could be an artifact caused by a mismatch between the character sets used for encoding and decoding the text.
To try and figure out what is going on, try modifying your client-side code to read the response as bytes rather than characters, and then output the bytes to the console in hexadecimal. Then post those bytes in your question, together with the displayed characters that you currently see.
If you understand character set naming and have some ideas what the character sets are likely to be, the UNIX / Linux iconv utility may be helpful. Emacs has extensive support for displaying / editing files in a wide range of character encodings. (Even plain old Wordpad can help, if this is just a problem with line termination sequences; e.g. "\n" versus "\r\n" versus "\n\r".)
(I'd avoid trying to diagnose this by copy and pasting. The copy and paste process may itself "mangle" the data, causing you further confusion.)
This is probably just binary data. Most of it will look like gibberish when interpreted as ascii. Make sure you are writing exact number of bytes to the socket, and not some nice number like 4096. Best would be if you can post your code so we can help you find the error(s).

Categories

Resources