I have a text file that was provided to me and no one knows the encoding on it. Looking at it in a text editor, everything looks fine, aligned properly into neat columns.
However, I'm seeing some anomalies when I read the data. Even though, visually, the field "Foo" appears in the same columns in the text file (for instance, in columns 15-20), when I try to pull it out using substring(15,20) my data varies wildly. Sometimes I'll pull bytes 11-16, sometimes 18-23, sometimes 15-20...there's no consistency between records.
I suspect that there are some special chartacters, invisible to my text editor, but readable by (and counted in the index of) the String methods. Is there any way in Java to dump the contents of the file with any special characters visible so I can see what I need to Strings I need replace with regex?
If not in Java, can anyone recommed a tool that may be able to help me out?
I would start with having a look at the file directly. Any code adds a layer of doubt. Take a Total Commander (or equivalent on your platform), view the file (F3) and switch to hex mode. You suggest that the special characters behavior is not even consistent between lines, so you should get some visual clue about the format before you even attempt to fix it algorithmically.
Have you tried printing the contents of the file as individual integers or bytes? That way you can see if there are any hidden characters.
Related
I am making a java program where I input answers for a friendship survey. It spits out the student's top ten friends. However I need to print out the results and give them to the students. The old of doing it was to have the java program write to write html then we would open each file one at a time and print out the page. However, having 400+ students to do it for takes a while.
So since I am re making the program I would like to make it so I can just have it on word files and print them all out at once. However, I don't know how to write to a word file and notepad isn't stylish enough. Anyone know how to make this possible or another way that is easier?
I did a similar thing some years ago, using Rich Text Format. Its advantage is that it's a plain text format that can easily be manipulated.
I created the form document in Word with some unique placeholder strings where I'd later fill in the actual data and saved it as RTF.
With a text editor, I made sure that Word didn't split the placeholders by inserting some junk formatting directives, and corrected that manually where necessary.
Then, filling in the actual data just meant to do some simple text replacement (in my case, there was no risk to interfere with the formatting directives), and saving the resulting RTF file.
As Word typically opens RTF files just as easy as DOC or DOCX ones, this was an easy working solution for me.
As part of a larger task we need to extract information from fixed field length text files. The data files were original developed for EDI but are in widespread use through the industry so we can't ask for a more modern way of encoding data.
I wrote a Java program, implemented as a user defined function, to do the file parsing. It works properly when run locally within jDeveloper. One of the things it does is remove all the newline characters so I can count record lengths accurately. When I try to run it from a simple composite application on the SOA server I find a strange problem: the newlines are gone but I get a new whitespace character in their place.
I do not know what can cause this or how to reliably deal with it. My composite is very simple; I just paste the file content into the "input" field using the test console on the SOA server and this sends the string to the user defined function which parses the file and outputs an XML fragment I can then read.
If I manually strip all the newlines and paste that in, all is well and it works fine but if I send a the data with newlines I get extra whitespace.
Is the composite trying to normalize newlines for me? If so; I would like to know how to make it quit.
So I'm working with last.fm API. Sometimes, the query results in tracks that contain characters like these:
Æther, é, Hṛṣṭa
or non-English characters like these:
水鏡.
When debugging in Eclipse, I see them just fine (as-is) but printing on console prints these as ??? - which is OK for me.
Now, how do I handle these? At first I though I could remove every song that has any character other than the ones in English language. I used the regex ^\\w+$ but it didn't work. I also tried \\w+. That didn't work either.
Then I thought further on how do handle these properly. Any one can help me out? I am perfectly fine with letting these tracks out of the equation, ie. I'm fine with having only English character tracks.
Another question: What is the best way to display these character of console and/or Swing GUI?
You must ensure that you use correct encoding when reading your input first.
Second ensure that the font used in Eclipse on platform you developing has ability to display all these characters. Swing must display unicode chars if you read them correctly.
You will likely want to use UTF-8 everywhere.
I see this to make text file and it also helps me out but in all examples i see that they just making string in notepad or we can say text file...
Can any one say that how to make table formatted text file in android??
i want to make file(invoice)
This is most likely going to involve some some slightly messy string processing. Assuming you have your data in an acceptable format (such as string arrays), you should be able to construct a single java string representing the whole table, and then use the code you found already to print it to a file. Use the escape character \t to separate between columns and \n to separate between rows.
That would be TSV format, and it is very easy to generate. Just add a TAB after every field, and a CR/LF pair after every record.
There are some restricted characters (and even full filenames, in Windows), for file and directory names. This other question covers them already.
Is there a way, in Java, to retrieve this list of forbidden characters, which would be depending on the system (a bit like retrieving the line breaker chars)? Or can I only put the list myself, checking for the system?
Edit: More background on my particular situation, aside from the general question.
I use a default name, coming from some data (no real control over their content), and this name is given to a JFileChooser, as a default file name to use (with setSelectedFile()). However, this one truncates anything prior to the last invalid character.
These default names occasionally end with dates in a "mm/dd/yy" format, which leaves only the "yy", in the default name, because "/" are forbidden. As such, checking for Exceptions is not really an option there, because the file itself is not even created yet.
Edit bis: Hmm, that makes me think, if JFileChooser is truncating the name, it probably has access to a list of such characters, can be interesting to check that further.
Edit ter: Ok, checking sources from JFileChooser shows something completely simple. For the text field, it uses file.getName(). It doesn't actually check for invalid characters, it's simply that it takes the "/" as a path separator, and keeps only the end, the "actual filename". Other forbidden characters actually go through.
When it comes to dealing with "forbidden" characters I'd rather be overcautious and ban/replace all "special" characters that may cause a problem on any filesystem.
Even if technically allowed, sometimes those characters can cause weirdness.
For example, we had an issue where the PDF files were being written (successfully) to a SAN, but when served up via a web server from that location some of the characters would cause issues when we were embedding the PDF in an HTML page that was being rendered in Firefox. It was fine if the PDF was accessed directly and it was fine in other browser. Some weird error with how Firefox and Adobe Reader interact.
Summary: "Special" characters in file names -> weird errors waiting to happen
Ultimately, the only way to be sure is to use a white-list.
Having certain "forbidden characters" is just one of many things that can go wrong when creating a file (others are access rights and file and path name lengths).
It doesn't really make sense to try and catch some of these early when there are others you can't catch until you actually try to create the file. Just handle the exceptions properly.
Have you tried using File.getCanonicalPath and comparing it to the original file name (or whatever is retrieved from getAbsolutePath)?
This will not give you the actual characters, but it may help you in determining whether this is a valid filename in the OS you're running on.
Have a look at this link for some info on how to get the OS the application is running on. Basically you need to use System.getProperty("os.name") and do an equals() or contains() to find out the operating system.
Something to be weary of though is that knowing the OS does not necessarily tell you the underlying file system being used, for example a Mac can read and write onto the FAT32 file system.
source: http://www.mkyong.com/java/how-to-detect-os-in-java-systemgetpropertyosname/