Java BufferedWriter Creating Null Characters - java

I've been using Java's BufferedWriter to write to a file to parse out some input. When I open the file after, however, there seems to be added null characters. I tried specifying the encoding as "US-ASCII" and "UTF8" but I get the same result. Here's my code snippet:
Scanner fileScanner = new Scanner(original);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "US-ASCII"));
while(fileScanner.hasNextLine())
{
String next = fileScanner.nextLine();
next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
out.write(next);
out.newLine();
}
out.flush();
out.close();
Maybe the issue isn't even with the BufferedWriter?
I've narrowed it down to this code block because if I comment it out, there are no null-characters in the output file. If I do a regex replace in VIM the file is null-character free (:%s/.*^L//g).
Let me know if you need more information.
Thanks!
EDIT:
hexdump of a normal line looks like:
0000000 5349 2a41 3030 202a
But when this code is run the hexdump looks like:
0000000 5330 2a49 4130 202a
I'm not sure why things are getting mixed up.
EDIT:
Also, even if the file doesn't match the regex and runs through that block of code, it comes out with null characters.
EDIT:
Here's a hexdump of the first few lines of a diff:
http://pastie.org/pastes/8964701/text
command was: diff -y testfile.hexdump expectedoutput.hexdump
The rest of the lines are different like the last two.

EDIT: Looking at the hexdump diff you gave, the only difference is that one has LF line endings (0A) and the other has CRLF line endings (0D 0A). All the other data in your diff is shifted ahead to accomodate the extra byte.
The CRLF is the default line ending on the OS you're using. If you want a specific line ending in your output, write the string "\n" or "\r\n".
Previously I noted that the Scanner doesn't specify a charset. It should specify the appropriate one that the input is known to be encoded in. However, this isn't the source of the unexpected output.

Scanner.nextLine() is eating the existing line endings.
The javadoc for nextLine states:
This method returns the rest of the current line, excluding any line separator at the end.
The javadoc for BufferedWriter.newLine explains:
Writes a line separator. The line separator string is defined by the system property line.separator, and is not necessarily a single newline ('\n') character.
In your case your system's default newline seperator is "\n". The EDI file you are parsing uses "\r\n".
Using the system defined newLine seperator isn't the appropriate thing to do in this case. The newline separator to use is dictated by the file format and should be put in a format specific static constant somewhere.
Change "out.newLine();" to "out.write("\r\n");"

I think what is going on is the following
All lines that contain ^L (ff) get modified to remove everything before the ^L but in addition you have the side effect in 1 that all \r (cr) also get removed. However, if cr appears before ^L nextLine() is treating that as a line too. Note how, in the output file below, the number of cr + nl is 6 in the input file and the number of cr + nl is also 6 but they're all nl, so the line with c gets preserved because it's being treated on a different line than ^L. Probably not what you want. See below.
Some observations
The source file is being generated on a system that uses \r\n to define a new line, and your program is being run on a system that does not. Because of this all occurrences of 0xd are going to be removed. This will make the two files different sizes even if there are no ^L.
But you probably overlooked #1 because vim will operate in DOS mode (recognize \r\n as a newline separator) or non-DOS mode (only \n) depending on what it reads when it opens the file and hides the fact from the user if it can. In fact to test I had to brute force in \r using ^v^m because I was editing on Linux using vim more here.
Your means to test is probably using od -x (for hex right)? But that outputs ints which is not what you want. Consider the following input file and output file. After your program runs. As viewed in vi
Input file
a
b^M
c^M^M ^L
d^L
Output file
a
b
c
Well maybe that's right, lets see what od has to say
od -x of input File
0a61 0d62 630a 0d0d 0c20 640a 0a0c
od -x of output File
0a61 0a62 0a63 0a0a 000a
Huh, what where did that null come from? But wait from the man page of od
-t type Specify the output format. type is a string containing one or more of the following kinds of type specifiers:
q a Named characters (ASCII). Control characters are displayed using the following names:
-h, -x Output hexadecimal shorts. Equivalent to -t x2.
-a Output named characters. Equivalent to -t a.
Oh, ok so instead use the -a option
od -a of input
a nl b cr nl c cr cr sp ff nl d ff nl
od -a of output
a nl b nl c nl nl nl nl
Forcing java to ignore \r
And finally, all that being said, you really have to overcome the implicit understanding of java that \r delimits a line, even contrary to the documentation. Even when explicitly setting the scanner to use a \r ignoring pattern, it still operates contrary to the documentation and you must override that again by setting the delimiter (see below). I've found the following will probably do what you want by insisting on Unix line semantics. I also added in some logic to not output a blank line.
public static void repl(File original,File file) throws IOException
{
Scanner fileScanner = new Scanner(original);
Pattern pattern1 = Pattern.compile("(?d).*");
fileScanner.useDelimiter("(?d)\\n");
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF8"));
while(fileScanner.hasNext(pattern1))
{
String next = fileScanner.next(pattern1);
next = next.replaceAll("(?d)(.*\\x0C)|(\\x0D)","");
if(next.length() != 0)
{
out.write(next);
out.newLine();
}
}
out.flush();
out.close();
}
With this change, the output above changes to.
od -a of input
a nl b cr nl c cr cr sp ff nl d ff nl
od -a of output
a nl b nl

Stuart Caie provided the answer. if you are looking for an code to avoid these characters.
Basic issue is , Org file using different line separator and the new file using different line separator character.
One easy way, find the Org file Separator character and use the same in new file.
try(BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)));
Scanner fileScanner = new Scanner(original);) {
String lineSep = null;
boolean lineSepFound = false;
while(fileScanner.hasNextLine())
{
if (!lineSepFound){
MatchResult matchResult = fileScanner.match();
if (matchResult != null){
lineSep = matchResult.group(1);
if (lineSep != null){
lineSepFound = true;
}
}
}else{
out.write(lineSep);
}
String next = fileScanner.nextLine();
next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
out.write(next);
}
} catch ( IOException e) {
e.printStackTrace();
}
Note ** MatchResult matchResult = fileScanner.match(); would provide the matchResult for the last Match performed. And in our case we have used hasNextLine() - Scanner used linePattern to find the next line .. Scanner.hasNextLine Source code finding the line Separator ,
but unfortunately no way to get the line separator back. So i have used thier code to get the lineSep only once. and used that lineSep for creating new file.
Also per your code , you would be having extra line separator at the end of file. Corrected here.
Let me know if that works.

Related

Strings act weirdly when reading them from a file with the java.util.Scanner and using linebreaks as delimiter

I try to read data from a file using the java.util.Scanner. When I try to use \n as a delimiter, then the resulting Strings react weirdly when I try to add more text to them.
I have a file called "test.txt" and try to read data from it. I then want to add more text to each String, similar to how this would print Hello World!:
String helloWorld = "Hello "+"World!";
System.out.println(helloWorld);.
I tried combining data with +, I tried += and I tried String.concat(), this has worked for me before and usually still works.
I also tried to use different delimiters, or no delimiter at all, both of those work as I expect, but I need the Strings to be separated at line breaks.
The test.txt file for the minimal reproducible example contains this text (there is a space at the end of each line):
zero:
one:
two:
three:
void minimalReproducibleExample() throws Exception { //May throw an exception if the test.txt file can't be found
String[] data = new String[4];
java.io.File file = new java.io.File("test.txt");
java.util.Scanner scanner = new java.util.Scanner(file).useDelimiter("\n");
for (int i=0;i<4;i++) {
data[i] = scanner.next(); //read the next line
data[i] += i; //add a number at the end of the String
System.out.println(data[i]); //print the String with the number
}
scanner.close();
}
I expect this code to print out these lines:
zero: 0
one: 1
two: 2
three: 3
I get this output instead:
0ero:
1ne:
2wo:
three: 3
Why do I not get the expected output when using \n as a delimiter?
The test.txt is most likely using Windows end of line representation \r\n which results in carriage return \r still being present in the String after it's read.
Ensure that test.txt is using \n as line delimtier or use Windows \r\n in Scanner.useDelimiter().
The problem is most likely with a wrong delimiter. If you use Windows 10, the new line separator is \r\n.
Just to be platform-independent use System.getProperty("line.separator") instead of hardcoded \n.

\n Not working when reading from File to List and then to output. Java

I have a minor problem, the \n's in my file isn't working in my output I tried two methods:
PLEASE NOTE:
*The text in the file here is a much simplified example. That is why I do not just use the output.append("\n\n"); in the second method. Also the \ns in the file are not always at the END of the line i.e. a line n the file could be Stipulation 1.1\nUnder this Stipulation...etc. *
The \n's in the file need to work. Also both JOptionPane.showMessageDialog(null,rules); and System.out.println(rules); give the same formatted output
Text in File:
A\n
B\n
C\n
D\n
Method 1:
private static void setGameRules(File f) throws FileNotFoundException, IOException
{
rules = Files.readAllLines(f.toPath(), Charset.defaultCharset());
JOptionPane.showMessageDialog(null,rules);
}
Output 1:
A\nB\nC\nD\n
Method 2:
private static void setGameRules(File f) throws FileNotFoundException, IOException
{
rules = Files.readAllLines(f.toPath(), Charset.defaultCharset());
StringBuilder output = new StringBuilder();
for (String s : rules)
{
output.append(s);
output.append("\n\n");//these \n work but the ones in my file do not
}
System.out.println(output);
}
Output 2:
A\n
B\n
C\n
D\n
The character sequence \n is simply a human readable representation of an unprintable character.
When reading it from a file, you get two characters a '\' and an 'n', not the line break character.
As such, you'll need to replace the placeholders in your file with a 'real' line break character.
Using the method I mentioned earlier: s = s.replaceAll( "\\\\n", System.lineSeparator() ); is one way, I'm sure there are others.
Perhaps in readAllLines you can add add the above line of code to do the replacement before, or as, you stick the line in the rules array.
Edit:
The reason this doesn't work the way you expect is because you're reading it from a file. If it was hardcoded into your class, the compiler would see the '\n' sequence and say "Oh boy! A line separator! I'll just replace that with (char)0x0A".
What do you mean with "it is not working"? In what way are they not working? Do you expect to see a line break? I am not sure if you actually have the characters '\n' at the end of each line, or the LineFeed Character (0x0A). The reason your '\n' would work in the Javas source is, that this is a way to escape the linefeed character. Tell us a little about your input file, how is it generated?
Second thing I notice is, that you print the text to the console in the second Method. I am not certain, that the JOptionPane will even display line breaks this way. I think it uses a JLabel, see Java: Linebreaks in JLabels? for that. The console does interpret \n as a linebreak.
The final Answer looks like this:
private static void setGameRules(File f) throws FileNotFoundException, IOException {
rules = Files.readAllLines(f.toPath(), Charset.defaultCharset());
for(int i =0;i!=rules.size();i++){
rules.set(i, rules.get(i).replaceAll( "\\\\n","\n"));
}
}
As #Ray said the \n in the file was just being read as chars \ and n not as the line seperator \n
I just added a for-loop to run through the list and replace them using:
rules.set(i, rules.get(i).replaceAll( "\\\\n","\n")

Scanner's nextLine(), Only fetching partial

So, using something like:
for (int i = 0; i < files.length; i++) {
if (!files[i].isDirectory() && files[i].canRead()) {
try {
Scanner scan = new Scanner(files[i]);
System.out.println("Generating Categories for " + files[i].toPath());
while (scan.hasNextLine()) {
count++;
String line = scan.nextLine();
System.out.println(" ->" + line);
line = line.split("\t", 2)[1];
System.out.println("!- " + line);
JsonParser parser = new JsonParser();
JsonObject object = parser.parse(line).getAsJsonObject();
Set<Entry<String, JsonElement>> entrySet = object.entrySet();
exploreSet(entrySet);
}
scan.close();
// System.out.println(keyset);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
as one goes over a Hadoop output file, one of the JSON objects in the middle is breaking... because scan.nextLine() is not fetching the whole line before it brings it to split. ie, the output is:
->0 {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
!- {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
Most of the above data has been sanitized (not the URL (for the most part) however... )
and the URL continues as:
$(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg~~60_1.JPG?set_id=8800005007
in the file....
So its slightly miffing.
This also is entry #112, and I have had other files parse without errors... but this one is screwing with my mind, mostly because I dont see how scan.nextLine() isnt working...
By debug output, the JSON error is caused by the string not being split properly.
And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.
EDIT:
Also blows up if I remove the offending line in about the same place.
Attempted with JVM 1.6 and 1.7
Workaround Solution:
BufferedReader scan = new BufferedReader(new FileReader(files[i]));
instead of scanner....
Based on your code, the best explanation I can come up with is that the line really does end after the "~~" according to the criteria used by Scanner.nextLine().
The criteria for an end-of-line are:
Something that matches this regex: "\r\n|[\n\r\u2028\u2029\u0085]" or
The end of the input stream
You say that the file continues after the "~~", so lets put EOF aside, and look at the regex. That will match any of the following:
The usual line separators:
<CR>
<NL>
<CR><NL>
... and three unusual forms of line separator that Scanner also recognizes.
0x0085 is the <NEL> or "next line" control code in the "ISO C1 Control" group
0x2028 is the Unicode "line separator" character
0x2029 is the Unicode "paragraph separator" character
My theory is that you've got one of the "unusual" forms in your input file, and this is not showing up in .... whatever tool it is that you are using to examine the files.
I suggest that you examine the input file using a tool that can show you the actual bytes of the file; e.g. the od utility on a Linux / Unix system. Also, check that this isn't caused by some kind of character encoding mismatch ... or trying to read or write binary data as text.
If these don't help, then the next step should be to run your application using your IDE's Java debugger, and single-step it through the Scanner.hasNextLine() and nextLine() calls to find out what the code is actually doing.
And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.
That's interesting. But if the tool you are using to extract the line is the same one that is not showing the (hypothesized) unusual line separator, then this evidence is not reliable. The process of extraction may be altering the "stuff" that is causing the problems.

How to preserve correct offset of string which is read from a file

I have a text.txt file which contains following txt.
Kontagent Announces Partnership with Global Latino Social Network Quepasa
Released By Kontagent
I read this text file into a string documentText.
documentText.subString(0,9) gives Kontagent, which is good.
But, documentText.subString(87,96) gives y Kontage in windows (IntelliJ Idea) and gives Kontagent in Unix environment. I am guessing it is happening because of blank line in the file (after which the offset got screwed). But, I cannot understand, why I get two different results. I need to get one result in the both the environments.
To read file as string I used all the functions talked about here
How do I create a Java string from the contents of a file? . But, I still get same results after using any of the functions.
Currently I am using this function to read the file into documentText String:
public static String readFileAsString(String fileName)
{
File file = new File(fileName);
StringBuilder fileContents = new StringBuilder((int)file.length());
Scanner scanner = null;
try {
scanner = new Scanner(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String lineSeparator = System.getProperty("line.separator");
try {
while(scanner.hasNextLine()) {
fileContents.append(scanner.nextLine() + lineSeparator);
}
return fileContents.toString();
} finally {
scanner.close();
}
}
EDIT: Is there a way to write a general function which will work for both windows and UNIX environments. Even if file is copied in text mode.
Because, unfortunately, I cannot guarantee that everyone who is working on this project will always copy files in binary mode.
The Unix file probably uses the native Unix EOL char: \n, whereas the Windows file uses the native Windows EOL sequence: \r\n. Since you have two EOLs in your file, there is a difference of 2 chars. Make sure to use a binary file transfer, and all the bytes will be preserved, and everything will run the same way on both OSes.
EDIT: in fact, you are the one which appends an OS-specific EOL (System.getProperty("line.separator")) at the end of each line. Just read the file as a char array using a Reader, and everything will be fine. Or use Guava's method which does it for you:
String s = CharStreams.toString(new FileReader(fileName));
On Windows, a newline character \n is prepended by \r or a carriage return character. This is non-existent in Linux. Transferring the file from one operating system to the other will not strip/append such characters but occasionally, text editors will auto-format them for you.
Because your file does not include \r characters (presumably transferred straight from Linux), System.getProperty("line.separator") will return \r\n and account for non-existent \r characters. This is why your output is 2 characters behind.
Good luck!
Based on input you guys provided, I wrote something like this
documentText = CharStreams.toString(new FileReader("text.txt"));
documentText = this.documentText.replaceAll("\\r","");
to strip off extra \r if a file has \r.
Now,I am getting expect result in windows environment as well as unix. Problem solved!!!
It works fine irrespective of what mode file has been copied.
:) I wish I could chose both of your answer, but stackoverflow doesn't allow.

java line breaks in file

I am trying to write a file line by line using apache FileUtils.writeLines()
When I try to open the file (notepad++, editplus) it is with no line breaks.
(I am sending null to the encoding)
thanks.
FileUtils.writeLines(new File(INDICES_AND_WEIGHTS_FILENAME), indicesAndWeightsParams.indicesParams,";");
where indicesParams is a list
public List indicesParams;
Make sure which version of writeLines() you're calling. There is one version which allows you to specify the line endings. If you pass an empty string there, you won't get any:
public static void writeLines(File file, Collection<String> lines, String lineEnding) throws IOException {
The encoding is always between the file and the collection, the line separator is always the last parameter.
Are you using code similar to this?
List<String> ls = new ArrayList<String>();
ls.add("aaa");
ls.add("bbbbb");
FileUtils.writeLines(new File("newfile.txt"), "UTF-8", ls); // same effect as with "null" as encoding
After this code, newFile.txt does have newlines.
Using od -a newfile.txt generates (on Windows):
0000000 a a a cr nl b b b b b cr nl
0000014
which shows that newlines really do exist.
Just add "\r\n" where ever you want the breaks.

Categories

Resources