Scanner's nextLine(), Only fetching partial - java

So, using something like:
for (int i = 0; i < files.length; i++) {
if (!files[i].isDirectory() && files[i].canRead()) {
try {
Scanner scan = new Scanner(files[i]);
System.out.println("Generating Categories for " + files[i].toPath());
while (scan.hasNextLine()) {
count++;
String line = scan.nextLine();
System.out.println(" ->" + line);
line = line.split("\t", 2)[1];
System.out.println("!- " + line);
JsonParser parser = new JsonParser();
JsonObject object = parser.parse(line).getAsJsonObject();
Set<Entry<String, JsonElement>> entrySet = object.entrySet();
exploreSet(entrySet);
}
scan.close();
// System.out.println(keyset);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
as one goes over a Hadoop output file, one of the JSON objects in the middle is breaking... because scan.nextLine() is not fetching the whole line before it brings it to split. ie, the output is:
->0 {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
!- {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
Most of the above data has been sanitized (not the URL (for the most part) however... )
and the URL continues as:
$(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg~~60_1.JPG?set_id=8800005007
in the file....
So its slightly miffing.
This also is entry #112, and I have had other files parse without errors... but this one is screwing with my mind, mostly because I dont see how scan.nextLine() isnt working...
By debug output, the JSON error is caused by the string not being split properly.
And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.
EDIT:
Also blows up if I remove the offending line in about the same place.
Attempted with JVM 1.6 and 1.7
Workaround Solution:
BufferedReader scan = new BufferedReader(new FileReader(files[i]));
instead of scanner....

Based on your code, the best explanation I can come up with is that the line really does end after the "~~" according to the criteria used by Scanner.nextLine().
The criteria for an end-of-line are:
Something that matches this regex: "\r\n|[\n\r\u2028\u2029\u0085]" or
The end of the input stream
You say that the file continues after the "~~", so lets put EOF aside, and look at the regex. That will match any of the following:
The usual line separators:
<CR>
<NL>
<CR><NL>
... and three unusual forms of line separator that Scanner also recognizes.
0x0085 is the <NEL> or "next line" control code in the "ISO C1 Control" group
0x2028 is the Unicode "line separator" character
0x2029 is the Unicode "paragraph separator" character
My theory is that you've got one of the "unusual" forms in your input file, and this is not showing up in .... whatever tool it is that you are using to examine the files.
I suggest that you examine the input file using a tool that can show you the actual bytes of the file; e.g. the od utility on a Linux / Unix system. Also, check that this isn't caused by some kind of character encoding mismatch ... or trying to read or write binary data as text.
If these don't help, then the next step should be to run your application using your IDE's Java debugger, and single-step it through the Scanner.hasNextLine() and nextLine() calls to find out what the code is actually doing.
And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.
That's interesting. But if the tool you are using to extract the line is the same one that is not showing the (hypothesized) unusual line separator, then this evidence is not reliable. The process of extraction may be altering the "stuff" that is causing the problems.

Related

Strings act weirdly when reading them from a file with the java.util.Scanner and using linebreaks as delimiter

I try to read data from a file using the java.util.Scanner. When I try to use \n as a delimiter, then the resulting Strings react weirdly when I try to add more text to them.
I have a file called "test.txt" and try to read data from it. I then want to add more text to each String, similar to how this would print Hello World!:
String helloWorld = "Hello "+"World!";
System.out.println(helloWorld);.
I tried combining data with +, I tried += and I tried String.concat(), this has worked for me before and usually still works.
I also tried to use different delimiters, or no delimiter at all, both of those work as I expect, but I need the Strings to be separated at line breaks.
The test.txt file for the minimal reproducible example contains this text (there is a space at the end of each line):
zero:
one:
two:
three:
void minimalReproducibleExample() throws Exception { //May throw an exception if the test.txt file can't be found
String[] data = new String[4];
java.io.File file = new java.io.File("test.txt");
java.util.Scanner scanner = new java.util.Scanner(file).useDelimiter("\n");
for (int i=0;i<4;i++) {
data[i] = scanner.next(); //read the next line
data[i] += i; //add a number at the end of the String
System.out.println(data[i]); //print the String with the number
}
scanner.close();
}
I expect this code to print out these lines:
zero: 0
one: 1
two: 2
three: 3
I get this output instead:
0ero:
1ne:
2wo:
three: 3
Why do I not get the expected output when using \n as a delimiter?
The test.txt is most likely using Windows end of line representation \r\n which results in carriage return \r still being present in the String after it's read.
Ensure that test.txt is using \n as line delimtier or use Windows \r\n in Scanner.useDelimiter().
The problem is most likely with a wrong delimiter. If you use Windows 10, the new line separator is \r\n.
Just to be platform-independent use System.getProperty("line.separator") instead of hardcoded \n.

.contains not working when reading from a text file?

Recently started Java and have been trying to make a database sorts of program which reads from a preset text file, the user can either search for a definition using the term or keywords/terms within the definition itself. The searching by term works fine but the key term always outputs not found.
FileReader fr = new FileReader("text.txt");
BufferedReader br = new BufferedReader(fr);
boolean found = false;
String line = br.readLine(); // first line so the term itself
String lineTwo = br.readLine(); // second line which is the definition
do {
if (lineTwo.toLowerCase().contains(keyterm.toLowerCase())) {
found = true;
System.out.println("Found "+keyterm);
System.out.println(line);
System.out.println(lineTwo);
}
} while ((br.readLine()!=null)&(!found));
if (!found){System.out.println("Not Found");} br.close(); fr.close();
This is my method used to check for the key term which works partially, it seems to be able to find the first two lines. Which causes it to output the definition of the first term if the key term is there however it doesn't work for any of the other terms.
edit
The text file it reads from looks something like this:
term
definition
term
definition
Each have their own line.
Edit 2
Thanks to #Matthew Kerian it now checks through the whole file, changing the end of the do while loop to
while (((lineTwo = br.readLine())!=null)&(!found));
It now finds the actual definition but is now outputting the wrong term with it.
Edit 3 The key term is defined by the users input
Edit 4 If it wasn't clear the output in the end I am looking for is either the definition of the term/key term if it is in the txt file or just not found if its not found.
Edit 5 Tried to look at what it was outputting and noticed it was outputting array (the first term in the text file) after every "lineTwo" it seems as though line is not updating.
Final Edit Managed to crudely solve the problem by making another text file with it flipped in the way it goes term definition it now goes definition term, lets me call upon the next line once the definition is found so it reads properly.
lineTwo is not begin refreshed with new data. Something like this would work better:
do {
if (lineTwo.toLowerCase().contains(keyterm.toLowerCase())) {
found = true;
System.out.println("Found "+keyterm);
System.out.println(line);
System.out.println(lineTwo);
}
} while (((lineTwo = br.readLine())!=null)&(!found));
We're still checking for EOF by checking nullness, but by setting it equal to line two we're constantly refreshing our buffer.

Java BufferedWriter Creating Null Characters

I've been using Java's BufferedWriter to write to a file to parse out some input. When I open the file after, however, there seems to be added null characters. I tried specifying the encoding as "US-ASCII" and "UTF8" but I get the same result. Here's my code snippet:
Scanner fileScanner = new Scanner(original);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "US-ASCII"));
while(fileScanner.hasNextLine())
{
String next = fileScanner.nextLine();
next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
out.write(next);
out.newLine();
}
out.flush();
out.close();
Maybe the issue isn't even with the BufferedWriter?
I've narrowed it down to this code block because if I comment it out, there are no null-characters in the output file. If I do a regex replace in VIM the file is null-character free (:%s/.*^L//g).
Let me know if you need more information.
Thanks!
EDIT:
hexdump of a normal line looks like:
0000000 5349 2a41 3030 202a
But when this code is run the hexdump looks like:
0000000 5330 2a49 4130 202a
I'm not sure why things are getting mixed up.
EDIT:
Also, even if the file doesn't match the regex and runs through that block of code, it comes out with null characters.
EDIT:
Here's a hexdump of the first few lines of a diff:
http://pastie.org/pastes/8964701/text
command was: diff -y testfile.hexdump expectedoutput.hexdump
The rest of the lines are different like the last two.
EDIT: Looking at the hexdump diff you gave, the only difference is that one has LF line endings (0A) and the other has CRLF line endings (0D 0A). All the other data in your diff is shifted ahead to accomodate the extra byte.
The CRLF is the default line ending on the OS you're using. If you want a specific line ending in your output, write the string "\n" or "\r\n".
Previously I noted that the Scanner doesn't specify a charset. It should specify the appropriate one that the input is known to be encoded in. However, this isn't the source of the unexpected output.
Scanner.nextLine() is eating the existing line endings.
The javadoc for nextLine states:
This method returns the rest of the current line, excluding any line separator at the end.
The javadoc for BufferedWriter.newLine explains:
Writes a line separator. The line separator string is defined by the system property line.separator, and is not necessarily a single newline ('\n') character.
In your case your system's default newline seperator is "\n". The EDI file you are parsing uses "\r\n".
Using the system defined newLine seperator isn't the appropriate thing to do in this case. The newline separator to use is dictated by the file format and should be put in a format specific static constant somewhere.
Change "out.newLine();" to "out.write("\r\n");"
I think what is going on is the following
All lines that contain ^L (ff) get modified to remove everything before the ^L but in addition you have the side effect in 1 that all \r (cr) also get removed. However, if cr appears before ^L nextLine() is treating that as a line too. Note how, in the output file below, the number of cr + nl is 6 in the input file and the number of cr + nl is also 6 but they're all nl, so the line with c gets preserved because it's being treated on a different line than ^L. Probably not what you want. See below.
Some observations
The source file is being generated on a system that uses \r\n to define a new line, and your program is being run on a system that does not. Because of this all occurrences of 0xd are going to be removed. This will make the two files different sizes even if there are no ^L.
But you probably overlooked #1 because vim will operate in DOS mode (recognize \r\n as a newline separator) or non-DOS mode (only \n) depending on what it reads when it opens the file and hides the fact from the user if it can. In fact to test I had to brute force in \r using ^v^m because I was editing on Linux using vim more here.
Your means to test is probably using od -x (for hex right)? But that outputs ints which is not what you want. Consider the following input file and output file. After your program runs. As viewed in vi
Input file
a
b^M
c^M^M ^L
d^L
Output file
a
b
c
Well maybe that's right, lets see what od has to say
od -x of input File
0a61 0d62 630a 0d0d 0c20 640a 0a0c
od -x of output File
0a61 0a62 0a63 0a0a 000a
Huh, what where did that null come from? But wait from the man page of od
-t type Specify the output format. type is a string containing one or more of the following kinds of type specifiers:
q a Named characters (ASCII). Control characters are displayed using the following names:
-h, -x Output hexadecimal shorts. Equivalent to -t x2.
-a Output named characters. Equivalent to -t a.
Oh, ok so instead use the -a option
od -a of input
a nl b cr nl c cr cr sp ff nl d ff nl
od -a of output
a nl b nl c nl nl nl nl
Forcing java to ignore \r
And finally, all that being said, you really have to overcome the implicit understanding of java that \r delimits a line, even contrary to the documentation. Even when explicitly setting the scanner to use a \r ignoring pattern, it still operates contrary to the documentation and you must override that again by setting the delimiter (see below). I've found the following will probably do what you want by insisting on Unix line semantics. I also added in some logic to not output a blank line.
public static void repl(File original,File file) throws IOException
{
Scanner fileScanner = new Scanner(original);
Pattern pattern1 = Pattern.compile("(?d).*");
fileScanner.useDelimiter("(?d)\\n");
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF8"));
while(fileScanner.hasNext(pattern1))
{
String next = fileScanner.next(pattern1);
next = next.replaceAll("(?d)(.*\\x0C)|(\\x0D)","");
if(next.length() != 0)
{
out.write(next);
out.newLine();
}
}
out.flush();
out.close();
}
With this change, the output above changes to.
od -a of input
a nl b cr nl c cr cr sp ff nl d ff nl
od -a of output
a nl b nl
Stuart Caie provided the answer. if you are looking for an code to avoid these characters.
Basic issue is , Org file using different line separator and the new file using different line separator character.
One easy way, find the Org file Separator character and use the same in new file.
try(BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)));
Scanner fileScanner = new Scanner(original);) {
String lineSep = null;
boolean lineSepFound = false;
while(fileScanner.hasNextLine())
{
if (!lineSepFound){
MatchResult matchResult = fileScanner.match();
if (matchResult != null){
lineSep = matchResult.group(1);
if (lineSep != null){
lineSepFound = true;
}
}
}else{
out.write(lineSep);
}
String next = fileScanner.nextLine();
next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
out.write(next);
}
} catch ( IOException e) {
e.printStackTrace();
}
Note ** MatchResult matchResult = fileScanner.match(); would provide the matchResult for the last Match performed. And in our case we have used hasNextLine() - Scanner used linePattern to find the next line .. Scanner.hasNextLine Source code finding the line Separator ,
but unfortunately no way to get the line separator back. So i have used thier code to get the lineSep only once. and used that lineSep for creating new file.
Also per your code , you would be having extra line separator at the end of file. Corrected here.
Let me know if that works.

\n Not working when reading from File to List and then to output. Java

I have a minor problem, the \n's in my file isn't working in my output I tried two methods:
PLEASE NOTE:
*The text in the file here is a much simplified example. That is why I do not just use the output.append("\n\n"); in the second method. Also the \ns in the file are not always at the END of the line i.e. a line n the file could be Stipulation 1.1\nUnder this Stipulation...etc. *
The \n's in the file need to work. Also both JOptionPane.showMessageDialog(null,rules); and System.out.println(rules); give the same formatted output
Text in File:
A\n
B\n
C\n
D\n
Method 1:
private static void setGameRules(File f) throws FileNotFoundException, IOException
{
rules = Files.readAllLines(f.toPath(), Charset.defaultCharset());
JOptionPane.showMessageDialog(null,rules);
}
Output 1:
A\nB\nC\nD\n
Method 2:
private static void setGameRules(File f) throws FileNotFoundException, IOException
{
rules = Files.readAllLines(f.toPath(), Charset.defaultCharset());
StringBuilder output = new StringBuilder();
for (String s : rules)
{
output.append(s);
output.append("\n\n");//these \n work but the ones in my file do not
}
System.out.println(output);
}
Output 2:
A\n
B\n
C\n
D\n
The character sequence \n is simply a human readable representation of an unprintable character.
When reading it from a file, you get two characters a '\' and an 'n', not the line break character.
As such, you'll need to replace the placeholders in your file with a 'real' line break character.
Using the method I mentioned earlier: s = s.replaceAll( "\\\\n", System.lineSeparator() ); is one way, I'm sure there are others.
Perhaps in readAllLines you can add add the above line of code to do the replacement before, or as, you stick the line in the rules array.
Edit:
The reason this doesn't work the way you expect is because you're reading it from a file. If it was hardcoded into your class, the compiler would see the '\n' sequence and say "Oh boy! A line separator! I'll just replace that with (char)0x0A".
What do you mean with "it is not working"? In what way are they not working? Do you expect to see a line break? I am not sure if you actually have the characters '\n' at the end of each line, or the LineFeed Character (0x0A). The reason your '\n' would work in the Javas source is, that this is a way to escape the linefeed character. Tell us a little about your input file, how is it generated?
Second thing I notice is, that you print the text to the console in the second Method. I am not certain, that the JOptionPane will even display line breaks this way. I think it uses a JLabel, see Java: Linebreaks in JLabels? for that. The console does interpret \n as a linebreak.
The final Answer looks like this:
private static void setGameRules(File f) throws FileNotFoundException, IOException {
rules = Files.readAllLines(f.toPath(), Charset.defaultCharset());
for(int i =0;i!=rules.size();i++){
rules.set(i, rules.get(i).replaceAll( "\\\\n","\n"));
}
}
As #Ray said the \n in the file was just being read as chars \ and n not as the line seperator \n
I just added a for-loop to run through the list and replace them using:
rules.set(i, rules.get(i).replaceAll( "\\\\n","\n")

JLine, how to get the full filename in Windows

Dear all JLine users,
I am recently developing a console application where I use JLine, to provide command and file name completion.
It works pretty well with FileNameCompleter but however I cannot get the full file name right.
My code is like below:
List<Completer> loadCompleter =
Arrays.asList(
new StringsCompleter(commands),
new FileNameCompleter(),
new NullCompleter()
);
console.addCompleter(new ArgumentCompleter(loadCompleter));
while ((line = console.readLine()) != null) {
line = line.trim();
// here I print out the line in char.
char[] result = line.toCharArray();
for (int i = 0; i < result.length; i ++) {
System.out.println(result[i] + " : " + (int)result[i]);
}
}
In the last part of my code I am printing out the line I got from the console, if for example, I have received
myCommand test\new\test.txt
the output is myCommand testnewtest.txt
The backward slash is gone for some reason and I never got the right file path.
This is not an issue when I am testing in Unix like system since forward slash seems ok.
Can anyone help me on the right way of getting the full filename? Many thanks.
Si.
JLine eats backslashes because they are used to escape special characters such as !. You can disable special characters (and the loss of backslashes) by adding the following into your ConsoleReader initialization:
console.setExpandEvents(false);
Alternatively, if you do want to retain special characters, you need to double up your backslashes (so instead of foo\bar, input foo\\bar).
It works for me now, I am jline-2.11.jar.
See https://github.com/Qatar-Computing-Research-Institute/NADEEF/blob/master/console/src/qa/qcri/nadeef/console/Console.java

Categories

Resources