It is a simple "file writing program" that scans a text file from the given directory and stores its content into a string initialized as " words" . Now, when the content is concatenated into the string. The spaces and the line breaks are preserved. However, when I attempt to write that string words into a file. The spaces are preserved but the lines breaks are lost.
For example:
Original String:
Hello. I am stuck in UMT.
Get me out of here.
File Writing:
Hello, I am stuck in UMT. Get me out of here.
Notice the new line is not preserved? No line break? No " \n" ?
package testing;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.FileWriter;
import java.io.IOException;
import java.util.Scanner;
import java.util.logging.Level;
import java.util.logging.Logger;
public class Reader{ {
String words="this is a a new file created by me :D \n";
File file = new File("C:/file.txt");
try {
Scanner scan= new Scanner(file);
while(scan.hasNextLine()){
words= words.concat(scan.nextLine() + "\r ");
}
} catch (FileNotFoundException ex) {
Logger.getLogger(Reader.class.getName()).log(Level.SEVERE, null,
ex);
}
System.out.println(words);
try {
FileWriter writer = new FileWriter("D:/newfile.txt");
writer.write(words);
writer.close();
} catch (IOException ex) {
Logger.getLogger(Reader.class.getName()).log(Level.SEVERE, null, ex);
}
}}
\r is carriage return, \n is new line but as a general rule you have to look at the encoding type of your file because the new line character can differ from one encoding to another one.
CR, LF, and CRLF are all commonly used end-of-line characters. In the world of PCs, CRLF is common amongst Windows apps, CR is common on the Classic Mac OS, and LF is common on the modern macOS and Unix-oriented OSes (BSD, Linux, etc). See Wikipedia.
Some more info
if you are facing some of these problems you have to keep in mind that different charecter encodings could have different notation for the new line. For example Word may use UTF-8 and Notepad may use ISO-8859-1 and you default system is setted on another type of encoding and it's not sure that every of this encoding share the same new line character. When you use the \n character you are typing the new line of your system so let's say you are working on windows this can display a new line that you can see in Word, but if you will open the same file on a Mac text editor you can see no new lines if they use different character encodings.
Try:
words= words.concat(scan.nextLine() + "\r\n" );
The nextLine() scans each line without the line break. So you have to manually insert one.
Also, on Linux and MacOS machines, next line is \n but on most Windows machines, next line is \r\n.
Related
I have successfully used the ISO8859-13 character encoding before but this time it doesn't seem to be working.
Based on the web site https://en.wikipedia.org/wiki/ISO/IEC_8859-13 it is a valid character.
These are the 3 characters stored in the file.
äää
Here is the code being used.
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
public class ReadFile
{
public static void main(String[] arguments)
{
try
{
File inFile = new File("C:\\Downloads\\MyFile.txt");
if (inFile.exists())
{
System.out.println("File found");
BufferedReader in = new BufferedReader(
new InputStreamReader(new FileInputStream(inFile), "ISO8859-13"));
String line = null;
while ( (line = in.readLine()) != null )
{
System.out.println("Line Read: >" + line + "<");
}
}
else
{
System.out.println("File not found");
}
}
catch (IOException e)
{
}
}
}
The output on both Windows and Linux with and/or without Eclipse is the same which is.
Line Read: >?¤?¤?¤<
This previously worked for a number of other characters but why not for this?
There are many explanations possible for what you are observing. The two most likely ones, along with some code you can use to confirm that you've found the cause:
Option #1: Terminal issues
Maybe you are writing this to a terminal that either cannot render ä, or, there is a terminal transfer issue (terminals are, in the end, just a bunch of streams and pipes hooked together, they are bytes under the hood, so if one part of the process thinks all are agreed that all bytes are to be interpreted as UTF-8 encoded text, and another as ISO-8859-13 encoded, you get problems). Given that you see the exact same output on windows as on linux this is unlikely (it would be particularly likely if you are seeing this in the 'console' view in an IDE, or different outputs on different systems for the same code). If you want to test it, run instead: System.out.println("unicode codepoint of the first character: " + (int) line.charAt(0)); - this should print 228, which is the unicode codepoint for ä. If it doesn't, then you can be certain this isn't the (only) problem.
If this is it, the fix is to, well, use another terminal or mess with settings, I'd just ask another SO question and give plenty of detail on your setup (which OS, which terminal client, what does SET print, does the client have encoding options, etcetera).
Option #2: It's not actually ISO-8859-13
This, too, is simple to test: remark out your BufferedReader in = .... line and replace it with: System.out.println(new FileInputStream(file).read()); - this should print 228. If it prints anything else, your input file is not actually ISO-8859-13.
If this is it, find out what the encoding actually is and use that instead. For example, in UTF-8 encoding, ä would end up as 2 bytes in a file. That would already imply that your input file containing just äää and not even a newline afterwards is 6 bytes large (in ISO-8859-13, it would be 3), and that the raw bytes, as you read them with fileInputStream.read(), are, in order: 195 164 195 164 195 164. So, if you run the above code and it prints 195 instead of 228 - your input is probably in UTF-8; it's definitely not in ISO-8859-13.
I need to write a program which is able to write UTF-8 data into a file.
I found out examples on the internet, however, I am not able to progress to desired result.
Code:
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
public class UTF8WriterDemo {
public static void main(String[] args) {
Writer out = null;
try {
out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("c://java//temp.txt"), "UTF-8"));
String text = "This texáát will be added to File !!";
out.write(text);
out.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Everything run succesfully, but at the end I see special characters not showing properly:
This texáát will be added to File !!
I tried several examples from the internet with the same result.
I use Visual Studio code.
Where could be the problem please?
Thank you
Your code is correct. You probably already have a file named temp.txt, and therefore Java writes text to the existing file (replacing previous content). What can be a problem is an encoding, that you already have set in your file.
In other words, you can't (or at least shouldn't) write UTF-8 text to the file with for example WINDOWS-1250 encoding or you would get an exact result as you have described.
If you didn't have this file, Java would automatically create a file with UTF-8 encoding.
Possible solutions:
Change encoding of your current file (usually you can open it in any text editor, use Save as and then specify encoding as UTF-8.
Remove this file and Java will create it automatically with proper encoding.
By the way, you should use StandardCharsets class instead of using String charsetName in order to avoid UnsupportedEncodingException:
new OutputStreamWriter(new FileOutputStream("temp.txt"), StandardCharsets.UTF_8)
When you say "I see special characters not showing properly", where are you seeing them?
What you say/show next looks like the string, utf-8 encoded (i.e. the accented a's are each represented by 2 chars, in what appears to be the appropriate encoding).
What I would expect the issue to be is that the java code is not outputting a BOM at the beginning of the file, leaving the interpretation of utf-8 sequences up to the discretion of the reader.
I'm writing a simple program that writes data to the selected file .
everything is going great except the line breaks \n the string is written in the file but without line breaks
I've tried \n and \n\r but nothing changed
the program :
public void prepare(){
String content = "<?xml version=\"1.0\" encoding=\"UTF-8\" standalone=\"no\"?>\n\r<data>\n\r<user><username>root</username><password>root</password></user>\n\r</data>";
FileOutputStream fos = null;
try {
fos = new FileOutputStream(file);
} catch (FileNotFoundException ex) {
System.out.println("File Not Found .. prepare()");
}
byte b[] = content.getBytes();
try {
fos.write(b);
fos.close();
} catch (IOException ex) {
System.out.println("IOException .. prepare()");
}
}
public static void main(String args[]){
File f = new File("D:\\test.xml");
Database data = new Database(f);
data.prepare();
}
Line endings for Windows follow the form \r\n, not \n\r. However, you may want to use platform-dependent line endings. To determine the standard line endings for the current platform, you can use:
System.lineSeparator()
...if you are running Java 7 or later. On earlier versions, use:
System.getProperty("line.separator")
My guess is that you're using Windows. Write \r\n instead of \n\r - as \r\n is the linebreak on Windows.
I'm sure you'll find that the characters you're writing into the file are there - but you need to understand that different platforms use different default line breaks... and different clients will handle things differently. (Notepad on Windows only understands \r\n, other text editors may be smarter.)
The correct linebreak sequence on windows is \r\n not \n\r.
Also your viewer may interpret them differently. For example, notepad will only display CRLF linebreaks, but Write or Word have no problem displaying CR or LF alone.
You should use System.lineSeparator() if you want to find the linebreak sequence for the current platform. You are correct in writing them explicitly if you are attempting to force a linebreak format regardless of the current platform.
So, using something like:
for (int i = 0; i < files.length; i++) {
if (!files[i].isDirectory() && files[i].canRead()) {
try {
Scanner scan = new Scanner(files[i]);
System.out.println("Generating Categories for " + files[i].toPath());
while (scan.hasNextLine()) {
count++;
String line = scan.nextLine();
System.out.println(" ->" + line);
line = line.split("\t", 2)[1];
System.out.println("!- " + line);
JsonParser parser = new JsonParser();
JsonObject object = parser.parse(line).getAsJsonObject();
Set<Entry<String, JsonElement>> entrySet = object.entrySet();
exploreSet(entrySet);
}
scan.close();
// System.out.println(keyset);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}
as one goes over a Hadoop output file, one of the JSON objects in the middle is breaking... because scan.nextLine() is not fetching the whole line before it brings it to split. ie, the output is:
->0 {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
!- {"Flags":"0","transactions":{"totalTransactionAmount":"0","totalQuantitySold":"0"},"listingStatus":"NULL","conditionRollupId":"0","photoDisplayType":"0","title":"NULL","quantityAvailable":"0","viewItemCount":"0","visitCount":"0","itemCountryId":"0","itemAspects":{ ... "sellerSiteId":"0","siteId":"0","pictureUrl":"http://somewhere.com/45/x/AlphaNumeric/$(KGrHqR,!rgF!6n5wJSTBQO-G4k(Ww~~
Most of the above data has been sanitized (not the URL (for the most part) however... )
and the URL continues as:
$(KGrHqZHJCgFBsO4dC3MBQdC2)Y4Tg~~60_1.JPG?set_id=8800005007
in the file....
So its slightly miffing.
This also is entry #112, and I have had other files parse without errors... but this one is screwing with my mind, mostly because I dont see how scan.nextLine() isnt working...
By debug output, the JSON error is caused by the string not being split properly.
And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.
EDIT:
Also blows up if I remove the offending line in about the same place.
Attempted with JVM 1.6 and 1.7
Workaround Solution:
BufferedReader scan = new BufferedReader(new FileReader(files[i]));
instead of scanner....
Based on your code, the best explanation I can come up with is that the line really does end after the "~~" according to the criteria used by Scanner.nextLine().
The criteria for an end-of-line are:
Something that matches this regex: "\r\n|[\n\r\u2028\u2029\u0085]" or
The end of the input stream
You say that the file continues after the "~~", so lets put EOF aside, and look at the regex. That will match any of the following:
The usual line separators:
<CR>
<NL>
<CR><NL>
... and three unusual forms of line separator that Scanner also recognizes.
0x0085 is the <NEL> or "next line" control code in the "ISO C1 Control" group
0x2028 is the Unicode "line separator" character
0x2029 is the Unicode "paragraph separator" character
My theory is that you've got one of the "unusual" forms in your input file, and this is not showing up in .... whatever tool it is that you are using to examine the files.
I suggest that you examine the input file using a tool that can show you the actual bytes of the file; e.g. the od utility on a Linux / Unix system. Also, check that this isn't caused by some kind of character encoding mismatch ... or trying to read or write binary data as text.
If these don't help, then the next step should be to run your application using your IDE's Java debugger, and single-step it through the Scanner.hasNextLine() and nextLine() calls to find out what the code is actually doing.
And almost forgot, it also works JUST FINE if I attempt to put the offending line in its own file and parse just that.
That's interesting. But if the tool you are using to extract the line is the same one that is not showing the (hypothesized) unusual line separator, then this evidence is not reliable. The process of extraction may be altering the "stuff" that is causing the problems.
I have a text.txt file which contains following txt.
Kontagent Announces Partnership with Global Latino Social Network Quepasa
Released By Kontagent
I read this text file into a string documentText.
documentText.subString(0,9) gives Kontagent, which is good.
But, documentText.subString(87,96) gives y Kontage in windows (IntelliJ Idea) and gives Kontagent in Unix environment. I am guessing it is happening because of blank line in the file (after which the offset got screwed). But, I cannot understand, why I get two different results. I need to get one result in the both the environments.
To read file as string I used all the functions talked about here
How do I create a Java string from the contents of a file? . But, I still get same results after using any of the functions.
Currently I am using this function to read the file into documentText String:
public static String readFileAsString(String fileName)
{
File file = new File(fileName);
StringBuilder fileContents = new StringBuilder((int)file.length());
Scanner scanner = null;
try {
scanner = new Scanner(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String lineSeparator = System.getProperty("line.separator");
try {
while(scanner.hasNextLine()) {
fileContents.append(scanner.nextLine() + lineSeparator);
}
return fileContents.toString();
} finally {
scanner.close();
}
}
EDIT: Is there a way to write a general function which will work for both windows and UNIX environments. Even if file is copied in text mode.
Because, unfortunately, I cannot guarantee that everyone who is working on this project will always copy files in binary mode.
The Unix file probably uses the native Unix EOL char: \n, whereas the Windows file uses the native Windows EOL sequence: \r\n. Since you have two EOLs in your file, there is a difference of 2 chars. Make sure to use a binary file transfer, and all the bytes will be preserved, and everything will run the same way on both OSes.
EDIT: in fact, you are the one which appends an OS-specific EOL (System.getProperty("line.separator")) at the end of each line. Just read the file as a char array using a Reader, and everything will be fine. Or use Guava's method which does it for you:
String s = CharStreams.toString(new FileReader(fileName));
On Windows, a newline character \n is prepended by \r or a carriage return character. This is non-existent in Linux. Transferring the file from one operating system to the other will not strip/append such characters but occasionally, text editors will auto-format them for you.
Because your file does not include \r characters (presumably transferred straight from Linux), System.getProperty("line.separator") will return \r\n and account for non-existent \r characters. This is why your output is 2 characters behind.
Good luck!
Based on input you guys provided, I wrote something like this
documentText = CharStreams.toString(new FileReader("text.txt"));
documentText = this.documentText.replaceAll("\\r","");
to strip off extra \r if a file has \r.
Now,I am getting expect result in windows environment as well as unix. Problem solved!!!
It works fine irrespective of what mode file has been copied.
:) I wish I could chose both of your answer, but stackoverflow doesn't allow.