when i try to get the text from a document, if it is followed by some special characters such as TM or C (for copyright) and so on, after writing it into a text file it will makes some unexpected added to it. as an example, we can consider the following:
if we have Apache™ Hadoop™! and then if we try to write in into a text using FileOutputStream then result would be like Apacheâ Hadoopâ which the â is nonsense for me and generally i want a way to detect such characters in the text and just skipping them for writing them, is there solution to this?
If you want just the printable ASCII range, then iterate over your string character by character building a new string. Include the character only if it's within the range 0x20 to 0x7E.
final StringBuilder buff = new StringBuilder();
for (char c : string.toCharArray())
{
if (c >= 0x20 && c <= 0x7E)
{
buff.append(c);
}
}
final FileWriter w = new FileWriter(...);
w.write(buff.toString());
w.close();
If you want to keep carriage returns and newlines, you also need to consider 0x0A and 0x0D.
I mis-read the question originally and didn't notice you wanted to skip them. I'll leave this here for now and will delete it if someone posts something better.
To deal with the characters properly, you can explicit setg the charset to ISO-8859-1. To do this, you'll need to use something like an OutputStreamWriter.
final OutputStreamWriter writer;
writer = new OutputStreamWriter(new FileOutputStream(file),
Charset.forName("ISO-8859-1"));
writer.write(string);
writer.close();
This won't skip them, but should encode them properly.
The reason is characters coding problem. Before you write the string into file, you need to coding the String characters.
you can use like follow:
Writer out = new OutputStreamWriter(new FileOutputStream(
new File("D://helloWorld.txt")), "UTF8");
String tm ="Apache™ Hadoop™";
out.write(tm);
out.close();
Related
I'm importing a file into my code and trying to print it. the file contains
i don't like cake.
pizza is good.
i don’t like "cookies" to.
17.
29.
the second dont has a "right single quotation" and when I print it the output is
don�t
the question mark is printed out a blank square. is there a way to convert it to a regular apostrophe?
EDIT:
public class Somethingsomething {
public static void main(String[] args) throws FileNotFoundException,
IOException {
ArrayList<String> list = new ArrayList<String>();
File file = new File("D:\\project1Test.txt");//D:\\project1Test.txt
if(file.exists()){//checks if file exist
FileInputStream fileStream = new FileInputStream(file);
InputStreamReader input = new InputStreamReader(fileStream);
BufferedReader reader = new BufferedReader(input);
String line;
while( (line = reader.readLine()) != null) {
list.add(line);
}
for(int i = 0; i < list.size(); i ++){
System.out.println(list.get(i));
}
}
}}
it should print as normal but the second "don't" has a white block on the apostrophe
this is the file I'm using https://www.mediafire.com/file/8rk7nwilpj7rn7s/project1Test.txt
edit: if it helps even more my the full document where the character is found here
https://www.nytimes.com/2018/03/25/business/economy/labor-professionals.html
It’s all about character encoding. The way characters are represented isn't always the same and they tend to get misinterpreted.
Characters are usually stored as numbers that depend on the encoding standard (and there are so many of them). For example in ASCII, "a" is 97, and in UTF-8 it's 61.
Now when you see funny characters such as the question mark (called replacement character) in this case, it's usually that an encoding standard is being misinterpreted as another standard, and the replacement character is used to replace the unknown or misinterpreted character.
To fix your problem you need to tell your reader to read your file using a specific character encoding, say SOME-CHARSET.
Replace this:
InputStreamReader input = new InputStreamReader(fileStream);
with this:
InputStreamReader input = new InputStreamReader(fileStream, "SOME-CHARSET");
A list of charsets is available here. Unfortunately, you might want to go through them one by one. A short list of most common ones could be found here.
Your problem is almost certainly the encoding scheme you are using. You can read a file in most any encoding scheme you want. Just tell Java how your input was encoded. UTF-8 is common on Linux. Windows native is CP-1250.
This is the sort of problem you have all the time if you are processing files created on a different OS.
See here and Here
I'll give you a different approach...
Use the appropriate means for reading plain text files. Try this:
public static String getTxtContent(String path)
{
try(BufferedReader br = new BufferedReader(new FileReader(path)))
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
return sb.toString();
}catch(IOException fex){ return null; }
}
I'm trying to read in a file and modify the text, but I need to keep new lines when doing so. For example, if I were to read in a file that contained:
This is some text.
This is some more text.
It would just read in as
This is some text.This is some more text.
How do I keep that space? I think it has something to do with the /n escape character. I've seen using BufferReader and FileReader, but we haven't learned that in my class yet, so is there another way? What I've tried is something like this:
if (ch == 10)
{
ch = '\n';
fileOut.print(ch);
}
10 is the ASCII table code for a new line, so I thought Java could recognize it as that, but it doesn't.
In Java 8:
You can read lines using:
List<String> yourFileLines = Files.readAllLines(Paths.get("your_file"));
Then collect strings:
String collect = yourFileLines.stream().filter(StringUtils::isNotBlank).collect(Collectors.joining(" "));
The problem is that you (possibly) want to read your file a line at a time, and then you want to write it back a line at a time (keeping empty lines).
The following source does that, it reads the input file one line at a time, and writes it back one line at a time (keeping empty lines).
The only problem is ... it possibly changes the new line, maybe you are reading a unix file and write a dos file or vice-versa depending on the system you are running in and the source type of the file you a reading.
Keeping the original newline can introduce a lot complexity, read BufferedReader and PrintWriter api docs for more information.
public void process(File input , File output){
try(InputStream in = new FileInputStream(input);
OutputStream out = new FileOutputStream(output)){
BufferedReader reader = new BufferedReader(new InputStreamReader(in, "utf-8"),true);
PrintWriter writer = new PrintWriter( new OutputStreamWriter(out,"utf-8"));
String line=null;
while((line=reader.readLine())!=null){
String processed = proces(line);
writer.println(processed);
}
} catch (IOException e) {
// Some exception management
}
}
public String proces(String line){
return line;
}
/n should be \n
if (ch == 10)
{
ch = '\n';
fileOut.print(ch);
}
Is that a typo?
ch = '/n';
otherwise use
ch = '\n';
I am trying to read a file containing greek words in utf8
with the following code
reader = new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF8"));
while((line = reader.readLine()) != null){
tokenizer = new StringTokenizer(line, delimiter);
while(tokenizer.hasMoreTokens()){
currentToken = tokenizer.nextToken();
map.put(currentToken, 1);
}
}
On every forum I looked for, I saw this new FileInputStream(file), "UTF8")
but still the printed results is like that ����
p.s. when i print a variable containing a greek word from inside the code, the print is successfull, that means that the problem is on file read.
any ideas?
There are some with too professionalism here. I remind you again that we are humans, not compilers! I am here again "powers" you deleted by post! I am very proud of being born in the birthplace of democracy, respecting the other discussants! You don't respect anything "guru" guys...
PS: Yeah, I know that you disseminate again down votes, but who really cares?
There is no "UTF8" charset in Java. The correct charset name is "UTF-8":
new InputStreamReader(new FileInputStream(file), "UTF-8"))
Or use StandardCharsets.UTF_8 instead to avoid any ambiguity:
new InputStreamReader(new FileInputStream(file), StandardCharsets.UTF_8))
That being said, make sure the file is actually UTF-8 encoded. If it has a UTF-8 BOM in front, you will have to either strip it off from the file itself, or manually skip it when reading the file before then reading the lines. Java readers do not recognize or skip BOMs automatically.
Use this for proper converstion - this one is from iso-8859-1 to utf-8:
public String to_utf8(String fieldvalue) throws UnsupportedEncodingException{
String fieldvalue_utf8 = new String(fieldvalue.getBytes("ISO-8859-1"), "UTF-8");
return fieldvalue_utf8;
}
I have been unable to find the reason for this. The only problem I am having in this code is that when the FileWriter tries to put the new value into the text file, it instead puts a ?. I have no clue why, or even what it means. Here is the code:
if (secMessage[1].equalsIgnoreCase("add")) {
if (secMessage.length==2) {
try {
String deaths = readFile("C:/Users/Samboni/Documents/Stuff For Streaming/deaths.txt", Charset.defaultCharset());
FileWriter write = new FileWriter("C:/Users/Samboni/Documents/Stuff For Streaming/deaths.txt");
int comb = Integer.parseInt(deaths) + 1;
write.write(comb);
write.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
And here is the readFile method:
static String readFile(String path, Charset encoding) throws IOException {
byte[] encoded = Files.readAllBytes(Paths.get(path));
return new String(encoded, encoding);
}
Also, the secMessage array is an array of strings containing the words of an IRC message split into individual words, that way the program can react to the commands on a word-by-word basis.
You're calling Writer.write(int). That writes a single UTF-16 code point to the file, taking just the bottom 16 bits. If your platform default encoding isn't able to represent the code point you're trying to write, it will write '?' as a replacement character.
I suspect you actually want to write out a text representation of the number, in which case you should use:
write.write(String.valueOf(comb));
In other words, turn the value into a string and then write it out. So if comb is 123, you'll get three characters ('1', '2', '3') written to the file.
Personally I'd avoid FileWriter though - I prefer using OutputStreamWriter wrapping FileOutputStream so you can control the encoding. Or in Java 7, you can use Files.newBufferedWriter to do it more simply.
write.write(new Integer(comb).toString());
You can convert the int into a string. Otherwise you will need the int to be a character. That will only work for a small subset of numbers, 0-9, so it is not recommended.
I'm trying to familiarize myself with the different types of stream IOs Java has to offer, so I wrote this little piece of code here.
public static void main(String[] args) throws IOException {
String str = "English is being IOed!\nLine 2 has a number.\n中文字體(Chinese)";
FileOutputStream fos = new FileOutputStream("ByteIO.txt");
Scanner fis = new Scanner(new FileInputStream("ByteIO.txt"));
FileWriter fw = new FileWriter("CharIO.txt");
Scanner fr = new Scanner(new FileReader("CharIO.txt"));
BufferedOutputStream bos = new BufferedOutputStream(new FileOutputStream("BufferedByteIO.txt"));
Scanner bis = new Scanner(new BufferedInputStream(new FileInputStream("BufferedByteIO.txt")));
BufferedWriter bw = new BufferedWriter(new FileWriter("BufferedCharIO.txt"));
Scanner br = new Scanner(new BufferedReader(new FileReader("BufferedCharIO.txt")));
DataOutputStream dos = new DataOutputStream(new BufferedOutputStream((new FileOutputStream("DataBufferedByteIO.txt"))));
Scanner dis = new Scanner(new DataInputStream(new BufferedInputStream((new FileInputStream("DataBufferedByteIO.txt")))));
try {
System.out.printf("ByteIO:\n");
fos.write(str.getBytes());
while (fis.hasNext())
System.out.print(fis.next());// in the form of a String
System.out.printf("\nCharIO:\n");
fw.write(str);
while (fr.hasNext())
System.out.print(fr.next());
System.out.printf("\nBufferedByteIO:\n");
bos.write(str.getBytes());
bos.flush();// buffer is not full, so you'll need to flush it
while (bis.hasNext())
System.out.print(bis.next());
System.out.printf("\nBufferedCharIO:\n");
bw.write(str);
bw.flush();// buffer is not full, so you'll need to flush it
while (br.hasNext())
System.out.print(br.next());
System.out.printf("\nDataBufferedByteIO:\n");
dos.write(str.getBytes());
//dos.flush();// dos doesn't seem to need this...
while (dis.hasNext())
System.out.print(dis.next());
} finally {
fos.close();
fis.close();
fw.close();
fr.close();
bos.close();
br.close();
dos.close();
dis.close();
}
}
All it does is just write a pre-defined string into the file and then read it. The problem arises when I run the code, I get this:
ByteIO:
EnglishisbeingIOed!Line2hasanumber.中文字體(Chinese)
CharIO:
//<--Empty line here
BufferedByteIO:
EnglishisbeingIOed!Line2hasanumber.中文字體(Chinese)
BufferedCharIO:
EnglishisbeingIOed!Line2hasanumber.中文字體(Chinese)
DataBufferedByteIO:
//<--Empty line here
The files are all populated with the correct data, so I suppose something is wrong with the scanner, but I just don't know what went wrong, and I hope somebody can point the mistake out for me.
The files are all populated with the same data. That's weird, according to Java I/O Streams, Byte Streams can only process single bytes, and only Character Streams can process Unicode, so shouldn't Byte Streams spit out gibberish when processing Chinese characters, which are UTF-16 (I think)? What exactly is the difference between a Byte Stream and a Character Stream (fos vs fw)?
On a partially unrelated topic, I thought Byte Streams were used to work with binary data such as music and images, I also thought that the data Byte Streams spit out should be illegible, but I seem to be wrong, am I? Exactly which I/O Stream Class(es) should I work with if I'm dealing with binary data?
An important concept to understand here is that of encoding.
String/char[]/Writer/Reader are used to deal with textual data of any kind.
byte[]/OutputStream/InputStream are used to deal with binary data. Also, a file on your disk only every stores binary data (yes, that's true, it will hopefully be a bit more clear in a minute).
Whenever you convert between those two worlds some kind of encoding will be in play. In Java, there are several ways to convert between those worlds without specifying an encoding. In this case, the platform default encoding will be used (which one this is depends on your platform and configuration/locale). [*]
The task of an encoding is to convert some given binary data (usually from a byte[]/ByteBuffer/InputStream) to textual data (usually into char[]/CharBuffer/Writer) or the other way around.
How exactly this happens depends on the encoding used. Some encodings (such as the ISO-8859-* family) are a simple mapping from byte values to corresponding unicode codepoints, others (such as UTF-8) are more complex and a single unicode codepoint can be anything from 1 to 4 bytes.
There's a quite nice article that gives a basic overview over the whole encoding issue titled: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
[*] Using the platform default encoding is usually not desired, because it makes your program un-portable and hard to use, but that's beside the point for this post.
Using BufferedInputStream and DataInputStream does not alter the content of the data.
Byte stream is for reading binary data. It is not suitable here.
Character stream is for reading text, the scanner assumes you are reading new line terminated lines. (Which you don't appear to have)
If I run
String str = "English is being IOed!\nLine 2 has a number.\n\u4E2D\u6587\u5b57\u9ad4(Chinese)\n";
Writer fw = new OutputStreamWriter(new FileOutputStream("ReaderWriter.txt"), "UTF-8");
fw.write(str);
fw.close();
Reader fr = new InputStreamReader(new FileInputStream("ReaderWriter.txt"), "UTF-8");
Scanner scanner = new Scanner(fr);
String next = "";
while (scanner.hasNext()) {
next = scanner.next();
System.out.println(next);
}
for (int i = 0; i < next.length(); i++)
System.out.println(Integer.toHexString((int) next.charAt(i)));
fr.close();
I get
English
is
being
IOed!
Line
2
has
a
number.
????(Chinese)
4e2d
6587
5b57
9ad4
28
43
68
69
6e
65
73
65
29
You can see that the original characters are preserved. The '?' means the character could not be displayed on my terminal or my character encoding. (I don't know why)