Java: reading text from a file results with strange formatting

Java: reading text from a file results with strange formatting - java

Usually, when I read text files, I do it like this:
File file = new File("some_text_file.txt");
Scanner scanner = new Scanner(new FileInputStream(file));
StringBuilder builder = new StringBuilder();
while(scanner.hasNextLine()) {
builder.append(scanner.nextLine());
builder.append('\n');
}
scanner.close();
String text = builder.toString();
There may be better ways, but this method has always worked for me perfectly.
For what I am working on right now, I need to read a large text file (over 700 kilobytes in size). Here is a sample of the text when opened in Notepad (the one that comes standard with any Windows operating system):
"lang"
{
"Language" "English"
"Tokens"
{
"DOTA_WearableType_Daggers" "Daggers"
"DOTA_WearableType_Glaive" "Glaive"
"DOTA_WearableType_Weapon" "Weapon"
"DOTA_WearableType_Armor" "Armor"
However, when I read the text from the file using the method that I provided above, the output is:
I could not paste the output for some reason. I have also tried to read the file like so:
File file = new File("some_text_file.txt");
Path path = file.toPath();
String text = new String(Files.readAllBytes(path));
... with no change in result.
How come the output is not as expected? I also tried reading a text file that I wrote and it worked perfectly fine.

It looks like encoding problem. Use a tool that can detect encoding to open the file (like Notepad++) and find how it is encoded. Then use the other constructor for Scanner:
Scanner scanner = new Scanner(new FileInputStream(file), encoding);
Or you can simply experiment with it, trying different encodings. It looks like UTF-16 to me.

final Scanner scanner = new Scanner(new FileInputStream(file), "UTF-16");

Related

Vscode doesn't recognize Umlaute (äöü) when reading and writing files with Java

I have a Java project which reads from a .txt file and counts the frequency of every word and saves every word along with its frequency in a .stat file. The way I do this is by reading the file with a BufferedReader, using replaceAll to replace all special characters with spaces and then iterating through the words and finally writing into a .stat with a PrintWriter.
This program works fine if I run it in Eclipse.
However, if I run it in VSCode, the Umlaute (äöü) get recognized as Special characters and are removed from the words.
If I don't use a replaceAll and leave all the special characters in the text, they will get recognized and displayed normally in the .stat.
If I use replaceAll("[^\\p{IsAlphabetic}+]"), the Umlaute will get replaced by all kinds of weird Unicode characters (for Example Ăbermut instead of Übermut).
If I use replaceAll("[^a-zA-ZäöüÄÖÜß]"), the Umlaute will just get replaced by spaces. The same happens if I mention the Umlaute via their Unicode.
This has to be a problem with the encoding in VSCode or perhaps Powershell, as it works fine in other IDEs.
I already checked if Eclipse and VSCode use the same Jdk version, which they did. It's 17.0.5 and the only one installed on my machine.
I also tried out all the different encoding settings in VSCode and I recreated the project from scratch after changing the settings, to no avail.
Here's the code of the minimal reproducable problem:
import java.io.*;
public class App {
static String s;
public static void main(String[] args) {
Reader reader = new Reader();
reader.readFile();
}
}
public class Reader {
public void readFile() {
String s = null;
File file = new File("./src/textfile.txt");
try (FileReader fileReader = new FileReader(file);
BufferedReader bufferedReader = new BufferedReader(fileReader);) {
s = bufferedReader.readLine();
} catch (FileNotFoundException ex) {
// TODO: handle exception
} catch (IOException ex) {
System.out.println("IOException");
}
System.out.println(s);
System.out.println(s.replaceAll("[a-zA-ZäöüÄÖÜß]", " "));
}
}
My textfile.txt contains the line "abcABCäöüÄÖÜß".
The above program outputs
ï»¿abcABCÃ¤Ã¶Ã¼Ã?Ã?Ã?Ã?
ï»¿ Ã¤Ã¶Ã¼Ã?Ã?Ã?Ã?
Which shows that the problem is presumably in the Reader, as the glibberish Unicode symbols don't get picked up by the replaceAll.

I solved it by explicitly turning all java files and all .txt files into UTF-8 encoding (in the bottom bar in VSCode), setting UTF-8 as the standard encoding in the VSCode settings and modifying both the FileReader and FileWriter to work with the UTF-8 encoding like this:
FileReader fileReader = new FileReader(file, Charset.forName("UTF-8"));
FileWriter fileWriter = new FileWriter(file, Charset.forName("UTF-8"));

hebrew characters from txt file does not display on word after compile to JAR file using spire.doc.jar

I write in java code with spire.doc.jar that use BufferedReader to take some words from txt file and display it on word document at the end,
this is how it read the txt file:
BufferedReader abc = new BufferedReader(new FileReader("carNumbers.txt"));
everything works well but when I export my all code into jar file and run it in CMD
the word document come out with wired characters, instead of hebrew:
הריני לאשר בזאת
i get:
�׳™׳×׳•׳¨ ׳•׳©׳�׳™׳˜
hebrew words that gets added to the word file with finalText.appendText like that:
finalText.appendText(",בכבוד רב" );
gets added to the word doc just fine
what i need to do to fix that please?

I fix that by changing this:
BufferedReader abc = new BufferedReader(new FileReader("carNumbers.txt"));
to this
BufferedReader abc = new BufferedReader(new InputStreamReader
(new FileInputStream("carNumbers.txt"),"UTF-8"));

using Scanner to read a file

I found the following useful in the past for reading in text files:
new Scanner(file).useDelimiter("\\Z").next();
However I came across a file today that was only partially read in with this syntax. I'm not sure what makes this file special, it's just a .jsp
I found the below worked in this instance but I'd like to know why the previous method didn't work.
Scanner in = new Scanner(new FileReader(file));
String text = in.useDelimiter("\\Z").next();

Save the jsp file as .txt and try to read it using your first method. if it works i feel size can be the issue.

java output html code to file

I have a chunk of html code that should be outputted as a .html file, in java. The pre-written html code is the header and table for a page, and i need to generate some data, and output it to the same .html file. Is there an easier way to print the html code than to do prinln() line by line? Thanks

You can look at some Java libraries for parsing HTML code. A quick Google search tuns up a few. Read in the HTML and then use their queries to manipulate the DOM as needed and then spit it back out.
e.g. http://jsoup.org/

Try using a templating engine, MVEL2 or FreeMarker, for example. Both can be used by standalone applications outside of a web framework. You lose time upfront but it will save you time in the long run.

JSP (Java Server Pages) allows you to write HTML files which have some Java code easily embedded within them. For example
<html><head><title>Hi!</title></head><body>
<% some java code here that outputs some stuff %>
</body></html>
Though that requires that you have an enterprise Java server installed. But if this is on a web server, that might not be unreasonable to have.
If you want to do it in normal Java, that depends. I don't fully understand which part you meant you will be outputting line by line. Did you mean you are going to do something like
System.out.println("<html>");
System.out.println("<head><title>Hi!</title></head>");
System.out.println("<body>");
// etc
Like that? If that's what you meant, then don't do that. You can just read in the data from the template file and output all the data at once. You could read it into a multiline text string of you could read the data in from the template and output it directly to the new file. Something like
while( (strInput = templateFileReader.readLine()) != null)
newFileOutput.println(strInput);
Again, I'm not sure exactly what you mean by that part.

HTML is simply a way of marking up text, so to write a HTML file, you are simply writing the HTML as text to a file with the .html extension.
There's plenty of tutorials out there for reading and writing from files, as well as getting a list of files from a directory. (Google 'java read file', 'java write file', 'java list directory' - that is basically everything you need.) The important thing is the use of BufferedReader/BufferedWriter for pulling and pushing the text in to the files and realising that there is no particular code science involved in writing HTML to a file.
I'll reiterate; HTML is nothing more than <b>text with tags</b>.
Here's a really crude example that will output two files to a single file, wrapping them in an <html></html> tag.
BufferedReader getReaderForFile(filename) {
FileInputStream in = new FileInputStream(filename);
return new BufferedReader(new InputStreamReader(in));
}
public void main(String[] args) {
// Open a file
BufferedReader myheader = getReaderForFile("myheader.txt");
BufferedReader contents = getReaderForFile("contentfile.txt");
FileWriter fstream = new FileWriter("mypage.html");
BufferedWriter out = new BufferedWriter(fstream);
out.write("<html>");
out.newLine();
for (String line = myheader.readLine(); line!=null; line = myheader.readLine()) {
out.write(line);
out.newLine(); // readLine() strips 'carriage return' characters
}
for (String line = contents.readLine(); line!=null; line = contents.readLine()) {
out.write(line);
out.newLine(); // readLine() strips 'carriage return' characters
}
out.write("</html>");
}

To build a simple HTML text file, you don't have to read your input file line by line.
File theFile = new File("file.html");
byte[] content = new byte[(int) theFile.length()];
You can use "RandomAccessFile.readFully" to read files entirely as a byte array:
// Read file function:
RandomAccessFile file = null;
try {
file = new RandomAccessFile(theFile, "r");
file.readFully(content);
} finally {
if(file != null) {
file.close();
}
}
Make your modifications on the text content:
String text = new String(content);
text = text.replace("<!-- placeholder -->", "generated data");
content = text.getBytes();
Writing is also easy:
// Write file content:
RandomAccessFile file = null;
try {
file = new RandomAccessFile(theFile, "rw");
file.write(content);
} finally {
if(file != null) {
file.close();
}
}

How to print the content of a tar.gz file with Java?

I have to implement an application that permits printing the content of all files within a tar.gz file.
For Example:
if I have three files like this in a folder called testx:
A.txt contains the words "God Save The queen"
B.txt contains the words "Ubi maior, minor cessat"
C.txt.gz is a file compressed with gzip that contain the file c.txt with the words "Hello America!!"
So I compress testx, obtain the compressed tar file: testx.tar.gz.
So with my Java application I would like to print in the console:
"God Save The queen"
"Ubi maior, minor cessat"
"Hello America!!"
I have implemented the ZIP version and it works well, but keeping tar library from apache ant http://commons.apache.org/compress/, I noticed that it is not easy like ZIP java utils.
Could someone help me?
I have started looking on the net to understand how to accomplish my aim, so I have the following code:
GZIPInputStream gzipInputStream=null;
gzipInputStream = new GZIPInputStream( new FileInputStream(fileName));
TarInputStream is = new TarInputStream(gzipInputStream);
TarEntry entryx = null;
while((entryx = is.getNextEntry()) != null) {
if (entryx.isDirectory()) continue;
else {
System.out.println(entryx.getName());
if ( entryx.getName().endsWith("txt.gz")){
is.copyEntryContents(out);
// out is a OutputStream!!
}
}
}
So in the line is.copyEntryContents(out), it is possible to save on a file the stream passing an OutputStream, but I don't want it! In the zip version after keeping the first entry, ZipEntry, we can extract the stream from the compressed root folder, testx.tar.gz, and then create a new ZipInputStream and play with it to obtain the content.
Is it possible to do this with the tar.gz file?
Thanks.

surfing the net, i have encountered an interesting idea at : http://hype-free.blogspot.com/2009/10/using-tarinputstream-from-java.html.
After converting ours TarEntry to Stream, we can adopt the same idea used with Zip Files like:
InputStream tmpIn = new StreamingTarEntry(is, entryx.getSize());
// use BufferedReader to get one line at a time
BufferedReader gzipReader = new BufferedReader(
new InputStreamReader(
new GZIPInputStream(
inputZip )));
while (gzipReader.ready()) { System.out.println(gzipReader.readLine()); }
gzipReader.close();
SO with this code you could print the content of the file testx.tar.gz ^_^

To not have to write to a File you should use a ByteArrayOutputStream and use the public String toString(String charsetName)
with the correct encoding.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java: reading text from a file results with strange formatting - java

final Scanner scanner = new Scanner(new FileInputStream(file), "UTF-16");

Related

Vscode doesn't recognize Umlaute (äöü) when reading and writing files with Java

hebrew characters from txt file does not display on word after compile to JAR file using spire.doc.jar

using Scanner to read a file

java output html code to file

How to print the content of a tar.gz file with Java?

Categories

Resources