OpenCSV reads strange text out of file - java

I am using AndroidStudio and my applications has to read in a CSV file which looks like this:
"Anmeldung";"1576017126809898";"1547126680978123";"";"";"Frau"
"Anmeldung";"1547126680911112";"1547126680978123";"";"";"Frau"
But as you can see in the following picture, OpenCSV reads some strange character and in my List there are senseless Strings which are not in the file it read
This is how I read the Data out of my file:
try {
FileReader filereader = new FileReader(filePath);
CSVParser parser = new CSVParserBuilder().withSeparator(';').build();
CSVReader csvReader = new CSVReaderBuilder(filereader)
.withSkipLines(1)
.withCSVParser(parser)
.build();
List<String[]> allData = csvReader.readAll();
MainActivity.setAllData(allData);
}
catch (Exception e) {
e.printStackTrace();
}
Thank you

It looks like there is an encoding poblem.
Make sure to open and parse the file with the proper encoding (for example utf-8 or utf-16). Same for viewing the data.

I figured it out. It might sound strange but I took the file and replaced all ; with ;
I think the data I got were exportet with an UTF-16 Encoding or from an linux device.
tl;dr The File had the wrong encoding. And the way I opened and viewed it was correct

Related

How to get file content properly from a jpg file?

I'm trying to get content from a jpg file so I can encrypt that content and save it in another file that is later decrypted.
I'm trying to do so by reading the jpg file as if it were a text file with this code:
String aBuffer = "";
try {
File myFile = new File(pathRoot);
FileInputStream fIn = new FileInputStream(myFile);
BufferedReader myReader = new BufferedReader(new InputStreamReader(fIn));
String aDataRow = "";
while ((aDataRow = myReader.readLine()) != null) {
aBuffer += aDataRow;
}
myReader.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
But this doesn't give the content the file has, just a short string and weirdly enough it also looks like just reading the file corrupts it.
What could I do so I can achieve the desired behavior?
Image files aren't text - but you're treating the data as textual data. Basically, don't do that. Use the InputStream to load the data into a byte array (or preferably, use Files.readAllBytes(Path) to do it rather more simply).
Keep the binary data as binary data. If you absolutely need a text representation, you'll need to encode it in a way that doesn't lose data - where hex or base64 are the most common ways of doing that.
You mention encryption early in the question: encryption also generally operates on binary data. Any encryption methods which provide text options (e.g. string parameters) are just convenience wrappers which encode the text as binary data and then encrypt it.
and weirdly enough it also looks like just reading the file corrupts it.
I believe you're mistaken about that. Just reading from a file will not change it in any way. Admittedly you're not using try-with-resources statements, so you could end up keeping the file handle open, potentially preventing another process from reading it - but the content of the file itself won't change.

How to change text encoding to utf-8 while using apache tika text parsing (most specifically for .txt files)

I'm using apache tika for text extraction. It was working fine over almost all filetypes unless I tried testing it over a Chinese machine with a .txt document written in Chinese. I did not save the file in utf-8 encoding format. Tika started parsing wrong string characters. This seems to be an encoding issue, I tried setting encoding type like this
metadata.add(Metadata.CONTENT_ENCODING, "UTF_8")
still no luck. I've seen some methods in java that convert text from one encoding type to another but only if the source encoding type is known. In my case, I'm not sure about the client's encoding type and can't force him to use utf-8. kindly help me with this!!
Thanks in advance:)
I had the same issue but when converting Powerpoint to text and I found out that by using the correct OutputStream which you can specify the encoding, the encoding is working well.
The metadata you try to add changes nothing for the conversion but just add the line in the headers of the html file.
Here is my code:
public String tranformPowerpointToText(File file) throws IOException, TikaException {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
ToTextContentHandler toTextContentHandler= new ToTextContentHandler(byteArrayOutputStream, "UTF-8");
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
try (InputStream stream = new FileInputStream(file)) {
parser.parse(stream, toTextContentHandler, metadata);
return byteArrayOutputStream.toString();
} catch (SAXException e) {
e.printStackTrace();
}
}

How to read / write into docx file using commons.io.FileUtils?

Need some quick help. I am trying to write a java program to generate a report. I have the report template in a docx file.
What I want to do is, use that docx file as template and put data in it multiple times for various records and write that to a new docx file. The main thing is I want to maintain the formatting and indentation of the contents inside the docx file. They are bullets data. And that's where the problem is.
Below is the piece of code handling the above operation,
public void readWriteDocx(HashMap<String, String> detailsMap) {
try {
File reportTemplateFile = new File("ReportTemplate.docx");
File actualReportFile = new File("ActualReport.docx");
StringBuilder preReport = new StringBuilder();
preReport.append("Some details about pre report goes here...: ");
preReport.append(System.lineSeparator());
String docxContent = "";
for (Map.Entry<String, String> entry : detailsMap.entrySet()) {
docxContent = FileUtils.readFileToString(reportTemplateFile, StandardCharsets.UTF_8);
// code to fetch and get data to insert into docxContent
docxContent = docxContent.replace("$filename", keyFilename);
docxContent = docxContent.replace("$expected", expectedFile);
docxContent = docxContent.replace("$actual", actualFile);
docxContent = docxContent.replace("$reportCount", String.valueOf(reportCount));
docxContent = docxContent.replace("$diffMessage", key);
FileUtils.writeStringToFile(actualReportFile, docxContent, StandardCharsets.UTF_8, true);
}
preReport.append(FileUtils.readFileToString(actualReportFile, StandardCharsets.UTF_8));
System.out.print(preReport.toString());
} catch (IOException e) {
e.printStackTrace();
}
}
As you can see, I am using FileUtils read and write methods and using UTF_8 encoding. That's just a guess, I am not sure about the same. I am trying to append the newly generated docx file contents to a string builder and print the same on console, but that's secondary. Main thing is that the docx should be written properly. But no luck.
When this prints, its all weird characters and nothing is readable. When I try to open the newly generated docx file, it doesn't even open.
Any idea what should I do to get the data in proper format. I am attaching image file of how my ReportTemplate.docx looks, that I am using as a template to generate this report. I am using commons-io-2.4.jar
Please guide if you can. Thanks a lot.
You can use Apache POI for creating and editing doc docx files or docx4j. Otherwise there is no simple way to edit doc or docx files without these libraries.

csvWriter behave differently on unix machine (tomcat sever) for huge file (size 5000 KB) and it creates empty file,Same Code work fine at windows,WHY?

I am writing csv file with the help of csvWriter (Java) but while executing code on Unix Box with huge records (Around 9000) it creates empty file.
When i try to execute same code at local( Eclipse ) at windows it works fine for same huge file. WHY?
I Noticed one thing if record are around 3000 then it works fine at unix box also.
Issue is with only huge file.
I tried to use writer.writeNext() method also instead of writeAll() but still same issue is observed at UNIX Box. :(
Note : File does not has any special characters , It's in English.
Code -->
CSVReader reader = new CSVReader(new FileReader(inputFile), ',','"');
List<String[]> csvBody = reader.readAll();
int listSize = csvBody.size();
if(listSize > 0){
String renameFileNamePath = outputFolder + "//"+ existingFileName.replaceFirst("file1", "file2");
File newFile = new File(renameFileNamePath);
CSVWriter writer = new CSVWriter(new FileWriter(newFile), ',');
for(int row=1 ; row < listSize; row++){
String timeKeyOrTransactionDate = null;
timeKeyOrTransactionDate = year+"-"+month+"-"+day+" 00:00:00";
csvBody.get(row)[0] = timeKeyOrTransactionDate ;
}
//Write to CSV file which is open
writer.writeAll(csvBody);
writer.flush();
writer.close();
}
reader.close();
The readAll and writeAll methods should only be used with small datasets - otherwise avoid it like the plague. Use the readNext and writeNext methods instead so you don't have to read the entire file into memory.
Note the readNext will return null once you have no more data (end of Stream or end of file). I will have to update the javadocs to mention that.
Disclaimer - I am the maintainer of the opencsv project. So please take the "avoid like plague" seriously. Really that was only put there because most files are usually small and can fit in memory but when in doubt of how big your dataset will be avoid putting it all in memory.
A data error. The linux machine probably uses UTF-8 Unicode encoding. This can throw error on the first encountered malformed UTF-8 byte sequence, with the single byte Windows encoding simply accepts.
You are using the old utility class FileReader (there also exists the also flawed FileWriter), that use the default platform encoding, which makes the software platform dependent.
You need to do:
Charset charset = Charset.forName("Windows-1252"); // Windows Latin-1
For reading
BufferedReader br = Files.newBufferedReader(inputFile.toPath(), charset);
For writing
Path newFile = Paths.get(renameFileNamePath);
BufferedWriter bw = Files.newBufferedWriter(newFile, charset);
CSVWriter writer = new CSVWriter(bw, ',');
The above assumes a single byte encoding, but probably will work for most other single byte encodings too.
A pity that the file is not in UTF-8, allowing any script.
Issue has resolved. Actually output directory was shared via loader application also and loader keeps on checking file in every minutes that's why before writing the csv file ,loader pick it and load with zero kb in DB.
Hence I used buffered writer instead of file writer and also writing data first in tmp file then renamed it with file2 and it worked.
Thanks to all of you for your help and valuable suggestions.

BufferedReader, read chars in an edittext gives strange chars

Ok, I am reading a .docx file via a BufferedReader and want to store the text in an edittext. The .docx is not in english language but in a different one (greek). I use:
File file = new File(file_Path);
try {
BufferedReader br = new BufferedReader(new FileReader(file));
String line;
StringBuilder text = new StringBuilder();
while ((line = br.readLine()) != null) {
text.append(line);
}
et1.setText(text);
And the result I get is this:
If the characters are in english language, it works fine. But in my case they aren't. How can I fix this? Thanks a lot
Ok, I am reading a .docx file via a BufferedReader
Well that's the first problem. BufferedReader is for plain text files. docx files are binary files in a specific format (assuming you mean the kind of file that Microsoft Word saves). You can't just read them like text files. Open the file up in Notepad (not Wordpad) and you'll see what what I mean.
You might want to look at Apache POI.
From comments:
Testing to read a .txt file with the same text gave same results too
That's probably due to using the wrong encoding. FileReader always uses the platform default encoding, which is annoying. Assuming you're using Java 7 or higher, you'd be better off with Files.newBufferedReader:
try (BufferedReader reader = Files.newBufferedReader(path, StandardCharsets.UTF_8)) {
...
}
Adjust the charset to match the one you used when saving your text file, of course - if you have the option of using UTF-8, that's a pretty good choice. (Aside from anything else, pretty much everything can handle UTF-8.)

Categories

Resources