Text search through multiple file encoding

Text search through multiple file encoding - java

I am trying to find a specific word from list of files and these files can be ASCII, Unicode or some other format.
So far I can only work on ASCII files . Is there any way to do same operation with other file encoding formats.
Scanner s = null;
try {
s = new Scanner(new BufferedReader(new FileReader("C:\\New Microsoft Word Document.docx")));
while (s.hasNext()) {
// final String lineFromFile = s.nextLine();
// if(lineFromFile.contains("DE")){
System.out.println(s.next());
// break;
// }
}
} finally {
if (s != null) {
s.close();
}
}
I get the following results
Q[µM¡°‰”Ø÷Þ3{:½¹®’)xTÖä¬?µXFÚB™QÎÞ‡Ïé=K0SˆÊÈÙ?õº×W?áÂ&¤6˜³qî?s”cÐ3ëÀÐJi½?^ýˆ;!¿Äøm«uÇ¥5LHCô`ÝÎ”bR…¤?§Ï+gF,y\í‹Q9S:êãw~Pá¡Â=‰p®RRª?OM±Ç•®™2R.÷àX9¼!ð#
qe—i;`{¥fzU#2>¼Mä|f}Á
+'šªÎNÛ

docx is not a text format with a different encoding, it's a completely different, non-text file format. Basically, it'a zip archive of various files and folders (with the main data in some xml files). You can't just read it as a text file, you need to use a library such as Apache POI, or some kind of file converter to obtain the text from it.

This has nothing to do with a different text encoding.
docx is a special format from microsoft which holds various information about a document (packed as a zip archive).
You could read the file using java ZipFile and get the entry: word/document.xml
document.xml contains the text of the word document. You can read then through this file and output specific lines.
Pseudocode:
ZipFile file = new ZipFile("doc.docx");
InputStream input = file.getInputStream(file.getEntry("word/document.xml"));
input contains now the text information.
EDIT: document.xml contains the text of the document, but there are many xml tags which you would have to filter out

Related

Java IO with UTF characters

I have a weird problem with files.
I intend to modify the timing of an .srt file, but writing the new file seems to be a weird task.
Here's a sample code I wrote:
import java.io.*;
import java.nio.charset.Charset;
public class ReaderWriter {
public static void main(String[] args) throws IOException {
InputStream inputStream = new FileInputStream("D:\\E\\Movies\\English\\1960's\\TheApartment1960.srt");
Reader reader = new InputStreamReader(inputStream,
Charset.forName("UTF-8"));
OutputStream outputStream = new FileOutputStream("output.srt");
Writer writer = new OutputStreamWriter(outputStream,
Charset.forName("UTF-8"));
int data = reader.read();
while (data != -1) {
char theChar = (char) data;
writer.write(theChar);
data = reader.read();
}
reader.close();
writer.close();
}
}
This is an image from the original file:
However, the resulted file seems like:
I searched a lot for a solution but in vain. Any help, please.

First a few points:
There is nothing wrong with your Java code. If I use it to read an input file containing Arabic text encoded in UTF-8 it creates the output file encoded in UTF-8 with no problems.
I don't think there is a font issue. Since you can successfully display the content of the input file there is no reason you cannot also successfully display the content of a valid output file.
Those black diamonds with question marks in the output file are replacement characters which are "used to replace an incoming character whose value is unknown or unrepresentable in Unicode". This indicates that the input file you are reading is not UTF-8 encoded, even though the code explicitly states that it is. I can reproduce similar results to yours if the input file is UTF-16 encoded, but specified as UTF-8 in the code.
Alternatively, if the input file truly is UTF-8 encoded, specify it as UTF-16 in the code. For example, here is a valid UTF-8 input file with some Arabic text where the code (incorrectly) stated Reader reader = new InputStreamReader(inputStream, Charset.forName("UTF-16"));:
يونكود في النظم القائمة وفيما يخص التطبيقات الحاسوبية، الخطوط، تصميم النصوص والحوسبة متعددة اللغات.
And here is the output file, containing the replacement characters because the input stream of the UTF-8 file was incorrectly processed as UTF-16:
���⃙臙訠���ꟙ蓙苘Ꟙꛙ藘ꤠ���諘께딠�����ꟙ蓘귘Ꟙ동裘꣙諘꧘谠����꫘뗙藙諙蔠���⃙裘ꟙ蓘귙裘돘꣘ꤠ���⃘ꟙ蓙蓘뫘Ꟙꨮ�
Given all that, simply ensuring that the encoding of the input file is specified correctly in the InputStreamReader() constructor should solve your problem. To verify this, just create another input file and save it with UTF-8 character encoding, then run your code. If it works then you know that the problem was the that the encoding of input file was not UTF-8.

reading from a file(whose contents are copied from another file) in java

i am reading from a text file(a file in which i wrote some sentences) and printing its contents everything was going well until i copied the same sentences from a PDF file, then nothing is printed into the console ,giving me build successful.
this is my code:
File f=new File("input.txt");
Scanner sc=new Scanner(f);
while (sc.hasNext())
{
String line=sc.nextLine();
int i=0;
while ( i< line.length())
{
char c=line.charAt(i);
System.out.println(c);
i++;
}
}
sc.close();
the contents of the text file (whether when i write the sentences by myself or copy it from the PDF):
{sample program in TINY language- computes factorial}
read x;{input an integer}
if 0<x then {don’t compute if x<=0}
fact:=1;
repeat
fact:=fact*x;
x:=x-1
what am i doing wrong and what to do if i wanted to copy sentences from a PDF file into the text file i am reading from

It's working well for me, I even created a pdf from your text input and I copied the text back to a .txt file and it still working. I used Notepad++ to create my .txt file.

If you want extract text or other information from pdf file use Oracle's library PDFBox.
PDFBox is the best library for this purpose, it's comprehensive and really quite easy to use if you're just doing basic text extraction.

How to read / write into docx file using commons.io.FileUtils?

Need some quick help. I am trying to write a java program to generate a report. I have the report template in a docx file.
What I want to do is, use that docx file as template and put data in it multiple times for various records and write that to a new docx file. The main thing is I want to maintain the formatting and indentation of the contents inside the docx file. They are bullets data. And that's where the problem is.
Below is the piece of code handling the above operation,
public void readWriteDocx(HashMap<String, String> detailsMap) {
try {
File reportTemplateFile = new File("ReportTemplate.docx");
File actualReportFile = new File("ActualReport.docx");
StringBuilder preReport = new StringBuilder();
preReport.append("Some details about pre report goes here...: ");
preReport.append(System.lineSeparator());
String docxContent = "";
for (Map.Entry<String, String> entry : detailsMap.entrySet()) {
docxContent = FileUtils.readFileToString(reportTemplateFile, StandardCharsets.UTF_8);
// code to fetch and get data to insert into docxContent
docxContent = docxContent.replace("$filename", keyFilename);
docxContent = docxContent.replace("$expected", expectedFile);
docxContent = docxContent.replace("$actual", actualFile);
docxContent = docxContent.replace("$reportCount", String.valueOf(reportCount));
docxContent = docxContent.replace("$diffMessage", key);
FileUtils.writeStringToFile(actualReportFile, docxContent, StandardCharsets.UTF_8, true);
}
preReport.append(FileUtils.readFileToString(actualReportFile, StandardCharsets.UTF_8));
System.out.print(preReport.toString());
} catch (IOException e) {
e.printStackTrace();
}
}
As you can see, I am using FileUtils read and write methods and using UTF_8 encoding. That's just a guess, I am not sure about the same. I am trying to append the newly generated docx file contents to a string builder and print the same on console, but that's secondary. Main thing is that the docx should be written properly. But no luck.
When this prints, its all weird characters and nothing is readable. When I try to open the newly generated docx file, it doesn't even open.
Any idea what should I do to get the data in proper format. I am attaching image file of how my ReportTemplate.docx looks, that I am using as a template to generate this report. I am using commons-io-2.4.jar
Please guide if you can. Thanks a lot.

You can use Apache POI for creating and editing doc docx files or docx4j. Otherwise there is no simple way to edit doc or docx files without these libraries.

How to read an excel(.xls) file like text?

I need to read an excel(.xls) file that i'm receiving.
Using the regular charsets like UTF-8, Cp1252, ISO-8859-1, UTF-16LE, none of these helped me, the characters are still malformed.
So i search ended up using juniversalchardet, it showed me that the charset was MacCyrillic, used MacCyrillic to read the file, but still the same weird outcome.
When i open the file on excel everything is fine, all the characters are fine, since its portuguese its filled whit Ç ~ and such. But opening whit notepad or trough java the file is all messed up.
But if open the file on my excel and then save it again like .txt it becomes readable
My method to find the charset
public static void lerCharset(String fileName) throws IOException {
byte[] buf = new byte[50000000];
FileInputStream fis = new FileInputStream(fileName);
// (1)
UniversalDetector detector = new UniversalDetector(null);
// (2)
int nread;
while ((nread = fis.read(buf)) > 0 && !detector.isDone()) {
detector.handleData(buf, 0, nread);
}
// (3)
detector.dataEnd();
// (4)
String encoding = detector.getDetectedCharset();
if (encoding != null) {
System.out.println("Detected encoding = " + encoding);
} else {
System.out.println("No encoding detected.");
}
// (5)
detector.reset();
fis.close();
}
How can i discover the correct charset?
Should i try a different aproach? Like making my java re-save the excel and then start reading?

If I'm understanding your question, you're trying to read the excel file like a text file.
The challenge is that .xls files are actually binary files containing the text, formatting, sheet information, macro information, etc...
You'd either need to save the files as .csv (Either via Excel before running your program or through your program directly), upgrade them to .xlsx (which has numerous libraries that can read the file as an XML at that point) or use a library (such as apache POI or anything similar) or even query the data out using ADO.
Good luck and I hope that's what you were implying via your question.

Code:
WorkbookSettings workbookSettings = new WorkbookSettings();
WorkbookSettings.setEncoding("Cp1252");

java read write unicode / UTF-8 filenames (not contents)

i have a few directories/files with Japanese characters. If i try to read a filename (not the contents) containing (as example) a ク i receive a String containing a �. If i try to create a file/directory containing an ク a file/directory appears containing a ?.
As example:
I list the files with.
File file = new File(".");
String[] filesAndDirs = file.list();
the filesAndDirs array now contains the directories this the special characters. The String now only contains ����. It seams there is nothing to decode because the a getbytes shows only "-17 -65 -67" for every char in the filename even for different chars.
I use MacOS 10.8.2 Java 7_10 and Netbeans.
Any ideas?
Thank You in advance :)

Those bytes are 0xef 0xbf 0xbd, which is the UTF-8-encoded form of the \ufffd character you're seeing instead of the Japanese characters. It appears whatever OS function Java is using to list the files is in fact returning those incorrect characters.
Perhaps Files.newDirectoryStream will be more reliable. Try this instead:
try (DirectoryStream<Path> dir = Files.newDirectoryStream(Paths.get("."))) {
for (Path child : dir) {
String filename = child.getFileName().toString();
System.out.println("name=" + filename);
for (char c : filename.toCharArray()) {
System.out.printf("%04x ", (int) c);
}
System.out.println();
}
}

It's a bug in the old java File api (maybe just on a mac). Anyway, it's all fixed in the new java.nio.
I have several files containing unicode characters in the filename and content that failed to load using java.io.File and related classes. After converting all my code to use java.nio.Path EVERYTHING started working. And I replaced org.apache.commons.io.FileUtils (which has the same problem) with java.nio.Files...
...and be sure to read and write the content of file using an appropriate charset, for example: Files.readAllLines(myPath, StandardCharsets.UTF_8)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.