Use FileOutputStream to Create a UTF-8 PDF File

Use FileOutputStream to Create a UTF-8 PDF File - java

I am using JasperReports and DynamicReports with this piece of java code to create a report in pdf format which contains utf-8 characters, the problem is generated pdf file does not contain utf-8 characters at all, like if they have been replaced with "". is there any thing that i should be aware of when using OutputStream to create a utf-8 file?
public void toPdf(String path){
OutputStream outHtml;
try {
outHtml = new FileOutputStream(path);
jasperBuilder.toPdf(outHtml);
} catch (Exception e1) {
logger.error("failed to create PDF", e1);
}
}
this may be notable that creating XLS and HTML file faces no such problem.
note that there are lots of lines of code under jasperBuilder.toPdf(outHtml); that i have traced and no where in those lines my utf-8 characters are being eliminated. so i guess the devil is in outHtml = new FileOutputStream(path);

I managed to solve it. It was a font and encoding problem. Just followed tutorial here, but change <pdfEncoding>UTF-8</pdfEncoding> to <pdfEncoding>Identity-H</pdfEncoding> in fonts.xml
<fontFamilies>
<fontFamily name="FreeUniversal">
<normal>/home/moien/tahoma.ttf</normal>
<bold>/home/moien/tahoma.ttf</bold>
<italic>/home/moien/tahoma.ttf</italic>
<boldItalic>/home/moien/tahoma.ttf</boldItalic>
<pdfEncoding>Identity-H</pdfEncoding>
<pdfEmbedded>true</pdfEmbedded>
</fontFamily>
</fontFamilies>
Now I have another challenge to solve, making font URL relative!

A FileOutputStream is completely agnostic of the "stuff" that gets written to it. It just writes bytes. If characters are being eliminated or mangled, then this is being caused by whatever is generating the bytes to be written to the stream.
In this case, my money would be on the way that you have configured / used the jasperBuilder object prior to running this code.

Related

Creating a text file with java without using absolute path

following the question I asked before How to have my java project to use some files without using their absolute path? I found the solution but another problem popped up in creating text files that I want to write into.here's my code:
private String pathProvider() throws Exception {
//finding the location where the jar file has been located
String jarPath=URLDecoder.decode(getClass().getProtectionDomain().getCodeSource().getLocation().getPath(), "UTF-8");
//creating the full and final path
String completePath=jarPath.substring(0,jarPath.lastIndexOf("/"))+File.separator+"Records.txt";
return completePath;
}
public void writeRecord() {
try(Formatter writer=new Formatter(new FileWriter(new File(pathProvider()),true))) {
writer.format("%s %s %s %s %s %s %s %s %n", whichIsChecked(),nameInput.getText(),lastNameInput.getText()
,idInput.getText(),fieldOfStudyInput.getText(),date.getSelectedItem().toString()
,month.getSelectedItem().toString(),year.getSelectedItem().toString());
successful();
} catch (Exception e) {
failure();
}
}
this works and creates the text file wherever the jar file is running from but my problem is that when the information is been written to the file, the numbers,symbols, and English characters are remained but other characters which are in Persian are turned into question marks. like: ????? 111 ????? ????.although running the app in eclipse doesn't make this problem,running the jar does.
Note:I found the code ,inside pathProvider method, in some person's question.

Your pasted code and the linked question are complete red herrings - they have nothing whatsoever to do with the error you ran into. Also, that protection domain stuff is a hack and you've been told before not to write data files next to your jar files, it's not how OSes (are supposed to) work. Use user.home for this.
There is nothing in this method that explains the question marks - the string, as returned, has plenty of issues (see above), but NOT that it will result in question marks in the output.
Files are fundamentally bytes. Strings are fundamentally characters. Therefore, when you write code that writes a string to a file, some code somewhere is converting chars to bytes.
Make sure the place where that happens includes a charset encoding.
Use the new API (I think you've also been told to do this, by me, in an earlier question of yours) which defaults to UTF-8. Alternatively, specify UTF-8 when you write. Note that the usage of UTF-8 here is about the file name, not the contents of it (as in, if you put persian symbols in the file name, it's not about persian symbols in the contents of the file / in the contents you want to write).
Because you didn't paste the code, I can't give you specific details as there are hundreds of ways to do this, and I do not know which one you used.
To write to a file given a String representing its path:
Path p = Paths.get(completePath);
Files.write("Hello, World!", p);
is all you need. This will write as UTF_8, which can handle persian symbols (because the Files API defaults to UTF-8 if you specify no encoding, unlike e.g. new File, FileOutputStream, FileWriter, etc).
If you're using outdated APIs: new BufferedWriter(new OutputStreamWriter(new FileOutputStream(thePath), StandardCharsets.UTF-8) - but note that this is a resource leak bug unless you add the appropriate try-with-resources.
If you're using FileWriter: FileWriter is broken, never use this class. Use something else.
If you're converting the string on its own, it's str.getBytes(StandardCharsets.UTF_8), not str.getBytes().

Getting text from PDF using Apache PDFBox

How can I get infromation about the structure of pdf, I mean text or pic? I need my programm to move pdf without text in other folder, but now I'm getting just an empty txt file.
try (FileWriter writer = new FileWriter(outputFile)) {
PDDocument document = new PDDocument().load(file);
PDFTextStripper pdfTextStripper = new PDFTextStripper();
String text = pdfTextStripper.getText(document);
writer.write(text);
document.close();
} catch (IOException e){
e.printStackTrace();
}
Also, have a problem with getting text from saved in pdf web-pages. It looks like:
I think there is something wrong with encoding, but don't know what to do

Your code works alright, your text viewer assumes a wrong encoding.
Using your code and the same PDFBox version as you I get proper extracted text:
But when I force my viewer to assume UTF-16 encoding, I get something very similar to what you get:
The file itself does not indicate any specific encoding by a BOM or anything:
Thus, your text viewer either incorrectly guesses UTF-16 encoding or is configured to use it.
Thus, either switch your text viewer to use UTF-8 or explicitly tell your FileWriter to use UTF-16.
Depending on your specific installation, the file encoding might actually be different. As my UTF-16 view looks so very much like yours, though, the encoding very likely is at least similar to UTF-8, probably some ISO 8859-x...

java utf8 encoding outputstream not working

I need to write a program which is able to write UTF-8 data into a file.
I found out examples on the internet, however, I am not able to progress to desired result.
Code:
import java.io.BufferedWriter;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.Writer;
public class UTF8WriterDemo {
public static void main(String[] args) {
Writer out = null;
try {
out = new BufferedWriter(
new OutputStreamWriter(new FileOutputStream("c://java//temp.txt"), "UTF-8"));
String text = "This texáát will be added to File !!";
out.write(text);
out.close();
} catch (Exception e) {
e.printStackTrace();
}
}
}
Everything run succesfully, but at the end I see special characters not showing properly:
This texÃ¡Ã¡t will be added to File !!
I tried several examples from the internet with the same result.
I use Visual Studio code.
Where could be the problem please?
Thank you

Your code is correct. You probably already have a file named temp.txt, and therefore Java writes text to the existing file (replacing previous content). What can be a problem is an encoding, that you already have set in your file.
In other words, you can't (or at least shouldn't) write UTF-8 text to the file with for example WINDOWS-1250 encoding or you would get an exact result as you have described.
If you didn't have this file, Java would automatically create a file with UTF-8 encoding.
Possible solutions:
Change encoding of your current file (usually you can open it in any text editor, use Save as and then specify encoding as UTF-8.
Remove this file and Java will create it automatically with proper encoding.
By the way, you should use StandardCharsets class instead of using String charsetName in order to avoid UnsupportedEncodingException:
new OutputStreamWriter(new FileOutputStream("temp.txt"), StandardCharsets.UTF_8)

When you say "I see special characters not showing properly", where are you seeing them?
What you say/show next looks like the string, utf-8 encoded (i.e. the accented a's are each represented by 2 chars, in what appears to be the appropriate encoding).
What I would expect the issue to be is that the java code is not outputting a BOM at the beginning of the file, leaving the interpretation of utf-8 sequences up to the discretion of the reader.

Filtering Wikipedia's XML dump: error on some accents

I'm trying to index Wikpedia dumps. My SAX parser make Article objects for the XML with only the fields I care about, then send it to my ArticleSink, which produces Lucene Documents.
I want to filter special/meta pages like those prefixed with Category: or Wikipedia:, so I made an array of those prefixes and test the title of each page against this array in my ArticleSink, using article.getTitle.startsWith(prefix). In English, everything works fine, I get a Lucene index with all the pages except for the matching prefixes.
In French, the prefixes with no accent also work (i.e. filter the corresponding pages), some of the accented prefixes don't work at all (like Catégorie:), and some work most of the time but fail on some pages (like Wikipédia:) but I cannot see any difference between the corresponding lines (in less).
I can't really inspect all the differences in the file because of its size (5 GB), but it looks like a correct UTF-8 XML. If I take a portion of the file using grep or head, the accents are correct (even on the incriminated pages, the <title>Catégorie:something</title> is correctly displayed by grep). On the other hand, when I rectreate a wiki XML by tail/head-cutting the original file, the same page (here Catégorie:Rock par ville) gets filtered in the small file, not in the original…
Any idea ?
Alternatives I tried:
Getting the file (commented lines were tried wihtout success*):
FileInputStream fis = new FileInputStream(new File(xmlFileName));
//ReaderInputStream ris = ReaderInputStream.forceEncodingInputStream(fis, "UTF-8" );
//(custom function opening the stream,
//reading it as UFT-8 into a Reader and returning another byte stream)
//InputSource is = new InputSource( fis ); is.setEncoding("UTF-8");
parser.parse(fis, handler);
Filtered prefixes:
ignoredPrefix = new String[] {"Catégorie:", "Modèle:", "Wikipédia:",
"Cat\uFFFDgorie:", "Mod\uFFFDle:", "Wikip\uFFFDdia:", //invalid char
"CatÃ©gorie:", "ModÃ¨le:", "WikipÃ©dia:", // UTF-8 as ISO-8859-1
"Image:", "Portail:", "Fichier:", "Aide:", "Projet:"}; // those last always work
* ERRATUM
Actually, my bad, that one I tried work, I tested the wrong index:
InputSource is = new InputSource( fis );
is.setEncoding("UTF-8"); // force UTF-8 interpretation
parser.parse(fis, handler);

Since you write the prefixes as plain strings into your source file, you want to make sure that you save that .java file in UTF-8, too (or any other encoding that supports the special characters you're using). Then, however, you have to tell the compiler which encoding the file is in with the -encoding flag:
javac -encoding utf-8 *.java
For the XML source, you could try
Reader r = new InputStreamReader(new FileInputStream(xmlFileName), "UTF-8");
InputStreams do not deal with encodings since they are byte-based, not character-based. So, here we create a Reader from an FileInputStream - the latter (stream) doesn't know about encodings, but the former (reader) does, because we give the encoding in the constructor.

Convert from Codepage 1252 (Windows) to Java, in Java

I have some strings in Java (originally from an Excel sheet) that I presume are in Windows 1252 codepage. I want them converted to Javas own unicode format. The Excel file was parsed using the JXL package, in case that matter.
I will clarify: apparently the strings gotten from the Excel file look pretty much like it already is some kind of unicode.
WorkbookSettings ws = new WorkbookSettings();
ws.setCharacterSet(someInteger);
Workbook workbook = Workbook.getWorkbook(new File(filename), ws);
Sheet s = workbook.getSheet(sheet);
row = s.getRow(4);
String contents = row[0].getContents();
This is where contents seems to contain something unicode, the åäö are multibyte characters, while the ASCII ones are normal single byte characters. It is most definitely not Latin1. If I print the "contents" string with printLn and redirect it to a hello.txt file, I find that the letter "ö" is represented with two bytes, C3 B6 in hex. (195 and 179 in decimal.)
[edit]
I have tried the suggestions with different codepages etc given below, tried converting from Cp1252 etc. There was some kind of conversion, because I would get some other kind of gibberish instead. As reference I always printed an "ö" string hand coded into the source code, to verify that there was not something wrong with my terminal or typefaces or anything. The manually typed "ö" always worked.
[edit]
I also tried WorkBookSettings as suggested in the comments, but I looked in the code for JXL and characterSet seems to be ignored by parsing code. I think the parsing code just looks at whatever encoding the XLS file is supposed to be in.

WorkbookSettings ws = new WorkbookSettings();
ws.setEncoding("CP1250");
Worked for me.

If none of the answer above solve the problem, the trick might be done like this:
String myOutput = new String (myInput, "UTF-8");
This should decode the incoming string, whatever its format.

When Java parses a file it uses some encoding to read the bytes on the disk and create bytes in memory. The default encoding varies from platform to platform. Java's internal String representation is Unicode already, so if it parses the file with the right encoding then you are already done; just write out the data in any encoding you want.
If your strings appear corrupted when you look at them in Java, it is probably because you are using the wrong encoding to read the data. Excel is probably using UTF-16 (Little-Endian I think) but I'd expect a library like JXL should be able to detect it appropriately. I've looked at the Javadocs for JXL and it doesn't do anything with character encodings. I imagine it auto-detects any encodings as it needs to.
Do you just need to write the already loaded strings to a text file? If so, then something like the following will work:
String text = getCP1252Text(); // doesn't matter what the original encoding was, Java always uses Unicode
FileOutputStream fos = new FileOutputStream("test.txt"); // Open file
OutputStreamWriter osw = new OutputStreamWriter(fos, "UTF-16"); // Specify character encoding
PrintWriter pw = new PrintWriter(osw);
pw.print(text ); // repeat as needed
pw.close(); // cleanup
osw.close();
fos.close();
If your problem is something else please edit your question and provide more details.

You need to specify the correct encoding when the file is parsed - once you have a Java String based on the wrong encoding, it's too late.
JXL allows you to specify the encoding by passing a WorkbookSettings object to the factory method.

"windows-1252"/"Cp1252" is not required to be supported by JREs, but is by Sun's (and presumably most others). See the "Supported Encodings" in your JDK documentation. Then it's just a matter of using String, InputStreamReader or similar to decode the bytes into chars.

FileInputStream fis = new FileInputStream (yourFile);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis,"CP1250"));
And do with reader whatever you'd do directly with file.

Your description indicates that the encoding is UTF-8 and indeed C3 B6 is the UTF-8 encoding for 'ö'.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.