How to detect the encoding of a PPTX file? - java

My question is, how can I get the encoding of a pptx file in Java?
(I'm using apache poi)
File f = new File(filename);
XMLSlideShow ppt = new XMLSlideShow(new FileInputStream(f));
The reason why I need to know the encoing is that later on, I post some data of the file which I have saved in a json string and It is at this stage my problem occurs.
When doing a http POST the encoding is changed, and I figured this problem could be solved If I knew the encoding of the data in my json string. Then I could set this encoding in my http POST.
EDIT/CLARIFICATION:
The problem is the swedish letters å,ä and ö.
å becomes Ã¥
ä becomes ä
ö becomes ö

Java and POI aside, to get to the encoding of a PowerPoint PPTX file, you have to examine the underlying XML for the slides:
Unzip the pptx file (for manually looking, any zip utility like 7-zip will do).
Under the zip root, find the ppt/slides directory.
Typically each slide is slide#.xml; open the one you want to examine.
Read the first line: <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
In most cases, I would expect the encoding to be the same across all slides (meaning that you could probably use the root-level "[Content_Types].xml" file as a proxy for encoding of the entire archive).

Related

ServletFileUpload encoding issue

I am using Apache commons fileupload API for uploading a file from UI. In the file there is a entry as given below:
<name>m²</name>
But after loading the file, the character turning into below character
<name>m²</name>
I am not sure if there is something with encoding to do here. Guide please
It appears the uploaded file is XML. In first place the file should be plain text and use entity encoding for superscript 2. Ideally <name>m²</name> should be <name>m²</name>, where is ² is the superscript 2 character. The xml processors are supposed to translate this entity character accordingly.
The reference to Wikipedia with ISO-8859-1 charset that shows character chart under Codepage layout section.

Blank PDF while downloading

I am facing a very strange issue, I am trying to send the PDF file as attachment from my struts application using below code,
JasperReport jrReport = (JasperReport) JRLoader.loadObject(jasperReport);
JasperPrint jasperPrint = JasperFillManager.fillReport(jrReport, parameters, dataSource);
jasperPrint.setName(fileNameTobeGivenToExportedReport);
response.reset();
response.setContentType("application/pdf");
response.setHeader("Content-Disposition", "attachment; filename=\"" + fileNameTobeGivenToExportedReport + ".pdf" + "\"");
response.setHeader("Cache-Control", "private");
JasperExportManager.exportReportToPdfStream(jasperPrint, response.getOutputStream());
but the PDF that is being downloaded is coming with no data, means it is showing the blank page.
When in the above code I added the below line to save the PDF file in my D: directory
File pdf = new File("D:\\sample22.pdf");
JasperExportManager.exportReportToPdfStream(jasperPrint, new FileOutputStream(pdf));
The file that is getting generated is proper, mean with all the data. One thing that I noticed that the file that is downloading from browser and "sample22.pdf" have same size.
I read an article that says that it might be an issue with server configuration as our server might be corrupting the output stream. This is the article that I read Creating PDF from Servlet.
This article says
This can happen when your server flattens all bytes with a value higher than 127. Consult your web (or application) server manual to find out how to make sure binary data is sent correctly to the browser.
I am using struts 1.x, jBoss6, iReport 1.2
Suppose that you have a simple "Hello World" PDF document:
When you open this document, you see that the file structure uses ASCII characters, but that the actual content of the page is compressed to a binary stream:
You don't see the words "Hello World" anywhere, they are compressed along with the PDF syntax that contains info needed to draw these words on the page into this stream:
xœ+är
á26S°00SIá2PÐ5´ 1ôÝBÒ¸4<RsròÂó‹rR5C²€j#*\C¸¹ Çq°
Now suppose that a process shave all the non-ASCII characters into ASCII. I've done this manually as you can see in the next screen shot:
I can still open the document, because I didn't change anything to the file structure: there is still a /Pages three with a single /Page dictionary. From the syntactical point of view, the file looks OK, so I can open it in Adobe Reader:
As you can see, the words "Hello World" are gone. The stream containing the syntax to render these words were corrupted (in my case manually, in your case by the server, or by Struts, or by whatever process you are using that thinks you are creating plain text instead of a binary file).
What you need to do, is to find the place where this happens. Maybe Struts is the culprit. Maybe you are (unintentionally) using Struts as if you were creating a plain text file. It is hard to tell remotely. This is a typical problem caused by a configuration issue. Only somebody with access to your configuration can solve this.

Handle ligatures in Apache Tika

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks.
Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ?
File file = new File("path/to/file.pdf");
String text = Tika().parseToString(file);
Edit
My PDF file is UTF-8 encoded (that's what InputStream.getEncoding() says), my platform encoding is also UTF-8. Even with a -Dfile.encoding=UTF8, it is not working.
For instance, I'm supposed to have :
"différentes implémentations"
...and that's what I really get :
"di��erentes impl�ementations"

CSV encoding specification

I am creating a CSV and writing content in UTF-8 to support German and English by specifying encoding as below
BufferedWriter outFile = new BufferedWriter( new OutputStreamWriter( outputStream, "UTF-8" ) );
The above is working fine till I add the below separator indication (;) in the header of CSV
outFile.write( "sep=;" );
outFile.newLine();
Without this delimiter ; my CSV will be wrong but when I inclde this the encoding is failing and UTf-8 not in place.
Is there any other keyword like "sep=" to specify in header of CSV to specify encoding?
I tried encoding="UTF-8" and it is not working.
Thanks.
You cannot open a UTF8 csv file with Excel 2007. Microsft have no understanding of the word "standards". Because of this, it is notoriously difficult to generate a csv file which opens in every possible application that reads .csv files and keeps the correct encoding.
If you must use Excel 2007, I would suggest using encoding with Microsofts own "windows 1252" as it supports German characters. Don't use the header, and also look in to using tab as a separator. Yes I know the c stands for comma, but tab seems to be more consistent with Excel 2007 if you save the file back again.

Character encoding

I get html file which I need to read and parse, this file can be in plain English, japenese, or any language with associated character encoding required for that language. The problem occurs when file is in Japenese with any of these encodings
Shift JIS
EUC-JP
ISO-2022-JP
I tried reading file with FileReader but resulting file is all garbage characters. I also tried using FileInputStream with just hard coding japenese encoding to check if Japanese file is read correctly but result is not as expected.
FileInputStream fis = new FileInputStream(htmlFile);
InputStreamReader isr = new InputStreamReader(fis, " ISO-2022-JP");
I don’t have much experience with character encoding and internationalization, any suggestions on how I can read/write files with different encodings?
one more thing, I don't know how to get the character encoding of the html file I am reading, I understand that I need to write file in same encoding but not sure how to get original file's encoding
Thanks,
Forget that FileReader exists, it implicitly uses the platform default encoding, which makes it pretty much useless.
Your code with the hardcoded encoding is correct except for the encoding itself, which has a leading space. If you remove it, the code should correctly read ISO-2022-JP encoded files
As for getting the character encoding of the HTML file, there are a number of ways it can be transmitted
on the HTTP level in a Content-Type HTTP header - but this is only available when you read the file from the webserver, not when it's saved as a file
as a corresponding META HTML tag: <META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
or, if the document type is XHTML, in the XML declaration: <?xml version="1.0" encoding="UTF-8"?>

Categories

Resources