Handle ligatures in Apache Tika - java

Tika doesn't seem to recognize ligatures (fi, ff, fl...) in PDF files and replaces them with question marks.
Any idea (not only on Tika) to extract PDF text while converting character ligatures to separated characters ?
File file = new File("path/to/file.pdf");
String text = Tika().parseToString(file);
Edit
My PDF file is UTF-8 encoded (that's what InputStream.getEncoding() says), my platform encoding is also UTF-8. Even with a -Dfile.encoding=UTF8, it is not working.
For instance, I'm supposed to have :
"différentes implémentations"
...and that's what I really get :
"di��erentes impl�ementations"

Related

Unzip files that contain chinese characters

I have a zip file.It contains some files.Files contain chinese characters so I used
ZipInputStream zipStream = new ZipInputStream(
new BufferedInputStream(new FileInputStream(zipFilePath), BUFFER_SIZE),
Charset.forName("ISO-8859-1")
);
......
FileOutputStream fileOutput = new FileOutputStream(uncompressedFileName);
while (zipStream.available() > 0) {
fileOutput.write(zipStream.read());
}
Extraction runs succesfully.After that I want to use encodingDetect method to find encoding but now service is not running.It returns nomatch. If I send files directly to service,The service is running.It find charset properly like UTF-8.
I guess that Charset.forName("ISO-8859-1")extract files but format is corrupted.Do you have any idea?
The problem is the Charset of the file names in the zip. UTF-8 raises an error (the file names are evidently not in UTF-8), as UTF-8 requires as special format for the multi-byte sequences, and evidently there are wrong "multibyte" sequences.
ISO-8859-1 is a single byte enconding, accepting garbage.
What you should do is to try the small number of Chinese Charsets, so the file name strings are filled correctly. Java String contains Unicode, so can hold any Charset. The help from someone talking Chinese probably would make sense.
And then try writing files with those names. If not successful on your PC, you must use artificial file names, maybe transliteration from Chinese.
A translation table from original Chinese file name to actual file name may be created
as UTF-8 text file, maybe with a BOM, '\uFEFF` at the begin-of-file.
ISO-8859-1 charset most definitely does not support Chinese language. Use UTF-8 instead of ISO-8859-1

ServletFileUpload encoding issue

I am using Apache commons fileupload API for uploading a file from UI. In the file there is a entry as given below:
<name>m²</name>
But after loading the file, the character turning into below character
<name>m²</name>
I am not sure if there is something with encoding to do here. Guide please
It appears the uploaded file is XML. In first place the file should be plain text and use entity encoding for superscript 2. Ideally <name>m²</name> should be <name>m²</name>, where is ² is the superscript 2 character. The xml processors are supposed to translate this entity character accordingly.
The reference to Wikipedia with ISO-8859-1 charset that shows character chart under Codepage layout section.

CSV encoding specification

I am creating a CSV and writing content in UTF-8 to support German and English by specifying encoding as below
BufferedWriter outFile = new BufferedWriter( new OutputStreamWriter( outputStream, "UTF-8" ) );
The above is working fine till I add the below separator indication (;) in the header of CSV
outFile.write( "sep=;" );
outFile.newLine();
Without this delimiter ; my CSV will be wrong but when I inclde this the encoding is failing and UTf-8 not in place.
Is there any other keyword like "sep=" to specify in header of CSV to specify encoding?
I tried encoding="UTF-8" and it is not working.
Thanks.
You cannot open a UTF8 csv file with Excel 2007. Microsft have no understanding of the word "standards". Because of this, it is notoriously difficult to generate a csv file which opens in every possible application that reads .csv files and keeps the correct encoding.
If you must use Excel 2007, I would suggest using encoding with Microsofts own "windows 1252" as it supports German characters. Don't use the header, and also look in to using tab as a separator. Yes I know the c stands for comma, but tab seems to be more consistent with Excel 2007 if you save the file back again.

Spanish character óé display error in Java properties

When I process a properties file with the Spanish characters ó and é, characters are displayed as ?. I tried different ways to fix this, but still fail:
I tried to use \uxxxx
I tried to use InputStreamReader with encoding UTF-8
I tried to convert string to bytes and then create a new String from those bytes:
new String( val.getBytes("UTF-8"), "UTF-8")
Nothing worked. What should I do next to fix this issue? Japanese and Russian are still OK.
The properties file needs to be in the proper encoding. By default some IDE's like eclipse saves the content using CP1252 but you are requiring the file as UTF-8. This is also required for your java code.
If you try to use \uxxxx characters but your application by default is working with CP1252 the conversion of the escape code result in a bad character.
If you use the InputStreamReader to force the reading as UTF-8 but your code and/or your file are not using UTF-8 support result in a bad character.
If you use UTF-8 conversion of an string but your source code is CP1252 you should have the same problem.
Related previous answer about source code : Should source code be saved in UTF-8 format
Notepad ++ Has a menu to view the format of the file and change it in "Format" menu you should view the file as if it should be opened by other formarts or you should convert the file to other file formats like "UTF-8"

UTF-8Characters not displayed correctly in

we are working on a project for school, The project is mandatory tri-lingual (dutch, english and french) , so the answer "Change to English will not do".
All our classes and resource files are encoded in UTF-8 format, and alle non-standar english characters are diplayed correctly in the classes themself.
the problem is that once we try to display our text, alle non-standard english characters are distorted.
We hear alot that this is due to an encoding issue, but I sincerly doubt that, since our whole project is encode in UTF-8.
here is extract from the french resource bundle:
VIDEOSETTINGS = Réglages du Vidéo
SOUNDSETTINGS = Réglages du son
KEYBINDSETTINGS = Keybind Paramètres
LANGUAGESETTINGS = Paramètres de langue
DIFFICULTYSETTINGS = Paramètres de Difficulté
EXITSETTINGS = Sortie les paramètres
and this results in these following displayed strings.
display result for provided resourcebundle extract
I would be most gratefull for a solution for this problem
EDIT
for extra info we are building a desktop app using Swing.
This is due to an encoding issue.
You are using the wrong decoder (probably ISO-8859-1) on UTF-8 encoded bytes.
Are these strings stored in a file? How are you loading the file? Via the Properties class? The Properties class always applies ISO-8859-1 decoding when loading the plain text format from an InputStream. If you are using Properties, use the load(Reader) overload, switch to the XML format, or re-write the file with the matching encoding. Also, if you are using Resource.getBundle() to load a properties file, you must use ISO-8859-1 encoding to write that file, escaping any non-Latin characters.
Since this is an encoding issue, it would be most helpful if you posted the code you have used to select the character encoding.
You didn't show some code, where you read the resource files. But if you use PropertyResourceBundle with an InputStream in the constructor, the InputStream must be encoded in ISO-8859-1. In that case, characters that cannot be represented in ISO-8859-1 encoding must be represented by Unicode Escapes.
You can use native2ascii or AnyEdit as tools to convert Properties to unicode escapes,
see Use cyrillic .properties file in eclipse project

Categories

Resources