Determining File Type from Extension - java

Is there a simple way in Java to translate file extensions to specific file types? That is, I'd like to translate ".doc" to "Microsoft Word Document." But I also do NOT want to look inside each file to determine mime type etc because of the performance hit that entails.
Is there a library or database file listing all the currently accepted extensions and their meanings? Something I can load programmatically and then search when I need to?

Microsoft has a support article on common file name extensions.
The wikipedia page on 'list of file formats' seems pretty exhaustive.
It should be easy to copy / paste one (or both) of those lists, tweak the formatting with a text editor, and push it into your code via a hard-coded array or external resource file.

Just keep in mind that if you go this route then you are trusting that the files contain the type of data that the extension claims. This is going to be a somewhat brittle solution because it will only take one upstream error to break things.
Much better IMO is to read the magic numbers (ie. file signature) from the first handful of bytes in the file itself. Just about any production or commercial software will do this (at an absolute minimum) instead of trusting the file extension. It would still be possible for a malicious user to fake the signature, but that requires intentional action and not just a coding error.
The cost of doing this versus checking the extension is really not substantially different; unless you are processing an enormous number of files or have some super tight deadlines (and in that case, Java might not be the best choice), both approaches require not much beyond asking the OS to read a handful of bytes from disk. Checking the magic numbers will just require a few more bytes to be read, and the overhead of opening/closing a stream on each file.

As far as I know, those names are only available in the registry. I’ve seen attempts to use Swing’s FileView to get them, but I wouldn’t rely on that, especially for headless code.
public static String getFullFileTypeName(String extension)
throws IOException,
InterruptedException {
if (!extension.startsWith(".")) {
extension = "." + extension;
}
String progID = readDefaultValue(extension);
String name = readDefaultValue(progID);
return name;
}
private static String readDefaultValue(String node)
throws IOException,
InterruptedException {
String registryPath = "HKEY_CLASSES_ROOT\\" + node;
ProcessBuilder builder = new ProcessBuilder(
"reg.exe", "query", registryPath, "/ve");
builder.redirectError(ProcessBuilder.Redirect.INHERIT);
String value = null;
Process process = builder.start();
try (BufferedReader reader = new BufferedReader(
new InputStreamReader(process.getInputStream()))) {
String line;
while ((line = reader.readLine()) != null) {
int regSZIndex = line.indexOf("REG_SZ");
if (regSZIndex >= 0) {
value = line.substring(regSZIndex + 6).trim();
break;
}
}
// Consume remaining output.
if (line != null) {
int c;
do {
c = reader.read();
} while (c >= 0);
// As of Java 11, the above loop can be replaced with:
//reader.transferTo(Writer.nullWriter());
}
}
int returnCode = process.waitFor();
if (returnCode != 0) {
throw new IOException("Got return code " + returnCode
+ " from " + builder.command());
}
if (value == null) {
throw new IOException(
"Could not find value for \"" + registryPath + "\"");
}
return value;
}

Related

Which encoding for ProcessBuilder parameters

Using ProcessBuilder, I need to be able to send non-ASCII parameters to another Java program.
In this case, a program Abc needs to send e.g. Arabic characters to Def program through the parameters. I have control of Abc code, but not of Def.
Using the normal way of ProcessBuilder without any playing with the encoding, it was mentioned here, it is not possible. Def recieves question marks "?????".
However, I am able to get some result, but different encodings can be used for different scenarios.
E.g. I am trying all encodings to send to the recipient, and comparing the result of what is expected.
Windows, IntelliJ console:
Default charset: UTF-8
Found charsets: windows-1252, windows-1254 and windows-1258
Windows, command prompt:
Default charset: windows-1252
Found charsets: CESU-8 and UTF-8
Ubuntu, command prompt:
Default charset: ISO-8859-1
Found charsets: ISO-2022-CN, ISO-2022-KR, ISO-8859-1, ISO-8859-15, ISO-8859-9, x-IBM1129, x-ISO-2022-CN-CNS and x-ISO-2022-CN-GB
My question is: how to programmatically know which correct encoding to use, since I need to have something universal?
In other words, what is the relation between the default charset and the found ones?
public class Abc {
private static final Path PATH = Paths.get("."); // With maven: ./target/classes
public static void main(String[] args) throws Exception {
var string = "hello أحمد";
var bytes = string.getBytes();
System.out.println("Original string: " + string);
System.out.println("Default charset: " + Charset.defaultCharset());
for (var c : Charset.availableCharsets().values()) {
var newString = new String(bytes, c);
var process = new ProcessBuilder().command("java", "-cp",
PATH.toAbsolutePath().toString(),
"Def", newString).start();
process.waitFor();
var output = asString(process.getInputStream());
if (output.contains(string)) {
System.out.println("Found " + c + " " + output);
}
}
}
private static String asString(InputStream is) throws IOException {
try (var reader = new BufferedReader(new InputStreamReader(is))) {
var builder = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
if (builder.length() != 0) {
builder.append(System.lineSeparator());
}
builder.append(line);
}
return builder.toString();
}
}
}
public class Def {
public static void main(String[] args) {
System.out.println(args[0]);
}
}
Under the hood, what's actually being passed around is bytes, not chars. Normally, you'd expect the java method that ends up turning characters into bytes to have an overload that lets you specify charset, but, for whatever reason, it does not exist here.
How it should work is thusly:
You pass a string to ProcessBuilder
PB will turn that string into bytes using Charset.defaultCharset() (why? Because PB is all about making the OS do things, and the default charset reflects the OS's preferred charset).
These bytes are then fed to the process.
The process starts up. If it is java, and we're talking the args in psv main(String[] args), the same is done in reverse: Java takes the bytes and turns them back to characters via Charset.defaultCharset(), again.
This does show an immediate issue: If the default charset is not capable of representing a certain character, then in theory you are out of luck.
That would strongly suggest that using java to fire up java.exe should ordinarily mean you can pass whatever you want (unless the characters involved aren't representable in the system's charset).
Your code is odd. In particular, this line is the problem:
var bytes = string.getBytes();
This is short for string.getBytes(Charset.defaultCharset()). So now you have your bytes in the provided charset.
var newString = new String(bytes, c);
and now you're taking those bytes and turning them into a string using a completely different charset. I'm not sure what you're trying to accomplish with this. Pure gobbledygook would come out.
In other words, what is the relation between the default charset and the found ones?
What do you mean by 'found ones'? The string "Found charsets" appears nowhere in your code. If you mean: What Charset.availableCharsets() returns - there is no relationship at all. availableCharsets isn't relevant for ProcessBuilder.
One possibility is to convert your String to Unicode sequences string and then pass it to another process and there convert it back to a regular String. String of Unicode sequences will always contain ASCI characters only. Here is how it may look like:
String encoded = StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("hello أحمد"));
The result will be that String encode will hold this value:
"\u0068\u0065\u006c\u006c\u006f\u0020\u0623\u062d\u0645\u062f"
This String you can safely pass to another process. In that other process, you can do the following:
String originalString = StringUnicodeEncoderDecoder.decodeUnicodeSequenceToString(encodedString);
And the result will be that originalString will now hold this value:
"hello أحمد"
Class StringUnicodeEncoderDecoder could be found in an Open Source library called MgntUtils. You can get this library as Maven Artifact or get it on Github (including source code and JavaDoc). JavaDoc online is available here
This library and this particular feature is used and well tested by multiple users.
Disclamer: This library is written by me

How to use OpenNLP parser models in an Android app?

I go through this link for java nlp https://www.tutorialspoint.com/opennlp/index.htm
I tried below code in android:
try {
File file = copyAssets();
// InputStream inputStream = new FileInputStream(file);
ParserModel model = new ParserModel(file);
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
} catch (Exception e) {
}
i download file **en-parser-chunking.bin** from internet and placed in assets of android project but code stop on third line i.e ParserModel model = new ParserModel(file); without giving any exception. Need to know how can this work in android? if its not working is there any other support for nlp in android without consuming any services?
The reason the code stalls/breaks at runtime is that you need to use an InputStream instead of a File to load the binary file resource. Most likely, the File instance is null when you "load" it the way as indicated in line 2. In theory, this constructor of ParserModelshould detect this and an IOException should be thrown. Yet, sadly, the JavaDoc of OpenNLP is not precise about this kind of situation and you are not handling this exception properly in the catch block.
Moreover, the code snippet you presented should be improved, so that you know what actually went wrong.
Therefore, loading a POSModel from within an Activity should be done differently. Here is a variant that takes care for both aspects:
AssetManager assetManager = getAssets();
InputStream in = null;
try {
in = assetManager.open("en-parser-chunking.bin");
POSModel posModel;
if(in != null) {
posModel = new POSModel(in);
if(posModel!=null) {
// From here, <posModel> is initialized and you can start playing with it...
// Creating a parser
Parser parser = ParserFactory.create(model);
// Parsing the sentence
String sentence = "Tutorialspoint is the largest tutorial library.";
Parse topParses[] = ParserTool.parseLine(sentence, parser,1);
for (Parse p : topParses) {
p.show();
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "ParserModel could not initialized.");
}
}
else {
// resource file not found - whatever you want to do in this case
Log.w("NLP", "OpenNLP binary model file could not found in assets.");
}
}
catch (Exception ex) {
Log.e("NLP", "message: " + ex.getMessage(), ex);
// proper exception handling here...
}
finally {
if(in!=null) {
in.close();
}
}
This way, you're using an InputStream approach and at the same time you take care for proper exception and resource handling. Moreover, you can now use a Debugger in case something remains unclear with the resource path references of your model files. For reference, see the official JavaDoc of AssetManager#open(String resourceName).
Note well:
Loading OpenNLP's binary resources can consume quite a lot of memory. For this reason, it might be the case that your Android App's request to allocate the needed memory for this operation can or will not be granted by the actual runtime (i.e., smartphone) environment.
Therefore, carefully monitor the amount of requested/required RAM while posModel = new POSModel(in); is invoked.
Hope it helps.

Gujarati text in Java String

I have Gujarati Bible and trying to insert each verse in MySQL database using parser written in Java. When I assign Gujarati text to Java String variable it shows junks in debug.
E.g. This is my Gujarati text
હે યહોવા તું મારો દેવ છે;
I assign it to Java String variable as shown below
verse._verseText = "હે યહોવા તું મારો દેવ છે;";
What i see in debug window is all junk characters. Any help is appreciated. If need more information let me know and I will provide as and when asked.
UPDATE
Pasting my parser code here
private Boolean Insert(String _text)
{
BibleVerse verse = new BibleVerse();
String[] data = _text.split("\\|");
try
{
if (data[0].equals(bookName) || bookName.equals("All"))
{
verse._Version = "Gujarati";
verse._book = data[0];
verse._chapter = Integer.parseInt(data[1]);
verse._verse = Integer.parseInt(data[2]);
verse._verseText = new String(data[3].getBytes(), "UTF-8");
_bibleDatabase.Insert(verse);
pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - INSERTED.");
}
else
{
pcs.firePropertyChange("logupdate", null, data[0] + " " + data[1] + "," + data[2] + " - SKIPPED.");
}
return true;
}
catch(Exception e)
{
pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
return false;
}
}
Here is the sample line from the text file
Isaiah|25|1|હે યહોવા તું મારો દેવ છે; હું તને મોટો માનીશ, હું તારા નામની સ્તુતિ કરીશ; કેમકે તેં અદભુત કાર્યો કર્યાં છે, તેં વિશ્વાસુપણે તથા સત્યતાથી પુરાતન સંકલ્પો પાર પાડ્યા છે.
UPDATE
Here is the code where I open & read file.
try
{
FileReader _file = new FileReader(this._filename);
_bufferedReader = new BufferedReader(_file);
SwingWorker parseWorker = new SwingWorker()
{
#Override
protected Object doInBackground() throws Exception
{
String line;
String[] data;
int lineno=0;
BibleVerse verse = new BibleVerse();
while ((line = _bufferedReader.readLine()) != null)
{
++lineno;
pcs.firePropertyChange("pgbupdate", null, lineno);
Insert(line);
}
_bufferedReader.close();
return null;
}
#Override
protected void done()
{
pcs.firePropertyChange("logupdate", null, "Parsing complete.");
}
};
parseWorker.execute();
}
catch (Exception e)
{
pcs.firePropertyChange("logupdate", null, "ERROR : " + e.getMessage());
}
The problem is this:
FileReader _file = new FileReader(this._filename);
This reads the file using the platform's default charset. If your data file is not encoded in that charset, you will get incorrect characters.
On Windows, the default charset is almost always UTF-16LE. On most other systems, it's UTF-8.
The easiest solution is to find out the actual encoding of your data file, so you can specify it explicitly in the code. The encoding of a file can be determined with the file command on Unix and Linux systems. In Windows, you may need to examine it with a binary editor, or install something like Cygwin, which has a file command of its own.
Once you know what it is, you should pass it explicitly to the construction of your Reader:
// Replace "UTF-8" with the actual encoding of your data file (if it's not UTF-8).
Reader _file = new InputStreamReader(new FileInputStream(this._filename), "UTF-8");
Once you've done that, there is no reason for any other part of your code to concern itself with bytes. You should replace this:
verse._verseText = new String(data[3].getBytes(), "UTF-8");
with this:
verse._verseText = data[3];
how to inject chinese characters using javascript?
not quite the same problem, but I think the same solution may work in this case.
If the script is inline (in the HTML file), then it's using the
encoding of the HTML file and you won't have an issue.
If the script is loaded from another file:
Your text editor must save the file in an appropriate encoding such as
utf-8 (it's probably doing this already if you're able to save it,
close it, and reopen it with the characters still displaying
correctly) Your web server must serve the file with the right http
header specifying that it's utf-8 (or whatever the enocding happens to
be, as determined by your text editor settings). Here's an example for
how to do this with php: Set http header to utf-8 php If you can't
have your webserver do this, try to set the charset attribute on your
script tag (e.g. > I tried to see what the spec said should happen
in the case of mismatching charsets defined by the tag and the http
headers, but couldn't find anything concrete, so just test and see if
it helps. If that doesn't work, place your script inline
It looks like if you want to store Gujarati text in Java string, you need to use unicode characters. See this: http://jrgraphix.net/r/Unicode/0A80-0AFF
So for example the first Gujarati character:
char example = '0A80';
String result = Character.toString((char)example);

How Can I Read The Next Row From A CSV Data Set Config In JMeter?

I am in the process of creating a test place in JMeter which visits a random amount of pages (from 2 - 10), whose URLs are to be fetched from a CSV Data Set. I have created the CSV Data Set and the samplers which are working fine, except that only one row is read from the Data Set per thread, which is not as a I need - I want a new row to be read after the sampler has completed (or before, I'm not fussed).
I saw that this question is very similar and the solution was to use the Raw Data Source Pre-Processor, which does work but requires arduous alterations to the file in question (adding chunk sizes before each line), which is a bit of a pain when the file is about 500 lines long.
Is there a way I can set the CSV Data Set to advance to the next row on reading, or use some post or pre processor, such as beanshell, in order to do this? I have seen people state that CSVRead can do this, but that states that access is per-thread, which would be no good for me.
As a side note - ultimately all I want to do is access a random line in the file which gets passed to a HTTP sampler, if there is an easier or better way to do this I'm open to suggestions.
You can possibly use for this beanshell (= java) code executed from BeanShell Sampler / BeanShell PostProcessor / BeanShell PreProcessor.
The following code will read all the lines from your file and then select single random:
import java.text.*;
import java.io.*;
import java.util.*;
String [] params = Parameters.split(",");
String csvTest = params[0];
String csvDir = params[0];
ArrayList strList = new ArrayList();
try {
File file = new File(System.getProperty("user.dir") + File.separator + csvDir + File.separator + csvTest);
if (!file.exists()) {
throw new Exception ("ERROR: file " + csvTest + " not found in " + csvDir + " directory.");
}
BufferedReader bufRdr = new BufferedReader(new FileReader(file));
String line = null;
while((line = bufRdr.readLine()) != null) {
strList.add(line);
}
bufRdr.close();
Random rnd = new java.util.Random();
vars.put("csvUrl",strList.get(rnd.nextInt(strList.size())));
}
catch (Exception ex) {
IsSuccess = false;
log.error(ex.getMessage());
System.err.println(ex.getMessage());
}
catch (Throwable thex) {
System.err.println(thex.getMessage());
}
Then you can access extracted URL via variable (${csvUrl} in this example).
I doubt only that reading full file on each iteration (if you have to execute this in loop) is good solution from performance point of view.

Java : How to determine the correct charset encoding of a stream

With reference to the following thread:
Java App : Unable to read iso-8859-1 encoded file correctly
What is the best way to programatically determine the correct charset encoding of an inputstream/file ?
I have tried using the following:
File in = new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());
But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.
You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.
The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.
Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.
Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.
Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.
I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet
check this out:
http://site.icu-project.org/ (icu4j)
they have libraries for detecting charset from IOStream
could be simple like this:
BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();
if (cm != null) {
reader = cm.getReader();
charset = cm.getName();
}else {
throw new UnsupportedCharsetException()
}
Here are my favorites:
TikaEncodingDetector
Dependency:
<dependency>
<groupId>org.apache.any23</groupId>
<artifactId>apache-any23-encoding</artifactId>
<version>1.1</version>
</dependency>
Sample:
public static Charset guessCharset(InputStream is) throws IOException {
return Charset.forName(new TikaEncodingDetector().guessEncoding(is));
}
GuessEncoding
Dependency:
<dependency>
<groupId>org.codehaus.guessencoding</groupId>
<artifactId>guessencoding</artifactId>
<version>1.4</version>
<type>jar</type>
</dependency>
Sample:
public static Charset guessCharset2(File file) throws IOException {
return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
}
You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for "malformed-input" or "unmappable-character" errors. Of course, this only tells you if a charset is wrong; it doesn't tell you if it is correct. For that, you need a basis of comparison to evaluate the decoded results, e.g. do you know beforehand if the characters are restricted to some subset, or whether the text adheres to some strict format? The bottom line is that charset detection is guesswork without any guarantees.
Which library to use?
As of this writing, they are three libraries that emerge:
GuessEncoding
ICU4j
juniversalchardet
I don't include Apache Any23 because it uses ICU4j 3.4 under the hood.
How to tell which one has detected the right charset (or as close as possible)?
It's impossible to certify the charset detected by each above libraries. However, it's possible to ask them in turn and score the returned response.
How to score the returned response?
Each response can be assigned one point. The more points a response have, the more confidence the detected charset has. This is a simple scoring method. You can elaborate others.
Is there any sample code?
Here is a full snippet implementing the strategy described in the previous lines.
public static String guessEncoding(InputStream input) throws IOException {
// Load input data
long count = 0;
int n = 0, EOF = -1;
byte[] buffer = new byte[4096];
ByteArrayOutputStream output = new ByteArrayOutputStream();
while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
output.write(buffer, 0, n);
count += n;
}
if (count > Integer.MAX_VALUE) {
throw new RuntimeException("Inputstream too large.");
}
byte[] data = output.toByteArray();
// Detect encoding
Map<String, int[]> encodingsScores = new HashMap<>();
// * GuessEncoding
updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());
// * ICU4j
CharsetDetector charsetDetector = new CharsetDetector();
charsetDetector.setText(data);
charsetDetector.enableInputFilter(true);
CharsetMatch cm = charsetDetector.detect();
if (cm != null) {
updateEncodingsScores(encodingsScores, cm.getName());
}
// * juniversalchardset
UniversalDetector universalDetector = new UniversalDetector(null);
universalDetector.handleData(data, 0, data.length);
universalDetector.dataEnd();
String encodingName = universalDetector.getDetectedCharset();
if (encodingName != null) {
updateEncodingsScores(encodingsScores, encodingName);
}
// Find winning encoding
Map.Entry<String, int[]> maxEntry = null;
for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
maxEntry = e;
}
}
String winningEncoding = maxEntry.getKey();
//dumpEncodingsScores(encodingsScores);
return winningEncoding;
}
private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
String encodingName = encoding.toLowerCase();
int[] encodingScore = encodingsScores.get(encodingName);
if (encodingScore == null) {
encodingsScores.put(encodingName, new int[] { 1 });
} else {
encodingScore[0]++;
}
}
private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
System.out.println(toString(encodingsScores));
}
private static String toString(Map<String, int[]> encodingsScores) {
String GLUE = ", ";
StringBuilder sb = new StringBuilder();
for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
}
int len = sb.length();
sb.delete(len - GLUE.length(), len);
return "{ " + sb.toString() + " }";
}
Improvements:
The guessEncoding method reads the inputstream entirely. For large inputstreams this can be a concern. All these libraries would read the whole inputstream. This would imply a large time consumption for detecting the charset.
It's possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only.
As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...
I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.
Bellow I provided some helpful comments which I’ve experienced in my work:
Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting
icu4j is the main tool in this context by IBM, imho
Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)
icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8
Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!
icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings
Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic
Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input
Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)
The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text
If you use ICU4J (http://icu-project.org/apiref/icu4j/)
Here is my code:
String charset = "ISO-8859-1"; //Default chartset, put whatever you want
byte[] fileContent = null;
FileInputStream fin = null;
//create FileInputStream object
fin = new FileInputStream(file.getPath());
/*
* Create byte array large enough to hold the content of the file.
* Use File.length to determine size of the file in bytes.
*/
fileContent = new byte[(int) file.length()];
/*
* To read content of the file in byte array, use
* int read(byte[] byteArray) method of java FileInputStream class.
*
*/
fin.read(fileContent);
byte[] data = fileContent;
CharsetDetector detector = new CharsetDetector();
detector.setText(data);
CharsetMatch cm = detector.detect();
if (cm != null) {
int confidence = cm.getConfidence();
System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
//Here you have the encode name and the confidence
//In my case if the confidence is > 50 I return the encode, else I return the default value
if (confidence > 50) {
charset = cm.getName();
}
}
Remember to put all the try-catch need it.
I hope this works for you.
If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.
I found a nice third party library which can detect actual encoding:
http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
I didn't test it extensively but it seems to work.
For ISO8859_1 files, there is not an easy way to distinguish them from ASCII. For Unicode files however one can generally detect this based on the first few bytes of the file.
UTF-8 and UTF-16 files include a Byte Order Mark (BOM) at the very beginning of the file. The BOM is a zero-width non-breaking space.
Unfortunately, for historical reasons, Java does not detect this automatically. Programs like Notepad will check the BOM and use the appropriate encoding. Using unix or Cygwin, you can check the BOM with the file command. For example:
$ file sample2.sql
sample2.sql: Unicode text, UTF-16, big-endian
For Java, I suggest you check out this code, which will detect the common file formats and select the correct encoding: How to read a file and automatically specify the correct encoding
An alternative to TikaEncodingDetector is to use Tika AutoDetectReader.
Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();
A good strategy to handle this, is with a way to auto detect the input charset.
I use org.xml.sax.InputSource in Java 11 to solve it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
inputStreamReader = new InputStreamReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...**strong text**
In plain Java:
final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };
List<String> lines;
for (String encoding : encodings) {
try {
lines = Files.readAllLines(path, Charset.forName(encoding));
for (String line : lines) {
// do something...
}
break;
} catch (IOException ioe) {
System.out.println(encoding + " failed, trying next.");
}
}
This approach will try the encodings one by one until one works or we run out of them.
(BTW my encodings list has only those items because they are the charsets implementations required on every Java platform, https://docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html)
Can you pick the appropriate char set in the Constructor:
new InputStreamReader(new FileInputStream(in), "ISO8859_1");

Categories

Resources