Using boilerpipe to extract non-english articles - java

I am trying to use boilerpipe java library, to extract news articles from a set of websites.
It works great for texts in english, but for text with special characters, for example, words with accent marks (história), this special characters are not extracted correctly. I think it is an encoding problem.
In the boilerpipe faq, it says "If you extract non-English text you might need to change some parameters" and then refers to a paper. I found no solution in this paper.
My question is, are there any params when using boilerpipe where i can specify the encoding? Is there any way to go around and get the text correctly?
How i'm using the library:
(first attempt based on the URL):
URL url = new URL(link);
String article = ArticleExtractor.INSTANCE.getText(url);
(second on the HTLM source code)
String article = ArticleExtractor.INSTANCE.getText(html_page_as_string);

You don't have to modify inner Boilerpipe classes.
Just pass InputSource object to the ArticleExtractor.INSTANCE.getText() method and force encoding on that object. For example:
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
Regards!

Well, from what I see, when you use it like that, the library will auto-chose what encoding to use. From the HTMLFetcher source:
public static HTMLDocument fetch(final URL url) throws IOException {
final URLConnection conn = url.openConnection();
final String ct = conn.getContentType();
Charset cs = Charset.forName("Cp1252");
if (ct != null) {
Matcher m = PAT_CHARSET.matcher(ct);
if(m.find()) {
final String charset = m.group(1);
try {
cs = Charset.forName(charset);
} catch (UnsupportedCharsetException e) {
// keep default
}
}
}
Try debugging their code a bit, starting with ArticleExtractor.getText(URL), and see if you can override the encoding

Ok, got a solution.
As Andrei said, i had to change the class HTMLFecther, which is in the package de.l3s.boilerpipe.sax
What i did was to convert all the text that was fetched, to UTF-8.
At the end of the fetch function, i had to add two lines, and change the last one:
final byte[] data = bos.toByteArray(); //stays the same
byte[] utf8 = new String(data, cs.displayName()).getBytes("UTF-8"); //new one (convertion)
cs = Charset.forName("UTF-8"); //set the charset to UFT-8
return new HTMLDocument(utf8, cs); // edited line

Boilerpipe's ArticleExtractor uses some algorithms that have been specifically tailored to English - measuring number of words in average phrases, etc. In any language that is more or less verbose than English (ie: every other language) these algorithms will be less accurate.
Additionally, the library uses some English phrases to try and find the end of the article (comments, post a comment, have your say, etc) which will clearly not work in other languages.
This is not to say that the library will outright fail - just be aware that some modification is likely needed for good results in non-English languages.

Java:
import java.net.URL;
import org.xml.sax.InputSource;
import de.l3s.boilerpipe.extractors.ArticleExtractor;
public class Boilerpipe {
public static void main(String[] args) {
try{
URL url = new URL("http://www.azeri.ru/az/traditions/kuraj_pehlevanov/");
InputSource is = new InputSource();
is.setEncoding("UTF-8");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);
System.out.println(text);
}catch(Exception e){
e.printStackTrace();
}
}
}
Eclipse:
Run > Run Configurations > Common Tab. Set Encoding to Other(UTF-8), then click Run.

I had the some problem; the cnr solution works great. Just change UTF-8 encoding to ISO-8859-1. Thank's
URL url = new URL("http://some-page-with-utf8-encodeing.tld");
InputSource is = new InputSource();
is.setEncoding("ISO-8859-1");
is.setByteStream(url.openStream());
String text = ArticleExtractor.INSTANCE.getText(is);

Related

How to edit MS Word documents using Java?

I do have few Word templates, and my requirement is to replace some of the words/place holders in the document based on the user input, using Java. I tried lot of libraries including 2-3 versions of docx4j but nothing work well, they all just didn't do anything!
I know this question has been asked before, but I tried all options I know. So, using what java library I can "really" replace/edit these templates? My preference goes to the "easy to use / Few line of codes" type libraries.
I am using Java 8 and my MS Word templates are in MS Word 2007.
Update
This code is written by using the code sample provided by SO member Joop Eggen
public Main() throws URISyntaxException, IOException, ParserConfigurationException, SAXException
{
URI docxUri = new URI("C:/Users/Yohan/Desktop/yohan.docx");
Map<String, String> zipProperties = new HashMap<>();
zipProperties.put("encoding", "UTF-8");
FileSystem zipFS = FileSystems.newFileSystem(docxUri, zipProperties);
Path documentXmlPath = zipFS.getPath("/word/document.xml");
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(Files.newInputStream(documentXmlPath));
byte[] content = Files.readAllBytes(documentXmlPath);
String xml = new String(content, StandardCharsets.UTF_8);
//xml = xml.replace("#DATE#", "2014-09-24");
xml = xml.replace("#NAME#", StringEscapeUtils.escapeXml("Sniper"));
content = xml.getBytes(StandardCharsets.UTF_8);
Files.write(documentXmlPath, content);
}
However this returns the below error
java.nio.file.ProviderNotFoundException: Provider "C" Not found
at: java.nio.file.FileSystems.newFileSystem(FileSystems.java:341) at java.nio.file.FileSystems.newFileSystem(FileSystems.java:341)
at java.nio.fileFileSystems.newFileSystem(FileSystems.java:276)
One may use for docx (a zip with XML and other files) a java zip file system and XML or text processing.
URI docxUri = ,,, // "jar:file:/C:/... .docx"
Map<String, String> zipProperties = new HashMap<>();
zipProperties.put("encoding", "UTF-8");
try (FileSystem zipFS = FileSystems.newFileSystem(docxUri, zipProperties)) {
Path documentXmlPath = zipFS.getPath("/word/document.xml");
When using XML:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.parse(Files.newInputStream(documentXmlPath));
//Element root = doc.getDocumentElement();
You can then use XPath to find the places, and write the XML back again.
It even might be that you do not need XML but could replace place holders:
byte[] content = Files.readAllBytes(documentXmlPath);
String xml = new String(content, StandardCharsets.UTF_8);
xml = xml.replace("#DATE#", "2014-09-24");
xml = xml.replace("#NAME#", StringEscapeUtils.escapeXml("Sniper")));
...
content = xml.getBytes(StandardCharsets.UTF_8);
Files.delete(documentXmlPath);
Files.write(documentXmlPath, content);
For a fast development, rename a copy of the .docx to a name with the .zip file extension, and inspect the files.
File.write should already apply StandardOpenOption.TRUNCATE_EXISTING, but I have added Files.delete as some error occured. See comments.
Try Apache POI. POI can work with doc and docx, but docx is more documented therefore support of it better.
UPD: You can use XDocReport, which use POI. Also I recomend to use xlsx for templates because it more suitable and more documented
I have spent a few days on this issue, until I found that what makes the difference is the try-with-resources on the FileSystem instance, appearing in Joop Eggen's snippet but not in question snippet:
try (FileSystem zipFS = FileSystems.newFileSystem(docxUri, zipProperties))
Without such try-with-resources block, the FileSystem resource will not be closed (as explained in Java tutorial), and the word document not modified.
Stepping back a bit, there are about 4 different approaches for editing words/placeholders:
MERGEFIELD or DOCPROPERTY fields (if you are having problems with this in docx4j, then you have probably not set up your input docx correctly)
content control databinding
variable replacement on the document surface (either at the DOM/SAX level, or using a library)
do stuff as XHTML, then import that
Before choosing one, you should decide whether you also need to be able to handle:
repeating data (eg adding table rows)
conditional content (eg entire paragraphs which will either be present or absent)
adding images
If you need these, then MERGEFIELD or DOCPROPERTY fields are probably out (though you can also use IF fields, if you can find a library which supports them). And adding images makes DOM/SAX manipulation as advocated in one of the other answers, messier and error prone.
The other things to consider are:
your authors: how technical are they? What does that imply for the authoring UI?
the "user input" you mention for variable replacement, is this given, or is obtaining it part of the problem you are solving?
Please try this to edit or replace the word in document
public class UpdateDocument {
public static void main(String[] args) throws IOException {
UpdateDocument obj = new UpdateDocument();
obj.updateDocument(
"c:\\test\\template.docx",
"c:\\test\\output.docx",
"Piyush");
}
private void updateDocument(String input, String output, String name)
throws IOException {
try (XWPFDocument doc = new XWPFDocument(
Files.newInputStream(Paths.get(input)))
) {
List<XWPFParagraph> xwpfParagraphList = doc.getParagraphs();
//Iterate over paragraph list and check for the replaceable text in each paragraph
for (XWPFParagraph xwpfParagraph : xwpfParagraphList) {
for (XWPFRun xwpfRun : xwpfParagraph.getRuns()) {
String docText = xwpfRun.getText(0);
//replacement and setting position
docText = docText.replace("${name}", name);
xwpfRun.setText(docText, 0);
}
}
// save the docs
try (FileOutputStream out = new FileOutputStream(output)) {
doc.write(out);
}
}
}
}

Specific characters not rendering properly in Java

I have an issue when displaying strings received from a server in a JTable. Some specific characters appear as little white squares instead of "é" or "à" etc. I tried a lot of things but none of them fixed my problem. I'm working with Eclipse under Windows. The server was developped using Visual Studio 2010.
The server sends an XML file using tinyXML2, the client uses JDom to read it. The font used is "Dialog". The server takes the strings from an Oracle database.
I assume this is an encoding problem, but I haven't been able to fix it yet.
Does anyone have an idea ?
Thx
Arnaud
EDIT : As requested, this is how I use JDom
public static Player fromXML(Element e)
{
Player result = new Player();
String e_text = null;
try
{
e_text = e.getChildText(XMLTags.XML_Player_playerId);
if (e_text != null) result.setID(Integer.parseInt(e_text));
e_text = e.getChildText(XMLTags.XML_Player_lastName);
if (e_text != null) result.setName(e_text);
e_text = e.getChildText(XMLTags.XML_Player_point_scored);
if (e_text != null) result.addSpecial(STAT_SCORED, Double.parseDouble(e_text));
e_text = e.getChildText(XMLTags.XML_Player_point_scored_last);
if (e_text != null) result.addSpecial(STAT_SCORED_LAST, Double.parseDouble(e_text));
}
catch (Exception ex) {
ex.printStackTrace();
}
return result;
}
public static Document load(String filename) {
File XMLFile = new File(CLIENT_TO_SERVER, filename);
SAXBuilder sxb = new SAXBuilder();
Document document = new Document();
try
{
document = sxb.build(new File(XMLFile.getPath()));
} catch(Exception e){e.printStackTrace();}
return document;
}
read the file using correct encoding, something like:
document = sxb.build(new BufferedReader(new InputStreamReader(new FileInputStream(XMLFile.getPath()), "UTF8")));
Note: 1. 1st determine which char encoding used in that file. specify that charset instead of UTF8 above.
Incase encoding is not known or it's being generated from various systems with different encoding, you may use 'encoding detector library of Mozilla'. #see https://code.google.com/p/juniversalchardet/
need to handle UnsupportedEncodingException

how can I detect charset of a web page

I just want to get the web page source in java language and I just want to get that content with correct encoding type. I am able to get the content of a web page till now. But for some web pages the content comes with absurd characters. So I need to detect charset of that web page.
According to my little research I found that there is a jChardet library to do this. But I couldn't import it to my project. Can someone please help me?
By the way the code below is the code to read the web page content
StringBuilder builder = new StringBuilder();
InputStream is = fURL.openStream();
BufferedReader buffer = null;
buffer = new BufferedReader(new InputStreamReader(is, encodingType));
int byteRead;
while ((byteRead = buffer.read()) != -1) {
builder.append((char) byteRead);
}
buffer.close();
return builder;
Read the Content-Type header of the HTTP response, it's the best way to get the charset. Only apply guessing when you have no alternatives - you do.
You can use too the http://jchardet.sourceforge.net/
private static String detectCharset(byte[] body) {
nsDetector det = new nsDetector(nsPSMDetector.ALL);
det.Init(new nsICharsetDetectionObserver() {
public void Notify(String charset) {
HtmlCharsetDetector.found = true;
}
});
boolean done = false;
boolean isAscii = true;
if (isAscii) {
isAscii = det.isAscii(body, body.length);
}
// DoIt if non-ascii and not done yet.
if (!isAscii && !done) {
done = det.DoIt(body, body.length, false);
}
return det.getProbableCharsets()[0];
}
Minimally, you would need to read and parse the HTTP headers to see whether they declare the encoding in HTTP headers and, in the absence of such a declaration (rather common), parse the document itself to find a meta tag that declares the encoding. For XHTML documents, you would need to check the XML declaration and default to utf-8. This would still leave a considerable amount of pages with undeclared encoding, so some heuristics would be needed. You might check the section on encodings in the HTML5 draft, which contains some heuristic overrides too (e.g., treating iso-8859-1 as windows-1252).

Checking HTML (Website) tags within Java Code

I have system in PHP that the user enters a website url and we download the html and check values in tags. I have to rewrite it in java now. I been search for days and cant find any easy way to do the following tasks.
1) download HTML based on URL
2) After downloading HTML check values in tags
THIS WILL NOT BUILD! CAN SOMEONE HELP ME
public String tagValue(String inHTML, String tag) throws DataNotFoundException
{
String value = null;
String searchFor = "/<" + tag + ">(.*?)<\/" + tag + "\>/";
Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
Matcher matcher = pattern.matcher(inHTML);
return value;
}
check out http://download.oracle.com/javase/6/docs/api/java/net/URLConnection.html
google "java html parser" for options. you could also use regular expressions if the requirements are fairly simple and straightforward.
An example follows. It took me a while, I haven't worked with these APIs for a long time.
jcomeau#intrepid:~/tmp$ cat test.java; javac test.java; java test
import java.util.regex.*;
import java.net.*;
import java.io.*;
public class test {
public static void main(String args[]) throws Exception {
URL target = new URL("http://www.example.com/");
URLConnection connection = target.openConnection();
connection.connect();
String html = "", line = null;
BufferedReader input = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
while ((line = input.readLine()) != null) html += line;
Pattern pattern = Pattern.compile("<a href=([^ >]*)[^>]*>([^<]*)");
Matcher matcher = pattern.matcher(html);
System.out.println("href\ttext");
while (matcher.find()) {
System.out.println(matcher.group(1) + "\t" + matcher.group(2));
}
}
}
href text
"/"
"/domains/" Domains
"/numbers/" Numbers
"/protocols/" Protocols
"/about/" About IANA
"/go/rfc2606" RFC 2606
"/about/" About
"/about/presentations/" Presentations
"/about/performance/" Performance
"/reports/" Reports
"/domains/" Domains
"/domains/root/" Root Zone
"/domains/int/" .INT
"/domains/arpa/" .ARPA
"/domains/idn-tables/" IDN Repository
"/protocols/" Protocols
"/numbers/" Number Resources
"/abuse/" Abuse Information
"http://www.icann.org/" Internet Corporation for Assigned Names and Numbers
"mailto:iana#iana.org?subject=General%20website%20feedback" iana#iana.org
1) download HTML based on URL
There are various options. There are some helper libraries, e.g. Apache HTTPComponents. You can also just use Java's built-in classes. See e.g. java code to download a file from server .
2) After downloading HTML check values in tags
You probably want to use an HTML parser. For very simple cases, you could use regular expressions (as it seems you are trying to in your example), but this quickly leads to problems. See this famous question: RegEx match open tags except XHTML self-contained tags
THIS WILL NOT BUILD! CAN SOMEONE HELP ME
To put a "\" (backslash) into a literal Java string, you need to double it (because \ is used to introduce special sequences in a Java string literal). So to get a string with just a "\", write it as
String myBackslash = "\\";
See e.g. How can I print "\t" (as it looks) in Java?

Java : How to determine the correct charset encoding of a stream

With reference to the following thread:
Java App : Unable to read iso-8859-1 encoded file correctly
What is the best way to programatically determine the correct charset encoding of an inputstream/file ?
I have tried using the following:
File in = new File(args[0]);
InputStreamReader r = new InputStreamReader(new FileInputStream(in));
System.out.println(r.getEncoding());
But on a file which I know to be encoded with ISO8859_1 the above code yields ASCII, which is not correct, and does not allow me to correctly render the content of the file back to the console.
You cannot determine the encoding of a arbitrary byte stream. This is the nature of encodings. A encoding means a mapping between a byte value and its representation. So every encoding "could" be the right.
The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.
Some streams tell you which encoding was used to create them: XML, HTML. But not an arbitrary byte stream.
Anyway, you could try to guess an encoding on your own if you have to. Every language has a common frequency for every char. In English the char e appears very often but ê will appear very very seldom. In a ISO-8859-1 stream there are usually no 0x00 chars. But a UTF-16 stream has a lot of them.
Or: you could ask the user. I've already seen applications which present you a snippet of the file in different encodings and ask you to select the "correct" one.
I have used this library, similar to jchardet for detecting encoding in Java:
https://github.com/albfernandez/juniversalchardet
check this out:
http://site.icu-project.org/ (icu4j)
they have libraries for detecting charset from IOStream
could be simple like this:
BufferedInputStream bis = new BufferedInputStream(input);
CharsetDetector cd = new CharsetDetector();
cd.setText(bis);
CharsetMatch cm = cd.detect();
if (cm != null) {
reader = cm.getReader();
charset = cm.getName();
}else {
throw new UnsupportedCharsetException()
}
Here are my favorites:
TikaEncodingDetector
Dependency:
<dependency>
<groupId>org.apache.any23</groupId>
<artifactId>apache-any23-encoding</artifactId>
<version>1.1</version>
</dependency>
Sample:
public static Charset guessCharset(InputStream is) throws IOException {
return Charset.forName(new TikaEncodingDetector().guessEncoding(is));
}
GuessEncoding
Dependency:
<dependency>
<groupId>org.codehaus.guessencoding</groupId>
<artifactId>guessencoding</artifactId>
<version>1.4</version>
<type>jar</type>
</dependency>
Sample:
public static Charset guessCharset2(File file) throws IOException {
return CharsetToolkit.guessEncoding(file, 4096, StandardCharsets.UTF_8);
}
You can certainly validate the file for a particular charset by decoding it with a CharsetDecoder and watching out for "malformed-input" or "unmappable-character" errors. Of course, this only tells you if a charset is wrong; it doesn't tell you if it is correct. For that, you need a basis of comparison to evaluate the decoded results, e.g. do you know beforehand if the characters are restricted to some subset, or whether the text adheres to some strict format? The bottom line is that charset detection is guesswork without any guarantees.
Which library to use?
As of this writing, they are three libraries that emerge:
GuessEncoding
ICU4j
juniversalchardet
I don't include Apache Any23 because it uses ICU4j 3.4 under the hood.
How to tell which one has detected the right charset (or as close as possible)?
It's impossible to certify the charset detected by each above libraries. However, it's possible to ask them in turn and score the returned response.
How to score the returned response?
Each response can be assigned one point. The more points a response have, the more confidence the detected charset has. This is a simple scoring method. You can elaborate others.
Is there any sample code?
Here is a full snippet implementing the strategy described in the previous lines.
public static String guessEncoding(InputStream input) throws IOException {
// Load input data
long count = 0;
int n = 0, EOF = -1;
byte[] buffer = new byte[4096];
ByteArrayOutputStream output = new ByteArrayOutputStream();
while ((EOF != (n = input.read(buffer))) && (count <= Integer.MAX_VALUE)) {
output.write(buffer, 0, n);
count += n;
}
if (count > Integer.MAX_VALUE) {
throw new RuntimeException("Inputstream too large.");
}
byte[] data = output.toByteArray();
// Detect encoding
Map<String, int[]> encodingsScores = new HashMap<>();
// * GuessEncoding
updateEncodingsScores(encodingsScores, new CharsetToolkit(data).guessEncoding().displayName());
// * ICU4j
CharsetDetector charsetDetector = new CharsetDetector();
charsetDetector.setText(data);
charsetDetector.enableInputFilter(true);
CharsetMatch cm = charsetDetector.detect();
if (cm != null) {
updateEncodingsScores(encodingsScores, cm.getName());
}
// * juniversalchardset
UniversalDetector universalDetector = new UniversalDetector(null);
universalDetector.handleData(data, 0, data.length);
universalDetector.dataEnd();
String encodingName = universalDetector.getDetectedCharset();
if (encodingName != null) {
updateEncodingsScores(encodingsScores, encodingName);
}
// Find winning encoding
Map.Entry<String, int[]> maxEntry = null;
for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
if (maxEntry == null || (e.getValue()[0] > maxEntry.getValue()[0])) {
maxEntry = e;
}
}
String winningEncoding = maxEntry.getKey();
//dumpEncodingsScores(encodingsScores);
return winningEncoding;
}
private static void updateEncodingsScores(Map<String, int[]> encodingsScores, String encoding) {
String encodingName = encoding.toLowerCase();
int[] encodingScore = encodingsScores.get(encodingName);
if (encodingScore == null) {
encodingsScores.put(encodingName, new int[] { 1 });
} else {
encodingScore[0]++;
}
}
private static void dumpEncodingsScores(Map<String, int[]> encodingsScores) {
System.out.println(toString(encodingsScores));
}
private static String toString(Map<String, int[]> encodingsScores) {
String GLUE = ", ";
StringBuilder sb = new StringBuilder();
for (Map.Entry<String, int[]> e : encodingsScores.entrySet()) {
sb.append(e.getKey() + ":" + e.getValue()[0] + GLUE);
}
int len = sb.length();
sb.delete(len - GLUE.length(), len);
return "{ " + sb.toString() + " }";
}
Improvements:
The guessEncoding method reads the inputstream entirely. For large inputstreams this can be a concern. All these libraries would read the whole inputstream. This would imply a large time consumption for detecting the charset.
It's possible to limit the initial data loading to a few bytes and perform the charset detection on those few bytes only.
As far as I know, there is no general library in this context to be suitable for all types of problems. So, for each problem you should test the existing libraries and select the best one which satisfies your problem’s constraints, but often none of them is appropriate. In these cases you can write your own Encoding Detector! As I have wrote ...
I’ve wrote a meta java tool for detecting charset encoding of HTML Web pages, using IBM ICU4j and Mozilla JCharDet as the built-in components. Here you can find my tool, please read the README section before anything else. Also, you can find some basic concepts of this problem in my paper and in its references.
Bellow I provided some helpful comments which I’ve experienced in my work:
Charset detection is not a foolproof process, because it is essentially based on statistical data and what actually happens is guessing not detecting
icu4j is the main tool in this context by IBM, imho
Both TikaEncodingDetector and Lucene-ICU4j are using icu4j and their accuracy had not a meaningful difference from which the icu4j in my tests (at most %1, as I remember)
icu4j is much more general than jchardet, icu4j is just a bit biased to IBM family encodings while jchardet is strongly biased to utf-8
Due to the widespread use of UTF-8 in HTML-world; jchardet is a better choice than icu4j in overall, but is not the best choice!
icu4j is great for East Asian specific encodings like EUC-KR, EUC-JP, SHIFT_JIS, BIG5 and the GB family encodings
Both icu4j and jchardet are debacle in dealing with HTML pages with Windows-1251 and Windows-1256 encodings. Windows-1251 aka cp1251 is widely used for Cyrillic-based languages like Russian and Windows-1256 aka cp1256 is widely used for Arabic
Almost all encoding detection tools are using statistical methods, so the accuracy of output strongly depends on the size and the contents of the input
Some encodings are essentially the same just with a partial differences, so in some cases the guessed or detected encoding may be false but at the same time be true! As about Windows-1252 and ISO-8859-1. (refer to the last paragraph under the 5.2 section of my paper)
The libs above are simple BOM detectors which of course only work if there is a BOM in the beginning of the file. Take a look at http://jchardet.sourceforge.net/ which does scans the text
If you use ICU4J (http://icu-project.org/apiref/icu4j/)
Here is my code:
String charset = "ISO-8859-1"; //Default chartset, put whatever you want
byte[] fileContent = null;
FileInputStream fin = null;
//create FileInputStream object
fin = new FileInputStream(file.getPath());
/*
* Create byte array large enough to hold the content of the file.
* Use File.length to determine size of the file in bytes.
*/
fileContent = new byte[(int) file.length()];
/*
* To read content of the file in byte array, use
* int read(byte[] byteArray) method of java FileInputStream class.
*
*/
fin.read(fileContent);
byte[] data = fileContent;
CharsetDetector detector = new CharsetDetector();
detector.setText(data);
CharsetMatch cm = detector.detect();
if (cm != null) {
int confidence = cm.getConfidence();
System.out.println("Encoding: " + cm.getName() + " - Confidence: " + confidence + "%");
//Here you have the encode name and the confidence
//In my case if the confidence is > 50 I return the encode, else I return the default value
if (confidence > 50) {
charset = cm.getName();
}
}
Remember to put all the try-catch need it.
I hope this works for you.
If you don't know the encoding of your data, it is not so easy to determine, but you could try to use a library to guess it. Also, there is a similar question.
I found a nice third party library which can detect actual encoding:
http://glaforge.free.fr/wiki/index.php?wiki=GuessEncoding
I didn't test it extensively but it seems to work.
For ISO8859_1 files, there is not an easy way to distinguish them from ASCII. For Unicode files however one can generally detect this based on the first few bytes of the file.
UTF-8 and UTF-16 files include a Byte Order Mark (BOM) at the very beginning of the file. The BOM is a zero-width non-breaking space.
Unfortunately, for historical reasons, Java does not detect this automatically. Programs like Notepad will check the BOM and use the appropriate encoding. Using unix or Cygwin, you can check the BOM with the file command. For example:
$ file sample2.sql
sample2.sql: Unicode text, UTF-16, big-endian
For Java, I suggest you check out this code, which will detect the common file formats and select the correct encoding: How to read a file and automatically specify the correct encoding
An alternative to TikaEncodingDetector is to use Tika AutoDetectReader.
Charset charset = new AutoDetectReader(new FileInputStream(file)).getCharset();
A good strategy to handle this, is with a way to auto detect the input charset.
I use org.xml.sax.InputSource in Java 11 to solve it:
...
import org.xml.sax.InputSource;
...
InputSource inputSource = new InputSource(inputStream);
inputStreamReader = new InputStreamReader(
inputSource.getByteStream(), inputSource.getEncoding()
);
Input sample:
<?xml version="1.0" encoding="utf-16"?>
<rss xmlns:dc="https://purl.org/dc/elements/1.1/" version="2.0">
<channel>
...**strong text**
In plain Java:
final String[] encodings = { "US-ASCII", "ISO-8859-1", "UTF-8", "UTF-16BE", "UTF-16LE", "UTF-16" };
List<String> lines;
for (String encoding : encodings) {
try {
lines = Files.readAllLines(path, Charset.forName(encoding));
for (String line : lines) {
// do something...
}
break;
} catch (IOException ioe) {
System.out.println(encoding + " failed, trying next.");
}
}
This approach will try the encodings one by one until one works or we run out of them.
(BTW my encodings list has only those items because they are the charsets implementations required on every Java platform, https://docs.oracle.com/javase/9/docs/api/java/nio/charset/Charset.html)
Can you pick the appropriate char set in the Constructor:
new InputStreamReader(new FileInputStream(in), "ISO8859_1");

Categories

Resources