Java scraping charset issues

Java scraping charset issues - java

I'm scraping Wikipedia pages with Java in order to extract information contained within infoboxes.
All works fine, except for the character encoding.
Wikipedia pages use "UTF-8" encoding.
The Ubuntu eclipse console uses "UTF-8" as default encoding as well.
However, the eclipse console shows some weird symbols when displaying information scraped. (e.g.:Smith Â· Ricardo instead of Smith · Ricardo)
This is the function I use to read data (it traverses all descendants of a node and join their text information at the end):
private String getTextContent(Node node) {
String text = "";
List<Node> children = null;
if (isTextNode(node)) {
return node.getNodeValue();
}
else if (!node.hasChildNodes()) {
return "";
}
else {
children = toList(node.getChildNodes());
for (Node childNode : children) {
text += getTextContent(childNode);
}
}
return text;
}
I forgot to mention that I'm using the JTidy library for scraping.

The console might be correctly interpreting UTF-8, but if you've got the wrong encoding when you read the data over the network, then you're going to run into problems.
Specify UTF-8 as the encoding for JTidy to use.

Go to Eclipse Project Right Click > Run Configuration>Common tab and check for UTF-8 over there.

Related

PDFBOX digit garble

I met some problems when I used PDFBOX to extract text. There are Tyep3 embedded fonts in my PDF, but the numbers cannot be displayed normally when extracting this part. Can someone give me some guidance? thank you
My version is 2.0.22
The correct output is [USD-001], the wrong output is [USD- ]
public static String readPDF(File file) throws IOException {
RandomAccessBufferedFileInputStream rbi = null;
PDDocument pdDocument = null;
String text = "";
try {
rbi = new RandomAccessBufferedFileInputStream(file);
PDFParser parser = new PDFParser(rbi);
parser.setLenient(false);
parser.parse();
pdDocument = parser.getPDDocument();
PDFTextStripper textStripper = new PDFTextStripper();
text = textStripper.getText(pdDocument);
} catch (IOException e) {
e.printStackTrace();
} finally {
rbi.close();
}
return text;
}
I tried to use PDFBOX to convert the PDF to an image and found that everything was fine. I just wanted to get it as normal text
PDFDebugger output
The pdf file : http://tmp.link/f/6249a07f6e47f

There are a number of aspects of this file making text extraction difficult.
First of all the font itself boycotts text extraction. In its ToUnicode stream we find the mappings:
1 begincodespacerange
<00> <ff> endcodespacerange
2 beginbfchar
<22> <0000> <23> <0000> endbfchar
I.e. the two character codes of interest both are mapped to U+0000, not to U+0030 ('0') and U+0031 ('1') as they should have been.
Also the Encoding is not helping at all:
<</Type/Encoding/Differences[ 0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g0/g121/g122]>>
The glyph names /g121 and /g122 don't have a standardized meaning either.
PdfBox for text extraction works with these two properties of a font and, therefore, fails here.
Adobe Acrobat, on the other hand, also makes use of ActualText during text extraction.
In the file there are such entries. Unfortunately, though, they are erroneous, like this for the digit '0':
/P <</MCID 23>>/Span <</ActualText<FEFF0030>>>BDC
The BDC instruction only expects a single name and a single dictionary. The above sequence of name, dictionary, name, and dictionary, therefore, is invalid.
Due to that Adobe Acrobat also used to not extract the actual text here. Only recently, probably as recently as the early 2022 releases, Acrobat started extracting a '0' here.
Actually one known "trick" to prevent one's PDFs to be text extracted by regular text extractor programs is to add incorrect ToUnicode and Encoding information but correct ActualText entries.
So it's possible the error in your file is actually an application of this trick, maybe even by design with the erroneous ActualText twist to lead text extractors with some ActualText support astray while still allowing copy&paste from Adobe Acrobat.

Character not displaying in html

I am having trouble displaying the "velar nasal" character (ŋ)(but I assume the same problem would arise with other rare characters).
I have a MySQL table which contains a word with this character.
When my code retrieves it to display in my HTML page, it is displayed as a question mark.
I have tried a number of things:
1) Tried using MySQL's CONVERT to convert the retrieved string to UTF-8 because I understood that the string is stored in my table as "Latin1":
SELECT CONVERT(Name USING utf8)
Instead of:
SELECT Name
This did not help, and, when I saved a string in my java code with the problematic word ("Yolŋu"), and then passed the String through the rest of the code the problem still occured (ie: The problem does not lie in the different character encoding that my DB uses).
2) I also tried creating a new String from bytes:
new String(name.getBytes("UTF-8"));
The String is being passed from java to the html via a JSONObject that is passed to a javascript file:
Relevant JSON code:
JSONArray names = new JSONArray();
for (int iD: iDs)
{
JSONObject namesData = new JSONObject();
String name = NameDB.getNameName(iD);
nameData.put("label", name);
nameData.put("value", iD);
names.put(nameData);
}
return names;
Relevant servlet code:
response.setContentType("application/json");
try (PrintWriter out = response.getWriter())
{
out.print(namesJSONArray);
}
Relevant js code:
An ajax call to the servlet is made via jquery ui's autocomplete "source" option.
I am pretty new to coding in general and very new to the character encoding topic.
Thank you.

First, in Java String should already hold correct Unicode, so new String(string.getBytes(...), ...) is a hack, with its own troubles.
1. The database
It would be nice if the database held the text in UTF-8. The encoding can be set on database, table and column level. The first thing is to investigate how the text is stored. A table dump (mysqldump) would be least error prone.
If you can use UTF-8, this must be set form MySQL on the database engine, and for data transfer for the java driver.
In every case you can check a round-trip in java JDBC by filling a table field, and reading it again, as also reading that existing troublesome field.
Dump the code points of the string.
String dump(String s) {
StringBuilder sb = new StringBuilder();
for (int i = 0; i < s.length(); ) {
int cp = s.codePointAt(i);
if (32 < cp && cp < 128) {
sb.append((char) cp);
} else {
sb.append("U+").append(Integer.toHexString(cp));
}
sb.append(' ');
i += Character.charCount(cp);
}
return sb.toString();
}
2. The output
Here probably lies the error. Call at the beginning:
response.setCharacterEncoding("UTF-8");
... response.getWriter(); // Now converts java's Unicode text to UTF-8.
For HTML a charset specification is in order too. Especially when the HTML page is saved to the file system, the encoding header would be lost.

You should be assure about the following things:
Your JVM must work with file.encoding=UTF-8 param
Your mySQL table in which contains special characters must be parametrized with encoding=UTF-8
Your web UI should specify the meta tag with the encoding you have saved the web page in your editor, so UTF-8
If the problem persists, try to use HTML entities (&entity) instead.

Gibberish result reading unicode tags using mp3agic in Java

I am trying to read tags of Russian songs in Java using mp3agic:
Mp3File song;
try {
song = new Mp3File(newURI);
if (song.hasId3v2Tag()) {
ID3v2 id3v2tag = song.getId3v2Tag();
title = id3v2tag.getTitle();
artist = id3v2tag.getArtist();
}
else if (song.hasId3v1Tag()){
ID3v1 id3v1tag = song.getId3v1Tag();
title = id3v1tag.getTitle();
artist = id3v1tag.getArtist();
}
}
However I get this "??-2????????? ?????" instead of this "Би-2Скользкие Улицы".
What can I do to resolve this issue?

An explanation of this issue can be found at: https://github.com/mpatric/mp3agic/issues/39
In summary, the problem is that the text encoding is windows-1251 (also known as cp1251). ID3v2 tags with windows-1251 encoded strings (or any other encoding that's not one of the 4 supported encodings for ID3v2) are not valid. Programatically differentiating windows-1251 from iso-8859-1 is not easy, so automatically detecting the strings in order to transcode them might be tricky.
Some interesting comments here: https://superuser.com/questions/495775/how-to-translate-wacky-metadata-to-readable-format

CharacterData ignoring non escaped characters

I'm using the following method to read in a line of text from an XML document via the web:
public static String getCharacterDataFromElement(Element e) {
Node child = ((Node) e).getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
It works fine, but if it comes across a character such as an ampersand which are not written like & etc it will then completely ignore that character and the rest of the line. What can I do to rectify this?

The only proper solution ist to correct the XML, so that the & is written as &, or the texts are wrapped in <![CDATA[ ... ]]>.
It's not actually XML unless you escape ampersands or use CDATA.

I suspect the talk of the input not being well-formed is a red herring. If the source document contains entity references then an element may contain multiple text node children, and your code is only reading the first of them. It needs to read them all.
(I think there are easier ways of getting the text content of a Node in DOM. But I'm not sure, I never use the DOM if I can avoid it because it makes everything so difficult. You're much better off with JDOM or XOM.)

How to convert HTML to text keeping linebreaks

How may I convert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser
Example:
Hello<br/>World
to:
Hello\n
World

Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.
public static String htmlToText(InputStream html) throws IOException {
Document document = Jsoup.parse(html, null, "");
Element body = document.body();
return buildStringFromNode(body).toString();
}
private static StringBuffer buildStringFromNode(Node node) {
StringBuffer buffer = new StringBuffer();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
buffer.append(textNode.text().trim());
}
for (Node childNode : node.childNodes()) {
buffer.append(buildStringFromNode(childNode));
}
if (node instanceof Element) {
Element element = (Element) node;
String tagName = element.tagName();
if ("p".equals(tagName) || "br".equals(tagName)) {
buffer.append("\n");
}
}
return buffer;
}

w3m -dump -no-cookie input.html > output.txt

I did find a relatively clever solution in html2txt: THE ASCIINATOR which does an admirable job of producing nroff like output (e.g. like man ls run on a terminal). It produces output in the Markdown style that StackOverflow uses as input.
For moderately complex pages like this page, the output is somewhat scattered as it tries mightily to turn non-linear layout into something linear. The output from less complicated markup is pretty readable.

If you don't mind hard-wrapped/designed-for-monospace output, lynx -dump produces good plain text from HTML.

HTML to Text:
I am taking this statement to mean that all HTML formatting, except line-breaks, will be abandoned.
What I have done for such a venture is using regexp to detect any set of tag enclosure.
If the value within the tags are br or br/, a line-break is inserted, otherwise the tag is discarded.
It works only for simple html pages. Tables will obviously be linearised.
I had been thinking of detecting the title value between the title tag enclosure, so that the converter automatically places the title at the top of the page. Needs to put in a little more algorithm. By my time is better spent with ...
I am reading on using Google Data APIs to upload a document to Google Docs and then using the same API to download/export it as text. Or, why text, when I could do pdf. But you have to get a Google account if you don't already have one.
Google docs data download/export
Google docs data api for java

Does it matter what language you use? You could always use pattern matching. Basically HTML lien break tags (br,p,div, ...) you can replace with "\n" and remove all the other tags. You could always store the tags in an array so you can easily check when you go through the HTML file. Then any other tags and all the other end tags (/p,..) can be replaced with an empty string therefore getting your result.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java scraping charset issues - java

The console might be correctly interpreting UTF-8, but if you've got the wrong encoding when you read the data over the network, then you're going to run into problems. Specify UTF-8 as the encoding for JTidy to use.

Go to Eclipse Project Right Click > Run Configuration>Common tab and check for UTF-8 over there.

Related

PDFBOX digit garble

Character not displaying in html

Gibberish result reading unicode tags using mp3agic in Java

CharacterData ignoring non escaped characters

How to convert HTML to text keeping linebreaks

Categories

Resources