HTML entity decoding in Java: apostrophe

HTML entity decoding in Java: apostrophe - java

I have to decode, using Java, HTML strings which contain the following entities: "&#39" and "&apos".
I'm using Apache Commons Lang, but it doesn't decode those two entities, so, I'm currently doing as follows, but I'm looking for the fastest way to do what I want.
import org.apache.commons.lang.StringEscapeUtils;
public class StringUtil {
public static String decodeHTMLString(String s) {
return StringEscapeUtils.unescapeHtml((s.replace("&#39;", "`").replace("&apos;", "'")));
}
}
I searched for older questions, but none seems to answer my question.

Well, i would imagine that part of the problem is that one of your entities is double encoded: "&#39;". That will not be turned into an apostrophe by any decoder.
As for "&apos;", apparently that one is not +technically+ part of the html entity set.

Related

non-basic characters in java, how to handle the encoding correctly

when I am trying to call method with parameter using my Polish language f.e.
node.call("ąćęasdasdęczć")
I get these characters as input characters.
Ä?Ä?Ä?asdasdÄ?czÄ
I don't know where to set correct encoding in maven pom.xml? or in my IDE? I tried to change UTF-8 to ISO_8859-2 in my IDE setting, but it didn't work. I was searching similiar questions, but I didn't find the answer.
#Edit 1
Sample code:
public void findAndSendKeys(String vToSet , By vLocator){
WebElement element;
element = webDriverWait.until(ExpectedConditions.presenceOfElementLocated(vLocator));
element.sendKeys(vToSet);
}
By nameLoc = By.id("First_Name");
findAndSendKeys("ąćęasdasdęczć" , nameLoc );
Then in input field I got Ä?Ä?Ä?asdasdÄ?czÄ. Converting string to Basic Latin in my IDE helps, but It's not the solution that I needed.
I have also problems with fields in classes f.e. I have class in which I have to convert String to basic Latin
public class Contacts{
private static final By LOC_ADDRESS_BTN = By.xpath("//button[contains(#aria-label,'Wybór adresu')]");
// it doesn't work, I have to use basic latin and replace "ó" with "\u00f3" in my IDE
}
#Edit 2 - Changed encoding, but problem still exists
1:

UTF-8 for URL, Java

So I'm trying to scrape a grammar website that gives you conjugations of verbs, but I'm having trouble accessing the pages that require accents, such as the page for the verb "fág".
Here is my current code:
String url = "http://www.teanglann.ie/en/gram/"+ URLEncoder.encode("fág","UTF-8");
System.out.println(url);
I've tried this both with and without the URLEncoder.encode() method, and it just keeps giving me a '?' in place of the 'á' when working with it, and my URL search returns nothing. Basically, I was wondering if there was something similar to Python's 'urllib.parse.quote_plus'. I've tried searching and tried many different methods from StackOverflow, all to no avail. Any help would be greatly appreciated.
Eventually, I'm going to replace the given string with a user inputed argument. Just using it to test at the moment.
Solution: It wasn't Java, but IntelliJ.

Summary from comment
The test code works fine.
import java.io.UnsupportedEncodingException;
import static java.net.URLEncoder.encode;
public class MainApp {
public static void main(String[] args) throws UnsupportedEncodingException {
String url = "http://www.teanglann.ie/en/gram/"+ encode("fág", "UTF-8");
System.out.println(url);
}
}
It emits like below
http://www.teanglann.ie/en/gram/f%EF%BF%BDg
Which would goto correct page.
Correct steps are
Ensure that source code encoding is correct. (IntelliJ probably
cannot guess it all correct)
Run the program with appropriate encoding (utf-8 in this case)
(See
What is the default encoding of the JVM?
for a relevant discussion)
Edit from Wyzard's comment
Above code works by accident(say does not have whitespace). Correct way to get encoded URL is like bellow
..
String url = "http://www.teanglann.ie/en/gram/fág";
System.out.println(new URI(url).toASCIIString());
This uses URI.toASCIIString() which adheres to RFC 2396, which talk about Uniform Resource Identifiers (URI): Generic Syntax

How to decode XHTML and/or HTML5 entities in Java?

I have some strings that contain XHTML character entities:
"They&apos;re quite varied"
"Sometimes the string ∈ XML standard, sometimes ∈ HTML4 standard"
"Therefore -> I need an XHTML entity decoder."
"Sadly, some strings are not valid XML & are not-quite-so-valid HTML <- but I want them to work, too."
Is there any easy way to decode the entities? (I'm using Java)
I'm currently using StringEscapeUtils.unescapeHtml4(myString.replace("&apos;", "\'")) as a temporary hack. Sadly, org.apache.commons.lang3.StringEscapeUtils has unescapeHtml4 and unescapeXML, but no unescapeXhtml.
EDIT: I do want to handle invalid XML, for example I want "&&xyzzy;" to decode to "&&xyzzy;"
EDIT: I think HTML5 has almost the same character entities as XHTML, so I think HTML 5 decoder would be fine too.

This may not be directly relevant but you may wish to adopt JSoup which handles things like that albeit from a higher level. Includes web page cleaning routines.

Have you tried to implement a XHTMLStringEscapeUtils based on the facilities provide by org.apache.commons.text.StringEscapeUtils?
import org.apache.commons.text.StringEscapeUtils;
import org.apache.commons.text.translate.*;
public class XHTMLStringEscapeUtils {
public static final CharSequenceTranslator ESCAPE_XHTML =
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_ESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_ESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_ESCAPE)
).with(StringEscapeUtils.ESCAPE_XML11);
public static final CharSequenceTranslator UNESCAPE_XHTML =
new AggregateTranslator(
new LookupTranslator(EntityArrays.BASIC_UNESCAPE),
new LookupTranslator(EntityArrays.ISO8859_1_UNESCAPE),
new LookupTranslator(EntityArrays.HTML40_EXTENDED_UNESCAPE),
new NumericEntityUnescaper(),
new LookupTranslator(EntityArrays.APOS_UNESCAPE)
);
public static final String escape(final String input) {
return ESCAPE_XHTML.translate(input);
}
public static final String unescape(final String input) {
return UNESCAPE_XHTML.translate(input);
}
}
Thanks to the modular design of Apache commons-text lib, it's easy to create custom escape utils.
You can find a full project with tests here xhtml-string-escape-utils

Saving Chinese to mongodb 2.4.8 cause unreadable string

Before I used Mongodb 2.0.6, everything is fine.
recently I started to use Mongodb 2.4.8 with Java Play framework, and I found that when I tried to save Chinese to mongodb, mongodb actually stored as some unreadable string, such as &\#21457;&\#29983;, what is show on web is the same string, does anything know why?
what should I do? how to convert it to readable Chinese?

I think,your string gets converted to unreadable string in between.As I tested this on console and works fine for me.
$ mongo test
MongoDB shell version: 2.4.8
connecting to: test
> var doc = { "message" :"你好" }
> db.ChineseWord.save(doc)
> db.ChineseWord.find().pretty()
{ "_id" : ObjectId("529da2018170273efa43e181"), "message" : "你好" }

From what you have posted I suspect that this may be an artefact of the Play Framework, as both these characters can be stored directly in MongoDB.
> db.test1.insert({x:"𡑗 and 𩦃"})
> db.test1.find();
{ "_id" : ObjectId("52a12237e7c9d6190f6feb95"), "x" : "𡑗 and 𩦃" }
Assuming that the characters you posted as &#21457 and &#29983 above are really meant to be 𡑗 and 𩦃 then I would suspect that the Play Framework is converting them into a representation of their extended unicode values. In this case those two characters would be from the "CJK Unified Ideographs Extension B" section.
You can view the whole set of characters here: http://codepoints.net/cjk_unified_ideographs_extension_b
This looks to be a similar issue as here in the play-framework google group.

I just wrote a quick test and this works just fine.
package com.mongodb;
import com.mongodb.util.TestCase;
import org.junit.Assert;
import org.junit.Test;
public class EncodingTest extends TestCase {
String chinese = "你好";
#Test
public void saveChinese() {
DBCollection collection = getDatabase().getCollection("chinese");
collection.insert(new BasicDBObject().append("message", chinese));
DBObject object = collection.findOne();
Assert.assertEquals(chinese, object.get("message"));
}
}
That text saves and loads without error. It would help to see what code you're using to test.

While I have no experience with play framework specifically, the general approach to resolve your issue is to try logging/dumping such string right before it's passed to your mongodb driver, if:
the string is still encoded as utf-8, not entity (&#...), you need to check if your mongodb driver for 2.4 is updated with some new options that convert utf-8 into entities.
if the string is already converted to entities, well you at least ruled out mongodb driver and should track down the conversion within play framework instead.
As others have mentioned, mongodb itself does not care if your input are entities or not, as long as they are utf-8 encoded. it's more likely play framework or the mongodb driver is to blame.
PS: I assume unreable means they were converted to entities (&#...), not encoded incorrectly.

get attribute value from html code in java

i have HTML string value and i want to get one attribute(id) value from that html String value
can u help me how to do it??
String msHTMLFile = "<ABBR class='HighlightClass' id='highlight40001' style=\"BACKGROUND-COLOR: yellow\" >Fetal/Neonatal Morbidity and Mortality</ABBR>";
result should come - highlight40001;

Try using this regular expression pattern:
\bid='([^']*)'
And then extract the string captured by group 1. This is not foolproof; using regex to parse HTML never is. You can try to complicate the regex to make it more flexible. Or you can just use a HTML parser. I recommend the latter.

Also not so clean, but this should work for you.
You can treat it as xml and parse it using JAXB:
ABBR.java:
import javax.xml.bind.annotation.XmlAttribute;
public class ABBR
{
#XmlAttribute public String id;
}
Main.java:
[..]
String msHTMLFile = "<ABBR class='HighlightClass' id='highlight40001' style=\"BACKGROUND-COLOR: yellow\" >Fetal/Neonatal Morbidity and Mortality</ABBR>";
ABBR obj = JAXB.unmarshal(new StringReader(msHTMLFile), ABBR.class);
System.out.println(obj.id);
[..]

If you're lucky and your HTML source produces XML-compliant HTML, JAXB or other XML parsers will do fine with it. A lot of people aren't writing particularly well-formed HTML (unclosed tags, etc), though some of my coworkers have gotten good results parsing HTML with HotSAX: http://sourceforge.net/projects/hotsax/

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML entity decoding in Java: apostrophe - java

Well, i would imagine that part of the problem is that one of your entities is double encoded: "'". That will not be turned into an apostrophe by any decoder. As for "'", apparently that one is not +technically+ part of the html entity set.

Related

non-basic characters in java, how to handle the encoding correctly

UTF-8 for URL, Java

How to decode XHTML and/or HTML5 entities in Java?

Saving Chinese to mongodb 2.4.8 cause unreadable string

get attribute value from html code in java

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

HTML entity decoding in Java: apostrophe - java

Well, i would imagine that part of the problem is that one of your entities is double encoded: "&#39;". That will not be turned into an apostrophe by any decoder. As for "&apos;", apparently that one is not +technically+ part of the html entity set.

Related

non-basic characters in java, how to handle the encoding correctly

UTF-8 for URL, Java

How to decode XHTML and/or HTML5 entities in Java?

Saving Chinese to mongodb 2.4.8 cause unreadable string

get attribute value from html code in java

Categories

Resources

Well, i would imagine that part of the problem is that one of your entities is double encoded: "'". That will not be turned into an apostrophe by any decoder. As for "'", apparently that one is not +technically+ part of the html entity set.