Convert HTML Character Back to Text Using Java Standard Library - java

I would like to convert some HTML characters back to text using Java Standard Library. I was wondering whether any library would achieve my purpose?
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
// TODO code application logic here
// "Happy & Sad" in HTML form.
String s = "Happy & Sad";
System.out.println(s);
try {
// Change to "Happy & Sad". DOESN'T WORK!
s = java.net.URLDecoder.decode(s, "UTF-8");
System.out.println(s);
} catch (UnsupportedEncodingException ex) {
}
}

I think the Apache Commons Lang library's StringEscapeUtils.unescapeHtml3() and unescapeHtml4() methods are what you are looking for. See https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/StringEscapeUtils.html.

Here you have to just add jar file in lib jsoup in your application and then use this code.
import org.jsoup.Jsoup;
public class Encoder {
public static void main(String args[]) {
String s = Jsoup.parse("<Français>").text();
System.out.print(s);
}
}
Link to download jsoup: http://jsoup.org/download

java.net.URLDecoder deals only with the application/x-www-form-urlencoded MIME format (e.g. "%20" represents space), not with HTML character entities. I don't think there's anything on the Java platform for that. You could write your own utility class to do the conversion, like this one.

The URL decoder should only be used for decoding strings from the urls generated by html forms which are in the "application/x-www-form-urlencoded" mime type. This does not support html characters.
After a search I found a Translate class within the HTML Parser library.

You can use the class org.apache.commons.lang.StringEscapeUtils:
String s = StringEscapeUtils.unescapeHtml("Happy & Sad")
It is working.

I'm not aware of any way to do it using the standard library. But I do know and use this class that deals with html entities.
"HTMLEntities is an Open Source Java class that contains a collection of static methods (htmlentities, unhtmlentities, ...) to convert special and extended characters into HTML entitities and vice versa."
http://www.tecnick.com/public/code/cp_dpage.php?aiocp_dp=htmlentities

Or you can use unescapeHtml4:
String miCadena="GUÍA TELEFÓNICA";
System.out.println(StringEscapeUtils.unescapeHtml4(miCadena));
This code print the line:
GUÍA TELEFÓNICA

As #jem suggested, it is possible to use jsoup.
With jSoup 1.8.3 it il possible to use the method Parser.unescapeEntities that retain the original html.
import org.jsoup.parser.Parser;
...
String html = Parser.unescapeEntities(original_html, false);
It seems that in some previous release this method is not present.

Related

How to print the escape characters as it is while using PrettyPrintWriter?

Using PrettyPrintWriter to pretty print the xml file
In the generated xml file the ' (apostrophe) is getting written as &apos
Want it to print as '
Using the following
xstream.marshal(obj, new PrettyPrintWriter(writer)) to pretty print
,any suggestions on how to print the escape characters as it is?
You can provide your own implementation of PrettyPrintWriter, which extends that class and overrides its writeText(QuickWriter, String) method.
In its most basic form that would be something like this:
import com.thoughtworks.xstream.core.util.QuickWriter;
import com.thoughtworks.xstream.io.xml.PrettyPrintWriter;
import java.io.Writer;
public class MyPrettyPrintWriter extends PrettyPrintWriter {
public MyPrettyPrintWriter(Writer writer) {
super(writer);
}
#Override
public void writeText(QuickWriter writer, String string) {
writer.write(string);
}
}
You would use this as follows:
String s = "Foo'Bar";
XStream xstream = new XStream();
FileWriter writer = new FileWriter("my_demo.xml");
xstream.marshal(s, new MyPrettyPrintWriter(writer));
The output file contains the following:
<string>Foo'Bar</string>
This is basic - it just passes the tag contents through to the file unchanged - nothing is escaped.
You are OK for content containing ", ' and >. But this will be a problem for text containing > and & - which should still be escaped. So you can enhance your writeText method to handle those cases as needed. See What characters do I need to escape in XML documents? for more details.
Note also this is only for text values - not for XML attributes. There is a separate writeAttributeValue method for that (probably not needed in your scenario).
It is worth adding: There should be no need to do any of this. The XML is valid, with escaped values such as &apos;. Any process (any half-way decent XML library or tool) reading that data should handle them correctly.

parse java string in java

I want to parse java code using java.
problem is , when I pass the java code to parse method,it does not take it as string.How do I escape the code to be parsed
public class JavaParser {
private int noOfLines;
public void parse(String javaCode){
String[] lines = javaCode.split("[\r\n]+");
for(String line : lines)
System.out.println(line);
}
public static void main(){
JavaParser a = new JavaParser();
a.parse("java code;");
}
}
You need to read the java code file as a text file, line by line or alla t once for example:
FileInputStream inputStream = new FileInputStream("foo.java");
try {
String everything = IOUtils.toString(inputStream);
} finally {
inputStream.close();
}
Then you can parse the everything string.
maybe you can describe what you're trying to achieve?
In general its java's compiler (like javac) work to parse the java source files.
Quick googling revealed this project that can suit your needs
As of java 6 you can invoke compiler as a part of your code (java exposes the compiler API). This can be helpful if you're trying to compile the code after you read it. In general you can read this article, maybe you'll find it helpful
Hope this helps

Externalize XML construction from a stream of CSV in Java

I get a stream of values as CSV , based on some condition I need to generate a XML including only a set of values from the CSV. For e.g .
Input : a:value1, b:value2, c:value3, d:value4, e:value5.
if (condition1)
XML O/P = <Request><ValueOfA>value1</ValueOfA><ValueOfE>value5</ValueOfE></Request>
else if (condition2)
XML O/P = <Request><ValueOfB>value2</ValueOfB><ValueOfD>value4</ValueOfD></Request>
I want to externalize the process in a way that given a template the output XML is generated accordingly. String manipulation is the easiest way of implementing this but I do not want to mess up the XML if some special characters appear in the input, etc. Please suggest.
Perhaps you could benefit from templating engine, something like Apache Velocity.
I would suggest creating an xsd and using JAXB to create the Java binding classes that you can use to generate the XML.
I recommend my own templating engine (JATL http://code.google.com/p/jatl/) Although its geared to (X)HTML its also very good at generating XML.
I didn't bother solving the whole problem for you (that is double splitting on the input ("," and then ":").) but this is how you would use JATL.
final String a = "stuff";
HtmlWriter html = new HtmlWriter() {
#Override
protected void build() {
//If condition1
start("Request").start("ValueOfA").text(a).end().end();
}
};
//Now write.
StringWriter writer = new StringWriter();
String results = html.write(writer).getBuffer().toString();
Which would generate
<Request><ValueOfA>stuff</ValueOfA></Request>
All the correct escaping is handled for you.

How to work with html code readed on Java?

I know how to read the HTML code of a website, for example, the next java code reads all the HTML code from http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html this is a website that shows all the football players of F.C. Barcelona.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class ReadWebPage {
public static void main(String[] args) throws IOException {
String urltext = "http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html";
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine);
}
in.close();
}
}
OK, but now I need to work with the HTML code, I need to obtain the names ("Valdés, Victor", "Pinto, José Manuel", etc...) and the positions (Goalkeeper, Defence, Midfield, Striker) of each of the players of the team. For example, I need to create an ArrayList <String> PlayerNames and an ArrayList <String> PlayerPositions and put on these arrays all the names and positions of all the players.
How I can do it??? I can't find the code example that can do it on google..... code examples are welcome
thanks
I would recommend using HtmlUnit, which will give you access to the DOM tree of the HTML page, and even execute JavaScript in case the data are dynamically put in the page using AJAX.
You could also use JSoup: no JavaScript, but more lightweight and support for CSS selectors.
I think that the best approach is first to purify HTML code into the valid XHTML form, and them apply XSL transformation - for retrieving some part of information you can use XPATH expressions. The best available html tag balancer is in my opinion neko HTML (http://nekohtml.sourceforge.net/).
You might like to take a look at htmlparser
I used this for something similar.
Usage something like this:
Parser fullWebpage = new Parser("WEBADDRESS");
NodeList nl = fullWebpage.extractAllNodesThatMatch(new TagNameFilter("<insert html tag>"));
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("a"),true);
String data = tds.toHtml();
Java has its own, built-in HTML parser. A positive feature of this parser it that it is error tolerant and would assume some tags even if they are missing or misspelled. While called swing.text.html.Parser, it has actually nothing shared with Swing (and with text only as much as HTML is a text). Use ParserDelegator. You need to write a callback for use with this parser, otherwise it is not complex to use. The code example (written as a ParserDelegator test) can be found here. Some say it is a reminder of the HotJava browser. The only problem with it, seems not upgraded to the most recent versions of HTML.
The simple code example would be
Reader reader; // read HTML from somewhere
HTMLEditorKit.ParserCallback callback = new MyCallBack(); // Implement that interface.
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);
I've found a link that is just what you was looking for:
http://tiny-url.org/work_with_html_java

Convert HTML symbols and HTML names to HTML number using Java

I have an XML which contains many special symbols like ® (HTML number &#174) etc.
and HTML names like &atilde (HTML number &#227) etc.
I am trying to replace these HTML symbols and HTML names with corresponding HTML number using Java. For this, I first converted XML file to string and then used replaceAll method as:
File fn = new File("myxmlfile.xml");
String content = FileUtils.readFileToString(fn);
content = content.replaceAll("®", "&\#174");
FileUtils.writeStringToFile(fn, content);
But this is not working.
Can anyone please tell how to do it.
Thanks !!!
The signature for the replaceAll method is:
public String replaceAll(String regex, String replacement)
You have to be careful that your first parameter is a valid regular expression. The Java Pattern class describes the constructs used in a Java regular expression.
Based on what I see in the Pattern class description, I don't see what's wrong with:
content = content.replaceAll("®", "&\#174");
You could try:
content = content.replaceAll("\\p(®)", "&\#174");
and see if that works better.
I don't think that \# is a valid escape sequence.
BTW, what's wrong with "&#174" ?
If you want HTML numbers try first escaping for XML.
Use EscapeUtils from Apache Commons Lang.
Java may have trouble dealing with it, so first I prefere to escape Java, and after that XML or HTML.
String escapedStr= StringEscapeUtils.escapeJava(yourString);
escapedStr= StringEscapeUtils.escapeXML(yourString);
escapedStr= StringEscapeUtils.escapeHTML(yourString);

Categories

Resources