how to convert HTML text to plain text? [duplicate]

how to convert HTML text to plain text? [duplicate] - java

This question already has answers here:
Remove HTML tags from a String
(35 answers)
Closed 1 year ago.
friend's
I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.

Yes, Jsoup will be the better option. Just do like below to convert the whole HTML text to plain text.
String plainText= Jsoup.parse(yout_html_text).text();

Just getting rid of HTML tags is simple:
// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");
But unfortunately the requirements are never that simple:
Usually, <p> and <div> elements need a separate handling, there may be cdata blocks with > characters (e.g. javascript) that mess up the regex etc.

You can use this single line to remove the html tags and display it as plain text.
htmlString=htmlString.replaceAll("\\<.*?\\>", "");

Use Jsoup.
Add the dependency
<dependency>
<!-- jsoup HTML parser library # https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
Now in your java code:
public static String html2text(String html) {
return Jsoup.parse(html).wholeText();
}
Just call the method html2text with passing the html text and it will return plain text.

Use a HTML parser like htmlCleaner
For detailed answer : How to remove HTML tag in Java

I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

If you want to parse like browser display, use:
import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;
public class RenderToText {
public static void main(String[] args) throws Exception {
String sourceUrlString="data/test.html";
if (args.length==0)
System.err.println("Using default argument of \""+sourceUrlString+'"');
else
sourceUrlString=args[0];
if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
Source source=new Source(new URL(sourceUrlString));
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
}
}
I hope this will help to parse table also in the browser format.
Thanks,
Ganesh

I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed).
http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();
The maxLineLength ensures lines are not artificially wrapped at 80 characters.
The setNewLine(null) uses the same new line character(s) as the source.

I use HTMLUtil.textFromHTML(value)
from
<dependency>
<groupId>org.clapper</groupId>
<artifactId>javautil</artifactId>
<version>3.2.0</version>
</dependency>

Using Jsoup, I got all the text in the same line.
So I used the following block of code to parse HTML and keep new lines:
private String parseHTMLContent(String toString) {
String result = toString.replaceAll("\\<.*?\\>", "\n");
String previousResult = "";
while(!previousResult.equals(result)){
previousResult = result;
result = result.replaceAll("\n\n","\n");
}
return result;
}
Not the best solution but solved my problem :)

Related

How to preserve the meaning of tags like <br>, <ul> , <li> , <p> etc when reading them in Java using JSOUP library?

I am writing a program that extracts some certain information from local HTML files. That information is then shown on a Java JFrame and is exported to an excel file. (I am using JSoup 1.9.2 library for the HTML parsing purposes)
I am running into an issue where whenever I extract anything from an HTML file, JSoup is not taking HTML tags like break tags, line tags etc. into account and so, all the information is being extracted like a big chunk of data without any proper newlines or formatting.
To show you an example, if this is the data that I want to read :
Title Line 1 Line 2 Unordered
Listelement 1 element 2
The data is coming back as :
Title Line 1 Line 2 Unordered List element 1 element 2 (i.e. all the
HTML tags are ignored)
This is the piece of code that I am using for reading in :
private String getTitle(Document doc) { // doc is the local HTML file
Elements title = doc.select(".title");
for (Element id : title) {
return id.text();
}
return "No Title Available ";
}
Can anyone suggest me a way that can be used to preserve the meaning behind the HTML tags by using which I can both display the data on the JFrame and export it to excel with a more readable format?
Thanks.

Just to give everyone an update, I was able to find a solution (more like a workaround) to the formatting issue. What i am doing now is extracting the complete HTML using id.html() which I am storing in a String object. Then, i am using the String function replaceAll() with a regular expression to get rid of all the HTML tags without pushing everything into a single line. The replaceAll() function looks something like replaceAll("\\<[^>]*>",""). My whole processHTML() function looks something like :
private String processHTML(String initial) { //initial is the String with all the HTML tags
String modified = initial;
modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
//All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
modified = modified.replaceAll(" ", " ");
modified = modified.replaceAll("<", "<");
modified = modified.replaceAll(">", ">");
modified = modified.replaceAll("&", "&");
modified = modified.replaceAll(""", "\"");
modified = modified.replaceAll("&apos;", "\'");
modified = modified.replaceAll("¢", "¢");
modified = modified.replaceAll("©", "©");
modified = modified.replaceAll("®", "®");
return modified;
}
Thanks you all again for helping me with this
Cheers.

converting HTML to String without TextView

I am having Problems filling my TextView.
I have an HTML String that needs to be converted from HTML to String and the replace some characters.
Problem is: I can convert it directly with:
TextView.setText(Html.fromHtml(sampleText);
But I need to alter the converted sampleText before giving it to the TextView.
E.g.:
String sampleText = "<b>Some Text</b>"
newSampleText = Html.fromHtml(sampleText);
newSampleText.replace(char1, char2);
TextView.setText(newSampletext);
Does anyone know how to convert the HTML saved inside the String?

if you don't need formatting, use Html.fromHtml(sampleText).toString()
otherwise, you need to extract text from html with jsoup to find and change text like here

please try this one:
You need to use Html.fromHtml() to use HTML in your XML Strings. Simply referencing a String with HTML in your layout XML will not work.
DEMO
Try use This version of setText and use SPANNABLE buffer type
DEMO1

Extracting html tags based on attribute

I have a crawled page and I have retrieved html of the page into String object.
Now i want to parse this string and to extract all tags that have itemprop defined into an array that would be associative for example
String[] itemprops;
itemprops['title'] = "Some title";
itemprops['description'] = "Some description";
Can I do this with regex somehow or is there some library that can do this.

Look at JSoup. It's an HTML scraping and parsing library that's exactly what you want.
In your case, you can do something like:
Document doc = Jsoup.parse(HTMLString);
String title = doc.select("title").text();
String description = doc.select("meta[name=description]").attr("content");
The select() function uses CSS selectors to get elements.

Also make sure that the html which you use follows strict syntax. Because broken syntax may cause parsing exception or loss data.

Using JSoup to parse text between two different tags

I have the following HTML...
<h3 class="number">
<span class="navigation">
6:55 <b>»</b>
</span>**This is the text I need to parse!**</h3>
I can use the following code to extract the text from h3 tag.
Element h3 = doc.select("h3").get(0);
Unfortunately, that gives me everything in that tag.
6:55 » This is the text I need to parse!
Can I use Jsoup to parse between different tags? Is there a best practice for doing this (regex?)

(regex?)
No, as you can read in the answers of this question, you can't parse HTML using a regular expression.
Try this:
Element h3 = doc.select("h3").get(0);
String h3Text = h3.text();
String spanText = h3.select("span").get(0).text();
String textBetweenSpanEndAndH3End = h3Text.replace(spanText, "");

No, JSoup wasn't made for this. It's supposed to parse something hierachical. Searching for a text which is between an end-tag and a start-tag, or the other way around wouldn't make any sense for JSoup. That's what regular expressions are for.
But you should of course narrow it down as much as you can using JSoup first, before you shoot with a regex at the string.

Just use ownText()
#Test
void innerTextCase() {
String sample = "<h3 class=\"number\">\n" +
"<span class=\"navigation\">\n" +
"6:55 <b>»</b>\n" +
"</span>**This is the text I need to parse!**</h3>\n";
Assertions.assertEquals("**This is the text I need to parse!**",
Jsoup.parse(sample).select("h3").first().ownText());
}

Get a part of a html file in java [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I have a HTML file looking like this:
<html>
<head>
<title>foobar</title>
</head>
<body>
bla bla<br />
{[CONTAINER]}
Hello
{[/CONTAINER]}
</body>
</html>
How do I get the "Hello" in the Container out of the rest of the html file? I've done this in PHP years ago and i remember a REGEX-Function which calls a definde class-function and give the content of the container as a parameter.
Can someone tell me how to do this in Java?

You can use regex that matches everything between {[CONTAINER]} and {[/CONTAINER]}. Example:
// Non capturing open tag. Non-capturing mean it won't be included in result when we match it against some text.
String open = "(?<=\\{\\[CONTAINER\\]\\})";
// Content between open and close tag.
String inside = ".*?";
// Non capturing close tag.
String close = "(?=\\{\\[/CONTAINER\\]\\})";
// Final regex
String regex = open + inside + close;
String text = "<html>..."; // you string here
// Usage
Matcher matcher = Pattern.compile(regex, Pattern.DOTALL).matcher(text);
while (matcher.find()) {
String content = matcher.group().trim();
System.out.println(content);
}
But you must be careful. Because it works only for {[CONTAINER]} and {[/CONTAINTER]}. Attributes for this custom tags aren't supported.
You also must be aware that it doesn't handle html tags in any specific way. So if there is a html tags between your CONTENT tags - they will be included.

You can parse the HTML using jsoup , more help here
More detailed here

Why do you want using Java?
You can simply use the DOM API with JavaScript:
document.getElementById("id_container").firstChild.data; // beware of \n char
or in a less efficient way:
document.getElementById("id_container").innerHTML;
However if your file is building on the server you can also use the same API:
http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/package-summary.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to convert HTML text to plain text? [duplicate] - java

This question already has answers here: Remove HTML tags from a String (35 answers) Closed 1 year ago. friend's I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.

Yes, Jsoup will be the better option. Just do like below to convert the whole HTML text to plain text. String plainText= Jsoup.parse(yout_html_text).text();

You can use this single line to remove the html tags and display it as plain text. htmlString=htmlString.replaceAll("\\<.*?\\>", "");

Use a HTML parser like htmlCleaner For detailed answer : How to remove HTML tag in Java

I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.

I use HTMLUtil.textFromHTML(value) from <dependency> <groupId>org.clapper</groupId> <artifactId>javautil</artifactId> <version>3.2.0</version> </dependency>

Related

How to preserve the meaning of tags like <br>, <ul> , <li> , <p> etc when reading them in Java using JSOUP library?

converting HTML to String without TextView

Extracting html tags based on attribute

Using JSoup to parse text between two different tags

Get a part of a html file in java [duplicate]

Categories

Resources