Is there a function that converts HTML to plaintext? - java

Is there a "hocus-pocus" function, suitable for Android, that converts HTML to plaintext?
I am referring to a function like the clipboard conversion operation found in browsers like Internet Explorer, Firefox, etc: If you select all rendered HTML inside the browser and copy/paste it to a text editor, you will receive (most of) the text, without any HTML tags or headers.
In a similar thread, I saw a reference to html2text but it's in Python. I am looking for an Android/Java function.
Is there something like this available or must I do this myself, using Jsoup or Jtidy?

I'd try something like:
String html = "<b>hola</b>";
String plain = Html.fromHtml(html).toString();

Using JSOUP :
String plain = new HtmlToPlainText().getPlainText(Jsoup.parse(html));
Without JSOUP:
String html= "htmltext";
String newHtml = html.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ").trim();

Related

How to preserve the meaning of tags like <br>, <ul> , <li> , <p> etc when reading them in Java using JSOUP library?

I am writing a program that extracts some certain information from local HTML files. That information is then shown on a Java JFrame and is exported to an excel file. (I am using JSoup 1.9.2 library for the HTML parsing purposes)
I am running into an issue where whenever I extract anything from an HTML file, JSoup is not taking HTML tags like break tags, line tags etc. into account and so, all the information is being extracted like a big chunk of data without any proper newlines or formatting.
To show you an example, if this is the data that I want to read :
Title Line 1 Line 2 Unordered
Listelement 1 element 2
The data is coming back as :
Title Line 1 Line 2 Unordered List element 1 element 2 (i.e. all the
HTML tags are ignored)
This is the piece of code that I am using for reading in :
private String getTitle(Document doc) { // doc is the local HTML file
Elements title = doc.select(".title");
for (Element id : title) {
return id.text();
}
return "No Title Available ";
}
Can anyone suggest me a way that can be used to preserve the meaning behind the HTML tags by using which I can both display the data on the JFrame and export it to excel with a more readable format?
Thanks.
Just to give everyone an update, I was able to find a solution (more like a workaround) to the formatting issue. What i am doing now is extracting the complete HTML using id.html() which I am storing in a String object. Then, i am using the String function replaceAll() with a regular expression to get rid of all the HTML tags without pushing everything into a single line. The replaceAll() function looks something like replaceAll("\\<[^>]*>",""). My whole processHTML() function looks something like :
private String processHTML(String initial) { //initial is the String with all the HTML tags
String modified = initial;
modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
//All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
modified = modified.replaceAll(" ", " ");
modified = modified.replaceAll("<", "<");
modified = modified.replaceAll(">", ">");
modified = modified.replaceAll("&", "&");
modified = modified.replaceAll(""", "\"");
modified = modified.replaceAll("&apos;", "\'");
modified = modified.replaceAll("¢", "¢");
modified = modified.replaceAll("©", "©");
modified = modified.replaceAll("®", "®");
return modified;
}
Thanks you all again for helping me with this
Cheers.

Receive data from html tag with Java in Selenium

I have the following html tag and I want to receive "name":"test_1476979972086" from my Java Selenium code.
How can I achive this?
I already tried getText and getAttribute function but without any success.
<a data-ng-href="#/devices"
target="_blank"
class="ng-binding"
href="#/devices">
{"name":"test_1476979972086"}
</a>
getText() is always emtpy. The xpath is unique. newDevice.created is unique on page.
final By successMessageBy = By.xpath("//p[#data-ng-show='newDevice.created']/a");
final WebElement successMessage = wait.until(ExpectedConditions.presenceOfElementLocated(successMessageBy));
final String msg = successMessage.getText();
Actually WebElement#getText() returns only visible text. It could be possible element is present there but text would be visible later.
So if WebElement#getText() doesn't work as expected, you should try using getAttribute("textContent") as below:-
successMessage.getAttribute("textContent");
upon first glance, the below should work. the fact that what you've tried doesnt work, leads me to believe that you aren't selecting the correct element. since i am ignorant of the rest of your html, this might not be unique. you'll have to play around with it, or share the surrounding html
String json = driver.findElement(By.cssSelector("a[href$='/devices']")).getText()

converting HTML to String without TextView

I am having Problems filling my TextView.
I have an HTML String that needs to be converted from HTML to String and the replace some characters.
Problem is: I can convert it directly with:
TextView.setText(Html.fromHtml(sampleText);
But I need to alter the converted sampleText before giving it to the TextView.
E.g.:
String sampleText = "<b>Some Text</b>"
newSampleText = Html.fromHtml(sampleText);
newSampleText.replace(char1, char2);
TextView.setText(newSampletext);
Does anyone know how to convert the HTML saved inside the String?
if you don't need formatting, use Html.fromHtml(sampleText).toString()
otherwise, you need to extract text from html with jsoup to find and change text like here
please try this one:
You need to use Html.fromHtml() to use HTML in your XML Strings. Simply referencing a String with HTML in your layout XML will not work.
DEMO
Try use This version of setText and use SPANNABLE buffer type
DEMO1

Extracting html tags based on attribute

I have a crawled page and I have retrieved html of the page into String object.
Now i want to parse this string and to extract all tags that have itemprop defined into an array that would be associative for example
String[] itemprops;
itemprops['title'] = "Some title";
itemprops['description'] = "Some description";
Can I do this with regex somehow or is there some library that can do this.
Look at JSoup. It's an HTML scraping and parsing library that's exactly what you want.
In your case, you can do something like:
Document doc = Jsoup.parse(HTMLString);
String title = doc.select("title").text();
String description = doc.select("meta[name=description]").attr("content");
The select() function uses CSS selectors to get elements.
Also make sure that the html which you use follows strict syntax. Because broken syntax may cause parsing exception or loss data.

Convert HTML to DOC with images in Java

I am stuck in a Java application.
I have a doubt that is there any way to convert HTML template to DOC Template with Image in HTML file using Java.
I have tried Aspose API but I cant use it because it is not open.
I fetch HTML template from database and store the whole template into string and now I want this string output in a WORD DOC including the images.
Here is my piece of code:
proc_stmt = con.prepareCall("{call PROCEDURECALL(?)}");
proc_stmt.registerOutParameter(1, Types.CLOB);
proc_stmt.execute();
String htmltemplate = proc_stmt.getString(1);
I am storing the HTML template in a String and now I want it to be converted in WORD DOC.
It also have a image src=local path link.The whole template is working fine but the image is not being posted so can anyone help me with it?
Thank you all for the time and help.
I tried docjx4j API 2.8.1 and it wors like wonder.
It had ConvertinXHTMLinFile and it works fine.
If anyone wants the code I will post it.
Here is the link that helped me :
https://github.com/plutext/docx4j/blob/master/src/main/java/org/docx4j/samples/ConvertInXHTMLFile.java
Once again, Thank you all.
Vrinda.

Categories

Resources