Drawing a box around sub-strings of a document - java

I was hoping to get some help in how I should approach a program I have attempted to write a few times now.
I have a number of folders. In each folder, there is a HTML file, and a .txt file which contains text in the HTML file, stripped of all HTML tags.
As an example, a simplified HTML file may be
<html><head></head><body><p>This is some <b>text</b></p><p>Please ignore me</p></body></html>
And within a .txt in the same folder, I have "This is some text".
From these two files, I would like to create a new file which is a HTML with a box drawn around "This is some text", like so :
The obvious problem here is that the pretty-printed text files do not contain any mark-up, and so finding it within the HTML document is difficult.
My idea thus far has been :
-Save the .txt contents in a variable.
-Grab the HTML contents, strip of all HTML tags :
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
I'm unsure how to proceed from this point. I mean...I could try to add a div with a class surrounding the text, and then add a border style to this...but how do I find the sub-string in the HTML reliably, retaining all of the markup within the HTML ?
I'm sure there is a simple way to do this and I am just overthinking it, I would usually have a chat with a friend about this and solve it but everyone seems to be offline - so I come to you for guidance here.
Can anyone offer any feedback please? Thanks.

This should work for you:
More information on selectors and setting attribute values
private void test(){
//replace with your stored variables
String html = "<html><head></head><body><p>This is some <b>text</b></p><p>Please ignore me</p></body></html>";
String txt = "This is some text";
Document doc = Jsoup.parse(html);
String query = "p:contains(" + txt + ")";
Elements htmlTxt = doc.select(query); //selects all the paragraph elements with your target txt
//Loop through each element and add a red border around it
for(Element e : htmlTxt){
System.out.println("e: " + e.toString());
e.attr("style", "border:3px; border-style:solid; border-color:#FF0000; padding: 1em;");
}
}

Related

Split text within span if there is an img tag between them - selenium - java

I'm having a scenario where within a span tag I have two strings, separated by an img tag.
<span>
text
<img/>
text
</span>
When I'm trying to find this span using selenium and Xpath, I found it - but getText() method of the span element returning "texttext". My intention is to get "text text".
driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN").getText();
My Xpath is fine (because I'm getting the right web element, but how can I get the string as I note here? I want to append a space whenever there is an img tag.
Will be glad for your help,
Thanks!
There is no direct way to do it using .getText(). You can use .getAttribute("innerHTML") and then you will need to replace whatever is between the two "text" strings (IMG, etc.) with a space.
Here's a simple example based on your HTML that will probably work.
String s = driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN").getAttribute("innerHTML"); // <span>text<img/>text</span>
s = s.replaceAll("<img.*?/>", " ");
System.out.println(s);
This prints
<span>text text</span>
To retrieve the text text from the first child node and the text text from third child node you can use the getAttribute("innerHTML") method and then use split() method and finally print text text inserting a space between them accordingly as follows :
String my_string = driver.findElement(By.xpath("MY_XPATH_TO_FIND_THAT_SPAN")).getAttribute("innerHTML");
String[] stringParts = my_string.split("\n");
String partA = stringParts[0];
String partB = stringParts[2];
System.out.println(partA + " " + partB);

Strip HTML from text but also specific content wrapped in html in Java

Let's say I have the following text:
<blockquote>
<div>This is text and html in a blockquote<\/div>
More text in a block quote.
<\/blockquote>
Here's some content <b> bolded </b> and <i> other random HTML tags </i>
I'd like to strip the entire blockquote out, and keep the content in other html tags. So the output would be:
Here's some bolded and other random HTML tags.
I know theres a hundred or more answers to "Stripping HTML from content" but I can't find an answer on stripping HTML tags but also content that is wrapped specific html tags.
How can I get the desire output in Java?
You could use simple regex expressions: .replaceAll("<blockquote>.*</blockquote>", "").replaceAll("<[^>]+>", ""). It should be enough.
It may seem like an overhead but you could use Jsoup to parse the HTML and operate with the Elements.
Maybe there is something more lightweight for your problem but Jsoup should to the job just fine. You can select elements by using css selectors, remove unneeded and get the plain text (without tags) out of them.
Here is a simple sample:
final String html = "<html><div><bq>i do not want this</bq>but this <b>should</b> all <i>get</i> read</div></html>";
final Document document = Jsoup.parse(html);
final Elements div = document.select("div");
div.select("bq").remove();
System.out.println(div.text()); // prints but this should all get read
You could also use JSoup in this way:
String text = "<blockquote>\n" +
" <div>This is text and html in a blockquote</div>\n" +
" More text in a block quote.\n" +
"</blockquote> \n" +
"Here's some content <b> bolded </b> and <i> other random HTML tags </i>";
Whitelist whitelist = Whitelist.none();
String cleanStr = Jsoup.clean(text, whitelist);

How to preserve the meaning of tags like <br>, <ul> , <li> , <p> etc when reading them in Java using JSOUP library?

I am writing a program that extracts some certain information from local HTML files. That information is then shown on a Java JFrame and is exported to an excel file. (I am using JSoup 1.9.2 library for the HTML parsing purposes)
I am running into an issue where whenever I extract anything from an HTML file, JSoup is not taking HTML tags like break tags, line tags etc. into account and so, all the information is being extracted like a big chunk of data without any proper newlines or formatting.
To show you an example, if this is the data that I want to read :
Title Line 1 Line 2 Unordered
Listelement 1 element 2
The data is coming back as :
Title Line 1 Line 2 Unordered List element 1 element 2 (i.e. all the
HTML tags are ignored)
This is the piece of code that I am using for reading in :
private String getTitle(Document doc) { // doc is the local HTML file
Elements title = doc.select(".title");
for (Element id : title) {
return id.text();
}
return "No Title Available ";
}
Can anyone suggest me a way that can be used to preserve the meaning behind the HTML tags by using which I can both display the data on the JFrame and export it to excel with a more readable format?
Thanks.
Just to give everyone an update, I was able to find a solution (more like a workaround) to the formatting issue. What i am doing now is extracting the complete HTML using id.html() which I am storing in a String object. Then, i am using the String function replaceAll() with a regular expression to get rid of all the HTML tags without pushing everything into a single line. The replaceAll() function looks something like replaceAll("\\<[^>]*>",""). My whole processHTML() function looks something like :
private String processHTML(String initial) { //initial is the String with all the HTML tags
String modified = initial;
modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
//All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
modified = modified.replaceAll(" ", " ");
modified = modified.replaceAll("<", "<");
modified = modified.replaceAll(">", ">");
modified = modified.replaceAll("&", "&");
modified = modified.replaceAll(""", "\"");
modified = modified.replaceAll("&apos;", "\'");
modified = modified.replaceAll("¢", "¢");
modified = modified.replaceAll("©", "©");
modified = modified.replaceAll("®", "®");
return modified;
}
Thanks you all again for helping me with this
Cheers.

How to keep line breaks when using Jsoup.parse?

This is not a duplicate. The was a similar question, but none of those answers are able to deal with a real html file. One can save any html, even this one and try to run any of the solutions to that answer ... none of them solves the problem completely
The question is
I have a saved .htm file on my desktop. I need to get pure text from it . However I do need to keep the line breaks so that the text is not on just one or couple of lines.
I tried the following and all methods from here
FileInputStream in = new FileInputStream("C:\\...myfile.htm");
String htmlText = IOUtils.toString(in);
for (String line : htmlText.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
This does preserve only lines of html file. However, the text is still messed up, because such things as </br> , <p> got removed. How can I parse so that the text preserves all natural line breaks.
This is something I've noticed the difference between jsoup and say Selenium where Selenium keeps the line breaks and jsoup does not when extracting text. With that said, i think the best route is to get the innerHtml on the node you are trying to extract text, then do a replaceAll on the innerHtml to replace </br>and <p> with line breaks.
As a more complete solution, instead of reading the text file line by line, is it possible to traverse the html text more natively? Your best bet would be to traverse the tree using something like a recursive function and when you hit a TextNode, add that text to the stripped variable from your example. Then when you hit a <p> or </br> element, you can add a linefeed as need be.
Something like:
Document doc = Jsoup.parse(htmlText);
Then pass that in a recursive function for each child node:
String getText(Element parentElement) {
String working = "";
for (Node child : parentElement.childNodes()) {
if (child instanceof TextNode) {
working += child.text();
}
if (child instanceof Element) {
Element childElement = (Element)child;
// do more of these for p or other tags you want a new line for
if (childElement.tag().getName().equalsIgnoreCase("br")) {
working += "\n";
}
working += getText(childElement);
}
}
return working;
}
Then you can just call the function to strip the text.
strippedText = getText(doc);
Not the simplest solution, but one i can think of that should work if you want to extract all text from an HTML. I haven't run this code, just wrote it now so if i missed something, i apologize. But it should give you the general idea.

How to add "..." ellipsis to lengthy text in a WebView?

I am using a WebView to align RTL text correctly.
Simply, I want to add "..." ellipsis when the text is lengthy.
Which is archived in a TextView using android:ellipsize="end"
Is there a way to achieve "..." ellipsis or to control the number of lines in a WebView?
Here is the code:
String header = "<html><head><meta http-equiv=\"Content-Type\" +" + "content=\"text/html; charset=UTF-8\" /></head>";
String dt = "<body dir=\"rtl\">" + o.get(p).getTitle() +"</body></html>";
webView.loadData(URLEncoder.encode(header + dt,"utf-8").replaceAll("\\+"," "), "text/html", "UTF-8");
You could parse the html and find out which is the last visible html element on the page. You could then use the following by appending the #ellipsis id to that element.
CSS Text Wrapping:
http://jsfiddle.net/6HcWM/
The problem will be finding the last visible html element as the screen size can vary, also the zoom level could presumably vary. I'm guessing that you'll need to append JavaScript to the html to discover this...
Maybe this would work:
How to tell if a DOM element is visible in the current viewport?
You have two simple options:
Use CSS and the text-overflow property on your HTML element: directions here.
If you need a bit better control where the ellipsis appears and have the means — include jQuery and a jQuery ellipsis plugin in your HTML via script tags: directions here.

Categories

Resources