This is not a duplicate. The was a similar question, but none of those answers are able to deal with a real html file. One can save any html, even this one and try to run any of the solutions to that answer ... none of them solves the problem completely
The question is
I have a saved .htm file on my desktop. I need to get pure text from it . However I do need to keep the line breaks so that the text is not on just one or couple of lines.
I tried the following and all methods from here
FileInputStream in = new FileInputStream("C:\\...myfile.htm");
String htmlText = IOUtils.toString(in);
for (String line : htmlText.split("\n")) {
String stripped = Jsoup.parse(line).text();
System.out.println(stripped);
}
This does preserve only lines of html file. However, the text is still messed up, because such things as </br> , <p> got removed. How can I parse so that the text preserves all natural line breaks.
This is something I've noticed the difference between jsoup and say Selenium where Selenium keeps the line breaks and jsoup does not when extracting text. With that said, i think the best route is to get the innerHtml on the node you are trying to extract text, then do a replaceAll on the innerHtml to replace </br>and <p> with line breaks.
As a more complete solution, instead of reading the text file line by line, is it possible to traverse the html text more natively? Your best bet would be to traverse the tree using something like a recursive function and when you hit a TextNode, add that text to the stripped variable from your example. Then when you hit a <p> or </br> element, you can add a linefeed as need be.
Something like:
Document doc = Jsoup.parse(htmlText);
Then pass that in a recursive function for each child node:
String getText(Element parentElement) {
String working = "";
for (Node child : parentElement.childNodes()) {
if (child instanceof TextNode) {
working += child.text();
}
if (child instanceof Element) {
Element childElement = (Element)child;
// do more of these for p or other tags you want a new line for
if (childElement.tag().getName().equalsIgnoreCase("br")) {
working += "\n";
}
working += getText(childElement);
}
}
return working;
}
Then you can just call the function to strip the text.
strippedText = getText(doc);
Not the simplest solution, but one i can think of that should work if you want to extract all text from an HTML. I haven't run this code, just wrote it now so if i missed something, i apologize. But it should give you the general idea.
Related
I'm trying to print the text content located in second br tag by following xpath but all texts which are in all br tags are printed in console. What might be the reason ?
driver.findElement(By.xpath("//*[text()[contains(.,'Telefon')]]")).getText();
The reason you can't get the text is because the text is not in the br tag.
< open br tag /> close
Additionally, if you read a bit more about it, even the /> is surplus to requirements. If you had just <br> the text wouldn't be contained within it because:
The <br> tag is an empty tag which means that it has no end tag.
The point is, all your text is h2. You need to deal with that the best you can.
To solve you're issue you'll need to:
.getAttribute("innerHTML") - this will give you all the text of the h2 with the br tag
split your string on the string <br> (please note that in chrome my <br /> becomes <br> - you might need to adjust this)
select either select item[2] or do a lamda to select the item that contains your text (do whichever you feel more comfortable with)
And those steps look like this:
//Get the element,
var h2Element = driver.findElement(By.xpath("//*[text()[contains(.,'Telefon')]]"));
var myTextArray = h2Element.getAttribute("innerHTML").split("<br>");
//approach 1 - just print the [1] item
var approach1Text = myTextArray[2];
System.out.println(approach1Text);
//aproach 2 - use a lamda to select by contains
var approach2Text = Arrays.stream(myTextArray).filter(a -> a.contains("Telefon")).findFirst().get();
System.out.println(approach2Text);
For a bonus note - you probably had fun getting your xpath to work because the br tag splits the text into separate elements. As result your h2 actually has multiple text() values. It has text(), text()[2], text()[3], etc - as many as there are brs
I put together a simple page to test this for you - just to show you what's going on: (note the xpath in dev tools)
This is text()[3] because xpaths are indexed from 1 (comapred to the java code above that starts at 0). However - that's just an example of why it's tricky, i wouldn't recommend you do it that way.
The easy way to eliminate the <br> (and other tags!) affect on text is to use normalize-space().
An xpath like this works and is realtively simple to follow.
//h2[contains(normalize-space(),'Telefon')]
Maps to my sample page OK:
I share this extra bit in case you have any more text-split objects and it helps you down the line.
...All that said - good work on getting your original xpath to work. That's good too.
The driver.findElement() function returns all the elements with the given Xpath. To get only one element in selenium you can use driver.find_element_by_xpath(fxp) function, where 'fxp' is full XPath of the given element.
Try changing your xpath expression to
//h2/br/following-sibling::text()[contains(.,'Telefon')]
and see if it works.
I am writing a program that extracts some certain information from local HTML files. That information is then shown on a Java JFrame and is exported to an excel file. (I am using JSoup 1.9.2 library for the HTML parsing purposes)
I am running into an issue where whenever I extract anything from an HTML file, JSoup is not taking HTML tags like break tags, line tags etc. into account and so, all the information is being extracted like a big chunk of data without any proper newlines or formatting.
To show you an example, if this is the data that I want to read :
Title Line 1 Line 2 Unordered
Listelement 1 element 2
The data is coming back as :
Title Line 1 Line 2 Unordered List element 1 element 2 (i.e. all the
HTML tags are ignored)
This is the piece of code that I am using for reading in :
private String getTitle(Document doc) { // doc is the local HTML file
Elements title = doc.select(".title");
for (Element id : title) {
return id.text();
}
return "No Title Available ";
}
Can anyone suggest me a way that can be used to preserve the meaning behind the HTML tags by using which I can both display the data on the JFrame and export it to excel with a more readable format?
Thanks.
Just to give everyone an update, I was able to find a solution (more like a workaround) to the formatting issue. What i am doing now is extracting the complete HTML using id.html() which I am storing in a String object. Then, i am using the String function replaceAll() with a regular expression to get rid of all the HTML tags without pushing everything into a single line. The replaceAll() function looks something like replaceAll("\\<[^>]*>",""). My whole processHTML() function looks something like :
private String processHTML(String initial) { //initial is the String with all the HTML tags
String modified = initial;
modified = modified.replaceAll("\\<[^>]*>",""); //regular expression used
modified = modified.trim(); //To get rid of any unwanted space before and after the needed data
//All the replaceAll() functions below are to get rid of any HTML entities that might be left in the data extarcted from the HTML
modified = modified.replaceAll(" ", " ");
modified = modified.replaceAll("<", "<");
modified = modified.replaceAll(">", ">");
modified = modified.replaceAll("&", "&");
modified = modified.replaceAll(""", "\"");
modified = modified.replaceAll("'", "\'");
modified = modified.replaceAll("¢", "¢");
modified = modified.replaceAll("©", "©");
modified = modified.replaceAll("®", "®");
return modified;
}
Thanks you all again for helping me with this
Cheers.
I'm collecting links from HTML and using jsoup to get the links which I add to a collection. The problem is that I need each link to be on one line so when written to a file it can be parsed line by line.
The input is a WARC record and for each record I want to get all links.
The getcontentutf8() and getHeaderMetadataItem methods come from a WarcRecord api found here.
The code:
String baseURL = getHeaderMetadataItem("WARC-Target-URI");
Vector<String> retVec = new Vector<String>();
Document doc = Jsoup.parse(getContentUTF8(), baseURL);
Elements links = doc.select("a[href]");
for (Element link : links){
String newLink = link.absUrl("href").replace("\n", "");
retVec.add();
System.out.println(newLink);
}
When writing stdout to file some links get split over two lines, for example:
1: http://somelink.com/submit?url=
2: http://someotherlink.net
While other might look like this (the way I want them):
1: http://somesite.com/submit?url=http://someothersite.com/
It looks like it always happens after a =.
EDIT: Added more information. Seems like removing both carriage return and newlines fixed some cases. However, now I am encountering tab characters within absolute URLs from jsoup. I checked some source sites and they actually have tabs after href. Seems like there are can be alot of cases to handle. I would like to think that there is a general solution to catching them?
<a class="MenuButton " href="/ features"> <em> Features </em> </a>
turns into the absolute URL:
http://archinect.com/ features
Since I store it to file on form URI \t <list of links> this will break when I parse it
I was hoping to get some help in how I should approach a program I have attempted to write a few times now.
I have a number of folders. In each folder, there is a HTML file, and a .txt file which contains text in the HTML file, stripped of all HTML tags.
As an example, a simplified HTML file may be
<html><head></head><body><p>This is some <b>text</b></p><p>Please ignore me</p></body></html>
And within a .txt in the same folder, I have "This is some text".
From these two files, I would like to create a new file which is a HTML with a box drawn around "This is some text", like so :
The obvious problem here is that the pretty-printed text files do not contain any mark-up, and so finding it within the HTML document is difficult.
My idea thus far has been :
-Save the .txt contents in a variable.
-Grab the HTML contents, strip of all HTML tags :
public static String html2text(String html) {
return Jsoup.parse(html).text();
}
I'm unsure how to proceed from this point. I mean...I could try to add a div with a class surrounding the text, and then add a border style to this...but how do I find the sub-string in the HTML reliably, retaining all of the markup within the HTML ?
I'm sure there is a simple way to do this and I am just overthinking it, I would usually have a chat with a friend about this and solve it but everyone seems to be offline - so I come to you for guidance here.
Can anyone offer any feedback please? Thanks.
This should work for you:
More information on selectors and setting attribute values
private void test(){
//replace with your stored variables
String html = "<html><head></head><body><p>This is some <b>text</b></p><p>Please ignore me</p></body></html>";
String txt = "This is some text";
Document doc = Jsoup.parse(html);
String query = "p:contains(" + txt + ")";
Elements htmlTxt = doc.select(query); //selects all the paragraph elements with your target txt
//Loop through each element and add a red border around it
for(Element e : htmlTxt){
System.out.println("e: " + e.toString());
e.attr("style", "border:3px; border-style:solid; border-color:#FF0000; padding: 1em;");
}
}
How do I replace text in an XML document using Java?
Source:
<body>
<title>Home Owners Agreement</title>
<p>The <b>good</b> thing about a Home Owners Agreement is that...</p>
</body>
Desired output:
<body>
<title>Home Owners Agreement</title>
<p>The <b>good</b> thing about a HOA is that...</p>
</body>
I only want text in <p> tags to be replaced. I tried the following:
replaceText(string term, string replaceWith, org.w3c.dom.Node p){
p.setTextContent(p.getTextContent().replace(term, replaceWith));
}
The problem with the above code is that all the child nodes of p get lost.
Okay, I figured out the solution.
The key to this is that you don't want to replace the text of the actual node. There is a actually a child representation of just the text. I was able to accomplish what I needed with this code:
private static void replace(Node root){
if (root.getNodeType() == root.TEXT_NODE){
root.setTextContent(root.getTextContent().replace("Home Owners Agreement", "HMO"));
}
for (int i = 0; i < root.getChildNodes().getLength(); i++){
outputTextOfNode(root.getChildNodes().item(i));
}
}
The problem here is that you actually want to replace node, not only the text.
You can traverse the children of current node and add them again to the new node. Then replace nodes.
But it requires a lot of work and very sensitive to you document structure. For example if somebody will wrap your <p> tag with div you will have to change your parsing.
Moreover this approach is very ineffective from point of view of CPU and memory utilization: you have to parse whole document to change a couple of words in it.
My suggestion is the following: try to use regular expressions. In most cases it is strong enough. For example code like
xml.replaceFirst("(<p>.*?</p>)", "<p>The <b>good</b> thing about a HOA is that...</p>")
will work (at least in your case).