extract paragraphs from HTML page - java

Using Jsoup, I want to extract all paragraphs from an HTML page, i.e. whatever is between <p> and </p>.
How do I accomplish this?

Can'y you just do:
myDocument.getElementsByTag('p')
JSoup getElementsByTag
You can then iterate over the returned elements and get their data/text/ownText / whatever you think is most relevant for what you want to do.
JSoup Element.text()

Related

How to save HTML code in Sql Server database for correct display?

My website is just like Stack Overflow and under development. I am using plain textarea to take text input as I do not have any WMD editor like Stack Overflow's.
When I take HTML code as input and store it in database table in a text or nvarchar(max) column, it is stored successfully. But when I call that data for display, it displays the corresponding HTML page instead of that HTML code on screen. I am not able to resolve it. For better understanding I'm putting here input page and output page images of my website.
This is image of input page:
This is the image of output page:
What is going wrong here ?
One easy way is to replace
< with < and > with >
in the HTML string which you retrived and then display it on page.
Have you tried that ?
You need to escape the HTML so it's not interpreted by the browser. How to do that depends on the view technology you're using.
With JSP and JSTL the escaping is automatically done with <c:out value="${myString}"/>. If you're not using JSTL yet, now's the time to start (there's a lot of other helpful things in there too).
you can save the html codes just like text. You can use varchar(max) type column to save the html code in table. Display the code is depending the browser. But if you use nvarchar type that will cause problems in display.
Another possible solution is to replace the html tags before storing in database. What I did is :-
text=text.replaceAll("<", "<");
text=text.replaceAll(">", ">");
and then stored text in database and its working. Thanks to Bibin Mathew.

Print html portion into pdf using Java

community!
My project is simple: I have a link to a website that has multiple information on different chemical substances and I want to extract some data and put in into pdf. Thing is that I want to keep the formatting of the original HTML (using it's css, of course).
Example of substance: http://www.molbase.com/en/msds_1659-31-0-moldata-2.html#tabs
I used jsoup to read the HTML of the table on the bottom of the page, the MSDS one, containing multiple sections with different information about the substance, but I really don't know how to save the exact HTML format into my pdf file. I have tried with iText too, but it gives me "missing ending tag" error, and if it worked, it would print the full page, not only that msds table.
Here is what I have tried to do, but ain't effective:
Document docu = Jsoup.connect(urlbun).get();
Element tableHeader = docu.select("div[class=\"msds\"]")
.first();
String[] finSyn = tableHeader.text().split(" ");
String moreText =" ";
I tried to split the text that the webpage has under that div ("class = "msds"") but I cannot find a way to split it the good way.
Please, could you please give me a hint on what to do? Even if the formating is not the same, I would like to be able to display the information in the same way, with indentation and such.
Thank you!
You can put the content that you want to convert to PDF inside a CSS ID (such as a DIV) and then use the PDFmyURL API to convert only that section to PDF.
Please refer to this on our website about how to select pieces from a page to convert to PDF
Disclosure: I work for the company that owns this site

How would you parse this String to an object?

Note that this question is not about implementation, but for programming tips.
I'm trying to read some HTML code, and then create an object / several objects in order to paint it back again chaning the format.
For example. Imagine this html:
<body>
Hello, this is some plain and I'm going to attach an image.
<img src="someimage.jpg" />
And after the image I keep writting.
And as this is a forum message, you can add a div to quote like the following:
<div class="post-quote"> Some user said something</div>
And that was it!
</body>
As you can see, there are several elements, like <img> and <div>.
My overall goal, is to have everything split up like:
Text
Image
Text
Div(quote class)
Text
And then, programming specific, it could be a List of contentElements.
With this list, I could paint those elements back into the screen customly formatted and positioned.
However, I can't find out how to divide the HTML String using some logical method.
Do you guys have any tips? How would you split this String to achieve the previously explained issue?
Thanks!!
Questions are welcome!
Edit
JSOUP is a parser. I'm not looking for a parser. I'm looking for TIPS about how can I keep the order of the parsed elements. Reread my question, please!
You should use a HTML parser such as jsoup.
Example on your HTML:
Document doc = Jsoup.parse(html);
print(doc.select("img").attr("src")); ==> someimage.jpg
print(doc.select("div.post-quote").text()); ==> Some user said something

Extract String from html page which is in Assets folder?

I have couple of html pages in my assets folder, i am able to open them and get them in a string. My problem lies ahead of it, I just to extract text between certain tags. For example if i am having a line in my html page as <h3>Hello have a nice day</h3> inside h3 tag.
I just want to get "Hello have a nice day". Till now i tried it to string functions but no success. How can i achieve this?
UPDATE
I got the solution from link
Use Html.fromHtml(), pass the html source and it will return only the text..
check http://developer.android.com/reference/android/text/Html.html
If you are able to read html files, then everything should be easy. If it's simple html page you can use xpath to parse it and retrieve whatever you want, or you can use some libaries such as jsoup to parse the html.

Displaying xml in textarea without rendering htmlentities

I've been reading about this for a while now and can't find the solution.
This looks like the solution I need:
How to stop html textarea from interpreting html entities into their characters
But when I do this I just get in the textarea. What gives?
This is my first time trying to use jstl. Please help.
use jquery's text() method and assign it to text area.

Categories

Resources