How would you parse this String to an object? - java

Note that this question is not about implementation, but for programming tips.
I'm trying to read some HTML code, and then create an object / several objects in order to paint it back again chaning the format.
For example. Imagine this html:
<body>
Hello, this is some plain and I'm going to attach an image.
<img src="someimage.jpg" />
And after the image I keep writting.
And as this is a forum message, you can add a div to quote like the following:
<div class="post-quote"> Some user said something</div>
And that was it!
</body>
As you can see, there are several elements, like <img> and <div>.
My overall goal, is to have everything split up like:
Text
Image
Text
Div(quote class)
Text
And then, programming specific, it could be a List of contentElements.
With this list, I could paint those elements back into the screen customly formatted and positioned.
However, I can't find out how to divide the HTML String using some logical method.
Do you guys have any tips? How would you split this String to achieve the previously explained issue?
Thanks!!
Questions are welcome!
Edit
JSOUP is a parser. I'm not looking for a parser. I'm looking for TIPS about how can I keep the order of the parsed elements. Reread my question, please!

You should use a HTML parser such as jsoup.
Example on your HTML:
Document doc = Jsoup.parse(html);
print(doc.select("img").attr("src")); ==> someimage.jpg
print(doc.select("div.post-quote").text()); ==> Some user said something

Related

java Selenium Chrome - keyword searcher

I'm almost done creating a Supreme Bot. Now I need a keyword-searcher. They should search for a keyword on the page and then click on it.
For example:
Illegal Business Hooded Sweatshirt Red
... the bot now searches for the keyword but also for the color. I uploaded a screenshot (from the Supreme page) and need your help.
Screenshot from the source code (Supreme):
My code I tried:
driver.findElement(By.xpath("//h1[text()='Illegal Business Hooded Sweatshirt']/p[text()='Red']")).click();
Since there're encoded symbols in between of text, I believe you can't find those items directly with xpath.
I suggest you to find all <acticle> tags, then for every article tag you search for <h1> inside, retrieve it's text (with filtering out those weird symbols), and compare the text you want with the text article tag actually has.
p tage is not inside h1 tag and "red" is inside anchor tag a
So you can use this xpath:- //h1/a/[text()='Illegal Business Hooded Sweatshirt']/ancestor::div/p/a[text()='Red']

Print html portion into pdf using Java

community!
My project is simple: I have a link to a website that has multiple information on different chemical substances and I want to extract some data and put in into pdf. Thing is that I want to keep the formatting of the original HTML (using it's css, of course).
Example of substance: http://www.molbase.com/en/msds_1659-31-0-moldata-2.html#tabs
I used jsoup to read the HTML of the table on the bottom of the page, the MSDS one, containing multiple sections with different information about the substance, but I really don't know how to save the exact HTML format into my pdf file. I have tried with iText too, but it gives me "missing ending tag" error, and if it worked, it would print the full page, not only that msds table.
Here is what I have tried to do, but ain't effective:
Document docu = Jsoup.connect(urlbun).get();
Element tableHeader = docu.select("div[class=\"msds\"]")
.first();
String[] finSyn = tableHeader.text().split(" ");
String moreText =" ";
I tried to split the text that the webpage has under that div ("class = "msds"") but I cannot find a way to split it the good way.
Please, could you please give me a hint on what to do? Even if the formating is not the same, I would like to be able to display the information in the same way, with indentation and such.
Thank you!
You can put the content that you want to convert to PDF inside a CSS ID (such as a DIV) and then use the PDFmyURL API to convert only that section to PDF.
Please refer to this on our website about how to select pieces from a page to convert to PDF
Disclosure: I work for the company that owns this site

Find text region which include article content in HTML

Recently I want to get information in HTML source by Java. The base need is to get the main content area of the HTML.
For example, the following is HTML source for example:
<html>
<head>
<tilte>
chinese charactor --中文
<title>
</head>
<body>
<div>
this is something area including Chinese charactor.,like meun I don't need,
</div>
<div>
this is something area including Chinese charactor,like ads I don't need,
</div>
<div>
this is main content, include the content I need. almost every content is filled by many Chinese charactor.Like: 好好学习,天天向上。 我爱stackoverflow.谢谢你的帮助,非常感谢!
</div>
<div>
this is foot area, also including Chinese charactor ,but I don't need.
</div>
</body>
</html>
This HTML source is a simple one; There are many different and complex sources. I want to parse the div or other element area which contain the main content by java. The result I want is:
<div>
This is main content, include the content I need. almost every content is filled by many Chinese character like: 好好学习,天天向上。 我爱stackoverflow.谢谢你的帮助,非常感谢!
</div>
There are tens of thousands of divs which have different content in them, and the div id is unknown or different. The divs have many different conditions, such as p tags. Is there a way to judge the Chinese character's appearance or distribution to parse the content?
I can't say I'm that confident I understand the question, but it seems like you want to scrape a certain div in an HTML page via Java?
I had to do this to scrape some data from a legacy system to test a new one - have a look at http://htmlunit.sourceforge.net/ . Basically it allows you to hit the page you want as if it were in a browser (so even if you would normally have to fill out a form to get to that page you can do it), then scrape the contents of different parts of the page in a bunch of different ways - you can get a collection of all the divs, and pick the third one, for instance, or pick the div with the right CSS class, or just use XPath.
I can't say that I kow for certain what you're going for, but one good place to start would probably be in Apache's HTTPComponents package. There are a lot of tools there for making http requests and getting the data back in a string buffer (what I think you're going for)
Check it out here:
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html#d5e43
Also, on the HTTPComponents main page, there are Chinese translations of most of the tutorials--you know, if that's something that would be useful to you :D
http://hc.apache.org/

Displaying xml in textarea without rendering htmlentities

I've been reading about this for a while now and can't find the solution.
This looks like the solution I need:
How to stop html textarea from interpreting html entities into their characters
But when I do this I just get in the textarea. What gives?
This is my first time trying to use jstl. Please help.
use jquery's text() method and assign it to text area.

How to display part of an HTML document in Java

I have an application where I need to show one specific section of a HTML document within a swing JPanel. The section to be shown depends on what the user is doing at any given time.
I know that JEditorPane can display simple HTML, and in fact in terms of HTML support this is more than enough for my needs. However I don't think I can use this to display only part of the original HTML file.
I thought of putting each section within a div, then hiding all divs with CSS (display: none), and showing only the target section by setting display: block on the section I wanted to show. Unfortunately JEditorPane has limited CSS support and this does not seem to include the "display" attribute.
Before I go and implement something more elaborate, is there any simple way to achieve this goal?
Thanks.
You may try Cobra :
http://lobobrowser.org/cobra.jsp
Override the ViewFactory and replace DIV views. If they should be hidden let them return 0 from getXXXSpan methods.
See for example the section folding related code http://java-sl.com/collapse_area.html
I didn't find a way to do what I wanted relying on the CSS support from the JEditorPane. What I ended up doing is manually parsing the HTML document and splitting it in "fragments" (top-level DIVs representing sections), then displaying each section as required via JEditorPane.setText.

Categories

Resources