Find text region which include article content in HTML - java

Recently I want to get information in HTML source by Java. The base need is to get the main content area of the HTML.
For example, the following is HTML source for example:
<html>
<head>
<tilte>
chinese charactor --中文
<title>
</head>
<body>
<div>
this is something area including Chinese charactor.,like meun I don't need,
</div>
<div>
this is something area including Chinese charactor,like ads I don't need,
</div>
<div>
this is main content, include the content I need. almost every content is filled by many Chinese charactor.Like: 好好学习,天天向上。 我爱stackoverflow.谢谢你的帮助,非常感谢!
</div>
<div>
this is foot area, also including Chinese charactor ,but I don't need.
</div>
</body>
</html>
This HTML source is a simple one; There are many different and complex sources. I want to parse the div or other element area which contain the main content by java. The result I want is:
<div>
This is main content, include the content I need. almost every content is filled by many Chinese character like: 好好学习,天天向上。 我爱stackoverflow.谢谢你的帮助,非常感谢!
</div>
There are tens of thousands of divs which have different content in them, and the div id is unknown or different. The divs have many different conditions, such as p tags. Is there a way to judge the Chinese character's appearance or distribution to parse the content?

I can't say I'm that confident I understand the question, but it seems like you want to scrape a certain div in an HTML page via Java?
I had to do this to scrape some data from a legacy system to test a new one - have a look at http://htmlunit.sourceforge.net/ . Basically it allows you to hit the page you want as if it were in a browser (so even if you would normally have to fill out a form to get to that page you can do it), then scrape the contents of different parts of the page in a bunch of different ways - you can get a collection of all the divs, and pick the third one, for instance, or pick the div with the right CSS class, or just use XPath.

I can't say that I kow for certain what you're going for, but one good place to start would probably be in Apache's HTTPComponents package. There are a lot of tools there for making http requests and getting the data back in a string buffer (what I think you're going for)
Check it out here:
http://hc.apache.org/httpcomponents-client-ga/tutorial/html/fundamentals.html#d5e43
Also, on the HTTPComponents main page, there are Chinese translations of most of the tutorials--you know, if that's something that would be useful to you :D
http://hc.apache.org/

Related

IN AEM, how can i take values from a component in the body section and write them in the head section of the html

So say you have a component that is being used in the body section of a page. I want to take specific values from the text fields of that component and write them to the of the html file.
I have a tile list component, with a title text field and a description rich text field.
<div class="col box-title">${tileitems.properties.title # context='html'}</div>
<div data-sly-test='${description}' class="col">
<p>${tileitems.properties.rte || tileitems.properties.description # context='html'}</p>
</div>
How can I get the values that are authored into the component and write on the same page but to the head of the html.
do i need a java model for this ? any examples thanks
You have (at least) two options:
Append the needed markup to head from your content component using JavaScript.
Create an Use Object (Java/Sling Model) and walk the descendants tree searching for the specific resource type and properties and expose the data you need. Use this to append the needed markup in the head.
Another option is to extend the page properties, instead of a component in the body. Then, you can read them directly in the head.html

Invisible html element on "view page source"

I am trying to read an html page in my java code using Jsoup library. This is the link to the page: http://www.alkhaleej.ae/
The part in the page that I am interested in is the horizontal menu bar at the top of the page (which has the news categories). When I right click on that menu bar and choose inspect element, the html elements of interest are visible to me under the tag <div id="MainMenuCenter">. However, when I run my code, it turns out this tag is actually empty, and all the children of this tag get invisible. I also tried to view the complete document using "view page source" on the webpage. I surprisingly found this element empty (no children) as below.
<div id="MainMenuCenter">
</div>
Therefore, I am not able to access the information I need in my code. What is really going on? Did the developers hide the children of this element on purpose? Can you suggest a way to make the children visible to my code? Thank you.
You can retrieve the data by looking at the network traffic on
Inspect element -> Network
Check the traffic one by one or use the find tools.
If you find the match data, you can re-obtain it by visiting the url who serve the data..
Maybe like: http://example.com/serve.php?category=car&page=1

Selenium findElement(s) issues when search for anchor tag

I have a few nav bar items that I am trying to find with driver.findElement(by.id("menu-news-menu-item")) and driver.findElements(by.id("menu-news-menu-item")). It can't find them for some reason. I have verified that the id is correct on the site but it still can't be found. I know there are other ways to get to the info, but it is my understanding that using the id is the best way to go about finding elements. Below I have included an HTML snippet of what I am trying to search for. If I need to provide any more information please let me know.
<div class="navbar-collapse collapse">
<li>
<a id="menu-news-menu-item" href="/novus/news">News</a>
</li>
</div>
From looking at your HTML I see one potential problem. There may be more.
The top level DIV you posted has a class navbar-collapse collapse. That indicates to me that that DIV is collapsible and is currently collapsed which means that any of its children will be hidden. Selenium was designed to allow the user to only interact with visible elements. This means that if you search for your A tag by ID and it's a child of the DIV that is currently collapsed, Selenium won't find it. What you need to do before you search for the A tag is to unhide it. I don't know for sure how to do this but it probably involves clicking the collapsible DIV.
With this info, try to figure the rest out on your own. You should be able to investigate the page HTML, try some code, and see what happens. If it doesn't work and you get stuck. Come back and post some more of the surrounding HTML, the code you tried, and the result (error messages, etc.) and we'll try to help you more.

How would you parse this String to an object?

Note that this question is not about implementation, but for programming tips.
I'm trying to read some HTML code, and then create an object / several objects in order to paint it back again chaning the format.
For example. Imagine this html:
<body>
Hello, this is some plain and I'm going to attach an image.
<img src="someimage.jpg" />
And after the image I keep writting.
And as this is a forum message, you can add a div to quote like the following:
<div class="post-quote"> Some user said something</div>
And that was it!
</body>
As you can see, there are several elements, like <img> and <div>.
My overall goal, is to have everything split up like:
Text
Image
Text
Div(quote class)
Text
And then, programming specific, it could be a List of contentElements.
With this list, I could paint those elements back into the screen customly formatted and positioned.
However, I can't find out how to divide the HTML String using some logical method.
Do you guys have any tips? How would you split this String to achieve the previously explained issue?
Thanks!!
Questions are welcome!
Edit
JSOUP is a parser. I'm not looking for a parser. I'm looking for TIPS about how can I keep the order of the parsed elements. Reread my question, please!
You should use a HTML parser such as jsoup.
Example on your HTML:
Document doc = Jsoup.parse(html);
print(doc.select("img").attr("src")); ==> someimage.jpg
print(doc.select("div.post-quote").text()); ==> Some user said something

Java HTML rendering using Cobra

I am currently using Cobra: Java HTML Renderer & Parser to render an HTML page that is dynamically generated based on user choices in a java app.
In my app the user has a choice of hundreds of items to select. The items are displayed in the form of special uniquely colored symbols and the user can select more then one item.
Once a number of items are selected their written description is dynamically generated and formatted to include css2 and html4 tags and loaded into the Cobra HTMLPanel for display.
I wish to display the image of the symbol with the written description of an item in the HTMLPanel.
One way to do this would be to save the BufferedImage to a file using ImageIO.write and then include the img html tag in my dynamically generated HTML document that is being loaded into HTMLPanel. Unfortunately this is unacceptable as there may be hundreds of symbols being selected by the user wich in turn would result in hundreds of ImageIO.write calls and an incredible decrease in performance of my app.
An alternate way would be to convert the BufferedImage to a Base64 encoding and then directly place the encoding into the HTML document as follows
<img alt="Embedded Image" src="..." />
Unfortunately HTMLPanel appears to ignore the data URI scheme.
Does anyone know a solution?
Use an embedded servlet container like Jetty. Point the URLs to "http://localhost:somePort/imageId", and then serve those URLs up from memory.

Categories

Resources