How to scrape or parse Iframe content to get specific values

How to scrape or parse Iframe content to get specific values - java

I get Iframe link http:\\abc.com?=blahblahiframelink from a third party rest service. I want to extract multiple values from content of that Iframe.
Here is simplified html. Please understand that the real html is far more complex having multiple nested div and tables
.css stuff
<html>
<div>
<p> NEED THIS INFO </p>
....
blah blah
<img src="NEED THIS INFO" > </img>
</div>
</html>
I marked "NEED THIS INFO" in above code as what I want to extract out, to demonstrate I want attribute values as well as element values.
I am thinking to first store that Iframe content in a java string in my rest service then use crazy Regex to get information I want.
Before I attempt that I want to check if there is more efficient way to do this. Is there some html parser I can use to get content in structured format.
If not then, please tell me how to store Iframe in Java string.
Please let me know if you need more info.

There are a couple of ways to do this for those coming here. However, the most efficient is going to be to write the iframe to a string like thus using HttpURLConnection or HttpsURLConnection (conn is the connection). Iframes are grabbable from their links.
BufferedReader br=new BufferedReader(new InputStreamReader(conn.getInputStream()));
String line="";
html="";
while((line=br.readLine())!=null)
{
html=html+line+"\n";
}
br.close();
The most efficient is, of course, to limit the number of middle-men like Mechanize and the number of URL calls; etc.
It is possible to use java's powerful .net or .nio to do this just be creating an HttpURLConnection or javax.net's HttpsURLClient to get your page, the cookies; etc. From there the answer unfolds.
To parse the page in Java you can with A and B being the better options I know
A. Create an XML document and run an xpath. I am time limited so I've posted a resource for you. All you need is a string and you can do this. This fits your needs if you are not looking for something specific. Once you get the page, just get everthing you need.
http://www.mkyong.com/tutorials/java-xml-tutorials/
B. Regex. Look online to find a good solution I am limited to two links. Also, MyRegexTester is a great free resource for learning and testing Regex which is less daunting then you think, especially in java. Use those wildcards and look aheads.
C. Better yet, use a parser like Jsoup but set the xml ini- variable to output xml if you are not resource constrained but that appears to not be the case. JSoup does the xml parsing for you and allows you to use an xpath to get the result.
D. Use HttpUnit or a gui-less browser like Mechanize in Python(http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet/), Perl, or Ruby. My favorite is Python since there are more ready-made modules and the speeds are about the same. Python also has a Jsoup plugin

Related

Parsing html text to obtain input fields

So I currently have a big blob of html text, and I want to generate an input form based on what is contained in that text. For example, if the text contains '[%Name%]', I want to be able to read that in and recognize 'Name' is there, and so in turn enable a form field for name. There will be multiple tags ([%age%], [%height%], etc.)
I was thinking about using Regex, but after doing some research it seems that Regex is a horrible idea to parse html with. I came across parsing html pages with groovy, but it is not strictly applicable to my implementation. I am storing the html formatted text (which I am creating using ckeditor) in a database.
Is there a efficient way to do this in java/groovy? Or should I just create an algorithm similar to examples shown here (I'm not too sure how effective the given algorithms would be, as they seem to be constructed around relatively small strings, whereas my string to parse through may end up being quite large (a 15-20 page document)).
Thanks in advance

Instead of reimplementing the wheel I think it's better to use jsoup. It is an excellent tool for your task and would be easy to obtain anything in a html page using it's selector syntax. Check out examples of usage in their cookbook.

Parse javascript generated content using Java

http://support.xbox.com/en-us/contact-us uses javascript to create some lists. I want to be able to parse these lists for their text. So for the above page I want to return the following:
Billing and Subscriptions
Xbox 360
Xbox LIVE
Kinect
Apps
Games
I was trying to use JSoup for a while before noticing it was generated using javascript. I have no idea how to go about parsing a page for its javascript generated content.
Where do I begin?

You'll want to use an HTML+JavaScript library like Cobra. It'll parse the DOM elements in the HTML as well as apply any DOM changes caused by JavaScript.

you could always import the whole page and then perform a string separator on the page (using return, etc) and look for the string containing the information, then return the string you want and pull pieces out of that string. That is the dirty way of doing it, not sure if there is a clean way to do it.

I don't think that text is generated by javascript... If I disable javascript those options can be found inside the html at this location (a jquery selector just because it was easier to hand-write than figuring out the xpath without javascript enabled :))
'div#ShellNavigationBar ul.NavigationElements li ul li a'
Regardless in direct answer to your query, you'd have to evaluate the javascript within the scope of the document, which I expect would be rather complex in Java. You'd have more luck identifying the javascript file generating the relevant content and just parsing that directly.

Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

So, I'm using HTTP Post Requests in Android Java to log into a website, before extracting the entire HTML code. After that, I use Pattern/Matcher (regex) to find all the elements I need before extracting them from the HTML data, and deleting everything unnecessary. For instance when I extract this:
String extractions = <td>Good day sir</td>
Then I use:
extractions.replaceAll("<td>", "").replaceAll("</td>", "");
I do this multiple times until I have all the data needed from that site, before I display it in some kind of list.
I'm not particularly stuck on anything, but please, can you tell me if this is an effective/efficient/fast way of getting data from a page and processing it, or are there ways to do this faster? Because sometimes it's like my program takes a lot of time to get certain data (although mostly that's when I'm on 3G on my phone).

Like others have said, regex is not the best tool for this job. But in this case, the particular way you use regex is even more inefficient than it would normally be.
In any case, let me offer one more possible solution (depending on your use case).
It's called YQL (Yahoo Query Language).
http://developer.yahoo.com/yql/
Here is a console for it so you can play around with it.
http://developer.yahoo.com/yql/console/
YQL is the lazy developer's way to build your own api on the fly. The main inconvenience is that you have to use Yahoo as a go-between, but if you're ok with that, then I'd suggest you go that route. Using YQL is probably the quickest way to get that kind of work done (especially if the html you're targeting keeps on changing and if its html tags are not always valid).

Using regex to parse a website is always a bad idea:
How to use regular expressions to parse HTML in Java?
Using regular expressions to parse HTML: why not?

Have a look at the Apache Tika library for extracting text from HTML - there are many other parsers also available, such as PDF etc. : http://tika.apache.org/

How to access an Html Form using a normal Java Application?

how can I access an Html Form and its components, say the Wikipedia Search Pane, using a normal Java Application? And enter some keywords? How would one usually handle this task?
I already figured out that a combination (of URL, URLConnection and BufferedReader), called »chaining«, allows me to read in a file, like this:
URL oracle = new URL("http://de.selfhtml.org/index.htm");
URLConnection yc = oracle.openConnection();
BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
But this does certainly not allow me to write to this Html Page.
Although, I know about its structure now, and could address its components.
And I need to address the Search Pane component, as it is laying on one of Wikipedia's Servers.
Having, I don't know, an HtmlComponentOutputStream would be nice.
In such a way that, the only things left to do would be calling:
HtmlComponentOutputStream.setText( "Penguin" );
HtmlComponentOutputStream.sendHtmlMessage( HtmlMessage.ENTER );
Thanks for reading so far, I'm grateful for any advise about how one would usually do this in Java.

It seems that you want to automate browser action using java.
If yes you can use Selenium RC (Selenium Web-driver). Java library to automate html page.
here is the link-
http://seleniumhq.org/download/

I don't know how you would normally do it, but I'd issue a post request and set the form fields as key-value pairs, i.e. the key would be the form field name and the value would be the field's value.
If you don't know the form fields you might just read the form and extract the input fields and the form's target action. Then use a post request again.

Try using a html parser such as the Cobra project Html Parser.
The have a specific example of how to submit a form.

retrieve information from a url

I want to make a program that will retrieve some information a url.
For example i give the url below, from
librarything
How can i retrieve all the words below the "TAGS" tab, like
Black Library fantasy Thanquol & Boneripper Thanquol and Bone Ripper Warhammer ?
I am thinking of using java, and design a data mining wrapper, but i am not sure how to start. Can anyone give me some advice?
EDIT:
You gave me excellent help, but I want to ask something else.
For every tag we can see how many times each tag has been used, when we press the "number" button. How can I retrieve that number also?

You could use a HTML parser like Jsoup. It allows you to select HTML elements of interest using simple CSS selectors:
E.g.
Document document = Jsoup.connect("http://www.librarything.com/work/9767358/78536487").get();
Elements tags = document.select(".tags .tag a");
for (Element tag : tags) {
System.out.println(tag.text());
}
which prints
Black Library
fantasy
Thanquol & Boneripper
Thanquol and Bone Ripper
Warhammer
Please note that you should read website's robots.txt -if any- and read the website's terms of service -if any- or your server might be IP-banned sooner or later.

I've done this before using PHP with a page scrape, then parsing the HTML as a string using Regular Expressions.
Example here
I imagine there's something similar in java and other languages. The concept would be similar:
Load page data.
Parse the data, (i.e. with a regex, or via the DOM model and using some CSS selectors or some XPath selectors.
Do what you want with the data :)
It's worth remembering that some people might not appreciate you data mining their site and profiting / redistrubuting it on a large scale.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to scrape or parse Iframe content to get specific values - java

Related

Parsing html text to obtain input fields

Parse javascript generated content using Java

Is HTML parsing (in Java/Android) then extracting data from it, an effective way of getting a webpage's content?

How to access an Html Form using a normal Java Application?

retrieve information from a url

Categories

Resources