Retrieving known invalid XML via Selenium WebDriver - java

I have a Selenium test that needs to get some raw XML from a web server. The problem I'm having is that one of the XML documents is known to be invalid because it is missing a root element. I'd like get the raw source of the invalid XML and tack on my own root element, but every WebDriver flavor I've tried the attempting to parse the XML and returns some form of error message. In short, I'm doing this:
WebDriver driver = new FirefoxDriver();
driver.get("http://some_URL_that_returns_xml_data");
String source = driver.getPageSource();
The source string represents the invalid XML error message rendered in the browser rather than the actual raw source as if I viewed source in the browser.
Does anyone know of a trick to get around this?

The standard way of doing this is to use Apache HttpUtils package, and in your HTTP request, just push the correct Content-Type header, which is probably application/xml . Then, your response will be rendered by the browser as xml, rather than text or html.
If the XML is invalid, the browser might only render part of the document and so if you want all of the text, you might want to send content type of text in the request.

If the only thing wrong with the XML is the absence of a wrapper element, then it is a "well-formed external parsed entity", and you can retrieve it using an entity reference. Create a dummy document like this:
<!DOCTYPE doc [
<!ENTITY e SYSTEM "http://uri.com/realdata.xml">
]>
<doc>&e;</doc>
(where the string after "SYSTEM" is the location of your XML), and pass this dummy document to your XML parser. (But not in the browser, where XML parsers typically ignore external entities).

Try:
WebElement element = driver.findElement(By.tagname("body"));
String elHtml = element.getAttribute("innerHTML");
OR:
String elHtml = driver.findElement(By.tagName("body")).getText()

Related

parse text from xml

I have following link
https://hero.epa.gov/hero/ws/swift.cfc?method=getProjectRIS&project_id=993&getallabstracts=true
I want to parse this xml to get only text, like
Provider: HERO - 2.xx
DBvendor=EPA
Text-encoding=UTF-8
How can I parse it ?
Well, it's not a text file, it's an HTML file. If you open a file in browser and select view source you will be able to see text enclosed in <char> tags.
When it's opened in browser, these tags and other HTML content is interpreted and output is rendered on the page (that's why it looks like a text). If you want to implement similar behavior in Java then you should look into PhantomJS and/or JSoup examples.
It looks like a text file but it is an XML file and the browser just displays its text content.
To verify right click and look at the page source.
You can use a library like Jsoup for parsing the file and getting the contents.
https://jsoup.org/cookbook/introduction/parsing-a-document

Jsoup not getting full html

I am trying to Jsoup to parse the html from the URL http://www.threadflip.com/shop/search/john%20hardy
Jsoup looks to only get the data from the line
<![CDATA[ window.gon= ..............
Does anyone know why this would be?
Document doc = Jsoup.connect("http://www.threadflip.com/shop/search/john%20hardy").get();
The site you try to parse loads most of its contents async via AJAX calls. JSoup does not interpret Javascript and therefore does not act like a browser. It seems that the store is filled by calling their api:
http://www.threadflip.com/api/v3/items?attribution%5Bapp%5D=web&item_collection_id=&q=john+hardy&page=1&page_size=30
So maybe you need to directly load the API Url in order to read the stuff you want. Note that the response is JSON, not HTML, so the JSoup html parser is of not much help here. But there is great JSON libraries available. I use JSON-Simple.
Alternatively, you may switch to Selenium webdriver, which actually remote controls a real browser. This should have no trouble accessing all items from the page.

HTML parsing using JSoup

I am new to jsoup. I want to parse html but the problem is with the URL which we have to specify in jsoup.connect(url), I will get this url in response from some other page at runtime. Is there any way to pass the received url into jsoup.connect? I had read something like:
String html = response.getContentAsString();
Document document = Jsoup.parse(html);
But I am not getting exactly how to use it. I would love to know if some other way of doing this is better than jsoup.
"parse" function accepts html content string, not url.
According to jsoup javadoc, the following should solve your problem:
Document doc = Jsoup.connect("http://example.com").get();
pay attention to the fact that "connect" method returns Connection object, but, in fact, does not connect. Therefore additional call to "get" (or "put", depending on the handler type on the server side).

Using java to extract a single value from an html page:

I am continuing work on a project that I've been at for some time now, and I have been struggling to pull some data from a website. The website has an iframe that pulls in some data from an unknown source. The data is in the iframe in a tag something like this:
<DIV id="number_forecast"><LABEL id="lblDay">9,000</LABEL></DIV>
There is a BUNCH of other crap above it but this div id / label is totally unique and is not used anywhere else in the code.
jsoup is probably what you want, it excels at extracting data from an HTML document.
There are many examples available showing how to use the API: http://jsoup.org/cookbook/extracting-data/selector-syntax
The process will be in two steps:
parse the page and find the url of the iframe
parse the content of the iframe and extract the information you need
The code would look like this:
// let's find the iframe
Document document = Jsoup.parse(inputstream, "iso-8859-1", url);
Elements elements = document.select("iframe");
Element iframe = elements.first();
// now load the iframe
URL iframeUrl = new URL(iframe.absUrl("src"));
document = Jsoup.parse(iframeUrl, 15000);
// extract the div
Element div = document.getElementById("number_forecast");
In you page that contains iframe change source of youe iframe to your own url. This url will be processed with your ouw controller, that will read content, parse it, extract all you need and write to response. If there is absolute references in your iframe this should work.

Getting web content - browser does not support frames

I have a snippet of code like this:
webUrl = new URL(url);
reader = new BufferedReader(new InputStreamReader(webUrl.openStream()));
When I try to get html content of some page I get response that my browser doesn't support frames. So I do not get the real html of the page.
Is there a workaround?
Maybe to tell to the program to register as some browser?
For me it is critical only to get the html, then I want to parse it.
EDIT: Can not get src of the frame from the html in browser. It is hidden in js.
The "You don't support frames and we haven't put sensible alternative content here" message will be in the <noframes> element. You need to access the appropriate <frame> element, access its src attribute, resolve the URI in it, and then fetch data from there.
You must set a user-agent string in your HTTP request, so that the server thinks you are supporting frames. I suggest something like HtmlClient or HttpClient for this.

Categories

Resources