How to Get Crawl content in Crawljax - java

I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me??
CrawljaxConfigurationBuilder builder =
CrawljaxConfiguration.builderFor("http://demo.crawljax.com/");
builder.addPlugin(new OnNewStatePlugin() {
#Override
public String toString() {
return "Our example plugin";
}
#Override
public void onNewState(CrawlerContext cc, StateVertex sv) {
LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser().getStrippedDom());
String name = cc.getCurrentState().getName();
String url = cc.getBrowser().getCurrentUrl();
System.out.println(cc.getCurrentState().getDom());
System.out.println("New State: " + name + "; url: " + url);
}
});
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
How to get dynamic/java script Webpage content..

We can able to get website source code
cc.getBrowser().getStrippedDom()); or cc.getCurrentState().getDocument();
This coding are Return Source code (css/java script file)..
Not possible.Because its testing tool.This tool only check Text are available, assign temp data to Fields.

To get the website content, use the following function:
cc.getCurrentState().getDom()
This function does not return a DOM node, but actually returns the page's HTML text instead. This is the right function to use if you want the page content, but it sounds like it returns a DOM node, so the name getDom is a misnomer. To get a DOM node instead, use:
cc.getCurrentState().getDocument()
which returns the Document DOM node.
You can retrieve the page content with:
cc.getCurrentState().getDocument().getTextContent()
(EDIT: This won't work -- getTextContent always returns null when called on Documents.)

Related

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}
You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}
I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Jsoup Scraping HTML dynamic content

I'm new to Jsoup and I have been trying to create a small code that gets the name of the items in a steam inventory using Jsoup.
public Element getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
return element;
}
this methods returns:
<h1 class="hover_item_name" id="iteminfo0_item_name"></h1>
and I want the information beetwen the "h1" labels which is generated when you click on a specific window.
Thank you in advance.
You can use the .select(String cssQuery) method:
doc.select("h1") gives you all h1 Elements.
If you need the actual Text in these tags use the .text() for each Element.
If you need a attribute like class or id use .attr(String attributeKey) on a Element eg:
doc.getElementsByClass("hover_item_name").first().attr("id")
gives you "iteminfo0_item_name"
But if you need to perform clicks on a website you can't do that with JSoup, hence JSoup is a HTML parser and not a browser alternative. Jsoup can't handle dynamic content.
But what you could do is, firstly scrape the relevant data in your h1 tags and then send a new .post() request, respectively an ajax call
If you rather want a real webdriver, have a look at Selenium.
Use .text() and return a String, i.e.:
public String getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
String text = element.text();
return text;
}

Is it possible to display text on a rich text item in a DocumentContext?

I have a form that runs a java agent on the WebQueryOpen event. This agent pulls data from a DB2 database and then puts them into the computed text fields I have placed on the form and are displayed whenever I open the form in the browser. This is working for me. However, when I try to use RichTextFields I get a ClassCastException error. No document is actually saved, I just open the form in the browser using this domino URL - https://company.com/database.nsf/sampleform?OpenForm
Sample code of simple text field - Displayed with w/o problems
Document sampledoc = agentContext.getDocumentContext();
String samplestr = "sample data from db2";
sampledoc.replaceItemValue("sampletextfield", samplestr);
When I tried using rich text field
Document sampledoc = agentContext.getDocumentContext();
String samplestr = "sample data from db2";
RichTextItem rtsample = (RichTextItem)sampledoc.getFirstItem('samplerichtextfield');
rtsample.appendText(samplestr); // ClassCastException error
Basically, I wanted to use rich text field so that it could accommodate more characters in case I pull a very long string data.
Screenshot of the field (As you can see it's a RichText)
The problem is that you're trying to access a regular Item as a RichTextItem.
The RichTextItem are special fields that are created with its own method just like this:
RichTextItem rtsample = (RichTextItem)sampledoc.createRichTextItem('samplerichtextfield');
It's different to the regular Items that can be created with a simple sampledoc.replaceItemValue(etc).
So, if you want to know if a item is RichTextItem and if it does not exist, create it, you can do this:
RichTextItem rti = null;
Item item = doc.getFirstItem("somefield");
if (item != null) {
if (item instanceof RichTextItem) {
//Yay!
rti = (RichTextItem) item;
} else {
//:-(
}
} else {
rti = doc.createRichTextItem("somefield");
//etc.
}

Trying to find specific links while web crawling

I am modifying the code given in [crawler4j][1]. I want to find specific links while crawling a web site. For ex I am crawling on www.cmu.edu and I am trying to get the link for directory search. Here is my code for it -
public void visit(Page page) {
String url = page.getWebURL().getURL();
// System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
System.out.println(html.matches(".*<a href.*."));
if (html.matches(".*.<a href=.*.>Directory Search</a>.*."))
System.out.println("***********Hello*********************");
// System.out.println("----------"+html);
return;
// List<WebURL> links = htmlParseData.getOutgoingUrls();
}
}
This code does not work. I am not getting the *******Helo********* on my console. Just to check I printed the html string in console and I copied the anchor tag that contains the directory sreach and I wrote this simple two line code -
String test2="<li class=\"first\">Directory Search</li>";
System.out.println("*******"+test2.matches(".*.<a href=.*.>Directory Search</a>.*."));
This works. The value of String test2 is copied from the console. What am I doing wrong in the first part of the code?
[1]
Try this (you have to use (?s) to match also new line characters)
String test2="qwert\n\n<li class=\"first\">Directory Search</li>";
System.out.println("*******"+test2.matches("(?s).*.<a href=.*.>Directory Search</a>.*."));

Get title and description dynamically by using an URL

I need to get title and description of a URL dynamically. What do I need to use in order to do this?
Take for example the following URL: http://en.wikipedia.org/wiki/Stack_overflow
I need to extract the tile of the URL and the description of it. Will you prefer jsoup extraction as below?
url.select("title");
If yes, how to extract description of the url?
I think that you need a HTML parser like Jericho.
Take a look at this example:
http://jericho.htmlparser.net/samples/console/src/ExtractText.java
specially this two methods:
private static String getTitle(Source source) {
Element titleElement=source.getFirstElement(HTMLElementName.TITLE);
if (titleElement==null) return null;
// TITLE element never contains other tags so just decode it collapsing whitespace:
return CharacterReference.decodeCollapseWhiteSpace(titleElement.getContent());
}
private static String getMetaValue(Source source, String key) {
for (int pos=0; pos<source.length();) {
StartTag startTag=source.getNextStartTag(pos,"name",key,false);
if (startTag==null) return null;
if (startTag.getName()==HTMLElementName.META)
return startTag.getAttributeValue("content"); // Attribute values are automatically decoded
pos=startTag.getEnd();
}
return null;
}

Categories

Resources