How to Get Crawl content in Crawljax

How to Get Crawl content in Crawljax - java

I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me??
CrawljaxConfigurationBuilder builder =
CrawljaxConfiguration.builderFor("http://demo.crawljax.com/");
builder.addPlugin(new OnNewStatePlugin() {
#Override
public String toString() {
return "Our example plugin";
}
#Override
public void onNewState(CrawlerContext cc, StateVertex sv) {
LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser().getStrippedDom());
String name = cc.getCurrentState().getName();
String url = cc.getBrowser().getCurrentUrl();
System.out.println(cc.getCurrentState().getDom());
System.out.println("New State: " + name + "; url: " + url);
}
});
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
How to get dynamic/java script Webpage content..

We can able to get website source code
cc.getBrowser().getStrippedDom()); or cc.getCurrentState().getDocument();
This coding are Return Source code (css/java script file)..
Not possible.Because its testing tool.This tool only check Text are available, assign temp data to Fields.

To get the website content, use the following function:
cc.getCurrentState().getDom()
This function does not return a DOM node, but actually returns the page's HTML text instead. This is the right function to use if you want the page content, but it sounds like it returns a DOM node, so the name getDom is a misnomer. To get a DOM node instead, use:
cc.getCurrentState().getDocument()
which returns the Document DOM node.
You can retrieve the page content with:
cc.getCurrentState().getDocument().getTextContent()
(EDIT: This won't work -- getTextContent always returns null when called on Documents.)

Related

Use JSoup to get all textual links

I'm using JSoup to grab content from web pages.
I want to get all the links on a page that have some contained text (it doesn't matter what the text is) just needs to be non-empty/image etc.
Example of links I want:
Link to Some Page
Since it contains the text "Link to Some Page"
Links I don't want:
<img src="someimage.jpg"/>
My code looks like this. How can I modify it to only get the first type of link?
Document document = // I get my document object
Elements linksOnPage = document.select("a[href]")
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

You could do something like this.
It does it's job though it's probably not the fanciest solution out there.
Note: the function text() gets you a clean text so if there are any HTML code fragements inside it, it won't return them.
Document doc = // get the doc
Elements linksOnPage = document.select("a");
for (Element pageElem : linksOnPage){
String link = "";
if(pageElem.text().trim().equals(""))
continue;
// do smth with it
}

I am using this and it's working fine:
Document document = // I get my document object
Elements linksOnPage = document.select("a:matches(([^\\s]+))");
for (Element page : linksOnPage) {
String link = page.attr("abs:href");
// I do stuff with the link
}

Jsoup Scraping HTML dynamic content

I'm new to Jsoup and I have been trying to create a small code that gets the name of the items in a steam inventory using Jsoup.
public Element getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
return element;
}
this methods returns:
<h1 class="hover_item_name" id="iteminfo0_item_name"></h1>
and I want the information beetwen the "h1" labels which is generated when you click on a specific window.
Thank you in advance.

You can use the .select(String cssQuery) method:
doc.select("h1") gives you all h1 Elements.
If you need the actual Text in these tags use the .text() for each Element.
If you need a attribute like class or id use .attr(String attributeKey) on a Element eg:
doc.getElementsByClass("hover_item_name").first().attr("id")
gives you "iteminfo0_item_name"
But if you need to perform clicks on a website you can't do that with JSoup, hence JSoup is a HTML parser and not a browser alternative. Jsoup can't handle dynamic content.
But what you could do is, firstly scrape the relevant data in your h1 tags and then send a new .post() request, respectively an ajax call
If you rather want a real webdriver, have a look at Selenium.

Use .text() and return a String, i.e.:
public String getItem(String user) throws IOException{
Document doc;
doc = Jsoup.connect("http://steamcommunity.com/id/"+user+"/inventory").get();
Element element = doc.getElementsByClass("hover_item_name").first();
String text = element.text();
return text;
}

Is it possible to display text on a rich text item in a DocumentContext?

I have a form that runs a java agent on the WebQueryOpen event. This agent pulls data from a DB2 database and then puts them into the computed text fields I have placed on the form and are displayed whenever I open the form in the browser. This is working for me. However, when I try to use RichTextFields I get a ClassCastException error. No document is actually saved, I just open the form in the browser using this domino URL - https://company.com/database.nsf/sampleform?OpenForm
Sample code of simple text field - Displayed with w/o problems
Document sampledoc = agentContext.getDocumentContext();
String samplestr = "sample data from db2";
sampledoc.replaceItemValue("sampletextfield", samplestr);
When I tried using rich text field
Document sampledoc = agentContext.getDocumentContext();
String samplestr = "sample data from db2";
RichTextItem rtsample = (RichTextItem)sampledoc.getFirstItem('samplerichtextfield');
rtsample.appendText(samplestr); // ClassCastException error
Basically, I wanted to use rich text field so that it could accommodate more characters in case I pull a very long string data.
Screenshot of the field (As you can see it's a RichText)

The problem is that you're trying to access a regular Item as a RichTextItem.
The RichTextItem are special fields that are created with its own method just like this:
RichTextItem rtsample = (RichTextItem)sampledoc.createRichTextItem('samplerichtextfield');
It's different to the regular Items that can be created with a simple sampledoc.replaceItemValue(etc).
So, if you want to know if a item is RichTextItem and if it does not exist, create it, you can do this:
RichTextItem rti = null;
Item item = doc.getFirstItem("somefield");
if (item != null) {
if (item instanceof RichTextItem) {
//Yay!
rti = (RichTextItem) item;
} else {
//:-(
}
} else {
rti = doc.createRichTextItem("somefield");
//etc.
}

Trying to find specific links while web crawling

I am modifying the code given in [crawler4j][1]. I want to find specific links while crawling a web site. For ex I am crawling on www.cmu.edu and I am trying to get the link for directory search. Here is my code for it -
public void visit(Page page) {
String url = page.getWebURL().getURL();
// System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
System.out.println(html.matches(".*<a href.*."));
if (html.matches(".*.<a href=.*.>Directory Search</a>.*."))
System.out.println("***********Hello*********************");
// System.out.println("----------"+html);
return;
// List<WebURL> links = htmlParseData.getOutgoingUrls();
}
}
This code does not work. I am not getting the *******Helo********* on my console. Just to check I printed the html string in console and I copied the anchor tag that contains the directory sreach and I wrote this simple two line code -
String test2="<li class=\"first\">Directory Search</li>";
System.out.println("*******"+test2.matches(".*.<a href=.*.>Directory Search</a>.*."));
This works. The value of String test2 is copied from the console. What am I doing wrong in the first part of the code?
[1]

Try this (you have to use (?s) to match also new line characters)
String test2="qwert\n\n<li class=\"first\">Directory Search</li>";
System.out.println("*******"+test2.matches("(?s).*.<a href=.*.>Directory Search</a>.*."));

Get title and description dynamically by using an URL

I need to get title and description of a URL dynamically. What do I need to use in order to do this?
Take for example the following URL: http://en.wikipedia.org/wiki/Stack_overflow
I need to extract the tile of the URL and the description of it. Will you prefer jsoup extraction as below?
url.select("title");
If yes, how to extract description of the url?

I think that you need a HTML parser like Jericho.
Take a look at this example:
http://jericho.htmlparser.net/samples/console/src/ExtractText.java
specially this two methods:
private static String getTitle(Source source) {
Element titleElement=source.getFirstElement(HTMLElementName.TITLE);
if (titleElement==null) return null;
// TITLE element never contains other tags so just decode it collapsing whitespace:
return CharacterReference.decodeCollapseWhiteSpace(titleElement.getContent());
}
private static String getMetaValue(Source source, String key) {
for (int pos=0; pos<source.length();) {
StartTag startTag=source.getNextStartTag(pos,"name",key,false);
if (startTag==null) return null;
if (startTag.getName()==HTMLElementName.META)
return startTag.getAttributeValue("content"); // Attribute values are automatically decoded
pos=startTag.getEnd();
}
return null;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to Get Crawl content in Crawljax - java

We can able to get website source code cc.getBrowser().getStrippedDom()); or cc.getCurrentState().getDocument(); This coding are Return Source code (css/java script file).. Not possible.Because its testing tool.This tool only check Text are available, assign temp data to Fields.

Related

Use JSoup to get all textual links

Jsoup Scraping HTML dynamic content

Is it possible to display text on a rich text item in a DocumentContext?

Trying to find specific links while web crawling

Get title and description dynamically by using an URL

Categories

Resources