Trying to find specific links while web crawling

Trying to find specific links while web crawling - java

I am modifying the code given in [crawler4j][1]. I want to find specific links while crawling a web site. For ex I am crawling on www.cmu.edu and I am trying to get the link for directory search. Here is my code for it -
public void visit(Page page) {
String url = page.getWebURL().getURL();
// System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
System.out.println(html.matches(".*<a href.*."));
if (html.matches(".*.<a href=.*.>Directory Search</a>.*."))
System.out.println("***********Hello*********************");
// System.out.println("----------"+html);
return;
// List<WebURL> links = htmlParseData.getOutgoingUrls();
}
}
This code does not work. I am not getting the *******Helo********* on my console. Just to check I printed the html string in console and I copied the anchor tag that contains the directory sreach and I wrote this simple two line code -
String test2="<li class=\"first\">Directory Search</li>";
System.out.println("*******"+test2.matches(".*.<a href=.*.>Directory Search</a>.*."));
This works. The value of String test2 is copied from the console. What am I doing wrong in the first part of the code?
[1]

Try this (you have to use (?s) to match also new line characters)
String test2="qwert\n\n<li class=\"first\">Directory Search</li>";
System.out.println("*******"+test2.matches("(?s).*.<a href=.*.>Directory Search</a>.*."));

Related

Java jsoup link ignore

I have the following code:
private static final Pattern FILE_FILTER = Pattern.compile(
".*(\\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
"|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private boolean isRelevant(String url) {
if (url.length() < 1) // Remove empty urls
return false;
else if (FILE_FILTER.matcher(url).matches()) {
return false;
}
else
return TLSpecific.isRelevant(url);
}
I am using this part when i am parsing a web site to check whether it contains links that contains some of the patterns declared, but I dont know is there a way to do it directly through jsoup and optimize the code. For example given a web page how I can ignore all of them with jsoup?

how I can ignore all of them with jsoup?
Let's say we want any element not having jpg or jpeg extension in their hrefor src attribute.
String filteredLinksCssQuery = "[href]:not([href~=(?i)\\.jpe?g$]), " + //
"[src]:not([src~=(?i)\\.jpe?g$])";
String html = "<a href='foo.jpg'>foo</a>" + //
"<a href='bar.svg'>bar</a>" + //
"<script src='baz.js'></script>";
Document doc = Jsoup.parse(html);
for(Element e: doc.select(filteredLinksCssQuery)) {
System.out.println(e);
}
OUTPUT
bar
<script src="baz.js"></script>
[href] /* Select any element having an href attribute... */
:not([href~=(?i)\.jpe?g$]) /* ... but exclude those matching the regex (?i)\.jpe?g$ */
, /* OR */
[src] /* Select any element having a src attribute... */
:not([src~=(?i)\.jpe?g$]) /* ... but exclude those matching the regex (?i)\.jpe?g$ */
You can add more extensions to filter. You may want to write some code for generating filteredLinksCssQuery automatically because this CSS query can quickly become unmaintainable.

How to Get Crawl content in Crawljax

I have crawl Dynamic webpage using Crawljax. i can able to get crawl current id, status and dom. but i can't get the Website content.. Any one help me??
CrawljaxConfigurationBuilder builder =
CrawljaxConfiguration.builderFor("http://demo.crawljax.com/");
builder.addPlugin(new OnNewStatePlugin() {
#Override
public String toString() {
return "Our example plugin";
}
#Override
public void onNewState(CrawlerContext cc, StateVertex sv) {
LOG.info("Found a new dom! Here it is:\n{}", cc.getBrowser().getStrippedDom());
String name = cc.getCurrentState().getName();
String url = cc.getBrowser().getCurrentUrl();
System.out.println(cc.getCurrentState().getDom());
System.out.println("New State: " + name + "; url: " + url);
}
});
CrawljaxRunner crawljax = new CrawljaxRunner(builder.build());
crawljax.call();
How to get dynamic/java script Webpage content..

We can able to get website source code
cc.getBrowser().getStrippedDom()); or cc.getCurrentState().getDocument();
This coding are Return Source code (css/java script file)..
Not possible.Because its testing tool.This tool only check Text are available, assign temp data to Fields.

To get the website content, use the following function:
cc.getCurrentState().getDom()
This function does not return a DOM node, but actually returns the page's HTML text instead. This is the right function to use if you want the page content, but it sounds like it returns a DOM node, so the name getDom is a misnomer. To get a DOM node instead, use:
cc.getCurrentState().getDocument()
which returns the Document DOM node.
You can retrieve the page content with:
cc.getCurrentState().getDocument().getTextContent()
(EDIT: This won't work -- getTextContent always returns null when called on Documents.)

How to read the public URL in GWT?

I m new in GWT and I m generating a web application in which i have to create a public URL.
In this public URL i have to pass hashtag(#) and some parameters.
I am finding difficulty in achieving this task.
Extracting the hashtag from the URL.
Extracting the userid from the URL.
My public URL example is :: http://www.xyz.com/#profile?userid=10003

To access the URL in GWT you can use the History.getToken() method. It will give you the entire string that follows the hashtag ("#").
In your case (http://www.xyz.com/#profile?userid=10003) it will return a string "profile?userid=10003". After you have this you can parse it however you want. You can check if it contains("?") and u can split it by "?" or you can get a substring. How you get the information from that is really up to you.

I guess you already have the URL. I'm not that good at Regex, but this should work:
String yourURL = "http://www.xyz.com/#profile?userid=10003";
String[] array = yourURL.split("[\\p{Lower}\\p{Upper}\\p{Punct}}]");
int userID = 0;
for (String string : array) {
if (!string.isEmpty()) {
userID = Integer.valueOf(string);
}
}
System.out.println(userID);

To get the parameters:
String userId = Window.Location.getParameter("userid");
To get the anchor / hash tag:
I don't think there is something, you can parse the URL: look at the methods provided by Window.Location.

get a substring with regex [duplicate]

I need a regex pattern for finding web page links in HTML.
I first use #"(<a.*?>.*?</a>)" to extract links (<a>), but I can't fetch href from that.
My strings are:
<a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="www.example.com/page.php/404" ....></a>
1, 2 and 3 are valid and I need them, but number 4 is not valid for me
(? and = is essential)
Thanks everyone, but I don't need parsing <a>. I have a list of links in href="abcdef" format.
I need to fetch href of the links and filter it, my favorite urls must be contain ? and = like page.php?id=5
Thanks!

I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.
<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1
You can view a full explanation of this regex at here.
Snippet playground:
const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;
const textToMatchInput = document.querySelector('[name=textToMatch]');
document.querySelector('button').addEventListener('click', () => {
console.log(textToMatchInput.value.match(linkRx));
});
<label>
Text to match:
<input type="text" name="textToMatch" value='<a href="google.com"'>
<button>Match</button>
</label>

Using regex to parse html is not recommended
regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.
Use an html parser like htmlagilitypack
You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var hrefList = doc.DocumentNode.SelectNodes("//a")
.Select(p => p.GetAttributeValue("href", "not found"))
.ToList();
hrefList contains all href`s

Thanks everyone (specially #plalx)
I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use
<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"
My final regex string:
First use one of this:
st = #"((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+ \w\d:##%/;$()~_?\+-=\\\.&]*)";
st = #"<a href[^>]*>(.*?)</a>";
st = #"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)";
st = #"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)";
st = #"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)";
st = #"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)";
st = #"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
st = #"(<a.*?>.*?</a>)";
st = #"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])";
st = #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
st = #"(http|https)://([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)";
st = #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$";
st = #"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*";
my choice is
#"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"
Second Use this:
st = "(.*)?(.*)=(.*)";
Problem Solved. Thanks every one :)

Try this :
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
var res = Find(html);
}
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, #"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, #"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, #"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
}
Input:
string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> ";
Result:
[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
C# Scraping HTML Links
Scraping HTML extracts important page elements. It has many legal uses
for webmasters and ASP.NET developers. With the Regex type and
WebClient, we implement screen scraping for HTML.
Edited
Another easy way:you can use a web browser control for getting href from tag a,like this:(see my example)
public Form1()
{
InitializeComponent();
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
private void Form1_Load(object sender, EventArgs e)
{
webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
List<string> href = new List<string>();
foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
{
href.Add(el.GetAttribute("href"));
}
}

Try this regex:
"href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))"
You will get more help from discussions over:
Regular expression to extract URL from an HTML link
and
Regex to get the link in href. [asp.net]
Hope its helpful.

HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
public IHTMLAnchorElement imageElementHref;
imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;
Simply try this code

I came up with this one, that supports anchor and image tags, and supports single and double quotes.
<[a|img]+\\s+(?:[^>]*?\\s+)?[src|href]+=[\"']([^\"']*)['\"]
So
click here
Will match:
Match 1: /something.ext
And
<a href='/something.ext'>click here</a>
Will match:
Match 1: /something.ext
Same goes for img src attributes

I took a much simpler approach. This one simply looks for href attributes, and captures the value (between apostrophes) trailing it into a group named url:
href=['"](?<url>.*?)['"]

I think in this case it is one of the simplest pregmatches
/<a\s*(.*?id[^"]*")/g
gets links with the variable id in the address
starts from href including it, gets all characters/signs (. - excluding new line signs)
until first id occur, including it, and next all signs to nearest next " sign ([^"]*)

HTML Parsing using Jsoup.Jar

Document doc = Jsoup.connect("http://reviews.opentable.com/0938/9/reviews.htm").get();
Element part = doc.body();
Elements parts = part.getElementsByTag("span");
String attValue;
String html;
for(Element ent : parts)
{
if(ent.hasAttr("class"))
{
attValue = ent.attr("class");
if(attValue=="BVRRReviewText description")
{
System.out.println("\n");
html=ent.text();
System.out.println(html);
}
}
}
Am using Jsoup.jar for the above program.
I am accessing the webpage and my aim is to the print the text that is found within the tag <span class="BVRRReviewText description">text</span>.
But nothing is getting printed as output. There is no contents added to the String html in the program. But attValue is getting all the attribute values of the span tag.
Where must I have went wrong? Please advise.

if(attValue=="BVRRReviewText description")
should be
if(attValue.equals("...")) surely?
This is Java, not Javascript.

Change
attValue=="BVRRReviewText description"
for
attValue.matches("...")

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Trying to find specific links while web crawling - java

Try this (you have to use (?s) to match also new line characters) String test2="qwert\n\n<li class=\"first\">Directory Search</li>"; System.out.println("*******"+test2.matches("(?s)..<a href=..>Directory Search</a>.*."));

Related

Java jsoup link ignore

How to Get Crawl content in Crawljax

How to read the public URL in GWT?

get a substring with regex [duplicate]

HTML Parsing using Jsoup.Jar

Categories

Resources

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Trying to find specific links while web crawling - java

Try this (you have to use (?s) to match also new line characters) String test2="qwert\n\n<li class=\"first\">Directory Search</li>"; System.out.println("*******"+test2.matches("(?s).*.<a href=.*.>Directory Search</a>.*."));

Related

Java jsoup link ignore

How to Get Crawl content in Crawljax

How to read the public URL in GWT?

get a substring with regex [duplicate]

HTML Parsing using Jsoup.Jar

Categories

Resources

Try this (you have to use (?s) to match also new line characters) String test2="qwert\n\n<li class=\"first\">Directory Search</li>"; System.out.println("*******"+test2.matches("(?s)..<a href=..>Directory Search</a>.*."));