Java jsoup link ignore - java

I have the following code:
private static final Pattern FILE_FILTER = Pattern.compile(
".*(\\.(css|js|bmp|gif|jpe?g|png|tiff?|mid|mp2|mp3|mp4|wav|avi|mov|mpeg|ram|m4v|pdf" +
"|rm|smil|wmv|swf|wma|zip|rar|gz))$");
private boolean isRelevant(String url) {
if (url.length() < 1) // Remove empty urls
return false;
else if (FILE_FILTER.matcher(url).matches()) {
return false;
}
else
return TLSpecific.isRelevant(url);
}
I am using this part when i am parsing a web site to check whether it contains links that contains some of the patterns declared, but I dont know is there a way to do it directly through jsoup and optimize the code. For example given a web page how I can ignore all of them with jsoup?

how I can ignore all of them with jsoup?
Let's say we want any element not having jpg or jpeg extension in their hrefor src attribute.
String filteredLinksCssQuery = "[href]:not([href~=(?i)\\.jpe?g$]), " + //
"[src]:not([src~=(?i)\\.jpe?g$])";
String html = "<a href='foo.jpg'>foo</a>" + //
"<a href='bar.svg'>bar</a>" + //
"<script src='baz.js'></script>";
Document doc = Jsoup.parse(html);
for(Element e: doc.select(filteredLinksCssQuery)) {
System.out.println(e);
}
OUTPUT
bar
<script src="baz.js"></script>
[href] /* Select any element having an href attribute... */
:not([href~=(?i)\.jpe?g$]) /* ... but exclude those matching the regex (?i)\.jpe?g$ */
, /* OR */
[src] /* Select any element having a src attribute... */
:not([src~=(?i)\.jpe?g$]) /* ... but exclude those matching the regex (?i)\.jpe?g$ */
You can add more extensions to filter. You may want to write some code for generating filteredLinksCssQuery automatically because this CSS query can quickly become unmaintainable.

Related

How to use Selenium get text from an element not including its sub-elements

HTML
<div id='one'>
<button id='two'>I am a button</button>
<button id='three'>I am a button</button>
I am a div
</div>
Code
driver.findElement(By.id('one')).getText();
I've seen this question pop up a few times in the last maybe year or so and I've wanted to try writing this function... so here you go. It takes the parent element and removes each child's textContent until what remains is the textNode. I've tested this on your HTML and it works.
/**
* Takes a parent element and strips out the textContent of all child elements and returns textNode content only
*
* #param e
* the parent element
* #return the text from the child textNodes
*/
public static String getTextNode(WebElement e)
{
String text = e.getText().trim();
List<WebElement> children = e.findElements(By.xpath("./*"));
for (WebElement child : children)
{
text = text.replaceFirst(child.getText(), "").trim();
}
return text;
}
and you call it
System.out.println(getTextNode(driver.findElement(By.id("one"))));
Warning: the initial solution (deep below) won't workI opened an enhancement request: 2840 against the Selenium WebDrive and another one against the W3C WebDrive specification - the more votes, the sooner they'll get enough attention (one can hope). Until then, the solution suggested by #shivansh in the other answer (execution of a JavaScript via Selenium) remains the only alternative. Here's the Java adaptation of that solution (collects all text nodes, discards all that are whitespace only, separates the remaining by \t):
WebElement e=driver.findElement(By.xpath("//*[#id='one']"));
if(driver instanceof JavascriptExecutor) {
String jswalker=
"var tw = document.createTreeWalker("
+ "arguments[0],"
+ "NodeFilter.SHOW_TEXT,"
+ "{ acceptNode: function(node) { return NodeFilter.FILTER_ACCEPT;} },"
+ "false"
+ ");"
+ "var ret=null;"
+ "while(tw.nextNode()){"
+ "var t=tw.currentNode.wholeText.trim();"
+ "if(t.length>0){" // skip over all-white text values
+ "ret=(ret ? ret+'\t'+t : t);" // if many, tab-separate them
+ "}"
+ "}"
+ "return ret;" // will return null if no non-empty text nodes are found
;
Object val=((JavascriptExecutor) driver).executeScript(jswalker, e);
// ---- Pass the context node here ------------------------------^
String textNodesTabSeparated=(null!=val ? val.toString() : null);
// ----^ --- this is the result you want
}
References:
TreeWalker - supported by all browsers
Selenium Javascript Executor
Initial suggested solution - not working - see enhancement request: 2840
driver.findElement(By.id('one')).find(By.XPath("./text()").getText();
In a single search
driver.findElement(By.XPath("//[#id=one]/text()")).getText();
See XPath spec/Location Paths the child::text() selector.
I use a function like below:
private static final String ALL_DIRECT_TEXT_CONTENT =
"var element = arguments[0], text = '';\n" +
"for (var i = 0; i < element.childNodes.length; ++i) {\n" +
" var node = element.childNodes[i];\n" +
" if (node.nodeType == Node.TEXT_NODE" +
" && node.textContent.trim() != '')\n" +
" text += node.textContent.trim();\n" +
"}\n" +
"return text;";
public String getText(WebDriver driver, WebElement element) {
return (String) ((JavascriptExecutor) driver).executeScript(ALL_DIRECT_TEXT_CONTENT, element);
}
var outerElement = driver.FindElement(By.XPath("a"));
var outerElementTextWithNoSubText = outerElement.Text.Replace(outerElement.FindElement(By.XPath("./*")).Text, "");
Similar solution to the ones given, but instead of JavaScript or setting text to "", I remove elements in the XML and then get the text.
Problem:
Need text from 'root element without children' where children can be x levels deep and the text in the root can be the same as the text in other elements.
The solution treats the webelement as an XML and replaces the children with voids so only the root remains.
The result is then parsed. In my cases this seems to be working.
I only verified this code in a environment with Groovy. No idea if it will work in Java without modifications. Essentially you need to replace the groovy libraries for XML with Java libraries and off you go I guess.
As for the code itself, I have two parameters:
WebElement el
boolean strict
When strict is true, then really only the root is taken into account. If strict is false, then markup tags will be left. I included in this whitelist p, b, i, strong, em, mark, small, del, ins, sub, sup.
The logic is:
Manage whitelisted tags
Get element as string (XML)
Parse to an XML object
Set all child nodes to void
Parse and get text
Up until now this seems to be working out.
You can find the code here: GitHub Code

How can I use Jsoup to turn unallowed html tag delimiter into entities where there are unallowed tags

Using Jsoup clean is it possible to convert this string:
Here is some <b>important</b> stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> movie
in the output
to this :
Here is some <b>important</b> stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie">
movie in the output
so it renders
Here is some important stuff that can't have
<script>javascript</script> or the following embed tag
<embed src="helloworld.swf" type="application/vnd.adobe.flash-movie">
movie in the output
Where the bold tag is allowed and left alone but the script and embed tags delimiters change from < > to < and > so they are treated as just text and not real html elements.
What settings are necessary to accomplish this? I have:
private static String limitHtml(String value) {
String result = value;
if (value != null && !value.isEmpty()) {
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
// what other settings ???
Whitelist whitelist = Whitelist.none().addTags(ALLOWED_HTML_TAGS);
whitelist.addAttributes(":all", ALLOWED_HTML_ATTRIBUTES);
result = Jsoup.clean(value, "", whitelist, settings);
}
return result;
}
Is there a similar Java lib that can accomplish this if Jsoup doesn't.
Jsoup can definitively get your back here. The trick is to use a dummy document (transitional variable in the code) with a single pre element in it.
We will simply add each unallowed element found in this pre element.
Later, we replace the unallowed element in the initial value with its escaped html code.
CODE
// Comma separated list of allowed tags.
private static String ALLOWED_HTML_TAGS_CSS_QUERY = "b,span";
private static String limitHtml(String value) {
String result = value;
if (value != null && !value.isEmpty()) {
// Build a sided document. It will help us escape unallowed tags.
Document transitional = Jsoup.parse("<pre></pre>");
// Parse the actual value for finding unallowed tags
Document doc = Jsoup.parseBodyFragment(value, "");
Elements unallowedElements = doc.select("*:not("+ALLOWED_HTML_TAGS_CSS_QUERY+")");
for (Element e : unallowedElements) {
switch (e.tagName()) {
case "#root": case "html": case "head": case "body":
// Those tags are added automatically by Jsoup. Nothing to do...
break;
default:
// Load the unallowed element to escape its html code in the transitional document
Element pre = transitional.select("pre").first().text(e.outerHtml());
// Replace unallowed element with its escape html code
e.replaceWith(new TextNode(pre.text(), ""));
}
}
// Get the final sanitized value
Document.OutputSettings settings = new Document.OutputSettings();
settings.prettyPrint(false);
Whitelist whitelist = Whitelist.none().addTags(ALLOWED_HTML_TAGS);
whitelist.addAttributes(":all", ALLOWED_HTML_ATTRIBUTES);
result = Jsoup.clean(doc.body().html(), "", whitelist, settings);
}
return result;
}
SAMPLE USAGE
String unsanitizedHtml = "Here is some <b>important</b> stuff that can't have " + //
"<script>javascript</script> or the following embed tag " + //
"<embed src=\"helloworld.swf\" type=\"application/vnd.adobe.flash-movie\"> movie" + //
"in the output";
System.out.println("BEFORE:\n" + unsanitizedHtml);
System.out.println();
System.out.println("AFTER:\n" + limitHtml(unsanitizedHtml));
OUTPUT
BEFORE:
Here is some <b>important</b> stuff that can't have <script>javascript</script> or the following embed tag <embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> moviein the output
AFTER:
Here is some <b>important</b> stuff that can't have <script>javascript</script> or the following embed tag <embed src="helloworld.swf" type="application/vnd.adobe.flash-movie"> moviein the output

Trying to find specific links while web crawling

I am modifying the code given in [crawler4j][1]. I want to find specific links while crawling a web site. For ex I am crawling on www.cmu.edu and I am trying to get the link for directory search. Here is my code for it -
public void visit(Page page) {
String url = page.getWebURL().getURL();
// System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
System.out.println(html.matches(".*<a href.*."));
if (html.matches(".*.<a href=.*.>Directory Search</a>.*."))
System.out.println("***********Hello*********************");
// System.out.println("----------"+html);
return;
// List<WebURL> links = htmlParseData.getOutgoingUrls();
}
}
This code does not work. I am not getting the *******Helo********* on my console. Just to check I printed the html string in console and I copied the anchor tag that contains the directory sreach and I wrote this simple two line code -
String test2="<li class=\"first\">Directory Search</li>";
System.out.println("*******"+test2.matches(".*.<a href=.*.>Directory Search</a>.*."));
This works. The value of String test2 is copied from the console. What am I doing wrong in the first part of the code?
[1]
Try this (you have to use (?s) to match also new line characters)
String test2="qwert\n\n<li class=\"first\">Directory Search</li>";
System.out.println("*******"+test2.matches("(?s).*.<a href=.*.>Directory Search</a>.*."));

get a substring with regex [duplicate]

I need a regex pattern for finding web page links in HTML.
I first use #"(<a.*?>.*?</a>)" to extract links (<a>), but I can't fetch href from that.
My strings are:
<a href="www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="http://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="https://www.example.com/page.php?id=xxxx&name=yyyy" ....></a>
<a href="www.example.com/page.php/404" ....></a>
1, 2 and 3 are valid and I need them, but number 4 is not valid for me
(? and = is essential)
Thanks everyone, but I don't need parsing <a>. I have a list of links in href="abcdef" format.
I need to fetch href of the links and filter it, my favorite urls must be contain ? and = like page.php?id=5
Thanks!
I'd recommend using an HTML parser over a regex, but still here's a regex that will create a capturing group over the value of the href attribute of each links. It will match whether double or single quotes are used.
<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1
You can view a full explanation of this regex at here.
Snippet playground:
const linkRx = /<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1/;
const textToMatchInput = document.querySelector('[name=textToMatch]');
document.querySelector('button').addEventListener('click', () => {
console.log(textToMatchInput.value.match(linkRx));
});
<label>
Text to match:
<input type="text" name="textToMatch" value='<a href="google.com"'>
<button>Match</button>
</label>
Using regex to parse html is not recommended
regex is used for regularly occurring patterns.html is not regular with it's format(except xhtml).For example html files are valid even if you don't have a closing tag!This could break your code.
Use an html parser like htmlagilitypack
You can use this code to retrieve all href's in anchor tag using HtmlAgilityPack
HtmlDocument doc = new HtmlDocument();
doc.Load(yourStream);
var hrefList = doc.DocumentNode.SelectNodes("//a")
.Select(p => p.GetAttributeValue("href", "not found"))
.ToList();
hrefList contains all href`s
Thanks everyone (specially #plalx)
I find it quite overkill enforce the validity of the href attribute with such a complex and cryptic pattern while a simple expression such as
<a\s+(?:[^>]*?\s+)?href="([^"]*)"
would suffice to capture all URLs. If you want to make sure they contain at least a query string, you could just use
<a\s+(?:[^>]*?\s+)?href="([^"]+\?[^"]+)"
My final regex string:
First use one of this:
st = #"((www\.|https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+ \w\d:##%/;$()~_?\+-=\\\.&]*)";
st = #"<a href[^>]*>(.*?)</a>";
st = #"((([A-Za-z]{3,9}:(?:\/\/)?)(?:[-;:&=\+\$,\w]+#)?[A-Za-z0-9.-]+|(?:www.|[-;:&=\+\$,\w]+#)[A-Za-z0-9.-]+)((?:\/[\+~%\/.\w-_]*)?\??(?:[-\+=&;%#.\w_]*)#?(?:[\w]*))?)";
st = #"((?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)[\w\d:##%/;$()~_?\+,\-=\\.&]+)";
st = #"(?:(?:https?|ftp|gopher|telnet|file|notes|ms-help):(?://|\\\\)(?:www\.)?|www\.)";
st = #"(((https?|ftp|gopher|telnet|file|notes|ms-help):((//)|(\\\\))+)|(www\.)[\w\d:##%/;$()~_?\+-=\\\.&]*)";
st = #"href=[""'](?<url>(http|https)://[^/]*?\.(com|org|net|gov))(/.*)?[""']";
st = #"(<a.*?>.*?</a>)";
st = #"(?:hrefs*=)(?:[s""']*)(?!#|mailto|location.|javascript|.*css|.*this.)(?.*?)(?:[s>""'])";
st = #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"http(s)?://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
st = #"(http|https)://([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"((http|ftp|https):\/\/[\w\-_]+(\.[\w\-_]+)+([\w\-\.,#?^=%&:/~\+#]*[\w\-\#?^=%&/~\+#])?)";
st = #"http://([\\w+?\\.\\w+])+([a-zA-Z0-9\\~\\!\\#\\#\\$\\%\\^\\&\\*\\(\\)_\\-\\=\\+\\\\\\/\\?\\.\\:\\;\\'\\,]*)?";
st = #"http(s?)\:\/\/[0-9a-zA-Z]([-.\w]*[0-9a-zA-Z])*(:(0-9)*)*(\/?)([a-zA-Z0-9\-\.\?\,\'\/\\\+&%\$#_]*)?$";
st = #"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*";
my choice is
#"(?<Protocol>\w+):\/\/(?<Domain>[\w.]+\/?)\S*"
Second Use this:
st = "(.*)?(.*)=(.*)";
Problem Solved. Thanks every one :)
Try this :
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}
private void Form1_Load(object sender, EventArgs e)
{
var res = Find(html);
}
public static List<LinkItem> Find(string file)
{
List<LinkItem> list = new List<LinkItem>();
// 1.
// Find all matches in file.
MatchCollection m1 = Regex.Matches(file, #"(<a.*?>.*?</a>)",
RegexOptions.Singleline);
// 2.
// Loop over each match.
foreach (Match m in m1)
{
string value = m.Groups[1].Value;
LinkItem i = new LinkItem();
// 3.
// Get href attribute.
Match m2 = Regex.Match(value, #"href=\""(.*?)\""",
RegexOptions.Singleline);
if (m2.Success)
{
i.Href = m2.Groups[1].Value;
}
// 4.
// Remove inner tags from text.
string t = Regex.Replace(value, #"\s*<.*?>\s*", "",
RegexOptions.Singleline);
i.Text = t;
list.Add(i);
}
return list;
}
public struct LinkItem
{
public string Href;
public string Text;
public override string ToString()
{
return Href + "\n\t" + Text;
}
}
}
Input:
string html = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> 2.<a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a> ";
Result:
[0] = {www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
[1] = {http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx}
C# Scraping HTML Links
Scraping HTML extracts important page elements. It has many legal uses
for webmasters and ASP.NET developers. With the Regex type and
WebClient, we implement screen scraping for HTML.
Edited
Another easy way:you can use a web browser control for getting href from tag a,like this:(see my example)
public Form1()
{
InitializeComponent();
webBrowser1.DocumentCompleted += new WebBrowserDocumentCompletedEventHandler(webBrowser1_DocumentCompleted);
}
private void Form1_Load(object sender, EventArgs e)
{
webBrowser1.DocumentText = "<a href=\"www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"http://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"https://www.aaa.xx/xx.zz?id=xxxx&name=xxxx\" ....></a><a href=\"www.aaa.xx/xx.zz/xxx\" ....></a>";
}
void webBrowser1_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)
{
List<string> href = new List<string>();
foreach (HtmlElement el in webBrowser1.Document.GetElementsByTagName("a"))
{
href.Add(el.GetAttribute("href"));
}
}
Try this regex:
"href\\s*=\\s*(?:\"(?<1>[^\"]*)\"|(?<1>\\S+))"
You will get more help from discussions over:
Regular expression to extract URL from an HTML link
and
Regex to get the link in href. [asp.net]
Hope its helpful.
HTMLDocument DOC = this.MySuperBrowser.Document as HTMLDocument;
public IHTMLAnchorElement imageElementHref;
imageElementHref = DOC.getElementById("idfirsticonhref") as IHTMLAnchorElement;
Simply try this code
I came up with this one, that supports anchor and image tags, and supports single and double quotes.
<[a|img]+\\s+(?:[^>]*?\\s+)?[src|href]+=[\"']([^\"']*)['\"]
So
click here
Will match:
Match 1: /something.ext
And
<a href='/something.ext'>click here</a>
Will match:
Match 1: /something.ext
Same goes for img src attributes
I took a much simpler approach. This one simply looks for href attributes, and captures the value (between apostrophes) trailing it into a group named url:
href=['"](?<url>.*?)['"]
I think in this case it is one of the simplest pregmatches
/<a\s*(.*?id[^"]*")/g
gets links with the variable id in the address
starts from href including it, gets all characters/signs (. - excluding new line signs)
until first id occur, including it, and next all signs to nearest next " sign ([^"]*)

adding text before and after a link jSoup

I've just stared learning Jsoup and the cookbook on their website but I'm just a bit stuck with addling text to an element I've parsed.
try{
Document doc = Jsoup.connect(url).get();
Element add = doc.prependText("a href") ;
Elements links = add.select("a[href]");
for (Element link : links) {
PrintStream sb = System.out.format("%n %s",link.attr("abs:href"));
System.out.print("<br>");
}
}
catch(Exception e){
System.out.print("error --> " + e);
}
Example run with google.com I get
http://www.google.ie/imghp?hl=en&tab=wi<br>
http://maps.google.ie/maps?hl=en&tab=wl<br>
https://play.google.com/?hl=en&tab=w8<br>
But I really want
<a href> http://www.google.ie/imghp?hl=en&tab=wi<br></a>
<a href> http://maps.google.ie/maps?hl=en&tab=wl<br></a>
<a href> https://play.google.com/?hl=en&tab=w8<br></a>
With this code I've gotten all the links off the page but I want to also get the and tags so I can them create my on webpage. I've tried adding a string and prepend text but just can't seem to get it right.
Thanks
with link.attr(...) you get the attribute value.
But you need the whole tag:
Document doc = Jsoup.connect(...).get();
for( Element e : doc.select("a[href]") ) // Select all 'a'-Tags with 'href' attribute
{
String wholeTag = e.toString(); // Get a string as the element is
/* No you you can use the html - in this example for a simple output */
System.out.println(wholeTag);
}

Categories

Resources