Find URLs in String with Jsoup - java

I have String value that is txt (not html) that contains urls:
Blabla http://www.example.com/foo1/ blabla
http://www.example.com/foo2/ blabla...
I need to grab all these urls from the string using Jsoup.
Is it possible?

No, Jsoup won't do this for you. Jsoup parses HTML tags, not arbitrary strings.
If you had an HTML document containing a bunch of link tags (<a href="http://example.com/page.html>link text</a>), you could use Jsoup to parse the tags and extract the href attribute.
If you just have a string with some links, you probably want to use regular expressions, as suggested in a comment by PeterMmm.

Related

How to select an element in Jsoup using its html content?

I want to select an element in Jsoup using its html content.
Example: LOCATION:
How can i do it. I couldn't find any approriate selector methods directly. Is there any work around available?
Using Jsoup library you can parse from value from html using name, ID or class of element.
String html = "<html><head><title>Title</title></head> <body><div id='location'>Mumbai, India</div></body></html>";
Document document= Jsoup.parse(html);
String content = document.getElementById("location").outerHtml();
Happy Coding :-)

Remove a given tag from a html string without replace

I'd like to filter an html String before loading it in a WebView:
I'd like to remove all the img tags with the param:
data-custom:'delete'
In example
<img src="https://..." data-custom:'delete'/>
How can I do this in Android in a elegant way (without external libraries if possible)
I'm going to go for a nice and simple:
String element = "<img src='https://...' data-custom:'delete'/>";
String attributeRemoved = element.replaceAll("data-custom:['|\"].+['|\"]", "");
Updated based on comment
If you want to remove the whole tag you can do this:
String elementRemoved = element.replaceAll("<.*data-custom:['|\"].+['|\"].*>", "");
If you only want to do it for <img> tags you can do:
String imgElementRemoved = element.replaceAll("<img.*data-custom:['|\"].+['|\"].*>", "");
A much more reliable way would be to parse the HTML as an XML document and use XPath to find all elements with a data-custom attribute and remove them from the document, then save the updated document. While you can do this stuff with regex, it's not normally a good idea...

Open Link in HTML with JSOUP

I have a table in a HTML page in which I have to iterate through to open the links into a next page where all the information is. In this page I extract any data I need and return to my basic page.
How do I change pages with the framework JSoup in Java? Is it actually possible?
If you look at the JSoup Cookbook, they have an example of getting all the links inside of an HTML element. Iterate the Elements from this example and do a Document doc = Jsoup.connect(<url from Elements>).get();. You can then do String htmlFromLink = doc.toString(); and get the HTML from the link.

Java: How to strip text content from HTML tags?

Lets consider this html code:
<html<body><p><b>Hi there</b></p>click here</html>
what I want from this html code is to remove the content between html tags and retrieve the html structure. Like this:
<html<body><p><b></b></p></html>
Would this satisfy?
txt.replaceAll(">[^<]*<","><")

Extracting html tags based on attribute

I have a crawled page and I have retrieved html of the page into String object.
Now i want to parse this string and to extract all tags that have itemprop defined into an array that would be associative for example
String[] itemprops;
itemprops['title'] = "Some title";
itemprops['description'] = "Some description";
Can I do this with regex somehow or is there some library that can do this.
Look at JSoup. It's an HTML scraping and parsing library that's exactly what you want.
In your case, you can do something like:
Document doc = Jsoup.parse(HTMLString);
String title = doc.select("title").text();
String description = doc.select("meta[name=description]").attr("content");
The select() function uses CSS selectors to get elements.
Also make sure that the html which you use follows strict syntax. Because broken syntax may cause parsing exception or loss data.

Categories

Resources