I'm trying parse a website (f.e. google)
In Chrome local storage i see variables (key-value pairs)
And document content depends of this variables
Can i set it using jsoup?
Or I should use other tools for it?
No, you cannot access Chrome's content with your own code (unless you can find an exploit for that). You'll have to get the site's content with the proper get or post request made with jsoup and then parse the content by yourself. Just remember that jsoup only loads the HTML of the site, so if your content is loaded from JS, you'll have to find another way to get it.
Related
I am creating a news app and have the url to the site of the articles e.g http://www.bbc.co.uk/news/technology-33379571 and I need a way to extract the content from the article.
I have tried jsoup but that gives all the html tags and there is one <main-article-body> but that gives the link to the article which I am trying to extract. I know boilerpipe does it exactly but that doesnt work with android, I am really stuck with this problem.
Any help will be much much appreciated
I have worked on few data extraction applications in.Net (c#) and have used regular expressions to extract content from news website.
The basic idea is to first extract all a href links (as needed) and then fetching details content by making web request. Finally using regular expressions to extract news body data.
Note: A problem with this process is that you will need to change your regular expressions when data source site changes.
I am looking for a way to extract all resources links from an HTML page in Java. (URL links, links to files..)
I first thought of extracting all elements inside src, href attributes, but the list will not be exhaustive. There is an example of code here: Jsoup, extract links, images, from website. Exception on runtime.
As a tricky example, I want to be able to detect links hidden inside JavaScript (which can also be hidden anywhere in the HTML DOM):
<IMG onmouseover="window.open('http://www.evil.com/image.jpg')">
EDIT:
1) I am not looking for a regex-based solution because they are not reliable to deal with HTML document
2) I have tried to use Html DOM parser like JSoup. They allows the extractions of tags and their properties quite well. However I have not found a way to detect links inside JavaScript with it.
3) Maybe there is an API available that tries to render the page and detect which resources needs to be loaded?
Do you have any thoughts?
Thanks.
If you want to use PHP with a bit of programming knowledge here is a library.
http://simplehtmldom.sourceforge.net/
I used this library to extract info from tags, even from properties of tags. This is exactly what you need to do what you want without working with complicated code.
I have a HTML file containing some java script tags. When I run this file in some browser such as IE, some contents are cached from its source and displayed on browser(for example weather of some cities). How can I run run this html file and get contents of web page that was displayed on web browser before? I don't want to display contents on my application; I want to parse returned data and extract some special contents(for example extract weather of each city).
can anyone guide me please?
What you're trying to do is called html scraping.
Your best option is to get help in the form of a library, since this is a conmon and complex task.
See this question: Options for HTML scraping?
Selenium is a good bet. It supports HtmlUnit, Firefox, Chrome amongst other browsers.
Link: http://seleniumhq.org/
i would like to parse web content and get only text from the web content. I am getting web content as HTML/java script. Now i need only text from the content.
Can some one help me in doing this? I am using HTML parser to do this.
For example i need the text content in the below file which is in bold.
The URLConnection class contains many methods that let you communicate with
the URL over the network. URLConnection is an HTTP-centric class; that
is, many of its methods are useful only when you are working with HTTP
URLs. However, most URL protocols allow you to read from and write to
the connection. This section describes both functions.
can some one suggest me or provide some sample code to do this.
Thanks in Advance.
You could use an Html parser. A safe choise will be HtmlParser.
An unorthodox method that i like to use is tools like HtmlUnit, which is basically for unit testing, but they have advanced xpath parsing capabilities, provides automatic login, session handling kind of capabilities too.
I recommend use HtmlUnit for web download and Jsoup as html/xml parser.
I use them to extract infos from websites (Google search too).
in my GWT application, on the client side I have a string containing html. Is there a good way to go about parsing that and finding specific html tags within it and returning the id's of those tags?
Any help would be much appreciated, thanks!
Check out GWT query. It is a jQuery like API for GWT that allows easily traversing and manipulating HTML.
You could attach your HTML string to the DOM - using Element.setInnerHTML(yourString). That way you're using the browser's parser. Attaching it to an invisible element or an invisible iframe should hide whats happening from the user.
For the querying you can use GWT's DOM functions if you want to stick with plain GWT. Using JavaScript directly or any JavaScript library like jQuery are also options. GWT query might also be an option, but I haven't used that yet.
UPDATE:
This approach can be abused by XSS (cross site scripting) attacks - so you must either trust or sanitize the HTML string.