I've already tried using HttpConnection and HttpClient to get the content of an index.html from a website. The problem is, that it only gives me the original html and then doesn't continue/wait to read. The site I want to read is constantly changing its html code, since it is updating after a while if you are on it.
How can I open a "session" in Java, so it retrieves the content after it updates, or at least waits for the website to load a little longer, since the image elements of the website are generated through php afterwards.
Related
I'm trying parse a website (f.e. google)
In Chrome local storage i see variables (key-value pairs)
And document content depends of this variables
Can i set it using jsoup?
Or I should use other tools for it?
No, you cannot access Chrome's content with your own code (unless you can find an exploit for that). You'll have to get the site's content with the proper get or post request made with jsoup and then parse the content by yourself. Just remember that jsoup only loads the HTML of the site, so if your content is loaded from JS, you'll have to find another way to get it.
The link I'm working with is: https://www.openml.org/t/31
If you scroll down to the bottom of the page, you should note an "Excel" button which proceeds to download the data into an excel spreadsheet.
I understand how to download a file from a URL using HttpURLConnection but this is a bit different. This excel page doesn't really lead anywhere. From what I found, this download doesn't have a "header" which I can use java.net.URLConnection with nor does it appear in the source code.
I am using JSoup to download some information from the webpage, and one thing I need to download is this excel page - and I've been stuck on this for several days. The "link address" of this download just takes me back to the OpenML homepage.
Although not appearing in the source code, inspect element says:
<a class="dt-button buttons-excel buttons-html5" tabindex="0" aria-controls="tasktable" href="#"><span>Excel</span></a>
But this HREF "#" takes me back to the homepage, so I can't use that as the link to download.
I'm fairly new to JSoup and HTML, am I looking in the right place or am I oblivious to some obvious information right in front of me that I'm not seeing?
Is using JSoup even needed in this case? Would Apache be more useful?
Thanks
If there is a blog website,
I want to know whether save the large html content to databse directly is a better way or is there any other way to process the html content.
On a website such as a blog, the blog posting's title and text are stored in a database as plain texts without the html. When you view a page, the web framework fetches the text content of the page from a database and uses a template to format it to a html page.
You should find a book to get into all this. How all this works is a far too broad question for this site.
I have a HTML file containing some java script tags. When I run this file in some browser such as IE, some contents are cached from its source and displayed on browser(for example weather of some cities). How can I run run this html file and get contents of web page that was displayed on web browser before? I don't want to display contents on my application; I want to parse returned data and extract some special contents(for example extract weather of each city).
can anyone guide me please?
What you're trying to do is called html scraping.
Your best option is to get help in the form of a library, since this is a conmon and complex task.
See this question: Options for HTML scraping?
Selenium is a good bet. It supports HtmlUnit, Firefox, Chrome amongst other browsers.
Link: http://seleniumhq.org/
This question is related to another one I've posted recently: Check printing with Java/JSP
We're looking for alternatives to how we currently print checks in a Java web application via an applet. It seems the consensus is to use PDF for printing and that itext offers the ability to do so with Java.
However, it's important in our particular case that the checks are "print-only" - the user should not have any ability in the application to save the check (I know a savvy user could do a PrintScreen but we want to cover our rears and make no native functionality in the app to save checks).
I haven't been successful in browsing the web to find out if it's possible to create a PDF with itext in this manner. I have seen posts on restricting permissions in a PDF but what I'm really looking for is a way to disable the ability to save a PDF locally using itext.
Does this functionality exist? If so, could you point me to documentation/code samples on it?
I'm presuming that you're serving this PDF and wishing to print it from within a web application / web site where no out of the ordinary client side plug-ins are installed.
If printing the PDF using conventional means (e.g. Adobe Reader), the PDF MUST be downloaded to the browser's cache to be opened and printed. There is no way around that.
Now you can probably prevent the average Joe from saving the PDF locally via the following technique, but any savvy user will be able to inspect your HTML's source and download the PDF directly.
Output your PDF in iText such that when the PDF is opened, a print action automatically occurs
Put an invisible IFRAME on your HTML page which loads this PDF, but is not visible in the browser to your user
When the user loads your HTML page, the PDF will be loaded in the IFRAME and sent to the users printer (presuming that Adobe Reader is installed in the browser). Yes, the PDF will end up in the browser cache, but the user would have to be savvy enough to both recognize this and then hunt it down in their browser's cache.
If this is not acceptable, you're going to have to look at converting the PDF to another file type (e.g. pages are rendered to images displayed in the browser or perhaps a Flash / Java object that sends each page in the document to the printer directly)
The printWriter class gives some static variables for certain options: PrintWriter
And here is another SO post that might help: iText disable printing/Copying/Saving