I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:
myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();
How should I write this object to a HTML file?
The methods myDoc.html(), myDoc.text() and myDoc.toString() don't output all elements of the document.
Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.
Use doc.outerHtml().
import org.apache.commons.io.FileUtils;
public void downloadPage() throws Exception {
final Response response = Jsoup.connect("http://www.example.net").execute();
final Document doc = response.parse();
final File f = new File("filename.html");
FileUtils.writeStringToFile(f, doc.outerHtml(), StandardCharsets.UTF_8);
}
Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.
The fact that there are elements that are ignored, must be due to the attempt of normalization by Jsoup.
In order to get the server's exact output without any form of normalization use this.
Connection.Response html = Jsoup.connect("PUT_URL_HERE").execute();
System.out.println(html.body());
Related
I have a tiny problem with Jsoup.
I am using this snippet of code
Response response = getAccount(); // GET Response in HTML format
Document document = Jsoup.parse(response.body().prettyPrint());
And it prints to console all the response, which is very messy, as response is in HTML format. I read that prettyPeek() is not logging response, but return value of prettyPeek() is not String type but Response, and even if I use prettyPeek().toString() my code doesn't work. Please tell me what snippet will work the same way as mine, but without logging to console.
To parse HTML into a Document just parse the body:
Document document = Jsoup.parse(response.body());
And that's it.
Also do you really need a reponse as a Response object?
You can get the Document by simply calling:
Document document = Jsoup.connect("http://example.com/").get();
Take a look at these very simple examples to see if there's a better way to do what you're trying to achieve:
https://jsoup.org/cookbook/input/load-document-from-url
I'm trying to fetch an entire webpage so I can extract some data. I'm using(or trying to use) HtmlUnit.
The result I want to get is the ENTIRELY generated code that being produced from all sources. I don't want the source code. I want a result like the 'inspect element' window in chrome. Any ideas? Is this even possible?
Should I use another library?
I'm posting a sample code that DIDN'T help me.
webClient = new WebClient(BrowserVersion.CHROME);
final HtmlPage page = webClient.getPage("https://www.bet365.com");
System.out.println(page.asXml());
If you mean to extract all the data from a websites server/database (which is what it sounds like) then it isn't possible because those files are protected.
If you just want source code, try this solution How do you Programmatically Download a Webpage in Java
page.getWebResponse().getContentAsString() returns the content as returned from the server.
page.asXml() returns the XHTML of the page, after JavaScript modifications.
page.save(File) saves the page recursively with dependencies.
You can also extract all sources returned from the web server by intercepting the request/response:
new WebConnectionWrapper(webClient) {
public WebResponse getResponse(WebRequest request) throws IOException {
WebResponse response = super.getResponse(request);
if (request.getUrl().toExternalForm().contains("my_url")) {
String content = response.getContentAsString();
// change or save content
WebResponseData data = new WebResponseData(content.getBytes(),
response.getStatusCode(), response.getStatusMessage(), response.getResponseHeaders());
response = new WebResponse(data, request, response.getLoadTime());
}
return response;
}
};
I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance
Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!
I am using NekoHTML framework with xerces 2.11.0 version to parse an HTML document.
But i am having a problem with this simple code :
DOMParser parser = new DOMParser();
System.out.println(parser.getClass().toString());
InputSource url = new InputSource("http://www.cbgarden.org");
try{
parser.parse(url);
Document document = parser.getDocument();
System.out.println(document.hasChildNodes());
System.out.println(document.getBaseURI());
System.out.println(document.getNodeName());
System.out.println(document.getNodeValue());
}catch(Exception e){
e.printStackTrace();
}
Now I put here the result of the multiple prints:
class org.cyberneko.html.parsers.DOMParser
true
http://www.cbgarden.org
document
null
So my question is : What could be wrong ?
No exception is thrown and I am following the rules that are defined in the usage rules in the NekoHTML. My build path libraries are with this precedence:
nekohtml.jar
nekohtmlSamples.jar
xercesImpl.jar
xercesSamples.jar
xml-apis.jar
I guess your question is about the null?
The document node has no value. It only has subnodes (like <html> witch contains <head> and <body>).
But if you want to have the whole page source as a String, you can simply download it using a URL its method openStream().
I know how to read the HTML code of a website, for example, the next java code reads all the HTML code from http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html this is a website that shows all the football players of F.C. Barcelona.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class ReadWebPage {
public static void main(String[] args) throws IOException {
String urltext = "http://www.transfermarkt.co.uk/en/fc-barcelona/startseite/verein_131.html";
URL url = new URL(urltext);
BufferedReader in = new BufferedReader(new InputStreamReader(url
.openStream()));
String inputLine;
while ((inputLine = in.readLine()) != null) {
// Process each line.
System.out.println(inputLine);
}
in.close();
}
}
OK, but now I need to work with the HTML code, I need to obtain the names ("Valdés, Victor", "Pinto, José Manuel", etc...) and the positions (Goalkeeper, Defence, Midfield, Striker) of each of the players of the team. For example, I need to create an ArrayList <String> PlayerNames and an ArrayList <String> PlayerPositions and put on these arrays all the names and positions of all the players.
How I can do it??? I can't find the code example that can do it on google..... code examples are welcome
thanks
I would recommend using HtmlUnit, which will give you access to the DOM tree of the HTML page, and even execute JavaScript in case the data are dynamically put in the page using AJAX.
You could also use JSoup: no JavaScript, but more lightweight and support for CSS selectors.
I think that the best approach is first to purify HTML code into the valid XHTML form, and them apply XSL transformation - for retrieving some part of information you can use XPATH expressions. The best available html tag balancer is in my opinion neko HTML (http://nekohtml.sourceforge.net/).
You might like to take a look at htmlparser
I used this for something similar.
Usage something like this:
Parser fullWebpage = new Parser("WEBADDRESS");
NodeList nl = fullWebpage.extractAllNodesThatMatch(new TagNameFilter("<insert html tag>"));
NodeList tds = nodes.extractAllNodesThatMatch(new TagNameFilter("a"),true);
String data = tds.toHtml();
Java has its own, built-in HTML parser. A positive feature of this parser it that it is error tolerant and would assume some tags even if they are missing or misspelled. While called swing.text.html.Parser, it has actually nothing shared with Swing (and with text only as much as HTML is a text). Use ParserDelegator. You need to write a callback for use with this parser, otherwise it is not complex to use. The code example (written as a ParserDelegator test) can be found here. Some say it is a reminder of the HotJava browser. The only problem with it, seems not upgraded to the most recent versions of HTML.
The simple code example would be
Reader reader; // read HTML from somewhere
HTMLEditorKit.ParserCallback callback = new MyCallBack(); // Implement that interface.
ParserDelegator delegator = new ParserDelegator();
delegator.parse(reader, callback, false);
I've found a link that is just what you was looking for:
http://tiny-url.org/work_with_html_java