I have a tiny problem with Jsoup.
I am using this snippet of code
Response response = getAccount(); // GET Response in HTML format
Document document = Jsoup.parse(response.body().prettyPrint());
And it prints to console all the response, which is very messy, as response is in HTML format. I read that prettyPeek() is not logging response, but return value of prettyPeek() is not String type but Response, and even if I use prettyPeek().toString() my code doesn't work. Please tell me what snippet will work the same way as mine, but without logging to console.
To parse HTML into a Document just parse the body:
Document document = Jsoup.parse(response.body());
And that's it.
Also do you really need a reponse as a Response object?
You can get the Document by simply calling:
Document document = Jsoup.connect("http://example.com/").get();
Take a look at these very simple examples to see if there's a better way to do what you're trying to achieve:
https://jsoup.org/cookbook/input/load-document-from-url
Related
I am getting a response from my RestAssured call as ContentType text/plain;charset=UTF-8.
I searched the internet but am unable to find a nice way to get the content out of the message as using the below is not so nice;
String content = response.then().extract().body().htmlPath().get().children().get(0).toString();
How can I extract the contents of this response a little more nice?
You can directly use .asString() to get the body content of the response irrespective of this return Content-Type.
You can try something like that:
response.then().extract().body().asString();
or directly:
response.asString();
You can try
String ResponseAsString=given().get("http://services.groupkt.com/state/get/IND/UP").asString();
System.out.println("My ResponseAsString is:"+ResponseAsString);
Also you can extract the response using JsonPath even though it's ContentType.TEXT
Response response=given().contentType(ContentType.TEXT).get("http://localhost:3000/posts");
//we need to convert response as a String and give array index
JsonPath jsonPath = new JsonPath(response.asString());
String title = jsonPath.getString("title[2]");
String author=jsonPath.getString("author[2]");
This is the way it work to me
import io.restassured.response.Response;
response.getBody().prettyPrint();
I really need help to extract Mircodata which is embedded in HTML5. My purpose is to get structured data from a webpage just like this tool of google: http://www.google.com/webmasters/tools/richsnippets. I have searched a lot but there is no possible solution.
Currently, I use the any23 library but I can’t find any documentation, just only javadocs which dont provide enough information for me.
I use any23's Microdata Extractor but getting stuck at the third parameter: "org.w3c.dom.Document in". I can't parse a HTML content to be a w3cDom. I have used JTidy as well as JSoup but the DOM objects in these library are not fixed with the Extractor constructor. In addition, I also doubt about the 2nd parameter of the Microdata Extractor.
I hope that anyone can help me to do with any23 or suggest another library can solve this extraction issues.
Edit: I found solution myself by using the same way as any23 command line tool did. Here is the snippet of code:
HTTPDocumentSource doc = new HTTPDocumentSource(DefaultHTTPClient.createInitializedHTTPClient(), value);
InputStream documentInputInputStream = doc.openInputStream();
TagSoupParser tagSoupParser = new TagSoupParser(documentInputInputStream, doc.getDocumentURI());
Document document = tagSoupParser.getDOM();
ByteArrayOutputStream byteArrayOutput = new ByteArrayOutputStream();
MicrodataParser.getMicrodataAsJSON(tagSoupParser.getDOM(),new PrintStream(byteArrayOutput));
String result = byteArrayOutput.toString("UTF-8");
These line of code only extract microdata from HTML and write them in JSON format. I tried to use MicrodataExtractor which can change the output format to others(Rdf, turtle, ...) but the input document seems to only accept XML format. It throws "Document didn't start" when I put in a HTML document.
If anyone found the way to use MicrodataExtractor, please leave the answer here.
Thank you.
xpath is generally the way to consume html or xml.
have a look at: How to read XML using XPath in Java
I have used this method to retrieve a webpage into an org.jsoup.nodes.Document object:
myDoc = Jsoup.connect(myURL).ignoreContentType(true).get();
How should I write this object to a HTML file?
The methods myDoc.html(), myDoc.text() and myDoc.toString() don't output all elements of the document.
Some information in a javascript element can be lost in parsing it. For example, "timestamp" in the source of an Instagram media page.
Use doc.outerHtml().
import org.apache.commons.io.FileUtils;
public void downloadPage() throws Exception {
final Response response = Jsoup.connect("http://www.example.net").execute();
final Document doc = response.parse();
final File f = new File("filename.html");
FileUtils.writeStringToFile(f, doc.outerHtml(), StandardCharsets.UTF_8);
}
Don't forget to catch Exceptions. Add dependency or download Apache commons-io library for easy and quick way to saving files in UTF-8 format.
The fact that there are elements that are ignored, must be due to the attempt of normalization by Jsoup.
In order to get the server's exact output without any form of normalization use this.
Connection.Response html = Jsoup.connect("PUT_URL_HERE").execute();
System.out.println(html.body());
I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance
Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!
I am reading from an API provided by a company, but the problem is that one of the accounts from which I am getting the data has around 22000 json objects, it reads fine with small amounts of data, i would say up to 8000 records, but then I get issues like the json is not well formatted besides the problem of being able to read the response.
The response comes this way:
<?xml version="1.0" encoding="utf-8"?>
<string xmlns="http://ywers.com">
[{"Name":"Edward", "LastName":"Jones", "Address":"{accepted}"}
,{"Name":"Carlos", "LastName":"Ramirez", "Address":"{Rejected}"}, ....... 22k more records here]</string>
I asked for some help earlier on here for the best way to do this, and i got a response about reading it using the xml parser and then a json parser, i am using GSON.
String XML = "<Your XML Response>";
XPathExpression xpath = XPathFactory.newInstance()
.newXPath().compile("/*[local-name()='string']");
String json = xpath.evaluate(new InputSource(new StringReader(XML)));
and then
JSONArray jsonRoot = new JSONArray(json.trim());
System.out.println(jsonRoot.getJSONObject(0).getString("Address")); // {accepted}
The problem with this is approach i am having is that it throws errors when reading the XML, it starts reading but after a while it stops with errors like:
java.lang.OutOfMemoryError
at java.lang.AbstractStringBuilder.enlargeBuffer(AbstractBuilder.java:94)
at java.lang.StringBuffer.append(StringBuffer.java:219)
at org.apache.harmony.xml.dom.CharacterDataImpl.appendData(CharacterDataImpl.java:43)
......
I would appreciate any advise on how to proceed with this, I am kind of new to android.
I don't know who would wrap 22k objects inside a xml string, but apparently someone is doing that. From my experience, your out of memory is because the you try to convert all the response to string but the response is too big to be handled. I recommend you to stream the JSON data. You can do stream the JSON data from the inputstream response that you get from the your HTTP post, but you need to skip the XML part by creating another input stream from the original response input stream and skip the XML part
Before I use the streaming API from google GSON I also got OOM error because the JSON data I got is very big data (many images and sounds in Base64 encoding) but with GSON streaming I can overcome that error because it reads the data per token not all at once. And for alternative you can also use Jackson JSON library I think it also have streaming API and how to use it almost same with my implementation with google GSON. I hope my answer can help you and if you have another question about my answer feel free to ask in the comment :)