Unusual output when using Jsoup in Java

Unusual output when using Jsoup in Java - java

I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance

Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!

Related

Java XSS Sanitization for nested HTML elements

I am using JSoup library in Java to sanitize input to prevent XSS attacks. It works well for simple inputs like alert('vulnerable').
Example:
String data = "<script>alert('vulnerable')</script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data); //StringEscapeUtils from apache-commons lib
System.out.println(data);
Output: ""
However, if I tweak the input to the following, JSoup cannot sanitize the input.
String data = "<<b>script>alert('vulnerable');<</b>/script>";
data = Jsoup.clean(data, , Whitelist.none());
data = StringEscapeUtils.unescapeHtml4(data);
System.out.println(data);
Output: <script>alert('vulnerable');</script>
This output obviously still prone to XSS attacks. Is there a way to fully sanitize the input so that all HTML tags is removed from input?

Not sure if this is the best solution, but a temporary workaround would be parsing the raw text into a Doc and then clean the combined text of the Doc element and all its children:
String unsafe = "<<b>script>alert('vulnerable');<</b>/script>";
Document doc = Jsoup.parse(unsafe);
String safe = Jsoup.clean(doc.text(), Whitelist.none());
System.out.println(safe);
Wait for someone else to come up with the best solution.

The problem is that you are unescaping the safe HTML that jsoup has made. The output of the Cleaner is HTML. The none safelist passes no tags, only the textnodes, as HTML.
So the input:
<<b>script>alert('vulnerable');<</b>/script>
Through the Cleaner returns:
<script>alert('vulnerable');</script>
which is perfectly safe for presenting as HTML. See https://try.jsoup.org/~hfn2nvIglfl099_dVxLQEPxekqg
Just don't include the unescape line.

How can I get the LineNumber of the element when using Jsoup?

such as:
Document doc = Jsoup.parse(file,"UTF-8");
Elements eles = doc.getElementsByTag("style");
How can I get the lineNumber of eles[0] in the file?

There is no way for you to do it with Jsoup API. I have checked on their source code: org.jsoup.parser.Parser maintains no position information of the element in the original input.
Please, refer to sources on Grep Code
Provided that Jsoup is build for extracting and manipulating data I don't believe that they will have such feature in future as it is ambigous what element position is after manipulation and costly to maintain actual references.

There is no direct way. But there is an indirect way.
Once you find the point of interest like an attribute, simply add a token as html before the element, and write the file to another temporary file. The next step is do a search for the token, using text editing tools.
code is as follows.
Step-1:
// get an element
for (Element element : doc.getAllElements()) {
... some code to get attributes of element ...
String myAttr = attribute.getKey();
if (myAttr.equals("some-attribute-name-of-interest") {
System.out.println(attribute.getKey() + "::" + attribute.getValue());
element.before("<!-- My Special Token : ABCDEFG -->");
}
Step-2:
// write the doc back to a temporary file
// see: How to save a jsoup document as text file
Step-3:
The last step is search for "My Special Token : ABCDEFG" in the output file using a text editing tool.
jsoup is a nice library. I thought this would help others.

Jsoup connect doesn't work correctly when link has Turkish letters

I'm using Jsoup to get html from web sites. I'm using
String url="http://www.example.com";
Document doc=Jsoup.connect(url).get();
this code to get html. But when I use some Turkish letters in the link like this;
String url="http://www.example.com/?q=Türkçe";
Document doc=Jsoup.connect(url).get();
Jsoup sends the request like this: "http://www.example.com/?q=Trke"
So I can't get the correct result. How can I solve this problem?

Working solution, if encoding is UTF-8 then simply use
Document document = Jsoup.connect("http://www.example.com")
.data("q", "Türkçe")
.get();
with result
URL=http://www.example.com?q=T%C3%BCrk%C3%A7e
For custom encoding this can be used:
String encodedUrl = URLEncoder.encode("http://www.example.com/q=Türk&#231e", "ISO-8859-3");
String encodedBaseUrl = URLEncoder.encode("http://www.example.com/q=", "ISO-8859-3");
String query = encodedUrl.replace(encodedBaseUrl, "");
Document doc= Jsoup.connect("http://www.example.com")
.data("q", query)
.get();

Unicode Characters are not allowed in URLs as per the specification. We're used to see them, because browsers display them in adress bars, but they are not sent to servers.
You have to URL encode your path before passing it to JSoup.
Jsoup.connect("http://www.example.com").data("q", "Türkçe") as proposed by MariuszS does just that

I found this on google: http://turkishbasics.com/resources/turkish-characters-html-codes.php
Maybe u can add it like this:
String url="http://www.example.com/?q=Türk&#231e";
Document doc=Jsoup.connect(url).get();

How to extract all the text from a webpage?

I am using JSoup library to extract texts in webpages. Following is my code
Document doc;
try {
URL url = new URL(text);
doc = Jsoup.parse(url, 70000);
Elements paragraphs = doc.select("p");
for(Element p : paragraphs)
{
textField.append(p.text());
textField.append("\n");
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
Here, I am only able to get text from "p" tags. But I need all the texts in the page. How can I do it? That might be by looping through nodes, but I just started using JSoup.

Try this:
String text = Jsoup.parse(new URL("https://www.google.com"), 10000).text();
System.out.println(text);
Here, 10000 is in milliseconds and refers to timeout.

You might want to use Boilerpipe, because you don't need HTML parsing, but only text extraction. This should be faster and less CPU-consuming.
Example:
URL url = new URL("http://www.example.com/some-location/index.html");
// NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
String text = ArticleExtractor.INSTANCE.getText(url);
Taken from: https://code.google.com/p/boilerpipe/wiki/QuickStart

Perhaps a different approach entirely. I'm not sure exactly what you are doing, thus I don't know exactly what you need. But you could grab the entire raw source of the entire web page. Then use regexp to delete all of the html tags. I did something similar (though in php) for a text to code ratio tool once.

How to convert HTML to text keeping linebreaks

How may I convert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser
Example:
Hello<br/>World
to:
Hello\n
World

Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.
public static String htmlToText(InputStream html) throws IOException {
Document document = Jsoup.parse(html, null, "");
Element body = document.body();
return buildStringFromNode(body).toString();
}
private static StringBuffer buildStringFromNode(Node node) {
StringBuffer buffer = new StringBuffer();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
buffer.append(textNode.text().trim());
}
for (Node childNode : node.childNodes()) {
buffer.append(buildStringFromNode(childNode));
}
if (node instanceof Element) {
Element element = (Element) node;
String tagName = element.tagName();
if ("p".equals(tagName) || "br".equals(tagName)) {
buffer.append("\n");
}
}
return buffer;
}

w3m -dump -no-cookie input.html > output.txt

I did find a relatively clever solution in html2txt: THE ASCIINATOR which does an admirable job of producing nroff like output (e.g. like man ls run on a terminal). It produces output in the Markdown style that StackOverflow uses as input.
For moderately complex pages like this page, the output is somewhat scattered as it tries mightily to turn non-linear layout into something linear. The output from less complicated markup is pretty readable.

If you don't mind hard-wrapped/designed-for-monospace output, lynx -dump produces good plain text from HTML.

HTML to Text:
I am taking this statement to mean that all HTML formatting, except line-breaks, will be abandoned.
What I have done for such a venture is using regexp to detect any set of tag enclosure.
If the value within the tags are br or br/, a line-break is inserted, otherwise the tag is discarded.
It works only for simple html pages. Tables will obviously be linearised.
I had been thinking of detecting the title value between the title tag enclosure, so that the converter automatically places the title at the top of the page. Needs to put in a little more algorithm. By my time is better spent with ...
I am reading on using Google Data APIs to upload a document to Google Docs and then using the same API to download/export it as text. Or, why text, when I could do pdf. But you have to get a Google account if you don't already have one.
Google docs data download/export
Google docs data api for java

Does it matter what language you use? You could always use pattern matching. Basically HTML lien break tags (br,p,div, ...) you can replace with "\n" and remove all the other tags. You could always store the tags in an array so you can easily check when you go through the HTML file. Then any other tags and all the other end tags (/p,..) can be replaced with an empty string therefore getting your result.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Unusual output when using Jsoup in Java - java

Related

Java XSS Sanitization for nested HTML elements

How can I get the LineNumber of the element when using Jsoup?

Jsoup connect doesn't work correctly when link has Turkish letters

How to extract all the text from a webpage?

How to convert HTML to text keeping linebreaks

Categories

Resources