Avoid spaceless concatenation with JSoup

Avoid spaceless concatenation with JSoup - java

Suppose I have a div as such:
<div>
This is a paragraph
written by someone
on the internet.
</div>
The problem is that when JSoup parses this, it puts it all on one line, so that when I call text() it reads as such:
This is a paragraphwritten by someoneon the internet.
Now, I realize this isn't really a JSoup problem, in that the actual html doesn't contain a space. However, is there any way to use JSoup (perhaps some override or maybe an option I haven't seen) so that as it parses it will add a space between lines? I imagine it must be possible (as I can inspect element in Chrome and unselect word wrap and it gets what I want) but I'm not sure JSoup can do this.
Any thoughts?

Can you provide a full example of your code? What version of jsoup are you using?
In the current version (1.6.1), this code:
Document doc = Jsoup.parse("<div>\n" +
"This is a paragraph\n" +
"written by someone\n" +
"on the internet.\n" +
"</div>");
System.out.println(doc.text());
Produces:
This is a paragraph written by someone on the internet.
I.e., \n (and \r\n etc) are converted to text as spaces.
Happy to fix or improve it, if I can replicate :)

the following post shows how you get everything including the line break
Removing HTML entities while preserving line breaks with JSoup
the answer and comment in the following also has another way (read the comment in it)
Remove HTML tags from a String
and this one has even another way if you check all the answers and the comments
How do I preserve line breaks when using jsoup to convert html to plain text?

Related

How can I count the Comments and the lines of a webpage using Jsoup?

Hello guys I am trying to make an html parser using jsoup.How can I count the comments and the lines of an html document?

As already answered you can iterate over every Node, check if it’s an Instance of Comment and count.
Counting the lines of the HTML can be done by splitting it at every line-break:
int lines = doc.html().split(System.getProperty("line.separator")).length;

Use selector syntax api for tags which are related to comments. (It's not the same tag for any websites.) Also, you may find the tags which you need to parse via browser's dev tools. (firebug, chrome dev tool etc.)
Selector syntax for jsoup
Good luck...

Parsing HTML with Java with HTMLCleaner; How can I recognize "<" char within attributes?

I'm parsing some pretty bad html code. I've had good success, until I noticed that with some elements, the attributes contain "<".
Ex:
40
will result as
<a href="#Anchor-">
<ht-42368>40</ht-42368>
</a>
This will render fine in the browser, but HTML cleaner will think it is trying to start a new tag. It adds a '">" before beginning a new tag, which I don't want.
What is the best way to fix this? I'm not sure if HTMLCleaner has any properties that I can configure to manage this.. if not, how should I preprocess the HTML data to fix these characters?
EDIT: fixed example
EDIT: I'm thinking I could apply a replaceAll() with a regex, before going into htmlcleaner. Maybe something like ="[^"]*" and search if it contains "<".. and if it does, replace with an escaped html ampersand. Would that work?

Regex remove only certain tags from html

I want to remove only a set of html tags (b,i,p, end of tags) from a given html.
Pattern p = Pattern.compile("<[^bip/](.*?)>");
However, this also removes img tag coz of .*. What should I change to prevent removal of img
EDIT: I'm doing this on Android app. I know regex is the worst way, but Inbuilt spannable classes are not working as expected and I cant import a library just for html parsing. My purpose is to just detect if other tags exist OR not. Also, html is pretty small (upto 10 lines max), performance shouldn't be a problem.

This has been said a million times on stackoverflow.
Don't process HTML, XHTML or XML with regexes. They aren't regular languages, they are context free languages and can't be correctly processed with regular expressions.

Trying to work into xml (or html) is a bad idea : you definitely want to use a parser.
In your case, you want to match:
<\s*/?\s*[bip]\s*>
Remove simple letter tag (and same closing tag) and take into account some spaces are valid; you also need to run your regex as multiline.
It might work, but it's dangerous and you might have unexpected side effects
EDIT:
I understood you just want to remove the tags, not the actual content inside the tag
EDIT2:
current pattern matches the 3 tags, not their content. In a substitution regexp (replacing by nothing), it would remove these formatting tags, not the embedded content.

If you want to remove only <b>,<p>,<i> and </b>,</p>,</i> tags then you can use following regex :
(</?b>|</?p>|</?i>)

I am not sure I understand your regex, seems very different from what you say you want. Use something like below:
<([bip])>.*?</\1>
And if possible, don't use the above or any other regexes. There are various other better ways to do this. Search here or on google.

Most of the sample regex only checks a tag starts with a certain tag. For instance, you may want to remove <b>, but not <br>. So, in most of the sample regex, if you add <b> in the tags list, it automatically removes <br> as well. I use /<\/?(font|div|b)(\/|>|\s.*?>)/g. This regex prevents starts with issue. This sample will find only font, div and b, not match with br.

Text Processing - Detecting if you are inside an HTML tag in Java

I have a program that does text processing on a html formatted document based on information on the same document without the html information. I basically, locate a word or phrase in the unformatted document, then find the corresponding word in the formatted document and alter the appearance of the word or phrase using HTML tags to make it stick out (e.g. bold it or change its color).
Here is my problem. Occasionally, I want to do formatting to a word or phrase which might be part of a html tag (for example perhaps I want to do some formatting to the word "font" but only if is a word that is not inside an html tag). Is there an easy way to detect whether a string is part of an html tag in a block of text or not?
By the way, I can't just strip out the html tags in the document and do my processing on the remaining text because I need to preserve the html in the result. I need to add to the existing html but I need to reliably distinguish between strings that are part of tags and strings that are not.
Any ideas?
Thank you,
Elliott

You could do a few things
Write a regular expression for what you're doing. There are plenty of prewritten ones you can find on Google
Find a library to parse the document (e.g., http://htmlparser.sourceforge.net/) and only replace text
The first is likely to the be the fastest and easiest, but the second will be more reliable.

Use the following regex code to detect if it has HTML tags: "\<.*?\>"
And here you can learn how to effectively use regex in your java code.
Happy coding ;)

If you have parsed the DOM, what you have, if you are doing it correctly. Then ask the super tag that contains current tag, and keep doing that, if that is not the tag, that you are looking for.
If you use some custom search or regex to parse html, then check best answe for this question:
RegEx match open tags except XHTML self-contained tags (It has +4000 upvotes for a reason)

How is web browser search implemented?

I want to implement in desktop application in java searching and highlighting multiple phrases in html files, like it is done in web browsers, so html tags (within < and >) are ignored but some tags like <b> arent ignored. When searching for example each table in text ...each <b>table</b> has name... will be highlighted, but in text ...has each</p><p> Table is... it will be not highlighted, because the <p> tag interrupts the text meaning.
in web browser is this somehow implemented, how can I get to this implementation? or is there some source on the net? I tried google, but without success :(

Instead of searching inside the actual HTML file the browsers search on the rendered output of that HTML.
Get a suitable HTML renderer and get its output as text. Then search on that text output using appropriate string searching algorithms.
The example that you highlighted in your question would result in a newline character in the rendered HTML output and hence a normal string searching algorithm will behave as you expect.

As Faisal stated, browsers search in rendered content only. For doing so you'll need to remove the HTML tags before doing the actual search:
This code might help you:
http://www.dotnetperls.com/remove-html-tags
Of course you'll need to add some checks/exclusions like script tags and other things that are not rendered into the browser.

This seems pretty easy.
1) Search for the last word in the string.
2) Look at what's before the last word.
3) Decide if what's before the last word constitutes and interruption (<p>, <br />, <div>).
4) If interruption, continue
5) Else evaluate previous word against the search query.
I don't know if this is how browsers perform this operation, but this approach should work.

Try using javax.swing.text.html package in java.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Avoid spaceless concatenation with JSoup - java

Related

How can I count the Comments and the lines of a webpage using Jsoup?

Parsing HTML with Java with HTMLCleaner; How can I recognize "<" char within attributes?

Regex remove only certain tags from html

Text Processing - Detecting if you are inside an HTML tag in Java

How is web browser search implemented?

Categories

Resources