I have string like this:
<html>
<body>
<p>Hello</p>
</body>
</html>
and In java I want to after selection string looked like this:
<p>Hello</p>
How?
If you are dealing with HTML parsing then go for JSoup
String html = "<html><body><p>Hello</p></body></html>";
Document doc = Jsoup.parseBodyFragment(html);
Elements fragment = doc.select("p"); // p tag
System.out.println(fragment.html());
Quick link
This one line should do it:
String text = input.replaceAll("(?s).*(<p>.*</p>).*", "$1");
Or to just get everything in the <body> tag, do this:
String text = input.replaceAll("(?s).*<body>(.*)</body>.*", "$1");
You can use JSoup, it will be quit handy for you.
Related
I'm trying to find all elements inside this kind of html:
<body>
My text without tag
<br>Some title</br>
<img class="image" src="url">
My second text without tag
<p>Some Text</p>
<p class="MsoNormal">Some text</p>
<ul>
<li>1</li>
<li>2</li>
</ul>
</body>
I need get all elements include parts without any tag. How a can get it?
P.S.: I need to get array of "Element" for each element.
Not quite sure if you are asking to retrieve all the text within the html. to do that, you can simply do the following:
String html; // your html code
Document doc = Jsoup.parse(html); //parse the string
System.out.println(doc.text()); // get all the text from tags.
OUTPUT:
My text without tag Some title My second text without tag Some Text
Some text 1 2
Just in case if you using a html file, you can use the below code and retrieve each tag that you need. The API is Jsoup. You can find more examples in the below link http://jsoup.org/
File input = new File(htmlFilePath);
InputStream is = new FileInputStream(input);
String html = IOUtils.toString(is);
Document htmlDoc = Jsoup.parse(html);
Elements pElements = htmlDoc.select("P");
Element pElement1 = pElements.get(0);
I have the following html:
<html>
<head>
</head>
<body>
<div id="content" >
<p>text <strong>text</strong> text <em>text</em> text </p>
</div>
</body>
</html>
How I can replace "text" to "word" in the each tag using Jsoup library.
I want to see:
<html>
<head>
</head>
<body>
<div id="content" >
<p>word <strong>word</strong> word <em>word</em> word </p>
</div>
</body>
</html>
Thank you for any suggestions!
UPD:
Thanks for answers, but I found the versatile way:
Element entry = doc.select("div").first();
Elements tags = entry.getAllElements();
for (Element tag : tags) {
for (Node child : tag.childNodes()) {
if (child instanceof TextNode && !((TextNode) child).isBlank()) {
System.out.println(child); //text
((TextNode) child).text("word"); //replace to word
}
}
}
Document doc = Jsoup.connect(url).get();
String str = doc.toString();
str = str.replace("text", "word");
try it..
A quick search turned up this code:
Elements strongs = doc.select("strong");
Element f = strongs.first();
Element l = strongs.last();1,siblings.lastIndexOf(l));
etc
First what you want to do is understand how the library works and what features it contains, and then you figure out how to use the library to do what you need. The code above seems to allow you to select a strong element, at which point you could update it's inner text, but I'm sure there are a number of ways you could accomplish the same.
In general, most libraries which parse xml are able to select any given element in the document object model, or any list of elements, and either manipulate the elements themselves, or their inner text, attributes and the like.
Once you gain more experience working with different libraries, your starting point is to look for the documentation of the library to see what that library does. If you see a method that says it does something, that's what it does, and you can expect to use it to accomplish that goal. Then, instead of writing a question on Stack Overflow, you just need to parse the functionality of the library you're using, and figure out how to use it to do what you want.
String html = "<html> ...";
Document doc = Jsoup.parse(html);
Elements p = doc.select("div#content > p");
p.html(p.html().replaceAll("text", "word"));
System.out.println(doc.toString());
div#content > p means that the elements <p> in the element <div> which id is content.
If you want to replace the text only in <strong>text</strong>:
Elements p = doc.select("div#content > p > strong");
p.html(p.html().replaceAll("text", "word"));
I want to split a single string containing normal text as well as html code into array of string. I tried to search on google but not found any suitable suggestion.
Consider the following string:
blahblahblahblahblahblahblahblahblahblah
blahblah First para blahblahblahblah
blahblahblahblahblahblahblahblahblahblah
<html>
<body>
<p>hello</p>
</body>
</html>
blahblahblahblahblahblahblahblahblahblah
blahblah Second Para lahblahblahblahblah
blahblahblahblahblahblahblahblahblahblah
this becomes:
s[0]=whole first para
s[1]=html code
s[2]=whole second para
Is it possible through jsoup ?. Or I need any other api?
It is possible with jQuery. Here below is a code snippet.
var str = "blablabla <html><body><p>hello</p></body></html> blabla";
var parsedHTML = $.parseHTML(str);
myList = [];
// loop through parsed text and put it into text based on its type
$.each(parsedHTML, function( i, el ) {
if (el.nodeType < 3) myList[i] = el.nodeName;
else myList[i] = el.data;
});
// use myList ...
Here is a fiddle which shows you that it works. The only disadvantage is that both <html> and <body> tag is parsed and not being obtained in the parsedHTML.
jsfiddle example
This can be done with JSoup
Simple use example:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Then you can navigate the DOM structure to extract the information.
update
To get the text with all the tags you could wrap the entire string in <meta> ... </meta> tags; then parse it, access the individual components, and finally serialize the components back into strings.
Alternatively if you believe the code is well formed (with matching beginning and end tags) you could search for the first match of the regex
/<(html|body)\s*>/
Depending on what the contents of the first tag (match) are you then look for the last occurrence of the matching close tag.
More manual, more prone to error, not recommended. But since you have a non- standard problem it seems you might want a non-standard solution .
I have the following HTML...
<h3 class="number">
<span class="navigation">
6:55 <b>»</b>
</span>**This is the text I need to parse!**</h3>
I can use the following code to extract the text from h3 tag.
Element h3 = doc.select("h3").get(0);
Unfortunately, that gives me everything in that tag.
6:55 » This is the text I need to parse!
Can I use Jsoup to parse between different tags? Is there a best practice for doing this (regex?)
(regex?)
No, as you can read in the answers of this question, you can't parse HTML using a regular expression.
Try this:
Element h3 = doc.select("h3").get(0);
String h3Text = h3.text();
String spanText = h3.select("span").get(0).text();
String textBetweenSpanEndAndH3End = h3Text.replace(spanText, "");
No, JSoup wasn't made for this. It's supposed to parse something hierachical. Searching for a text which is between an end-tag and a start-tag, or the other way around wouldn't make any sense for JSoup. That's what regular expressions are for.
But you should of course narrow it down as much as you can using JSoup first, before you shoot with a regex at the string.
Just use ownText()
#Test
void innerTextCase() {
String sample = "<h3 class=\"number\">\n" +
"<span class=\"navigation\">\n" +
"6:55 <b>»</b>\n" +
"</span>**This is the text I need to parse!**</h3>\n";
Assertions.assertEquals("**This is the text I need to parse!**",
Jsoup.parse(sample).select("h3").first().ownText());
}
My input is plain text string and requirement is to remove all html tags except few specific tags like:
<p>
<li>
<u>
<li>
If these specific tags have attributes like class or id, I want to remove these attributes.
A few examples:
Link -> Link
<p>paragraph</p> -> <p>paragraph</p>
<p class="class1">paragraph</p> -> <p>paragraph</p>
I have gone through this Remove HTML tags from a String but it does not answer my question completely.
Can it be handled by a set of regex's or could I make use of some library?
I tried JSoup and It seems to be able to handle all such cases. Here is example code.
public String clean(String unsafe){
Whitelist whitelist = Whitelist.none();
whitelist.addTags(new String[]{"p","br","ul"});
String safe = Jsoup.clean(unsafe, whitelist);
return StringEscapeUtils.unescapeXml(safe);
}
For input string
String unsafe = "<p class='p1'>paragraph</p>< this is not html > <a link='#'>Link</a> <![CDATA[<sender>John Smith</sender>]]>";
I get following output which is pretty much I require.
<p>paragraph</p>< this is not html > Link <sender>John Smith</sender>
For simple HTML, this may be sufficient:
// remove any <script> tags
html = html.replaceAll("(?i)<script.*?</script>", ""));
// this removes any attributes
html = html.replaceAll("(?i)<([a-zA-Z0-9-_]*)(\\s[^>]*)>", "<$1>"));
// this removes any tags (not li and p)
html = html.replaceAll("(?i)<(?!(/?(li|p)))[^>]*>", ""));
Hope that helps.