Get a part of a html file in java [duplicate] - java

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
I have a HTML file looking like this:
<html>
<head>
<title>foobar</title>
</head>
<body>
bla bla<br />
{[CONTAINER]}
Hello
{[/CONTAINER]}
</body>
</html>
How do I get the "Hello" in the Container out of the rest of the html file? I've done this in PHP years ago and i remember a REGEX-Function which calls a definde class-function and give the content of the container as a parameter.
Can someone tell me how to do this in Java?

You can use regex that matches everything between {[CONTAINER]} and {[/CONTAINER]}. Example:
// Non capturing open tag. Non-capturing mean it won't be included in result when we match it against some text.
String open = "(?<=\\{\\[CONTAINER\\]\\})";
// Content between open and close tag.
String inside = ".*?";
// Non capturing close tag.
String close = "(?=\\{\\[/CONTAINER\\]\\})";
// Final regex
String regex = open + inside + close;
String text = "<html>..."; // you string here
// Usage
Matcher matcher = Pattern.compile(regex, Pattern.DOTALL).matcher(text);
while (matcher.find()) {
String content = matcher.group().trim();
System.out.println(content);
}
But you must be careful. Because it works only for {[CONTAINER]} and {[/CONTAINTER]}. Attributes for this custom tags aren't supported.
You also must be aware that it doesn't handle html tags in any specific way. So if there is a html tags between your CONTENT tags - they will be included.

You can parse the HTML using jsoup , more help here
More detailed here

Why do you want using Java?
You can simply use the DOM API with JavaScript:
document.getElementById("id_container").firstChild.data; // beware of \n char
or in a less efficient way:
document.getElementById("id_container").innerHTML;
However if your file is building on the server you can also use the same API:
http://docs.oracle.com/javase/6/docs/api/org/w3c/dom/package-summary.html

Related

separate html coded string and normal string

I want to split a single string containing normal text as well as html code into array of string. I tried to search on google but not found any suitable suggestion.
Consider the following string:
blahblahblahblahblahblahblahblahblahblah
blahblah First para blahblahblahblah
blahblahblahblahblahblahblahblahblahblah
<html>
<body>
<p>hello</p>
</body>
</html>
blahblahblahblahblahblahblahblahblahblah
blahblah Second Para lahblahblahblahblah
blahblahblahblahblahblahblahblahblahblah
this becomes:
s[0]=whole first para
s[1]=html code
s[2]=whole second para
Is it possible through jsoup ?. Or I need any other api?
It is possible with jQuery. Here below is a code snippet.
var str = "blablabla <html><body><p>hello</p></body></html> blabla";
var parsedHTML = $.parseHTML(str);
myList = [];
// loop through parsed text and put it into text based on its type
$.each(parsedHTML, function( i, el ) {
if (el.nodeType < 3) myList[i] = el.nodeName;
else myList[i] = el.data;
});
// use myList ...
Here is a fiddle which shows you that it works. The only disadvantage is that both <html> and <body> tag is parsed and not being obtained in the parsedHTML.
jsfiddle example
This can be done with JSoup
Simple use example:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Then you can navigate the DOM structure to extract the information.
update
To get the text with all the tags you could wrap the entire string in <meta> ... </meta> tags; then parse it, access the individual components, and finally serialize the components back into strings.
Alternatively if you believe the code is well formed (with matching beginning and end tags) you could search for the first match of the regex
/<(html|body)\s*>/
Depending on what the contents of the first tag (match) are you then look for the last occurrence of the matching close tag.
More manual, more prone to error, not recommended. But since you have a non- standard problem it seems you might want a non-standard solution .

JAVA Regex to remove html tag and content [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to remove HTML tag in Java
RegEx match open tags except XHTML self-contained tags
I want to remove specific HTML tag with its content.
For example, if the html is:
<span style='font-family:Verdana;mso-bidi-font-family:
"Times New Roman";display:none;mso-hide:all'>contents</span>
If the tag contains "mso-*", it must remove the whole tag (opening, closing and content).
As Dave Newton pointed out in his comment, a html parser is the way to go here. If you really want to do it the hard way, here's a regex that works:
String html = "FOO<span style='font-family:Verdana;mso-bidi-font-family:"
+ "\"Times New Roman\";display:none;mso-hide:all'>contents</span>BAR";
// regex matches every opening tag that contains 'mso-' in an attribute name
// or value, the contents and the corresponding closing tag
String regex = "<(\\S+)[^>]+?mso-[^>]*>.*?</\\1>";
String replacement = "";
System.out.println(html.replaceAll(regex, replacement)); // prints FOOBAR

How to parse HTML and get CSS styles

I need to parse HTML and find corresponding CSS styles. I can parse HTML and CSS separataly, but I can't combine them. For example, I have an XHTML page like this:
<html>
<head>
<title></title>
</head>
<body>
<div class="abc">Hello World</div>
</body>
</html>
I have to search for "hello world" and find its class name, and after that I need to find its style from an external CSS file. Answers using Java, JavaScript, and PHP are all okay.
Use jsoup library in java which is a HTML Parser. You can see for example here
For example you can do something like this:
String html="<<your html content>>";
Document doc = Jsoup.parse(html);
Element ele=doc.getElementsContainingOwnText("Hello World").first.clone(); //get tag containing Hello world
HashSet<String>class=ele.classNames(); //gives you the classnames of element containing Hello world
You can explore the library further to fit your needs.
Similiar question Can jQuery get all CSS styles associated with an element?. Maybe css optimizers can do what you want, take a look at unused-css.com its online tool but also lists other tools.
As i understood you have chance to parse style sheet from external file and this makes your task easy to solve. First try to parse html file with jsoup which supports jquery like selector syntax that helps you parse complicated html files easier. then check this previous solution to parse css file. Im not going to full solution as i state with these libraries all task done internally and the only thing you should do is writing glue code to combine these two.
Using Java java.util.regex
String s = "<body>...<div class=\"abc\">Hello World</div></body>";
Pattern p = Pattern.compile("<div.+?class\\s*?=\\s*['\"]?([^ '\"]+).*?>Hello World</div>", Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(s);
if (m.find()) {
System.out.println(m.group(1));
}
prints abc

Regular expression for getting HREF based on span tag [duplicate]

I have a requirement where I need to get the last HREF in the HTML code, means getting the HREF in the footer of the page.
Is there any direct regular expression for the same?
No regex, use the :last jQuery selector instead.
demo :
foo
bar
var link = $("a:last");
You could use plain JavaScript for this (if you don't need it to be a jQuery object):
var links = document.links;
var lastLink = links[links.length - 1];
var lastHref = lastLink.href;
alert(lastHref);
JS Fiddle demo.
Disclaimer: the above code only works using JavaScript; as HTML itself has no regex, or DOM manipulation, capacity. If you need to use a different technology please leave a comment or edit your question to include the relevant tags.
It's not a good idea to parse html with regular expressions. Have a look at HtmlParser
to parse html.

how to convert HTML text to plain text? [duplicate]

This question already has answers here:
Remove HTML tags from a String
(35 answers)
Closed 1 year ago.
friend's
I have to parse the description from url,where parsed content have few html tags,so how can I convert it to plain text.
Yes, Jsoup will be the better option. Just do like below to convert the whole HTML text to plain text.
String plainText= Jsoup.parse(yout_html_text).text();
Just getting rid of HTML tags is simple:
// replace all occurrences of one or more HTML tags with optional
// whitespace inbetween with a single space character
String strippedText = htmlText.replaceAll("(?s)<[^>]*>(\\s*<[^>]*>)*", " ");
But unfortunately the requirements are never that simple:
Usually, <p> and <div> elements need a separate handling, there may be cdata blocks with > characters (e.g. javascript) that mess up the regex etc.
You can use this single line to remove the html tags and display it as plain text.
htmlString=htmlString.replaceAll("\\<.*?\\>", "");
Use Jsoup.
Add the dependency
<dependency>
<!-- jsoup HTML parser library # https://jsoup.org/ -->
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.13.1</version>
</dependency>
Now in your java code:
public static String html2text(String html) {
return Jsoup.parse(html).wholeText();
}
Just call the method html2text with passing the html text and it will return plain text.
Use a HTML parser like htmlCleaner
For detailed answer : How to remove HTML tag in Java
I'd recommend parsing the raw HTML through jTidy which should give you output which you can write xpath expressions against. This is the most robust way I've found of scraping HTML.
If you want to parse like browser display, use:
import net.htmlparser.jericho.*;
import java.util.*;
import java.io.*;
import java.net.*;
public class RenderToText {
public static void main(String[] args) throws Exception {
String sourceUrlString="data/test.html";
if (args.length==0)
System.err.println("Using default argument of \""+sourceUrlString+'"');
else
sourceUrlString=args[0];
if (sourceUrlString.indexOf(':')==-1) sourceUrlString="file:"+sourceUrlString;
Source source=new Source(new URL(sourceUrlString));
String renderedText=source.getRenderer().toString();
System.out.println("\nSimple rendering of the HTML document:\n");
System.out.println(renderedText);
}
}
I hope this will help to parse table also in the browser format.
Thanks,
Ganesh
I needed a plain text representation of some HTML which included FreeMarker tags. The problem was handed to me with a JSoup solution, but JSoup was escaping the FreeMarker tags, thus breaking the functionality. I also tried htmlCleaner (sourceforge), but that left the HTML header and style content (tags removed).
http://stackoverflow.com/questions/1518675/open-source-java-library-for-html-to-text-conversion/1519726#1519726
My code:
return new net.htmlparser.jericho.Source(html).getRenderer().setMaxLineLength(Integer.MAX_VALUE).setNewLine(null).toString();
The maxLineLength ensures lines are not artificially wrapped at 80 characters.
The setNewLine(null) uses the same new line character(s) as the source.
I use HTMLUtil.textFromHTML(value)
from
<dependency>
<groupId>org.clapper</groupId>
<artifactId>javautil</artifactId>
<version>3.2.0</version>
</dependency>
Using Jsoup, I got all the text in the same line.
So I used the following block of code to parse HTML and keep new lines:
private String parseHTMLContent(String toString) {
String result = toString.replaceAll("\\<.*?\\>", "\n");
String previousResult = "";
while(!previousResult.equals(result)){
previousResult = result;
result = result.replaceAll("\n\n","\n");
}
return result;
}
Not the best solution but solved my problem :)

Categories

Resources