I am trying to search a .txt file that contains HTML in it. I need to search the file for specific HTML tags, then grab the following next few characters of code. I am new to java, but am willing to learn what I need to.
For example: Say I have the code: <span class="date">Apr 13</span> and all I need is the date(Apr 13). How do I go about doing this?
Thanks a lot!
Have a look at String class docs and try to find the method to search the string.
Since you said you are getting it from a HTML file, you can have a look at Jsoup which is a HTML parser, which will make searching for strings in HTML documents a lot easier.
With jsoup, you can do it like this
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements spans = doc.select("span");
for (Element element : spans) {
System.out.println(element.html());
}
try this
Matcher m = Pattern.compile(">(.*?)<").matcher(s);
while(m.find()) {
String s = m.group(1);
}
If you want is something basic (I thought it would be good as you are new), you can use this :
if(s.indexOf("span class=\"date\"")!=0)
s=s.substring(s.indexOf(">")+1,s.lastIndexOf("<"));
But this answer is specific to your question than a broad one
String yourString = "<span class=\"date\">Apr 13</span>"
String date = yourString.split("class=\"date\">")[1].split("</sp")[0];
Related
I am using JSoup library to extract texts in webpages. Following is my code
Document doc;
try {
URL url = new URL(text);
doc = Jsoup.parse(url, 70000);
Elements paragraphs = doc.select("p");
for(Element p : paragraphs)
{
textField.append(p.text());
textField.append("\n");
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
Here, I am only able to get text from "p" tags. But I need all the texts in the page. How can I do it? That might be by looping through nodes, but I just started using JSoup.
Try this:
String text = Jsoup.parse(new URL("https://www.google.com"), 10000).text();
System.out.println(text);
Here, 10000 is in milliseconds and refers to timeout.
You might want to use Boilerpipe, because you don't need HTML parsing, but only text extraction. This should be faster and less CPU-consuming.
Example:
URL url = new URL("http://www.example.com/some-location/index.html");
// NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
String text = ArticleExtractor.INSTANCE.getText(url);
Taken from: https://code.google.com/p/boilerpipe/wiki/QuickStart
Perhaps a different approach entirely. I'm not sure exactly what you are doing, thus I don't know exactly what you need. But you could grab the entire raw source of the entire web page. Then use regexp to delete all of the html tags. I did something similar (though in php) for a text to code ratio tool once.
i have string which contains some value as given below. i want to replace the html img tags containing specific customerId with some new text. i tried small java program which is not giving me expected output.here is the program info
My input string is
String inputText = "Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334¶m1=123/></p>"
+ "<p>someText</p><img src=\"getCustomers.do?custCode=2&customerId=3340¶m2=456/> ..Ending here";
Regex is
String regex = "(?s)\\<img.*?customerId=3340.*?>";
new text i want to put inside input string
EDIT Starts:
String newText = "<img src=\"getCustomerNew.do\">";
EDIT ENDS:
now i am doing
String outputText = inputText.replaceAll(regex, newText);
output is
Starting here.. Replacing Text ..Ending here
but my expected output is
Starting here.. <img src=\"getCustomers.do?custCode=2&customerId=3334¶m1=123/></p><p>someText</p>Replacing Text ..Ending here
Please note in my expected output only img tag which is containing customerId=3340 got replaced with Replacing Text. i am not getting why in the output i am getting both the img tags are getting replced?
You've got "wildcard"/"any" patterns (.*) in there which will extend the match to the longest possible matching string, and the last fixed text in the pattern is a > character, which therefore matches the last > character in the input text, i.e. the very last one!
You should be able to fix this by changing the .* parts to something like [^>]+ so that the matching won't span past the first > character.
Parsing HTML with regular expressions is bound to cause pain.
As other people have told you in the comments, HTML is not a regular language so using regex for manipulating it is usually painful. Your best option is to use an HTML parser. I haven't used Jsoup before, but googling a little bit it seems you need something like:
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;
public class MyJsoupExample {
public static void main(String args[]) {
String inputText = "<html><head></head><body><p><img src=\"getCustomers.do?custCode=2&customerId=3334¶m1=123\"/></p>"
+ "<p>someText <img src=\"getCustomers.do?custCode=2&customerId=3340¶m2=456\"/></p></body></html>";
Document doc = Jsoup.parse(inputText);
Elements myImgs = doc.select("img[src*=customerId=3340");
for (Element element : myImgs) {
element.replaceWith(new TextNode("my replaced text", ""));
}
System.out.println(doc.toString());
}
}
Basically the code gets the list of img nodes with a src attribute containing a given string
Elements myImgs = doc.select("img[src*=customerId=3340");
then loop over the list and replace those nodes with some text.
UPDATE
If you don't want to replace the whole img node with text but instead you need to give a new value to its src attribute then you can replace the block of the for loop with:
element.attr("src", "my new value"));
or if you want to change just a part of the src value then you can do:
String srcValue = element.attr("src");
element.attr("src", srcValue.replace("getCustomers.do", "getCustonerNew.do"));
which is very similar to what I posted in this thread.
What happens is that your regex starts matching the first img tag then consumes everything (regardless is greedy or not) until it finds customerId=3340 and then continues consuming everything until it finds >.
If you want it to consume just the img with customerId=3340 think of what makes different this tag from other tags that it may match.
In this particular case, one possible solution is to look at what is behind that img tag using a look-behind operator (which doesn't consume a match). This regex will work:
String regex = "(?<=</p>)<img src=\".*?customerId=3340.*?>";
I have a file with the following contents
<div name="hello"></div>
and i need a java code that will read this file and print only the word *hello
This is what i have come up with
while (( line = bf.readLine()) != null)
{
linecount++;
int indexfound = line.indexOf("<div name");
if (indexfound > -1) {
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(line);
while (m.find()) { System.out.println(m.group(1)); }
}
}
bf.close();
}} catch (IOException e) {
e.printStackTrace();
}}}
but the problem with this code is that if i make changes to the file such that it looks this way
<div name="hello" value="hi"></div>
then hi also gets printed but i want only hello to be printed
While the best answer to questions like this is to advocate the use of an HTML or XML parser to extract attributes, it's worthwhile pointing out the issue in your question.
You are getting both attributes printed because you are printing inside a while loop. You are printing everything surrounded by double quotes.
Furthermore, you only want the value of the name attribute. So your pattern should be formed as follows:
Pattern.compile("name=\"([^\"]*)\"");
You can use any of the DOM libraries available in java such as jDOM or Dom4j. The file you are trying to parse is an xml (HTML) file, these DOM libraries are developed to parse such xml files. Its easy to get started. Follow the tutorials on this site. http://www.java-samples.com/showtutorial.php?tutorialid=152
Your code might work for the change you have made in the XML however you may need changes in your code with every other change in your XML. This can be exhausting and hence I suggest the best way to read an XML doc in Java is to use parsers. In Java there are two parsers I have come across recently: DOM and SAX. You should find a lot of tutorials and examples on the internet; these were where I learned a lot:
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
and
http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
I am using a regex to remove HTML tags. I do something like -
result.replaceAll("\<.*?\>", "");
However, it does not help me get rid of the img tags in the html. Any idea what is a good way to do that?
If you cannot use HTML parsers/cleaners then I would at least suggest you to use Pattern.DOTALL flag to take care of multi-line HTML blocks. Consider code like this:
String str = "123 <img \nsrc='ping.png'>abd foo";
Pattern pt = Pattern.compile("<.*?>", Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);
OUTPUT
Output: 123 abd foo
To give a more concrete recommendation, use JSoup (or NekoHTML) to parse the HTML into a Java object.
Once you've got a Document object it can easily be traversed to remove the tags. This cookbook recipe shows how to get attributes and text from the DOM.
Another suggestion is HtmlCleaner
I'm just re-iterating what others have said already, but this point cannot be over-stated: DO NOT USE REGEXES TO PARSE HTML. There are a 1,000 similar questions on this on SO. Use a proper HTML parser, it will make your life so much easier, and is far more robust and reliable. Take a look at Dom4j, Jericho, JSoup. Please.
So, a piece of code for you.
I use http://htmlparser.sourceforge.net/ to parse HTML. It is not overcomplicated and quite straightforward to use.
Basically it looks like this:
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
...
String html; /* read your HTML into variable 'html' */
String result=null;
....
try {
Parser p = new Parser(html);
NodeList nodes = p.parse(null);
result = nodes.asString();
} catch (ParserException e) {
e.printStackTrace();
}
That will give you plain text stripped of tags (but no substitutes like & would be fixed). And of course you can do plenty more with this library, like applying filters, visitors, iterating and all the stuff.
use html parser instead. iterate over the object, print however you like and get the best result.
I have been able achieve do this with the below code snippet.
String htmlContent = values.get(position).getContentSnippet();
String plainTextContent = htmlContent.replaceAll("<img .*?/>", "");
I used the above regex to clean the img tags in my RSS content.
How may I convert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser
Example:
Hello<br/>World
to:
Hello\n
World
Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.
public static String htmlToText(InputStream html) throws IOException {
Document document = Jsoup.parse(html, null, "");
Element body = document.body();
return buildStringFromNode(body).toString();
}
private static StringBuffer buildStringFromNode(Node node) {
StringBuffer buffer = new StringBuffer();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
buffer.append(textNode.text().trim());
}
for (Node childNode : node.childNodes()) {
buffer.append(buildStringFromNode(childNode));
}
if (node instanceof Element) {
Element element = (Element) node;
String tagName = element.tagName();
if ("p".equals(tagName) || "br".equals(tagName)) {
buffer.append("\n");
}
}
return buffer;
}
w3m -dump -no-cookie input.html > output.txt
I did find a relatively clever solution in html2txt: THE ASCIINATOR which does an admirable job of producing nroff like output (e.g. like man ls run on a terminal). It produces output in the Markdown style that StackOverflow uses as input.
For moderately complex pages like this page, the output is somewhat scattered as it tries mightily to turn non-linear layout into something linear. The output from less complicated markup is pretty readable.
If you don't mind hard-wrapped/designed-for-monospace output, lynx -dump produces good plain text from HTML.
HTML to Text:
I am taking this statement to mean that all HTML formatting, except line-breaks, will be abandoned.
What I have done for such a venture is using regexp to detect any set of tag enclosure.
If the value within the tags are br or br/, a line-break is inserted, otherwise the tag is discarded.
It works only for simple html pages. Tables will obviously be linearised.
I had been thinking of detecting the title value between the title tag enclosure, so that the converter automatically places the title at the top of the page. Needs to put in a little more algorithm. By my time is better spent with ...
I am reading on using Google Data APIs to upload a document to Google Docs and then using the same API to download/export it as text. Or, why text, when I could do pdf. But you have to get a Google account if you don't already have one.
Google docs data download/export
Google docs data api for java
Does it matter what language you use? You could always use pattern matching. Basically HTML lien break tags (br,p,div, ...) you can replace with "\n" and remove all the other tags. You could always store the tags in an array so you can easily check when you go through the HTML file. Then any other tags and all the other end tags (/p,..) can be replaced with an empty string therefore getting your result.