Problem with extracting values from xml file using java and regex - java

I have a file with the following contents
<div name="hello"></div>
and i need a java code that will read this file and print only the word *hello
This is what i have come up with
while (( line = bf.readLine()) != null)
{
linecount++;
int indexfound = line.indexOf("<div name");
if (indexfound > -1) {
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(line);
while (m.find()) { System.out.println(m.group(1)); }
}
}
bf.close();
}} catch (IOException e) {
e.printStackTrace();
}}}
but the problem with this code is that if i make changes to the file such that it looks this way
<div name="hello" value="hi"></div>
then hi also gets printed but i want only hello to be printed

While the best answer to questions like this is to advocate the use of an HTML or XML parser to extract attributes, it's worthwhile pointing out the issue in your question.
You are getting both attributes printed because you are printing inside a while loop. You are printing everything surrounded by double quotes.
Furthermore, you only want the value of the name attribute. So your pattern should be formed as follows:
Pattern.compile("name=\"([^\"]*)\"");

You can use any of the DOM libraries available in java such as jDOM or Dom4j. The file you are trying to parse is an xml (HTML) file, these DOM libraries are developed to parse such xml files. Its easy to get started. Follow the tutorials on this site. http://www.java-samples.com/showtutorial.php?tutorialid=152

Your code might work for the change you have made in the XML however you may need changes in your code with every other change in your XML. This can be exhausting and hence I suggest the best way to read an XML doc in Java is to use parsers. In Java there are two parsers I have come across recently: DOM and SAX. You should find a lot of tutorials and examples on the internet; these were where I learned a lot:
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
and
http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/

Related

Relace HWPFDocument paragraph text using java results strange output

I require to replace a HWPFDocument paragraph text of .doc file if it contains a particular text using java. It replaces the text. But the process writes the output text in a strange way. Please help me to rectify this issue.
Code snippet used:
public static HWPFDocument processChange(HWPFDocument doc)
{
try
{
Range range = doc.getRange();
for (int i = 0; i < range.numParagraphs(); i++)
{
Paragraph paragraph = range.getParagraph(i);
if (paragraph.text().contains("Place Holder"))
{
String text = paragraph.text();
paragraph.replaceText(text, "*******");
}
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
return doc;
}
Input:
Place Holder
Textvalue1
Textvalue2
Textvalue3
Output:
*******Textvalue1
Textvalue1
Textvalue2
Textvalue3
The HWPF library is not in a perfect state for changing / writing .doc files. (At least at the last time that I looked. Some time ago I developed a custom variant of HWPF for my client which - among many other things - provides correct replace and save operations, but that library is not publicly available.)
If you absolutely must use .doc files and Java you may get away by replacing with strings of exactly same length. For instance "12345" -> "abc__" (_ being spaces or whatever works for you). It might make sense to find the absolute location of the to be replaced string in the doc file (using HWPF) and then changing it in the doc file directly (without using HWPF).
Word file format is very complicated and "doing it right" is not a trivial task. Unless you are willing to spend many man months, it will also not be possible to fix part of the library so that just saving works. Many data structures must be handled very precisely and a single "slip up" lets Word crash on the generated output file.

java get next few words in string

I am trying to search a .txt file that contains HTML in it. I need to search the file for specific HTML tags, then grab the following next few characters of code. I am new to java, but am willing to learn what I need to.
For example: Say I have the code: <span class="date">Apr 13</span> and all I need is the date(Apr 13). How do I go about doing this?
Thanks a lot!
Have a look at String class docs and try to find the method to search the string.
Since you said you are getting it from a HTML file, you can have a look at Jsoup which is a HTML parser, which will make searching for strings in HTML documents a lot easier.
With jsoup, you can do it like this
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements spans = doc.select("span");
for (Element element : spans) {
System.out.println(element.html());
}
try this
Matcher m = Pattern.compile(">(.*?)<").matcher(s);
while(m.find()) {
String s = m.group(1);
}
If you want is something basic (I thought it would be good as you are new), you can use this :
if(s.indexOf("span class=\"date\"")!=0)
s=s.substring(s.indexOf(">")+1,s.lastIndexOf("<"));
But this answer is specific to your question than a broad one
String yourString = "<span class=\"date\">Apr 13</span>"
String date = yourString.split("class=\"date\">")[1].split("</sp")[0];

How to extract all the text from a webpage?

I am using JSoup library to extract texts in webpages. Following is my code
Document doc;
try {
URL url = new URL(text);
doc = Jsoup.parse(url, 70000);
Elements paragraphs = doc.select("p");
for(Element p : paragraphs)
{
textField.append(p.text());
textField.append("\n");
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
Here, I am only able to get text from "p" tags. But I need all the texts in the page. How can I do it? That might be by looping through nodes, but I just started using JSoup.
Try this:
String text = Jsoup.parse(new URL("https://www.google.com"), 10000).text();
System.out.println(text);
Here, 10000 is in milliseconds and refers to timeout.
You might want to use Boilerpipe, because you don't need HTML parsing, but only text extraction. This should be faster and less CPU-consuming.
Example:
URL url = new URL("http://www.example.com/some-location/index.html");
// NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
String text = ArticleExtractor.INSTANCE.getText(url);
Taken from: https://code.google.com/p/boilerpipe/wiki/QuickStart
Perhaps a different approach entirely. I'm not sure exactly what you are doing, thus I don't know exactly what you need. But you could grab the entire raw source of the entire web page. Then use regexp to delete all of the html tags. I did something similar (though in php) for a text to code ratio tool once.

Regex to remove html does not get rid of img tag

I am using a regex to remove HTML tags. I do something like -
result.replaceAll("\<.*?\>", "");
However, it does not help me get rid of the img tags in the html. Any idea what is a good way to do that?
If you cannot use HTML parsers/cleaners then I would at least suggest you to use Pattern.DOTALL flag to take care of multi-line HTML blocks. Consider code like this:
String str = "123 <img \nsrc='ping.png'>abd foo";
Pattern pt = Pattern.compile("<.*?>", Pattern.DOTALL);
Matcher matcher = pt.matcher(str);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
System.out.println("Output: " + sb);
OUTPUT
Output: 123 abd foo
To give a more concrete recommendation, use JSoup (or NekoHTML) to parse the HTML into a Java object.
Once you've got a Document object it can easily be traversed to remove the tags. This cookbook recipe shows how to get attributes and text from the DOM.
Another suggestion is HtmlCleaner
I'm just re-iterating what others have said already, but this point cannot be over-stated: DO NOT USE REGEXES TO PARSE HTML. There are a 1,000 similar questions on this on SO. Use a proper HTML parser, it will make your life so much easier, and is far more robust and reliable. Take a look at Dom4j, Jericho, JSoup. Please.
So, a piece of code for you.
I use http://htmlparser.sourceforge.net/ to parse HTML. It is not overcomplicated and quite straightforward to use.
Basically it looks like this:
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
...
String html; /* read your HTML into variable 'html' */
String result=null;
....
try {
Parser p = new Parser(html);
NodeList nodes = p.parse(null);
result = nodes.asString();
} catch (ParserException e) {
e.printStackTrace();
}
That will give you plain text stripped of tags (but no substitutes like & would be fixed). And of course you can do plenty more with this library, like applying filters, visitors, iterating and all the stuff.
use html parser instead. iterate over the object, print however you like and get the best result.
I have been able achieve do this with the below code snippet.
String htmlContent = values.get(position).getContentSnippet();
String plainTextContent = htmlContent.replaceAll("<img .*?/>", "");
I used the above regex to clean the img tags in my RSS content.

How to convert HTML to text keeping linebreaks

How may I convert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser
Example:
Hello<br/>World
to:
Hello\n
World
Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.
public static String htmlToText(InputStream html) throws IOException {
Document document = Jsoup.parse(html, null, "");
Element body = document.body();
return buildStringFromNode(body).toString();
}
private static StringBuffer buildStringFromNode(Node node) {
StringBuffer buffer = new StringBuffer();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
buffer.append(textNode.text().trim());
}
for (Node childNode : node.childNodes()) {
buffer.append(buildStringFromNode(childNode));
}
if (node instanceof Element) {
Element element = (Element) node;
String tagName = element.tagName();
if ("p".equals(tagName) || "br".equals(tagName)) {
buffer.append("\n");
}
}
return buffer;
}
w3m -dump -no-cookie input.html > output.txt
I did find a relatively clever solution in html2txt: THE ASCIINATOR which does an admirable job of producing nroff like output (e.g. like man ls run on a terminal). It produces output in the Markdown style that StackOverflow uses as input.
For moderately complex pages like this page, the output is somewhat scattered as it tries mightily to turn non-linear layout into something linear. The output from less complicated markup is pretty readable.
If you don't mind hard-wrapped/designed-for-monospace output, lynx -dump produces good plain text from HTML.
HTML to Text:
I am taking this statement to mean that all HTML formatting, except line-breaks, will be abandoned.
What I have done for such a venture is using regexp to detect any set of tag enclosure.
If the value within the tags are br or br/, a line-break is inserted, otherwise the tag is discarded.
It works only for simple html pages. Tables will obviously be linearised.
I had been thinking of detecting the title value between the title tag enclosure, so that the converter automatically places the title at the top of the page. Needs to put in a little more algorithm. By my time is better spent with ...
I am reading on using Google Data APIs to upload a document to Google Docs and then using the same API to download/export it as text. Or, why text, when I could do pdf. But you have to get a Google account if you don't already have one.
Google docs data download/export
Google docs data api for java
Does it matter what language you use? You could always use pattern matching. Basically HTML lien break tags (br,p,div, ...) you can replace with "\n" and remove all the other tags. You could always store the tags in an array so you can easily check when you go through the HTML file. Then any other tags and all the other end tags (/p,..) can be replaced with an empty string therefore getting your result.

Categories

Resources