How may I convert HTML to text keeping linebreaks (produced by elements like br,p,div, ...) possibly using NekoHTML or any decent enough HTML parser
Example:
Hello<br/>World
to:
Hello\n
World
Here is a function I made to output text (including line breaks) by iterating over the nodes using Jsoup.
public static String htmlToText(InputStream html) throws IOException {
Document document = Jsoup.parse(html, null, "");
Element body = document.body();
return buildStringFromNode(body).toString();
}
private static StringBuffer buildStringFromNode(Node node) {
StringBuffer buffer = new StringBuffer();
if (node instanceof TextNode) {
TextNode textNode = (TextNode) node;
buffer.append(textNode.text().trim());
}
for (Node childNode : node.childNodes()) {
buffer.append(buildStringFromNode(childNode));
}
if (node instanceof Element) {
Element element = (Element) node;
String tagName = element.tagName();
if ("p".equals(tagName) || "br".equals(tagName)) {
buffer.append("\n");
}
}
return buffer;
}
w3m -dump -no-cookie input.html > output.txt
I did find a relatively clever solution in html2txt: THE ASCIINATOR which does an admirable job of producing nroff like output (e.g. like man ls run on a terminal). It produces output in the Markdown style that StackOverflow uses as input.
For moderately complex pages like this page, the output is somewhat scattered as it tries mightily to turn non-linear layout into something linear. The output from less complicated markup is pretty readable.
If you don't mind hard-wrapped/designed-for-monospace output, lynx -dump produces good plain text from HTML.
HTML to Text:
I am taking this statement to mean that all HTML formatting, except line-breaks, will be abandoned.
What I have done for such a venture is using regexp to detect any set of tag enclosure.
If the value within the tags are br or br/, a line-break is inserted, otherwise the tag is discarded.
It works only for simple html pages. Tables will obviously be linearised.
I had been thinking of detecting the title value between the title tag enclosure, so that the converter automatically places the title at the top of the page. Needs to put in a little more algorithm. By my time is better spent with ...
I am reading on using Google Data APIs to upload a document to Google Docs and then using the same API to download/export it as text. Or, why text, when I could do pdf. But you have to get a Google account if you don't already have one.
Google docs data download/export
Google docs data api for java
Does it matter what language you use? You could always use pattern matching. Basically HTML lien break tags (br,p,div, ...) you can replace with "\n" and remove all the other tags. You could always store the tags in an array so you can easily check when you go through the HTML file. Then any other tags and all the other end tags (/p,..) can be replaced with an empty string therefore getting your result.
Related
I am trying to search a .txt file that contains HTML in it. I need to search the file for specific HTML tags, then grab the following next few characters of code. I am new to java, but am willing to learn what I need to.
For example: Say I have the code: <span class="date">Apr 13</span> and all I need is the date(Apr 13). How do I go about doing this?
Thanks a lot!
Have a look at String class docs and try to find the method to search the string.
Since you said you are getting it from a HTML file, you can have a look at Jsoup which is a HTML parser, which will make searching for strings in HTML documents a lot easier.
With jsoup, you can do it like this
File input = new File("input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
Elements spans = doc.select("span");
for (Element element : spans) {
System.out.println(element.html());
}
try this
Matcher m = Pattern.compile(">(.*?)<").matcher(s);
while(m.find()) {
String s = m.group(1);
}
If you want is something basic (I thought it would be good as you are new), you can use this :
if(s.indexOf("span class=\"date\"")!=0)
s=s.substring(s.indexOf(">")+1,s.lastIndexOf("<"));
But this answer is specific to your question than a broad one
String yourString = "<span class=\"date\">Apr 13</span>"
String date = yourString.split("class=\"date\">")[1].split("</sp")[0];
I am getting this output when trying to use Jsoup to extract text from Wikipedia:
I dont have enough rep to post pictures as I am new to this site but its basically like this:
[]{k[]q[]f[]d[]d etc..
Here is part of my code:
public static void scrapeTopic(String url)
{
String html = getUrl("http://www.wikipedia.org/" + url);
Document doc = Jsoup.parse(html);
String contentText = doc.select("*").first().text();
System.out.println(contentText);
}
It appears to get all the information but in the wrong format!
I appreciate any help given
Thanks in advance
Here are some suggestion for you. While fetching general webpage, which doesn't require HTTP header's field to be set like cookie, user-agent just call:
Document doc = Jsoup.connect("givenURL").get();
This function read the webpage using a GET request. When you are selecting element using *, it returns any element, that is all the element of the document. Hence, calling doc.select("*").first() is returning the #root element. Try printing it to see:
System.out.println(doc.select("*").first().tagName()); // #root
System.out.println(doc.select("*").first()); // will print the whole document,
System.out.println(doc); //print the whole document, the above action is pointless
System.out.println(doc.select("*").first()==doc);
// check whither they are equal, and it will print TRUE
I am assuming that you are just playing around to learn about this API, although selector is much powerful, but a good start should be trying general document manipulation function e.g., doc.getElementsByTag().
However, in my local machine, i was successful to fetch the Document and parsing it using your getURL() function !!
I am using JSoup library to extract texts in webpages. Following is my code
Document doc;
try {
URL url = new URL(text);
doc = Jsoup.parse(url, 70000);
Elements paragraphs = doc.select("p");
for(Element p : paragraphs)
{
textField.append(p.text());
textField.append("\n");
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
Here, I am only able to get text from "p" tags. But I need all the texts in the page. How can I do it? That might be by looping through nodes, but I just started using JSoup.
Try this:
String text = Jsoup.parse(new URL("https://www.google.com"), 10000).text();
System.out.println(text);
Here, 10000 is in milliseconds and refers to timeout.
You might want to use Boilerpipe, because you don't need HTML parsing, but only text extraction. This should be faster and less CPU-consuming.
Example:
URL url = new URL("http://www.example.com/some-location/index.html");
// NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
String text = ArticleExtractor.INSTANCE.getText(url);
Taken from: https://code.google.com/p/boilerpipe/wiki/QuickStart
Perhaps a different approach entirely. I'm not sure exactly what you are doing, thus I don't know exactly what you need. But you could grab the entire raw source of the entire web page. Then use regexp to delete all of the html tags. I did something similar (though in php) for a text to code ratio tool once.
I have a file with the following contents
<div name="hello"></div>
and i need a java code that will read this file and print only the word *hello
This is what i have come up with
while (( line = bf.readLine()) != null)
{
linecount++;
int indexfound = line.indexOf("<div name");
if (indexfound > -1) {
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m = p.matcher(line);
while (m.find()) { System.out.println(m.group(1)); }
}
}
bf.close();
}} catch (IOException e) {
e.printStackTrace();
}}}
but the problem with this code is that if i make changes to the file such that it looks this way
<div name="hello" value="hi"></div>
then hi also gets printed but i want only hello to be printed
While the best answer to questions like this is to advocate the use of an HTML or XML parser to extract attributes, it's worthwhile pointing out the issue in your question.
You are getting both attributes printed because you are printing inside a while loop. You are printing everything surrounded by double quotes.
Furthermore, you only want the value of the name attribute. So your pattern should be formed as follows:
Pattern.compile("name=\"([^\"]*)\"");
You can use any of the DOM libraries available in java such as jDOM or Dom4j. The file you are trying to parse is an xml (HTML) file, these DOM libraries are developed to parse such xml files. Its easy to get started. Follow the tutorials on this site. http://www.java-samples.com/showtutorial.php?tutorialid=152
Your code might work for the change you have made in the XML however you may need changes in your code with every other change in your XML. This can be exhausting and hence I suggest the best way to read an XML doc in Java is to use parsers. In Java there are two parsers I have come across recently: DOM and SAX. You should find a lot of tutorials and examples on the internet; these were where I learned a lot:
http://www.mkyong.com/java/how-to-read-xml-file-in-java-sax-parser/
and
http://www.mkyong.com/java/how-to-read-xml-file-in-java-dom-parser/
I'm using the following method to read in a line of text from an XML document via the web:
public static String getCharacterDataFromElement(Element e) {
Node child = ((Node) e).getFirstChild();
if (child instanceof CharacterData) {
CharacterData cd = (CharacterData) child;
return cd.getData();
}
return "";
}
It works fine, but if it comes across a character such as an ampersand which are not written like & etc it will then completely ignore that character and the rest of the line. What can I do to rectify this?
The only proper solution ist to correct the XML, so that the & is written as &, or the texts are wrapped in <![CDATA[ ... ]]>.
It's not actually XML unless you escape ampersands or use CDATA.
I suspect the talk of the input not being well-formed is a red herring. If the source document contains entity references then an element may contain multiple text node children, and your code is only reading the first of them. It needs to read them all.
(I think there are easier ways of getting the text content of a Node in DOM. But I'm not sure, I never use the DOM if I can avoid it because it makes everything so difficult. You're much better off with JDOM or XOM.)