android get news article content - java

I am creating a news app and have the url to the site of the articles e.g http://www.bbc.co.uk/news/technology-33379571 and I need a way to extract the content from the article.
I have tried jsoup but that gives all the html tags and there is one <main-article-body> but that gives the link to the article which I am trying to extract. I know boilerpipe does it exactly but that doesnt work with android, I am really stuck with this problem.
Any help will be much much appreciated

I have worked on few data extraction applications in.Net (c#) and have used regular expressions to extract content from news website.
The basic idea is to first extract all a href links (as needed) and then fetching details content by making web request. Finally using regular expressions to extract news body data.
Note: A problem with this process is that you will need to change your regular expressions when data source site changes.

Related

Jsoup example to read data form different html's page (or) websites? should be reusable to any website

I have a requirement to read data from different websites using Jsoup html parser (Ex: name, city, state, zip etc.). I can able to read the data from one website. but the problem is my code should be reusable to other websites. but in the other website elements and their position is different to first website. How to achieve this. please suggest any pattern or any examples. thanks.
I'll get this straight for you, there is NO WAY to achieve a general parser to scrape all websites. I've worked at a company where I had to scrape 30 websites. And I literally had to write one for every website.
You can however create general utility classes that helps you to process the data you've parsed.

Java: extract all resources links from HTML

I am looking for a way to extract all resources links from an HTML page in Java. (URL links, links to files..)
I first thought of extracting all elements inside src, href attributes, but the list will not be exhaustive. There is an example of code here: Jsoup, extract links, images, from website. Exception on runtime.
As a tricky example, I want to be able to detect links hidden inside JavaScript (which can also be hidden anywhere in the HTML DOM):
<IMG onmouseover="window.open('http://www.evil.com/image.jpg')">
EDIT:
1) I am not looking for a regex-based solution because they are not reliable to deal with HTML document
2) I have tried to use Html DOM parser like JSoup. They allows the extractions of tags and their properties quite well. However I have not found a way to detect links inside JavaScript with it.
3) Maybe there is an API available that tries to render the page and detect which resources needs to be loaded?
Do you have any thoughts?
Thanks.
If you want to use PHP with a bit of programming knowledge here is a library.
http://simplehtmldom.sourceforge.net/
I used this library to extract info from tags, even from properties of tags. This is exactly what you need to do what you want without working with complicated code.

Hw dynamic HTML parser works

I am just making a small online platform. It includes news from different news websites.
I am using JAVA for my website and jsoup parser for parsing HTML.
I am using simple way, First i am downloading HTML pages to local folder and then extracting HTML from it by selectors and filters. eg. doc.select("img")
But just suppose if design of target websites changes, then it will stop working.
and another big problem is different parser for different websites.
I want to make a parser something like google news, HTC news feed.
If any one can point me in the right way.

Java prog to capture images from a webpage and output their links to an Excel file

I want a Java app that would capture all the images (and preferably data in other tags too) from a webpage and write their links to an excel file.
While I know my way around Excel files and Java, I was just wondering if there's any way to capture images from web pages.
A quick google search didnt help
Obviously there is.
Since images are in the source code, you can start from the simpliest solution - getting the page source, retrieve image links and download them.
KISS ;-)
Probably you need to parse the html of the webpage and get the links referring to images from respective html tags.

Convert HTML page into MS word using java or any API

I want to convert a HTML page into MS word. I want to know what API's will be helpful and also if there is any other option to do the same.
The entire page is to be converted into .doc (eg. If there is a table in the html page, a similar table must be created in the word doc) .
Apache POI does not provide an option to format the word document as in the HTML page.
I need something that can give me a completely formatted word document.
Some of the things that i seek are JSOUP, docx4j, jasper reports, and JOD Convertor.
I tried parsing the HTML page using JSOUP and I get the contents of
the page in my java program. Now I need to pass these contents to a
doc/docx file. Can docx4j be helpful to get a formatted docx file?
Please help.
Thank you.
I would go with Ashwini Raman's suggestion. It wont work with every scenario. In the case of a complex HTML document with many images and stuff word will not do a good job. But for most cases it should be fine. Otherwise, there is a complex task ahead of you. You will have to parse your HTML document using the jsoup library for example and then use the docx4j library to create your workd document.
Links to both are here:
http://www.docx4java.org/trac/docx4j
http://jsoup.org/
When you are doing it also, the formatting might be iffy.
To answer your original question, no there is no ready made library that does what you are expecting. At least I havent come across any.
I found a way round to do the same. First I need to get the parsed objects using JSOUP and pass these to a document template. I am now looking for the options that can provide me creating easy templates and creating the document dynamically.
I have asked another question regarding the same.

Categories

Resources