What should i use to crawl many news articles? - java

I've a project of natural language processing but for that i need to crawl many web articles from some sources like Yahoo news, Google news or blogs...
I'm a java developper (so i'd rather use java tools). I guess i can parse each source website on my own and extract the articles with HttpClient / XPath but i'm a bit lazy :) is there a way so that i won't have to make a parser per source?
(I'm not only interested by new articles but articles from 2000 to now too)

The hardest part of NLP is getting data you can use. Everything else is just math.
It may be hard to find a large collection of news articles other than on each news source's website because of all the copyright issues involved. If you don't need recent news, your best bet is probably to look at the Linguistic Data Consortium's English Gigaword corpus; if you are at a university, there may already be an existing relationship for you to use the data for free.
If you need to actually crawl and parse websites, for now you'll probably find you have to write specific parsers for the various news websites to make sure you get the right text. However, once more websites start using HTML5, it will be easier to pull out the relevant text through the use of the article tag.
To do the actual crawling, this previous question can point you in some useful directions.

Related

Extracting information using XPaths

Good afternoon dear community,
I have finally compiled a list of working XPaths required to scrape all of the information from URL's that i need.
I would like to ask for your suggestion, for a newbie in coding what is the best way to scrape around 50k links using only XPaths (around 100 xpaths for each link)?
Import.io is my best tool at the moment, or even SEO tools for Excel, but they both have their own limitations. Import io is expensive, SEO tools for excel isn't suited to extract more than 1000 links.
I am willing to learn the system suggested, but please suggest a good way of scraping for my project!
#
SOLVED! SEO Tools crawler is actually super usefull and I believe I've found what i need. I guess i'll hold off Python or Java until i encounter another tough obstacle.
Thank you all!
That strongly depends on what you mean by "scraping information". What exactly do you want to mine from the websites? All major languages (certainly Java and Python that you mentioned) have good solutions for connecting to websites, reading content, parsing HTML using a DOM and using XPath to extract certain fragments. For example, Java has JTidy, which allows you to parse even "dirty" HTML from websites into a DOM and manipulate it somewhat. However, the tools needed will depend on the exact data processing needs of your project.
I would encourage you to use Python (I use 2.7.x) w/ Selenium. I routinely automate scraping and testing of websites with this combo (in both a headed and headless manner), and Selenium unlocks the opportunity to interact with scripted sites that do not have explicit webcalls for each and every page.
Here is a good, quick tutorial from the Selenium docs: 2. Getting Started
There are a lot of great sources out there, and it would take forever to post them all; but, you will find the Python community very helpful and you'll likely see that Python is a great language for this type of web interaction.
Good luck!

Information retrieval in dbpedia using spotlight

I have recently come across dbpedia-spotlight and I want to do an information retrieval. I have a set of queries and dbpedia and using Information retrieval I need to get the output. I was not able to understand the documentation so can you give me a sample code to start working.
I have tried terrier but that was equally difficult.
Terrier is more popular as a research tool, where you can try out various standard IR models against standard test collections, e.g. TREC, ClueWeb etc.
If you want to develop quickly a reasonably functional search system, Lucene is the best thing to try. Go through this Lucene in 5 minutes tutorial. I guess this should be fairly simple to use.

How to scrape newspaper articles from multiple newspaper websites with help of RSS feeds

I have to do some Data Collection that is obtain a obtain a large stream relatively clean corpus. The corpus is simply a collection of webpages (HTML) - each page corresponding to a news article with associated information such as the date of its publication, edition it appears in, the section in which it appears etc.
I have to develop a crawler that can crawl news paper websites of different languages in parallel.
Let us identify 2 languages (English and Hindi). Write a crawler to scrape articles from the website of these newspapers. We have to collect it for 1 month
What we are interested is in collecting a large number of multiple language news articles from the websites of various newspapers as they are published on their website.
Instead of writing a full-fledged scraper, I have been told to use sources like RSS feeds.
The idea is to obtain parallel corpora - i.e., newspaper articles that are in different languages and in sync with each other.
After building crawler, we have to set it up on a server to obtain the newspaper stream
Please let me which tools and programming language should I use to build this crawler.
I know JAVA, so I would want to preferably work on Java libraries.
I know that RSS feeds are in XML
I am not sure what exactly is the question, but yes, RSS feeds are probably the way to go (at least as a signal) and yes, JAVA has great tools to deal with feeds :)

library for text classification in java

I have a set of categorized text files. I want to categorize another large set of text files to use in my research. Is there a good way to compare them?
I think SVM based methods are useful but is there a simple and documented library for using such algorithms?
I don't know much about SVM, but LingPipe might be really helpful for you. The link is a tutorial specifically about categorization of documents (automatic or guided).
Also, look into the inter-related search products Lucene (a search library), Solr (search server app), and Carrot2 (for 'clustering' search results). There should be some interesting work in that space for you.
Mallet is another awesome library to look into. It has good commandline tools to help you get started and a Java API once you start getting into integrating it with the rest of your system.

Generic Article Extraction from web pages

Am going to begin my work in article extraction.
The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg. 1. http://www.tripadvisor.ca/Hotel_Review-g32643-d1097955-Reviews-San_Mateo_County_Memorial_Park_Campground-Loma_Mar_California.html, 2. http://www.travelpod.com/hotel/Comfort_Suites_Sfo_Airport-San_Mateo.html )
I need to do the task in Java and I am just working with Java for the past couple of months alone..
And here comes my questions regarding these.
Is there possibility to extract reviews alone from different web pages in a generic way.
Kindly let me know if there are any API that supports the task in Java.
Also, let me know of your thoughts/sources which will be more helpful for me to attain the task mentioned above.
UPDATE
If any sort of related examples available in net, please post the same since that could be of great use.
You probably need a screen scraping utility for Java like TagSoup or NekoHTML. JSoup is also popular.
However, you also have a bigger legal consideration here when extracting data from a 3rd party website like tripadvisor. Does their policy allow it?

Categories

Resources