Generic Article Extraction from web pages

Generic Article Extraction from web pages - java

Am going to begin my work in article extraction.
The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg. 1. http://www.tripadvisor.ca/Hotel_Review-g32643-d1097955-Reviews-San_Mateo_County_Memorial_Park_Campground-Loma_Mar_California.html, 2. http://www.travelpod.com/hotel/Comfort_Suites_Sfo_Airport-San_Mateo.html )
I need to do the task in Java and I am just working with Java for the past couple of months alone..
And here comes my questions regarding these.
Is there possibility to extract reviews alone from different web pages in a generic way.
Kindly let me know if there are any API that supports the task in Java.
Also, let me know of your thoughts/sources which will be more helpful for me to attain the task mentioned above.
UPDATE
If any sort of related examples available in net, please post the same since that could be of great use.

You probably need a screen scraping utility for Java like TagSoup or NekoHTML. JSoup is also popular.
However, you also have a bigger legal consideration here when extracting data from a 3rd party website like tripadvisor. Does their policy allow it?

Related

Import .csv to RTC using JAVA

I am working on IBM RTC and I need to import a .csv file to RTC using JAVA. Is there a way to do this? If yes, can someone help me with the same.

Parsing CSV data is something that you definitely do not want to implement yourself, there are plenty of libraries for that (see here).
RTC offers a wide range of APIs that can be used with, see:
rsjazz.wordpress.com or
jazz.net
In that sense: you can write Java code that reads CSV data, and RTC has a rich API that allows you push "content" into the system.
But a word of warning: I used that java API some years ago to manipulate information within our RTC instance. That was a very painful experience. I found the APIs to be badly documented and extremely hard to use. It took me several days to come to working code that would make just a few small updates to our stories/tasks.
Maybe things have improved since then, but be prepared for, as said ... a painful experience.
EDIT, regarding your comment on "other options":
Well, I dont see them: you want to push data you have in CSV into your RTC instance. So, if you still want to do that, you have to use that means that are available to you! And don't let my words discourage you. A) it was some time back when I did my programming with RTC, so maybe their APIs are better structured and more intuitive today. B) there is some documentation out there (for example here). And I think everybody can register at jazz.net; so when you have further, specific questions, you might find "better" answers there!
All I wanted to say was: I know that other products such as jenkins or sonarqube have great APIs; and you work with that, all nice, easy, fun. You get things working with RTC, too. Just the path there, maybe isnt that nice and easy.
My personal recommendation: start with the RTC part first. Meaning: just try to write a small programm that authenticates against the server; and then push some example data into the system. If that works nicely for you; then spend the time on pulling / transforming the real data that you have in mind!

Extracting information using XPaths

Good afternoon dear community,
I have finally compiled a list of working XPaths required to scrape all of the information from URL's that i need.
I would like to ask for your suggestion, for a newbie in coding what is the best way to scrape around 50k links using only XPaths (around 100 xpaths for each link)?
Import.io is my best tool at the moment, or even SEO tools for Excel, but they both have their own limitations. Import io is expensive, SEO tools for excel isn't suited to extract more than 1000 links.
I am willing to learn the system suggested, but please suggest a good way of scraping for my project!
#
SOLVED! SEO Tools crawler is actually super usefull and I believe I've found what i need. I guess i'll hold off Python or Java until i encounter another tough obstacle.
Thank you all!

That strongly depends on what you mean by "scraping information". What exactly do you want to mine from the websites? All major languages (certainly Java and Python that you mentioned) have good solutions for connecting to websites, reading content, parsing HTML using a DOM and using XPath to extract certain fragments. For example, Java has JTidy, which allows you to parse even "dirty" HTML from websites into a DOM and manipulate it somewhat. However, the tools needed will depend on the exact data processing needs of your project.

I would encourage you to use Python (I use 2.7.x) w/ Selenium. I routinely automate scraping and testing of websites with this combo (in both a headed and headless manner), and Selenium unlocks the opportunity to interact with scripted sites that do not have explicit webcalls for each and every page.
Here is a good, quick tutorial from the Selenium docs: 2. Getting Started
There are a lot of great sources out there, and it would take forever to post them all; but, you will find the Python community very helpful and you'll likely see that Python is a great language for this type of web interaction.
Good luck!

HTML to Textile Java library

I need to parse a String from HTML to Textile.
I've been looking at Textile4J, Textile-J, JTextile, PLextile.
But so far, none of them provide the functionality I'm looking for.
They do provide the reverse functionality (Textile to HTML).
Worst case scenario, I can use another programming language, but I have not really looked into that.

For now, I don't believe the functionality I want is available in any java Textile library.
I'll try and update this post if and when that changes.
Based on the libraries mentioned above, I have created my own (limited) functionality.
There are also several solutions available in python / ruby.

How do I use java to fetch dynamic web content?

I've been programming in java for a little while and I've found no real way to even come close to this goal. My googling has been pretty fruitless as well.
I'm looking for a way to essentially download current weather (or other but weather is a good start I suppose) and save the current temp / humidity / dewpoint / next day forecast for those numbers into an array of strings
I have no idea where to start, but I figure that this will be a good place to start learning how to use java to fetch.
Thanks!

How would you approach this task in other language?
In the case of weather you would probably look for some API exposed by the site you're trying to get the weather from.
Here come some clues:
1. If you want to just issue an HTTP request, get a result (kind-of ajax style) and parse the web page you can use java.net package or if you want a (much more powerful) thirdparty lib, use Apache HTTP Client.
2. If you're looking for API exposed via WebServices (which I believe is a better approach here) then they're language agnostic, so you just turn to web services (SOAP/Rest) in Java just like in any other language.
I know, the answer is a little bit 'common', so please clarify 'how' are you planning to solve this issue (even in any other language)...
Hope, this helps

A good source for weather information is METAR. There is also a Java library jweather available which should encapsulate all network/protocol/api issues to a limited set of methods to retrieve the required weather information

newbie: how to access content from a shop/catalog like website?

I like to access some data from web pages that are arranged like a catalog/shop from an android app.
For a concrete example: This is the URL for Amazons listing on Mark Twains books:
http://www.amazon.com/s/ref=nb_sb_noss/180-5768314-5501168?url=search-alias%3Daps&field-keywords=mark+tain&x=0&y=0#/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=mark+twain&rh=i%3Aaps%2Ck%3Amark+twain
1) If I have the above URL how do I obtain e.g.
the number of entries and
for each entry the line with the title (and maybe the image)? Which probably includes how to iterate through all the follow-up pages and access each entry.
What is the best (correct + compatible + efficient) way to do this?
I got the impression that jquery might be of use. But so far my knowledge of HTML and Javascript is just about basic.
2) How to query for the URL for all of Mark Twains books?
3) Any suggested readings for this and similar kind of topics?
Thanks for your time and have a good day!
Thomas

You would be very well advised to not "screen scrape" other web sites. Besides being difficult to maintain (as the web site changes, etc.) - this will actually be against the terms of use / service (TOS) for many web sites.
Instead, see if the desired web sites offer a web service that you can use. These will return data in a much more consumable format, such as JSON or XML. You'll usually also get your own developer key (to track requests against), as well as other possible features that you wouldn't get if going directly against the HTML.
Amazon, in particular, certainly offers this. See https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html for details. (Don't be confused by the naming of "advertising".)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.