Using Web crawler for price comparison

Using Web crawler for price comparison - java

I need a open source java based web crwaler which I can extend for price comparison?
How do I do the price comparison?
Is there any open source code for that?

Take a look at web harvest, you will have to use it's slightly odd and peculiar syntax for processing web pages, but it should be fairly to extend it to do some price comparison:
http://web-harvest.sourceforge.net/samples.php?num=2

Building something that scrapes price information from a large number of different sites is going to be a lot of work, whether you scrape from the stores themselves or from existing comparison sites.
Everyone's website layout will be different, requiring you to configure your crawler separately for each one.
Some websites may present the price information in ways that make scraping difficult; e.g. using AJAX.
Some website owners will put the relevant pages into their robots.txt files to tell you to stay away. And if you ignore that, there are various things they can do to make life difficult for you.
Scraping lots of people's websites without permission is likely to make you unpopular. It might attract threats of lawsuits, or actual lawsuits from people who perceive that you are harming their business model. Or other responses ...
Are you really sure you want to do this? Really??

Any reason you can't just get your data from one of the hundreds of price comparison sites already out there? Seems like would be simpler to scrape nextag or froogle or whatever instead of writing a crawler to scrape billions of store websites.

Nobody wants their site to get overloaded without getting any benefit. I think you should create a crawler for your need. However, be aware that most of them may block you or make your responses slower. you need to behave like you are not one and eating their bandwidth...

Someone here wrote about the legal issues. The legal issues are not simple. Stephen C wrote about lawsuits but that goes both ways. There is a large body of law related to anti-competitive conduct. If someone wants their prices to be not reported because they are involved in price-fixing or making false claims, then the websites themselves face severe penalties. The law is not something to trivially quote. You can google price fixing and see the large fines already imposed on countless companies.

Related

newbie: how to access content from a shop/catalog like website?

I like to access some data from web pages that are arranged like a catalog/shop from an android app.
For a concrete example: This is the URL for Amazons listing on Mark Twains books:
http://www.amazon.com/s/ref=nb_sb_noss/180-5768314-5501168?url=search-alias%3Daps&field-keywords=mark+tain&x=0&y=0#/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=mark+twain&rh=i%3Aaps%2Ck%3Amark+twain
1) If I have the above URL how do I obtain e.g.
the number of entries and
for each entry the line with the title (and maybe the image)? Which probably includes how to iterate through all the follow-up pages and access each entry.
What is the best (correct + compatible + efficient) way to do this?
I got the impression that jquery might be of use. But so far my knowledge of HTML and Javascript is just about basic.
2) How to query for the URL for all of Mark Twains books?
3) Any suggested readings for this and similar kind of topics?
Thanks for your time and have a good day!
Thomas

You would be very well advised to not "screen scrape" other web sites. Besides being difficult to maintain (as the web site changes, etc.) - this will actually be against the terms of use / service (TOS) for many web sites.
Instead, see if the desired web sites offer a web service that you can use. These will return data in a much more consumable format, such as JSON or XML. You'll usually also get your own developer key (to track requests against), as well as other possible features that you wouldn't get if going directly against the HTML.
Amazon, in particular, certainly offers this. See https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html for details. (Don't be confused by the naming of "advertising".)

Java API for financial data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am working on my Master's project and I am looking for a substantial amount of financial data about a particular company.
Example: let's say "Apple". I want the historic prices, current market price / ratios, quarterly results and the analyst calls.
I saw couple of posts on StackOverflow about YQL. I think I can get current price and various ratios from Yahoo Finance for free. However for other data, there are companies like Thomson Reuters, Bloomberg, etc. but they seem to have a closed system.
Where can I get an API to fetch various data? Is there anything which will help me get that data? I am fine with raw data as well in any format. Whatever I can get. Could you guys please suggest any API?

A Java library under development is IdylFin, which has convenience methods to download historical data.
Disclaimer: I am the author of this library.

Stephen is right on the money, if you really want a real wealth of data, you're probably gonna have to pay for it.
however, I've been successful on my own private projects by using the "API" spelled out here:
http://www.gummy-stuff.org/Yahoo-data.htm
I've pulled down all the stocks from the S&P 500 quite often, but if you ever publish that data, talk with yahoo. you'll probably have to license it.
btw, all this data is in CSV format, so get a CSV reader/converter etc. their easy to find

This is a Yahoo finance Historical data for "Apple"
http://in.finance.yahoo.com/q/hp?s=AAPL
There is a link at the bottom to download the data. May be this could help

I will suggest a couple of APIs that have financial data that is sometimes hard to find (e.g. quarterly results, analyst calls):
1) http://www.zacksdata.com/zacks-data-api
2) http://www.mergent.com/servius
Both have free trials available.
(Disclosure: My company manages both of these APIs)

A Java example to fetch data from Yahoo finance it given here Obba Tutorial: Using a Java class which fetches stock quotes from finance.yahoo.com

I have tackled this problem in the past.
For price history data, I used yahoo's API. When I say API, I mean I was making an HTTP get request for a CSV file of price history data. Unfortunately, that only gets you data for one company at a time, for a time span you specify. So I first made a list of all the ticker symbols, and iterated over that, calling yahoo's API for each. You might be able to find a website that lists ticker symbols too, and just periodically download that list.
Do this too often and too fast, and their website just might block you. I added some code to limit how frequently I made http requests. I also persisted my data so I would not have to get it again. I would always persist the raw/unprocessed form of data, your code could change in ways that make it tough to use anything else. Avro/Thrift might be an exception, since those support schema evolution.
For other kinds of data, you may not have any API that gives you nice CSV files. I had to cope with that problem many times. Here is my advice.
Sometimes a website calls a restful web service behind the scenes, you can discover that by using firebug. Sometimes it will also require certain headers, which you can also discover using firebug.
If you are forced to work with HTML, there are several java libraries that can help you. apache.commons.http is a library you can use to easily make http requests and handle their responses. Google has an http-client jar too, which is probably worth investigating.
The JSoup API is excellent at parsing HTML data, even when it is poorly formatted, and not XHTML. It works with XML too. Instead of traversing or visiting nodes in the jsoup hierarchy, learn XPath and use that to select what you want. The website may periodically change the format of its web page, that should be easy to cope with and fix if you're using JSoup, and tough to cope with otherwise.
If you have to work with JSON, use the Jackson library to parse it.
If you have to work with CSV, use the OpenCSV library to parse and handle it.
Also, always store the data in the raw, and avoid making unnecessary HTTP requests so you don't get blocked. I have been blocked by google finance a couple times, they can do it. Fortunately the block does expire. You might even want to add a random wait period between requests.

Have you tried Google Finance API. (Please google it ;). I am using it for tracking my portfolio. Could you try http://code.google.com/apis/finance/docs/finance-gadgets.html? There is an example of custom widget and it might tell you if you are barking under the right tree.

You are really asking about a free financial data service ... rather than an API.
The problem is that the data is a valuable commodity. It probably has cost the providers a lot of money to set up their systems, and it costs them even more money to keep those systems running. Naturally, they want a return on their investment, and they do this (in part) by selling their data / services.
(In the case of Yahoo, Google, etc, the data is bought from someone else, and Yahoo/Google will be subject to restrictions on how they can use it. Those restrictions will be reflected in respective ToS; e.g. you are only allowed to access the services "for personal use".)
I think your best bet would be to approach a number of the financial data providers, and ask if they can provide you with free access (subject to whatever restrictions they might want to impose) to their data services. You could get lucky ...

Good data is not free. Its as simple as that. The reason is that all data is ultimately licensed from an exchange like NYSE or NASDAQ.
If you can get some money high resolution historical data is available from Automated Trader.
You should also talk to the business school at your school. If they have finance masters/phd students or masters in financial engineering they should have large repositories of high resolution data for their students.
If you make your question more detailed I can provide a more detailed answer.

This is something that I kick myself for at least once a week. Way back when the internet consisted of Gopher and all that, you were able to log into FTP servers at the NASDAQ and NYSE, and download all kinds of stock history files for free. I had done it, even had it imported to a database and did some stuff with it....but that was probably 10 computers ago, its LONG gone now.

How to implement a social media/website monitoring service?

i would like to implement some kind of service my customers can use to find their company on
a. blogs, forums
b. facebook, twitter
c. review sites
a. blogs, forums
This can only be done by a crawler, right? A crawler looking for the robots.txt on a forum/blog and than optionally reading the content (and of course links) of the forum/blog.
But where to start? Can i use a set of sites to start with crawling? do i have to predefine them or can i use some other searchengine first? E.g. searching in Google for that company and then crawl the SERPs? Legal?
b. facebook, twitter
They have APIs, so hat should not be a problem i think.
c. review sites
I looked at some review site's TOS and they wrote that using an automated software crawling their sites is not permitted. On the other hand, the sites that are relevant to me are not disallowed in their robots.txt. What matters here?
Any other hints are welcome.
Thanks in advance :-)

Honestly, the easiest way to do it would be to start with the search engines. They all have APIs for doing automated searches, so that'd probably give yout he highest return for your time on getting back links/mentions of your client's products or brand.
That won't handle things behind authentication, only public stuff (of course). But it'll give you a good baseline to start with. From there, you could (if you want) use API's or custom-written bots that are given auth creds on the sites, but honestly I think at that point you're missnig the core question, I think.
Is the core question, "Where are we mentioned?" or is the core question really... "What sites are getting traffic to come to us?" In most cases, it's the latter, in which case you can ignore all of what I said previously and just use Google Analytics, or similar software on your client's site to determine where traffic's coming from.
Edit
Ok, so if it's where are we mentioned, I'd still start w/ the search engines as stated. Google's api is pretty easy and it has a SOAP based one that you can pull in as a web reference if you want; example
Re: review sites. If the site's TOS says you can't use automated bots, then it's a good idea not to use automated bots. The robots.txt is not legally binding (it's sort of a good-neighbor thing), and so I wouldn't not use the lack of exclusion there to be permission. Some review sites (more modern ones) might disallow automated scraping of their site, but they might still publish RSS feeds or Atom feeds or have some other API that you can hook into, that's worth checking.

Generic Article Extraction from web pages

Am going to begin my work in article extraction.
The task that I will be doing is to extract the hotel reviews that is posted in different web pages(eg. 1. http://www.tripadvisor.ca/Hotel_Review-g32643-d1097955-Reviews-San_Mateo_County_Memorial_Park_Campground-Loma_Mar_California.html, 2. http://www.travelpod.com/hotel/Comfort_Suites_Sfo_Airport-San_Mateo.html )
I need to do the task in Java and I am just working with Java for the past couple of months alone..
And here comes my questions regarding these.
Is there possibility to extract reviews alone from different web pages in a generic way.
Kindly let me know if there are any API that supports the task in Java.
Also, let me know of your thoughts/sources which will be more helpful for me to attain the task mentioned above.
UPDATE
If any sort of related examples available in net, please post the same since that could be of great use.

You probably need a screen scraping utility for Java like TagSoup or NekoHTML. JSoup is also popular.
However, you also have a bigger legal consideration here when extracting data from a 3rd party website like tripadvisor. Does their policy allow it?

Making commercial Java software (DRM)

I intend to make some software to be sold over internet. I've only created open-source before, so I have really no idea of how to protect it from being cracked and distributed as warez. Bearing in mind that I know like two programms that aren't either cracked or not really useful I decided that the only more or less reliable way may look like this:
Connect to a server and provide licensing info and some sort of hardware summary info
If everything is fine, the server returns some crucial missing parts of the program bound to that certain pc along with the usage limit of say 2 days
That crucial stuff is not saved to hard drive, so it is downloaded every time the program starts, if the programm runs more than 2 days, data is downloaded again
If the same info is used from different computers, suspend the customer account
What do you think about this? It may seem a bit to restrictive, but I'd better make less sales at first then eventually see my precious killer app downloaded for free. Anyways, first I need some basic theory/tutorials/guides about how to ensure that user only uses a certain Java app if he has paid for it, so please suggest some.
Thanks

I work for a company selling protected Java software.
I won't comment on the scheme for user authentication, but I can comment on the online license check.
Don't make it even "work for two days": that's how I pirate most software... Virtual Machine set "back in time" and externally firewalled so that it doesn't "phone home" anymore (that is: only allowing it to contact the server once, to get the trial key), always reimaged from the point where the software got freshly installed and bingo, the 30-days trial (or two days trial) has become a lifetime trial. Why do I do this? To learn how to better protect our app of course ;) (ok, ok, I do it just for fun too)
What we do in our commercial Java software is to check the license at every startup.
We've got hundreds of customers and nobody ever bitched about it. Not once. We generate a unique class at each run, which is different at every run, which depends both on things unique for that launch on the client side and on things generated once on the server side.
In addition to that having the app contact your server at every launch is a great way to gather analytics: download to trial ratio, nb average launches per trial, etc. And it's not nasty anymore than having an Urchin/Google JavaScript tracker on each webpage is nasty.
Simply make it clear to people that your software performs the online licence check: we'got a huge checkbox either on or off saying: "Online licence verification: OK/Failed". And that's it. People know there's a check. If they don't like it, they go use inferior competitor products and life is good.
People are used to live in a wired world.
How often can you not access GMail because your Internet connection is down? How often can you not access FaceBook or SO because your Internet connection is down?
Point is: make as much computation as possible dependent on the server side:
licence check
save user preferences
backup of the data generated by your app
etc.
Nobody will complain. You'll have 0.1% of your user complain and anyway you don't want these users: they're the one who would complain about other things and post negative feedback about your app online. You better have them not to use your software at all and complain about the fact that it requires an always-on Internet connection (which 99.99% of your target demographic and hence they won't care about the complain) rather than actually have them use the app, and complain about other things related to your app.
Regarding decompiling, .class can usually be decompiled back to .java unless you're using a code flow obfuscator that produces valid bytecode but that it impossible to be generated from .java file (hence it is impossible to get back a valid .java file).
String obfuscator helps make it harder to figure out.
Source code obfuscator helps make it harder to figure out.
Bytecode obfuscator like the free Proguard makes it harder (and produce faster code, especially noticeable in the mobile world) to figure out.
If you're shipping Windows/Linux only then you can use a Java-to-native converter like Excelsior Jet (not free and kinda expensive for startups, but it produces native code from which you simply cannot find the .java files back).
As a funny side note you'll see people trying to mess with your online server... At about 30 beta-testers we had already people (which we know where part of the trial) trying to pirate our online servers.

I am sorry to turn you down, but first you should have an idea of what you want to build; then you should prove that your idea not only works, but is also loved by users to the point where they want to pirate it. Thirdly, you have to make sure that the time you are investing in making it "secure" is actually worth the value of the application.
If you sell it for a dollar, and you only sell ten copies, and you spent 100 hours making it secure, you do the math and tell me if your time was worth that little money.
The take-home message here is: everything can be cracked or copied. At the end there are much brighter people than us doing this (iPhone cracking, digital TV, games, etc) and nobody found the silver bullet. Only thing you can do is make it harder to crack your application (often at the expenses of usability, ease of installation, and by cutting corners for some use scenarios). Asking yourself if it's worth the hassle it's always a good starting point.

Don't bother.
The gaming industry has been battling piracy for decades. Online multiplayer games with a central server typically require a subscription to play. That model is fairly resistant to piracy. Pretty much all other games are heavily pirated, despite innumerable attempts at DRM.
Your app will be cracked and pirated, no matter what language you write it in and what tools you use to prevent it. If your DRM actually works, the people who would have pirated it still won't buy it. Furthermore, legitimate users will prefer other products that don't have intrusive DRM. If there are no competing products and yours has any market to speak of, someone will create one.

Unless your application is specifically web based your users will find it to be a huge hassle to require an internet connection in order that they might access the product. What you are suggesting will work, unless it gets broken, like all DRM systems do. I understand the want to protect your intellectual property, but with many companies as examples, these systems are usually broken or the product does much worse because of them.

I have really no idea of how to
protect it from being cracked and
distributed as warez.
First, you'd be better choosing a language besides Java, if this is a concern. This is why C++ is still alive and well in the commercial apps world. Unless you are going to use an actual Java compiler to native exe, I'd reconsider Java for IP protection reasons.
For that matter, even C++ isn't impervious to cracking, but IP prorection vs. cracking are two separate, important concerns.

That's a really tricky task, especially with something running in a VM.
I would say you might be thinking in the right direction. Obfuscating it to make it reasonably hard to modify might prevent people from circumventing the built in licence checks.
However, ultimately, if your application is self-contained, it will always be crackable. If you can build it so that it uses services you provide, than you can probably command it's use.

To paraphrase a Mr Jeff Atwood, it is better to make it easier for your customer to pay you than it is to crack your app. In other words, I think you are attacking the wrong problem. Make buying your product REALLY easy and then your customers won't feel they need to go to the effort of cracking it.

I would have a look at the backlash from the game Spore before deciding on a licensing scheme. They had it phone home, and only allowed so-many installations, etc. etc. etc. Spore was supposed to be their "Killer App" and it really had a hard time simply because of the licensing. You say you are willing to have fewer sales than see people using it for free, but you may want to be careful what you ask for. I was really looking forward to spore (and so were my children) but I never did buy it because of the DRM scheme.
No matter what you do, it'll be cracked in very short order especially if the program really is worth anything.
If you do go with a licensing scheme, make it simple and usable so you are not punishing those that have actually paid for your software. Also, I would avoid any phone-home style checks, that way your customers will be able to continue to use the software even if you don't want to keep paying for that domain 3 years from now.

I see a specific weakness in your example, besides the comment most people already put in that DRM is hard(impossible) to implement and often simple to circumvent.
In your second point:
If everything is fine, the server
returns some crucial missing parts of
the program bound to that certain pc
along with the usage limit of say 2
days
This 2 (or X) days limit will most likely be extremely simple to circumvent, this would just a few minutes to find and patch (crack).
If you really want to have a DRM model the only reasonable way to go is to put at significant part of the application as a web service and require constant connection from the users.
Before you try any of this, be sure to read Exploiting Software and you will think twice before trying to do DRM.

I think, given the context, the most effective form of protection for now would be the limited demo/license key approach: it would give people time to fall in love with your application so that they are willing to pay for it, yet prevent casual copying.
Once you know that your app hit it big, and that crackers provably siphon off a significant portion of your possible earnings, then you can still add additional checks.
Another thing to consider is where your app is going to be used: if it's something people would put on the their laptops to use on the go, network connectivity is not a given.

That is some of the harshest DRM I've ever heard of, your users would hate it.
Also, keep in mind that there are a lot of good Java decompilers out there due to the nature of the language and someone determined enough could just find areas of the program dealing with your DRM and bypass/disable it then recompile it (according to this a recompilation would be unrealistic)... so you would even have to go out of your way to implement your code as complex as possible to prevent a hacker from being successful. (Which could be done with one of those code obfuscation tools they may have out there.)

As long as it's an Internet application, you could restrict it in that manner. Short of cracking the program, this would work fine except for replay attacks.
For example, if I can capture the traffic that is going to your server, and simply replay it back to my program each time, I'm still good. For example, I could create my own "web server" and ensure the program hits that instead of your server.

You should read "Surreptitious Software" from Collberg and Nagra. This book is really good to help you understand how software protection mechanisms work (such as Code obfuscation, watermarking, birthmarking, etc...).
As lorenzog said, total security doesn't exist and software security is like a constant race between software vendors and pirates.
You should use cheap obfuscating transformations (so the overhead they incur isn't killing the performances) to prevent as many attackers (remember most of them are script kiddies) as possible to "steal" your killer algorithms or any secret data.
If you're willing to push the security further you can birthmark your algorithms and watermark your copies in order to find who leaked your creation. But even if you do, this doesn't mean your software is 100% secured. Plus the time you'll spend adding these mechanisms might not be worth the effort.
These concepts are really well explained in the book I mentioned before which is worth reading.

If I had enough reputation points, I'd vote this question down. Commercial software protection is a waste of time, money, and effort for many reasons. Concentrate on making a piece of software worth buying. If your software is popular enough to maintain wide seeding by pirates, you're probably successful enough at that point that you won't even worry about piracy. Anyway, crackers crack software protection mostly for fun. The stronger your protection, the better the challenge it presents and the more they want to crack it. Your best effort will cost you thousands, take months, and be cracked in only days.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.