i would like to implement some kind of service my customers can use to find their company on
a. blogs, forums
b. facebook, twitter
c. review sites
a. blogs, forums
This can only be done by a crawler, right? A crawler looking for the robots.txt on a forum/blog and than optionally reading the content (and of course links) of the forum/blog.
But where to start? Can i use a set of sites to start with crawling? do i have to predefine them or can i use some other searchengine first? E.g. searching in Google for that company and then crawl the SERPs? Legal?
b. facebook, twitter
They have APIs, so hat should not be a problem i think.
c. review sites
I looked at some review site's TOS and they wrote that using an automated software crawling their sites is not permitted. On the other hand, the sites that are relevant to me are not disallowed in their robots.txt. What matters here?
Any other hints are welcome.
Thanks in advance :-)
Honestly, the easiest way to do it would be to start with the search engines. They all have APIs for doing automated searches, so that'd probably give yout he highest return for your time on getting back links/mentions of your client's products or brand.
That won't handle things behind authentication, only public stuff (of course). But it'll give you a good baseline to start with. From there, you could (if you want) use API's or custom-written bots that are given auth creds on the sites, but honestly I think at that point you're missnig the core question, I think.
Is the core question, "Where are we mentioned?" or is the core question really... "What sites are getting traffic to come to us?" In most cases, it's the latter, in which case you can ignore all of what I said previously and just use Google Analytics, or similar software on your client's site to determine where traffic's coming from.
Edit
Ok, so if it's where are we mentioned, I'd still start w/ the search engines as stated. Google's api is pretty easy and it has a SOAP based one that you can pull in as a web reference if you want; example
Re: review sites. If the site's TOS says you can't use automated bots, then it's a good idea not to use automated bots. The robots.txt is not legally binding (it's sort of a good-neighbor thing), and so I wouldn't not use the lack of exclusion there to be permission. Some review sites (more modern ones) might disallow automated scraping of their site, but they might still publish RSS feeds or Atom feeds or have some other API that you can hook into, that's worth checking.
Related
I am considering using EWS Java API to access Exchange. Specifically, to insert appointment into users' calendars. Now, the online documentation that I immediately find is... sparse, so I would very much appreciate if someone could give some input in:
1. Is this an API worth using?
2. Are there any online resources for it (that a rather superficial google doesn't find)?
3. Is there anyone of the alternatives I should use instead? (SyncEx, j-exchange)
I've seen previously posted links to the alternatives. What I am hoping for, however, is that someone can share experience with what is actually good to work with.
Thanks a lot,
HÃ¥kan
I like to access some data from web pages that are arranged like a catalog/shop from an android app.
For a concrete example: This is the URL for Amazons listing on Mark Twains books:
http://www.amazon.com/s/ref=nb_sb_noss/180-5768314-5501168?url=search-alias%3Daps&field-keywords=mark+tain&x=0&y=0#/ref=nb_sb_noss_1?url=search-alias%3Daps&field-keywords=mark+twain&rh=i%3Aaps%2Ck%3Amark+twain
1) If I have the above URL how do I obtain e.g.
the number of entries and
for each entry the line with the title (and maybe the image)? Which probably includes how to iterate through all the follow-up pages and access each entry.
What is the best (correct + compatible + efficient) way to do this?
I got the impression that jquery might be of use. But so far my knowledge of HTML and Javascript is just about basic.
2) How to query for the URL for all of Mark Twains books?
3) Any suggested readings for this and similar kind of topics?
Thanks for your time and have a good day!
Thomas
You would be very well advised to not "screen scrape" other web sites. Besides being difficult to maintain (as the web site changes, etc.) - this will actually be against the terms of use / service (TOS) for many web sites.
Instead, see if the desired web sites offer a web service that you can use. These will return data in a much more consumable format, such as JSON or XML. You'll usually also get your own developer key (to track requests against), as well as other possible features that you wouldn't get if going directly against the HTML.
Amazon, in particular, certainly offers this. See https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html for details. (Don't be confused by the naming of "advertising".)
How do I use OAuth within my Java GWT application?
In particular, I want to get a list of users in my Google Aps domain, using this API:
http://code.google.com/googleapps/domain/profiles/developers_guide_protocol.html
I know this sounds like a question, that probably has been asked many times before, but I couldn't find any Java code on how to realize the OAuth steps described in the API above.
I would be glad if someone could share some code, or point me to the right docs.
This tutorial by Matt Raible is easily the best one I've seen so far on OAuth and gwt. He also has a very good picture depicting the authentication flow, which I always find help. However, as Matt himself says, the solution is not 100% reliable, but it might still get you part of the way.
With this in mind, it might be better to just go with a pure javascript implementation of it. You'll find one such implementation right here. This SO thread might come in handy to you if you chose that path.
Best of luck to you.
What do you mean in your GWT application?
Do you mean client-side only?
Because on the server you can easily use the Scribe OAuth library.
It has a good documentation and is fairly simple to use.
For integrating OAuth and GWT, you should start with Scribe which handles the implementation of the OAuth:
https://github.com/fernandezpablo85/scribe-java
Next, you need to create a GWT widget that can handle the user's interactions to acquire permission to access their account. Then grab the response token, and make the API requests to the external site.
No point re-implementing OAuth when scribe already does it for you - you just need to. I'd probably aim to use a GWT Popup for doing the authentication:
http://gwt.google.com/samples/Showcase/Showcase.html#!CwBasicPopup
I have used MS Money for several years now and due to my "coding interest" it would be great to know where to start learning the basics for programming such an application. Better to say: Its not about how to design and write an application, its about the "bank details". (Just displaying the amount of a certain bank account for the beginning would be a pleasant aim for me.).
I would like to do it in C++ or Java, since I'm used to these languages.
Will it be "too big" for a hobby project? I do not know much about all the security issues, the bank server interfaces/technique, etc.
At the first place after a "no" I need a reliable source for learning.
Most of the apps I've worked with read in a file exported from the bank's website, which is relatively straight forward.
If that's the road you're looking to go down you'll need to write code to:
Login to the bank's website to download the file via HTTPS
Either get specs for the file format or reverse engineer it
Apply whatever business rules you choose to the resulting data
The first thing to remember when trying to programmatically interact with a banking website without express written permission from the bank will MOST LIKELY be a violation of the website use agreement, and may land you in more trouble than it's worth.
Second, you DON'T want to start 'learning' programming by trying to tackle something that massive and sensitive. Not that there is anything wrong with the eventual goal, but that's a journey of a thousand leagues and you need to take your first step.
I would say start with a simple programming environment, like python, or perl. Reason, you don't have to worry about linking, libraries, code generation etc. Get used to the basics of what you want to achieve functionally, them reimplementing that in C++ or Java would be the next step.
To begin with focus on learning client-server programming.
Write a client, write a server, learn all about sockets, learn all about TCP programming,
then learning about secure socket layers (SSL) and transport layer security (TLS).
Once you've done this, try switching to C++ or Java and see if you can repeat the effect.
There are TONS of tutorials on these topics.
Once you have become used to that, learn what tools and libraries are already available to do most common things. For example libcurl is great for creating common internet application protocol clients (HTTP, HTTPS, FTP and the like).
See if you can create an interactive program that you can "log in to" using your web browser which outputs stuff in XML and formats it using cascading style sheets.
This should lead you into javascript world, where there are powerful tools such as jquery. If you mix and match these correctly, you will find that development can be a LOT of fun and quite rapid.
:-)
Happy journeying.
I think its quite a reasonable hobby project; start with a simple ledger and then you can add features.
A few things I would do to begin such a project:
Decide on an initial feature set. A good start might be just one of the ledgers/accounts - basically balancing a checkbook. Make this general enough that you can have several.
Design a data model. What fields will your ledger have? What restrictions on the values of each?
Choose technologies. What language do you want to program in? How will you persist the data? What GUI do you want - a fat client like MS money or a web app?
From there, write up some design notes if warranted and start coding!
You might look into OFX4J, an implementation of the Open Financial Exchange specification, mentioned here and in a comment by #nicerobot.
Are you looking for something mint.com-ish? From my understanding of their security policy this is how they do it: You give them your online account credentials which they give immediately to the bank and get back a "read-only" account login. They then throw away the credentials you provided and use "read-only" credentials to update your metrics every 24 hours. I don't know how they do this or if they have a special relationship with the banks, but it is possible.
I don't think many (if any) banks provides api's.
Online budget-apps in Sweden seem to rely either on exporting transactions in some excel format, or simply have you "mark all transacations in the banksystem, ctrl-c then ctrl-v in a textbox", which is then parses.
I'm writing a Java program, and I want a function that, given a string, returns the number of Google hits a search formed from that query returns. How can I do this? (Bonus points for the same answer but with Bing instead.)
For instance, googleHits("Has anyone really been far even as decided to use even go want to do look more like?") would return 131,000,000. (or however many there are.)
Related: How can I programmatically access the "did you mean" suggestion? (eg searching "teh circuz" returns "did you mean the circus?")
found it: http://code.google.com/apis/ajaxsearch/documentation/#fonje
The Google Terms of Service say this:
5.3 You agree not to access (or attempt to access) any of the Services
by any means other than through the
interface that is provided by Google,
unless you have been specifically
allowed to do so in a separate
agreement with Google. You
specifically agree not to access (or
attempt to access) any of the Services
through any automated means (including
use of scripts or web crawlers) and
shall ensure that you comply with the
instructions set out in any robots.txt
file present on the Services.
Google has ways of making life unpleasant for you / your company if you violate the Terms of Service ...
UPDATE: The second sentence is about the way that you use Google's services ... including their published APIs. It is not entirely clear from the wording what is allowed and what is forbidden; literally speaking "any automated means" is very broad. However a Java app that performed Google searches, screen-scraped the results and repackaged them to provide some value added service would (IMO) be a violation of the TOS. And using Google's published APIs to do the same thing would (IMO) also be a violation.
But that's my opinion, not Google's. And it is the Google opinion that matters. If anyone is thinking of doing something like this, they should contact Google and check that what they are proposing is OK.
The point is that Google is not going to assist people to subvert their search business model. Anyone who thinks they can get away with it based on some clever interpretation of the TOS is going to get burned.
for the first part of the answer, try read the t-o-s; for the "did you mean" part, see: http://norvig.com/spell-correct.html
You may be able to do it "legally" using the Google Java Client Library. I don't know for sure, but they may have some methods similar to what you're looking for, and you won't be violating their TOS.
Google Data APIs Library
You can legally access the Google AJAX Feed API through its RESTful interface:
http://code.google.com/apis/ajaxfeeds/documentation/#fonje
Bing still has a developer program where you can call against their API in a JSON/XML or SOAP matter:
http://www.bing.com/developers