I'm working on a site containing real estate listings in Spring MVC. I would like to prevent scripts to steal the content by scraping the site. Does anyone have experience with techniques that can easily be plugged in to a spring mvc environment?
User-agent is too simple to circumvent.
One idea I had was to keep track of two counters on the serverside.
ipaddress --> (counter xhr request, counter page request)
the counter page request is increased with a filter
the counter xhr request is increased on document ready
If a filter notices the two counters are totally out of sync, the ip is blocked.
Could this work or are there easier techniques?
Cheers
edit
I am aware that if scrapers are persistent they will find a way to get the content. However, I'd like to make it as hard as possible.
Off the top of my head:
Look for patterns in how your pages are requested. Regular intervals is a flag. Regular frequency might be a flag (four times a day, but at different times during the day).
Require login. Nothing gets shown until the user logs in, so at least the scraper has to have an account.
Mix up the tag names around the content every once in a while. It might break their script. Do this enough times and they'll search for greener pastures.
You can't stop it at all, but you can make it harder as much as possible.
One way to make it harder is change your content URL very frequent base on time with appending some encrypted flag in url.
Some of suggestion are in given link.
http://blog.screen-scraper.com/2009/08/17/further-thoughts-on-hindering-screen-scraping/
http://www.hyperarts.com/blog/the-definitive-guide-to-blog-content-scraping-how-to-stop-it/
Load the content via ajax.
Make the ajax request dynamic so they cant just go and scrape the ajax request.
Only sophisticated scrapers support execution of java script.
Most scrapers dont run the pages through a real browser, so you can try to use that to your advantage.
Related
I have multiple page app with login, users, permissions and functional area. How should structure my ember routes. Should i put all the screens in one page. which does not look ideal.
Should i create ember routes for each page? Like for users. create/edit/delete. which looks reasonable to me. I am thinking on these lines.
How should you transition from one page to another page. Like from users to permissions page. Should i use window.location.replace or something like that based on condition, move to another page?
How can i pass parameters like userId, sessionId to other pages. I dont want to use get method. I can use local store but not sure whether there are better ways and what is the common practice generally followed.
I know it depends on project to project. But would be nice to understand, what did you use?
All of your questions are answered in the basic Ember guides: http://emberjs.com/guides/
The big advantage of Ember is that it contains a very sophisticated state engine that manages your routes for you. Once you learn the basics of Ember, your questions will be answered. Another hint: stop thinking in terms of "pages" and start thinking in terms of resources, nested resources and routes. Getting stuck on the concept of a "page" will just get in your way.
I am attempting to capture client/server response time as recorded by the browser using Selenium WebDriver. My selenium test cases are written in Java. I don't control the code in which I am testing and have tried a variety of solutions as laid out below but none of them meet my requirements 100%.
At the end of the day, I am looking to be able to surround a test step with start() and stop() logic and save the client/server response time as recorded by the browser to a database for reporting.
If I am missing something obvious, please suggest a different approach!
Things I've tried:
1.) Manually surround the test step with start() and stop() timer logic.
PROS: Simplest solution and it works for both page loads and ajax calls.
CONS: Does not capture what the true response time from the browser and if there is unusually long wait time on the Selenium side, it falsely reports the numbers. It also considers things like user input as part of the transaction which I don't want. I don't necessarily control the Page Objects I am dealing with so it is not an easy work around.
2.) Using Navigation Timings API
PROS: This works great for page loads
CONS: Does not work for AJAX calls. AJAX calls are simply added to the overall page load time and the getEvents() call is not available in Firefox for me to attempt to manually calculate the ajax time.
3.) Using Browser MOB
PROS: Can surround a transaction, not just a request and save it in HAB format.
CONS: I had high hopes for this but the numbers are not reported from a browser perspective and thus are just as inaccurate as (1). There is also setup overhead creating a proxy server and the resulting HAB file does not have client/server response times broken down.
4.) Firefox and Networking Export plugin
PROS: Nice automated solution
CONS: The export functionality creates a new file for each request but can not aggregate multiple requests into a transaction. There is also no way in which to specify the file name so it makes it impossible to attempt to read in the files which are simply appended with a timestamp.
5.) Relying on "framework" response times.
PROS: Works and at at least on the surface appears accurate.
CONS: Does not work across frameworks and thus can not be considered a scalable solution for a busy production site where multiple frameworks are in use.
Things I haven't tried:
1.) Javascript injection
PROS: Perhaps I could inject javascript like the boomerang plugin into the site to measure response times.
CONS: May be difficult and I worry about losing my injection through page events which I may not be aware of or control.
2.) Relying on HTTPWatch plugin
PROS: Appears to do what I want
CONS: There is no Java plugin and I don't know if I am up for creating a COM based integration layer when I don't even know if it will suit my needs. I do like the ability to start/stop transactions though vs. individual requests.
3.) YSlow, Google Page Speed and WebPageTest
PROS: Seamless?
CONS: Non-starter since I am behind a firewall although I am intrigued on how they attach to the requests.
I have a Spring MVC project in Java. This web app can be accessed by multiple users in different browsers. I haven't coded any session bean in my program.
Now I want to 'crash'/'timeout' the browsing of one of the users, while other users will go on with their normal expected browsing. I want to do this to see if this action has any effect on the shared variables.
What kind of coding I need to do for this? Thanks in advance!
It is not at all clear what you are trying to achieve here, but I'm assuming that you are doing this as an experiment ... to see what happens.
You could modify the webapp to implement some special request, or request parameter, or request parameter value that tells the webapp to crash or freeze the request being processed. Then send that request from one browser while others are doing "normal" things.
Whether this is going to reveal anything interesting is ... questionable.
Another interpretation is that you are aiming to include timed out requests and other things in your normal testing regime. To achieve that, you would need implement some kind of test harness to automate the sending of requests to your server; i.e. to simulate a number of simultaneous users doing things. There are various test tools for doing that kind of thing.
I have created a jsp page with database connectivity. This page has both html content and java programming.My database consists of a list of ip addresses.
My java code fetches each ip address and checks whether it is currently alive on the network or not. So my jsp page loads only after this java code has performed checks on all ip addresses.This is why my page loads very late.
Is there any remedy to this so that my page loads quicker??
You can load all ip addresses from db into an ArrayList and also load all ips which are alive into another ArrayList and compare these two arrays. This should be much faster.
Separating JSP from Java code is one best practice, but the idea I'll describe here is more generally about separating the retrieval and updating of data from the rendering of the data, which is a common problem to solve.
What you need to do is separate the java code making all the network calls from the JSP which is being rendered. You can have the network calls all being run in one thread, checking each address once per minute or every few minutes, and updating each address' database record with a status. Then when the JSP is called, the JSP just grabs the latest data from the database and displays it (which is how JSP's should be used).
Now, there are numerous ways to accomplish this. If I were doing it myself, I would use Spring Framework and put the network-calling code in a method annotated with #Scheduled, and the network calls and database update could be done from that method. Details on how to use Spring are outside the scope of this answer, but hopefully this gives you an idea of the overall approach, and one technology you could start investigating.
I think there are two issues:
binding your JSP directly into your actual functionality. It would be preferable to implement some MVC structuring, and to allow the JSP to issue commands, and display whether those commands are being executed, if results are available etc. e.g. a command from the JSP to your servlet would initiate the processing (in a separate thread), and set state such that the JSP can report that 'processing' is in progress.
Your core functionality is to interrogate different IP addresses. That could easily be parallelised, such that you issue each IP query on a separate thread (naive solution, admittedly). Check out the Executor frameworks for more info.
-You should load JSP page only with IP list and after it's loaded, you can fetch IP adress statuses with AJAX requests.
-Earlier mentioned idea of caching statuses is a great.
-Also you can improve interface (paging, lazy loading lists, etc) to reduce count of IP addresses for checking.
I'm using Java+struts2+JSP as the web application framework.
I have to pass some huge objects through struts actions to my jsp pages. this makes the pages so heavy to load and on the other hand they suck the server's bandwidth out.
Is there any way to send compressed objects via struts2 to a jsp page and decompress them there?
The question is a bit vague on how the objects are passed from the action classes to the JSP pages, but it appears to me that instead of forwarding the request during the execution of the request, the application is issuing a client-side re-direct to a new page.
In the JSP/servlet model, forwards are internal to the server, and do not result in a new request by the client. On the other hand, redirects will result in the browser being forced to go the new page as indicated by the server.
If possible, you should investigate the use of forwards which is the default mechanism in Struts to display the view. This will only reduce the server's bandwidth requirements.
On the topic of the large memory consumption in JSP pages, you might want to profile the application to deduce whether the 'huge' load time of JSPs is due these objects or whether it is due to the additional client request as explained above. Without such a profile report indicating CPU and memory usage, it is presumptuous to claim that object bloat is responsible for high page load times.
If you need to move data inside your server side, check this:
http://www.google.de/search?q=java+gzip&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-GB:official&client=firefox-a
If you want to improve download speed for clients, enable gzip compression in your webserver.
Sounds like you need to unzip files with JavaScript. This Answer actually provides a link to just such JavaScript. I don't know how practical the idea is though.