As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have some problems with a java app i'm developing, i'm using HtmlCleaner 2.2 library (the one used in web-harvest proyect) and have no problem getting the source of a page.
My problem starts when i want to recursively browse the site and get an tree of categories and products as childs. I guess that each time the script visits a page, it counts as a user entering the site, so when it visits 15 or 20 category or product pages, the website firewall blocks my ip for about an hour.
With this problem 2 solutions come to my mind, first: use proxys, i don't get banned and i can download faster using threads, second: open only one connection. I guess it's a bad idea to use proxies so i want to ask, in a simple code, what is the best way to visit recursively about 300000 products of a website without being banned? fastest and simple
Putting the source in a string it's enough to count as visited.
I don't want a debate about the best way, only a well justificated one.
Acclaration: This is a school task, i'm not making any profit of this, and i'm trying to be the less harmful for the site
If your spidering provides legitimate business value to the site your are scraping, you could contact the website owner and ask for either a data feed or an exclusion to their banning algorithm (after all, it's often beneficial for people to have their products exposed to prospective buyers).
UPDATE
Based on your statement that this is a school task, ask your teacher for assistance in finding a website that is willing to be bombarded with traffic in the interest of education, or reach out to the website owner, explain what you are doing, and ask for permission.
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm looking for a database which would allow me to store most of the objects in the memory. Basically I want to store in the memory everything except some rarely used data (history of changes, etc).
I'm looking for:
simple API for java, preferably non-ORM
ACID is not required (well, D is)
some support for queries, but nothing fancy
The idea is to operate on a model in memory, store any "command" mutating the model in the database, periodically synchronize model to database (like prevayler does)
Which database matches my needs? (I'll use postgres or H2 if there isn't anything simpler).
You need one of object databases: http://en.wikipedia.org/wiki/Comparison_of_object_database_management_systems
You should use Terracotta. It is usually used for caching, but its exactly what you are asking for, except that it's "querying" abilities are sparse.
Update:
The previous link was to their "enterprise" edition, but they have the open source project Ehcache which fits your needs, and their enterprise product is based on.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm new to ElasticSearch which i try to use to help a cool startup that needs a search engine.
My usecase is:
Each user of the website has its personal space where he can create text documents
Each user can share or not its content with limited people (friends)
Each user can create public content
Users may be from different countries
Users may search on other stuff than posts (for exemple, search for another user)
Our data is hosted in CouchDB.
1) Should i create one unique index or is it a good practice to create an index per user?
I've read it's not a bad idea to put everything on the same index so you can search on many different things in the same time.
But i noticed ES provides the ability to search on multiple indexes so why not creating an index per user?
Is it a problem because the maximum url size is limited and the index names are provided by the url or something else?
2) Should i create one index or type per json document?
I've mostly 2 different type of documents to index: posts and users.
If i want to be able to search on both of them in the same time, am i supposed:
To create an index for posts and an index for users, and search both of them?
To create one index and 2 different types, and search on both types of the same index?
I don't really what will be the difference.
3) Is it normal to have to create multiple rivers of the same type?
For exemple, on the CouchDB river, which provide a "filter" attribute to receive only the documents matching your filter.
Thus if i want to index my posts and my users on 2 separate indexes or types, my first try will be create 2 CouchDB rivers which will both have a different filter and a different index and/or type.
Is it the way to do?
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm doing some text mining in web pages. Currently I'm working with Java, but maybe there is more appropriate languages to do what I want.
Example of some things I want to do:
Determine the char type of a word based on it parts (letter, digit, symbols, etc.) as Alphabetic, Number, Alphanumeric, Symbol, etc.(there is more types).
Discover stop words based on statistics.
Discover some gramatical class (verb, noun, preposition, conjuntion) based on statistics and some logics.
I was thinking about using Prolog and R (I don't know much about these languages), but I don't know if they are good for this or maybe, another language more appropriate.
Which can I use? Good libs for Java are welcome too.
python.!
They have a HELL-LOTTA libraries in this area.
but, i've got no knowledge about prologue and R.. but definitely py is LOT better than java in text mining, and AI stuff...
I highly recommend Perl. It has a lot of text-processing features, web search and parsing, and a large etc. Take a look at the available modules (>23.000 and growing) at CPAN.
I think Apache Solr and Nutch provides you the framework for that and on top of that you can extend it for your requirements.
Java has some basic support, but nothing like the above two products, they are awesome!
HTML Unit might give you some good APIs for fetching web pages, and traversing over elements in DOM by XPath. I have used it for sometime to perform simple to more complex operations.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I am new to this kind of computing. I don't know what are the existing distance functions that are helpful to calculate the distance between to double sets(arrays). Can some one suggest me at-least 10 distance functions so that i can select few among them which suits best for my problem domain. I just want to calculate the distance between two sets for my scientific approach to the problem domain. I also want to know whether i have to implement them manually or any java API that covers most distance functions? Suggestions can help me to minimize my effort and save my time..:)
Providing you with code is not really going to help. What you need to do is to read up on the mathematics of the the various measures of distance, and figure out which is most appropriate based on that knowledge.
You could start by reading the Wikipedia page on Distance and the linked pages and resources.
Only when you've decide on an appropriate measure do you need to go looking for code. In a lot of cases, it is probably simplest to implement the measure yourself.
Alternatively, if you want us to provide sensible suggestions of measures that are appropriate to your problem domain, tell us what the problem domain is.
Are we talking about statistical distance between two samples? If so, there is an abundance of methods, each one suiting a different problem.
If your problem domain is simple, subtracting the sample means (averages) could suffice. For more complex data, the Earth Movers' Distance is common, though newer and more robust methods (such as kernel functions) are available.
Coding is the least of your problems. You must provide a more accurate definition of your problem before I can further assist you.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm wondering what's a good online judge for just practicing algorithms. I'm currently not very good at writing algorithms, so probably something easy (and least frustrating) would be good.
I've tried the UVA online judge, but it took me about 20 tries to get the first example question right; There was absolutely no documentation on how to read input, etc. I've read about Topcoder, but I'm not really looking to compete, merely to practice.
Take a better look at topcoder. Yes, they have competitions, but you can still easily just "play" by yourself. You are given a goal and a time limit and you choose your language, and then you code it. You can view the source code of the best coders to improve yourself.
I have used topcoder for awhile and have never been in any competition. Check it out.
You may also want to check out Project Euler. Not a judge, but there are mathematical problems and solutions available for many languages.
Have a look at SPOJ
CodingBat might give you some good practice. It responds instantly with test results.
This is a year old by now, so my answer is for future stumblers.
The ACM-ICPC Live Archive has a lot of great problems, and in a lot of different areas. (Project Euler is also great, but the problems are all number-theoretic.) And hoop-jumping is normal with these things... last I checked, Facebook Puzzles requires you to email a zip file containing the code and an Ant buildfile, and they take a long time to get back to you.
I've only sent Java code to UVa, so I'll elaborate a little on the Java particulars for anyone else who's struggling. Your class must be called Main, and its entry point must be the main method. You read from System.in. If you're on a Unix-y platform, after compiling you can use
Java Main < input.txt
to test your program.
The presentation has to be exact. For example, if they say "outputs should be separated by a blank line," that does not mean, "follow each output with a blank line." Finally, don't be afraid to check out their forums.
Reference: http://online-judge.uva.es/board/viewtopic.php?t=7429
(In their sample code, they read the input byte-by-byte. Don't do that; use Scanner instead. It's also not necessary to have the main method create an instance of the class. You can go 100% static, and often the problems are small enough that OOP doesn't buy you anything.)