As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'm new to ElasticSearch which i try to use to help a cool startup that needs a search engine.
My usecase is:
Each user of the website has its personal space where he can create text documents
Each user can share or not its content with limited people (friends)
Each user can create public content
Users may be from different countries
Users may search on other stuff than posts (for exemple, search for another user)
Our data is hosted in CouchDB.
1) Should i create one unique index or is it a good practice to create an index per user?
I've read it's not a bad idea to put everything on the same index so you can search on many different things in the same time.
But i noticed ES provides the ability to search on multiple indexes so why not creating an index per user?
Is it a problem because the maximum url size is limited and the index names are provided by the url or something else?
2) Should i create one index or type per json document?
I've mostly 2 different type of documents to index: posts and users.
If i want to be able to search on both of them in the same time, am i supposed:
To create an index for posts and an index for users, and search both of them?
To create one index and 2 different types, and search on both types of the same index?
I don't really what will be the difference.
3) Is it normal to have to create multiple rivers of the same type?
For exemple, on the CouchDB river, which provide a "filter" attribute to receive only the documents matching your filter.
Thus if i want to index my posts and my users on 2 separate indexes or types, my first try will be create 2 CouchDB rivers which will both have a different filter and a different index and/or type.
Is it the way to do?
Related
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I'm looking for a database which would allow me to store most of the objects in the memory. Basically I want to store in the memory everything except some rarely used data (history of changes, etc).
I'm looking for:
simple API for java, preferably non-ORM
ACID is not required (well, D is)
some support for queries, but nothing fancy
The idea is to operate on a model in memory, store any "command" mutating the model in the database, periodically synchronize model to database (like prevayler does)
Which database matches my needs? (I'll use postgres or H2 if there isn't anything simpler).
You need one of object databases: http://en.wikipedia.org/wiki/Comparison_of_object_database_management_systems
You should use Terracotta. It is usually used for caching, but its exactly what you are asking for, except that it's "querying" abilities are sparse.
Update:
The previous link was to their "enterprise" edition, but they have the open source project Ehcache which fits your needs, and their enterprise product is based on.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I'll preface this comment by saying that I understand how a hash table works however I'm not sure how I would go about implementing one from scratch using only primitives.
Would anyone be able to provide a Java code implementation of a hash table using only arrays?
How would I even start writing a hash table in Java?
How would I code a linked-list hash table again using only primitives?
Cheers!
The code given by the OpenJDK can be pretty hard to understand, so I'll write a short idea how to do it...
One way I did it recently was to use the array itself as a symbol table. The indices of the array will then be the keys (hash-keys) and the elements the value (whatever you want to store). Since arrays have a fixed size and hash-keys can be any integer we are faced with a challenge: to crop the hash-values so they are in the same range as the size of the array. If, say the array has a length of 5, the keys needs to be between 0 and 4. Otherwise we would place values into slots outside of the array => lots and lots of exceptions.
This challenge becomes especially fun when you'd like to avoid collisions...
A lot of help can be found on this page on princeton.
Good luck!
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have some problems with a java app i'm developing, i'm using HtmlCleaner 2.2 library (the one used in web-harvest proyect) and have no problem getting the source of a page.
My problem starts when i want to recursively browse the site and get an tree of categories and products as childs. I guess that each time the script visits a page, it counts as a user entering the site, so when it visits 15 or 20 category or product pages, the website firewall blocks my ip for about an hour.
With this problem 2 solutions come to my mind, first: use proxys, i don't get banned and i can download faster using threads, second: open only one connection. I guess it's a bad idea to use proxies so i want to ask, in a simple code, what is the best way to visit recursively about 300000 products of a website without being banned? fastest and simple
Putting the source in a string it's enough to count as visited.
I don't want a debate about the best way, only a well justificated one.
Acclaration: This is a school task, i'm not making any profit of this, and i'm trying to be the less harmful for the site
If your spidering provides legitimate business value to the site your are scraping, you could contact the website owner and ask for either a data feed or an exclusion to their banning algorithm (after all, it's often beneficial for people to have their products exposed to prospective buyers).
UPDATE
Based on your statement that this is a school task, ask your teacher for assistance in finding a website that is willing to be bombarded with traffic in the interest of education, or reach out to the website owner, explain what you are doing, and ask for permission.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I'm doing some text mining in web pages. Currently I'm working with Java, but maybe there is more appropriate languages to do what I want.
Example of some things I want to do:
Determine the char type of a word based on it parts (letter, digit, symbols, etc.) as Alphabetic, Number, Alphanumeric, Symbol, etc.(there is more types).
Discover stop words based on statistics.
Discover some gramatical class (verb, noun, preposition, conjuntion) based on statistics and some logics.
I was thinking about using Prolog and R (I don't know much about these languages), but I don't know if they are good for this or maybe, another language more appropriate.
Which can I use? Good libs for Java are welcome too.
python.!
They have a HELL-LOTTA libraries in this area.
but, i've got no knowledge about prologue and R.. but definitely py is LOT better than java in text mining, and AI stuff...
I highly recommend Perl. It has a lot of text-processing features, web search and parsing, and a large etc. Take a look at the available modules (>23.000 and growing) at CPAN.
I think Apache Solr and Nutch provides you the framework for that and on top of that you can extend it for your requirements.
Java has some basic support, but nothing like the above two products, they are awesome!
HTML Unit might give you some good APIs for fetching web pages, and traversing over elements in DOM by XPath. I have used it for sometime to perform simple to more complex operations.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
i'm working on a project that consists of a website that connects to the NCBI(National Center for Biotechnology Information) and searches for articles there. Thing is that I have to do some text mining on all the results.
I'm using the JAVA language for textmining and AJAX with ICEFACES for the development of the website.
What do I have :
A list of articles returned from a search.
Each article has an ID and an abstract.
The idea is to get keywords from each abstract text.
And then compare all the keywords from all abstracts and find the ones that are the most repeated. So then show in the website the related words for the search.
Any ideas ?
I searched a lot in the web, and I know there is Named Entity Recognition,Part Of Speech tagging, there is teh GENIA thesaurus for NER on genes and proteins, I already tried stemming ... Stop words lists, etc...
I just need to know the best aproahc to resolve this problem.
Thanks a lot.
i would recommend you use a combination of POS tagging and then string tokenizing to extract all the nouns out of each abstract.. then use some sort of dictionary/hash to count the frequency of each of these nouns and then outputting the N most prolific nouns.. combining that with some other intelligent filtering mechanisms should do reasonably well in giving you the important keywords from the abstract
for POS tagging check out the POS tagger at http://nlp.stanford.edu/software/index.shtml
However, if you are expecting a lot of multi-word terms in your corpus.. instead of extracting just nouns, you could take the most prolific n-grams for n=2 to 4
There's an Apache project for that... I haven't used it but, OpenNLP an open source Apache project. It's in the incubator so it maybe a bit raw.
This post from jeff's search engine cafe has a number of other suggestions.
This might be relevant as well:
https://github.com/jdf/cue.language
It has stop words, word and ngram frequencies, ...
It's part of the software behind Wordle.
I ended up using the Alias`i Ling Pipe