low-level Java-based file repository [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Does anyone know of a library that will help me build a file store in Java? I am not looking for something like JCP. Rather, I need to build something that stores millions of files/terabytes of data, de-duped by hash, and with metadata for each file. Metadata might include mime type, filenames, dates, size, etc. (A hash might correspond to various filenames, dates, etc.)
I know this is not overly difficult, but I don't want to reinvent the wheel if the wheel already exists. For example, files have to be sorted into a directory hierarchy on disk based on part of the hash to avoid exceeding the maximum number of files the OS will allow per directory. A web service needs to be written to provide access to files, etc. Some other data structure (RDBMS?) needs to store the metadata. A mechanism is needed for loading new content.
Everything I am finding is higher level, JCP or JCP-ish, but I figured it was worth checking with the experts here before going off to build it. Thanks in advance.

Definitely don't reinvent the wheel... an RDBMS might work, but something like Apache Rabbit perhaps? If you want reaalllllyyyyy low level there is Peter Lawrey's Chronicle.

All the existing content management solutions are too high level for my purposes. SQLServer with FILESTREAM is about as close as I could find with what promises to be reasonable performance. However this is not a difficult thing to build and then I won't have to reply on a specific RDBMS for the solution, so that is the route I'm going to go with.

Related

Is Python the best option to load a huge text file to SQL Server? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
I have a few text files in CSV format. Some of them are over 500 MB but less than 1 GB. I need to load each of them to a SQL Server 2008 R2 database as table.
I considered using Python. Is Python a good option (performance-wise) to do things like this? Any Python plugin should be used? I am more of a Java man. How is it compared to Java?
Anyone has the experience? Thanks!
Cheers, Alex
In general, no scripting language performs as well as native utilities for loading bulk data.
Unless your CSV is malformed and requires pre-scrubbing and transformation, there is no need to limit your choices to programming languages. Use a tool instead. SSIS, BCP, DTS all come to mind for CSV.
If you have need for customized load logic, or client based load, then by all means, Python, Perl, Java, C# can all do it. But it won't load as fast as a tool already built for it (and speed seems to be what you are concerned with).

Java OpenSource library for versioning files [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
My task description is something like this:
"Application should be able to store text/binary files in some filesystem storage. Every file has an author and date of uploading. Application should store all versions of files and provide abilities to review history/versioning tree".
We can't use DB solution here because we have another application that processes uploaded files and it requires original file version (build script which uses javac command). And this is not good idea to store files in the database.
So I'm looking for some ready-to-use solution and I want to avoid writing my own storage implementation.
I've googled some solutions and see jackrabbit library as a variant. It implements JCR specification. But also I saw some bad comments about JCR concept.
Please advise me something else.
Or is JCR good enough for my task?
That requirement sounds like source code version control. There are APIs for Git and Subversion, and probably for other less used systems. http://svnkit.com for example, a search for "git api" or "subversion api" will turn up others.

How to practice Hadoop online? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 8 years ago.
Improve this question
Is there a way to find an online Hadoop database and practice on it using Java?
I found that you can practice on www.gethue.com, but I don't think you can do it using java.
Thank you
You can try Cloudera Live.
It's in beta, but seems to work pretty well.
I made a small list of free offers enabling you to manage your own hadoop cluster. It's not technically an available database, but you can fill these clusters with the data you want.
Here is the list :
Microsoft Azure HDInsight : they offer you 150€ to spend on their products. You can rent a Hadoop cluster and work on it.
Qubole : they give you preconfigured Hadoop clusters, you have 75 computing hours for free
Joyent : you can have one VM for free for a year.
You may also try amazon's Elastic Map Reduce, although I'm not sure this specific offer is included in their free trial. An advantage of using it is you can access free datasets more easily (for instance, this one).
Please also note that all these services (except Qubole) require a credit card for registration.

Using a nosql database for very large dataset with small data size highly written and moderate read [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
what is a better nosql database for creating a system to record advertisement data for about 50 to 200 millions insert per day, the aggregation of the data will be used to show the pattern of how users engage with the ads. I really like MongoDB but it seems that major industry players are picking Riak for the job. It seems that Mongo had to flush some caveats in last 2 releases and the current version seems to be pretty good for the job, any idea?
It seems mongodb with hadoop (http://docs.mongodb.org/ecosystem/tutorial/getting-started-with-hadoop/ ) fits your data requirements. You can store data in mongodb and run aggregation jobs (map/reduce) on hadoop cluster.
I'd use Java Berkeley DB from Oracle.
Very powerful and easy to use Open source but not free.
This is not a type of question that could be answered as Product1 or Product2. It is just too small amount of information given. There is no info about the environment, where the system will run, what type of information will be inserted, how are you going to aggregate it.
The best way is to try:
write a test using Product1,
write a the same thing with a Product2
start inserting the data which looks as close as possible to the
data you are assuming to get in real environment
make measurements of speed and whatever factors you need
and only based on that you will be able to determine what suits you

which keyvalue store has the best performance? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I think tow months ago. I found a google's open source project that can store key value pairs with high performance. But i forget the name. Could anybody tell me? or you can have some other suggestions for me? I have been using BerkerlyDB, but I found BerkerlyDb is not fast enough for my program. However, berkerylyDB is convenient to use as it appears as a java lib jar, which can be integraed with my program seamlessly. My program is also written in Java.
Two strong competitors in the DHT (Distributed Hash Table) 'market':
Cassandra (created by Facebook, in use by Digg and Twitter)
HBase
Here is a presentation about Cassandra. On slide 20 you'll see some speed benchmarks- 0.12 ms / write
(You can search around for the whole presentation, including Eric Evans talking)
Nobody mentions leveldb and yet this post is at the top when searching for "good key value store". Leveldb in my experience is simply awesome. It's so fast I couldn't believe it.
I've been trying quite a few databases for a task I was doing. I tried:
windows azure table storage (expensive, value size max 1 Mb and each property size is max 64 Kb)
redis (awesome if you have as much ram as you please)
mongodb (awesome as long as there is enough ram, breaks after that point)
sql server (expensive, needs maintenance, such as rebuilding indexes and eventually still not fast enough)
sqlite (free, but not as simple to use as leveldb and not fast)
leveldb. If you can model your job as to reading large consecutive chunks of data through an iterator then you'll get great speed. Writing is also pretty fast. Combine it with ssd disk and you'll love it.
Bigtable?
Redis
http://code.google.com/p/redis/
Maybe you should describe what features you need. If it doesn't need to be distributed (does it?) then I would try using the H2 Database. For those who think "it can't be fast because it's using SQL" please note that when using prepared statement, SQL parsing is only done once. Disclaimer: I'm the main author of H2.
Many answer seem to automatically assume need for distribution; but that seems odd if question refers to BDB.
With that in mind, beyond Redis and H2 (which are both good), there is also Tokyo Cabinet to consider, which seems to offer benefits over BDB. And one more newer possibility is Krati.
I think you saw Guava or Google collections.

Categories

Resources