Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I'm writing a Java Program right now, which reads files and writes the content of these files (after some modifications) into an relational database.
My problem right now is that the program should support a wide range of databases and not only one.
So in my program I create SQL statements and commit them to the DB - no problem. (SAP HANA)
Now I want so add another DB (MySQL) and have to slightly change the SQL syntax of the query before committing.
My solution right now is copying the code block, that creates the statements and make the DB specific changes to it. But that obviously can't be it.(to many databases -> 80% code never used) I probably need some kind of mapper, that converts my sql to a dialect, that the chosen DB understands.
Now, I found out about Hibernate and other mappers, but I don't think they fit my needs. The Problem is that they expect an java object (pojo) and convert them. But since I don't know what kind of data my Program is gonna load, I can not create static objects for each column for example.
Sometimes I need to create 4 columns, sometimes 10. sometimes they are Integer, sometimes Strings / varchar. And all of the time they have different names. So all tutorials I found on hibernate are starting from a point where the program is certain what kind of data is going to be inserted into the db which my program is not.
Moreover I need to insert a large number of lines per table (like a billion+) and I think it might be slow to create a object for each insert.
I hope anyone understands my problem and can give me some hints. maybe a mapper, that just converts sql without the need to create a object before.
thank you very much! : )
edit: to make it more clear: the purpose of the programm is to fill up a relational db with data that is stored / discribes in files like csv and xml ). so the db is not used as a tool to store the data but storing the data there is the main aim. I need a realtional db filled up with data that the user provides. and not only one db, but different kinds of rdbs
I think you are describing a perfect use for a file system. Or if you want to go with a filesystem abstraction:
have a look at the apache jackrabbit project
So basically you want to write a tool that writes a arbitrary text file (some kind of csv I assume) into an arbitrary database system? Creating tables and content on the fly, depending on the structure of the text tile?
Using a high level abstraction layer like hibernate is not gonna take you anywhere soon. What you want to do is low level database interaction. As long as you dont need any specific DBMS dependent features you should go a long way with ANSI sql. If that is not enough, I dont see an easy way out of this. Maybe it is an option to write your own abstraction layer that handles DBMS specific formating of the SQL statments. Doesn't sound nice though.
A different thing to think about is the large number of lines per table (like a billion+). Using single row INSERT statements is not a good idea. You have to make use of efficient mass data interfaces - which are strongly DBMS dependent! Prepared statements is the least measure here.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
We have a table of vocabulary items that we use to search text documents. The java program that uses this table currently reads it from a database, stores it in memory and then searches documents for individual items in the table. The table is brought into memory for performance reasons. This has worked for many years but the table has grown quite large over time and now we are starting to see Java Heap Space errors.
There is a brute force approach to solving this problem which is to upgrade to a larger server, install more memory, and then allocate more memory to the Java heap. But I'm wondering if there are better solutions. I don't think an embedded database will work for our purposes because the tables are constantly being updated and the application is hosted on multiple sites suggesting a maintenance nightmare. But, I'm uncertain about what other techniques are out there that might help in this situation.
Some more details, there are currently over a million vocabulary items (think of these items as short text strings, not individual words). The documents are read from a directory by our application, and then each document is analyzed to determine if any of the vocabulary is present in the document. If it is, we note which items are present and store them in a database. The vocabulary itself is stored and maintained in a MS SQL relational database that we have been growing for years. Since, all vocabulary items must be analyzed for each document, repeatedly reading from the database is inefficient. And the number of documents that need to be analyzed each day can at some of our installations be quite large (on the order of 100K documents a day). The documents are typically 2 to 3 pages long although we occasionally see documents as large a 100 pages.
In the hopes of making your application more performant, you're taking all the data out of a database that is designed with efficient data operations in mind and putting it into your application's memory. This works fine for small data sets, but as those data sets grow, you are eventually going to run out of resources in the application to handle the entire dataset.
The solution is to use a database, or at least a data tier, that's appropriate for your use case. Let your data tier do the heavy lifting instead of replicating the data set into your application. Databases are incredible, and their ability to crunch through huge amounts of data is often underrated. You don't always get blazing fast performance for free (you might have to think hard about indexes and models), but few are the use cases where java code is going to be able to pull an entire data set down and process it more efficiently than a database can.
You don't say much about which database technologies you're using, but most relational databases are going to offer a lot of useful tools for full text searching . I've seen well designed relational databases perform text searches very effectively. But if you're constrained by your database technology or your table really is so big that a relational database text search isn't feasible, you should put your data into a searchable cache such as elastic search. If you model and index your data effectively, you can build a very performant text search platform that will scale reliably. Tom's suggestion of lucene is another good one. There's a lot of cloud technologies that can help with this kind of thing too: S3 + Athena comes to mind, if you're into AWS.
I'd look at http://lucene.apache.org it should be a good fit for what you've described.
I was having the same issue with a Table with one more than millon of Data and there was a Client that want export all that data. My solution was very simple I followed this Question. But there was a little Issue having more than 100k records go to Heap Space. So I just use Chunks with my queries WITH NO LOCK ( I know this can have some inconsistent data, but I needed to do that because it was Blocking the DB Without this Statement). I hope this approach help you.
When you had a small table, you probably implemented an approach of looping over the words in the table and for each one looking it up in the document to be processed.
Now the table has grown to the point where you have trouble loading it all in memory. I expect that the processing of each document has also slowed down due to having more words to look up in each document.
If you flip the processing around, you have more opportunities to optimize this process. In particular, to process a document you first identify the set of words in the document (e.g., by adding each word to a Set). Then you loop over each document word and look it up in the table. The simplest implementation simply does a database query for each word.
To optimize this, without loading the whole table in memory, you will want to implement an in-memory cache of some kind. Your database server will actually automatically implement this for you when you query the database for each word; the efficacy of this approach will depend on the hardware and configuration of your database server as well as the other queries that are competing with your word look-ups.
You can also implement an in-memory cache of the most-used portion of the table. You will limit the size of the cache based on how much memory you can afford to give it. All words that you look up that are not in the cache need to be checked by querying the database. Your cache might use a least-recently-used eviction strategy so that you keep the most common words in the cache.
While you can only store words that exist in the table in your cache, you might achieve better performance if you cache the result of the lookup. This will result in your cache having the most common words that show up in the documents being in the cache (and each one with a boolean value that indicates if the word is or is not in the table).
There are several really good open source in-memory caching implementations available in Java, which will minimize the amount of code you need to write to implement a caching solution.
I have a use case where in I need to read rows from a file, transform them using an engine and then write the output to a database (that can be configured).
While I could write a query builder of my own, I was interested in knowing if there's already an available solution (library).
I searched online and could find jOOQ library but it looks like it is type-safe and has a code-gen tool so is probably suited for static database schema's. In the use case that I have db's can be configured dynamically and the meta-data is programatically read and made available for write-purposes (so a list of tables would be made available, user can select the columns to write and the insert script for these column needs to be dynamically created).
Is there any library that could help me with the use case?
If I understand correctly you need to query the database structure, display the result to via a GUI and have the user map data from a file to that structure?
Assuming this is the case, you're not looking for a 'library', you're looking for an ETL tool.
Alternatively, if you're set on writing something yourself, the (very) basic way to do this is:
the structure of a database using Connection.getMetaData(). The exact usage can vary between drivers so you'll need to create an abstraction layer that meets your needs - I'd assume you're just interested in the table structure here.
the format of the file needs to be mapped to a similar structure to the tables.
provide a GUI that allows the user to connect elements from the file to columns in the table including any type mapping that is needed.
create a parametrized insert statement based on file element to column mapping - this is just a simple bit of string concatenation.
loop throw the rows in the file performing a batch insert for each.
My advice, get an ETL tool, this sounds like a simple problem, but it's full of idiosyncrasies - getting even an 80% solution will be tough and time consuming.
jOOQ (the library you referenced in your question) can be used without code generation as indicated in the jOOQ manual:
http://www.jooq.org/doc/latest/manual/getting-started/use-cases/jooq-as-a-standalone-sql-builder
http://www.jooq.org/doc/latest/manual/sql-building/plain-sql
When searching through the user group, you'll find other users leveraging jOOQ in the way you intend
The setps you need to do is:
read the rows
build each row into an object
transform the above object to target object
insert the target object into the db
Among the above 4 steps, the only thing you need to do is step 3.
And for the above purpose, you can use Transmorph, EZMorph, Commons-BeanUtils, Dozer, etc.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm designing a database and a Java application to the following:
1. Allow user to query the database via an API.
2. Allow a user to save a query and identify the query via a 'query-id'. User can then pass-in 'query-id' on next call to API, which will execute the query associated with id but it will only retrieve data from the last time the specific query was requested.
- Along with this, I would also need to save the query-id information for each UserID.
Information regarding the Database
The database of choice is PostgreSQL and the information to be requested by user will be stored in various tables.
My question: Any suggestions/advice/tips on how to go about implementing requirement
No. 2?
Is there an existing design pattern, sql queries, built-in db function on how to save a query and fetch information from multiple tables from the last returned results.
Note:
My initial thoughts so far is to store the last row(each row in all the tables will have a primary key) read from each table into a data structure and then save this data structure for each saved query and use it when retrieving data again.
For storing the user and query-id information, I was thinking of creating a separate table to store the UserName, UserUUID, SavedQuery, LastInfoRetrieved.
Thanks.
This is quite a question. The obvious tool to use here would prepared statements but since these are planned on first run, they can run into problems when run multiple times with multiple parameters. Consider the difference, assuming that id ranges from 1 to 1000000 between:
SELECT * FROM mytable WHERE id > 999900;
and
SELECT * FROM mytable WHERE id > 10;
The first should use an index while the second should do a physical-order scan of the table.
A second possibility would be to have functions which return refcursors. This would mean the query is actually run when the refcursor is returned.
A third possibility would be to have a schema of tables that could be used for this, per session, holding results. Ideally these would be temporary tables in pg_temp, but if you have to preserve across sessions, that may be less desirable. Building such a solution is a lot more work and adds a lot of complexity (read: things that can go wrong) so it is really a last choice.
From what you say, refcursors sound like the way to do this but keep in mind PostgreSQL needs to know what data types to return so you can run into some difficulties in this regard (read the documentation thoroughly before proceeding), and if prepared statements gets where you need to go, that might be simpler.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Let me first brief about the scenario. The database is Sybase. There are some 2-3 k stored procedures. Stored procedures might return huge data (around million records). There will be a service (servlet / spring controller) which will call the required procedure and flush the data back to client in xml format.
I need to appy filtering (on multiple column & multiple condition) / sort (based on some dynamic criteria), this I have done.
The issue is, as the data is huge, doing all the filtering / sorting in-memory is not good. I have thought of below options.
Option 1:
Once I get the ResultSet object, read some X no. of records, filter it, store it in some file, repeat this process till all the data is read. Then just read the file and flush the data to client.
I need to figure out how do I sort the data in file and how to store objects in file so that the filtering/sorting is fast.
Option 2:
Look for some Java API, which takes the data, filters it & sort it based on the given criteria and returns it back as a stream
Option 3:
Use in-memory database like hsqldb, h2database, But I think this will overhead instead of helping. I will need to insert data first and then query data and this will also in turn use the file system.
Note I don't want to modify the stored procedures so the option of doing filtering/sorting in database is not an option or might be the last option if nothing else works.
Also if it helps, every record that I read from ResultSet, I store it in a Map, with keys being the column name and this Map is stored in a List, on which I apply the filtering & sorting.
Which option do you think will be good for memory footprint, scalable, performance wise or any other option which will be good for this scenario ?
Thanks
I would recommend your Option 3 but it doesn't need to be an in-memory database; you could use a proper database instead. Any other option would be just a more specific solution to the general problem of sorting huge amounts of data. That is, after all, exactly what a database is for and it does it very well.
If you really believe your Option 3 is not a good solution then you could implement a sort/merge solution. Gather your Maps as you already do but whenever you reach a limit of records (say 10,000 perhaps) sort them, write them to disk and clear them down from memory.
Once your data is complete you can now open all files you wrote and perform a merge on them.
Is hadoop applicable for your problem?
You should filter the data in database itself. You can write aggregation procedure which will execute all other procedures, combine data or filter them However the best option is to modify 2-3 thousands stored procedures so they return only needed data.
I have to go through a database and modify it according to a logic. The problem looks something like this. I have a history table in my database and I have to modify.
Before modifying anything I have to look at whether an object (which has several rows in the history table) had a certain state, say 4 or 9. If it had state 4 or 9 then I have to check the rows between the currently found row and the next state 4 or 9 row. If such a row (between those states) has a specific value in a specific column then I do something in the next row. I hope this is simple enough to give you an idea. I have to do this check for all the objects. Keep in mind that any object can be modified anywhere in its life cycle (of course until it reaches a final state).
I am using a SQL Sever 2005 and Hibernate. AFAIK I can not do such a complicated check in Transact SQL! So what would you recommend for me to do? So far I have been thinking on doing it as JUnit test. This would have the advantage of having Hibernate to help me do the modifications and I would have Java for lists and other data structures I might need and don't exist in SQL. If I am doing it as a JUnit test I am not loosing my mapping files!
I am curious what approaches would you use?
I think you should be able to use cursors to manage the complicated checks in SQL Server. You didn't mention how frequently you need to do this, but if this is a one-time thing, you can either do it in Java or SQL Server, depending on your comfort level.
If this check needs to be applied on every CRUD operation, perhaps database trigger is the way to go. If the logic may change frequently over the time, I would much rather writing the checks in Hibernate assuming no one will hit the database directly.