Does anyone know of a tool that can inspect a specified schema and generate random data based on the tables and columns of that schema?
Another alternative is Swingbench Data Generator
It is useful to use the SAMPLE clause (for example generating order lines for a random combination of orders and products)
This is an interesting question. It is easy enough to generate random values - a simple loop round the data dictionary with calls to DBMS_RANDOM would do the trick.
Except for two things.
One is, as #FrustratedWithForms points out, there is the complication of foreign key constraints. Let's tip lookup values (reference data) into the mix too.
The second is, random isn't very realistic. The main driver for using random data is a need for large volumes of data, probably for performance testing. But real datasets aren't random, they contain skews and clumps, variable string lengths, and of course patterns (especially where dates are concerned).
So, rather than trying to generate random data I suggest you try to get a real dataset. Ideally your user/customer will be able to provide one, preferably anonymized. Otherwise try taking something which is already in the public domain, and massage it to fit your specific requirements. The Info Chimps are the top bananas when it comes to these matters. Check them out.
Allround Automation's PL/SQL Developer has a data generator tool. But be warned: it's a bit flaky - it seems to work fine on a single-table basis but gets tripped up when there are dependencies between tables.
I admit that eventually I just started writing my own SQL scripts to generate data. Turned out to be much more stable.
Have a look at Databene Benerator.
It's a bit complicated to do the initial setup but is quite powerful.
Bit of a wild card this one but thought I would mention it.
If you have data in a production environment that you can't use because it may contain sensitive information, Oracle have a product called "Oracle Data Masking" that will replace the sensitive information with realistic values.
I don't know the cost of this product but if you want more information, it can be found here.
Related
I have about a year of experience in coding in Java. To hone my skills I'm trying to write a Calendar/journal entry desktop app in Java. I've realized that I still have no experience in data persistence and still don't really understand what the data persistence options would be for this program -- So perhaps I'm jumping the gun, and the design choices that I'm hoping to implement aren't even applicable once I get into the nitty gritty.
I mainly want to write a calendar app that allows you to log daily journal entries with associated activity logs for time spent on daily tasks. In terms of adding, editing and viewing the journal entries, using a hash table with the dates of the entries as keys and the entries themselves as the values seems most Big-Oh efficient (O(1) average case for each using a hash table).
However, I'm also hoping to implement a feature that could, given a certain range of dates, provide a simple analysis of average amount of time spent on certain tasks per day. If this is one of the main features I'm interested in, am I wrong in thinking that perhaps a sorted array would be more Big-Oh efficient? Especially considering that the data entries are generally expected to already be added date by date.
Or perhaps there's another option I'm unaware of?
The reason I'm asking is because of the answer provided by this following question: Why not use hashing/hash tables for everything?
And the reason I'm unsure if I'm even asking the right question is because of the answer to the following question: Whats the best data structure for a calendar / day planner?
If so, I would really appreciate being directed other resources on data persistence in java.
Thank you for the help!
Use a NavigableMap interface (implemented by TreeMap, a red-black tree).
This allows you to easily and efficiently select date ranges and traverse over events in key order.
As an aside, if you consider time or date intervals to be "half-open" it will make many problems easier. That is, when selecting events, include the lower bound in results, but exclude the upper. The methods of NavigableMap, like subMap(), are designed to work this way, and it's a good practice when you are working with intervals of any quantity, as it's easy to define a sequence of intervals without overlap or gaps.
Depends on how serious you want your project to be. In all cases, be careful of premature optimization. This is when you try too hard to make your code "efficient", and sacrifice readability/maintainability in the process. For example, there is likely a way of doing manual memory management with native code to make a more efficient implementation of a data structure for your calendar, but it likely does not outweigh the beneits of using familiar APIs etc. It might do, but you only know when you run your code.
Write readable code
Run it, test for performance issues
Use a profiler (e.g. JProfiler) to identify the code that is responsible for poor performance
Optimise that code
Repeat
For code that will "work", but will not be very scalable, a simple List will usually do fine. You can use JSONs to store your objects, and a library such as Jackson Databind to map between List and JSON. You could then simply save it to a file for persistence.
For an application that you want to be more robust and protected against data corruption, a database is probably better. With this, you can guarantee that, for example, data is not partially written, concurrent access to the same data will not result in corruption, and a whole host of other benefits. However, you will need to have a database server running alongside your application. You can use JDBC and suitable drivers for your database vendor (e.g. Mysql) to connect to, read from and write to the database.
For a serious application, you will probably want to create an API for your persistence. A framework like Spring is very helpful for this, as it allows you to declare REST endpoints using annotations, and introduces useful programming concepts, such as containers, IoC/Dependency Injection, Testing (unit tests and integration tests), JPA/ORM systems and more.
Like I say, this is all context dependent, but above all else, avoid premature optimization.
This thread might give you some ideas what data structure to use for Range Queries.
Data structure for range query
And it even might be easier to use a database and using an API to query for the desired range.
If you are using (or are able to use) Guava, you might consider using RangeMap (*).
This would allow you to use, say, a RangeMap<Instant, Event>, which you could then query to say "what event is occurring at time T".
One drawback is that you wouldn't be able to model concurrent events (e.g. when you are double-booked in two meetings).
(*) I work for Google, Guava is Google's open-sourced Java library. This is the library I would use, but others with similar range map offerings are available.
The problem is how to store (and search) a set of items a user likes and dislikes. Although each user may have 2-100 items in their set, the possible values for the items numbers in the tens of thousands (and is expanding).
Associated with each item is a value say from 10 (like) to 0 (neutral) to -10 (dislike).
So given a user with a particular set, how to find users with similar sets (say a percentage overlap on the intersection)? Ideally the set of matches could be reduced via a filter that includes only items with like/dislike values within a certain percentage.
I don't see how to use key/value or column-store for this, and walking relational table of items for each user would seem to consume too many resources. Making the sets into documents would seem to lose clarity.
The web app is in Java. I've searched ORMS, NoSQL, ElasticSearch and related tools and databases. Any suggestions?
Ok this seems like the actual storage isn’t the problem, but you want to make a suggestion system based on the likes/dislikes.
The point is that you can store things however you want, even in SQL, most SQL RDBMS will be good enough for your data store, but you can of course also use anything else you want. The point, is that no SQL solution (which I know of) will give you good results with this. The thing you are looking for is a suggestion system based on artificial intelligence, and the best one for distributed systems, where they have many libraries implemented, is Apache Mahout.
According to what I’ve learned about it so far, it can do what you need basically out of the box. I know that it’s based on Hadoop and Yarn but I’m not sure if you can import data from anywhere you want, or need to have it in HDFS.
Other option would be to implement a machine learning algorithm on your own, which would run only on one machine, but you just won’t get the results you want with a simple query in any sql system.
The reason you need machine learning algorithms and a query with some numbers won’t be enough in most of the cases, is the diversity of users you are facing… What if you have a user B which liked / disliked everything he has in common with user A the same way - but the coverage is only 15%. On the other hand you have user C which is pretty similar to A (while not at 100%, the directions are pretty much the same) and C has marked over 90% of the things, which A also marked. In this scenario C is much closer to A than B would be, but B has 100% coverage. There are many other scenarios where most simple percentages won’t be enough, and that’s why many companies which have suggestion systems (Amazon, Netflix, Spotify, …) use Apache Mahout and similar systems to get those done.
I'm not really sure if my question is correct per se to post here, but I thought I'd give it a go.
I'm working on a project where I take text data from a public knowledge base and want to use this text to automatically expand tag based search queries with additional terms that are supposed to be relevant to the original query. The public knowledge base is basically a collection of data from Wikipedia; in my case the abstracts of 3.74 million articles.
In the beginning I simply performed a search based on an original query, fetched the words used in articles describing the matches from my query and did a simple term frequency calculation to get the N most used terms.
It seemed to be a simple idea that worked to begin with, but as I tested more queries I started running into problems. It's clear that I need some kind of semantic analysis on my custom text collection, but I have no idea where to even begin doing something like this. Any tools I find online that are supposed to do semantic analysis' like this only works on a predefined collection of texts. As stated: I need something that can process a custom collection and later use that index to perform searches on.
Any ideas or suggestions?
This is a design question involving both Java and MySQL.
The client requires the addition of 14 boolean flags (T/F) to keep track of some new information in an existing class/table.
I can add these flags to the existing table, or I could create a new class and table just for this data. Adding the 14 boolean flags to the existing table will give it quite a few attributes, which I'm inclined to avoid (especially if the number of flags increases in time). Creating a new class/table is cleaner, but it it really necessary in this case?
Alternately, I could use a 16 bit integer with masks to multiplex the data and then I'm only adding one variable to the existing class/table.
My primary question is this: is it more efficient to store 14 individual boolean variables in a MySQL database and load them into the class, or would it be better to store a single integer and then (in Java) multiplex the flags using bit manipulation (i.e. masks)?
Secondary question, if individual flags are more efficient, then is it better to have lots of attributes in one table or split them? What is the penalty for storing lots of boolean flags in a table that already has quite a few entities?
If the primary question's answer is "integer + multiplex" then the second question becomes moot.
Thanks.
-R
I personally like to have separate columns. the only place I might consider masking is when the database and the application are running under extreme conditions or on low memory and storage devices where any use of memory or space is crucial.
1- space should not be a consideration unless the class/table can grow to huge volumes.
to simulate Boolean flags a tiny int (1) is enough and all you need is 0/1 values.
2- it becomes much harder for anyone wanting to do queries on the table or wanting to write reports using it. and if your client does access the database, I am quite sure masking won't be acceptable in most cases.
3- it will be much harder to build indexes on this column when they are needed, if that will be possible at all (based on the database)
4- working more and writing more code should not be an issue. You work more now but you will work less in the future. thinking it is less work for the programmer/dba is just an illusion IMHO. here are some considerations:
a- it will be harder to maintain the code and write database queries. maybe you do everything in your java code now but you never know what the future holds.
b- making structural changes become harder. what if the customer requires removal of two flags and addition of 4 ? do you keep the original two bits that held the removed flags in the database and add 4 bits ? or you use them for two of the new flags and then add two more bits? how would this affect code that is already written ? and how easy would it be to track all places and actually making the changes in the code?
in a small application, this is not a big problem. but applications grow with time. if the table gets to be widely used, this is very dangerous. if you had code working with the 7th and 8th flag, and they were removed and the decision was (by some other programmer lets say) to reuse the same places, any code that used to access the 7th and 8th bit will keep functioning (incorrectly) until that is noticed. it could already do harmful things until the issue is spotted and fixed. if you had separate columns and you dropped them, the error will pop up to the surface on the very first use of that code as the columns won't be there.
c- it will without a doubt be harder to make scripts that upgrade the data and/or change structure for the dba. an experienced dba will not sit and write the column names one after the other and will use its tools to generate scripts. with bit manipulation, he will have to work by hand and make no mistake in the expressions he produces in various selects/updates
5- all the above is database related. once it reaches your application, you are free.
you can read the 16 flags from the database and produce your integer and from now on, your code can use bit manipulation on it and you can save time (by writing your functions that deal with it once and using them). I personally think that here too its better not to do so but anyway its your choice.
I know i am not focused and that i might have repeated here and there. But I also hope that i was able to help you in seeing longer term considerations that will help you make the right choice for your case.
take a look at SET Column Type
You can use EnumSet. It's the best way to emulate flags - much more clear in design and have almost the same performance as int. Can be easily translated to int (to read/put into database). For more information look at "Effective Java" book, chapter "EnumSet"
In the primary question you ask that what is more efficient then what is better. This complicate the answer.
From point of view of Developer and DBA having a single column is more efficient solution. Because you spare place and using masks you increase the performance of inserts and updates.
From point of view data analyst the separate column is more efficient solution, each column has specified role.
As goes fro me i prefer the masks
- Les changes in code
- Better management (limited integer capacity is a risk here)
We have a system which performs a 'coarse search' by invoking an interface on another system which returns a set of Java objects. Once we have received the search results I need to be able to further filter the resulting Java objects based on certain criteria describing the state of the attributes (e.g. from the initial objects return all objects where x.y > z && a.b == c).
The criteria used to filter the set of objects each time is partially user configurable, by this I mean that users will be able to select the values and ranges to match on but the attributes they can pick from will be a fixed set.
The data sets are likely to contain <= 10,000 objects for each search. The search will be executed manually by the application user base probably no more than 2000 times a day (approx). It's probably worth mentioning that all the objects in the result set are known domain object classes which have Hibernate and JPA annotations describing their structure and relationship.
Possible Solutions
Off the top of my head I can think of 3 ways of doing this:
For each search persist the initial result set objects in our database, then use Hibernate to re-query them using the finer grained criteria.
Use an in-memory Database (such as hsqldb?) to query and refine the initial result set.
Write some custom code which iterates the initial result set and pulls out the desired records.
Option 1
Option 1 seems to involve a lot of toing and froing across a network to a physical Database (Oracle 10g) which might result in a lot of network and disk activity. It would also require the results from each search to be isolated from other result sets to ensure that different searches don't interfere with each other.
Option 2
Option 2 seems like a good idea in principle as it would allow me to do the finer query in memory and would not require the persistence of result data which would only be discarded after the search was complete. Gut feeling is that this could be pretty performant too but might result in larger memory overheads (which is fine as we can be pretty flexible on the amount of memory our JVM gets).
Option 3
Option 3 could be very performant but is something I would like to avoid as any code we write would require such careful testing that the time taken to acheive something flexible and robust enough would probably be prohibitive.
I don't have time to prototype all 3 ideas so I am looking for comments people may have on the 3 options above, plus any further ideas I have not considered, to help me decide which idea might be most suitable. I'm currently leaning toward option 2 (in memory database) so would be keen to hear from people with experience of querying POJOs in memory too.
Hopefully I have described the situation in enough detail but don't hesitate to ask if any further information is required to better understand the scenario.
Cheers,
Edd
Options 1 and 2 are quite compatible: by implementing one you can replace it with the other with simple reconfiguration of persistence.xml (given that in-memory database is JPA compatible, e.g. JavaDB, Derby, etc.).
Option 3 is re-implementing both third-party software (database) and your own code (existing JPA entities). You also listed its advantages as concerns. It's clearly a less feasible option in your case. I can't think of anything else to promote Option 3 either.
It seems that in-memory database is more suitable given use cases and their time span. If requirements evolve into less transient ones then you can switch to Oracle.
If your expressions are not too complex, you can use an expression language for evaluating string queries on your Java objects (POJOs). I can recommend MVEL http://mvel.codehaus.org .
The idea is that you put your objects into MVEL context. Then you provide string query written according to MVEL simple notation, and finally evaluate expression.
Example taken from MVEL site:
Map vars = new HashMap();
vars.put("x", new Integer(5));
vars.put("y", new Integer(10));
Integer result = (Integer) MVEL.eval("x * y", vars);
assert result.intValue() == 50; // Mind the JDK 1.4 compatible code :)
Usually expression languages support traversing your object graph (collections) and
accessing members in JSP EL style (dot notation).
Also, I can suggest looking at OGNL (google it, I can't add more than one link)
How complex are the refining criteria? If the majority are quite simple, I'd be tempted to go for option (3) to start with, but make sure it's encapsulated behind a suitable interface so that if you come across something that is too complex or inefficient to code up yourself you can switch to the in-memory DB at that point (either wholesale for all queries, or just for the complex ones if there's an overhead in setting up the temporary tables).
Option 2 seems to be good - since you can toggle between 1 & 2 as per need. 3 is restricted in terms of future data sizing issue as well. Querying objects would imply greater dependency on the code structure for storage and querying.
Probably it would be good idea to include some caching mechanism (ehcache/memcache) along with usage of Option 2 and then profiling to check the performance difference.