I am trying to implement a real-time updating framework, where changing input data automatically leads to recalculating all dependent results. So I need a kind of subscription mechanism, but a clever one, as I have to handle enormous amounts of data. I like to think about the mechanism as a "calculation tree" or directed graph, with the nodes representing the results, and the edges representing the functions.
Something similar must have been implemented in MS Excel, with the cells being the nodes, but Excel will not fulfill my needs as it is not able to handle large amounts of data, and is not flexible enough.
While in principle I want to be able to browse through the complete calculation tree (including all results in the complete depth of the tree), I know that this could mean storing several Terabytes of data. So I need to be able to forget or skip nodes if the computer runs out of memory, and then recalculate them as needed. And not to forget: while programming the (short!) functions, I don't want to be bothered with endless technical subscribe stuff (ideally this should be taken care of automatically in the framework).
Do you think it's doable, and if so, how would you attack it? Do you know of any component / library which one could use for this type of things? I have thought about publish/subscribe mechanisms and message brokers, but fear they are going to slow down my calculations.
Thx in advance for your responses!
Calle
Related
I have about a year of experience in coding in Java. To hone my skills I'm trying to write a Calendar/journal entry desktop app in Java. I've realized that I still have no experience in data persistence and still don't really understand what the data persistence options would be for this program -- So perhaps I'm jumping the gun, and the design choices that I'm hoping to implement aren't even applicable once I get into the nitty gritty.
I mainly want to write a calendar app that allows you to log daily journal entries with associated activity logs for time spent on daily tasks. In terms of adding, editing and viewing the journal entries, using a hash table with the dates of the entries as keys and the entries themselves as the values seems most Big-Oh efficient (O(1) average case for each using a hash table).
However, I'm also hoping to implement a feature that could, given a certain range of dates, provide a simple analysis of average amount of time spent on certain tasks per day. If this is one of the main features I'm interested in, am I wrong in thinking that perhaps a sorted array would be more Big-Oh efficient? Especially considering that the data entries are generally expected to already be added date by date.
Or perhaps there's another option I'm unaware of?
The reason I'm asking is because of the answer provided by this following question: Why not use hashing/hash tables for everything?
And the reason I'm unsure if I'm even asking the right question is because of the answer to the following question: Whats the best data structure for a calendar / day planner?
If so, I would really appreciate being directed other resources on data persistence in java.
Thank you for the help!
Use a NavigableMap interface (implemented by TreeMap, a red-black tree).
This allows you to easily and efficiently select date ranges and traverse over events in key order.
As an aside, if you consider time or date intervals to be "half-open" it will make many problems easier. That is, when selecting events, include the lower bound in results, but exclude the upper. The methods of NavigableMap, like subMap(), are designed to work this way, and it's a good practice when you are working with intervals of any quantity, as it's easy to define a sequence of intervals without overlap or gaps.
Depends on how serious you want your project to be. In all cases, be careful of premature optimization. This is when you try too hard to make your code "efficient", and sacrifice readability/maintainability in the process. For example, there is likely a way of doing manual memory management with native code to make a more efficient implementation of a data structure for your calendar, but it likely does not outweigh the beneits of using familiar APIs etc. It might do, but you only know when you run your code.
Write readable code
Run it, test for performance issues
Use a profiler (e.g. JProfiler) to identify the code that is responsible for poor performance
Optimise that code
Repeat
For code that will "work", but will not be very scalable, a simple List will usually do fine. You can use JSONs to store your objects, and a library such as Jackson Databind to map between List and JSON. You could then simply save it to a file for persistence.
For an application that you want to be more robust and protected against data corruption, a database is probably better. With this, you can guarantee that, for example, data is not partially written, concurrent access to the same data will not result in corruption, and a whole host of other benefits. However, you will need to have a database server running alongside your application. You can use JDBC and suitable drivers for your database vendor (e.g. Mysql) to connect to, read from and write to the database.
For a serious application, you will probably want to create an API for your persistence. A framework like Spring is very helpful for this, as it allows you to declare REST endpoints using annotations, and introduces useful programming concepts, such as containers, IoC/Dependency Injection, Testing (unit tests and integration tests), JPA/ORM systems and more.
Like I say, this is all context dependent, but above all else, avoid premature optimization.
This thread might give you some ideas what data structure to use for Range Queries.
Data structure for range query
And it even might be easier to use a database and using an API to query for the desired range.
If you are using (or are able to use) Guava, you might consider using RangeMap (*).
This would allow you to use, say, a RangeMap<Instant, Event>, which you could then query to say "what event is occurring at time T".
One drawback is that you wouldn't be able to model concurrent events (e.g. when you are double-booked in two meetings).
(*) I work for Google, Guava is Google's open-sourced Java library. This is the library I would use, but others with similar range map offerings are available.
I have a graph that I want to explore in different ways. This graph is going to be explored by users and I cannot know in advance what information they want to retrieve from the graph. I like Cypher very much and I was wondering if I can use it as a frond-end but using my own representation of the graph.
Let me explain that: I cannot transform my graph into a Neo4j Graph for performance reasons. Hence, I was thinking that maybe I can use Cypher and a modification of Neo4j to explore the graph using my own representation of Node, Labels, Properties and so on.
I think this solution would be good because I can:
Reuse the parser and semantic checker of the language
Partially reuse the optimization engine, let's say the platform independent part.
I was exploring the source code at github and it seems really coupled to a specific implementation.
My questions:
Are you aware of some project using Cypher/Neo4j like this?
Are you aware of another graph database with a good query language that can be used like that?
Any suggestions on how to address the modifications to Neo4J
Just to explain a little bit why I cannot copy the graph. It is a graph that is already produced by another system. It changes a lot an it has easily 10000 nodes, I cannot monitor the graph modification to update the graph because it is, once again, time consuming. Even worse, I have to provide a mechanism to query the graph every five seconds.
Let me describe the problem. A lot of suppliers send us data files in various formats (with various headers). We do not have any control on the data format (what columns the suppliers send us). Then this data needs to be converted to our standard transactions (this standard is constant and defined by us).
The challenge here is that we do not have any control on what columns suppliers send us in their files. The destination standard is constant. Now I have been asked to develop a framework through which the end users can define their own data transformation rules through UI. (say field A in destination transaction is equal to columnX+columnY or first 3 characters of columnZ from input file). There will be many such data transformation rules.
The goal is that the users should be able to add all these supplier files (and convert all their data to my company data from front end UI with minimum code change). Please suggest me some frameworks for this (preferably java based).
Worked in a similar field before. Not sure if I would trust customers/suppliers to use such a tool correctly and design 100% bulletproof transformations. Mapping columns is one thing, but how about formatting problems in dates, monetary values and the likes? You'd probably need to manually check their creations anyway or you'll end up with some really nasty data consistency issues. Errors caused by faulty data transformation are little beasts hiding in the dark and jumping at you when you need them the least.
If all you need is a relatively simple, graphical way to design data conversations, check out something like Talend Open Studio (just google it). It calls itself an ETL tool, but we used for all kinds of stuff.
I have some entries with dates in my database. What is best?:
Fetch them with a sql statement and also apply order by.
Get the list with sql, and order them within the application with collection.sort or so?
Thanks
This a very broad question that is very difficult to answer, and it depends a lot on what you mean by best?
From a performance perspective, you will simply have to measure to determine what part of your system is the bottleneck. Databases are usually very efficient, but it could still be relevant to off-load that work to the client.
From a separation of concern perspective, it depends on how the sorting matters in the application and how the application is layered.
Ask your self: "where does the knowledge that the data is sorted belong?" and "What would happen if I where to change from a relational database storage to something different".
To some extent, it depends on how many values are in the complete collection. If it is, say, 20-30 values then you can sort anywhere — even a relatively poor sorting algorithm can do that quickly (avoid Stooge Sort though; that's terrible) — as that is the sort of size of data chunk which you might expect to actually fetch in one service response.
But once you get into larger datasets you need to plan much more carefully. In particular, you want to avoid moving data around if you don't have to. If the data is currently only present in the database, you really don't want to fetch it all into the client just to sort it (a relatively expensive operation) and then throw virtually all of it away. It's far better to actually keep the data sorted in the database to start with, so that picking it up in order is trivial; in relational database terms, keeping the data sorted is functionally identical to maintaining an index on the data. Indeed, you can have multiple indices on the data, which can make even rather complex queries quick. (NoSQL DBs are more varied; some even don't support the concept of keeping data sorted.) The downside of maintaining indices is that they take up more space and they take time to maintain, particularly when the data is being created in the first place.
So… to return to your question, you probably want to try to not sort the data in the application: for most data, an appropriate index can be much more efficient as it lets your code not even look at unwanted data. But if you have to fetch it all into your application for some other reason and you can't bring it in pre-sorted, there's no reason to avoid sorting it yourself: Java's sorting algorithms are efficient and stable. But you should measure whether fetching it from the DB in the new order is faster. (The question is whether the DB overheads exceed the super-linear costs of re-sorting; lots of problems are in the domain where “maybe; hard to tell” is the answer.)
The other thing to balance is whether it is simpler for your code to not do sorting itself and instead always delegate that to the DB. Keeping your code simpler (and more bug-free) is a good goal to have…
Database management systems (DMBS) are optimized for these tasks, so I think you should stick with them. Especially if you are accessing the database from a script written in PHP or (other scripting language), it might be slower to perform that task using a script. You might also reach a memory limit allowed to be used by PHP if you sort the array using a script.
I don't mean to raise a question of performance of different programming languages, just want to point out that it is a very good practice to rely on the DMBS whenever you can.
This is a very interesting question to me, and I want to present the other side of the accepted answer, which BTW is a very good answer with which I don't necessarily *dis*agree. Just want to present the other side.
When I started in my career, I was working on mainframe DB2, and the old-timers that taught me were VERY INSISTENT that sorting be done OUTSIDE of the db. Their rational for this is that it's work that CAN be offloaded, and this leaves the DB free to service other requests.
Of course, it's far more nuanced than this. In general, I'd say the factors you're weighing are:
A) How busy, or central to your system, is your database? If your db is very busy, if you have a lot of OLTP processing on clients or app servers, and your client or application servers have lots of excess capacity, why not sort on the app server or client? Even if it's less efficient, it spreads the work through the system and gets you more throughput from a whole-systems perspective.
B) How big is the sort? It would be silly to, say, blow your call stack or java heap because you sorted a gazillion MB of data.
C) Will sorting in your app or app server cause pauses, latency, etc? In other words, if your particular programming language has REALLY bad sorting libraries, and you don't want to write your own, maybe letting the DB take 0.5 seconds is better than making your application take 5.0 seconds.
So, as with all things, "it depends" ;-). But, I think these are the things upon which it depends.
I'm working on a Java application, one of its functions is to show detailed information in graph form with the odd statistic and "top 10" list here and there.
The data is being generated live by the application, consider it an internet "honeypot", data is the result of external attacks, the graphs will need to be of varying forms such as
Overall Statistics (Charts showing frequency of attacks per minute/hour/day, No. of attacks today, No. of attack-type attacks, Top 10 attackers)
Per Sensor (Charts showing frequency of attacks per minute/hour/day, Sensor 1 attacks today,No. of attack-type attacks, Top 10 attackers)
Per Attack-Type (Pie Chart)
The information for each attack type can vary quite a bit and there will be other information some have and some don't (e.g. a DoS will have an attacker-address whereas a Remote Exploit to upload a file will have attacker-address and file-name).
Initially I approached this by creating Classes, there is a DoS data structure within which all the details of that attack can be stored and these are store inside a vector, but this ended up becoming a serious headache very fast.
The obvious solution to me is to create a database (MySQL?) with a table for each attack type, from this, gaining all the 1., 2. and 3. information is merely an SQL query away.
However, I can't help but feel that my database solution is a tad nasy and that I'm missing something here, so after hitting my head against the problem I'm asking here.
Any pointers greatly appreciated!
I'd lean towards building the entire concept of 'attack' out as a class composed of all of the potential objects and fields necessary to describe any type of attack. You could specify interfaces as necessary to specify the contract of each particular attack type (for factory creation, etc) but then persist the entire object to a database with a schema pretty much identical to your implementation class structure. This should probably give you a pretty good ability to do the reporting that you want and I think implementation would be reasonably straightforward.
Without knowing just how large your attack tree is, it's a little difficult to be sure my approach is correct, but maybe this will be useful.
Not sure but what you're describing looks like an OLAP cube so maybe consider using a star schema or a snowflake schema and have a look at something like Pentaho:
A complete Business Intelligence platform that includes reporting, analysis (OLAP), dashboards, data mining and data integration (ETL).