We have a Java based system with postgres as database. For some reasons we want to propagate certain changes on timely basis (say 1 hour) to a different location. The two broad approaches are
Logging all the changes to a file as and when that happens. However
this approach will scatter the code everywhere.
Somehow find the incremental changes in postgres between two time stamps in
some log files and send that. However I am not sure how feasible is this
approach.
Anyone has any thoughts/ideas around this?
Provided that the database size is not very great, you could do it quick&dirt by just:
Dumping the entire postgresql to a textfile.
(If the dump file is not sorted *1) sorting the textfile.
Create a diff file with the previous dump file.
Of course, I would only advice this for a situation where your database is going to be kept relatively small and you are just going to use it for a couple of servers.
*1: I do not know if it is somehow sorted, check the docs.
There are a few different options available:
Depending on the amount of data being written you could give Bucardo a try.
Otherwise it is also possible to do something with PgQ in combination with Londiste
Or create something yourself by using triggers so you can generate some kind of audit table
There are many pre-packaged approaches, so you probably don't need to develop your own. Many of the options are summarized and compared on this Wiki page:
http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
Many of them are based on the use of triggers to capture the data, with automatic generation of the triggers based on a more user-friendly interface.
Instead of writing your own solution, I would advise to leverage work already done by others. And in the case you described I would go for PgQ + Londiste (both part of Skytools package), that are easy to set up and use. If you do not want streaming replication, you could still use PgQ / Londiste to easily capture DMLs and write them to a file that you can load when needed. This would allow you expand your setup / processing when new requirements come.
Related
i need to handle a big CSV file with around +750.000 rows of data. Each line has around 1000+ characters and ~50 columns, and i am really not sure what's the best (or atleast good and sufficient) way to handle and manipulate this kind of data.
I need to do the following steps:
Compare the values of two Colomns and write the result to a new column (this one seems easy)
Compare values of two lines and do stuff. (e.g delete if one value is duplicated.)
Compare values of two different files.
My Problem is that this is currently done with PHP and/ or Excel and the limits are nearly exceeded + this takes a long time to process and will be no longer possible when the files get even bigger.
I have 3 different possibilities in mind:
Use MySQL, create a table (or two) and do the comparing, adding or deleting part. (I am not really familiar with SQL and would have to learn it, also it should be done automatically so there is the problem that you cant create tables of CSV files )
Use Java creating Objects in ArrayList or Linked Lists and to "the stuff" (to operations would be easy but handling that much data will probably be the problem)
(Is it even possible to save that many files in Java or does it crash / is there a good tool etc.?)
Use Clojure along with MongoDB to add files from CSV to MongoDB and read files using Mongo.
(Name additional possibilities if you have another idea ..)
All in all I am not a Pro in any of these but would like to solve this problem / get some hints or even your opinion.
Thanks in advance
Since in our company we work a lot with huge csv files here are some ideas:
because these files are in our case always exported from some other relational database we always use PostgreSQL, MySQL or golang + SQLite to be able to use simple plain SQL queries which are in these cases most simple and reliable solution
number of rows you describe is quite low from the point of view of all these databases so do not worry
all have native internal solution for import / export of CSV - which works much quicker than anything created manually
for repeated standard checks I use golang + SQLite with :memory: database - this is definitely the quickest solution
MySQL is definitely very good and quick for checks you described but choose of database depends also on how sophisticated analysis you would need to do further - for example MySQL up to 5.7 still does not have window functions which you could need later - so consider using PostgreSQL in some cases too...
I normally use PostgreSQL for this kind of tasks. PostgreSQL COPY allows importing CSV data easily. Then you get a table with your CSV data and the power for SQL (and a reasonable database) to do basically anything you want with the data.
I am pretty sure MySQL have similar capabilities of importing CSV, I just generally prefer PostgreSQL.
I would not use Java for CSV processing. This will be too much code and unless you take care of indices, the processing will not be performant. An SQL database is much better equiped for tabular data processing (should not be a surprize).
I wouldn't use MongoDB, my impression is that it is less powerful in update operations compared to an SQL database. But this is just an opinion, take it with a grain of salt.
You should try Python with the pandas package. On a machine with enough memory (say 16GB) it should be able to handle your CSV files with ease. The main thing is - anyone with some experience with pandas will be able to develop a quick script for you and tell you in a few minutes if your job is doable or not. To get you started:
import pandas
df = pandas.read_csv('filename.csv')
You might need to specify the column type if you get into memory issues.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I'd suggest to use Spark. Even in a standalone machine the performance is incredible. You can use Scala and Python to handle your data. It's flexible and you can do processing that is impossible in Java or relational database.
The other choices are great also, but I'd consider Spark to all analytics needs from now on.
I have a MongoDB replica set and want to know if it is possible to distribute queries evenly between the members of the set to increase the performance. If it is possible please let me know how to achieve this.
I'm using the 10gen Java-driver 2.12.1.
Thank for your help.
It is possible, but you would need to implement the query distribution at the driver level using a ReadPreference.
You should carefully consider the fact that secondary reads may return stale data. You should also carefully consider whether or not the added complexity of implementing the query distribution and tolerating stale data are warranted. I would suggest that you start by doing some performance tests using the primary (default) ReadPreference.
It is also worth pointing out that what you are trying to accomplish (read scaling) can be better accomplished using sharding, without the need for additional client logic and use of a secondary ReadPreference.
Bulk load usually uses map reduce to create a file on HDFS and this file is then assoicated with a region.
If thats the case, can my client create this file (locally) and put it on hdfs. See as we already know what keys are , what values, we can do it locally without loading the server.
Can someone point to an example, how hfile can be created (in any language will be fine)
regards
Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload.
Proposed strategy:
- Check HFileOutputFormat2.java file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue elements (or Cell if we speak in term or interfaces).
- You need to free HFileOutputFormat2 from MapReduce. Check for its writer logics for this. You need only this part.
- OK, you need also to build effective solution for Put -> KeyValue stream handling for HFile. First place to look is TotalOrderPartitioner and PutSortReducer.
If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.
I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner and HFileOutputFormat2 INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.
OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.
I have a desktop application for managing restaurants front-of-house operations such as reservations, guest data, table turnover, with support for online reservations.
The problem that I am trying to solve is how to capture customer spend and table state by integrating into MICROS. I would like to find out when a table is busy, when a check is printed, what is the total value of the check paid by customer.
Any help in how or where to start would be appreciated. The MICROS website is quite vague as to what can be done.
-Thanks
One way to track this information is to create a polling application that runs on that Micros sever. You would need read access on the database, and in the best case scenario full dba access. The schema is quite complicated, but if you Google something like "micros pos 3700 schema pdf" you'll come across some resources to get you going. Also, check out http://www.tek-tips.com/ and do some searching for Micros if you go this route. There are examples of SQL and other users who have faced the same task of integrating with Micros. You can query things like open checks, and when a check was closed. That may give you an idea of when it was printed if you cannot find that out specifically.
I have never used MICROS specifically but I have integrated with many systems before and I generally find that if you call them and tell them you want to integrate they will usually be willing to tell you where their data is stored, also using their software for purposes other than what they intended could be copyright infringement unless you ask; also you would unofficially be a data processor for MICROS then and you don't want to get sued, so its probably best to ask.
Generally speaking though you can probably find the data you want just by performing a single action before the you open so as not to confuse matters and looking through the files in the install directory until you find information on the action you just performed, take note and repeat for each action. Then you can watch the directory for changes and if the file is one of the ones you care about then process it. The best ones are often logs as they are usually plaintext, updated realtime, easy to access and you can usually pick out the patterns you want quite easily.
You do need to keep in mind though that some data may only be outputted at the end of the day or transaction in a format you can use so again I really recommend calling and asking.
Let's say you have a database with a lot of products/customers/orders and the codebase (java/c#) contains all the business logic. During the night several batches are needed to export data to flat files and then ftp them to a proprietary system.
How should we do this "write-database-into-a-flat-file? What are the best-practices?
Some thoughts:
we could create a stored procedure and use f.ex ssis to fetch the data? Maybe we can do this if we have a "batch-output-database-table" but not if we have to do logic before the file is written?
we could do all the logic in managed code using the same repositories / business logic as the rest of the domain? (this could be a slow process compared to the stored procedure solution)
What if the only interface for the domain-services are web services (which could take "long" time for each request), will the "best practices" change ?
I personally prefer to use normal (managed) code to implement feeds instead of stored procs, mainly because:
1) It's usually easier to interface with the other system (even if it is only shared drive)
2) It is easy to log everything you need and debug if something goes wrong
3) You can reuse the same code you use for normal business logic (its beneficial even if you just reference the same projects, etc.)
4) Often you need to enrich data with some information from other systems and this again is much easier to do from managed code.
5) Its much easier to test managed code, have all the unit tests, automated builds, etc.
I am not sure why it needs to be that much slower than doing it all in a stored proc. You just need to write a good stored procedure to extract the data you need, and the C#/java app will do all the transformations, enrichment, etc.
EDIT: Answering the comment:
I don't think it is possible to say if you should reuse the existing stored procs, tweak them, or create new ones. I think that is the performance hit or needed changes are not too big than I would try to use one set of procs, to avoid duplication of logic. But if the differences are substantial, then probably the cost of maintaining extra procs will be lower than changing and releasing existing ones.
Go with the repository code you've already got. Do a few performance tests and see if it meets the perf. requirements. If there is a significant perf. issue that can be nailed down to too much DB IO then go for the sproc or implement a bulk export repository.