I'm Processing info in Google Cloud Dataflow, we tried to use JPA to insert or update the data into our mysql database, but these queries shouted down our server. So we've decided to change our paths...
I want to generate a mysql or .sql file so we can write the new info processed through dataflow. I want to know if there is an implemented way to do so, or do I have to do this by myself?
Let me explain a little more, we have an input from an XML, we process the info into java classes, we have a json dump of the db, so we can see what we have online without making so much calls, with this in mind, we compare the new info with the info we already have, and we decide if it's new or if it's just an update.
How can I do this via Java/Maven? I need code to generate this file...
Yes, Cloud Dataflow processes data in parallel on many machines. As such, it is not very surprising that other services may not be able to keep up or that some quotas are hit.
Depending on your specific use case, you may be able to slow/throttle Dataflow down without changing your approach. One might limit the number of workers, limit parallelism, use IntraBundleParallelization API, etc. This might be a better path, overall. We are also working on more explicit ways to throttle Dataflow.
Now, it is not really feasible for any system to automatically generate a .sql file for your database. However, it should be pretty straightforward to use primitives like ParDo and TextIO.Write to generate such a file via a Dataflow pipeline.
Related
We have a requirement where we need to pull data from multiple rest API services transform it and populate it into new database. There would be huge amount of records that might have to be fetched, transformed and updated this way. But it is one time activity once all the data that we get from rest calls have been transformed and populated into new DB we would not have to re run the transformation anytime later. What is the best way to achieve in spring.
Can spring batch be possible solution if it has to be a one time execution?
If it is a one-time thing I wouldn't bother using Spring Batch. I would simply call the external APIs, get the data, transform it and then persist it in your database. You could trigger the process either by exposing an endpoint in your own API to start it or relying on a Scheduled task.
Keeping things as simple as possible (but never simpler) is one of the greatest assets you can have while developing software but it is also one of the hardest thing to achieve for us as software engineers simply because we usually overthink the solutions.
For this kind of problem, it will be better if you use the ETL (extract, transfer, and load) tool or framework, my recommendation is Kafka check this link, I think it will be helpful Link
In our application , in one of our microservice we will query the DB , get the result ( 100k rows ) and generate Excel using Apache POI.In couple of other services they also does the same process ( get DB rows and generate excel) . Here Excel generation process is common , IS this right design to separate this excel generation process as separate micorservice and use in all other services ?
The challenge is passing the data ( 100k rows ) between microservices over HTTP .
How can we achieve it ?
I personally never put the export feature as a separate service.
Providing such a table based data, I provide a table view of the data with paging, and also give export function as an octet streamed data without paging limit. Export could be a type of a view.
I've used the Apache POI library for report rendering but only for the small pages and complex shapes previously. POI also provides streaming version of workbook classes such as SXSSFWorkbook.
To be a microservice, it should have a proper reason to be a external system. If the system only provides just export something, negative. It's too simple and overkill. If you're considering to add versioning, permission, distribution, folder zipping, or... storage management, well.. that could be an option.
By the way, exporting such a big data into a file, Excel has max row limit to 1M size so you may hit the limit if your data size grow more.
Why don't use use just a CSV format? Easy to use, Easy to jump, Easy to process.
You need to ask this question as to what define a service. Reading a chunks of data from a while, does this come under a service?
When I think of separating my services I think along multiple lines like what this module needs to do. Who all will be using it, what all dependencies do I have, how I need to scale it up in future and above all. Which business team will be taking care of it. I tend to divide the modules based on the answers I get to these questions.
Here in your case I see this as less of a service and more of a utility function that can be put in a jar and shared across. A new service will be more along a line of say reporting service reading legacy excel files to create reports or migrating service which uses a utility to read excel.
Also there is no final answer you need to keep questioning your design unless you are happy with it.
I have a use case in which my data is present in Mysql.
For each new row insert in Mysql, I have to perform analytics for the new data.
How I am currently solving this problem is:
My application is a Spring-boot application, in which I have used Scheduler which checks for new row entered in the database after every 2 seconds.
The problem with the current approach is:
Even if there is no new data available in Mysql table, Scheduler fires MySQL query to check if new data available or not.
One way to solve this type of problem in any SQL database in Triggers .
But till now I am not successful in creating Mysql triggers which can call Java-based Spring application or a simple java application.
My question is :
Is their any better way to solve my above use-case? Even I am open to change to another storage (database) system if they are built for this type of use-case.
This fundamentally sounds like an architecture issue. You're essentially using a database as an API which, as you can see, causes all kinds of issues. Ideally, this db would be wrapped in a service that can manage the notification of systems that need to be notified. Let's look at a few different options going forward.
Continue to poll
You didn't outline what the actual issue is with your current polling approach. Is running the job when it's not needed causing an issue of some kind? I'd be a proponent for just leaving it unless you're interested in making a larger change.
Database Trigger
While I'm unaware of a way to launch a java process via a db trigger, you can do an HTTP POST from one. With that in mind, you can have your batch job staged in a web app that uses a POST to launch the job when the trigger fires.
Wrap existing datastore in a service
This is, IMHO, the best option. This allows there to be a system of record that provides an API that can be versioned, etc. This would allow any logic around who to notify would also be encapsulated into this service.
Replace data store with something that allows for better notifications
Without any real information on what the data being store is, it's hard to say how practical this is. But using something as Apache Kafka or Apache Geode would both be options that provide the ability to be notified when new data is persisted (Kafka by listening to the topic, Geode via a continuous query).
For the record, I'd advocate for the wrapping of the existing database in a service. That service would be the only way into the db and take on responsibility for any notifications required.
Bulk load usually uses map reduce to create a file on HDFS and this file is then assoicated with a region.
If thats the case, can my client create this file (locally) and put it on hdfs. See as we already know what keys are , what values, we can do it locally without loading the server.
Can someone point to an example, how hfile can be created (in any language will be fine)
regards
Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload.
Proposed strategy:
- Check HFileOutputFormat2.java file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue elements (or Cell if we speak in term or interfaces).
- You need to free HFileOutputFormat2 from MapReduce. Check for its writer logics for this. You need only this part.
- OK, you need also to build effective solution for Put -> KeyValue stream handling for HFile. First place to look is TotalOrderPartitioner and PutSortReducer.
If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.
I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner and HFileOutputFormat2 INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.
OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.
I have developed a small swing desktop application. This app needs data from other database, so for that I've created an small process using java that allows to get the info (using jdbc) from remote db and copy (using jpa) it to the local database, the problem is that this process take a lot of time. is there other way to do it in order to make faster this task ?
Please let me know if I am not clear, I'm not a native speaker.
Thanks
Diego
One good option is to use the Replication feature in MySQL. Please refer to the MySQL manual here for more information.
JPA is less suited here, as object-relational-mapping is costly, and this is bulk data transfer. Here you probably also do not need data base replication.
Maybe backup is a solution: several different approaches listed there.
In general one can also do a mysqldump (on a table for instance) on a cron task compress the dump, and retrieve it.