I would want to ask you about the inconveniences of calling an external API while running a map reduce job. which are the drawbacks?
Some examples: If inside the mapper we need to geocode an address and we call a google maps api, or calling an external DB in order to get related elements of an item, etc.
It's perfectly OK to make a call to an external API as long as there are no DB calls in the external API. In many ways this is preferred to writing your logic over again. Often times you want your MapReduce jobs to be nothing more than wrapper's around logic written in a non MapReduce context. This make's for better testable code.
However, making external DB calls is STRONGLY discouraged. This will drastically reduce the speed of your MapReduce jobs as every call would be a random access call. In addition, having several thousand Map/Reduce taks hitting your DB at the same time could bring the DB to it's knees. If you need related elements, it's preferable to have all the elements on HDFS and doing a join in MapReduce. If the DB you're talking about is a NoSQL store such as Cassandra or HBase, they'll have a batch export feature to export the entire table onto HDFS.
Related
We have a requirement where we need to pull data from multiple rest API services transform it and populate it into new database. There would be huge amount of records that might have to be fetched, transformed and updated this way. But it is one time activity once all the data that we get from rest calls have been transformed and populated into new DB we would not have to re run the transformation anytime later. What is the best way to achieve in spring.
Can spring batch be possible solution if it has to be a one time execution?
If it is a one-time thing I wouldn't bother using Spring Batch. I would simply call the external APIs, get the data, transform it and then persist it in your database. You could trigger the process either by exposing an endpoint in your own API to start it or relying on a Scheduled task.
Keeping things as simple as possible (but never simpler) is one of the greatest assets you can have while developing software but it is also one of the hardest thing to achieve for us as software engineers simply because we usually overthink the solutions.
For this kind of problem, it will be better if you use the ETL (extract, transfer, and load) tool or framework, my recommendation is Kafka check this link, I think it will be helpful Link
Our application uses java / sql server.
We have ETL jobs (around 35 for different upstreams) using sprint batch. Some of the code is in java and some in database. We want to track lifecycle of a job from database. E.g. when a job started, when a particular component got called, when a method / stored procedure got called and how much time that took. The purpose is to do health check which component is taking more time and in case some stored procedure takes lots of time in production we should be able to query database. Moreover, we also want to store intermediate calculations for audit and debug purpose.
This time tracking and intermediate calculations would be stored besides normal application logging.
Current solution we have implemented is normalized tables in database (e.g. Job, Task, status, etc) for which we have stored procedure wrapper and then have java classes as well to call those stored procedures.
We are not redesigning our application, so wanted to check what is the best approach to track such information. AOP? but I believe that usually gets called for before and after what about the intermediate calculations we want to store?
Our current approach is working, but it is cluttering code as method is doing logging & auditing, instead of just concentrating on the main logic.
A free and open-source tool you should consider is Jamon, it is a comprehensive monitoring framework that provides a lots of useful features:
JAMon allows developers to track their applications performance and
behavior using predefined modules. There are modules that
automatically monitor : SQL, HTTP page requests, Spring beans, method
invocations, Log4j, and Exceptions. Other modules are often easy to
build. JAMon keeps track of the following metrics for any of the items
it tracks in the modules: hits, total, average, min, max and
concurrency (average, max, current/active) to name a few.
Now about storing calculation, I would suggest to break your methods in smaller sub-methods and then use AOP or any other tool to capture the returned value and perform whatever operation you want on these data.
In addition, if you need to have more details on the database layer I would recommend log4jdbc, which will give you nice audit and metrics around jdbc calls. For example you'll be able to get the execution time, the in and out parameters of called procedures, parameters provided to any statements.
You can even extends this tool to provide custom behavior (audit only some procedures, do something specific with collected data.
Aspects are a very good way to isolate the timing code in one place.
Stored procedures seem unnecessary to me. A simple SQL INSERT ought to do the trick. It's fine if you're using the stored proc as an interface to hide the schema from users, but I doubt that this table will evolve much.
Logging, timing, and auditing are the "hello world" of aspect oriented programming.
I'm Processing info in Google Cloud Dataflow, we tried to use JPA to insert or update the data into our mysql database, but these queries shouted down our server. So we've decided to change our paths...
I want to generate a mysql or .sql file so we can write the new info processed through dataflow. I want to know if there is an implemented way to do so, or do I have to do this by myself?
Let me explain a little more, we have an input from an XML, we process the info into java classes, we have a json dump of the db, so we can see what we have online without making so much calls, with this in mind, we compare the new info with the info we already have, and we decide if it's new or if it's just an update.
How can I do this via Java/Maven? I need code to generate this file...
Yes, Cloud Dataflow processes data in parallel on many machines. As such, it is not very surprising that other services may not be able to keep up or that some quotas are hit.
Depending on your specific use case, you may be able to slow/throttle Dataflow down without changing your approach. One might limit the number of workers, limit parallelism, use IntraBundleParallelization API, etc. This might be a better path, overall. We are also working on more explicit ways to throttle Dataflow.
Now, it is not really feasible for any system to automatically generate a .sql file for your database. However, it should be pretty straightforward to use primitives like ParDo and TextIO.Write to generate such a file via a Dataflow pipeline.
Bulk load usually uses map reduce to create a file on HDFS and this file is then assoicated with a region.
If thats the case, can my client create this file (locally) and put it on hdfs. See as we already know what keys are , what values, we can do it locally without loading the server.
Can someone point to an example, how hfile can be created (in any language will be fine)
regards
Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload.
Proposed strategy:
- Check HFileOutputFormat2.java file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue elements (or Cell if we speak in term or interfaces).
- You need to free HFileOutputFormat2 from MapReduce. Check for its writer logics for this. You need only this part.
- OK, you need also to build effective solution for Put -> KeyValue stream handling for HFile. First place to look is TotalOrderPartitioner and PutSortReducer.
If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.
I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner and HFileOutputFormat2 INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.
OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.
I have been reading about neo4j last few days. I got very confused about whether I need to use REST API or if can I go with Java APIs.
My need is to create millions of nodes which will have some connection among them. I want to add indexes on few of node attributes for searching. Initially I started with embedded mode of GraphDB with Java API but soon reached OutOfMemory with indexing on few nodes so I thought it would be better if my neo4j is running as service and I connect to it through REST API then it will do all memory management by itself by swapping in/out data to underlying files. Is my assumption right?
Further, I have plans to scale my solution to billion of nodes which I believe wont be possible with single machine's neo4j installation. I also believe Neo4j has the capability of running in distributed mode. For this reason also I thought continuing with REST API implementation is best idea.
Though I couldn't find out any good documentation about how to run Neo4j in distributed environment.
Can I do stuff like batch insertion, etc. using REST APIs as well, which I do with Java APIs with Graph DB running in embedded mode?
Do you know why you are getting your OutOfMemory Exception? This sounds like you are creating all these nodes in the same transaction, which causes it to live in memory. Try committing small chunks at a time, so that Neo4j can write it to Disk. You don't have to manage the memory of Neo4j aside from things like cache.
Distributed mode is in a Master/Slave architecture, so you'll still have a copy of the entire DB on each system. Neo4j is very efficient for disk storage, a Node taking 9 Bytes, Relationship taking 33 Bytes, properties are variable.
There is a Batch REST API, which will group many calls into the same HTTP call, however making REST calls is still a slower then if this were embedded.
There are some disadvantages to using the REST API that you did not mentions, and that's stuff like transactions. If you are going to do atomic operations, where you need to create several nodes, relationships, change properties, and if any step fails not commit any of it, you cannot do this in the REST API.