I am working on one project where I am receiving somewhere around 10 files each containing size of 200GB. My Project requirement is to extract data from each file and do the joining with other files and extract the data.
E.G like I have file 1.txt where I have Account Id and I have file 2.txt, where I have Account Id and Account Name. On the Basis of the Account Id from the first file I need to extract the data from the second file.
This way I need to perform manipulation on each of the 10 files and create final output files.
I am currently doing this in Java which is really time taking process. Approx it is taking 4 to 5 hours.
Can I increase my performance by any change? Is there any technology or any tool or framework which I can integrate with java and increase my performance?
I have used following approaches.
1) Apache Drill- I am performing a join queries
Result : Drill is throwing Drillbit down exception as file size is too much.
2) Apache Beam- I am performing joining on files using parallel processing.
Result : it is giving me Out Of Memory exception at group by feature.
I am reading this data from hadoop.
I would suggest to use Hadoop and Spark because Spark uses in-memory model which is faster than Map Reduce.
Maybe these two links will help you:
https://content.pivotal.io/blog/3-key-capabilities-necessary-for-text-analytics-natural-language-processing-in-the-era-of-big-data
https://community.hortonworks.com/articles/84781/spark-text-analytics-uncovering-data-driven-topics.html
Related
In our application , in one of our microservice we will query the DB , get the result ( 100k rows ) and generate Excel using Apache POI.In couple of other services they also does the same process ( get DB rows and generate excel) . Here Excel generation process is common , IS this right design to separate this excel generation process as separate micorservice and use in all other services ?
The challenge is passing the data ( 100k rows ) between microservices over HTTP .
How can we achieve it ?
I personally never put the export feature as a separate service.
Providing such a table based data, I provide a table view of the data with paging, and also give export function as an octet streamed data without paging limit. Export could be a type of a view.
I've used the Apache POI library for report rendering but only for the small pages and complex shapes previously. POI also provides streaming version of workbook classes such as SXSSFWorkbook.
To be a microservice, it should have a proper reason to be a external system. If the system only provides just export something, negative. It's too simple and overkill. If you're considering to add versioning, permission, distribution, folder zipping, or... storage management, well.. that could be an option.
By the way, exporting such a big data into a file, Excel has max row limit to 1M size so you may hit the limit if your data size grow more.
Why don't use use just a CSV format? Easy to use, Easy to jump, Easy to process.
You need to ask this question as to what define a service. Reading a chunks of data from a while, does this come under a service?
When I think of separating my services I think along multiple lines like what this module needs to do. Who all will be using it, what all dependencies do I have, how I need to scale it up in future and above all. Which business team will be taking care of it. I tend to divide the modules based on the answers I get to these questions.
Here in your case I see this as less of a service and more of a utility function that can be put in a jar and shared across. A new service will be more along a line of say reporting service reading legacy excel files to create reports or migrating service which uses a utility to read excel.
Also there is no final answer you need to keep questioning your design unless you are happy with it.
I'm Processing info in Google Cloud Dataflow, we tried to use JPA to insert or update the data into our mysql database, but these queries shouted down our server. So we've decided to change our paths...
I want to generate a mysql or .sql file so we can write the new info processed through dataflow. I want to know if there is an implemented way to do so, or do I have to do this by myself?
Let me explain a little more, we have an input from an XML, we process the info into java classes, we have a json dump of the db, so we can see what we have online without making so much calls, with this in mind, we compare the new info with the info we already have, and we decide if it's new or if it's just an update.
How can I do this via Java/Maven? I need code to generate this file...
Yes, Cloud Dataflow processes data in parallel on many machines. As such, it is not very surprising that other services may not be able to keep up or that some quotas are hit.
Depending on your specific use case, you may be able to slow/throttle Dataflow down without changing your approach. One might limit the number of workers, limit parallelism, use IntraBundleParallelization API, etc. This might be a better path, overall. We are also working on more explicit ways to throttle Dataflow.
Now, it is not really feasible for any system to automatically generate a .sql file for your database. However, it should be pretty straightforward to use primitives like ParDo and TextIO.Write to generate such a file via a Dataflow pipeline.
I have a requirement where I need to create a JAVA application which will read data from 52 database tables, copy all data which is more than 3 years old to flat files (either csv, txt files), delete these data from tables and store these files on a SFTP server. The database in this case is Sybase ASE 15 version.
Also I need to restore this data in temporary tables when certain reports are to be prepared involving the above archived data.
If I make this application a single threaded application, it will take many hours to complete the task. So I need to make it multi-threaded.
Please suggest should I use only core java or any framework like spring batch. And how to achieve multi-threading in both the cases.
Bulk load usually uses map reduce to create a file on HDFS and this file is then assoicated with a region.
If thats the case, can my client create this file (locally) and put it on hdfs. See as we already know what keys are , what values, we can do it locally without loading the server.
Can someone point to an example, how hfile can be created (in any language will be fine)
regards
Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload.
Proposed strategy:
- Check HFileOutputFormat2.java file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue elements (or Cell if we speak in term or interfaces).
- You need to free HFileOutputFormat2 from MapReduce. Check for its writer logics for this. You need only this part.
- OK, you need also to build effective solution for Put -> KeyValue stream handling for HFile. First place to look is TotalOrderPartitioner and PutSortReducer.
If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.
I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner and HFileOutputFormat2 INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.
OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.
I'm currently working on applying genetic algorithms to a particular application, and the issue is that there is a large amount of data that I need to analyze, graph and simply, tabulate. Upto this point I have been using csv files, but they have been kind of limited as I still have to generate charts manually, and its an issue when this needs to be done on over 100 documents.
Are there any other options for output logging in Java, for analysis other than CSV files? Any link to any API of any kind would also be useful.
P.S: (The question seems common enough to have been asked already, but I couldn't find it.) I'm not asking about how to log data in Java, or how to redirect it to a file, but if there are any existing ways to easily tabulate and graph large amounts of output.
The kind of data I'm working with involves a lot of numerical data, specifically the attributes of different generations and different organisms inside those generations. I'm trying to find and interpret trends within the numerical data which would mean that I need to generate separate graphs for different populations or test runs, and also find representative values for each file and graph those against specific test run conditions.
Also, there is a time parameter which references the speed of the algorithm. Which methods let me log output without letting the post-processing and disk access affect my test runs? Is it possible?
You could use Apache POI to write out an Excel spreadsheet directly. You can also have it start with a spreadsheet already containing macros and whatever else you need to display your information.
There are plenty of choices in exporting reports / data. There is an open source project call JasperReport that can do charts, PDF, XML, CSV, plain text exporting. But it is an involved process, but does offer Java API to accomplish that task.
How about writing table into a database, like MySQL?
Using such, you can search your data by better means than in a text file.
Have a look here: http://www.vogella.de/articles/MySQLJava/article.html
On linux, you also have the possibility to stay using your csv files and then generate plots using gnuplot scipts.
Sounds like you need to be saving the data to a database and then using Jasper Reports to build a report containing the graphs and whatever else you need using the data that you stored in the database instead of trying to use excel. Jasper is fairly easy to use and you can have your java application generate the report for you after the data has been stored in the database.