Bulk load usually uses map reduce to create a file on HDFS and this file is then assoicated with a region.
If thats the case, can my client create this file (locally) and put it on hdfs. See as we already know what keys are , what values, we can do it locally without loading the server.
Can someone point to an example, how hfile can be created (in any language will be fine)
regards
Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload.
Proposed strategy:
- Check HFileOutputFormat2.java file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue elements (or Cell if we speak in term or interfaces).
- You need to free HFileOutputFormat2 from MapReduce. Check for its writer logics for this. You need only this part.
- OK, you need also to build effective solution for Put -> KeyValue stream handling for HFile. First place to look is TotalOrderPartitioner and PutSortReducer.
If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.
I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner and HFileOutputFormat2 INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.
OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.
Related
i need to handle a big CSV file with around +750.000 rows of data. Each line has around 1000+ characters and ~50 columns, and i am really not sure what's the best (or atleast good and sufficient) way to handle and manipulate this kind of data.
I need to do the following steps:
Compare the values of two Colomns and write the result to a new column (this one seems easy)
Compare values of two lines and do stuff. (e.g delete if one value is duplicated.)
Compare values of two different files.
My Problem is that this is currently done with PHP and/ or Excel and the limits are nearly exceeded + this takes a long time to process and will be no longer possible when the files get even bigger.
I have 3 different possibilities in mind:
Use MySQL, create a table (or two) and do the comparing, adding or deleting part. (I am not really familiar with SQL and would have to learn it, also it should be done automatically so there is the problem that you cant create tables of CSV files )
Use Java creating Objects in ArrayList or Linked Lists and to "the stuff" (to operations would be easy but handling that much data will probably be the problem)
(Is it even possible to save that many files in Java or does it crash / is there a good tool etc.?)
Use Clojure along with MongoDB to add files from CSV to MongoDB and read files using Mongo.
(Name additional possibilities if you have another idea ..)
All in all I am not a Pro in any of these but would like to solve this problem / get some hints or even your opinion.
Thanks in advance
Since in our company we work a lot with huge csv files here are some ideas:
because these files are in our case always exported from some other relational database we always use PostgreSQL, MySQL or golang + SQLite to be able to use simple plain SQL queries which are in these cases most simple and reliable solution
number of rows you describe is quite low from the point of view of all these databases so do not worry
all have native internal solution for import / export of CSV - which works much quicker than anything created manually
for repeated standard checks I use golang + SQLite with :memory: database - this is definitely the quickest solution
MySQL is definitely very good and quick for checks you described but choose of database depends also on how sophisticated analysis you would need to do further - for example MySQL up to 5.7 still does not have window functions which you could need later - so consider using PostgreSQL in some cases too...
I normally use PostgreSQL for this kind of tasks. PostgreSQL COPY allows importing CSV data easily. Then you get a table with your CSV data and the power for SQL (and a reasonable database) to do basically anything you want with the data.
I am pretty sure MySQL have similar capabilities of importing CSV, I just generally prefer PostgreSQL.
I would not use Java for CSV processing. This will be too much code and unless you take care of indices, the processing will not be performant. An SQL database is much better equiped for tabular data processing (should not be a surprize).
I wouldn't use MongoDB, my impression is that it is less powerful in update operations compared to an SQL database. But this is just an opinion, take it with a grain of salt.
You should try Python with the pandas package. On a machine with enough memory (say 16GB) it should be able to handle your CSV files with ease. The main thing is - anyone with some experience with pandas will be able to develop a quick script for you and tell you in a few minutes if your job is doable or not. To get you started:
import pandas
df = pandas.read_csv('filename.csv')
You might need to specify the column type if you get into memory issues.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
I'd suggest to use Spark. Even in a standalone machine the performance is incredible. You can use Scala and Python to handle your data. It's flexible and you can do processing that is impossible in Java or relational database.
The other choices are great also, but I'd consider Spark to all analytics needs from now on.
In our application , in one of our microservice we will query the DB , get the result ( 100k rows ) and generate Excel using Apache POI.In couple of other services they also does the same process ( get DB rows and generate excel) . Here Excel generation process is common , IS this right design to separate this excel generation process as separate micorservice and use in all other services ?
The challenge is passing the data ( 100k rows ) between microservices over HTTP .
How can we achieve it ?
I personally never put the export feature as a separate service.
Providing such a table based data, I provide a table view of the data with paging, and also give export function as an octet streamed data without paging limit. Export could be a type of a view.
I've used the Apache POI library for report rendering but only for the small pages and complex shapes previously. POI also provides streaming version of workbook classes such as SXSSFWorkbook.
To be a microservice, it should have a proper reason to be a external system. If the system only provides just export something, negative. It's too simple and overkill. If you're considering to add versioning, permission, distribution, folder zipping, or... storage management, well.. that could be an option.
By the way, exporting such a big data into a file, Excel has max row limit to 1M size so you may hit the limit if your data size grow more.
Why don't use use just a CSV format? Easy to use, Easy to jump, Easy to process.
You need to ask this question as to what define a service. Reading a chunks of data from a while, does this come under a service?
When I think of separating my services I think along multiple lines like what this module needs to do. Who all will be using it, what all dependencies do I have, how I need to scale it up in future and above all. Which business team will be taking care of it. I tend to divide the modules based on the answers I get to these questions.
Here in your case I see this as less of a service and more of a utility function that can be put in a jar and shared across. A new service will be more along a line of say reporting service reading legacy excel files to create reports or migrating service which uses a utility to read excel.
Also there is no final answer you need to keep questioning your design unless you are happy with it.
I'm Processing info in Google Cloud Dataflow, we tried to use JPA to insert or update the data into our mysql database, but these queries shouted down our server. So we've decided to change our paths...
I want to generate a mysql or .sql file so we can write the new info processed through dataflow. I want to know if there is an implemented way to do so, or do I have to do this by myself?
Let me explain a little more, we have an input from an XML, we process the info into java classes, we have a json dump of the db, so we can see what we have online without making so much calls, with this in mind, we compare the new info with the info we already have, and we decide if it's new or if it's just an update.
How can I do this via Java/Maven? I need code to generate this file...
Yes, Cloud Dataflow processes data in parallel on many machines. As such, it is not very surprising that other services may not be able to keep up or that some quotas are hit.
Depending on your specific use case, you may be able to slow/throttle Dataflow down without changing your approach. One might limit the number of workers, limit parallelism, use IntraBundleParallelization API, etc. This might be a better path, overall. We are also working on more explicit ways to throttle Dataflow.
Now, it is not really feasible for any system to automatically generate a .sql file for your database. However, it should be pretty straightforward to use primitives like ParDo and TextIO.Write to generate such a file via a Dataflow pipeline.
We have a Java based system with postgres as database. For some reasons we want to propagate certain changes on timely basis (say 1 hour) to a different location. The two broad approaches are
Logging all the changes to a file as and when that happens. However
this approach will scatter the code everywhere.
Somehow find the incremental changes in postgres between two time stamps in
some log files and send that. However I am not sure how feasible is this
approach.
Anyone has any thoughts/ideas around this?
Provided that the database size is not very great, you could do it quick&dirt by just:
Dumping the entire postgresql to a textfile.
(If the dump file is not sorted *1) sorting the textfile.
Create a diff file with the previous dump file.
Of course, I would only advice this for a situation where your database is going to be kept relatively small and you are just going to use it for a couple of servers.
*1: I do not know if it is somehow sorted, check the docs.
There are a few different options available:
Depending on the amount of data being written you could give Bucardo a try.
Otherwise it is also possible to do something with PgQ in combination with Londiste
Or create something yourself by using triggers so you can generate some kind of audit table
There are many pre-packaged approaches, so you probably don't need to develop your own. Many of the options are summarized and compared on this Wiki page:
http://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
Many of them are based on the use of triggers to capture the data, with automatic generation of the triggers based on a more user-friendly interface.
Instead of writing your own solution, I would advise to leverage work already done by others. And in the case you described I would go for PgQ + Londiste (both part of Skytools package), that are easy to set up and use. If you do not want streaming replication, you could still use PgQ / Londiste to easily capture DMLs and write them to a file that you can load when needed. This would allow you expand your setup / processing when new requirements come.
I have a Java utility for database imports. I'd like to be able to use sqlldr for performance on oracle. I could create the control and data files, but that doesn't seem like The Right Thing™ to do. I should be able to stream the data by providing INFILE "-" in the control file (q1 - how? from command line, I can pipe "echo <data...>" to the sqlldr, but there must be a way to just stream the string into the input stream for the process? never used Java for this before). I can't see how to stream the control file itself (q2 - or am I missing something obvious?). I could use named pipes, but I have no idea how to instantiate and use them from Java in windows (q3 - would that work and how?).
<moan>why must oracle be so complicated? it was trivial in mysql...<moan>
"why must oracle be so complicated? it
was trivial in mysql"
What you must remember is, Oracle is a venerable product. SQL Loader as a utility must be twenty years old, maybe more. So naturally it is harder to work with than some newer tools.
And that is why you should stop trying to fit SQL Loader into your new-fangled Java app :-) Look at external tables instead. Because these are database objects we can use SQL SELECTs against them, so it's a whole easier to automate load processes with them. I wrote a bit more about external tables in my answer to another question. Check it out.
Fundamentally SQLLDR is about getting data from one or more files into a database table. It is powerful in that role, especially when dealing with multiple files or parallel loads from a single file (it can have multiple threads/processes reading from the same file at the same time).
Not all of these fit well with reading from something that isn't a real file. If your data stream is coming from a web service, then I'd pull it using UTL_HTTP. If it is coming from FTP, then I'd FTP straight into the database as a CLOB/BLOB and process it from there.
Depending on your version, also look at the preprocessor capabilities of external tables