Cleaning big data with Hive

Cleaning big data with Hive - java

I'm using Hive to query data that I have. The problem is, this data needs to be cleaned and it's way too big for me to try and process it on my computer (hence using Hadoop and Hive). Is there a way for me to do this with Hive? I looked into user defined functions but my understanding is that they operate row by row so might not be an optimal way to clean the data.
Thanks

You should clean your data using a MapReduce program. Probably don't even a reducer which would increase your performance.
The MapReduce program works like a bufferedfile reader, reading one line of data at a time. You can perform your cleaning operation on each line, and then insert it into a hive table for querying.

what is your data size?
what is your cleaning operation?
If your cleaning operation can not be done with the help of Hive then only go for mapreduce/pig.
If your problem is performance of hive, try to optimize it.
Optimization depends on your cleaning operation.you can use distribution cache,map side joins etc...

Related

How to persist a key-value map for fast lookup?

I want to create a hashmap kind of thing for fast lookup between IDs and assigned names.
The number of entries will be a few hundred thousand. Thus I don't want to keep everything in memory. Anyhow, as performance counts in the process, I don't want to make database queries for each ID.
So, what are my chances? How could I get fast lookups on large datasets?

A quick search found these:
Production-ready:
MapDB - I can personally recommend this. It's the successor to JDBM, if you found that one while googling.
Chronicle-Map
Voldemort
Probably nor production-ready, but worth looking at:
https://github.com/aloksingh/disk-backed-map
https://github.com/reines/persistenthashmap

Well there are couple of solutions in my mind !
1) Go for lucene -> store in files
2) Make views in the database -> store in database
So its upto you for which you go for !!

I had a similar requirement a few years back and was avoiding using databases thinking it would have highlook up times. Similar to you I had large set of values so could not use in memory datastructures. So I decided to sequentially parse the filesystem. It was a bit slow, but I could do nothing about it.
Then I explored more on DBs and used DB for my application, just to test. Initially it was slower compared to filesystem. But after indexing the table and Optimizing the database. It proved to be atleast 10-15 times faster than file system. I cant remember the exact performance results but it took just 150-200 ms to read data from a large dataset(around 700 mb of data-size on file system), whereas the same for filesystem was 3.5 seconds.
I used DB2 database, and this guide for performance tuning of DB2
Besides once the DB is setup, you can reuse it for mutiple applications over the network.

If you looking fast solution. Answer in-memory database
Redis, Memcached, Hazelcast, VoltDB etc.

Database query vs java processing

Is it better to make two database calls or one database call with some java processing?
One database call gets only the relevant data which is to be separated into two different list which requires few lines of java.

Database dips are always an expensive operation. If you can manage with one db fetch and do some java processing, it should be a better and faster choice for you.
But you may have to analyze in your scenarion, which one is turning to be a more efficent choice. I assume singly DB fetch and java processing should be better.

Testing is key. Some questions you may want to ask yourself:
How big is each database call
How much bigger/smaller would the calls be if I combined them
Should I push the procesing to the client?
Timing
How time critical is processing?
Do you need to swarm the DB or is it okay to piggy back on the client?
Is the difference negligible?

Java Processing is much faster than SQL Fetch, As I had the same problem so I recommend you to fetch single data with some processing, because maybe the time both options take has a minor difference but some Computers take a lot of time to fetch data from DB so I suggest you to just Get single data with some Java Processing.

Generally Javaprocessing is better if its not some simple DB query that you are doing.
I Would recomend you for trying them both, measure some time and load and see what fits your application the best.

It all depends on how intensive your processing is and how your database is setup. For instance an Oracle running on a native file system will most likely be more performant then doing the java processing code on your own for complex operations. Note that most build in operations on well known databases are highly optimized and usually very performant.

Hbase Client Scanner Hangs

I have been using Hbase for months and I have loaded Hbase table with more than 6GB of data. When I tried scanning the rows using Java client it hangs and reports the following error,
Could not seek StoreFileScanner[HFileScanner for reader reader=hdfs
Further if I login to shell and scan it works perfectly and even Java client scanner works fine for hbase table having small amount of data.
Any workaround for this?

For large data you can write map reduce code. simple Java programs are not really very effective when it comes to big data. You can look into pig script to achieve that.
Check out these for further help :
http://sujee.net/tech/articles/hadoop/hbase-map-reduce-freq-counter/
http://wiki.apache.org/hadoop/Hbase/MapReduce
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/package-summary.html
Or else you can give a try to Pig Scripts also for mapt reduce programs.
http://pig.apache.org/docs/r0.9.1/api/org/apache/pig/backend/hadoop/hbase/HBaseTableInputFormat.html
One more option is there you increase the HBase time out Property and give a try. From different HBase configuration setting you can refer:
http://hbase.apache.org/docs/r0.20.6/hbase-conf.html
But when it comes to large data Map-reduce code is always better, and you can also search for optimizing guidelines/best practices for hbase.

How to improve my software project's speed?

I'm doing a school software project with my class mates in Java.
We store the info on a remote db.
When we start the application we pull all the information from the database and transform it into objects to use in our application (using java sql statemens).
In the application we edit some of these objects and then when we exit the application
we save or update information in the database using Hibernate.
As you see we dont use Hibernate for pulling in information, we use it just for saving and updating.
We have 2, but very similar problems.
The loading of object(when we start the app) and the saving of objects(with Hibernate) in the db(when closing the app) is taking too much time.
And our project its not a huge enterprise application, its a quite small app, we just manage some students, teachers, homeworks and tests. So our db is also very very small.
How could we increase performance ?
later edit: if we use a local database it runs very quick, it just runs slow on remote databases

Are you saying you are loading the entire database into memory and then manipulating it? If that is the case, why don't you instead simply use the database as a storage device, and do lookups and manipulation as necessary (using Hibernate if you like, or something else if you don't)? The key there is to make sure that you are using connection pooling, as that will reduce the connection time.
If this is what you are doing, then you could be running into memory issues as well - first, by not caching the entire database in memory, you will reduce memory and will spread out the network load from the beginning/end to the times when it needs to happen.

These 2 sentences are red flags for me :
When we start the application we pull
all the information from the database
and transform it into objects to use
in our application (using java sql
statemens). In the application we edit
some of these objects and then when we
exit the application we save or update
information in the database using
Hibernate.
Is there a requirements reason that you are loading all the information from the database into memory at startup, or why you're waiting until shutdown to save changes back in the database?
If not, I'd suggest a design change. If you've already got Hibernate mappings for the tables in the DB, I'd use Hibernate for both all of your CRUD (create, read, update, delete) operations. And, I'd only load the data that each page in your app needs, as it needs it.
If you can't make that kind of design change at this point, I think you've got to look closely at how you're managing the database connections. Are you using connection pools? Are you opening up multiple connections? Forgetting to release them?
Something else to look at. How are you using Hibernate to save the entities to the db? Are you doing a getHibernateTemplate().get on each one and then doing an entity.save or entity.update on each one? If so, that means you are also causing Hibernate to run a select query for each database object before it does a save or update. So, essentially, you'd be loading each database object twice (once at the beginning of the program, once before saving). To see if that's what's happening, you can turn on the show_sql property or use P6Spy to see exactly what queries Hibernate is running.

For what you are doing, you may very well be better off serializing your objects and writing them out to a flat file.
But, much more likely, you should just read / update objects directly from your database as needed instead of all at once, for all the reasons aperkins gives.
Also, consider what happens if your application crashes? If all of your updates are saved only in memory until the application is closed, everything would be lost if the app closes unexpectedly.

The difference in loading everything from a remote DB server versus loading everything from a local DB server is the network latency / pipe size. The network is a much smaller pipe than anything else. Two questions: first, how much data are we really talking about? Second, what is your network speed? 10/100/1000? Figure between 10 and 20% of your pipe size is going to be overhead due to everything from networking protocols to the actual queries themselves.
As others have stated, the way you've architected is usually high on the list of "don't do". When starting, pull only enough data to initialize the app. As the user works through it, pull what you need for that task.
The ONLY time you pull everything is when they are working in a disconnected state. In that case, you still don't load everything as objects in the application, you just work from a local data store which gets sync'ed with the remote server every so often.

The project its pretty much complete. we cant do large refactoring on it now.
I tried to use a second level cache for Hibernate when saving. EhCacheProvider.
in hibernate.xml:
net.sf.ehcache.hibernate.EhCacheProvider
i have done a config for the cache, ehcache.xml:
i have put the cache.jar in the project build path
and i have set the hibernate property for every class and set in the mapping.
But this cache doesn't seem to have an effect. I dont know if it works(if it is used).

Try minimising number of SQL queries, since every query has its own overhead.
You can enable database compression, which should speed things up when there is a lot of data.
Maybe you are connecting to the database many times?
Check the ping time of remote database server - it might be the problem.

As your application is just slow when running on a remote database server, I'd assume that the performance loss is due to:
Connecting to the server: try to reuse connections (pass the instance around) or use connection pooling
Query round-trip time: use as few queries as possible, see here in case of a hand-written DAL:
Preferred way of retrieving row with multiple relating rows
For hibernate you may use its batch functionality and adjust hibernate.batch_size.
In all cases, especially when you can't refactor larger parts of the codebase, use a profiler (method time or sql queries) to find the bottleneck. I bet you'll find thousands of queries, each taking 10ms RTT) which could be merged into one.

Some other things you can look into:
You can allocate more memory to the JVM
Use the jconsole tool to investigate what the bottlenecks are.

Why dont you have two separate threads?
Thread 1 will load your objects one by one.
Thread 2 will process objects as they are loaded.
Your app will seem more interactive at startup.

It never hurts to review the basics:
Improving speed means reducing time (obviously), and to do that, you find activities that take significant time but can be eliminated or replaced with something that uses less time. What I mean by activity is almost always a function call, method call, or property call, performed on a specific line of code for a specific purpose. If may invoke I/O or it may invoke computation, or both. If its purpose is not essential, then it can be optimized.
Many people use profilers to try to find these time-wasting lines of code, but most profilers miss the target because they look at functions, not lines, they go to sleep during I/O, and they worry about "self time".
Many more people try to guess what could be the problem, or they ask others to guess, such as by asking on SO. Such guesses, in the nature of guesses, are sometimes right - more often not, but people still invest time and resources in them.
There's a very simple way to find out for sure, without guessing, what could fruitfully be optimized, and here is one way to do it in Java.

Thanks for your answers. Their were more than helpful.
We completely solved this problem like so:
Refactored the LOAD code. Now it uses Hibernate with Lazy Fetching.
Refactored the SAVE code. Now it saves, just the data that was modified and right after the time it was modified. This way we dont have a HUGE save an the end.
Im amazed of how good it all went. The amount of new code we had to write was very very small.

Too Little CPU Utilization in Java

Hey stackoverflow community!
I'm having an issue where a highly involved algorithmic program is using TOO LITTLE cpu utilization: somewhere between 3 and 4%. It is taking very long to return results, and I believe it's just not working hard enough.
Do any of you geniuses have any ideas why this would occur - if anything I would expect 100% utilization. One additional detail is that the program makes inserts into a sqlite3 database, and thus yes, there are a lot of JNI calls via the sqlite3jdbc library I believe. (Note that I wanted to defer these inserts with a PreparedQuery batch earlier, but this caused major memory problems - there's a lot of data).
Thanks in advance
UPDATE: Fixed. Yeah, I was just being a doofus, but I didn't expect that sqlite would start a new transaction and do so much overhead.
I now use a PreparedStatement and queue 32768 entries before insert - seemed like a good enough number to me.

I would never recommend that someone use a JDBC driver with JNI if a type IV, 100% Java version is available. Google found this one.
With that aside, I can't tell anything without more info. Are the app and the database running on the same hardware?
What is so "intensive" about INSERTs?
I'd recommend profiling and getting some real data rather than guessing. Faith-based computing never works for me.

Obviously the database calls are causing delays. Isn't it an option to create smaller batches and test if that helps?? Maybe you could parallelize the algorithm as well to have a queue somewhere taking results and another thread cleaning out that queue?
edit:
There are also some other problem areas:
Database optimalization (model)
Database server configuration
Disk speed
All these factors should be taken into account

If you're writing a lot of data, then it sounds like you may be disk bound. Take a look at your disk io stats on the machine, and if that's actually the bottleneck, either find hardware with better io, or figure out how to do less writes.

The disk is slowing down your app. INSERTS use the disk, disk is slow, and the OS needs to wait for the write operations to finish.
Can't you use 2 threads, one for the algorithm, and another for the inserts?
If you only make inserts, you may also write then to a text file, and execute them at a later time

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.