I need to know is there any way to import data from mysql to HDFS, there are some conditions I need to mention.
I know hbase,hive and sqoop can help me , but I dont wan't any extra layers. Just mapreduce and hadoop java api.
I also need to update HDFS as data is updated in mySQL.
I need to know best way to import mysql data into HDFS and update in real time.
Why don't you want to use sqoop - it does what you would have to do (open a JDBC connection get data , write to hadoop) see this presentation from hadoop world 09
You can use Real Time import using CDC and Talend.
http://www.talend.com/talend-big-data-sandbox
Yes, you can access the database and HDFS via JDBC connectors and hadoop Java API.
But in map-reduce things will be out of your control when accessing a database.
Each mapper/reducer tries to establish a separate connection to database, eventually impacts the database performance.
There won't be any clue which mapper/reducer executes what portion of the query result set.
Incase if there is a single mapper/reducer to access the database then hadoop parallelism will be lost.
Fault tolerant mechanism has to be implemented if any of the mapper/reducer is failed.
list goes on......
To overcome all these hurdles, Sqoop was developed to transfer data between RDBMS to/from HDFS.
Related
I know that it is possible to distribute jobs over a hadoop cluster. Also do i know that it is possible to read and write semi-directly to SQL Databases from within an hadoop system.
My question is more directed as, is it done, in the real world, that data is read and write from files and a relational database from within hadoop jobs and then, after processing written back to the relational database. So using hadoop directly as process balancer, with something like hibernate and without the use of HDFS.
Thanks
This is not possible. Because you wont have access to the records in setup and clean up tasks of mapper and reducer.
Out of hdfs, the only way to execute the jobs is to input/output with local file system.
I am writing an application that collects huge amount of data and store it in Neo4j. For this I'm using Java code.
In order to quickly analyze the data I want to use terminal Neo4j server to connect to the same database and then use Neo4j console to query on it using Cypher.
This seems to be a lot of hassle. I have already changed, neo4j-server.properties to connect the directory where my java code is collecting the data. And also changed the flag allow_store_upgrade=true in neo4j.properties.
However, I am still facing issues because of locks.
Is there a standard way to achieve this?
You need to have neo4j-shell-<version>.jar on your classpath and set a remote_shell_enabled='true' as config option while initializing your embedded instance.
I've written up a blog post on this some time ago: http://blog.armbruster-it.de/2014/01/using-remote-shell-combined-with-neo4j-embedded/
I am trying to write to 3 different databases: MySQL, Oracle and MongoDB. The requirement is that all 3 databases should be in a consistent state. For e.g if the write to MySQL and Oracle succeeded, and mongo failed(e.g. network failure), then there should be a way to write the failed record back to mongo to keep all 3 records consistent. What's the best way to do this? Should I implement a queue to store failed records and have some background process to read records from the queue and try to write it again to the failed database?
Your best bet would probably be the Java Transaction API (JTA). I have not personally used it but it seems to be the Java "industry standard" for distributed transactions.
Hi I am new to Hadoop and NoSQL technologies. I started learning with world-count program by reading file stored in HDFS and and processing it. Now I want to use Hadoop with MongoDB. Started program from here .
Now here is confusion with me that it stores mongodb data on my local file system, and read data from local file system to HDFS in map/reduce and again write it to mongodb local file system. When I studied HBase, we can configure it to store it's data on HDFS, and hadoop can directly process it on HDFS(map/reduce). How to configure mongodb to store it's data on HDFS.
I think it is better approach to store data in HDFS for fast processing. Not in the local file system. Am I right? Please clear my concept if I am going in wrong direction.
MongoDB isn't built to work on top of HDFS and it's not really necessary since Mongo already has its own approach for scaling horizontally and working with data stored across multiple machines.
A better approach if you need to work with MongoDB and Hadoop is to use MongoDB as the source of your data but process everything in Hadoop (which will use HDFS for any temporary storage). Once your done processing the data you can write it back to MongoDB, S3, or wherever you want.
I wrote a blog post that goes into a little more details about how you can work with Mongo and Hadoop here: http://blog.mortardata.com/post/43080668046/mongodb-hadoop-why-how
HDFS is a distributed file system while HBase is a NoSQL database that uses HDFS as its file system provide a fast and efficient integration with Hadoop that has been prove to work at scale. Being able to work with HBase data directly in Hadoop or push it into HDFS is one of the big advantages when picking HBase as a NoSQL database solution - I don't believe MongoDB provides such tight integration with Hadoop and HDFS which would mitigate any performance and efficiency concerns with moving data from/to a database.
Please look at this blog post for a detailed analysis on how well MongoDB integrates with Hadoop - one of the conclusions was that writes to HDFS from MongoDB didn't perform well: http://www.ikanow.com/how-well-does-mongodb-integrate-with-hadoop/
I have developed a small swing desktop application. This app needs data from other database, so for that I've created an small process using java that allows to get the info (using jdbc) from remote db and copy (using jpa) it to the local database, the problem is that this process take a lot of time. is there other way to do it in order to make faster this task ?
Please let me know if I am not clear, I'm not a native speaker.
Thanks
Diego
One good option is to use the Replication feature in MySQL. Please refer to the MySQL manual here for more information.
JPA is less suited here, as object-relational-mapping is costly, and this is bulk data transfer. Here you probably also do not need data base replication.
Maybe backup is a solution: several different approaches listed there.
In general one can also do a mysqldump (on a table for instance) on a cron task compress the dump, and retrieve it.