Comparing Two Excel files using Hadoop Mapreduce

Comparing Two Excel files using Hadoop Mapreduce - java

I am new to Hadoop and Mapreduce. I have a requirement to compare two excel files using mapreduce. I have to go to mapreduce as the size of these files will be very big(>10gb). My question is how do I get two different input files from different mapper and compare these two files in Reducer.I have to convert this application into jar and run it in Amazon EMR.I am not able to find suitable tutorial for this in web. Kindly provide me some ideas to do this.

I think that distributed cache would be useful in your situation. I haven't used distributed cache with large files but please explore and let me know if it works for you.

Based on your answers to my comment here is how I think it should be done if you are implementing it using plain Map-reduce -
Create a custom InputFormat that reads the excel spreadsheet.
As part of this you would need RecordReader that reads Excel spreadsheet and outputs Cell location ( A1 for example) as key and its content as Value.
Once the files are read you need to do inner join on two datasets using Key ( which is cell location).
Post join, you can compare the contents of the cell.
It may be helpful if you take a look at Apache Pig or Cascading kind of APIs that abstract plain map-reduce.
Regards,
Amit

Related

Update a huge size text file using Apache Spark

I have around 300GB fulldata and daily I’ll get an update on this data around 10GB. Both files are in text format. I would like to update the fulldata based on updates. How do I proceed the situation with Apache spark in a distributed manner.
I have tried to create a JavaRDD with a map function which override a call method and converted that to an Dataset[Row] from the two files. Now I’m planning to do a sparkSQL join queries over the datasets. Is this the right approach, anyone can guide me in this, as this’s my first footstep with apache spark.
How do I achieve the parallel processing here?

Which is the best way to handle big CSV files (Java, MySQL, MongoDB)

i need to handle a big CSV file with around +750.000 rows of data. Each line has around 1000+ characters and ~50 columns, and i am really not sure what's the best (or atleast good and sufficient) way to handle and manipulate this kind of data.
I need to do the following steps:
Compare the values of two Colomns and write the result to a new column (this one seems easy)
Compare values of two lines and do stuff. (e.g delete if one value is duplicated.)
Compare values of two different files.
My Problem is that this is currently done with PHP and/ or Excel and the limits are nearly exceeded + this takes a long time to process and will be no longer possible when the files get even bigger.
I have 3 different possibilities in mind:
Use MySQL, create a table (or two) and do the comparing, adding or deleting part. (I am not really familiar with SQL and would have to learn it, also it should be done automatically so there is the problem that you cant create tables of CSV files )
Use Java creating Objects in ArrayList or Linked Lists and to "the stuff" (to operations would be easy but handling that much data will probably be the problem)
(Is it even possible to save that many files in Java or does it crash / is there a good tool etc.?)
Use Clojure along with MongoDB to add files from CSV to MongoDB and read files using Mongo.
(Name additional possibilities if you have another idea ..)
All in all I am not a Pro in any of these but would like to solve this problem / get some hints or even your opinion.
Thanks in advance

Since in our company we work a lot with huge csv files here are some ideas:
because these files are in our case always exported from some other relational database we always use PostgreSQL, MySQL or golang + SQLite to be able to use simple plain SQL queries which are in these cases most simple and reliable solution
number of rows you describe is quite low from the point of view of all these databases so do not worry
all have native internal solution for import / export of CSV - which works much quicker than anything created manually
for repeated standard checks I use golang + SQLite with :memory: database - this is definitely the quickest solution
MySQL is definitely very good and quick for checks you described but choose of database depends also on how sophisticated analysis you would need to do further - for example MySQL up to 5.7 still does not have window functions which you could need later - so consider using PostgreSQL in some cases too...

I normally use PostgreSQL for this kind of tasks. PostgreSQL COPY allows importing CSV data easily. Then you get a table with your CSV data and the power for SQL (and a reasonable database) to do basically anything you want with the data.
I am pretty sure MySQL have similar capabilities of importing CSV, I just generally prefer PostgreSQL.
I would not use Java for CSV processing. This will be too much code and unless you take care of indices, the processing will not be performant. An SQL database is much better equiped for tabular data processing (should not be a surprize).
I wouldn't use MongoDB, my impression is that it is less powerful in update operations compared to an SQL database. But this is just an opinion, take it with a grain of salt.

You should try Python with the pandas package. On a machine with enough memory (say 16GB) it should be able to handle your CSV files with ease. The main thing is - anyone with some experience with pandas will be able to develop a quick script for you and tell you in a few minutes if your job is doable or not. To get you started:
import pandas
df = pandas.read_csv('filename.csv')
You might need to specify the column type if you get into memory issues.
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html

I'd suggest to use Spark. Even in a standalone machine the performance is incredible. You can use Scala and Python to handle your data. It's flexible and you can do processing that is impossible in Java or relational database.
The other choices are great also, but I'd consider Spark to all analytics needs from now on.

HBASE : Bulk load (Is my understanding correct)

Bulk load usually uses map reduce to create a file on HDFS and this file is then assoicated with a region.
If thats the case, can my client create this file (locally) and put it on hdfs. See as we already know what keys are , what values, we can do it locally without loading the server.
Can someone point to an example, how hfile can be created (in any language will be fine)
regards

Nothing actually stops anyone from preparing HFile 'by hands' but doing so you start to depend on HFile compatibility issues. In accordance to this (https://hbase.apache.org/book/arch.bulk.load.html) you just need to put your files to HDFS ('closer' to HBase) and call completebulkload.
Proposed strategy:
- Check HFileOutputFormat2.java file from HBase sources. It is standard MapReduce OutputFormat. What you indeed need as base for this is just sequence of KeyValue elements (or Cell if we speak in term or interfaces).
- You need to free HFileOutputFormat2 from MapReduce. Check for its writer logics for this. You need only this part.
- OK, you need also to build effective solution for Put -> KeyValue stream handling for HFile. First place to look is TotalOrderPartitioner and PutSortReducer.
If you did all steps you have solution that can take sequence of Put (no issue to generate them from any data) and as a result you have local HFile. Looks like this should take up to week to get something pretty working.
I don't go this way because just having good InputFormat and data transforming mapper (which I have long ago) I now can use standard TotalOrderPartitioner and HFileOutputFormat2 INSIDE MapReduce framework to have everything working just using full cluster power. Feel confused by 10G SQL dump loaded in 5 minutes? Not me. You can't beat such speed using single server.
OK, this solution required careful SQL requests design for SQL DB to perform ETL process from. But now it's everyday procedure.

Simplest and fastest way to find rows in a large CSV

I have several CSV files and I need to load them and search for rows by column value.
Someone suggests to use OpenCSV project to load CSV. But I don't know if this is the best way.
Does OpenCSV provide some search/filter utility?
Is there a better way to do what I need?

You can load the data from your CSV files into your favourite SQL engine, like e.g. MySQL or SQLite, and use SQL to filter conveniently and fast. This is a common task so databases have ready to use tools for importing data from CSV files, this is how you can do it in SQLite: http://www.sqlite.org/cvstrac/wiki?p=ImportingFiles

If your CSV files are too big to keep in memory and you don't want to resort to storing everything in a database first (this would be a tedious disk to memory to disk operation) then there is another approach nobody seems to have mentioned: streaming.
The approach would consist of reading a number of rows from the file, processing them and then discarding the ones that don't match your search. You could do this with the Apache commons FileUtils for example. It could be some of the existing CSV API's offer this, I haven't checked that.

Use an embedded database, separating CSV from search functionality.

Something like Apache Commons CSV will simply give you a 2-dimensional string array of values. I doubt any solution will give you something more than this (given no type/schema info in a CVS file) and I suspect a well-crafted loop over these results is all you need. That'll be the simplest and fastest (as requested).
If you want to do more, you can run up the standard Java-provided JavaDb database in-JVM, load the results into that and perform SQL queries without an external datasource/service.
Note that memory may be a problem if you load a sizable CSV, but just how big are these ? Memory is very cheap these days.

Output logging in Java

I'm currently working on applying genetic algorithms to a particular application, and the issue is that there is a large amount of data that I need to analyze, graph and simply, tabulate. Upto this point I have been using csv files, but they have been kind of limited as I still have to generate charts manually, and its an issue when this needs to be done on over 100 documents.
Are there any other options for output logging in Java, for analysis other than CSV files? Any link to any API of any kind would also be useful.
P.S: (The question seems common enough to have been asked already, but I couldn't find it.) I'm not asking about how to log data in Java, or how to redirect it to a file, but if there are any existing ways to easily tabulate and graph large amounts of output.
The kind of data I'm working with involves a lot of numerical data, specifically the attributes of different generations and different organisms inside those generations. I'm trying to find and interpret trends within the numerical data which would mean that I need to generate separate graphs for different populations or test runs, and also find representative values for each file and graph those against specific test run conditions.
Also, there is a time parameter which references the speed of the algorithm. Which methods let me log output without letting the post-processing and disk access affect my test runs? Is it possible?

You could use Apache POI to write out an Excel spreadsheet directly. You can also have it start with a spreadsheet already containing macros and whatever else you need to display your information.

There are plenty of choices in exporting reports / data. There is an open source project call JasperReport that can do charts, PDF, XML, CSV, plain text exporting. But it is an involved process, but does offer Java API to accomplish that task.

How about writing table into a database, like MySQL?
Using such, you can search your data by better means than in a text file.
Have a look here: http://www.vogella.de/articles/MySQLJava/article.html
On linux, you also have the possibility to stay using your csv files and then generate plots using gnuplot scipts.

Sounds like you need to be saving the data to a database and then using Jasper Reports to build a report containing the graphs and whatever else you need using the data that you stored in the database instead of trying to use excel. Jasper is fairly easy to use and you can have your java application generate the report for you after the data has been stored in the database.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.