I have some pig output files and want to read them on another machine(without hadoop installation). I just want to read a tab-seperated plain text line and parse it into a java object. I am guessing we should be able to use pig.jar as dependency and be able to read it. I could not find relevant documentation. I think this class could be used? How can we provide the schema also.
I suggest you to store data in Avro serialization format. It is Pig-independent and it allows to handle complex data structures like you described (so you don't need to write your own parser). See this article for examples.
Your pig output files are just text files, right? Then you don't need any pig or hadoop jars.
Last time i worked with Pig was on amazon's EMR platform, and the output files were stashed in an s3 bucket. They were just text files and standard java can read the file in.
That class you referenced is for reading into pig from some text format.
Are you asking for a library to parse the pig data model into java objects? I.e. the text representation of tuples & bags, etc? If so then its probably easier to write it yourself. It's a VERY simple data model with only 3 -ish datatypes..
Related
I am learning to build neural nets and I came across this code on github,
https://github.com/PavelJunek/back-propagation-java
There is a training set and a validation set required to be used but I don't know where to input the files. The Readme doesn't quite explain how to use the files. How do I test with different csv files I have on this code?
How so? It tells you exactly what to do. The program needs to get two CSV files: a CSV file containing all the training data and a second CSV file containing all of the validation data.
If you have a look at the Program.java file (in the main method), you'll see that you need to pass both files as arguments with the command line.
Has anyone ever been involved with exporting JSON like this one in my sample made only for the purposes of this conversation:
https://gist.github.com/slavisah/97b57a5826dc0b49ee22895035eb244a
It represents a list of material objects (wood, metal etc.) Requirement is that every material has to be written in one line of CSV file with all of it's behaviors and properties, and their sub-lists in relation to them in the same row. Every list is N sized.
My question is how to structure that CSV file for easiest export/import in my application? Maybe someone is familiar with Java library which is capable of doing things like this?
Thanks.
Some good libraries for working with csv files using java...
http://www.beanio.org/
http://super-csv.github.io/super-csv/index.html
i suggest you to use Apache poi the Java API for Microsoft Documents : https://poi.apache.org/
I have created a python script for predictive analytics using pandas,numpy etc. I want to send my result set to java application . Is their simple way to do it. I found we can use Jython for java python integration but it doesn't use many data analysis libraries. Any help will be great . Thank you .
Have you tried using xml to transfer the data between the two applications ?
My next suggestion would be to output the data in JSON format in a txt file and then call the java application which will read the JSON from the text file.
Better approach here is to use java pipe input like python pythonApp.py | java read. Output of python application can be used as an input for java application till the format of data is consitent and known. Above soultions of creating a file and then reading also works but is prone to more errors.
I've used Apache Flume to pipe a large amount of tweets into the HDFS of Hadoop. I was trying to do sentiment analysis on this data - just something simple to begin with, like positive v negative word comparison.
My problem is that all the guides I find showing me how to do it have a text file of positive and negative words and then a huge text file with every tweet.
As I used Flume, all my data is already in Hadoop. When I access it using localhost:50070 I can see the data, in separate files according to month/day/hour, with each file containing three or four tweets. I have maybe 50 of these files for every hour. Although it doesn't say anywhere, I'm assuming they are in JSON format.
Bearing this in mind, how can I perform my analysis on them? In all the examples I've seen where the Mapper and Reducer have been written, there has been a single file this has been performed on, not a large collection of small JSON files. What should my next step be?
This example should get you started
https://github.com/cloudera/cdh-twitter-example
Basically use hive external table to map your json data and query using hiveql
When you want to process all the files in a directory, you can just specify the path of the directory as your input file to your hadoop job so that it will consider all the files in that directory as its input.
For example if your small files are in the directory /user/flume/tweets/.... then in your hadoop job you can just specify /user/flume/tweets/ as your input file.
If you want to automate the analysis for every one hour you need to write one oozie workflow.
You can refer to the below link for sentiment analysis in hive
https://acadgild.com/blog/sentiment-analysis-on-tweets-with-apache-hive-using-afinn-dictionary/
I'm trying to read excel file and pass all the data to DB. I found a few code examples but all of them required external jars. How can I read excel files using only the standard library?
IF you don't want to use a library then you will have to download the Excel file format specs from MS and write an Excel parser yourself (which is extremely complicated and takes > 10 years for one developer). For the OpenXML format spec see here and here.
Thus I really recommend using a library for that...
Try Apache POI - a free Java library for dealing with MS Office documents..
You can save as the excel file *.cvs and sperated ";". Then, you can read file line by line and get the columns which is getting from each token.
Microsoft excel uses a binary way to save its data, so manually reading excel files might be a hassle. If you could convert the excel (xls) to a comma seperated values (csv) file, then you can just read the file and split your input on the comma's.
This is a difficult problem. First off, it is not as simple as "adding a third party library". There are no existing EXCEL reading libraries that do not cost money and the one that I know that does work is very expensive AND has bugs in it.
One strategy is to create an Excel add in that reads the data and transfers it to your application by OLE or the clipboard or by a TCP/IP port or saves it to a temporary file. If you look in the source code for OPeNDAP.org's ODC project you can find an Excel add in and TCP capability to do this.
You can try referring to the reader in OpenOffice which is open source code, however, in my opinion that code is not easily refactorable into a private project for various reasons.
Microsoft has components and tools to open Excel files and expose them via COM objects.
You can also learn the BIFF format and write your own parser. You probably would want to write a parser for BIFF5, but be forewarned, this is a BIG project, even if you only parse a limited number of data types.