Writing one file per group in Pig Latin - java

The Problem:
I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
Setup:
Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.
First Attempt:
My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
Second Attempt:
My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.
The Question:
So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.

Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.
Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions

Related

I want to transfer information from Google Docs to Google Spreadsheet

I currently have a large amount of information sorted into table form on google docs, an example can be seen below:
I would like to transfer all of this information into Google Spreadsheet form. With lines 1-5 going across columns B-F, respectively, and the information going underneath each respective column.
Would I need to use a script to accomplish this task? If so, what type of script should I use, and where can I access such a script (i.e. potentially find a freelance programmer who can write it for me, if necessary). Are there any other ways this task could be accomplished? All of the information in the google docs is very standardized thus there is not any sort of variation which could complicate a script. If a script could transfer one set of 5, it could work on all of the sets.
Thank you, let me know if you need any more information.
This can be done with a lot of different languages. I would approach this using Java just because I am most familiar with it. I would start by downloading the Google Doc as plaintext (.txt). Then run it through line by line parsing it into .csv format. From there you can import it directly into Google Sheets.
You can do this with Notepad++ or equivalent editor. Need to use find and replace tool using extended keys.
Like for replacing a line break search for \r\n and replace with any you need.
If you can place \t [tab space] between fields you can simply paste them onto sheet they align into columns.
So here you can replace double line breaks with some symbol then single line break with \t and then again replace the symbol with single line break. you get all data in columns structure.

Android Intents and Lists

I plan on reading several files when my app/game is created and using the information from them for the entirety of the app. I also have to write to the file at one point.
I have two files. One is a 2-column text file that I'll turn into a dictionary for fast searching. The other is a text file that has 11 columns. I'll make a dictionary out of two of the columns, and the other data I need kept as is so I can write to the columns to count the amount of times something happens in different circumstances for datamining.
Currently, I've turned the second file into a list of a list of strings, or List>. I can't figure out how to pass that around in intents. ".putStringArrayListExtra" only works for a list of strings.
Am I going about this the wrong way entirely? This is my first real Android app.
In order to store a data structure into an Intent, it has to be either serializable or parcelable. If your data structure is neither of them, you might create a class that would implement Serializable and manage it. A good example might be found here.
Once done, you then might use Intent.putSerializable(...) to store your data structure. See this:
Using putSerializable in Android
Additionally to this, if you could convert your structure into a JSON structure, you'd already have it done since it would be treated as a String. If not, the above solution should be easy to do.

Validate Format Tricky File Using Java

I need to parse and validate a file whose format is a little bit tricky.
Basically the file comes in this format:
\n -- just to make clear it may have empty lines
CLIENT_ID
A_NUMERIC_VALUE
ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT
ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT
\n
\n
CLIENT_ID_2
A_NUMERIC_VALUE_2
ONE_LINE_OF_SOME_RANDOM_COMMENT_ABOUT_THE_CLIENT_2
ANOTHER_LINE_OF_SOME_RADOM_COMMENT_ABOUT_THE_CLIENT_2
OHH_THIS_ONE_HAS_THREE_LINES_OF_COMMENTS
The file will be big very seldom (10 mb is probably the biggest file I've ever seen - usually they have around 900kb-1mb).
So I have two problems:
1) How can I effectively validate the format of the file? Using regex + scanner? (I see this one as a very feasible option if I can transform each client entry into only one string - so I can apply the regex upon it).
2) I need to transform each of the entries in the file into Client objects. Should I validate the whole file before transforming it into Java objects? Or should I validate the file as I go on transforming its entry into Java objects? (Bear in mind that if any client entry is invalid, the processing halts immediately and an exception is thrown - hence any object that was created will be discarded).
I'm really keen to see your suggestions about question #1. Question #2 is more a curiosity of mine on how you would handle this situation. Ignore #2 if you will, but please answer #1 =)
Does anyone know any framework to help me on handling the file by the way?
Thanks.
Update:
I saw this question and the problem is very similar to mine, but I'm not sure whether regex is the best way out to this problem. There might be quite a lot of "\n" throughout the file, varying number of comments for each client entry and an optional ID - hence the regex would have to be quite complex. That's why I mentioned transforming each entry into one row in the question #1 because this way would be much easier to create a regex to validate... nevertheless, this solution does not sound very elegant to my ears :(
Cheers.
If you intend to fail the batch if any part is found invalid, then validate the file first.
There are several advantages. One is that validation and processing need not be synchronous. If, for example, you process batches daily, but receive files throughout the day, you can validate them throughout the day and notify to correct problems before your scheduled processing. Another is that validation of whether a file is well-formed is very fast.
A short, simple perl script would certainly do the job. No need to transform the data, if I understand the pattern correctly, and it's all read-forward.
read past any newlines
read and validate a client id
read and validate a numeric value
read and validate one or more comments until a blank line is found
repeat the above four steps until EOF or invalid data detected

Regex: Negating a whole word (needed to optimize a file)

I am trying to do a simple weather widget for Android, that provides temperatures just for my country (Jordan). The website I am using for the weather records provides a JSON file with country regions data for many countries. The problem is that the file contains 2500+ objects, and it takes a really long time to be parsed. Thus, and as I actually need <100 of them (the regions of my country), I thought that I could optimize the file before passing it to the JSON parser, by taking off all of the records I don't need. I don't know if it's a good solution, but it was what I thought of. Anyway, my problem now is getting the right Regex.
This is the URL of the JSON file.
As you can see, every object has four items. The one I need to check for is "icon", which specifies the country of that region.
EXAMPLE:
{"value":"khalda","icon":"Jordan","label":"khalda","desc":"Amman & Madaba"},
What I could came up with so far is the pattern of the object I actually need. However, I need to get the ones I don't need to be able to delete them. Here is the pattern: \{[^\{]*Jordan*[^\}]*\}, (This has to be modified so it validates when "Jordan" does NOT exist, which I couldn't figure out.)
Any help/hint is highly appreciated.
Thanks.
Rather than matching and deleting the objects you don't need, match and extract the single(?) object that you do need. It will be faster.
(And I agree with minitech's comment. Parsing the JSON file is unlikely to be the real bottleneck.)

What technologies are there for formatted, structured data input and output?

I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks
You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.

Categories

Resources