Convert timestamp to weekNumber mapReduce - java

I am looking to preprocess timestamps to obtain the corresponding weeknumbers using mapreduce as the dataset has hundreds of millions of instances that need to be processed. I have so far figured out that the first MR job needs to preprocess and sort each line according to timestamp as the key and the rest of the line as value.
The second job then appends the corresponding date to each timestamp object.
I however do not know how to perform the third task I need to accomplish which is to create a continuous timeline of weeknumbers .Meaning, if my minimum timestamp corresponds to the date 03/10/2000 I would like to tag this with a number 10 (indicating that this is the 10th week of the year 2000 let's assume it is if its not in this case.). Then let's say I have the next timestamp corresponding to 02/01/2011, if we assume 52 weeks in the year 2000 and that 02/01/2011 is the 5th week in 2011, I would like to tag this date as week 57 and not as week 5. I would like to know how to achieve this last step in mapreduce. Assuming I have the following input file:
sorted_timestamp1::date::vals....
sorted_timestamp2::date::vals...
...
...
...
sorted_timestampn::date::vals.....
Simple pseudocode with map and reduce in java would suffice for my case, actual code would be great also.
Thanks in advance for your help!

I think you can separate the two problems:
1) map reduce logic:
What do you really want to calculate with map reduce. Depending on this information you have to choose the key values.
Just a guess from my side: If you want to do some aggregations on a weekly level, the mapper should take each line of input (think of line number as a key) and write out the data with new key representing the week (I'll give you some remarks in point 2.
The reducer will then have all data sets with equal week key in access and you can do whatever you want to do / aggregate and write the results out.
2) Week calculations:
Using java.util.Calendar object you can easily calculate the week of a Timestamp/Date. To get a continous week value you can calculate the week offset to a minimum reference date. To keep things simple I propose to use the 1.1. of a senceful date. To calculate the difference of weeks you can for example use
Joda package static method Weeks.weeksBetween
If the concrete value of the "week" key is not of special interest you can also use a composite key like
year*100+week
which is much simpler to evaluate and therefore is faster. If you really need the special week timeline think about using the simple key first (just used for aggregations in map reduce) and do the more expensive week timeline evaluations later after the reducer has generated its result with much less data.
Good luck + regards
Martin

Related

Time - Specify X seconds / minutes, and basic arithmetic

I am wondering how one would go about specifing a certain amount time, say X seconds. I'm writing the behaviour for a class that represents a Till (as in, a supermarket till), and whish to specify how long it takes to check out 1 item.
I'm doing this so once I receive the number of items a customer has, the time taken to serve the customer is simply:
ITEM_CHECKOUT_TIME * NumberOfItems;
ITEM_CHECKOUT_TIME would be a constant, and what I wish to specify. Some basic arithmetic would be done on this constant, like above.
Sure, I could use a double to represent the time, but I was wondering if it's actually possible with the Time classes, or anything else specifically for this task.
Thanks!
I would not use a double to represent time. I would probably represent it as a whole number of milliseconds (or nanoseconds). If you're looking for something fancier, you might want to look at the Duration class in the Joda-Time library:
http://joda-time.sourceforge.net/key_duration.html

ID generation using AtomicLong, how to start from 0 each day

I am using java.util.concurrent.atomic.AtomicLong class to generate sequence numbers for id generation. I need to start this number from 1 each day what are the available logic, methods which I can use?
Here's an answer on how to get the beginning of a day using Joda Time. Then use the method AtomicLong#set(long) to reset the value.

Protocol Buffer: How to define Date type?

I'm trynig to write a proto file that has a Date field which is not defined as a type into Protocol buffer.
I have read the following post but I couldn't figure out a proper solution that suits me :
What the best ways to use decimals and datetimes with protocol buffers?.
I'm trying to convert the proto file to a java .
My answer in the linked post relates mainly to protobuf-net; however, since you are coming at this from java I would recommend: keep it simple.
For dates, I would suggest just using the time (perhaps milliseconds) into an epoch (1 Jan 1970 is traditional). For times, just the size in that same unit (milliseconds etc). For decimal, maybe use fixed point simply by scaling - so maybe treat 1.05 as the long 1050 and assert always exactly 3dp (hence fixed point).
This is simple and pragmatic, and covers most common scenarios without making things complicated.
I'm not sold on this idea, but I'm really not sold on the idea of storing dates (which aren't instants in time) as a timestamp, so here's my suggestion.
Convert your date into a human-readable integer (e.g. 2014-11-3 becomes 20141103) and store this integer value. It contains exactly the data you need, is simple to create and parse, and takes up minimal space. Additionally, it is ordered and has a one-to-one mapping of dates to valid values (granted, invalid numbers are possible, such as 20149999, but these are easy to detect). In contrast, there are approximately 86400 valid timestamps that represent each day.
NB: There is a discussion on DBA SE criticizing this method of date storage, but in that context a specialized date type exists, which obviously isn't the case here.

Distinguishing and Parsing Dates in Java

i know this topic isn't new, though i have to dig it up again.
I already searched the Web numerous times (including some Threads here on stackoverflow) but haven't found a satisfying answer so far.
(Amongst others I checked
Parsing Ambiguous Dates in Java and
http://www.coderanch.com/t/375367/java/java/Handling-Multiple-Date-Formats-Elegantly
I am currently writing a Dateparser in Java, which takes a date and generates a format-String which can be used by SimpleDateFormat for parsing the date.
The dates are parsed via regex (yes, it's an ugly one xD) from Logfiles (IBM Websphere, Tomcat, Microsoft Exchange, ....). Because we have customers in (at least 2) different Locales, there is no way to simply "throw" the String against the parse-method of SimpleDateFormat and expect it to work properly.
Furthermore, there is the problem with the position of day and month (i.e. formats "dd/MM/yyyy" or "MM/dd/yyyy") which cannot be solved if i don't have at least two datasets where the day-digit has changed..
So my current approach would be storing the dateformats for a specific software installed at a specific customer's systems in a database (mysql / xml / ... ) and forcing the user to at least specify customername and softwarename so there is enough context to break down the amount of possibilites the format may be given in.
This "subset" then would be used to try to parse the logfiles of the specified software.
(The subset is stored in a HashMap in a HashMap in the form
HashMap> map;
The Integer-Key is the length of the formatstring and the String Key of the second Hashmap specifies a datesignature only containing the separating characters.
(i.e. ".. ::." for a date with format "dd.MM.yyyy 11:11:11.111")
I also take into account the value of the digits, i.e. a digit > 12 has to be a day because there is no 13th month. But this only works reliably for Date-Strings later than the 12th of a month..
Is there any chance to avoid implementing prior knowledge about the environment out of which the logfile came, thus enabling the parser to reliably parse one date without having to refer a second datestring for comparison?
I'm stuck on that for almost 3 months now -.-
Any suggestions would be very welcome =)
Edit:
Okay guys this thread can be closed. I now came up with a different solution for my specific problem. For those who are interested:
I am writing a Logreader in Java. As we have regular maintenance I have to read many logfiles.
But it's not just the plain text information that's written in the file.
Imagine a server just having crashed, it's sunday night and the next person to notice is the head of the IT dpt of the customer. Then on the following day I have to to maintenance and check the logfiles. Judging by content, everything seemed okay, nothing unusual. Half an hour after sending the maintenance report I receive a mail with the above mentioned head of it dpt ranting, that the server had crashed and it seemed to go unnoticed.
The point is, you can't keep track over content and the timestamps for logfiles with several thousand lines. So i developed a component which reads a logfile and calculates the time between two different log-entrys. Each logline got parsed into a java.util.Date to later get the Date as Timestamp for high resolution regarding the log-intervals. The differences i then threw onto a linegraph, which makes longer timeouts between two loglines visible as a big spike relating to the rest of the file.
My solution now will be to completely throw away the date-half of the String and insert a dummy-Date with a predefined format. The date only has to change if the Hour and minute approach 23:59.
The original date later is presented on the graph with the "fake-data" lying beneath.
I thank all of you for your suggestions and feedback =)
(And I hope my English has been understandable so far ;) )
My suggestion is to store all dates as 'ambiguous' until such time that the ambiguity can be resolved. (This assumes that a particular customer will always supply data in the same format.) As soon as you get a log from a customer for which you can unambiguously identify the date format, you would then be able to retrospectively apply this format to previously files.
To do this, you would need a table mapping each customer to their date format with some marker (e.g. NULL) to indicate that format is not yet established. You will probably also need to create your own date representation such that you can model these ambiguous dates.
So, as an example, if the possible date formats are:
dd/mm/yyyy
mm/dd/yyyy
yyyy/mm/dd
yyyy/dd/mm
Given dates, you should always be able to identify the year (permitting two digit years would make this problem considerably harder). So you should be able to map dates as follows:
25/01/2011 -> UNAMBIGUOUS_DD_MM_YYYY
12/01/2011 -> AMBIGUOUS_XX_XX_YYYY
2011/03/03 -> AMBIGUOUS_YYYY_XX_XX
03/30/2011 -> UNAMBIGUOUS_MM_DD_YYYY
If possible, you can ask the customers to pass the dateformat string also along with their actual date strings.
i.e. in their log files, they would need to have one more column
..... , '03/11/2011' , 'MM/DD/YYYY' , ...
I think the strategy you are going for (i.e. analysing a bigger set of data) is the best you can get.
From a single line of logfile you will never know if 3/5/11 is the 3rd of may in 2011 or the 5th of march in 2011. (I guess there might also be locales that might interpret this as 11th of may in 2003...)
I had these problems myself some time ago, and i also could only try to introduce some sort of context by either looking at numbers>12, or what changes quickest (must be "day"). But you already stated that yourself...

LibSVM Input format

I want to represent a set of labelled instances (data) in a file to be fed in to LibSVM as training data. For the problem mentioned in this question. It will include,
Login date
Login time
Location (country code?)
Day of the week
Authenticity (0 - Non Authentic, 1 - Authentic) - The Label
How can I format this data to be input to the SVM?
Are you asking about the data format or how to convert the data? For the latter you're going to have to experiment to find the right way to do this. The general idea is to convert your data into a nominal or ordinal value attribute. Some of these are simple - #4, #6 - some of these are going to be tough - #1-#3.
For example, you could represent #1 as three attributes of day, month and year, or just one by converting it to a UNIX like timestamp.
The IP is even harder - there's no straightforward way to convert that into a meaningful ordinal value. Using every IP as a nominal attribute might not be useful depending on your problem.
Once you figure this out, convert your data, check the LibSVM docs. The general format is followed by : i.e., +1 1:0 2:0 .. etc
I believe there is an unstated assumption in the previous answers. The unstated assumption is that users of libSVM know that they should avoid putting categorical data into the classifier.
For example, libSVM will not know what to do with country codes. If you are trying to predict which visitors are most likely to buy something on your site then you could have problems if USA is between Chad and Niger in your country code list. The bulge from USA will likely skew predictions for the countries located near it.
To fix this I would create one category for each country under consideration (and perhaps an 'other' category). Then for each instance you want to classify, I would set all the country categories to zero except the one to which the instance belongs. (To do this with the libSVM sparse file format, this isn't really a big deal).

Categories

Resources