LibSVM Input format - java

I want to represent a set of labelled instances (data) in a file to be fed in to LibSVM as training data. For the problem mentioned in this question. It will include,
Login date
Login time
Location (country code?)
Day of the week
Authenticity (0 - Non Authentic, 1 - Authentic) - The Label
How can I format this data to be input to the SVM?

Are you asking about the data format or how to convert the data? For the latter you're going to have to experiment to find the right way to do this. The general idea is to convert your data into a nominal or ordinal value attribute. Some of these are simple - #4, #6 - some of these are going to be tough - #1-#3.
For example, you could represent #1 as three attributes of day, month and year, or just one by converting it to a UNIX like timestamp.
The IP is even harder - there's no straightforward way to convert that into a meaningful ordinal value. Using every IP as a nominal attribute might not be useful depending on your problem.
Once you figure this out, convert your data, check the LibSVM docs. The general format is followed by : i.e., +1 1:0 2:0 .. etc

I believe there is an unstated assumption in the previous answers. The unstated assumption is that users of libSVM know that they should avoid putting categorical data into the classifier.
For example, libSVM will not know what to do with country codes. If you are trying to predict which visitors are most likely to buy something on your site then you could have problems if USA is between Chad and Niger in your country code list. The bulge from USA will likely skew predictions for the countries located near it.
To fix this I would create one category for each country under consideration (and perhaps an 'other' category). Then for each instance you want to classify, I would set all the country categories to zero except the one to which the instance belongs. (To do this with the libSVM sparse file format, this isn't really a big deal).

Related

Is there a good way to identify whether there is date information contained in a String

I had this problem of trying to identifying whether there is a date information contained in a paragraph. So here are the issues:
We don't know where the date string might appear. A paragraph would be something like "We would like set the appointment at Nov. 15th. Then we would .....". So we cannot directly use DateTime.parse()
The format of the date is arbitrary, it can be more formal forms like "Nov. 15th" or "08/21/1988" or "5th in this month".
It would be unlikely to cover all the cases given that the date information can have various forms, I just want to cover as many cases as possible. The lightweight solution I can come up with would be regular expressions I guess.... And again that would be a huge expression. Does anyone know if there are better solutions or available regular expressions for this?
(P.S. I would prefer more light weighted approaches, methods like machine learning might be more general but is not applicable to my task here)
I'd propably approach it with a regular expression (or multiple) as well.
I'd make the regular expression match regions that look date-like by matching everything around "th", "nd" "st", month/day names and abbreviations, dot/line/slash/colon separated numbers or such things. Experiment with that and see how good it finds dates with a ton of test-cases.
Parsing the possible dates is another story. I guess you'd need something as powerful as PHP's strtotime.
Another approach is to just clearly define a big collection of possible formats. Then, when one is detected, you can easily parse it. Feels too brute-force for me though
As a starting point, there are seven pages of date regexes over at http://regexlib.com. If you don't know which one you're looking for, I would create an array and apply them one at a time. You'll still have a problem with dates like 11/12/2015 vs. 12/11/2015 so some kind of process for clarification is still necessary (e.g., automatically mail back and ask "Do you mean December 11 or November 12?").

Convert timestamp to weekNumber mapReduce

I am looking to preprocess timestamps to obtain the corresponding weeknumbers using mapreduce as the dataset has hundreds of millions of instances that need to be processed. I have so far figured out that the first MR job needs to preprocess and sort each line according to timestamp as the key and the rest of the line as value.
The second job then appends the corresponding date to each timestamp object.
I however do not know how to perform the third task I need to accomplish which is to create a continuous timeline of weeknumbers .Meaning, if my minimum timestamp corresponds to the date 03/10/2000 I would like to tag this with a number 10 (indicating that this is the 10th week of the year 2000 let's assume it is if its not in this case.). Then let's say I have the next timestamp corresponding to 02/01/2011, if we assume 52 weeks in the year 2000 and that 02/01/2011 is the 5th week in 2011, I would like to tag this date as week 57 and not as week 5. I would like to know how to achieve this last step in mapreduce. Assuming I have the following input file:
sorted_timestamp1::date::vals....
sorted_timestamp2::date::vals...
...
...
...
sorted_timestampn::date::vals.....
Simple pseudocode with map and reduce in java would suffice for my case, actual code would be great also.
Thanks in advance for your help!
I think you can separate the two problems:
1) map reduce logic:
What do you really want to calculate with map reduce. Depending on this information you have to choose the key values.
Just a guess from my side: If you want to do some aggregations on a weekly level, the mapper should take each line of input (think of line number as a key) and write out the data with new key representing the week (I'll give you some remarks in point 2.
The reducer will then have all data sets with equal week key in access and you can do whatever you want to do / aggregate and write the results out.
2) Week calculations:
Using java.util.Calendar object you can easily calculate the week of a Timestamp/Date. To get a continous week value you can calculate the week offset to a minimum reference date. To keep things simple I propose to use the 1.1. of a senceful date. To calculate the difference of weeks you can for example use
Joda package static method Weeks.weeksBetween
If the concrete value of the "week" key is not of special interest you can also use a composite key like
year*100+week
which is much simpler to evaluate and therefore is faster. If you really need the special week timeline think about using the simple key first (just used for aggregations in map reduce) and do the more expensive week timeline evaluations later after the reducer has generated its result with much less data.
Good luck + regards
Martin

Protocol Buffer: How to define Date type?

I'm trynig to write a proto file that has a Date field which is not defined as a type into Protocol buffer.
I have read the following post but I couldn't figure out a proper solution that suits me :
What the best ways to use decimals and datetimes with protocol buffers?.
I'm trying to convert the proto file to a java .
My answer in the linked post relates mainly to protobuf-net; however, since you are coming at this from java I would recommend: keep it simple.
For dates, I would suggest just using the time (perhaps milliseconds) into an epoch (1 Jan 1970 is traditional). For times, just the size in that same unit (milliseconds etc). For decimal, maybe use fixed point simply by scaling - so maybe treat 1.05 as the long 1050 and assert always exactly 3dp (hence fixed point).
This is simple and pragmatic, and covers most common scenarios without making things complicated.
I'm not sold on this idea, but I'm really not sold on the idea of storing dates (which aren't instants in time) as a timestamp, so here's my suggestion.
Convert your date into a human-readable integer (e.g. 2014-11-3 becomes 20141103) and store this integer value. It contains exactly the data you need, is simple to create and parse, and takes up minimal space. Additionally, it is ordered and has a one-to-one mapping of dates to valid values (granted, invalid numbers are possible, such as 20149999, but these are easy to detect). In contrast, there are approximately 86400 valid timestamps that represent each day.
NB: There is a discussion on DBA SE criticizing this method of date storage, but in that context a specialized date type exists, which obviously isn't the case here.

Distinguishing and Parsing Dates in Java

i know this topic isn't new, though i have to dig it up again.
I already searched the Web numerous times (including some Threads here on stackoverflow) but haven't found a satisfying answer so far.
(Amongst others I checked
Parsing Ambiguous Dates in Java and
http://www.coderanch.com/t/375367/java/java/Handling-Multiple-Date-Formats-Elegantly
I am currently writing a Dateparser in Java, which takes a date and generates a format-String which can be used by SimpleDateFormat for parsing the date.
The dates are parsed via regex (yes, it's an ugly one xD) from Logfiles (IBM Websphere, Tomcat, Microsoft Exchange, ....). Because we have customers in (at least 2) different Locales, there is no way to simply "throw" the String against the parse-method of SimpleDateFormat and expect it to work properly.
Furthermore, there is the problem with the position of day and month (i.e. formats "dd/MM/yyyy" or "MM/dd/yyyy") which cannot be solved if i don't have at least two datasets where the day-digit has changed..
So my current approach would be storing the dateformats for a specific software installed at a specific customer's systems in a database (mysql / xml / ... ) and forcing the user to at least specify customername and softwarename so there is enough context to break down the amount of possibilites the format may be given in.
This "subset" then would be used to try to parse the logfiles of the specified software.
(The subset is stored in a HashMap in a HashMap in the form
HashMap> map;
The Integer-Key is the length of the formatstring and the String Key of the second Hashmap specifies a datesignature only containing the separating characters.
(i.e. ".. ::." for a date with format "dd.MM.yyyy 11:11:11.111")
I also take into account the value of the digits, i.e. a digit > 12 has to be a day because there is no 13th month. But this only works reliably for Date-Strings later than the 12th of a month..
Is there any chance to avoid implementing prior knowledge about the environment out of which the logfile came, thus enabling the parser to reliably parse one date without having to refer a second datestring for comparison?
I'm stuck on that for almost 3 months now -.-
Any suggestions would be very welcome =)
Edit:
Okay guys this thread can be closed. I now came up with a different solution for my specific problem. For those who are interested:
I am writing a Logreader in Java. As we have regular maintenance I have to read many logfiles.
But it's not just the plain text information that's written in the file.
Imagine a server just having crashed, it's sunday night and the next person to notice is the head of the IT dpt of the customer. Then on the following day I have to to maintenance and check the logfiles. Judging by content, everything seemed okay, nothing unusual. Half an hour after sending the maintenance report I receive a mail with the above mentioned head of it dpt ranting, that the server had crashed and it seemed to go unnoticed.
The point is, you can't keep track over content and the timestamps for logfiles with several thousand lines. So i developed a component which reads a logfile and calculates the time between two different log-entrys. Each logline got parsed into a java.util.Date to later get the Date as Timestamp for high resolution regarding the log-intervals. The differences i then threw onto a linegraph, which makes longer timeouts between two loglines visible as a big spike relating to the rest of the file.
My solution now will be to completely throw away the date-half of the String and insert a dummy-Date with a predefined format. The date only has to change if the Hour and minute approach 23:59.
The original date later is presented on the graph with the "fake-data" lying beneath.
I thank all of you for your suggestions and feedback =)
(And I hope my English has been understandable so far ;) )
My suggestion is to store all dates as 'ambiguous' until such time that the ambiguity can be resolved. (This assumes that a particular customer will always supply data in the same format.) As soon as you get a log from a customer for which you can unambiguously identify the date format, you would then be able to retrospectively apply this format to previously files.
To do this, you would need a table mapping each customer to their date format with some marker (e.g. NULL) to indicate that format is not yet established. You will probably also need to create your own date representation such that you can model these ambiguous dates.
So, as an example, if the possible date formats are:
dd/mm/yyyy
mm/dd/yyyy
yyyy/mm/dd
yyyy/dd/mm
Given dates, you should always be able to identify the year (permitting two digit years would make this problem considerably harder). So you should be able to map dates as follows:
25/01/2011 -> UNAMBIGUOUS_DD_MM_YYYY
12/01/2011 -> AMBIGUOUS_XX_XX_YYYY
2011/03/03 -> AMBIGUOUS_YYYY_XX_XX
03/30/2011 -> UNAMBIGUOUS_MM_DD_YYYY
If possible, you can ask the customers to pass the dateformat string also along with their actual date strings.
i.e. in their log files, they would need to have one more column
..... , '03/11/2011' , 'MM/DD/YYYY' , ...
I think the strategy you are going for (i.e. analysing a bigger set of data) is the best you can get.
From a single line of logfile you will never know if 3/5/11 is the 3rd of may in 2011 or the 5th of march in 2011. (I guess there might also be locales that might interpret this as 11th of may in 2003...)
I had these problems myself some time ago, and i also could only try to introduce some sort of context by either looking at numbers>12, or what changes quickest (must be "day"). But you already stated that yourself...

Natural language processing to recognise numerical data

My requirement is to recognize and extract numerical data from a natural language sentence (English only) in response to queries. Platform is Java. For example if the user query is "What is the height of mount Everest" and we have a paragraph as:
In 1856, the Great Trigonometric Survey of British India established the first published height of Everest, then known as Peak XV, at 29,002 ft (8,840 m). In 1865, Everest was given its official English name by the Royal Geographical Society upon recommendation of Andrew Waugh, the British Surveyor General of India at the time, who named it after his predecessor in the post, and former chief, Sir George Everest.[4] Chomolungma had been in common use by Tibetans for centuries, but Waugh was unable to propose an established local name because Nepal and Tibet were closed to foreigners. (Pasted from wikipedia)
For a user query "Height of mount Everest" from the paragraph I need to get 29002 ft or 8840 m as the answer. Can anyone please suggest any possible ways of doing it in Java? Are there any open source libraries for the same?
Obviously, doing this well is extremely difficult to do. If it's an assignment though then I'm guessing the expectation is a bit lower. Here are some thoughts to hopefully get you started:
I'd split the problem into 2 parts; parsing the question block and then passing the answer block. From the question block, you need to know 2 pieces of information, the noun of what you're searching for, and also the type of the answer. In this case the noun is Everest and the type is height. "Types" of data you can build a dictionary for fairly quickly to search your input string for (e.g. "height", "weight", "distance", "age"). The nouns are more difficult, so I'd say to just assume that every non-type in the question is a potential noun, perhaps removing a dictionary of known non-nouns (such as "at", "the", "of" etc.).
Once you've identified the noun and type from the question, you can begin scanning your answer block. I'd begin by breaking that up into sentences. Then scan each sentence for each of your nouns. If one is found in that sentence, you need to scan the sentence again for numbers (taking into account possible whitespace or comma delimiting). Finally, you need to look "around" any numbers you find for a measurement type. So in this case, your "type" that we parsed from the question was "height". You would need to create a mapping of types to measurements, so "height" would map "km, ft, in, cm, m" etc. If the number has one of these types around it, then return the number and measurement type as the answer.
Hope that gets you started. As stated above, this is not intended to be a robust, commercial solution. It's homework-level.

Categories

Resources