Look for multiple items within string - java

I'm parsing out a bunch of employee incident reports for reporting purposes.
The incident reports themselves are free text, and I have to categorize the injuries by body location. I'm trying to avoid if{}elseif{}elseif{}....}else{}.
Example incident reports:
Employee slipped on wet stairs and injured her knee and right arm, and struck her head on the handrail.
Should add "knee", "arm", and "head" to affected area.
Employee was lifting boxes without approved protective equipment resulting in a back strain.
Should add "back" to affected area.
While attempting to unjam copier, employee got right index finger caught in machinery resulting in a 1-inch cut.
Should add "finger" to affected area.
Right now, I have:
private static StaffInjuryData setAffectedAreas(String incident, StaffInjuryData sid){
incident = incident.toUpperCase(); //eliminate case issues
if(incident.contains("HEAD")){
sid.addAffectedArea("HEAD");
}else if(incident.contains("FACE")){
sid.addAffectedArea("FACE");
}else if(incident.contains("EYE")){
sid.addAffectedArea("EYE");
}else if(incident.contains("NOSE")){
sid.addAffectedArea("NOSE");
}
//etc, etc, etc
return sid;
}
Is there an easier/more efficient way then if-elseif-ad inifinitum to do this?

One approach is to construct a regular expression from the individual body parts, use it for searching the string, and add the individual matches to the list:
Pattern bodyParts = Pattern.compile("\\b(head|face|eye|nose)\\b", Pattern.CASE_INSENSITIVE);
Use of \b on both ends prevents partial matches, e.g. finding "head" in the text containing "forehead" or "eye" inside an "eyelid".
This Q&A explains how to search text using regex in Java.

Add a Set<String> as parameter where you provide all expected keyword :
private static StaffInjuryData setAffectedAreas(String incident, StaffInjuryData sid, Set<String> keywords){
incident = incident.toUpperCase(); //eliminate case issues
for (String keyword : keywords){
if(incident.contains(keyword)){
sid.addAffectedArea(keyword);
}
}
return sid;
}

Perhaps creating a list containing all parts {neck,shoulder,back,etc} and then checking if the entry contains any of those values?

you might be able to create some sort of container (like a list or set) with all of the different parts (IE Head, Face, Eye, Nose, Finger, etc), split the string using the .split() method, and then compare each part of that string to each item in your container.
This might be easier, but could possibly be less efficient

Related

Use Regex in Java to split up topleveldomains that are stored in a variable

I'm currently working on a system for a game, to prevent publicity from other servers.
I have come up with some RegEx, that allowed me to block a lot of publicity, but these **** keep adapting of course. They are using domains of their servers, that consist of things like "builder.de", "myserver.com" ad stuff, paired with some promises.
In order to track them, I think the only way is to get a decent filter for the domains, since they can endlessly change their promises, but buying expensive domains will at least strongly annoy them. We have gotten so quick in blocking domains, that they seem to rather find ways to sneak through our filter.
Now they have come up with a new thing, posting domains like this: "builder . *d*e". My Preprocessing of the message turns it into "builder d e". My blocked domains and topleveldomains are stored in two lists of strings, and I iterate over them:
public static boolean checkForAdverts(String check) {
check = preprocessString(check);
for (String domain : DataManager.domains) {
for (String topleveldomain : DataManager.topleveldomains) {
if (DataManager.isPairInWhitelist(domain, topleveldomain)) continue;
if (CENSORED) { //Sorry, in case smb of these people find this post...
return true;
}
}
}
return false;
}
I have this variable tldomain, that contains "de", but I want to allow one other character in it. If I was dealing with a plain up string, I would just have done "d(.)?e", but that is impossible, since I never know what is inside my tldomain. Also, these toplevel-domains may consist out of three letters as well, e.g. "com". I want to get a match if there is 1 or 2 character "hidden" somewhere in my toplevel-domain.
So I want:
Matches: "de" "dee" "dxe" "d e", or "coom" "com" "cxom" "cdom" "c om" "co m" "c o m"
I have no Idea how to do that if my toplevel-domain is stored in a variable, how can I do that?

Trying to use an Esper Lambda Expression

I am trying to expand on an example from the Esper documentation for the where enumeration method, and having issues. Here is the example in question:
select items.where(i => i.location.x = 0 and i.location.y = 0) as zeroloc
from LocationReport
What I would like to do seems pretty simple. Instead of selecting items that match this expression :
I want to select LocationReports that contain at least one item that matches the expression.
Do it over a time_batch window (emphasized textnon-batched time window is a possibility as well).
So every n number of seconds I would receive a collection of LocationReports in which each report contains at least one zero location in its items List.
For Reference, here is the structure of the Java objects used in the Esper example:
public class LocationReport { List items; ...
public class Item { String assetId; // passenger or luggage asset
id Location location; // (x,y) location boolean luggage; //
true if this item is a luggage piece String assetIdPassenger; // if
the item is luggage, contains passenger associated ...
public class Location { int x; int y; ...
Background detail : Assume LocationReport is the actual object I am interested in... Using EPL like in the example above, the where logic works, but the problem is that, in returning only the items member, I do not see the LocationReport class it came from, which contains other properties besides items that my UpdateListener needs.
Also, probably not relevant, but in my case, I am receiving a high rate of messages where many LocationReports are duplicates (or close enough to be considered duplicates), and my where clause will need to make that determination and only forward "new" messages.
Thanks!
You could add the "*" to the select and that gives you the event objects alongside. select *, items.where(...) from LocationReport
You could add "output every N seconds" to output. Add "#time(...)" for the time window.

Best algorithm for analyzing unique sentences and filtering them?

I am in the middle of writing some code to filter sentences into different groups.
The sentences are formed from the descriptions of incident tickets that my servicedesk have processed.
I have to filter them based on 5 catergories; Laptop,Telephony,Network, Printer,Application.
An example of a description from the application catergory is: "Please can you install CMS on XXXX YYYYYYY laptop"
I understand that it is impossible to get this perfect. But I was wondering what the best way to tackle this is? As you can see from the example it falls into the application category but contains a keyword "laptop".
If theres any more information I can provide you with please let me know. Every little helps. Thanks
Maintain different list or queues for different categories.
When you receive sentence, check for keyword occurrence in that sentence and add/push to appropriate list/queue.
you can maintain a map which tells you which list/queue for which keyword.
Interesting question! As seen in your example, there can be multiple keywords within the same sentence, making it difficult to decipher which category the sentence will belong to.
In order to get around this, I would suggest possibly using a separate priority queue for each category, containing keywords for each category in order of priority.
For example, you would have a priority queue of keywords for the Application category, and (within that priority queue) "install" would be of higher priority than "laptop" or "computer", because "install" is more closely related to applications than "laptop".
In your algorithm for choosing which category a sentence is part of, I would do a round-robin search through all five priority queues until a match is found - the highest priority match out of all five categories takes the sentence. This is one possible solution I can think of.
NOTE: For this to work properly, of course it is important to pick and choose carefully which keywords go into which categories; for example, in the Laptop category, it may seem natural to have "laptop" be the highest priority keyword - however, this would cause lots of collisions because laptop will probably be a very commonly used word in sentences. You should have very specific keywords pertaining to each category, rather than having broad/surface level keywords like "laptop" (or have "laptop" be a very low priority keyword).
This is actually a machine learning problem (text categorization) that you could solve using several algorithms: support vector machines, multinomial logistic regression, naive bayes and more.
There are many libraries which will help you, here is one (java)
http://alias-i.com/lingpipe/demos/tutorial/classify/read-me.html
Also python has a very good library:
http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html#training-a-classifier
If you want to take this approach, you are going to need a training dataset, meaning that you need to manually label a set of documents that the algorithm will use to automatically learn which keywords are important.
Hope it helps!
If you only have the reach from receiving these sentences and sending/doing logic,
why not just filter them by regex?
See for example,
Regex to find a specific word in a string in java
e.g.
List<String> LaptopList = new ArrayList<String>();
for (String item : sentenceList) {
if item.matches(".*\\blaptop\\b.*"){
LaptopList.add(item);
}
}
You are looking at the keyword "Laptop". But there is a keyword install "install" which primary tells about installation of some application.
So you can try like
if( sentence.contains("install") || (sentence.contains("install") && sentence.contains("laptop") )
{
applicationTickets.add(sentence);
}
else if(sentence.contains("laptop") || other conditions)
{
laptopTickets.add(sentence);
}
else if( )
..........
else if( )
..........
If you observe the code, the applications category is placed first because It is matching with the terms of Laptop. So through this code trying to fall that sentence into laptop category.
You can use loops for checking all the conditions. The keywords can be added to the specify list for every category.

What is the optimized implementation of conflict graph for combinatorial auction?

Given m bids that may share a subset of n items, I want to find the best way to store conflicts among bids and check whether two bids are conflicting (i.e., they share at least one item). So far, I have tried a matrix of dimension m x m which isn't optimal. My problem may have thousands of bids, therefore I frequently get the error "Java out of memory space" when I use the square matrix implementation. Then, I tried with a triangular matrix (because the original conflict matrix is symmetric) but without getting rid of the memory issue!
Any better idea?
The best way to code?
Thanks.
One solution is to use Guava's Table and combine it with a List<Bid> containing all bids. Note: below code uses a lot of Guava goodness:
final List<Bid> allBids = Lists.newArrayList();
final Table<Bid, Bid, Void> conflicts = HashBasedTable.create();
When you store a new bid, you'll do (this supposes that bid has an .items() method which return a Set<Item>):
for (final Bid bid: allBids)
if (!Sets.intersection(newBid.items(), bid.items()).isEmpty())
conflicts.put(newBid, bid, null);
allBids.add(newBid);
Then when you want to detect a conflict between two bids:
return table.contains(bid1, bid2) || table.contains(bid2, bid1);
You could even store the conflicting items by replacing the Void values with Set<Item>s instead and put the result of Sets.intersection() in it:
// Table becomes Table<Bid, Bid, Set<Item>>
Sets.SetView<Item> set;
for (final Bid bid: allBids) {
set = Sets.intersection(newBid.items(), bid.items())
if (!set.isEmpty())
conflicts.put(newBid, bid, set.immutableCopy());
}
allBids.add(newBid);

using java to parse a csv then save in 2D array

Okay so i am working on a game based on a Trading card game in java. I Scraped all of the game peices' "information" into a csv file where each row is a game peice and each column is a type of attribute for that peice. I have spent hours upon hours writing code with Buffered reader and etc. trying to extract the information from my csv file into a 2d Array but to no avail. My csv file is linked Here: http://dl.dropbox.com/u/3625527/MonstersFinal.csv I have one year of computer science under my belt but I still cannot figure out how to do this.
So my main question is how do i place this into a 2D array that way i can keep the rows and columns?
Well, as mentioned before, some of your strings contain commas, so initially you're starting from a bad place, but I do have a solution and it's this:
--------- If possible, rescrape the site, but perform a simple encoding operation when you do. You'll want to do something like what you'll notice tends to be done in autogenerated XML files which contain HTML; reserve a 'control character' (a printable character works best, here, for reasons of debugging and... well... sanity) that, once encoded, is never meant to be read directly as an instance of itself. Ampersand is what I like to use because it's uncommon enough but still printable, but really what character you want to use is up to you. What I would do is write the program so that, at every instance of ",", that comma would be replaced by "&c" before being written to the CSV, and at every instance of an actual ampersand on the site, that "&" would be replaced by "&a". That way, you would never have the issue of accidentally separating a single value into two in the CSV, and you could simply decode each value after you've separated them by the method I'm about to outline in...
-------- Assuming you know how many columns will be in each row, you can use the StringTokenizer class (look it up- it's awesome and built into Java. A good place to look for information is, as always, the Java Tutorials) to automatically give you the values you need in the form of an array.
It works by your passing in a string and a delimiter (in this case, the delimiter would be ','), and it spitting out all the substrings which were separated by those commas. If you know how many pieces there are in total from the get-go, you can instantiate a 2D array at the beginning and just plug in each row the StringTokenizer gives them to you. If you don't, it's still okay, because you can use an ArrayList. An ArrayList is nice because it's a higher-level abstraction of an array that automatically asks for more memory such that you can continue adding to it and know that retrieval time will always be constant. However, if you plan on dynamically adding pieces, and doing that more often than retrieving them, you might want to use a LinkedList instead, because it has a linear retrieval time, but a much better relation than an ArrayList for add-remove time. Or, if you're awesome, you could use a SkipList instead. I don't know if they're implemented by default in Java, but they're awesome. Fair warning, though; the cost of speed on retrieval, removal, and placement comes with increased overhead in terms of memory. Skip lists maintain a lot of pointers.
If you know there should be the same number of values in each row, and you want them to be positionally organized, but for whatever reason your scraper doesn't handle the lack of a value for a row, and just doesn't put that value, you've some bad news... it would be easier to rewrite the part of the scraper code that deals with the lack of values than it would be to write a method that interprets varying length arrays and instantiates a Piece object for each array. My suggestion for this would again be to use the control character and fill empty columns with &n (for 'null') to be interpreted later, but then specifics are of course what will individuate your code and coding style so it's not for me to say.
edit: I think the main thing you should focus on is learning the different standard library datatypes available in Java, and maybe learn to implement some of them yourself for practice. I remember implementing a binary search tree- not an AVL tree, but alright. It's fun enough, good coding practice, and, more importantly, necessary if you want to be able to do things quickly and efficiently. I don't know exactly how Java implements arrays, because the definition is "a contiguous section of memory", yet you can allocate memory for them in Java at runtime using variables... but regardless of the specific Java implementation, arrays often aren't the best solution. Also, knowing regular expressions makes everything much easier. For practice, I'd recommend working them into your Java programs, or, if you don't want to have to compile and jar things every time, your bash scripts (if your using *nix) and/or batch scripts (if you're using Windows).
I think the way you've scraped the data makes this problem more difficult than it needs to be. Your scrape seems inconsistent and difficult to work with given that most values are surrounded by quotes inconsistently, some data already has commas in it, and not each card is on its own line.
Try re-scraping the data in a much more consistent format, such as:
R1C1|R1C2|R1C3|R1C4|R1C5|R1C6|R1C7|R1C8
R2C1|R2C2|R2C3|R2C4|R2C5|R2C6|R2C7|R3C8
R3C1|R3C2|R3C3|R3C4|R3C5|R3C6|R3C7|R3C8
R4C1|R4C2|R4C3|R4C4|R4C5|R4C6|R4C7|R4C8
A/D Changer|DREV-EN005|Effect Monster|Light|Warrior|100|100|You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
Where each line is definitely its own card (As opposed to the example CSV you posted with new lines in odd places) and the delimiter is never used in a data field as something other than a delimiter.
Once you've gotten the input into a consistently readable state, it becomes very simple to parse through it:
BufferedReader br = new BufferedReader(new FileReader(new File("MonstersFinal.csv")));
String line = "";
ArrayList<String[]> cardList = new ArrayList<String[]>(); // Use an arraylist because we might not know how many cards we need to parse.
while((line = br.readLine()) != null) { // Read a single line from the file until there are no more lines to read
StringTokenizer st = new StringTokenizer(line, "|"); // "|" is the delimiter of our input file.
String[] card = new String[8]; // Each card has 8 fields, so we need room for the 8 tokens.
for(int i = 0; i < 8; i++) { // For each token in the line that we've read:
String value = st.nextToken(); // Read the token
card[i] = value; // Place the token into the ith "column"
}
cardList.add(card); // Add the card's info to the list of cards.
}
for(int i = 0; i < cardList.size(); i++) {
for(int x = 0; x < cardList.get(i).length; x++) {
System.out.printf("card[%d][%d]: ", i, x);
System.out.println(cardList.get(i)[x]);
}
}
Which would produce the following output for my given example input:
card[0][0]: R1C1
card[0][1]: R1C2
card[0][2]: R1C3
card[0][3]: R1C4
card[0][4]: R1C5
card[0][5]: R1C6
card[0][6]: R1C7
card[0][7]: R1C8
card[1][0]: R2C1
card[1][1]: R2C2
card[1][2]: R2C3
card[1][3]: R2C4
card[1][4]: R2C5
card[1][5]: R2C6
card[1][6]: R2C7
card[1][7]: R3C8
card[2][0]: R3C1
card[2][1]: R3C2
card[2][2]: R3C3
card[2][3]: R3C4
card[2][4]: R3C5
card[2][5]: R3C6
card[2][6]: R3C7
card[2][7]: R4C8
card[3][0]: R4C1
card[3][1]: R4C2
card[3][2]: R4C3
card[3][3]: R4C4
card[3][4]: R4C5
card[3][5]: R4C6
card[3][6]: R4C7
card[3][7]: R4C8
card[4][0]: A/D Changer
card[4][1]: DREV-EN005
card[4][2]: Effect Monster
card[4][3]: Light
card[4][4]: Warrior
card[4][5]: 100
card[4][6]: 100
card[4][7]: You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
I hope re-scraping the information is an option here and I hope I haven't misunderstood anything; Good luck!
On a final note, don't forget to take advantage of OOP once you've gotten things worked out. a Card class could make working with the data even simpler.
I'm working on a similar problem for use in machine learning, so let me share what I've been able to do on the topic.
1) If you know before you start parsing the row - whether it's hard-coded into your program or whether you've got some header in your file that gives you this information (highly recommended) - how many attributes per row there will be, you can reasonably split it by comma, for example the first attribute will be RowString.substring(0, RowString.indexOf(',')), the second attribute will be the substring from the first comma to the next comma (writing a function to find the nth instance of a comma, or simply chopping off bits of the string as you go through it, should be fairly trivial), and the last attribute will be RowString.substring(RowString.lastIndexOf(','), RowString.length()). The String class's methods are your friends here.
2) If you are having trouble distinguishing between commas which are meant to separate values, and commas which are part of a string-formatted attribute, then (if the file is small enough to reformat by hand) do what Java does - represent characters with special meaning that are inside of strings with '\,' rather than just ','. That way you can search for the index of ',' and not '\,' so that you will have some way of distinguishing your characters.
3) As an alternative to 2), CSVs (in my opinion) aren't great for strings, which often include commas. There is no real common format to CSVs, so why not make them colon-separated-values, or dash-separated-values, or even triple-ampersand-separated-values? The point of separating values with commas is to make it easy to tell them apart, and if commas don't do the job there's no reason to keep them. Again, this applies only if your file is small enough to edit by hand.
4) Looking at your file for more than just the format, it becomes apparent that you can't do it by hand. Additionally, it would appear that some strings are surrounded by triple double quotes ("""string""") and some are surrounded by single double quotes ("string"). If I had to guess, I would say that anything included in a quotes is a single attribute - there are, for example, no pairs of quotes that start in one attribute and end in another. So I would say that you could:
Make a class with a method to break a string into each comma-separated fields.
Write that method such that it ignores commas preceded by an odd number of double quotes (this way, if the quote-pair hasn't been closed, it knows that it's inside a string and that the comma is not a value separator). This strategy, however, fails if the creator of your file did something like enclose some strings in double double quotes (""string""), so you may need a more comprehensive approach.

Categories

Resources