can someone explain to me how would I correctly structure these classes I have made an attempt but am not sure if they are correct.
https://imgur.com/a/7inqMQ5 - UML
https://imgur.com/a/slb80uR - Dataset
A. List all crime types (by name) for which you have data.
B. For a given crime type and LSOA, display details of all crimes, for which you have data.
C. For a given LSOA, determine how many crimes are presently
“Under investigation”
“Investigation complete; no suspect identified” for a given month
D. Find the LSOA with the highest average total crime frequency
E. Find the LSOA with the highest average frequency of crimes under investigation.
F. For a given crime, find which LSOA has the most occurrences.
Your UML for the first class is great, although if all the strings for a property are repeated or from a certain list, I'd recommend looking into Enums
The second class is also along the right lines but I've made an example one: see this which you can add to/rename slightly to better fit your requirements.
Related
I am having trouble getting through my Java labs. I don't have the best instructor and they are due tomorrow. If someone could help me and give a brief summarization of the code, they would use. I have posted this lab description below and attached the code of the runner file in the comments.
Lab Description: Write several array manipulation methods. One method will sum up a section of a provided array, another method will count up how many of a certain number occur in the array, and the last method will remove all of a certain value from the array.
I have tried a forloop to iterate through the array and select the elements but I am stumped on how to select the range and add them.
I have a List<String[]> of customer records in Java (from a database). I know from manually eyeballing the data that 25%+ are duplicates.
The duplicates are far from exact though. Sometimes they have different zips, but the same name and address. Other times the address is missing completely, etc...
After a day of research; I'm still really stumped as to how to even begin to attack this problem?
What are the "terms" that I should be googling for that describe this area (from a solve this in Java perspective)? And I don't suppose there is fuzzymatch.jar out there that makes it all just to easy?
I've done similar systems before for matching place information and people information. These are complex objects with many features and figuring out whether two different objects describe the same place or person is tricky. The way to do it is to break it down to the essentials.
Here's a few things that you can do:
0) If this is a oneoff, load the data into openrefine and fix things interactively. Maximum this solves your problem, minimum it will show you where your possible matches are.
1) there are several ways you can compare strings. Basically they differ in how reliable they are in producing negative and false matches. A negative match is when it matches when it shouldn't have. A positive match is when it should match and does. String equals will not produce negative matches but will miss a lot of potential matches due to slight variations. Levenstein with a small factor is a slightly better. Ngrams produce a lot of matches, but many of them will be false. There are a few more algorithms, take a look at e.g. the openrefine code to find various ways of comparing and clustering strings. Lucene implements a lot of this stuff in its analyzer framework but is a bit of a beast to work with if you are not very familiar with its design.
2) Separate the process of comparing stuff from the process of deciding whether you have a match. What I did in the past was qualify my comparisons, using a simple numeric score e.g. this field matched exactly (100) but that field was a partial match (75) and that field did not match at all. The resulting vector of qualified comparisons, e.g. (100, 75,0,25) can be compared to a reference vector that defines your perfect or partial match criteria. For example if first name, last name, and street match, the two records are the same regardless of the rest of the fields. Or if phonenumbers and last names match, that's a valid match too. You can encode such perfect matches as a vector and then simply compare it with your comparison vectors to determine whether it was a match, not a match, or a partial match. This is sort of a manual version of what machine learning does which is to extract vectors of features and then build up a probability model of which vectors mean what from reference data. Doing it manually, can work for simple problems.
3) Build up a reference data set with test cases that you know to match or not match and evaluate your algorithm against that reference set. That way you will know when you are improving things or making things worse when you tweak e.g. the factor that goes into Levinstein or whatever.
Jilles' answer is great and comes from experience. I've also had to work on cleaning up large messy tables and sadly didn't know much about my options at that time (I ended up using Excel and a lot of autofilters). Wish I'd known about OpenRefine.
But if you get to the point where you have to write custom code to do this, I want to make a suggestion as to how: The columns are always the same, right? For instance, the first String is always the key, the second is the First name, the sixth is the ZIP code, tenth is the fax number, etc.?
Assuming there's not an unreasonable number of fields, I would start with a custom Record type which has each DB field as member rather than a position in an array. Something like
class CustomerRow {
public final String id;
public final String firstName;
// ...
public CustomerRow(String[] data) {
id = data[0];
// ...
}
You could also include some validation code in the constructor, if you knew there to be garbage values you always want to filter out.
(Note that you're basically doing what an ORM would do automatically, but getting started with one would probably be more work than just writing the Record type.)
Then you'd implement some Comparator<CustomerRow>s which only look at particular fields, or define equality in fuzzy terms (there's where the edit distance algorithms would come in handy), or do special sorts.
Java uses a stable sort for objects, so to sort by e.g. name, then address, then key, you would just do each sort, but choose your comparators in the reverse order.
Also if you have access to the actual database, and it's a real relational database, I'd recommend doing some of your searches as queries where possible. And if you need to go back and forth between your Java objects and the DB, then using an ORM may end up being a good option.
I am new to java, and a new student of Computer Science.
I have a question: How can I find the most common name in an array that contains objects
with information about trips?
The array has objects that each of them contains information about trips, and there is the name of guide.
By logic, I understand that I need first to get all guide names, then count each name,
then compare the counters of each name, find the maximum counter, and return the guide that
contains that maximum counter.. but how do I do this?
Any suggestions?
There are a lot of ways to do it. You are on the right track in your approach. Here's a little more detail about how you could do the things that you mention, in java.
"get all guide names"
This means you have to write a loop over your array, and collect the names in some kind of data structure. Which data structure to use depends on what you want to do (more on this below).
"count each name"
Aha, so your data structure that collects the names should be able to also store the count for each name. One of the most versatile data structures in java is the Map. In this case, you could have a Map to store the count of each name.
"compare the counters", "find the maximum"
You can either do this after you've collected the names into a Map, but it's probably simpler to just do it as you go through the loop. As you loop over the items in the array, and get the name to update the count, you can also keep track of the "maximum count so far" and the name that goes with it. Any time you get a name whose new count is greater than this maximum, you would then have a new maximum and corresponding name (at least until you find a bigger one). Then at the end of the loop you will have the name that you are looking for.
Given two names that have variations in the way they are represented, is there any API/tool/algorithm that can give a score of how similar/different the names are?
Tim O' Reilly is one input and T Reilly is another input. The score returned between these two should be lesser than that got between Tim O' Reilly and Tim Reilly.
I am looking for such score calculation mechanisms. Few challenges that the algorithm should be capable of handling are:
1) The first names and last names could be swapped when a name is given as input
2) There might be initials in place of names
3) One of the names may not have the last name while the other may have both first name and last name.
... and so on which are common errors in name representations.
Two libraries including a handful of distance scores for name similarity are:
SimPack:
SecondString
No single method covers the cases that you mention but for 1) and 3) Feature and Set similarity measures (jaccard, tfidf for instance) work- For 2) besides soundex (as mentioned by #houman001) you may consider levensthein or jaro. Experiment with some examples of your use case and combine.
For the "API/tool/algorithm that can give a score of how similar/different the names are" part, I can give you a hint:
There are a few heuristic libraries that search engines use, but there is also this coding called soundex that computes a number out of a word. Words with the same soundex code are those that are slightly different. There are some Java implementations around as well.
On the points you mentioned later about names, look for contact management libraries/utilities and do some coding as these requirements are pretty specific.
I am trying to seek a solution for timetable generation using Genetic Algorithms(GA).
In my scenario i view a timetable of 6 days. Monday to Saturday.
Each day is divided into number of lectures/Time slots.(maximum no. of lectures are 6 in a day/Each time slot if 1 hr so that means 6 hours for a day)
I have tried to represent a Class consisting of Teacher,Student Group(set), and a lecture.
I maintain a pool of possible teachers,possible subjects and possible student groups.
And i randomly assign them to these Class.
so a Class is collection of all these references.
so for each time slot we have the Class object representation in it.
similarly a day is made up of number of lectures Class object representation.
and so on with the week making up of 6 days.
A set of possible constraints that i have is:
1.A teacher can take only one lecture in one time slot
2.A teacher can take a set of subjects(finite)
3.A teacher can be unavailable on a certain day
4.A teacher can be unavailable on a certain timeslot
And other constraints as it may be included lately.
Can anyone give me a idea about how to represent these constraints or handle these constraints? and how to calculate the fitness scores depending on constraints?
EDIT : The implementation is here https://github.com/shridattz/dynamicTimeTable
UPDATE:
The code can be found here
github.com/shridattz/dynamicTimeTable
In my TimeTable Generation I have used A timetable object. This object consists of ClassRoom objects and the timetable schedule for each them also a fittness score for the timetable.
Fittness score corresponds to the number of clashes the timetable has with respect to the other schedules for various classes.
ClassRoom object consists of week objects.Week objects consist of Days. and Days consists of Timeslots. TimeSlot has a lecture in which a subject,student group attending the lecture and professor teaching the subject is associated
This way I have represented the timetable as a chromosome.
And further on talking about the constraints, I have used composite design pattern, which make it well extendable to add or remove as many constraints.
in each constraint class the condition as specified in my question is checked between two timetable objects.
If condition is satisfied i.e there is a clash is present then the score is incremented by one.
This way the timetable with the least Score is the Best we can get.
For this problem ther is no efficint solution. I think you got that too because you use genetic algorithms. I wrote some month ago a framework for genetic algorithm myself.
I think you missed: every class has a list of lessons per week and only one lesson can happen at a time. Now you can combine randomly teachers and classes for the timeslots.
In the fitnes function I'd give a huge plus if a class has all lessons to do a week. A big minus would be if teachers haven't simmilar load (teacher a has two lessons a week and teacher b 12 for example). This you might relativate if a teacher has to work just 20 hours a week (use %).
All in all it is not that trivial and you might look for an experienced co-worker or mentor to help you with this topic.
If you want more specific advises, please specify your question.