using java to parse a csv then save in 2D array

using java to parse a csv then save in 2D array - java

Okay so i am working on a game based on a Trading card game in java. I Scraped all of the game peices' "information" into a csv file where each row is a game peice and each column is a type of attribute for that peice. I have spent hours upon hours writing code with Buffered reader and etc. trying to extract the information from my csv file into a 2d Array but to no avail. My csv file is linked Here: http://dl.dropbox.com/u/3625527/MonstersFinal.csv I have one year of computer science under my belt but I still cannot figure out how to do this.
So my main question is how do i place this into a 2D array that way i can keep the rows and columns?

Well, as mentioned before, some of your strings contain commas, so initially you're starting from a bad place, but I do have a solution and it's this:
--------- If possible, rescrape the site, but perform a simple encoding operation when you do. You'll want to do something like what you'll notice tends to be done in autogenerated XML files which contain HTML; reserve a 'control character' (a printable character works best, here, for reasons of debugging and... well... sanity) that, once encoded, is never meant to be read directly as an instance of itself. Ampersand is what I like to use because it's uncommon enough but still printable, but really what character you want to use is up to you. What I would do is write the program so that, at every instance of ",", that comma would be replaced by "&c" before being written to the CSV, and at every instance of an actual ampersand on the site, that "&" would be replaced by "&a". That way, you would never have the issue of accidentally separating a single value into two in the CSV, and you could simply decode each value after you've separated them by the method I'm about to outline in...
-------- Assuming you know how many columns will be in each row, you can use the StringTokenizer class (look it up- it's awesome and built into Java. A good place to look for information is, as always, the Java Tutorials) to automatically give you the values you need in the form of an array.
It works by your passing in a string and a delimiter (in this case, the delimiter would be ','), and it spitting out all the substrings which were separated by those commas. If you know how many pieces there are in total from the get-go, you can instantiate a 2D array at the beginning and just plug in each row the StringTokenizer gives them to you. If you don't, it's still okay, because you can use an ArrayList. An ArrayList is nice because it's a higher-level abstraction of an array that automatically asks for more memory such that you can continue adding to it and know that retrieval time will always be constant. However, if you plan on dynamically adding pieces, and doing that more often than retrieving them, you might want to use a LinkedList instead, because it has a linear retrieval time, but a much better relation than an ArrayList for add-remove time. Or, if you're awesome, you could use a SkipList instead. I don't know if they're implemented by default in Java, but they're awesome. Fair warning, though; the cost of speed on retrieval, removal, and placement comes with increased overhead in terms of memory. Skip lists maintain a lot of pointers.
If you know there should be the same number of values in each row, and you want them to be positionally organized, but for whatever reason your scraper doesn't handle the lack of a value for a row, and just doesn't put that value, you've some bad news... it would be easier to rewrite the part of the scraper code that deals with the lack of values than it would be to write a method that interprets varying length arrays and instantiates a Piece object for each array. My suggestion for this would again be to use the control character and fill empty columns with &n (for 'null') to be interpreted later, but then specifics are of course what will individuate your code and coding style so it's not for me to say.
edit: I think the main thing you should focus on is learning the different standard library datatypes available in Java, and maybe learn to implement some of them yourself for practice. I remember implementing a binary search tree- not an AVL tree, but alright. It's fun enough, good coding practice, and, more importantly, necessary if you want to be able to do things quickly and efficiently. I don't know exactly how Java implements arrays, because the definition is "a contiguous section of memory", yet you can allocate memory for them in Java at runtime using variables... but regardless of the specific Java implementation, arrays often aren't the best solution. Also, knowing regular expressions makes everything much easier. For practice, I'd recommend working them into your Java programs, or, if you don't want to have to compile and jar things every time, your bash scripts (if your using *nix) and/or batch scripts (if you're using Windows).

I think the way you've scraped the data makes this problem more difficult than it needs to be. Your scrape seems inconsistent and difficult to work with given that most values are surrounded by quotes inconsistently, some data already has commas in it, and not each card is on its own line.
Try re-scraping the data in a much more consistent format, such as:
R1C1|R1C2|R1C3|R1C4|R1C5|R1C6|R1C7|R1C8
R2C1|R2C2|R2C3|R2C4|R2C5|R2C6|R2C7|R3C8
R3C1|R3C2|R3C3|R3C4|R3C5|R3C6|R3C7|R3C8
R4C1|R4C2|R4C3|R4C4|R4C5|R4C6|R4C7|R4C8
A/D Changer|DREV-EN005|Effect Monster|Light|Warrior|100|100|You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
Where each line is definitely its own card (As opposed to the example CSV you posted with new lines in odd places) and the delimiter is never used in a data field as something other than a delimiter.
Once you've gotten the input into a consistently readable state, it becomes very simple to parse through it:
BufferedReader br = new BufferedReader(new FileReader(new File("MonstersFinal.csv")));
String line = "";
ArrayList<String[]> cardList = new ArrayList<String[]>(); // Use an arraylist because we might not know how many cards we need to parse.
while((line = br.readLine()) != null) { // Read a single line from the file until there are no more lines to read
StringTokenizer st = new StringTokenizer(line, "|"); // "|" is the delimiter of our input file.
String[] card = new String[8]; // Each card has 8 fields, so we need room for the 8 tokens.
for(int i = 0; i < 8; i++) { // For each token in the line that we've read:
String value = st.nextToken(); // Read the token
card[i] = value; // Place the token into the ith "column"
}
cardList.add(card); // Add the card's info to the list of cards.
}
for(int i = 0; i < cardList.size(); i++) {
for(int x = 0; x < cardList.get(i).length; x++) {
System.out.printf("card[%d][%d]: ", i, x);
System.out.println(cardList.get(i)[x]);
}
}
Which would produce the following output for my given example input:
card[0][0]: R1C1
card[0][1]: R1C2
card[0][2]: R1C3
card[0][3]: R1C4
card[0][4]: R1C5
card[0][5]: R1C6
card[0][6]: R1C7
card[0][7]: R1C8
card[1][0]: R2C1
card[1][1]: R2C2
card[1][2]: R2C3
card[1][3]: R2C4
card[1][4]: R2C5
card[1][5]: R2C6
card[1][6]: R2C7
card[1][7]: R3C8
card[2][0]: R3C1
card[2][1]: R3C2
card[2][2]: R3C3
card[2][3]: R3C4
card[2][4]: R3C5
card[2][5]: R3C6
card[2][6]: R3C7
card[2][7]: R4C8
card[3][0]: R4C1
card[3][1]: R4C2
card[3][2]: R4C3
card[3][3]: R4C4
card[3][4]: R4C5
card[3][5]: R4C6
card[3][6]: R4C7
card[3][7]: R4C8
card[4][0]: A/D Changer
card[4][1]: DREV-EN005
card[4][2]: Effect Monster
card[4][3]: Light
card[4][4]: Warrior
card[4][5]: 100
card[4][6]: 100
card[4][7]: You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
I hope re-scraping the information is an option here and I hope I haven't misunderstood anything; Good luck!
On a final note, don't forget to take advantage of OOP once you've gotten things worked out. a Card class could make working with the data even simpler.

I'm working on a similar problem for use in machine learning, so let me share what I've been able to do on the topic.
1) If you know before you start parsing the row - whether it's hard-coded into your program or whether you've got some header in your file that gives you this information (highly recommended) - how many attributes per row there will be, you can reasonably split it by comma, for example the first attribute will be RowString.substring(0, RowString.indexOf(',')), the second attribute will be the substring from the first comma to the next comma (writing a function to find the nth instance of a comma, or simply chopping off bits of the string as you go through it, should be fairly trivial), and the last attribute will be RowString.substring(RowString.lastIndexOf(','), RowString.length()). The String class's methods are your friends here.
2) If you are having trouble distinguishing between commas which are meant to separate values, and commas which are part of a string-formatted attribute, then (if the file is small enough to reformat by hand) do what Java does - represent characters with special meaning that are inside of strings with '\,' rather than just ','. That way you can search for the index of ',' and not '\,' so that you will have some way of distinguishing your characters.
3) As an alternative to 2), CSVs (in my opinion) aren't great for strings, which often include commas. There is no real common format to CSVs, so why not make them colon-separated-values, or dash-separated-values, or even triple-ampersand-separated-values? The point of separating values with commas is to make it easy to tell them apart, and if commas don't do the job there's no reason to keep them. Again, this applies only if your file is small enough to edit by hand.
4) Looking at your file for more than just the format, it becomes apparent that you can't do it by hand. Additionally, it would appear that some strings are surrounded by triple double quotes ("""string""") and some are surrounded by single double quotes ("string"). If I had to guess, I would say that anything included in a quotes is a single attribute - there are, for example, no pairs of quotes that start in one attribute and end in another. So I would say that you could:
Make a class with a method to break a string into each comma-separated fields.
Write that method such that it ignores commas preceded by an odd number of double quotes (this way, if the quote-pair hasn't been closed, it knows that it's inside a string and that the comma is not a value separator). This strategy, however, fails if the creator of your file did something like enclose some strings in double double quotes (""string""), so you may need a more comprehensive approach.

Related

Java - How to join many list values into a single string with delimiter at end of each value

How to join list of millions of values into a single String by appending '\n' at end of each line -
Input data is in a List:
list[0] = And the good south wind still blew behind,
list[1] = But no sweet bird did follow,
list[2] = Nor any day for food or play
list[3] = Came to the mariners' hollo!
Below code joins the list into a string by appending new line character at the end -
String joinedStr = list.collect(Collectors.joining("\n", "{", "}"));
But, the problem is if the list has millions of data the joining fails. My guess is String object couldn't handle millions line due to large size.
Please give suggestion.

The problem with trying to compose a gigantic string is that you have to keep the entire thing in memory before you do anything further with it.
If the string is too big to fit in memory, you have only two options:
increase the available memory, or
avoid keeping a huge string in memory in the first place
This string is presumably destined for some further processing - maybe it's being written to a blob in a database, or maybe it is the body of an HTTP response. It's not being constructed just for fun.
It is probably much more preferable to write to some kind of stream (maybe an implementation of OutputStream) that can be read one character at a time. The consumer can optionally buffer based on the delimiter if they are aware of the context of what you're sending, or they can wait until they have the entire thing.
Preferably you would use something which supports back pressure so that you can pause writing if the consumer is too slow.
Exactly how this looks will depend on what you're trying to accomplish.

Maybe you can do it with a StringBuilder, which is designed specifically for handling large Strings. Here's how I'd do it:
StringBuilder sb = new StringBuilder();
for (String s : list) sb.append(s).append("\n");
return s.toString();
Haven't tested this code though, but it should work

How to find specific substrings (that can be very similar) and do different things with them in java

I am writing a program that takes in a file and extracts data from a single string within the file. I run into a problem when I try to separate the substrings in the way that I want. The goal is to separate the larger chunks of the line from other large chunks without separating the smaller chunks within the larger chunk (separated by commas).
An example of the file contents would be this: (Although it is slightly long, the files I have may vary from short lists like this to 50 or even to 100 blocks of item sets)
{"timeStamp":1477474644345,"itemSets":[{"mode":"any","sortrank":4999,"type":"custom","priority":false,"isGlobalForMaps":true,"uid":"LOL_D957E9EC-39E4-943E-C55E-52B63E05D99C","isGlobalForChampions":false,"associatedMaps":[],"associatedChampions":[40],"blocks":[{"type":"starting","items":[{"id":"3303","count":1},{"id":"2031","count":1},{"id":"1082","count":1},{"id":"3340","count":1},{"id":"3363","count":1},{"id":"2043","count":1},{"id":"3364","count":1}]},{"type":"Support Build Items","items":[{"id":"2049","count":1},{"id":"1001","count":1},{"id":"3165","count":1},{"id":"3117","count":1},{"id":"2301","count":1},{"id":"3089","count":1},{"id":"3135","count":1},{"id":"3504","count":1}]},{"type":"AP Build Items","items":[{"id":"3165","count":1},{"id":"3020","count":1},{"id":"3089","count":1},{"id":"3135","count":1},{"id":"3285","count":1},{"id":"3116","count":1}]},{"type":"Other Items (Situational Items)","items":[{"id":"3026","count":1},{"id":"3285","count":1},{"id":"3174","count":1},{"id":"3001","count":1},{"id":"3504","count":1}]}],"title":"Janna Items","map":"any"},{"mode":"any","sortrank":0,"type":"custom","priority":false,"isGlobalForMaps":false,"uid":"LOL_F265D25A-EA44-5B86-E37A-C91BD73ACB4F","isGlobalForChampions":true,"associatedMaps":[10],"associatedChampions":[],"blocks":[{"type":"Searching","items":[{"id":"3508","count":1},{"id":"3031","count":1},{"id":"3124","count":1},{"id":"3072","count":1},{"id":"3078","count":1},{"id":"3089","count":1}]}],"title":"TEST","map":"any"}]}
The code I have attempted to write tries to separate this into meaningful chunks, here is what I have written so far:
cutString = dataFromFile.substring(dataFromFile.indexOf("itemSets\":") + 11, dataFromFile.indexOf("},{"));
stringContinue = dataFromFile.substring(cutString.length());
while(stringContinue.contains("},{"))
{
//Do string manipulation to cut every part and re-attach it, then re-check to find if this ("},{\"id") is not there
if(stringContinue.contains("},{\"id"))
{
//if(stringContinue.equals(anObject))
cutString = cutString + stringContinue.substring(0, stringContinue.indexOf("},{\"id"));
}
else if(stringContinue.contains("},{\"count"))
{
cutString = cutString + stringContinue.substring(0, stringContinue.indexOf("},{\"count"));
}
else if(stringContinue.contains("},{"))
{
cutString = cutString + stringContinue.substring(0, stringContinue.indexOf("},{"));
}
stringContinue = stringContinue.substring(cutString.length());
//Check if we see a string pattern that is the cut off point
//if()
//System.out.println(stringContinue);
System.out.println(cutString);
}
But when I run it, I get an output like this:
{"mode":"any","sortrank":4999,"type":"custom","priority":false,"isGlobalForMaps":true,"uid":"LOL_D957E9EC-39E4-943E-C55E-52B63E05D99C","isGlobalForChampions":false,"associatedMaps":[],"associatedChampions":[40],"blocks":[{"type":"starting","items":[{"id":"3303","count":1arting","items":[{"id":"3303","count":1
The output I want to achieve is this:
{"mode":"any","sortrank":4999,"type":"custom","priority":false,"isGlobalForMaps":true,"uid":"LOL_D957E9EC-39E4-943E-C55E-52B63E05D99C","isGlobalForChampions":false,"associatedMaps":[],"associatedChampions":[40],"blocks":[{"type":"starting","items":[{"id":"3303","count":1},{"id":"2031","count":1},{"id":"1082","count":1},{"id":"3340","count":1},{"id":"3363","count":1},{"id":"2043","count":1},{"id":"3364","count":1}]},{"type":"Support Build Items","items":[{"id":"2049","count":1},{"id":"1001","count":1},{"id":"3165","count":1},{"id":"3117","count":1},{"id":"2301","count":1},{"id":"3089","count":1},{"id":"3135","count":1},{"id":"3504","count":1}]},{"type":"AP Build Items","items":[{"id":"3165","count":1},{"id":"3020","count":1},{"id":"3089","count":1},{"id":"3135","count":1},{"id":"3285","count":1},{"id":"3116","count":1}]},{"type":"Other Items (Situational Items)","items":[{"id":"3026","count":1},{"id":"3285","count":1},{"id":"3174","count":1},{"id":"3001","count":1},{"id":"3504","count":1}]}],"title":"Janna Items","map":"any"}
{"mode":"any","sortrank":0,"type":"custom","priority":false,"isGlobalForMaps":false,"uid":"LOL_F265D25A-EA44-5B86-E37A-C91BD73ACB4F","isGlobalForChampions":true,"associatedMaps":[10],"associatedChampions":[],"blocks":[{"type":"Searching","items":[{"id":"3508","count":1},{"id":"3031","count":1},{"id":"3124","count":1},{"id":"3072","count":1},{"id":"3078","count":1},{"id":"3089","count":1}]}],"title":"TEST","map":"any"}
So then my question is how do I check for the point where I can separate the blocks without getting java to detect the same pattern that it uses to separate the smaller chunks? Basically I am looking for a pattern like this ("},{"), but not this ("},{\"id:") or this ("},{\count:"). Is there any other things that the String Class can offer for functionality that is similar that i am not aware of?
Edit: Although using a json parser would make things easier and convenient for this type of problem, another one rises because it would make the program only take in json files. This question is more for string manipulation and trying to find a part of the string that can separate the large blocks of information without touching or changing (very minimally as possible) the smaller blocks that have the same way of separation. So far regex and splitting strings to be re-attached later seems to be the way to go unless there is a more clear-cut answer.

You could split the string into an array based on regex like this:
//fileString is the String you get from your file
String[] chunksIWant = fileString.split("\\},\\{");
This will return the String array chunksIWant split in the chunks you want. It does get rid of the regex itself, in this case "},{", so if you would need the symbols for some reason you will have to add them back afterwards.

You are getting this data from file in Json format.
So when you get that data on java side use JsonParser to convert data in JsonArray format.
Then you can iterate that JsonArray to get as JsonObject by using String name.
You can use value of JsonObject as required.

Efficiency of Very Long StringBuilders

I am currently developing an app on android with java.
Obviously there are very many situations that I have to use lots of String manipulation. I prefer using the class SpannableStringBuilder since it has the most functions, very easy to append or add spans to the string.
However I was curious about the internal implementation of the class so I read parts of the code in the java file of SpannableStringBuilder. I was quite surprised when I saw that, when I use append() or replace() functions on the stringbuilder, it just creates a new char array with the resulting length and copies all the characters in the original char array from start to end to the new array with the changed part together.
Now I am worried, since what i am working on is quite similar to a memo pad and I use SpannableStringBuilder and append() every input the user types. But since it's a memopad the user might even type up to more than several tens of thousands of characters. In that case, the append() or replace() function has to copy a whole bunch of characters every time the user types something with a keyboard... I think this should have a huge impact on performance.
Is this something I should be worried about? are there other ways to make continuously appendable stringbuilders?

Java - Trying to create an arraylist of strings but the arraylist gets full(?)

I might just be doing something stupid here but I'm trying to write a program that will take all the text from an xml file, put it in an arraylist as strings, then find certain recurring strings and count them. It basically works, but for some reason it won't go through the entire xml file. It's a pretty large file with over 15000 lines (ideally I'd like it to be able to hand any amount of lines though). I did a test to output everything it was putting in the arraylist to a .txt file and eventually the last line simply says "no", and there's still much more text/lines to go through.
This is the code I'm using to make the arraylist (lines is the amount of lines in the file):
// make array of strings
for (int i=0; i<lines; i++) {
strList.add(fin2.next());
}
fin2.close();
Then I'm searching for the desired strings with:
// find strings
for (String string : strList) {
if(string.matches(identifier)){
count++;
}
}
System.out.println(count);
fout.println(count);
It basically works (the printwriter and scanners work, line count works, etc) except the arraylist won't take all the text from the .xml file, so of course the count at the end is inaccurate. Is arraylist not the best solution for this problem?

This is a BAD practice to do. Each time you put a string into an ArrayList and keep it there, you're going to have an increase in memory usage. The bigger the file, the more memory is used up to the point where you're wondering why your application is using 75% of your memory.
You don't need to store the lines into an ArrayList in order to see if they match. You can simply just read the line and compare it to whatever text you're comparing it to.
Here would be your code modified:
String nextString = "";
while (fin2.hasNext()) {
nextString = fin2.next();
if (nextString.matches(identifier) || nextString.matches(identifier2)) {
count++;
}
}
fin2.close();
System.out.pritnln(count);
Eliminates looping through everything twice, saves you a ton of memory, and gives you accurate results. Also I'm not sure if you're meaning to read the entire line, or you have some sort of token. If you want to read the entire line, change hasNext to hasNextLine and next to nextLine
Edit: Modified the code to show what it would look like looking for multiple strings.

Have you tried to use map, like HashMap. Since Your goal is to find the occurrence of word from a xml, hashmap will make your like easier.

The problem is not with your ArrayList but with your for loop. What's happening is that you're using the number of lines in your file as your sentinel value, but rather than incrementing i by 1 every line, you are doing it every word. Therefore, not all the words are added to your ArrayList because your loop terminates earlier than expected. Hope this helps!
EDIT: I don't know what object you are using right now to collect the contents of this xml file, but I would suggest using Scanner instead (passing the File as a parameter in the constructor) and replacing the current for loop with a while loop that uses while (nameOfScanner.hasNextLine())

comparing "the likes" smartly

Suppose you need to perform some kind of comparison amongst 2 files. You only need to do it when it makes sense, in other words, you wouldn't want to compare JSON file with Property file or .txt file with .jar file
Additionally suppose that you have a mechanism in place to sort all of these things out and what it comes down to now is the actual file name. You would want to compare "myFile.txt" with "myFile.txt", but not with "somethingElse.txt". The goal is to be as close to "apples to apples" rules as possible.
So here we are, on one side you have "myFile.txt" and on another side you have "_myFile.txt", "_m_y_f_i_l_e.txt" and "somethingReallyClever.txt".
Task is to pick the closest name to later compare. Unfortunately, identical name is not found.
Looking at the character composition, it is not hard to figure out what the relationship is. My algo says:
_myFile.txt to _m_y_f_i_l_e.txt 0.312
_myFile.txt to somethingReallyClever.txt 0.16
So _m_y_f_i_l_e.txt is closer to_myFile.txt then somethingReallyClever.txt. Fantastic. But also says that ist is only 2 times closer, where as in reality we can look at the 2 files and would never think to compare somethingReallyClever.txt with _myFile.txt.
Why?
What logic would you suggest i apply to not only figure out likelihood by having chars on the same place, but also test whether determined weight makes sense?
In my example, somethingReallyClever.txt should have had a weight of 0.0
I hope i am being clear.
Please share your experience and thoughts on this.
(whatever approach you suggest should not depend on number of characters filename consists out of)

Possibly helpful previous question which highlights several possible algorithms:
Word comparison algorithm
These algorithms are based on how many changes would be needed to get from one string to the other - where a change is adding a character, deleting a character, or replacing a character.
Certainly any sensible metric here should have a low score as meaning close (think distance between the two strings) and larger scores as meaning not so close.

Sounds like you want the Levenshtein distance, perhaps modified by preconverting both words to the same case and normalizing spaces (e.g. replace all spaces and underscores with empty string)

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.