I am processing a file which I need to split based on the separator.
The following code shows the separators defined for the files I am processing
private static final String component = Character.toString((char) 31);
private static final String data = Character.toString((char) 29);
private static final String segment = Character.toString((char) 28);
Can someone please explain the significance of these specific separators?
Looking at the ASCII codes, these separators are file, group and unit separators. I don't really understand what this means.
Found this here. Cool website!
28 – FS – File separator The file
separator FS is an interesting control
code, as it gives us insight in the
way that computer technology was
organized in the sixties. We are now
used to random access media like RAM
and magnetic disks, but when the ASCII
standard was defined, most data was
serial. I am not only talking about
serial communications, but also about
serial storage like punch cards, paper
tape and magnetic tapes. In such a
situation it is clearly efficient to
have a single control code to signal
the separation of two files. The FS
was defined for this purpose.
29 – GS – Group separator
Data storage was one
of the main reasons for some control
codes to get in the ASCII definition.
Databases are most of the time setup
with tables, containing records. All
records in one table have the same
type, but records of different tables
can be different. The group separator
GS is defined to separate tables in a
serial data storage system. Note that
the word table wasn't used at that
moment and the ASCII people called it
a group.
30 – RS – Record separator
Within a group (or table) the records
are separated with RS or record
separator.
31 – US – Unit separator
The smallest data items to be stored
in a database are called units in the
ASCII definition. We would call them
field now. The unit separator
separates these fields in a serial
data storage environment. Most current
database implementations require that
fields of most types have a fixed
length. Enough space in the record is
allocated to store the largest
possible member of each field, even if
this is not necessary in most cases.
This costs a large amount of space in
many situations. The US control code
allows all fields to have a variable
length. If data storage space is
limited—as in the sixties—this is a
good way to preserve valuable space.
On the other hand is serial storage
far less efficient than the table
driven RAM and disk implementations of
modern times. I can't imagine a
situation where modern SQL databases
are run with the data stored on paper
tape or magnetic reels...
The ascii control characters range from 28-31. (0x1C to 0x1F)
31 Unit Separator
30 Record Separator
29 Group Separator
28 File Separator
Sample invocation:
char record_separator = 0x1F;
String s = "hello" + record_separator + "world"
These characters are control characters. They're not meant to be written or read by humans, but by computers. You should treat them in your program like any other character.
Related
I'm learning to code by myself (Java and Android) and I'm working on creating an application for Android.
It's a language application (Grammar) to analyze verb.
What I have currently is I have a list of 1000 verb and each one has it's own way of forms and I'm stuck to find way to store them as strings but don't want to use a lot of string or too long array.
I was thinking of array of string but an array to have 1000 string not sure if this is really practical.
I thought of creating what I need in an excel and then to use this excel as the storage where the app can use it to search in it and show the results found there in a TextView but again not quite sure if this will work with Android.
Let's say I have the below 3 verbs in infinitive
Akl
ktv
hlk
Now the first verb can come in another 2 forms (Nakl - Hikel) and the other verbs too have their own forms.
What I want to do is, when a user type the verb whether in past or present for example (Akled "past" or hikeling "present") the system will substring only the verb by removing the ending and then use what is left (for example: akled ---> akl) to show the other forms, in this case if it is "akled" then system will use "akl" and show nakl - hikel.
Example:
User type in text box (akled) and press analyse
System will do the following:
substring the verb only (akl)
then based on this verb will show other forms, which are (nakl - hikel).
Is this doable with huge number of verbs? Let's say each verb has only 2 other forms, so based on this the 1000 verb has 2000 other forms.
Don't bother about loading too many strings in the memory. Strings are internally represented as array of characters and char type in Java takes 2 bytes. So, if you were to keep 100,000 strings (each 20 characters long), then the total memory occupied by the String[] of 100,000 elements will be 100,000 * 20 * 2 = 4,000,000 bytes = 4 MB. And JVM heap size is usually in Gigabytes, so you shouldn't be bothered about whether you should load this much strings in memory or not. Even if you load 10 times the above, i.e. 1,000,000 strings, you'll be occupying only 40 MB of memory.
you can create your own String resource files in android, where you can store there all the strings used on the app: https://developer.android.com/guide/topics/resources/string-resource
above link will show you the format to use, where to put that file of strings in your android project and how to use it.
on this link you have an example of how to get an string array from resources:
https://www.android-examples.com/get-string-array-from-strings-xml-in-android/
I'm creating an android application that needs a massive database (70mb but the application has to work offline...). The largest table has two columns, a keyword and a definition. The definitions themselves are relatively short, usually under 2000 characters, so compressing each one individually wouldn't save me very much since compression libraries store the rules decompress the strings as part of the compressed string.
However if I could compress all of these strings with the same set of rules and then store just the compressed data in the DB and the rules elsewhere, I could save a lot of space. Does anyone know of a library that will let me do something like this?
Desired behavior:
public String getDefinition(String keyword) {
DecompressionObject decompresser = new DecompressionObject(RULES_FILE);
byte[] data = queryDatabase(keyword);
return decompresser.decompress(keyword);
}
The "rules" as you call them is not why you are getting limited compression efficacy. The Huffman code table that precedes the data in a deflate stream is around 80 bytes, and so is not significant compared to your 2000 byte string.
What is limiting the compression efficacy is simply a lack of history from which to draw matching strings. The only place to look for matching strings is in the 2000 characters, and then only in the preceding characters at any point in the compression.
What you could do to improve compression would be to create a dictionary of common strings that would be used as history to precede each string you are compressing. Then that same dictionary is provided to the decompressor ahead of time for it to use to decompress each string. This assumes that there is some commonality of content in your ensemble of strings.
zlib provides these functions in deflateSetDictionary() and inflateSetDictionary().
I am working on an engine that does OCR post-processing, and currently I have a set of organizations in the database, including Chamber of Commerce Numbers.
Also from the OCR output I have a list of possible Chamber of Commerce (COC) numbers.
What would be the best way to search the most similar one? Currently I am using Levenshtein Distance, but the result range is simply too big and on big databases I really doubt it's feasibility. Currently it's implemented in Java, and the database is a MySQL database.
Side note: A Chamber of Commerce number in The Netherlands is defined to be an 8-digit number for every company, an earlier version of this system used another 4 digits (0000, 0001, etc.) to indicate an establishment of an organization, nowadays totally new COC numbers are being given out for those.
Example of COCNumbers:
30209227
02045251
04087614
01155720
20081288
020179310000
09053023
09103292
30039925
13041611
01133910
09063023
34182B01
27124701
List of possible COCNumbers determined by post-processing:
102537177
000450093333
465111338098
NL90223l30416l
NLfl0737D447B01
12juni2013
IBANNL32ABNA0242244777
lncassantNL90223l30416l10000
KvK13041611
BtwNLfl0737D447B01
A few extra notes:
The post-processing picks up words and word groups from the invoice, and those word groups are being concatenated in one string. (A word group is at it says, a group of words, usually denoted by a space between them).
The condition that the post-processing uses for it to be a COC number is the following: The length should be 8 or more, half of the content should be numbers and it should be alphanumerical.
The amount of possible COCNumbers determined by post-processing is relatively small.
The database itself can grow very big, up to 10.000s of records.
How would I proceed to find the best match in general? (In this case (13041611, KvK13041611) is the best (and moreover correct) match)
Doing this matching exclusively in MySQL is probably a bad idea for a simple reason: there's no way to use a regular expression to modify a string natively.
You're going to need to use some sort of scoring algorithm to get this right, in my experience (which comes from ISBNs and other book-identifying data).
This is procedural -- you probably need to do it in Java (or some other procedural programming language).
Is the candidate string found in the table exactly? If yes, score 1.0.
Is the candidate string "kvk" (case-insensitive) prepended to a number that's found in the table exactly? If so, score 1.0.
Is the candidate string the correct length, and does it match after changing lower case L into 1 and upper case O into 0? If so, score 0.9
Is the candidate string the correct length after trimming all alphabetic characters from either beginning or the end, and does it match? If so, score 0.8.
Do both steps 3 and 4, and if you get a match score 0.7.
Trim alpha characters from both the beginning and end, and if you get a match score 0.6.
Do steps 3 and 6, and if you get a match score 0.55.
The highest scoring match wins.
Take a visual look at the ones that don't match after this set of steps and see if you can discern another pattern of OCR junk or concatenated junk. Perhaps your OCR is seeing "g" where the input is "8", or other possible issues.
You may be able to try using Levenshtein's distance to process these remaining items if you match substrings of equal length. They may also be few enough in number that you can correct your data manually and proceed.
Another possibility: you may be able to use Amazon Mechanical Turk to purchase crowdsourced labor to resolve some difficult cases.
i'm making an Highscore implementation for my game. Here there's what i want to do:
I have object Score which contains String Name and Integer score.
Now :
if Name isn't already in the file, add it
if Name is on the file, after a space take the String and convert into integer, so i got the score.
Now, if score is better than the actual, i have to OVERWRITE it on the file...
and here's my problem..i can i do that? how can i write exactly a string over another in a certain point of the file?
Generally it's considered too fiddly to replace text in text files for this kind of requirement, so the usual way to do it is just to read in the whole file, make the replacement and write a new version of the whole file. If you have large amounts of data you would use a database or a NoSQL solution instead.
P.S. consider using serialization, it can make things easier.
I have object Score which contains String Name and Integer score.
Use a Properties file for this instead. It provides an easy interface to load & save the file, get the keys (Name) and set or retrieve values (Score). String values are stored, but they can be converted to/from integer easily.
I concur that this is best done by fully re-serializing the entire database. On modern computers, you can push 30MB/s per disk for linear writes (more if there's sufficient cache). And if you're dealing with more than 30MB of data, you REALLY need a DB (HSQLDB, DerbyDB, BerkleyDB) are trivial DBs. Or go all the way to postgres/mysql.
However, the way to overwrite a FIXED sized section of an existing file (or rather, the way to emulate doing so), is to use:
RandomAccessFile raf = new RandomAccessFile(fileName, "rw");
try {
raf.seek(position);
raf.writeInt(newScore);
raf.close();
} finally { raf.close(); }
Note using the writeInt instead of raf.write(Integer.toHexString(newScore).getBytes()), because you really really need that to be fixed in size.
Now if the text file is intrinsicly ascii (e.g. humans will read the file), and thus the value can't be binary.. Perhaps you could keep it HexString (because that will be fixed in size), or you could a zero-padded decimal string:
But what you absoluately positively can not do is grow the string by 1 byte.
So:
bob 15
joe 7
nina 981
Can't have joe's score replaced with 10, UNLESS you've padded a bunch of spaces.
If this is your data-file, then you will absolutely have to rewrite the whole file (even if you write the extra code to only rewrite from the point of change on - statistically that'll be 50% of the file and thus not worth bothering.
One other thing - if you do rewrite, you have the risk of shortening the file.. For that you need to call
raf.setLength(0);
before writing the first byte.. Otherwise, you'll see phantom text beyond the end of your new file.
final Map<String,Long> symbol2Position = new ConcurrentHashMap<>();
final Map<String,Integer> symbol2Score = new ConcurrentHashMap<>();
final String fileName;
final RandomAccessFile raf;
// ... skipped code
void storeFull() {
raf.position(0);
raf.setLength(0);
for (Map.Entry<String,Long> e : symbol2Position.entrySet()) {
raf.writeUTF8(e.getKey());
raf.write(',');
symbol2Score.put(e.getKey(), raf.position());
raf.writeUTF8(String.format("%06d",e.getValue()));
raf.write('\n');
}
}
void updateScore(String key, int newScore) {
if (symbol2Score.containsKey(key)) {
symbol2Score.put(key, newScore);
raf.position(symbol2Position.get(key));
raf.writeString(String.format("%06d", newScore));
} else {
symbol2Score.put(key, newScore);
long fileLen = raf.length();
symbol2Position.put(key, fileLen);
raf.position(fileLen);
raf.writeString(String.format("%06d", newScore));
}
}
I'd probably rather use a DB or binary map file.. meaning file with 8B per field, 4B pointing to user-name position, and 4B representing score. But this allows for a human readable data-file, while making updates faster than just rewriting the property file.
Check out LevelDB - fastest damn DB on the planet for embedded systems. :) Main thing it has over the above, is thousands / millions of updates per second without the rand-seek-rewrite cost of updating 6 bytes randomly across a multi-GB file.
Just a thought, any specific reason for the file storage of names and scores ?
Seems like a Map<String, Integer> would serve you much better...
Okay so i am working on a game based on a Trading card game in java. I Scraped all of the game peices' "information" into a csv file where each row is a game peice and each column is a type of attribute for that peice. I have spent hours upon hours writing code with Buffered reader and etc. trying to extract the information from my csv file into a 2d Array but to no avail. My csv file is linked Here: http://dl.dropbox.com/u/3625527/MonstersFinal.csv I have one year of computer science under my belt but I still cannot figure out how to do this.
So my main question is how do i place this into a 2D array that way i can keep the rows and columns?
Well, as mentioned before, some of your strings contain commas, so initially you're starting from a bad place, but I do have a solution and it's this:
--------- If possible, rescrape the site, but perform a simple encoding operation when you do. You'll want to do something like what you'll notice tends to be done in autogenerated XML files which contain HTML; reserve a 'control character' (a printable character works best, here, for reasons of debugging and... well... sanity) that, once encoded, is never meant to be read directly as an instance of itself. Ampersand is what I like to use because it's uncommon enough but still printable, but really what character you want to use is up to you. What I would do is write the program so that, at every instance of ",", that comma would be replaced by "&c" before being written to the CSV, and at every instance of an actual ampersand on the site, that "&" would be replaced by "&a". That way, you would never have the issue of accidentally separating a single value into two in the CSV, and you could simply decode each value after you've separated them by the method I'm about to outline in...
-------- Assuming you know how many columns will be in each row, you can use the StringTokenizer class (look it up- it's awesome and built into Java. A good place to look for information is, as always, the Java Tutorials) to automatically give you the values you need in the form of an array.
It works by your passing in a string and a delimiter (in this case, the delimiter would be ','), and it spitting out all the substrings which were separated by those commas. If you know how many pieces there are in total from the get-go, you can instantiate a 2D array at the beginning and just plug in each row the StringTokenizer gives them to you. If you don't, it's still okay, because you can use an ArrayList. An ArrayList is nice because it's a higher-level abstraction of an array that automatically asks for more memory such that you can continue adding to it and know that retrieval time will always be constant. However, if you plan on dynamically adding pieces, and doing that more often than retrieving them, you might want to use a LinkedList instead, because it has a linear retrieval time, but a much better relation than an ArrayList for add-remove time. Or, if you're awesome, you could use a SkipList instead. I don't know if they're implemented by default in Java, but they're awesome. Fair warning, though; the cost of speed on retrieval, removal, and placement comes with increased overhead in terms of memory. Skip lists maintain a lot of pointers.
If you know there should be the same number of values in each row, and you want them to be positionally organized, but for whatever reason your scraper doesn't handle the lack of a value for a row, and just doesn't put that value, you've some bad news... it would be easier to rewrite the part of the scraper code that deals with the lack of values than it would be to write a method that interprets varying length arrays and instantiates a Piece object for each array. My suggestion for this would again be to use the control character and fill empty columns with &n (for 'null') to be interpreted later, but then specifics are of course what will individuate your code and coding style so it's not for me to say.
edit: I think the main thing you should focus on is learning the different standard library datatypes available in Java, and maybe learn to implement some of them yourself for practice. I remember implementing a binary search tree- not an AVL tree, but alright. It's fun enough, good coding practice, and, more importantly, necessary if you want to be able to do things quickly and efficiently. I don't know exactly how Java implements arrays, because the definition is "a contiguous section of memory", yet you can allocate memory for them in Java at runtime using variables... but regardless of the specific Java implementation, arrays often aren't the best solution. Also, knowing regular expressions makes everything much easier. For practice, I'd recommend working them into your Java programs, or, if you don't want to have to compile and jar things every time, your bash scripts (if your using *nix) and/or batch scripts (if you're using Windows).
I think the way you've scraped the data makes this problem more difficult than it needs to be. Your scrape seems inconsistent and difficult to work with given that most values are surrounded by quotes inconsistently, some data already has commas in it, and not each card is on its own line.
Try re-scraping the data in a much more consistent format, such as:
R1C1|R1C2|R1C3|R1C4|R1C5|R1C6|R1C7|R1C8
R2C1|R2C2|R2C3|R2C4|R2C5|R2C6|R2C7|R3C8
R3C1|R3C2|R3C3|R3C4|R3C5|R3C6|R3C7|R3C8
R4C1|R4C2|R4C3|R4C4|R4C5|R4C6|R4C7|R4C8
A/D Changer|DREV-EN005|Effect Monster|Light|Warrior|100|100|You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
Where each line is definitely its own card (As opposed to the example CSV you posted with new lines in odd places) and the delimiter is never used in a data field as something other than a delimiter.
Once you've gotten the input into a consistently readable state, it becomes very simple to parse through it:
BufferedReader br = new BufferedReader(new FileReader(new File("MonstersFinal.csv")));
String line = "";
ArrayList<String[]> cardList = new ArrayList<String[]>(); // Use an arraylist because we might not know how many cards we need to parse.
while((line = br.readLine()) != null) { // Read a single line from the file until there are no more lines to read
StringTokenizer st = new StringTokenizer(line, "|"); // "|" is the delimiter of our input file.
String[] card = new String[8]; // Each card has 8 fields, so we need room for the 8 tokens.
for(int i = 0; i < 8; i++) { // For each token in the line that we've read:
String value = st.nextToken(); // Read the token
card[i] = value; // Place the token into the ith "column"
}
cardList.add(card); // Add the card's info to the list of cards.
}
for(int i = 0; i < cardList.size(); i++) {
for(int x = 0; x < cardList.get(i).length; x++) {
System.out.printf("card[%d][%d]: ", i, x);
System.out.println(cardList.get(i)[x]);
}
}
Which would produce the following output for my given example input:
card[0][0]: R1C1
card[0][1]: R1C2
card[0][2]: R1C3
card[0][3]: R1C4
card[0][4]: R1C5
card[0][5]: R1C6
card[0][6]: R1C7
card[0][7]: R1C8
card[1][0]: R2C1
card[1][1]: R2C2
card[1][2]: R2C3
card[1][3]: R2C4
card[1][4]: R2C5
card[1][5]: R2C6
card[1][6]: R2C7
card[1][7]: R3C8
card[2][0]: R3C1
card[2][1]: R3C2
card[2][2]: R3C3
card[2][3]: R3C4
card[2][4]: R3C5
card[2][5]: R3C6
card[2][6]: R3C7
card[2][7]: R4C8
card[3][0]: R4C1
card[3][1]: R4C2
card[3][2]: R4C3
card[3][3]: R4C4
card[3][4]: R4C5
card[3][5]: R4C6
card[3][6]: R4C7
card[3][7]: R4C8
card[4][0]: A/D Changer
card[4][1]: DREV-EN005
card[4][2]: Effect Monster
card[4][3]: Light
card[4][4]: Warrior
card[4][5]: 100
card[4][6]: 100
card[4][7]: You can remove from play this card in your Graveyard to select 1 monster on the field. Change its battle position.
I hope re-scraping the information is an option here and I hope I haven't misunderstood anything; Good luck!
On a final note, don't forget to take advantage of OOP once you've gotten things worked out. a Card class could make working with the data even simpler.
I'm working on a similar problem for use in machine learning, so let me share what I've been able to do on the topic.
1) If you know before you start parsing the row - whether it's hard-coded into your program or whether you've got some header in your file that gives you this information (highly recommended) - how many attributes per row there will be, you can reasonably split it by comma, for example the first attribute will be RowString.substring(0, RowString.indexOf(',')), the second attribute will be the substring from the first comma to the next comma (writing a function to find the nth instance of a comma, or simply chopping off bits of the string as you go through it, should be fairly trivial), and the last attribute will be RowString.substring(RowString.lastIndexOf(','), RowString.length()). The String class's methods are your friends here.
2) If you are having trouble distinguishing between commas which are meant to separate values, and commas which are part of a string-formatted attribute, then (if the file is small enough to reformat by hand) do what Java does - represent characters with special meaning that are inside of strings with '\,' rather than just ','. That way you can search for the index of ',' and not '\,' so that you will have some way of distinguishing your characters.
3) As an alternative to 2), CSVs (in my opinion) aren't great for strings, which often include commas. There is no real common format to CSVs, so why not make them colon-separated-values, or dash-separated-values, or even triple-ampersand-separated-values? The point of separating values with commas is to make it easy to tell them apart, and if commas don't do the job there's no reason to keep them. Again, this applies only if your file is small enough to edit by hand.
4) Looking at your file for more than just the format, it becomes apparent that you can't do it by hand. Additionally, it would appear that some strings are surrounded by triple double quotes ("""string""") and some are surrounded by single double quotes ("string"). If I had to guess, I would say that anything included in a quotes is a single attribute - there are, for example, no pairs of quotes that start in one attribute and end in another. So I would say that you could:
Make a class with a method to break a string into each comma-separated fields.
Write that method such that it ignores commas preceded by an odd number of double quotes (this way, if the quote-pair hasn't been closed, it knows that it's inside a string and that the comma is not a value separator). This strategy, however, fails if the creator of your file did something like enclose some strings in double double quotes (""string""), so you may need a more comprehensive approach.