I am having this problem with word boundary identification. I removed all the markup of the wikipedia document, now I want to get a list of entities.(meaningful terms). I am planning to take bi-grams, tri-grams of the document and check if it exists in dictionary(wordnet). Is there a better way to achieve this.
Below is the sample text. I want to identify entities(shown as surrounded by double quotes)
Vulcans are a humanoid species in the fictional "Star Trek" universe who evolved on the planet Vulcan and are noted for their attempt to live by reason and logic with no interference from
emotion They were the first extraterrestrial species officially to make first contact with Humans and later became one of the founding members of the "United Federation of Planets"
I think what you're talking about is really still a subject of burgeoning research rather than a simple matter of applying well-established algorithms.
I can't give you a simple "do this" answer, but here are some pointers off the top of my head:
I think using WordNet could work (not sure where bigrams/trigrams come into it though), but you should view WordNet lookups as part of a hybrid system, not the be-all and end-all to spotting named entities
then, start by applying some simple, common-sense criteria (sequences of capitalised words; try and accommodate frequent lower-case function words like 'of' into these; sequences consisting of "known title" plus capitalisd word(s));
look for sequences of words that statistically you wouldn't expect to appear next to one another by chance as candidates for entities;
can you build in dynamic web lookup? (your system spots the capitalised sequence "IBM" and sees if it finds e.g. a wikipedia entry with the text pattern "IBM is ... [organisation|company|...]".
see if anything here and in the "information extraction" literature in general gives you some ideas: http://www-nlpir.nist.gov/related_projects/muc/proceedings/muc_7_toc.html
The truth is that when you look at what literature there is out there, it doesn't seem like people are using terribly sophisticated, well-established algorithms. So I think there's a lot of room for looking at your data, exploration and seeing what you can come up with... Good luck!
If I understand correctly, you're looking to extract substrings delimited by double-quotation marks ("). You could use capture-groups in regular expressions:
String text = "Vulcans are a humanoid species in the fictional \"Star Trek\"" +
" universe who evolved on the planet Vulcan and are noted for their " +
"attempt to live by reason and logic with no interference from emotion" +
" They were the first extraterrestrial species officially to make first" +
" contact with Humans and later became one of the founding members of the" +
" \"United Federation of Planets\"";
String[] entities = new String[10]; // An array to hold matched substrings
Pattern pattern = Pattern.compile("[\"](.*?)[\"]"); // The regex pattern to use
Matcher matcher = pattern.matcher(text); // The matcher - our text - to run the regex on
int startFrom = text.indexOf('"'); // The index position of the first " character
int endAt = text.lastIndexOf('"'); // The index position of the last " character
int count = 0; // An index for the array of matches
while (startFrom <= endAt) { // startFrom will be changed to the index position of the end of the last match
matcher.find(startFrom); // Run the regex find() method, starting at the first " character
entities[count++] = matcher.group(1); // Add the match to the array, without its " marks
startFrom = matcher.end(); // Update the startFrom index position to the end of the matched region
}
OR write a "parser" with String functions:
int startFrom = text.indexOf('"'); // The index-position of the first " character
int nextQuote = text.indexOf('"', startFrom+1); // The index-position of the next " character
int count = 0; // An index for the array of matches
while (startFrom > -1) { // Keep looping as long as there is another " character (if there isn't, or if it's index is negative, the value of startFrom will be less-than-or-equal-to -1)
entities[count++] = text.substring(startFrom+1, nextQuote); // Retrieve the substring and add it to the array
startFrom = text.indexOf('"', nextQuote+1); // Find the next " character after nextQuote
nextQuote = text.indexOf('"', startFrom+1); // Find the next " character after that
}
In both, the sample-text is hard-coded for the sake of the example and the same variable is presumed to be present (the String variable named text).
If you want to test the contents of the entities array:
int i = 0;
while (i < count) {
System.out.println(entities[i]);
i++;
}
I have to warn you, there may be issues with border/boundary cases (i.e. when a " character is at the beginning or end of a string. These examples will not work as expected if the parity of " characters is uneven (i.e. if there is an odd number of " characters in the text). You could use a simple parity-check before-hand:
static int countQuoteChars(String text) {
int nextQuote = text.indexOf('"'); // Find the first " character
int count = 0; // A counter for " characters found
while (nextQuote != -1) { // While there is another " character ahead
count++; // Increase the count by 1
nextQuote = text.indexOf('"', nextQuote+1); // Find the next " character
}
return count; // Return the result
}
static boolean quoteCharacterParity(int numQuotes) {
if (numQuotes % 2 == 0) { // If the number of " characters modulo 2 is 0
return true; // Return true for even
}
return false; // Otherwise return false
}
Note that if numQuotes happens to be 0 this method still returns true (because 0 modulo any number is 0, so (count % 2 == 0) will be true) though you wouldn't want to go ahead with the parsing if there are no " characters, so you'd want to check for this condition somewhere.
Hope this helps!
Someone else asked a similar question about how to find "interesting" words in a corpus of text. You should read the answers. In particular, Bolo's answer points to an interesting article which uses the density of appearance of a word to decide how important it is---using the observation that when a text talks about something, it usually refers to that something fairly often. This article is interesting because the technique does not require prior knowledge on the text that is being processed (for instance, you don't need a dictionary targeted at the specific lexicon).
The article suggests two algorithms.
The first algorithm rates single words (such as "Federation", or "Trek", etc.) according to their measured importance. It is straightforward to implement, and I could even provide a (not very elegant) implementation in Python.
The second algorithm is more interesting as it extracts noun phrases (such as "Star Trek", etc.) by completely ignoring whitespace and using a tree-structure to decide how to split noun phrases. The results given by this algorithm when applied to Darwin's seminal text on evolution are very impressive. However, I admit implementing this algorithm would take a bit more thought as the description given by the article is fairly elusive, and what more the authors seem a bit difficult to track down. That said, I did not spend much time, so you may have better luck.
Related
To give some context: I recently started playing Dungeons and Dragons with a group of friends. I decided I wanted to try to make a program that allowed me to search for spells by level, school of magic, etc. To do this, I took a text file with every spell and its information listed alphabetically by spell name, and created a few regex expressions to sort through it all. I finally got it to give me the correct results for every attribute. But once I put it in a loop to get everything at once, I get a long list of errors, beginning with StackOverflowError. As far as I'm aware, this is supposed to happen when you get infinite loops, but mine definitely terminates. Moreover, I can go farther looping manually (with a loop that checks a boolean that I set with the keyboard at the end of each loop) than I can with a simple for or while loop.
The code I'm using is below. I didn't include the Spell class because it's just standard getters/setters and variable declarations. The School type I have is just an enum with the eight schools.
Map<String, Spell> allSpells = new HashMap<String, Spell>();
ArrayList<Spell> spellArray = new ArrayList<Spell>();
int finalLevel;
int lastMatch = 0;
int startIndex = 0;
Matcher match;
String finalTitle;
Spell.School finalSchool;
String finalDescription;
String fullList;
String titleString = ".+:\\n"; //Finds the titles of spells
Pattern titlePattern = Pattern.compile(titleString);
String levelString = "\\d\\w+-level"; //Finds the level of spells
Pattern levelPattern = Pattern.compile(levelString);
String schoolString = "(C|c)onjuration|(A|a)bjuration|(E|e)nchantment|(N|n)ecromancy|(E|e)vocation|(D|d)ivination|(I|i)llusion|(T|t)ransmutation"; //Finds the school of spells
Pattern schoolPattern = Pattern.compile(schoolString);
String ritualString = "\\(ritual\\)"; //Finds if a spell is a ritual
Pattern ritualPattern = Pattern.compile(ritualString);
String descriptionString = "\nCasting Time: (.|\\n)+?(\\n\\n)"; //Finds the description of spells
Pattern descriptionPattern = Pattern.compile(descriptionString);
try
{
BufferedReader in = new BufferedReader(new FileReader("Spell List.txt"));
// buffer for storing file contents in memory
StringBuffer stringBuffer = new StringBuffer("");
// for reading one line
String line = null;
// keep reading till readLine returns null
while ((line = in.readLine()) != null)
{
// keep appending last line read to buffer
stringBuffer.append(line + "\n");
}
fullList = stringBuffer.toString(); //Convert stringBuffer to a normal String. Used for setting fullList = a substring
boolean cont = true;
for(int i = 0; i < 100; i++) //This does not need to be set to 100. This is just a temporary number. Anything over 4 gives me this error, but under 4 I am fine.
{
//Spell Title
match = titlePattern.matcher(fullList);
match.find(); //Makes match point to the first title found
finalTitle = match.group().substring(0, match.group().length()-1); //finalTitle is set to found group, without the newline at the end
allSpells.put(finalTitle, new Spell()); //Creates unnamed Spell object tied to the matched title in the allSpells map
spellArray.add(allSpells.get(finalTitle)); //Adds the unnamed Spell object to a list.
//To be used for iterating through all Spells to find properties matching criteria
//Spell Level
match = levelPattern.matcher(fullList.substring(match.end(), match.end()+50)); //Gives an approximate region in which this could appear
if(match.find()) //Accounts for cantrips. If no match for a level is found, it is set to 0
{
finalLevel = Integer.valueOf(match.group().substring(0, 1));
}
else
{
finalLevel = 0;
}
allSpells.get(finalTitle).setSpellLevel(finalLevel);
//Spell School
match = schoolPattern.matcher(fullList);
match.find();
finalSchool = Spell.School.valueOf(match.group().substring(0, 1).toUpperCase() + match.group().substring(1, match.group().length())); //Capitalizes matched school
allSpells.get(finalTitle).setSpellSchool(finalSchool);
//Ritual?
match = ritualPattern.matcher(fullList.substring(0, 75));
if(match.find())
{
allSpells.get(finalTitle).setRitual(true);
}
else
allSpells.get(finalTitle).setRitual(false);
//Spell Description
match = descriptionPattern.matcher(fullList);
match.find();
finalDescription = match.group().substring(1); //Gets rid of the \n at the beginning of the description
allSpells.get(finalTitle).setDescription(finalDescription);
lastMatch = match.end();
System.out.println(finalTitle);
fullList = fullList.substring(lastMatch);
}
}
catch (Exception e)
{
e.printStackTrace();
}
If it helps, I have the list I'm using here.
As I mentioned in the comments of the code, going through the loop more than 4 times gives me this error, but under 4 does not. I have tried doing this as a while loop as well, and I get the same error.
I have tried searching for a solution online, but everything I see about this error just talks about recursive calls. If anyone has a solution for this I would greatly appreciate it. Thanks.
EDIT: The error list I'm getting is huge, so I put it in a text file here
. I know people are asking for stack traces, and I hope that this is what they mean. I'm still relatively new to java and have never had to work with stack traces before.
EDIT 2: I have found that if I simply replace the description regex with "\nCasting Time:" that it runs through the whole thing without errors. The only problem, of course, is that it doesn't collect all the information I want it to. Hopefully this information will help determine the problem though.
FINAL EDIT: I did a bit more searching once I found the specific line causing the problem, and found that increasing the stack size fixed the problem.
By increasing the stack size, you're treating a symptom and leaving the problem unsolved. In this case, the problem is an inefficient regex.
First, if you want to match anything including newlines, you should always use the DOTALL option. An alternation like .|\n is much less efficient. (It's also incorrect. The dot matches anything that's not a line terminator, which can be much more than just \n.)
Second, that alternation is inside a capturing group, with the quantifier outside the group: (.|\n)+?. That means you're capturing one character at a time, only to overwrite the captured character with the next character, an so on. You're making the regex engine do a lot of unnecessary work.
Here's the regex I would use:
"(?ms)^Casting Time: (.+?)\n\n"
The DOTALL option can be activated with the inline modifier, (?s). I also used the MULTILINE option, which lets me anchor the match to the beginning of the line with ^. That way, there's no need consume the leading \n, only to chop it off later. In fact, if you use group(1) instead of group(), the trailing \n\n will be excluded as well.
As for RegExr, it uses a different regex flavor than Java's--one with far fewer features. Most Java regexes will work on the excellent Regex101 site with the pcre (php) option selected. For absolute compatibility, there RegexPlanet's Java page, or a code testing site like Ideone.
I'm really really really not sure what is the best way to approach this. I've gotten as far as I can, but I basically want to scan a user response with an array of words and search for matches so that my AI can tell what mood someone is in based off the words they used. However, I've yet to find a clear or helpful answer. My code is pretty cluttered too because of how many different methods I've tried to use. I either need a way to compare sections of arrays to each other or portions of strings. I've found things for finding a part of an array. Like finding eggs in green eggs and ham, but I've found nothing that finds a section of an array in a section of another array.
public class MoodCompare extends Mood1 {
public static void MoodCompare(String inputMood){
int inputMoodLength = inputMood.length();
int HappyLength = Arrays.toString(Happy).length();
boolean itWorks = false;
String[] inputMoodArray = inputMood.split(" ");
if(Arrays.toString(Happy).contains(Arrays.toString(inputMoodArray)) == true)
System.out.println("Success!");
InputMood is the data the user has input that should have keywords lurking in them to their mood. Happy is an array of the class Mood1 that is being extended. This is only a small piece of the class, much less the program, but it should be all I need to make a valid comparison to complete the class.
If anyone can help me with this, you will save me hours of work. So THANK YOU!!!
Manipulating strings will be nicer when you do not use the relative primitive arrays, where you have to walk through yourself etcetera. A Dutch proverb says: not seeing the wood through the trees.
In this case it seems you check words of the input against a set of words for some mood.
Lets use java collections:
Turning an input string into a list of words:
String input = "...";
List<String> sentence = Arrays.asList(input.split("\\W+"));
sentence.remove("");
\\W+ is a sequence of one or more non-word characters. Mind "word" mean A-Za-z0-9_.
Now a mood would be a set of unique words:
Set<String> moodWords = new HashSet<>();
Collections.addAll(moodWords, "happy", "wow", "hurray", "great");
Evaluation could be:
int matches = 0;
for (String word : sentence) {
if (moodWords.contains(word)) {
++matches;
}
}
int percent = sentence.isEmpty() ? 0 : matches * 100 / sentence.size();
System.out.printf("Happiness: %d %%%n", percent);
In java 8 even compacter.
int matches = sentence.stream().filter(moodWords::contains).count();
Explanation:
The foreach-word-in-sentence takes every word. For every word it checks whether it is contained in moodWords, the set of all mood words.
The percentage is taken over the number of words in the sentence being moody. The boundary condition of an empty sentence is handled by the if-then-else expression ... ? ... : ... - an empty sentence given the arbitrary percentage 0%.
The printf format used %d for the integer, %% for the percent sign % (self-escaped) and %n for the line break character(s).
If I'm understanding your question correctly, you mean something like this?
String words[] = {"green", "eggs", "and", "ham"};
String response = "eggs or ham";
Mood mood = new Mood();
for(String foo : words)
{
if(response.contains(foo))
{
//Check if happy etc...
if(response.equals("green")
mood.sad++;
...
}
}
System.out.println("Success");
...
//CheckMood() etc... other methods.
Try to use tokens.
Every time that the program needs to compare the contents of a row from one array to the other array, just tokenize the contents in parallel and compare them.
Visit the following Java Doc page for farther reference: http://docs.oracle.com/javase/7/docs/api/java/util/StringTokenizer.html
or even view the following web pages:
http://introcs.cs.princeton.edu/java/72regular/Tokenizer.java.html
I'm new to Java. I thought I would write a program to count the occurrences of a character or a sequence of characters in a sentence. I wrote the following code. But I then saw there are some ready-made options available in Apache Commons.
Anyway, can you look at my code and say if there is any rookie mistake? I tested it for a couple of cases and it worked fine. I can think of one case where if the input is a big text file instead of a small sentence/paragraph, the split() function may end up being problematic since it has to handle a large variable. However this is my guess and would love to have your opinions.
private static void countCharInString() {
//Get the sentence and the search keyword
System.out.println("Enter a sentence\n");
Scanner in = new Scanner(System.in);
String inputSentence = in.nextLine();
System.out.println("\nEnter the character to search for\n");
String checkChar = in.nextLine();
in.close();
//Count the number of occurrences
String[] splitSentence = inputSentence.split(checkChar);
int countChar = splitSentence.length - 1;
System.out.println("\nThe character/sequence of characters '" + checkChar + "' appear(s) '" + countChar + "' time(s).");
}
Thank you :)
Because of edge cases, split() is the wrong approach.
Instead, use replaceAll() to remove all other characters then use the length() of what's left to calculate the count:
int count = input.replaceAll(".*?(" + check + "|$)", "$1").length() / check.length();
FYI, the regex created (for example when check = 'xyz'), looks like ".*?(xyz|$)", which means "everything up to and including 'xyz' or end of input", and is replaced by the captured text (either `'xyz' or nothing if it's end of input). This leaves just a string of 0-n copies the check string. Then dividing by the length of check gives you the total.
To protect against the check being null or zero-length (causing a divide-by-zero error), code defensively like this:
int count = check == null || check.isEmpty() ? 0 : input.replaceAll(".*?(" + check + "|$)", "$1").length() / check.length();
A flaw that I can immediately think of is that if your inputSentence only consists of a single occurrence of checkChar. In this case split() will return an empty array and your count will be -1 instead of 1.
An example interaction:
Enter a sentence
onlyme
Enter the character to search for
onlyme
The character/sequence of characters 'onlyme' appear(s) '-1' time(s).
A better way would be to use the .indexOf() method of String to count the occurrences like this:
while ((i = inputSentence.indexOf(checkChar, i)) != -1) {
count++;
i = i + checkChar.length();
}
split is the wrong approach for a number of reasons:
String.split takes a regular expression
Regular expressions have characters with special meanings, so you cannot use it for all characters (without escaping them). This requires an escaping function.
Performance String.split is optimized for single characters. If this were not the case, you would be creating and compiling a regular expression every time. Still, String.split creates one object for the String[] and one object for each String in it, every time that you call it. And you have no use for these objects; all you want to know is the count. Although a future all-knowing HotSpot compiler might be able to optimize that away, the current one does not - it is roughly 10 times as slow as simply counting characters as below.
It will not count correctly if you have repeating instances of your checkChar
A better approach is much simpler: just go and count the characters in the string that match your checkChar. If you think about the steps you need to take count characters, that's what you'd end up with by yourself:
public static int occurrences(String str, char checkChar) {
int count = 0;
for (int i = 0, l = str.length(); i < l; i++) {
if (str.charAt(i) == checkChar)
count++;
}
return count;
}
If you want to count the occurrence of multiple characters, it becomes slightly tricker to write with some efficiency because you don't want to create a new substring every time.
public static int occurrences(String str, String checkChars) {
int count = 0;
int offset = 0;
while ((offset = str.indexOf(checkChars, offset)) != -1) {
offset += checkChars.length();
count++;
}
return count;
}
That's still 10-12 times as fast to match a two-character string than String.split()
Warning: Performance timings are ballpark figures that depends on many circumstances. Since the difference is an order of magnitude, it's safe to say that String.split is slower in general. (Tests performed on jdk 1.8.0-b28 64-bit, using 10 million iterations, verified that results were stable and the same with and without -Xcomp, after performing tests 10 times in same JVM instances.)
I have a set of elements of size about 100-200. Let a sample element be X.
Each of the elements is a set of strings (number of strings in such a set is between 1 and 4). X = {s1, s2, s3}
For a given input string (about 100 characters), say P, I want to test whether any of the X is present in the string.
X is present in P iff for all s belong to X, s is a substring of P.
The set of elements is available for pre-processing.
I want this to be as fast as possible within Java. Possible approaches which do not fit my requirements:
Checking whether all the strings s are substring of P seems like a costly operation
Because s can be any substring of P (not necessarily a word), I cannot use a hash of words
I cannot directly use regex as s1, s2, s3 can be present in any order and all of the strings need to be present as substring
Right now my approach is to construct a huge regex out of each X with all possible permutations of the order of strings. Because number of elements in X <= 4, this is still feasible. It would be great if somebody can point me to a better (faster/more elegant) approach for the same.
Please note that the set of elements is available for pre-processing and I want the solution in java.
You can use regex directly:
Pattern regex = Pattern.compile(
"^ # Anchor search to start of string\n" +
"(?=.*s1) # Check if string contains s1\n" +
"(?=.*s2) # Check if string contains s2\n" +
"(?=.*s3) # Check if string contains s3",
Pattern.DOTALL | Pattern.COMMENTS);
Matcher regexMatcher = regex.matcher(subjectString);
foundMatch = regexMatcher.find();
foundMatch is true if all three substrings are present in the string.
Note that you might need to escape your "needle strings" if they could contain regex metacharacters.
It sounds like you're prematurely optimising your code before you've actually discovered a particular approach is actually too slow.
The nice property about your set of strings is that the string must contain all elements of X as a substring -- meaning we can fail fast if we find one element of X that is not contained within P. This might turn out a better time saving approach than others, especially if the elements of X are typically longer than a few characters and contain no or only a few repeating characters. For instance, a regex engine need only check 20 characters in 100 length string when checking for the presence of a 5 length string with non-repeating characters (eg. coast). And since X has 100-200 elements you really, really want to fail fast if you can.
My suggestion would be to sort the strings in order of length and check for each string in turn, stopping early if one string is not found.
Looks like a perfect case for the Rabin–Karp algorithm:
Rabin–Karp is inferior for single pattern searching to Knuth–Morris–Pratt algorithm, Boyer–Moore string search algorithm and other faster single pattern string searching algorithms because of its slow worst case behavior. However, Rabin–Karp is an algorithm of choice for multiple pattern search.
When the preprocessing time doesn't matter, you could create a hash table which maps every one-letter, two-letter, three-letter etc. combination which occurs in at least one string to a list of strings in which it occurs.
The algorithm to index a string would look like that (untested):
HashMap<String, Set<String>> indexes = new HashMap<String, Set<String>>();
for (int pos = 0; pos < string.length(); pos++) {
for (int sublen=0; sublen < string.length-pos; sublen++) {
String substring = string.substr(pos, sublen);
Set<String> stringsForThisKey = indexes.get(substring);
if (stringsForThisKey == null) {
stringsForThisKey = new HashSet<String>();
indexes.put(substring, stringsForThisKey);
}
stringsForThisKey.add(string);
}
}
Indexing each string that way would be quadratic to the length of the string, but it only needs to be done once for each string.
But the result would be constant-speed access to the list of strings in which a specific string occurs.
You are probably looking for Aho-Corasick algorithm, which constructs an automata (trie-like) from the set of strings (dictionary), and try to match the input string to the dictionary using this automata.
You might want to consider using a "Suffix Tree" as well. I haven't used this code, but there is one described here
I have used proprietary implementations (that I no longer even have access to) and they are very fast.
One way is to generate every possible substring and add this to a set. This is pretty inefficient.
Instead you can create all the strings from any point to the end into a NavigableSet and search for the closest match. If the closest match starts with the string you are looking for, you have a substring match.
static class SubstringMatcher {
final NavigableSet<String> set = new TreeSet<String>();
SubstringMatcher(Set<String> strings) {
for (String string : strings) {
for (int i = 0; i < string.length(); i++)
set.add(string.substring(i));
}
// remove duplicates.
String last = "";
for (String string : set.toArray(new String[set.size()])) {
if (string.startsWith(last))
set.remove(last);
last = string;
}
}
public boolean findIn(String s) {
String s1 = set.ceiling(s);
return s1 != null && s1.startsWith(s);
}
}
public static void main(String... args) {
Set<String> strings = new HashSet<String>();
strings.add("hello");
strings.add("there");
strings.add("old");
strings.add("world");
SubstringMatcher sm = new SubstringMatcher(strings);
System.out.println(sm.set);
for (String s : "ell,he,ow,lol".split(","))
System.out.println(s + ": " + sm.findIn(s));
}
prints
[d, ello, ere, hello, here, ld, llo, lo, old, orld, re, rld, there, world]
ell: true
he: true
ow: false
lol: false
I have a (huge) set of similar data files. The set is constantly growing. The size of a single file is about 10K. Each file must be compressed on its own. The compression is done with the zlib library, which is used by the java.util.zip.Deflater class. When passing a dictionary to the Deflate algorithm using setDictionary, I can improve the compression ratio.
Is there a way (algorithm) to find the 'optimal' dictionary, i.e. a dictionary with the overall optimal compression ratio?
See zlib manual
John Reiser explained on comp.compression:
For the dictionary: make a histogram of short substrings, sort by payoff (number of occurrences times number of bits saved when compressed) and put the highest-payoff substrings into the dictionary. For example, if k is the length of the shortest substring that can be compressed (usually 3==k or 2==k), then make a histogram of all the substrings of lengths k, 1+k, 2+k, and 3+k. Of course there is some art to placing those substrings into the dictionary, taking advantage of substrings, overlapping, short strings nearer to the high-address end, etc.
The Linux kernel uses a similar technique to compress names of symbols that are used for printing backtraces of the subroutine calling stack. See the file scripts/kallsyms.c. For instance, https://code.woboq.org/linux/linux/scripts/kallsyms.c.html
The zlib manual recommends to place the most common ocurrences at the end of the dictionary.
The dictionary should consist of strings (byte sequences) that are likely to be encountered later in the data to be compressed, with the most commonly used strings preferably put towards the end of the dictionary. Using a dictionary is most useful when the data to be compressed is short and can be predicted with good accuracy; the data can then be compressed better than with the default empty dictionary.
This is because LZ77 has a sliding window algorithm, so the later substrings will be reachable further on your stream of data than the first few.
I'd play with generating the dictionary with a higher level language with good support of strings. A crude JavaScript example:
var str = "The dictionary should consist of strings (byte sequences) that"
+ " are likely to be encountered later in the data to be compressed,"
+ " with the most commonly used strings preferably put towards the "
+ "end of the dictionary. Using a dictionary is most useful when the"
+ " data to be compressed is short and can be predicted with good"
+ " accuracy; the data can then be compressed better than with the "
+ "default empty dictionary.";
// Extract words, remove punctuation (extra: replace(/\s/g, " "))
var words = str.replace(/[,\;.:\(\)]/g, "").split(" ").sort();
var wcnt = [], w = "", cnt = 0; // pairs, current word, current word count
for (var i = 0, cnt = 0, w = ""; i < words.length; i++) {
if (words[i] === w) {
cnt++; // another match
} else {
if (w !== "")
wcnt.push([cnt, w]); // Push a pair (count, word)
cnt = 1; // Start counting for this word
w = words[i]; // Start counting again
}
}
if (w !== "")
wcnt.push([cnt, w]); // Push last word
wcnt.sort(); // Greater matches at the end
for (var i in wcnt)
wcnt[i] = wcnt[i][1]; // Just take the words
var dict = wcnt.join("").slice(-70); // Join the words, take last 70 chars
Then dict is a string of 70 chars with:
rdsusedusefulwhencanismostofstringscompresseddatatowithdictionarybethe
You can try it copy-paste-run here (add: "print(dict)")
That's just whole words, not substrings. Also there are ways to overlap common substrings to save space on the dictionary.