I'm building a process which extracts data from 6 csv-style files and two poorly laid out .txt reports and builds output CSVs, and I'm fully aware that there's going to be some overhead searching through all that whitespace thousands of times, but I never anticipated converting about 50,000 records would take 12 hours.
Excerpt of my manual matching code (I know it's horrible that I use lists of tokens like that, but it was the best thing I could think of):
public static String lookup(Pattern tokenBefore,
List<String> tokensAfter)
{
String result = null;
while(_match(tokenBefore)) { // block until all input is read
if(id.hasNext())
{
result = id.next(); // capture the next token that matches
if(_matchImmediate(tokensAfter)) // try to match tokensAfter to this result
return result;
} else
return null; // end of file; no match
}
return null; // no matches
}
private static boolean _match(List<String> tokens)
{
return _match(tokens, true);
}
private static boolean _match(Pattern token)
{
if(token != null)
{
return (id.findWithinHorizon(token, 0) != null);
} else {
return false;
}
}
private static boolean _match(List<String> tokens, boolean block)
{
if(tokens != null && !tokens.isEmpty()) {
if(id.findWithinHorizon(tokens.get(0), 0) == null)
return false;
for(int i = 1; i <= tokens.size(); i++)
{
if (i == tokens.size()) { // matches all tokens
return true;
} else if(id.hasNext() && !id.next().matches(tokens.get(i))) {
break; // break to blocking behaviour
}
}
} else {
return true; // empty list always matches
}
if(block)
return _match(tokens); // loop until we find something or nothing
else
return false; // return after just one attempted match
}
private static boolean _matchImmediate(List<String> tokens)
{
if(tokens != null) {
for(int i = 0; i <= tokens.size(); i++)
{
if (i == tokens.size()) { // matches all tokens
return true;
} else if(!id.hasNext() || !id.next().matches(tokens.get(i))) {
return false; // doesn't match, or end of file
}
}
return false; // we have some serious problems if this ever gets called
} else {
return true; // empty list always matches
}
}
Basically wondering how I would work in an efficient string search (Boyer-Moore or similar). My Scanner id is scanning a java.util.String, figured buffering it to memory would reduce I/O since the search here is being performed thousands of times on a relatively small file. The performance increase compared to scanning a BufferedReader(FileReader(File)) was probably less than 1%, the process still looks to be taking a LONG time.
I've also traced execution and the slowness of my overall conversion process is definitely between the first and last like of the lookup method. In fact, so much so that I ran a shortcut process to count the number of occurrences of various identifiers in the .csv-style files (I use 2 lookup methods, this is just one of them) and the process completed indexing approx 4 different identifiers for 50,000 records in less than a minute. Compared to 12 hours, that's instant.
Some notes (updated 6/6/2010):
I still need the pattern-matching behaviour for tokensBefore.
All ID numbers I need don't necessarily start at a fixed position in a line, but it's guaranteed that after the ID token is the name of the corresponding object.
I would ideally want to return a String, not the start position of the result as an int or something.
Anything to help me out, even if it saves 1ms per search, will help, so all input is appreciated. Thankyou!
Usage scenario 1: I have a list of objects in file A, who in the old-style system have an id number which is not in file A. It is, however, POSSIBLY in another csv-style file (file B) or possibly still in a .txt report (file C) which each also contain a bunch of other information which is not useful here, and so file B needs to be searched through for the object's full name (1 token since it would reside within the second column of any given line), and then the first column should be the ID number. If that doesn't work, we then have to split the search token by whitespace into separate tokens before doing a search of file C for those tokens as well.
Generalised code:
String field;
for (/* each record in file A */)
{
/* construct the rest of this object from file A info */
// now to find the ID, if we can
List<String> objectName = new ArrayList<String>(1);
objectName.add(Pattern.quote(thisObject.fullName));
field = lookup(objectSearchToken, objectName); // search file B
if(field == null) // not found in file B
{
lookupReset(false); // initialise scanner to check file C
objectName.clear(); // not using the full name
String[] tokens = thisObject.fullName.split(id.delimiter().pattern());
for(String s : tokens)
objectName.add(Pattern.quote(s));
field = lookup(objectSearchToken, objectName); // search file C
lookupReset(true); // back to file B
} else {
/* found it, file B specific processing here */
}
if(field != null) // found it in B or C
thisObject.ID = field;
}
The objectName tokens are all uppercase words with possible hyphens or apostrophes in them, separated by spaces (a person's name).
As per aioobe's answer, I have pre-compiled the regex for my constant search tokens, which in this case is just \r\n. The speedup noticed was about 20x in another one of the processes, where I compiled [0-9]{1,3}\\.[0-9]%|\r\n|0|[A-Z'-]+, although it was not noticed in the above code with \r\n. Working along these lines, it has me wondering:
Would it be better for me to match \r\n[^ ] if the only usable matches will be on lines beginning with a non-space character anyway? It may reduce the number of _match executions.
Another possible optimisation is this: concatenate all tokensAfter, and put a (.*) beforehand. It would reduce the number of regexes (all of which are literal anyway) that would be compiled by about 2/3, and also hopefully allow me to pull out the text from that grouping instead of keeping a "potential token" from every line with an ID on it. Is that also worth doing?
The above situation could be resolved if I could get java.util.Scanner to return the token previous to the current one after a call to findWithinHorizon.
Something to start with: Every single time you run id.next().matches(tokens.get(i)) the following code is executed:
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
return m.matches();
Compiling a regular expression is non-trivial and you should consider compiling the patterns once and for all in your program:
pattern[i] = Pattern.compile(tokens.get(i));
And then simply invoke something like
pattern[i].matcher(str).matches()
Related
I am trying to run a mapreduce job on hadoop which reads the fifth entry of a tab delimited file (fifth entry are user reviews) and then do some sentiment analysis and word count on them.
However, as you know with user reviews, they usually include line breaks and empty lines. My code iterates through the words of each review to find keywords and check sentiment if keyword is found.
The problem is as the code iterates through the review, it gives me ArrayIndexOutofBoundsException Error because of these line breaks and empty lines in one review.
I have tried using replaceAll("\r", " ") and replaceAll("\n", " ") to no avail.
I have also tried if(tokenizer.countTokens() == 2){
word.set(tokenizer.nextToken());}
else {
}
also to no avail. Below is my code:
public class KWSentiment_Mapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> keywordsList = new ArrayList<String>();
ArrayList<String> posWordsList = new ArrayList<String>();
ArrayList<String> tokensList = new ArrayList<String>();
int e;
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t");
String Review = line[4].replaceAll("[\\-\\+\\\\)\\.\\(\"\\{\\$\\^:,]", "").toLowerCase();
StringTokenizer tokenizer = new StringTokenizer(Review);
while (tokenizer.hasMoreTokens()) {
// 1- first read the review line and store the tokens in an arraylist, 2-
// iterate through review to check for KW if found
// 3-check if there's PosWord near (upto +3 and -2)
// 4- setWord & context.write 5- null the review line arraylist
String CompareString = tokenizer.nextToken();
tokensList.add(CompareString);
}
{
for (int i = 0; i < tokensList.size(); i++)
{
for (int j = 0; j < keywordsList.size(); j++) {
boolean flag = false;
if (tokensList.get(i).startsWith(keywordsList.get(j)) == true) {
for (int e = Math.max(0, i - 2); e < Math.min(tokensList.size(), i + 4); e++) {
if (posWordsList.contains(tokensList.get(e))) {
word.set(keywordsList.get(j));
context.write(word, one);
flag = true;
break; // breaks out of e loop }}
}
}
}
if (flag)
break;
}
}
tokensList.clear();
}
}
Expected results are such that:
Take these two cases of reviews where error occurs:
Case 1: "Beautiful and spacious!
I highly recommend this place and great host."
Case 2: "The place in general was really silent but we didn't feel stayed.
Aside from this, the bathroom is big and the shower is really nice but there problem. "
The system should read the whole review as one line and iterate through the words in it. However, it just stops as it finds a line break or an empty line as in case 2.
Case 1 should be read such as: "Beautiful and spacious! I highly recommend this place and great host."
Case 2 should be:"The place in general was really silent but we didn't feel stayed. Aside from this, the bathroom is big and the shower is really nice but there problem. "
I am running out of time and would really appreciate help here.
Thanks!
So, I hope I am understanding what what you are trying to do....
If I am reading what you have above correctly, the value of 'value' passed into your map function above contains the delimited value that you would like to parse the user reviews out of. If that is the case, I believe we can make use of the escaping functionality in the opencsv library using tabs as your delimiting character instead of commas to correctly populate the user review field:
http://opencsv.sourceforge.net
In this example we are reading one line from the input that is passed in and parsing it into 'columns' base on the tab character and placing the results in the 'nextLine' array. This will allow us to use the escaping functionality of the CSVReader without reading an actual file and instead using the value of the text passed into your map function.
StringReader reader = new StringReader(value.toString());
CSVReader csvReader = new CSVReader(reader, '\t', '\"', '\\', 0);
String [] nextLine = csvReader.readNext();
if(nextLine != null && nextLine.length >= 5) {
// Do some stuff
}
In the example that you pasted above, I think even that split("\n") will be problematic as tabs within a user review split into two results in the result in addition to new lines being treated as new records. But, both of these characters are legal as long as they are inside a quoted value (as they should be in a properly escaped file and as they are in your example). CSVReader should handle all of these.
Validate each line at the start of the map method, so that you know line[4] exists and isn't null.
if (value == null || value.toString == null) {
return;
}
String[] line = value.toString().split("\t");
if (line == null || line.length() < 5 || line[4] == null) {
return;
}
As for line breaks, you'll need to show some sample input. By default MapReduce passes each line into the map method independently, so if you do want to read multiple lines as one message, you'll have to write a custom InputSplit, or pre-format your data so that all data for each review is on the same line.
Basically I want to create a program which simulates the 'Countdown' game on Channel 4. In effect a user must input 9 letters and the program will search for the largest word in the dictionary that can be made from these letters.I think a tree structure would be better to go with rather than hash tables. I already have a file which contains the words in the dictionary and will be using file io.
This is my file io class:
public static void main(String[] args){
FileIO reader = new FileIO();
String[] contents = reader.load("dictionary.txt");
}
This is what I have so far in my Countdown class
public static void main(String[] args) throws IOException{
Scanner scan = new Scanner(System.in);
letters = scan.NextLine();
}
I get totally lost from here. I know this is only the start but I'm not looking for answers. I just want a small bit of help and maybe a pointer in the right direction. I'm only new to java and found this question in an interview book and thought I should give it a .
Thanks in advance
welcome to the world of Java :)
The first thing I see there that you have two main methods, you don't actually need that. Your program will have a single entry point in most cases then it does all its logic and handles user input and everything.
You're thinking of a tree structure which is good, though there might be a better idea to store this. Try this: http://en.wikipedia.org/wiki/Trie
What your program has to do is read all the words from the file line by line, and in this process build your data structure, the tree. When that's done you can ask the user for input and after the input is entered you can search the tree.
Since you asked specifically not to provide answers I won't put code here, but feel free to ask if you're unclear about something
There are only about 800,000 words in the English language, so an efficient solution would be to store those 800,000 words as 800,000 arrays of 26 1-byte integers that count how many times each letter is used in the word, and then for an input 9 characters you convert to similar 26 integer count format for the query, and then a word can be formed from the query letters if the query vector is greater than or equal to the word-vector component-wise. You could easily process on the order of 100 queries per second this way.
I would write a program that starts with all the two-letter words, then does the three-letter words, the four-letter words and so on.
When you do the two-letter words, you'll want some way of picking the first letter, then picking the second letter from what remains. You'll probably want to use recursion for this part. Lastly, you'll check it against the dictionary. Try to write it in a way that means you can re-use the same code for the three-letter words.
I believe, the power of Regular Expressions would come in handy in your case:
1) Create a regular expression string with a symbol class like: /^[abcdefghi]*$/ with your letters inside instead of "abcdefghi".
2) Use that regular expression as a filter to get a strings array from your text file.
3) Sort it by length. The longest word is what you need!
Check the Regular Expressions Reference for more information.
UPD: Here is a good Java Regex Tutorial.
A first approach could be using a tree with all the letters present in the wordlist.
If one node is the end of a word, then is marked as an end-of-word node.
In the picture above, the longest word is banana. But there are other words, like ball, ban, or banal.
So, a node must have:
A character
If it is the end of a word
A list of children. (max 26)
The insertion algorithm is very simple: In each step we "cut" the first character of the word until the word has no more characters.
public class TreeNode {
public char c;
private boolean isEndOfWord = false;
private TreeNode[] children = new TreeNode[26];
public TreeNode(char c) {
this.c = c;
}
public void put(String s) {
if (s.isEmpty())
{
this.isEndOfWord = true;
return;
}
char first = s.charAt(0);
int pos = position(first);
if (this.children[pos] == null)
this.children[pos] = new TreeNode(first);
this.children[pos].put(s.substring(1));
}
public String search(char[] letters) {
String word = "";
String w = "";
for (int i = 0; i < letters.length; i++)
{
TreeNode child = children[position(letters[i])];
if (child != null)
w = child.search(letters);
//this is not efficient. It should be optimized.
if (w.contains("%")
&& w.substring(0, w.lastIndexOf("%")).length() > word
.length())
word = w;
}
// if a node its end-of-word we add the special char '%'
return c + (this.isEndOfWord ? "%" : "") + word;
}
//if 'a' returns 0, if 'b' returns 1...etc
public static int position(char c) {
return ((byte) c) - 97;
}
}
Example:
public static void main(String[] args) {
//root
TreeNode t = new TreeNode('R');
//for skipping words with "'" in the wordlist
Pattern p = Pattern.compile(".*\\W+.*");
int nw = 0;
try (BufferedReader br = new BufferedReader(new FileReader(
"files/wordsEn.txt")))
{
for (String line; (line = br.readLine()) != null;)
{
if (p.matcher(line).find())
continue;
t.put(line);
nw++;
}
// line is not visible here.
br.close();
System.out.println("number of words : " + nw);
String res = null;
// substring (1) because of the root
res = t.search("vuetsrcanoli".toCharArray()).substring(1);
System.out.println(res.replace("%", ""));
}
catch (Exception e)
{
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Output:
number of words : 109563
counterrevolutionaries
Notes:
The wordlist is taken from here
the reading part is based on another SO question : How to read a large text file line by line using Java?
I'm trying to make the following algorithm work. What I want to do is split the given string into substrings consisting of either a series of numbers or an operator.
So for this string = "22+2", I would get an array in which [0]="22" [1]="+" and [2]="2".
This is what I have so far, but I get an index out of bounds exception:
public static void main(String[] args) {
String string = "114+034556-2";
int k,a,j;
k=0;a=0;j=0;
String[] subStrings= new String[string.length()];
while(k<string.length()){
a=k;
while(((int)string.charAt(k))<=57&&((int)string.charAt(k))>=48){
k++;}
subStrings[j]=String.valueOf(string.subSequence(a,k-1)); //exception here
j++;
subStrings[j]=String.valueOf(string.charAt(k));
j++;
}}
I would rather be told what's wrong with my reasoning than be offered an alternative, but of course I will appreciate any kind of help.
I'm deliberately not answering this question directly, because it looks like you're trying to figure out a solution yourself. I'm also assuming that you're purposefully not using the split or the indexOf functions, which would make this pretty trivial.
A few things I've noticed:
If your input string is long, you'd probably be better off working with a char array and stringbuilder, so you can avoid memory problems arising from immutable strings
Have you tried catching the exception, or printing out what the value of k is that causes your index out of bounds problem?
Have you thought through what happens when your string terminates? For instance, have you run this through a debugger when the input string is "454" or something similarly trivial?
You could use a regular expression to split the numbers from the operators using lookahead and lookbehind assertions
String equation = "22+2";
String[] tmp = equation.split("(?=[+\\-/])|(?<=[+\\-/])");
System.out.println(Arrays.toString(tmp));
If you're interested in the general problem of parsing, then I'd recommend thinking about it on a character-by-character level, and moving through a finite state machine with each new character. (Often you'll need a terminator character that cannot occur in the input--such as the \0 in C strings--but we can get around that.).
In this case, you might have the following states:
initial state
just parsed a number.
just parsed an operator.
The characters determine the transitions from state to state:
You start in state 1.
Numbers transition into state 2.
Operators transition into state 3.
The current state can be tracked with something like an enum, changing the state after each character is consumed.
With that setup, then you just need to loop over the input string and switch on the current state.
// this is pseudocode -- does not compile.
List<String> parse(String inputString) {
State state = INIT_STATE;
String curr = "";
List<String> subStrs = new ArrayList<String>();
for(Char c : inputString) {
State next;
if (isAnumber(c)) {
next = JUST_NUM;
} else {
next = JUST_OP;
}
if (state == next) {
// no state change, just add to accumulator:
acc = acc + c;
} else {
// state change, so save and reset the accumulator:
subStrs.add(acc);
acc = "";
}
// update the state
state = next;
}
return subStrs;
}
With a structure like that, you can more easily add new features / constructs by adding new states and updating the behavior depending on the current state and incoming character. For example, you could add a check to throw errors if letters appear in the string (and include offset locations, if you wanted to track that).
If your critera is simply "Anything that is not a number", then you can use some simple regex stuff if you dont mind working with parallel arrays -
String[] operands = string.split("\\D");\\split around anything that is NOT a number
char[] operators = string.replaceAll("\\d", "").toCharArray();\\replace all numbers with "" and turn into char array.
String input="22+2-3*212/21+23";
String number="";
String op="";
List<String> numbers=new ArrayList<String>();
List<String> operators=new ArrayList<String>();
for(int i=0;i<input.length();i++){
char c=input.charAt(i);
if(i==input.length()-1){
number+=String.valueOf(c);
numbers.add(number);
}else if(Character.isDigit(c)){
number+=String.valueOf(c);
}else{
if(c=='+' || c=='-' || c=='*' ||c=='/'){
op=String.valueOf(c);
operators.add(op);
numbers.add(number);
op="";
number="";
}
}
}
for(String x:numbers){
System.out.println("number="+x+",");
}
for(String x:operators){
System.out.println("operators="+x+",");
}
this will be the output
number=22,number=2,number=3,number=212,number=21,number=23,operator=+,operator=-,operator=*,operator=/,operator=+,
Is it possible to have multiple arguments for a .contains? I am searching an array to ensure that each string contains one of several characters. I've hunted all over the web, but found nothing useful.
for(String s : fileContents) {
if(!s.contains(syntax1) && !s.contains(syntax2)) {
found.add(s);
}
}
for (String s : found) {
System.out.println(s); // print array to cmd
JOptionPane.showMessageDialog(null, "Note: Syntax errors found.");
}
How can I do this with multiple arguments? I've also tried a bunch of ||s on their own, but that doesn't seem to work either.
No, it can't have multiple arguments, but the || should work.
!s.contains(syntax1+"") || !s.contains(syntax2+"") means s doesn't contain syntax1 or it doesn't contain syntax2.
This is just a guess but you might want s contains either of the two:
s.contains(syntax1+"") || s.contains(syntax2+"")
or maybe s contains both:
s.contains(syntax1+"") && s.contains(syntax2+"")
or maybe s contains neither of the two:
!s.contains(syntax1+"") && !s.contains(syntax2+"")
If syntax1 and syntax2 are already strings, you don't need the +""'s.
I believe s.contains("") should always return true, so you can remove it.
It seems that what you described can be done with a regular expression.
In regular expression, the operator | marks you need to match one of several choices.
For example, the regex (a|b) means a or b.
The regex ".*(a|b).*" means a string that contains a or b, and other then that - all is OK (it assumes one line string, but that can be dealt with easily as well if needed).
Code example:
String s = "abc";
System.out.println(s.matches(".*(a|d).*"));
s = "abcd";
System.out.println(s.matches(".*(a|d).*"));
s = "fgh";
System.out.println(s.matches(".*(a|d).*"));
Regular Exprsssions is a powerful tool that I recommend learning. Have a look at this tutorial, you might find it helpful.
There is not such thing as multiple contains.
if you require to validate that a list of string is included in some other string you must iterate through them all and check.
public static boolean containsAll(String input, String... items) {
if(input == null) throw new IllegalArgumentException("Input must not be null"); // We validate the input
if(input.length() == 0) {
return items.length == 0; // if empty contains nothing then true, else false
}
boolean result = true;
for(String item : items) {
result = result && input.contains(item);
}
return result;
}
I have here a String that contains the source code of a class. Now i have another String that contains the full name of a method in this class. The method name is e.g.
public void (java.lang.String test)
Now I want to retieve the source code of this method from the string with the class' source code. How can I do that? With String#indexOf(methodName) i can find the start of the method source code, but how do i find the end?
====EDIT====
I used the count curly-braces approach:
internal void retrieveSourceCode()
{
int startPosition = parentClass.getSourceCode().IndexOf(this.getName());
if (startPosition != -1)
{
String subCode = parentClass.getSourceCode().Substring(startPosition, parentClass.getSourceCode().Length - startPosition);
for (int i = 0; i < subCode.Length; i++)
{
String c = subCode.Substring(0, i);
int open = c.Split('{').Count() - 1;
int close = c.Split('}').Count() - 1;
if (open == close && open != 0)
{
sourceCode = c;
break;
}
}
}
Console.WriteLine("SourceCode for " + this.getName() + "\n" + sourceCode);
}
This works more or less fine, However, if a method is defined without body, it fails. Any hints how to solve that?
Counting braces and stopping when the count decreases to 0 is indeed the way to go. Of course, you need to take into account braces that appear as literals and should thus not be counted, e.g. braces in comments and strings.
Overall this is kind of a thankless endeavour, comparable in complexity to say, building a command line parser if you want to get it working really reliably. If you know you can get away with it you could cut some corners and just count all the braces, although I do not recommend it.
Update:
Here's some sample code to do the brace counting. As I said, this is a thankless job and there are tons of details you have to get right (in essence, you 're writing a mini-lexer). It's in C#, as this is the closest to Java I can write code in with confidence.
The code below is not complete and probably not 100% correct (for example: verbatim strings in C# do not allow spaces between the # and the opening quote, but did I know that for a fact or just forgot about it?)
// sourceCode is a string containing all the source file's text
var sourceCode = "...";
// startIndex is the index of the char AFTER the opening brace
// for the method we are interested in
var methodStartIndex = 42;
var openBraces = 1;
var insideLiteralString = false;
var insideVerbatimString = false;
var insideBlockComment = false;
var lastChar = ' '; // White space is ignored by the C# parser,
// so a space is a good "neutral" character
for (var i = methodStartIndex; openBraces > 0; ++i) {
var ch = sourceCode[i];
switch (ch) {
case '{':
if (!insideBlockComment && !insideLiteralString && !insideVerbatimString) {
++openBraces;
}
break;
case '}':
if (!insideBlockComment && !insideLiteralString && !insideVerbatimString) {
--openBraces;
}
break;
case '"':
if (insideBlockComment) {
continue;
}
if (insideLiteralString) {
// "Step out" of the string if this is the closing quote
insideLiteralString = lastChar != '\';
}
else if (insideVerbatimString) {
// If this quote is part of a two-quote pair, do NOT step out
// (it means the string contains a literal quote)
// This can throw, but only for source files with syntax errors
// I 'm ignoring this possibility here...
var nextCh = sourceCode[i + 1];
if (nextCh == '"') {
++i; // skip that next quote
}
else {
insideVerbatimString = false;
}
}
else {
if (lastChar == '#') {
insideVerbatimString = true;
}
else {
insideLiteralString = true;
}
}
break;
case '/':
if (insideLiteralString || insideVerbatimString) {
continue;
}
// TODO: parse this
// It can start a line comment, if followed by /
// It can start a block comment, if followed by *
// It can end a block comment, if preceded by *
// Line comments are intended to be handled by just incrementing i
// until you see a CR and/or LF, hence no insideLineComment flag.
break;
}
lastChar = ch;
}
// From the values of methodStartIndex and i we can now do sourceCode.Substring and get the method source
Have a look at:- Parser for C#
It recommends using NRefactory to parse and tokenise source code, you should be able to use that to navigate your class source and pick out methods.
You will have to, probably, know the sequence of the methods listed in the code file. So that, you can look for the method closing scope } which may be right above start of next method.
So you code might look like:
nStartOfMethod = String.indexOf(methodName)
nStartOfNextMethod = String.indexOf(NextMethodName)
Look for .LastIndexOf(yourMethodTerminator /*probably a}*/,...) between a string of nStartOfMethod and nStartOfNextMethod
In this case, if you dont know the sequence of methods, you might end up skipping a method in between, to find an ending brace.