Java String variable becomes corrupt at runtime - java

Hey all I've got a weird bug in a small Java program I'm writing for a school project. I am well aware of how sloppy the code is (it is still a work in progress), but anyway, somehow my string variable "year" becomes corrupted after breaking out of a loop. I am using Java with Mapreduce and hadoop to count unigrams and bigrams and sort them by year/author. Using print statements, I have determined that "year" is indeed set when I set it equal to temp, but any time after the loop it is set in, the variable is corrupted somehow. The year number becomes replaced with a huge amount of whitespace (at least that's how it appears in the console). I have tried setting year=year.trim() and using the regex year=year.replaceAll("[^0-9]",""), neither works. Anybody have any ideas?
I have only included the map class, as that is where the problem is. Also it should be noted that the text files being parsed are files from Project Gutenberg.I am working with a small sample of about 40 random texts from the project.
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class WordCount {
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public synchronized void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
line = line.toLowerCase();
line = line.replaceAll("[^0-9a-z\\s-*]", "").replaceAll("\\s+", " ");
String year=""; // variable to hold date -- somehow this gets cleared out before I need it
String temp=""; // variable to hold each token
StringTokenizer tokenizer = new StringTokenizer(line); // Splits document into individual words for parsing
while (tokenizer.hasMoreTokens()) {
temp = tokenizer.nextToken(); // grab first token of document
if (temp.equals("***")) // hit first triple star, break out and move to next while loop
break;
if (temp.equals("release")&&tokenizer.hasMoreTokens()){ // if token is "release" followed by "date", extract year
if (tokenizer.nextToken().equals("date")){
while(tokenizer.hasMoreTokens()){
temp = tokenizer.nextToken();
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(0))){
if (temp.length()>3||Integer.parseInt(temp)>=40){
year = temp; // set year = token if token is a number greater than 40 or has >3 digits
break;
}
}
}
if (!year.equals("")){ //if date isn't an empty string, it means we have date and break
break; // out of first while loop
}
}
System.out.println("\n"+year+"\n");// year will still print here
}
} // but it is gone if I try to print past this point
}
while (tokenizer.hasMoreTokens()){ // keep grabbing tokens until hit another "***", then break and
temp = tokenizer.nextToken(); // can begin counting unigrams/bigrams
if (temp.equals("***"))
break;
}
line = line.substring(line.indexOf(temp)); // form a new document starting from location of previous "***"
line = line.replaceAll("[^a-z\\s-]", "").replaceAll("\\s+", " ");
line = line.replaceAll("-+", "-"); /*Many calls to remove excess whitespace and punctuation from entire document*/
line = line.replaceAll(" - ", " ");
line = line.replaceAll("- ", " ");
line = line.replaceAll(" -", " ");
line = line.replaceAll("\\s+", " ");
StringTokenizer toke = new StringTokenizer(line); //start a new tokenizer with re-formatted file
while(toke.hasMoreTokens()){//continue to grab tokens until EOF
temp = toke.nextToken();
//System.out.println(date);
if (temp.charAt(0)=='-')
temp = temp.substring(1);//if word starts or ends with hyphen, remove it
if (temp.length()>1&&temp.charAt(temp.length()-1)=='-')
temp = temp.replace('-', ' ');
if ((!temp.equals(" "))){
word.set(temp+"\t"+year);
context.write(word,one);
}
}
}
}

You have year = temp in your code. It seems it depends on your input what you get there.
Possible bug:
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(0))){
IMHO you mean i instead of 0 in charAt:
for (int i = 0; i<temp.length();i++){
if (Character.isDigit(temp.charAt(i))){
Also consider not to use StringTokenizer:
StringTokenizer is a legacy class that is retained for compatibility
reasons although its use is discouraged in new code. It is recommended
that anyone seeking this functionality use the split method of String
or the java.util.regex package instead.
The following example illustrates how the String.split method can be
used to break up a string into its basic tokens:
String[] result = "this is a test".split("\\s");
for (int x=0; x<result.length; x++)
System.out.println(result[x]);

Found your white space...
The 2 statements that print out the year variable add a couple newlines:
System.out.println("\n"+year+"\n")
or a tab:
word.set(temp+"\t"+year);
context.write(word,one);
Try removing the \n and \t.

Related

Load CSV and split attributes

I'm trying to load a csv file and split 'timespan' into 'begin' and 'end'. If the timespan consists of one date 'begin' and 'end' are the same.
timespan,someOtherField, ...
27.03.2017 - 31.03.2017,someOtherValue, ...
31.03.2017,someOtherValue, ...
Result:
begin,end,someOtherField
27.03.2017,31.03.2017,someOtherValue, ...
31.03.2017,31.03.2017,someOtherValue, ...
At the moment I'm loading the file line by line using OpenCSV. This works pretty good but i don't know how to split one attribute. Propably I have to parse the CSV into an array?
For any line l you can use StringTokenizer to get the tokens separated by ,:
StringTokenizer tokens = new StringTokenizer(l, ",")
The first token represents timespan, so:
String timespan = tokens.nextToken()
Then you can split timespan based on " - ", so:
String[] startEnd = timespan.split(" - ");
Finally, you have to compute the size of the startEnd, if startEnd.length == 1, then you absolutely know that start begin and end coincides, so startEnd[0],startEnd[0]
otherwise the result would look like the following startEnd[0],startEnd[1]
I hope this could help you solve the problem.
Thanks for your answer! I parsed the csv into an extra class and created an object for each record. The code below shows the splitting of the timespan. I will now rebuild a new csv file from all objects.
// Load CSV as Booking objects
ArrayList<Booking> bookings = Utils.readCSV(csvClean);
for (int i = 0; i < bookings.size(); i++) {
String timespan = bookings.get(i).getTimespan();
String begin = "";
String end = "";
if (timespan.contains(" - ")) {
// Split timespan and set values
String[] parts = timespan.split(" - ");
begin = parts[0].trim();
end = parts[1].trim();
bookings.get(i).setBegin(begin);
bookings.get(i).setEnd(end);
} else {
bookings.get(i).setBegin(timespan.trim());
bookings.get(i).setEnd(timespan.trim());
} // end if else
} // end for

How to merge many List<String> elements in one based on double quote delimiter in java

I have a CSV file generated in other platform (Salesforce), by default it seems Salesforce is not handling break lines in the file generation in some large text fields, so in my CSV file I have some rows with break lines like this that I need to fix:
"column1","column2","my column with text
here the text continues
more text in the same field
here we finish this","column3","column4"
Same idea using this piece of code:
List<String> listWords = new ArrayList<String>();
listWords.add("\"Hi all");
listWords.add("This is a test");
listWords.add("of how to remove");
listWords.add("");
listWords.add("breaklines and merge all in one\"");
listWords.add("\"This is a new Line with the whole text in one row\"");
in this case I would like to merge the elements. My first approach was to check for the lines were the last char is not a ("), concatenates the next line and just like that until we see the las char contains another double quote.
this is a non working sample of what I was trying to achieve but I hope it gives you an idea
String[] csvLines = csvContent.split("\n");
Integer iterator = 0;
String mergedRows = "";
for(String row:csvLines){
newCsvfile.add(row);
if(row != null){
if(!row.isEmpty()){
String lastChar = String.valueOf(row.charAt(row.length()-1));
if(!lastChar.contains("\"")){
//row += row+" "+csvLines[iterator+1].replaceAll("\r", "").replaceAll("\n", "").replaceAll("","").replaceAll("\r\n?|\n", "");
mergedRows += row+" "+csvLines[iterator+1].replaceAll("\r", "").replaceAll("\n", "").replaceAll("","").replaceAll("\r\n?|\n", "");
row = mergedRows;
csvLines[iterator+1] = null;
}
}
newCsvfile.add(row);
}
iterator++;
}
My final result should look like (based on the list sample):
"Hi all This is a test of how to remove break lines and merge all in one"
"This is a new Line with the whole text in one row".
What is the best approach to achieve this?
In case you don't want to use a CSV reading library like #RealSkeptic suggested...
Going from your listWords to your expected solution is fairly simple:
List<String> listSentences = new ArrayList<>();
String tmp = "";
for (String s : listWords) {
tmp = tmp.concat(" " + s);
if (s.endsWith("\"")){
listSentences.add(tmp);
tmp = "";
}
}

Java Regex : How to search a text or a phrase in a large text

I have a large text file and I need to search a word or a phrase in the file line by line and output the line with the text found in it.
For example, the sample text is
And the earth was without form,
Where [art] thou?
if the user search for thou word, the only line to be display is
Where [art] thou?
and if the user search for the earth, the first line should be displayed.
I tried using the contains function but it will display also the without when searching only for thou.
This is my sample code :
String[] verseList = TextIO.readFile("pentateuch.txt");
Scanner kbd = new Scanner(System.in);
int counter = 0;
for (int i = 0; i < verseList.length; i++) {
String[] data = verseList[i].split("\t");
String[] info3 = data[3].split(" ");
System.out.print("Search for: ");
String txtSearch = kbd.nextLine();
LinkedList<String> searchedList = new LinkedList<String>();
for (String bible : verseList){
if (bible.contains(txtSearch)){
searchedList.add(bible);
counter++;
}
}
if (searchedList.size() > 0){
for (String s : searchedList){
String[] searchedData = s.split("\t");
System.out.printf("%s - %s - %s - %s \n",searchedData[0], searchedData[1], searchedData[2], searchedData[3]);
}
}
System.out.print("Total: " + counter);
So I am thinking of using regex but I don't know how.
Can anyone help? Thank you.
Since sometimes variables have non-word characters at boundary positions, you cannot rely on \b word boundary.
In such cases, it is safer to use look-arounds (?<!\w) and (?!\w), i.e. in Java, something like:
"(?<!\\w)" + searchedData[n] + "(?!\\w)"
To match a String that contains a word, use this code:
String txtSearch; // eg "thou"
if (str.matches(".*?\\b" + txtSearch + "\\b.*"))
// it matches
This code builds a regex that only matches if both ends of txtSearch fall and the start/end of a word in the string by using \b, which means "word boundary".

Having an issue with formatting a String input

I'm trying to get the input that the user enters to go to lower-case and then put the first character in the input to upper-case. For example, If I enter aRseNAL for my first input, I want to format the input so that it will put "Arsenal" into the data.txt file, I'm also wondering if there's a way to put each first character to upper-case if there's more than one word for a team ie. mAN uNiTeD formatted to Man United to be written to the file.
The code I have below is what i tried and I cannot get it to work. Any advice or help would be appreciated.
import java.io.*;
import javax.swing.*;
public class write
{
public static void main(String[] args) throws IOException
{
FileWriter aFileWriter = new FileWriter("data.txt");
PrintWriter out = new PrintWriter(aFileWriter);
String team = "";
for(int i = 1; i <= 5; i++)
{
boolean isTeam = true;
while(isTeam)
{
team = JOptionPane.showInputDialog(null, "Enter a team: ");
if(team == null || team.equals(""))
JOptionPane.showMessageDialog(null, "Please enter a team.");
else
isTeam = false;
}
team.toLowerCase(); //Put everything to lower-case.
team.substring(0,1).toUpperCase(); //Put the first character to upper-case.
out.println(i + "," + team);
}
out.close();
aFileWriter.close();
}
}
In Java, strings are immutable (cannot be changed) so methods like substring and toLowerCase generate new strings - they don't modify your existing string.
So rather than:
team.toLowerCase();
team.substring(0,1).toUpperCase();
out.println(team);
You'd need something like:
String first = team.substring(0,1).toUpperCase();
String rest = team.substring(1,team.length()).toLowerCase();
out.println(first + rest);
Similar as #DNA suggested but that will throw Exception if String length is 1. So added a check for same.
String output = team.substring(0,1).toUpperCase();
// if team length is >1 then only put 2nd part
if (team.length()>1) {
output = output+ team.substring(1,team.length()).toLowerCase();
}
out.println(i + "," + output);

string tokenizer in Java

I have a text file which contains data seperated by '|'. I need to get each field(seperated by '|') and process it. The text file can be shown as below :
ABC|DEF||FGHT
I am using string tokenizer(JDK 1.4) for getting each field value. Now the problem is, I should get an empty string after DEF.However, I am not getting the empty space between DEF & FGHT.
My result should be - ABC,DEF,"",FGHT but I am getting ABC,DEF,FGHT
From StringTokenizer documentation :
StringTokenizer is a legacy class that
is retained for compatibility reasons
although its use is discouraged in new
code. It is recommended that anyone
seeking this functionality use the
split method of String or the
java.util.regex package instead.
The following code should work :
String s = "ABC|DEF||FGHT";
String[] r = s.split("\\|");
Use the returnDelims flag and check two subsequent occurrences of the delimiter:
String str = "ABC|DEF||FGHT";
String delim = "|";
StringTokenizer tok = new StringTokenizer(str, delim, true);
boolean expectDelim = false;
while (tok.hasMoreTokens()) {
String token = tok.nextToken();
if (delim.equals(token)) {
if (expectDelim) {
expectDelim = false;
continue;
} else {
// unexpected delim means empty token
token = null;
}
}
System.out.println(token);
expectDelim = true;
}
this prints
ABC
DEF
null
FGHT
The API isn't pretty and therefore considered legacy (i.e. "almost obsolete"). Use it only with where pattern matching is too expensive (which should only be the case for extremely long strings) or where an API expects an Enumeration.
In case you switch to String.split(String), make sure to quote the delimiter. Either manually ("\\|") or automatically using string.split(Pattern.quote(delim));
StringTokenizer ignores empty elements. Consider using String.split, which is also available in 1.4.
From the javadocs:
StringTokenizer is a legacy class that
is retained for compatibility reasons
although its use is discouraged in new
code. It is recommended that anyone
seeking this functionality use the
split method of String or the
java.util.regex package instead.
you can use the constructor that takes an extra 'returnDelims' boolean, and pass true to it.
this way you will receive the delimiters, which will allow you to detect this condition.
alternatively you can just implement your own string tokenizer that does what you need, it's not that hard.
Here is another way to solve this problem
String str = "ABC|DEF||FGHT";
StringTokenizer s = new StringTokenizer(str,"|",true);
String currentToken="",previousToken="";
while(s.hasMoreTokens())
{
//Get the current token from the tokenize strings
currentToken = s.nextToken();
//Check for the empty token in between ||
if(currentToken.equals("|") && previousToken.equals("|"))
{
//We denote the empty token so we print null on the screen
System.out.println("null");
}
else
{
//We only print the tokens except delimiters
if(!currentToken.equals("|"))
System.out.println(currentToken);
}
previousToken = currentToken;
}
Here is a way to split a string into tokens (a token is one or more letters)
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
String s = scan.nextLine();
s = s.replaceAll("[^A-Za-z]", " ");
StringTokenizer arr = new StringTokenizer(s, " ");
int n = arr.countTokens();
System.out.println(n);
while(arr.hasMoreTokens()){
System.out.println(arr.nextToken());
}
scan.close();
}
package com.java.String;
import java.util.StringTokenizer;
public class StringWordReverse {
public static void main(String[] kam) {
String s;
String sReversed = "";
System.out.println("Enter a string to reverse");
s = "THIS IS ASHIK SKLAB";
StringTokenizer st = new StringTokenizer(s);
while (st.hasMoreTokens()) {
sReversed = st.nextToken() + " " + sReversed;
}
System.out.println("Original string is : " + s);
System.out.println("Reversed string is : " + sReversed);
}
}
Output:
Enter a string to reverse
Original string is : THIS IS ASHIK SKLAB
Reversed string is : SKLAB ASHIK IS THIS

Categories

Resources