I've been trying to work on this problem for a while now but to no avail. When I run the code I get this error message: incompatible types: edu.duke.StorageResource cannot be converted to java.lang.String on line String geneList = FMG.storeAll(dna);. Does this mean I'm trying to make edu.duke object work with a java.lang.String type object? What would we go about resolving this issue?
Here's my code so far:
package coursera_java_duke;
import java.io.*;
import edu.duke.FileResource;
import edu.duke.StorageResource;
import edu.duke.DirectoryResource;
public class FindMultiGenes5 {
public int findStopIndex(String dna, int index) {
int stop1 = dna.indexOf("TGA", index);
if (stop1 == -1 || (stop1 - index) % 3 != 0) {
stop1 = dna.length();
}
int stop2 = dna.indexOf("TAA", index);
if (stop2 == -1 || (stop2 - index) % 3 != 0) {
stop2 = dna.length();
}
int stop3 = dna.indexOf("TAG", index);
if (stop3 == -1 || (stop3 - index) % 3 != 0) {
stop3 = dna.length();
}
return Math.min(stop1, Math.min(stop2, stop3));
}
public StorageResource storeAll(String dna) {
//CATGTAATAGATGAATGACTGATAGATATGCTTGTATGCTATGAAAATGTGAAATGACCCAdna = "CATGTAATAGATGAATGACTGATAGATATGCTTGTATGCTATGAAAATGTGAAATGACCCA";
String geneAL = new String();
String sequence = dna.toUpperCase();
StorageResource store = new StorageResource();
int index = 0;
while (true) {
index = sequence.indexOf("ATG", index);
if (index == -1)
break;
int stop = findStopIndex(sequence, index + 3);
if (stop != sequence.length()) {
String gene = dna.substring(index, stop + 3);
store.add(gene);
//index = sequence.substring(index, stop + 3).length();
index = stop + 3; // start at the end of the stop codon
}else{ index = index + 3;
}
}
return store;//System.out.println(sequence);
}
public void testStorageFinder() {
DirectoryResource dr = new DirectoryResource();
StorageResource dnaStore = new StorageResource();
for (File f : dr.selectedFiles()) {
FileResource fr = new FileResource(f);
String s = fr.asString();
dnaStore = storeAll(s);
printGenes(dnaStore);
}
System.out.println("size = " + dnaStore.size());
}
public String readStrFromFile(){
FileResource readFile = new FileResource();
String DNA = readFile.asString();
//System.out.println("DNA: " + DNA);
return DNA;
}//end readStrFromFile() method;
public float calCGRatio(String gene){
gene = gene.toUpperCase();
int len = gene.length();
int CGCount = 0;
for(int i=0; i<len; i++){
if(gene.charAt(i) == 'C' || gene.charAt(i) == 'G')
CGCount++;
}//end for loop
System.out.println("CGCount " + CGCount + " Length: " + len + " Ratio: " + (float)CGCount/len);
return (float)CGCount/len;
}//end of calCGRatio() method;
public void printGenes(StorageResource sr){
//create a FindMultiGenesFile object FMG
FindMultiGenes5 FMG = new FindMultiGenes5();
//read a DNA sequence from file
String dna = FMG.readStrFromFile();
String geneList = FMG.storeAll(dna);
//store all genes into a document
StorageResource dnaStore = new StorageResource();
System.out.println("\n There are " + geneList.size() + " genes. ");
int longerthan60 = 0;
int CGGreaterthan35 = 0;
for(int i=0; i<geneList.size(); i++){
if(!dnaStore.contains(geneList.get(i)))
dnaStore.add(geneList.get(i));
if(geneList.get(i).length() > 60) longerthan60++;
if(FMG.calCGRatio(geneList.get(i)) > 0.35) CGGreaterthan35++;
}
System.out.println("dnaStore.size: " + dnaStore.size());
System.out.println("\n There are " + dnaStore.size() + " genes. ");
System.out.println("There are " + longerthan60 + " genes longer than 60.");
System.out.println("There are " + CGGreaterthan35 + " genes with CG ratio greater than 0.35.");
}//end main();
}
I found your post as I am also doing a similar course at Duke using those edu.duke libraries.
When I get that error message it is because I'm using the wrong method to access it.
Try FMD.data() to get an iterable of all of the gene strings.
I have 2 files as below :
1.txt
first|second|third
fourth|fifth|sixth
2.txt
first1|second1|third1
fourth1|fifth1|sixth1
Now I want to join them both :
first|first1|second1|third1|second|third
fourth|fourth1|fifth1|sixth1|fifth|sixth
Am trying using scanner but not able to join them. Any suggestion.
Scanner scanner = new Scanner(new File(("F:\\1.txt")));
Scanner scanner2 = new Scanner(new File(("F:\\2.txt")));
while(scanner.hasNext()) {
while(scanner2.hasNext()) {
system.out.println(scanner.next() + "|" + scanner2.next() + "|");
}
// output
first|second|third|first1|second1|third1|
fourth|fifth|sixth|fourth1|fifth1|sixth1|
Scanner scanner = new Scanner(new File(("F:\\1.txt")));
Scanner scanner2 = new Scanner(new File(("F:\\2.txt")));
String[] line1, line2, res;
while (scanner.hasNext() && scanner2.hasNext()) {
line1 = scanner.next().split("\\|");
line2 = scanner2.next().split("\\|");
int len = Math.min(line1.length,line2.length);
res= new String[line1.length + line2.length];
for(int index = 0, counter = 0; index < len; index++){
res[counter++] = line1[index];
res[counter++] = line2[index];
}
if(line1.length > line2.length){
for(int jIndex = 2*line2.length, counter = 0;jIndex < (line1.length+line2.length);jIndex++ ){
res[jIndex] = line1[line2.length + (counter++)];
}
}else if(line2.length > line1.length){
for(int jIndex = 2*line1.length, counter = 0;jIndex < (line1.length+line2.length);jIndex++ ){
res[jIndex] = line2[line1.length + (counter++)];
}
}
String result = Arrays.asList(res).toString().replaceAll("(^\\[|\\]$)", "").replace(", ", "|");
System.out.println(result);
}
scanner.close();
scanner2.close();
You can discard the if conditions if both lines contains same number of tokens
This will give output as,
first|first1|second|second1|third|third1
fourth|fourth1|fifth|fifth1|sixth|sixth1
And
String[] line1, line2, res;
while (scanner.hasNext() && scanner2.hasNext()) {
line1 = scanner.next().split("\\|");
line2 = scanner2.next().split("\\|");
res= new String[line1.length + line2.length];
int counter = 0;
res[counter++] = line1[0];
for(int index = 0; index < line2.length; index++){
res[counter++] = line2[index];
}
for(int index = 1; index < line1.length; index++){
res[counter++] = line1[index];
}
String result = Arrays.asList(res).toString().replaceAll("(^\\[|\\]$)", "").replace(", ", "|");
System.out.println(result);
}
scanner.close();
scanner2.close();
will give output as
first|first1|second1|third1|second|third
fourth|fourth1|fifth1|sixth1|fifth|sixth
Im trying to read from a text file certain numbers. It was working a few days ago and now suddenly its not reading the numbers after certain words. Here are my java functions for write and read. NOTE THIS IS USED IN A JSP:
public void writeToFile() {
try {
File aFile = new File(nameOfFile + "MAIN" + ".txt");
aFile.createNewFile();
PrintWriter writeTo = new PrintWriter(aFile);
writeTo.print("Matrix 1" + "\nRows: " + row1 + "\n");
writeTo.print("Columns: " + column1 + "\n");
writeTo.println("Method used: " + operationName);
if("DotProduct".equals(operationName))
writeTo.println("Vector: " + vectorRow1);
writeTo.println();
for(int i = 0; i < getRow1(); i++) {
int counter = 0;
for(int j = 0; j < getCol1(); j++, counter++)
writeTo.print(matrixToFile[i][j] + "\t");
if(counter == getCol1()) {
writeTo.print("\n");
}
}
writeTo.close();
File aFile2 = new File(nameOfFile + "SECOND.txt");
aFile2.createNewFile();
writeTo = new PrintWriter(aFile2);
writeTo.print("Matrix 2" + "\nRows: " + row2 + "\n");
writeTo.print("Columns: " + column2 + "\n");
if("DotProduct".equals(operationName))
writeTo.println("Vector: " + vectorCol2);
writeTo.println();
for(int i = 0; i < getRow2(); i++) {
int counter = 0;
for(int j = 0; j < getCol2(); j++, counter++)
writeTo.print(matrix2ToFile[i][j] + "\t");
if(counter == getCol2()) {
writeTo.print("\n");
}
}
writeTo.close();
File aFile3 = new File(nameOfFile + "RESULT.txt");
aFile3.createNewFile();
writeTo = new PrintWriter(aFile3);
writeTo.print("Result" + "\nRows: " + row3 + "\n");
writeTo.print("Columns: " + column3 + "\n");
writeTo.println();
if("DotProduct".equals(operationName))
writeTo.println("Result:" + resultVector);
else {
for(int i = 0; i < getRow3(); i++) {
int counter = 0;
for(int j = 0; j < getCol3(); j++, counter++)
writeTo.print(result[i][j] + "\t");
if(counter == getCol3()) {
writeTo.print("\n");
}
}
}
writeTo.close();
nameOfFile = "/var/lib/tomcat7/webapps/Matrices/WEB-INF/MatrixData/";
errors.put("dataStored", "Your results have been stored!");
} catch(IOException ex) {
System.out.printf("Error!");
}
}
This is the read file:
public void readFromFile(String resultOrNot, String fileName) {
nameOfFile = "/var/lib/tomcat7/webapps/Matrices/WEB-INF/MatrixData/";
workMe = fileName;
try {
Scanner readFile = new Scanner(new File(nameOfFile + workMe));
readFile.useDelimiter("Rows: ");
while(readFile.hasNextInt()) {
stringRow = readFile.next();
}
setRow1VIAString(stringRow);
readFile.useDelimiter("Columns: ");
while(readFile.hasNextInt()) {
stringCol = readFile.next();
}
setColumn1VIAString(stringCol);
readFile.useDelimiter("Method used: ");
while(readFile.hasNext()) {
operationName = readFile.next();
}
readFile.useDelimiter("\n");
for(int i = 0; i < row1 || readFile.hasNextDouble(); i++)
for(int j = 0; j < column1; j++)
matrix1[i][j] = readFile.nextDouble();
readFile.close();
workMe = workMe.replace("MAIN", "SECOND");
readFile = new Scanner(new File(nameOfFile + workMe));
readFile.useDelimiter("Rows: ");
while(readFile.hasNextInt()) {
stringRow = readFile.next();
}
setRow2VIAString(stringRow);
readFile.useDelimiter("Columns: ");
while(readFile.hasNextInt()) {
stringCol = readFile.next();
}
setColumn2VIAString(stringCol);
readFile.useDelimiter("\n");
for(int i = 0; i < row1 || readFile.hasNextDouble(); i++)
for(int j = 0; j < column1; j++)
matrix1[i][j] = readFile.nextDouble();
readFile.close();
if("giveMeResult".equals(resultOrNot)) {
workMe = workMe.replace("SECOND", "RESULT");
readFile = new Scanner(new File(nameOfFile + workMe));
readFile.useDelimiter("Rows: ");
while(readFile.hasNextInt()) {
stringRow = readFile.next();
}
setRow3VIAString(stringRow);
readFile.useDelimiter("Columns: ");
while(readFile.hasNextInt()) {
stringCol = readFile.next();
}
setColumn3VIAString(stringCol);
readFile.useDelimiter("\n");
for(int i = 0; i < row3 || readFile.hasNextDouble(); i++)
for(int j = 0; j < column3; j++)
matrix1[i][j] = readFile.nextDouble();
}
} catch(IOException ex) {
ex.printStackTrace();
}
}
The converting function:
public void setRow1VIAString(String aRow) {
row1 = Integer.parseInt(aRow);
}
And this is the error:
java.lang.NumberFormatException: null
java.lang.Integer.parseInt(Integer.java:454)
java.lang.Integer.parseInt(Integer.java:527)
matrixcalculator.MatrixCalculator.setRow1VIAString(MatrixCalculator.java:171)
matrixcalculator.MatrixCalculator.readFromFile(MatrixCalculator.java:728)
org.apache.jsp.choosing_jsp._jspService(choosing_jsp.java:84)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:432)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:390)
org.apache.jasper.servlet.JspServlet.service(JspServlet.java:334)
javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
I know it means that the string trying to me parse is null, what I dont understand is why is it null now? Its been working alright and now suddenly its doing this. Im aware of the whitespaces but I got them counted so that there wont be any problems.
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
Not sure why it gives me the NullPointerException. Please help.
I am pretty sure all the arrays are full, and i restricted all the loops not to go passed empty spaces.
import java.util.;
import java.io.;
public class TextAnalysis {
public static void main (String [] args) throws IOException {
String fileName = args[0];
File file = new File(fileName);
Scanner fileScanner = new Scanner(file);
int MAX_WORDS = 10000;
String[] words = new String[MAX_WORDS];
int unique = 0;
System.out.println("TEXT FILE STATISTICS");
System.out.println("--------------------");
System.out.println("Length of the longest word: " + longestWord(fileScanner));
read(words, fileName);
System.out.println("Number of words in file wordlist: " + wordList(words));
System.out.println("Number of words in file: " + countWords(fileName) + "\n");
System.out.println("Word-frequency statistics");
lengthFrequency(words);
System.out.println();
System.out.println("Wordlist dump:");
wordFrequency(words,fileName);
}
public static void wordFrequency(String[] words, String fileName) throws IOException{
File file = new File(fileName);
Scanner s = new Scanner(file);
int [] array = new int [words.length];
while(s.hasNext()) {
String w = s.next();
if(w!=null){
for(int i = 0; i < words.length; i++){
if(w.equals(words[i])){
array[i]++;
}
}
for(int i = 0; i < words.length; i++){
System.out.println(words[i] + ":" + array[i]);
}
}
}
}
public static void lengthFrequency (String [] words) {
int [] lengthTimes = new int[10];
for(int i = 0; i < words.length; i++) {
String w = words[i];
if(w!=null){
if(w.length() >= 10) {
lengthTimes[9]++;
} else {
lengthTimes[w.length()-1]++;
}
}
}
for(int j = 0; j < 10; j++) {
System.out.println("Word-length " + (j+1) + ": " + lengthTimes[j]);
}
}
public static String longestWord (Scanner s) {
String longest = "";
while (s.hasNext()) {
String word = s.next();
if (word.length() > longest.length()) {
longest = word;
}
}
return (longest.length() + " " + "(\"" + longest + "\")");
}
public static int countWords (String fileName) throws IOException {
File file = new File(fileName);
Scanner fileScanner = new Scanner(file);
int count = 0;
while(fileScanner.hasNext()) {
String word = fileScanner.next();
count++;
}
return count;
}
public static void read(String[] words, String fileName) throws IOException{
File file = new File(fileName);
Scanner s = new Scanner(file);
while (s.hasNext()) {
String word = s.next();
int i;
for ( i=0; i < words.length && words[i] != null; i++ ) {
words[i]=words[i].toLowerCase();
if (words[i].equals(word)) {
break;
}
}
words[i] = word;
}
}
public static int wordList(String[] words) {
int count = 0;
while (words[count] != null) {
count++;
}
return count;
}
}
There are two problems with this code
1.You didn't do null check,although the array contains null values
2.Your array index from 0-8,if you wan't to get element at 9th index it will throw ArrayIndexOutOfBound Exception.
Your code should be like that
public static void lengthFrequency (String [] words) {
int [] lengthTimes = new int [9];
for(int i = 0; i < words.length; i++) {
String w = words[i];
if(null!=w) //This one added for null check
{
/* if(w.length() >= 10) {
lengthTimes[9]++;
} else {
lengthTimes[w.length()-1]++;
}
}*/
//Don't need to check like that ...u can do like below
for(int i = 0; i < words.length; i++) {
String w = words[i];
if(null!=w)
{
lengthTimes[i] =w.length();
}
}
}
//here we should traverse upto length of the array.
for(int i = 0; i < lengthTimes.length; i++) {
System.out.println("Word-length " + (i+1) + ": " + lengthTimes[i]);
}
}
Your String Array String[] words = new String[MAX_WORDS]; is not initialized,you are just declaring it.All its content is null,calling length method in line 31 will give you null pointer exception.
`
Simple mistake. When you declare an array, it is from size 0 to n-1. This array only has indexes from 0 to 8.
int [] lengthTimes = new int [9];
//some code here
lengthTimes[9]++; // <- this is an error (this is line 29)
for(int i = 0; i < 10; i++) {
System.out.println("Word-length " + (i+1) + ": " + lengthTimes[i]); // <- same error when i is 9. This is line 37
When you declare:
String[] words = new String[MAX_WORDS];
You're creating an array with MAX_WORDS of nulls, if your input file don't fill them all, you'll get a NullPointerException at what I think is line 37 in your original file:
if(w.length() >= 10) { // if w is null this would throw Npe
To fix it you may use a List instead:
List<String> words = new ArrayList<String>();
...
words.add( aWord );
Or perhaps you can use a Set if you don't want to have repeated words.
I used lingpipe for sentence detection but I don't have any idea if there is a better tool. As far as I have understood, there is no way to compare two sentences and see if they mean the same thing.
Is there anyother good source where I can have a pre-built method for comparing two sentences and see if they are similar?
My requirement is as below:
String sent1 = "Mary and Meera are my classmates.";
String sent2 = "Meera and Mary are my classmates.";
String sent3 = "I am in Meera and Mary's class.";
// several sentences will be formed and basically what I need to do is
// this
boolean bothAreEqual = compareOf(sent1, sent2);
sop(bothAreEqual); // should print true
boolean bothAreEqual = compareOf(sent2, sent3);
sop(bothAreEqual);// should print true
How to test if the meaning of two sentences are the same: this would be a too open-ended question.
However, there are methods for comparing two sentences and see if they are similar. There are many possible definition for similarity that can be tested with pre-built methods.
See for example http://en.wikipedia.org/wiki/Levenshtein_distance
Distance between
'Mary and Meera are my classmates.'
and 'Meera and Mary are my classmates.':
6
Distance between
'Mary and Meera are my classmates.'
and 'Alice and Bobe are not my classmates.':
14
Distance between
'Mary and Meera are my classmates.'
and 'Some totally different sentence.':
29
code:
public class LevenshteinDistance {
private static int minimum(int a, int b, int c) {
return Math.min(Math.min(a, b), c);
}
public static int computeDistance(CharSequence str1,
CharSequence str2) {
int[][] distance = new int[str1.length() + 1][str2.length() + 1];
for (int i = 0; i <= str1.length(); i++){
distance[i][0] = i;
}
for (int j = 0; j <= str2.length(); j++){
distance[0][j] = j;
}
for (int i = 1; i <= str1.length(); i++){
for (int j = 1; j <= str2.length(); j++){
distance[i][j] = minimum(
distance[i - 1][j] + 1,
distance[i][j - 1] + 1,
distance[i - 1][j - 1]
+ ((str1.charAt(i - 1) == str2.charAt(j - 1)) ? 0 : 1));
}
}
int result = distance[str1.length()][str2.length()];
//log.debug("distance:"+result);
return result;
}
public static void main(String[] args) {
String sent1="Mary and Meera are my classmates.";
String sent2="Meera and Mary are my classmates.";
String sent3="Alice and Bobe are not my classmates.";
String sent4="Some totally different sentence.";
System.out.println("Distance between \n'"+sent1+"' \nand '"+sent2+"': \n"+computeDistance(sent1, sent2));
System.out.println("Distance between \n'"+sent1+"' \nand '"+sent3+"': \n"+computeDistance(sent1, sent3));
System.out.println("Distance between \n'"+sent1+"' \nand '"+sent4+"': \n"+computeDistance(sent1, sent4));
}
}
Here is wat i have come up with. this is just a substitute till i get to the real thing but it might be of some help to people out there..
package com.examples;
import java.io.File;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import com.aliasi.sentences.MedlineSentenceModel;
import com.aliasi.sentences.SentenceModel;
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;
import com.aliasi.tokenizer.Tokenizer;
import com.aliasi.tokenizer.TokenizerFactory;
import com.aliasi.util.Files;
import com.sun.accessibility.internal.resources.accessibility;
public class SentenceWordAnalysisAndLevenshteinDistance {
private static int minimum(int a, int b, int c) {
return Math.min(Math.min(a, b), c);
}
public static int computeDistance(CharSequence str1, CharSequence str2) {
int[][] distance = new int[str1.length() + 1][str2.length() + 1];
for (int i = 0; i <= str1.length(); i++) {
distance[i][0] = i;
}
for (int j = 0; j <= str2.length(); j++) {
distance[0][j] = j;
}
for (int i = 1; i <= str1.length(); i++) {
for (int j = 1; j <= str2.length(); j++) {
distance[i][j] = minimum(
distance[i - 1][j] + 1,
distance[i][j - 1] + 1,
distance[i - 1][j - 1]
+ ((str1.charAt(i - 1) == str2.charAt(j - 1)) ? 0
: 1));
}
}
int result = distance[str1.length()][str2.length()];
return result;
}
static final TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE;
static final SentenceModel SENTENCE_MODEL = new MedlineSentenceModel();
public static void main(String[] args) {
try {
ArrayList<String> sentences = null;
sentences = new ArrayList<String>();
// Reading from text file
// sentences = readSentencesInFile("D:\\sam.txt");
// Giving sentences
// ArrayList<String> sentences = new ArrayList<String>();
sentences.add("Mary and Meera are my classmates.");
sentences.add("Mary and Meera are my classmates.");
sentences.add("Meera and Mary are my classmates.");
sentences.add("Alice and Bobe are not my classmates.");
sentences.add("Some totally different sentence.");
// Self-implemented
wordAnalyser(sentences);
// Internet referred
// levenshteinDistance(sentences);
} catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
}
private static ArrayList<String> readSentencesInFile(String path) {
ArrayList<String> sentencesList = new ArrayList<String>();
try {
System.out.println("Reading file from : " + path);
File file = new File(path);
String text = Files.readFromFile(file, "ISO-8859-1");
System.out.println("INPUT TEXT: ");
System.out.println(text);
List<String> tokenList = new ArrayList<String>();
List<String> whiteList = new ArrayList<String>();
Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(
text.toCharArray(), 0, text.length());
tokenizer.tokenize(tokenList, whiteList);
System.out.println(tokenList.size() + " TOKENS");
System.out.println(whiteList.size() + " WHITESPACES");
String[] tokens = new String[tokenList.size()];
String[] whites = new String[whiteList.size()];
tokenList.toArray(tokens);
whiteList.toArray(whites);
int[] sentenceBoundaries = SENTENCE_MODEL.boundaryIndices(tokens,
whites);
System.out.println(sentenceBoundaries.length
+ " SENTENCE END TOKEN OFFSETS");
if (sentenceBoundaries.length < 1) {
System.out.println("No sentence boundaries found.");
return new ArrayList<String>();
}
int sentStartTok = 0;
int sentEndTok = 0;
for (int i = 0; i < sentenceBoundaries.length; ++i) {
sentEndTok = sentenceBoundaries[i];
System.out.println("SENTENCE " + (i + 1) + ": ");
StringBuffer sentenceString = new StringBuffer();
for (int j = sentStartTok; j <= sentEndTok; j++) {
sentenceString.append(tokens[j] + whites[j + 1]);
}
System.out.println(sentenceString.toString());
sentencesList.add(sentenceString.toString());
sentStartTok = sentEndTok + 1;
}
} catch (IOException e) {
// TODO: handle exception
e.printStackTrace();
}
return sentencesList;
}
private static void levenshteinDistance(ArrayList<String> sentences) {
System.out.println("\nLevenshteinDistance");
for (int i = 0; i < sentences.size(); i++) {
System.out.println("Distance between \n'" + sentences.get(0)
+ "' \nand '" + sentences.get(i) + "': \n"
+ computeDistance(sentences.get(0),
sentences.get(i)));
}
}
private static void wordAnalyser(ArrayList<String> sentences) {
System.out.println("No.of Sentences : " + sentences.size());
List<String> stopWordsList = getStopWords();
List<String> tokenList = new ArrayList<String>();
ArrayList<List<String>> filteredSentences = new ArrayList<List<String>>();
for (int i = 0; i < sentences.size(); i++) {
tokenList = new ArrayList<String>();
List<String> whiteList = new ArrayList<String>();
Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(sentences.get(i)
.toCharArray(), 0, sentences.get(i).length());
tokenizer.tokenize(tokenList, whiteList);
System.out.print("Sentence " + (i + 1) + ": " + tokenList.size()
+ " TOKENS, ");
System.out.println(whiteList.size() + " WHITESPACES");
filteredSentences.add(filterStopWords(tokenList, stopWordsList));
}
for (int i = 0; i < sentences.size(); i++) {
System.out.println("\n" + (i + 1) + ". Comparing\n '"
+ sentences.get(0) + "' \nwith\n '" +
sentences.get(i)
+ "' : \n");
System.out.println(filteredSentences.get(0) + "\n and \n"
+ filteredSentences.get(i));
System.out.println("Percentage of similarity: "
+ calculateSimilarity(filteredSentences.get(0),
filteredSentences.get(i))
+ "%");
}
}
private static double calculateSimilarity(List<String> list1,
List<String> list2) {
int length1 = list1.size();
int length2 = list2.size();
int count1 = 0;
int count2 = 0;
double result1 = 0.0;
double result2 = 0.0;
int least, highest;
if (length2 > length1) {
least = length1;
highest = length2;
} else {
least = length2;
highest = length1;
}
// computing result1
for (String string1 : list1) {
if (list2.contains(string1))
count1++;
}
result1 = (count1 * 100) / length1;
// computing result2
for (String string2 : list2) {
if (list1.contains(string2))
count2++;
}
result2 = (count2 * 100) / length2;
double avg = (result1 + result2) / 2;
return avg;
}
private static List<String> getStopWords() {
String stopWordsString = ".,a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your";
List<String> stopWordsList = new ArrayList<String>();
List<String> stopWordTokenList = new ArrayList<String>();
List<String> whiteList = new ArrayList<String>();
Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(
stopWordsString.toCharArray(), 0, stopWordsString.length());
tokenizer.tokenize(stopWordTokenList, whiteList);
for (int i = 0; i < stopWordTokenList.size(); i++) {
// System.out.println((i + 1) + ":" + tokenList.get(i));
if (!stopWordTokenList.get(i).equals(",")) {
stopWordsList.add(stopWordTokenList.get(i));
}
}
System.out.println("No.of stop words: " + stopWordsList.size());
return stopWordsList;
}
private static List<String> filterStopWords(List<String> tokenList,
List<String> stopWordsList) {
List<String> filteredSentenceWords = new ArrayList<String>();
for (String sentenceToken : tokenList) {
if (!stopWordsList.contains(sentenceToken)) {
filteredSentenceWords.add(sentenceToken);
}
}
return filteredSentenceWords;
}
}