word count frequency in document - java

I have a directory in which I have 1000 txt.files in it. I want to know for every word how many times it occurs in the 1000 document. So say even the word "cow" occured 100 times in X it will still be counted as one. If it occured in a different document it is incremented by one. So the maximum is 1000 if "cow" appears in every single document. How do I do this the easy way without the use of any other external library. Here's what I have so far
private Hashtable<String, Integer> getAllWordCount()
private Hashtable<String, Integer> getAllWordCount()
{
Hashtable<String, Integer> result = new Hashtable<String, Integer>();
HashSet<String> words = new HashSet<String>();
try {
for (int j = 0; j < fileDirectory.length; j++){
File theDirectory = new File(fileDirectory[j]);
File[] children = theDirectory.listFiles();
for (int i = 0; i < children.length; i++){
Scanner scanner = new Scanner(new FileReader(children[i]));
while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
if (words.contains(text) == false){
if (result.get(text) == null)
result.put(text, 1);
else
result.put(text, result.get(text) + 1);
words.add(text);
}
}
}
words.clear();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(result.size());
return result;
}

You also need a HashSet<String> in which you store each unique word you've read from the current file.
Then after every word read, you should check if it's in the set, if it isn't, increment the corresponding value in the result map (or add a new entry if it was empty, like you already do) and add the word to the set.
Don't forget to reset the set when you start to read a new file though.

how about this?
private Hashtable<String, Integer> getAllWordCount()
{
Hashtable<String, Integer> result = new Hashtable<String, Integer>();
HashSet<String> words = new HashSet<String>();
try {
for (int j = 0; j < fileDirectory.length; j++){
File theDirectory = new File(fileDirectory[j]);
File[] children = theDirectory.listFiles();
for (int i = 0; i < children.length; i++){
Scanner scanner = new Scanner(new FileReader(children[i]));
while (scanner.hasNext()){
String text = scanner.next().replaceAll("[^A-Za-z0-9]", "");
words.add(text);
}
for (String word : words) {
Integer count = result.get(word)
if (result.get(word) == null) {
result.put(word, 1);
} else {
result.put(word, result.get(word) + 1);
}
}
words.clear();
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
System.out.println(result.size());
return result;
}

Related

IndexOutOfBoundsException for automation

I am trying to automate an application. For that, i am using hash map for excel data set and i have created my methods for performing action on that data.
Class file to execute is shown below
#Test
public void testLAP_Creamix() throws Exception {
try {
launchMainApplication();
Lapeyre_frMain Lapeyre_frMainPage = new Lapeyre_frMain(tool, test, user, application);
HashMap<String, ArrayList<String>> win = CreamixWindowsDataset.main();
SortedSet<String> keys = new TreeSet<>(win.keySet());
for (String i : keys) {
System.out.println("########### Test = " + win.get(i).get(0) + " ###########");
Lapeyre_frMainPage.EnterTaille(win.get(i).get(1));
Lapeyre_frMainPage.SelectCONFIGURATION(win.get(i).get(2));
Lapeyre_frMainPage.SelectPLANVASQUE(win.get(i).get(3));
Lapeyre_frMainPage.SelectCOULEUR(win.get(i).get(4));
Lapeyre_frMainPage.SelectPOIGNEES(win.get(i).get(5));
Lapeyre_frMainPage.SelectTYPE_DE_MEUBLE(win.get(i).get(6));
Lapeyre_frMainPage.VerifyPanierPrice(win.get(i).get(7));
Lapeyre_frMainPage.VerifyECO_PARTPrice(win.get(i).get(8));
Lapeyre_frMainPage.ClickCREAMIXReinit();
System.out.println("########### Test End ##############");
}
test.setResult("pass");
} catch (AlreadyRunException e) {
} catch (Exception e) {
verificationErrors.append(e.getMessage());
throw e;
}
}
Hash Map code :
public static HashMap<String, ArrayList<String>> main() throws IOException {
final String DatasetSheet = "src/test/resources/CreamixDataSet.xlsx";
final String DatasetTab = "Creamix";
Object[][] ab = DataLoader.ReadMyExcelData(DatasetSheet, DatasetTab);
int rowcount = DataLoader.myrowCount(DatasetSheet, DatasetTab);
int colcount = DataLoader.mycolCount(DatasetSheet, DatasetTab);
HashMap<String, ArrayList<String>> map = new HashMap<String, ArrayList<String>>();
// i = 2 to avoid column names
for (int i = 2; i < rowcount;) {
ArrayList<String> mycolvalueslist = new ArrayList<String>();
for (int j = 0; j < colcount;) {
mycolvalueslist.add(ab[i][j].toString());
j++;
}
map.put(ab[i][0].toString(), mycolvalueslist);
i++;
}
return map;
}
Query: I was able to run this code few days back, but now after adding some new columns it is giving me below mentioned error.
IndexOutOfBoundsException Index 7 out of bounds for length 7
I am not able to trace the issue here, what should i look for? please help!
for (String i : keys) {
arr = win.get(i);//debug here,watch it size
Lapeyre_frMainPage.EnterTaille(arr.get(1));
}

Clustering many sentence using weka lib in java

I have 5 files text.
I merge these files into 1 file. That file contain about 60 sentences.
I want to clustering that file to 5 cluster.
I am using weka to clustering.
public static void doClustering(String pathSentences, int numberCluster) throws IOException {
Helper.deleteAllFileInFolder("results");
//so cum bang so cau trong file / so cau trung binh trong 1 file
HashMap<Integer, String> sentences = new HashMap<>();
HashMap<Integer, Integer> clustering = new HashMap<>();
try {
StringToWordVector filter = new StringToWordVector();
SimpleKMeans kmeans = new SimpleKMeans();
FastVector atts = new FastVector(5);
atts.addElement(new Attribute("text", (FastVector) null));
Instances docs = new Instances("text_files", atts, 0);
Scanner sc = new Scanner(new File(pathSentences));
int count = 0;
while (sc.hasNextLine()) {
String content = sc.nextLine();
double[] newInst = new double[1];
newInst[0] = (double) docs.attribute(0).addStringValue(content);
docs.add(new SparseInstance(1.0, newInst));
sentences.put(sentences.size(), content);
clustering.put(clustering.size(), -1);
}
NGramTokenizer tokenizer = new NGramTokenizer();
tokenizer.setNGramMinSize(10);
tokenizer.setNGramMaxSize(10);
tokenizer.setDelimiters("\\W");
filter.setTokenizer(tokenizer);
filter.setInputFormat(docs);
filter.setLowerCaseTokens(true);
filter.setWordsToKeep(1);
Instances filteredData = Filter.useFilter(docs, filter);
kmeans.setPreserveInstancesOrder(true);
kmeans.setNumClusters(numberCluster);
kmeans.buildClusterer(filteredData);
int[] assignments = kmeans.getAssignments();
int i = 0;
for (int clusterNum : assignments) {
clustering.put(i, clusterNum);
i++;
}
PrintWriter[] pw = new PrintWriter[numberCluster];
for (int j = 0; j < numberCluster; j++) {
pw[j] = new PrintWriter(new File("results/result" + j + ".txt"));
}
sentences.entrySet().stream().forEach((entry) -> {
Integer key = entry.getKey();
String value = entry.getValue();
Integer cluster = clustering.get(key);
pw[cluster].println(value);
});
for (int j = 0; j < numberCluster; j++) {
pw[j].close();
}
} catch (Exception e) {
System.out.println("Error K means " + e);
}
}
When I change the order of the input file, the clustering results also vary.
Can you help me fix it. Thanks you so much.
k-means is a randomized algorithm.
It picks some instances as initial seeds, then searches for a local optimum.
So of course it will produce different results!
If they vary a lot, this indicates it did not work well. If your data is good for k-means, then most runs will produce very similar results (except for permutation of labels).

assign HashMap values to dynamic jComboBoxes

I am loading text file contents to GUI and counting HashMap values using this code:
Map<String, ArrayList<String>> sections = new HashMap<>();
Map<String, String> sections2 = new HashMap<>();
String s = "", lastKey="";
try (BufferedReader br = new BufferedReader(new FileReader("input.txt"))) {
while ((s = br.readLine()) != null) {
String k = s.substring(0, 10).trim();
String v = s.substring(10, s.length() - 50).trim();
if (k.equals(""))
k = lastKey;
ArrayList<String> authors = null;
if(sections.containsKey(k))
{
authors = sections.get(k);
}
else
{
authors = new ArrayList<String>();
sections.put(k, authors);
}
authors.add(v);
lastKey = k;
}
} catch (IOException e) {
}
// to get the number of authors
int numOfAuthors = sections.get("AUTHOR").size();
// to count HashMap value
jButton1.addActionListener(new Clicker(numOfAuthors));
jButton1.doClick();
// convert the list to a string to load it in a GUI
String authors = "";
for (String a : sections.get("AUTHOR"))
{
authors += a;
}
jcb1.setSelectedItem(authors);
The ActionListener of jButton1 was borrowed from here.
Now I want to assign AUTHOR (the number of items in HashMap is 12, so jButton1 will add dynamic 12 jComboBoxes) values to dynamically created jComboBoxes.
I have tried this code:
BufferedReader br = new BufferedReader(new FileReader ("input.txt"));
String str=null;
int i = 0;
while( (str = br.readLine()) !=null ) {
String v = str.substring(12, str.length() - 61).trim();
if(i == 0) {
jcb1.setSelectedItem(v);
} else {
SubPanel panel = (SubPanel) jPanel2.getComponent(i - 1);
JComboBox jcb = panel.getJcb();
jcb.setSelectedItem(v);
}
i++;
}
But this code read from input.txt all lines (70 lines), but I want to assign just that 12 values from AUTHOR field and show them on jcb.
How can I solve it?
You shouldn't have to re-read the entire text file again in order to complete the setup of the GUI. I would just read the text file once, then use the Map<String, ArrayList<String>> sections = new HashMap<>(); object to complete the setup of the GUI.
This could be the process for you:
1) Read the entire file and return the sections HashMap.
2) Setup the jPanel2 by adding the SubPanels to it (e.g. based on the number of Authors).
3) Setup the JComboBox's by adding the data stored in the HashMap (e.g. the mapped ArrayList's).
For number 1), I would just create a method that reads the file and returns the HashMap.
Read The File
Example (Adapted from your other question here):
public Map<String, ArrayList<String>> getSections ()
{
Map<String, ArrayList<String>> sections = new HashMap<>();
String s = "", lastKey = "";
try (BufferedReader br = new BufferedReader(new FileReader("input.txt")))
{
while ((s = br.readLine()) != null)
{
String k = s.substring(0, 10).trim();
String v = s.substring(10, s.length() - 50).trim();
if (k.equals(""))
k = lastKey;
ArrayList<String> authors = null;
if (sections.containsKey(k))
{
authors = sections.get(k);
}
else
{
authors = new ArrayList<String>();
sections.put(k, authors);
}
// don't add empty strings
if (v.length() > 0)
{
authors.add(v);
}
lastKey = k;
}
}
catch (IOException e)
{
e.printStackTrace();
}
return sections;
}
GUI Setup
Note: This code can be put wherever you are setting up the GUI now, I'm just placing all in the method below for an example.
public void setupGUI ()
{
// read the file and get the map
Map<String, ArrayList<String>> sections = getSections();
// get the authors
ArrayList<String> authors = sections.get("AUTHOR");
// Setup the jPanel2 by adding the SubPanels
int num = authors.size();
jButton1.addActionListener(new Clicker(num));
jButton1.doClick();
// Setup the JComboBox's by adding the data stored in the map
for (int i = 0; i < authors.size(); i++)
{
int index = i;
// not sure if getComponent() is zero or 1-baed so adjust the index accordingly.
SubPanel panel = (SubPanel) jPanel2.getComponent(index);
// Not sure if you already have the JComboBox in the SubPanel
// If not, you can add them here.
JComboBox jcb = panel.getJcb();
jcb.setSelectedItem(authors.get(i));
}
}
Side Note: I'm not sure why you are creating 12 separate SubPanel's, each with its own JComboBox? Maybe you want to consider how you can better layout the GUI. Just a consideration. In either case, you can use the above examples are a starting point.

List<String[]> method Adding always same values

In my Java Project, i want to read values from txt file to List method.Values seems like;
1 kjhjhhkj 788
4 klkkld3 732
89 jksdsdsd 23
Number of row changable. I have tried this codes and getting same values in all indexes.
What can i do?
String[] dizi = new String[3];
List<String[]> listOfLists = new ArrayList<String[]>();
File f = new File("input.txt");
try {
Scanner s = new Scanner(f);
while (s.hasNextLine()) {
int i = 0;
while (s.hasNext() && i < 3) {
dizi[i] = s.next();
i++;
}
listOfLists.add(dizi);
}
} catch (FileNotFoundException e) {
System.out.println("Dosyaya ba?lanmaya çal???l?rken hata olu?tu");
}
int q = listOfLists.size();
for (int z = 0; z < q; z++) {
for (int k = 0; k < 3; k++) {
System.out.print(listOfLists.get(z)[k] + " ");
}
}
String [] dizi = new String [3];
dizi is a global variable getting overridden eveytime in the loop. Thats why you are getting same values at all indexes
Make a new instance everytime before adding to the list.
You put the same reference to the list, create a new array in while loop.
while (s.hasNextLine()){
String[] dizi = new String[3]; //new array
int i = 0;
while (s.hasNext() && i < 3)
{
dizi[i] = s.next();
i++;
}
listOfLists.add(dizi);
}

How to find latest jar version of jars by java program?

In my project has 40 to 50 jar files available, It takes lot of time to find out latest version of each jar at every time. Can u any one help me to write a java program for this?
You may want to just use maven : http://maven.apache.org/
Or an other dependencies manager, like Ivy.
At the time of ant-build please call this method
public void ExpungeDuplicates(String filePath) {
Map<String,Integer> replaceJarsMap = null;
File folder = null;
File[] listOfFiles = null;
List<String> jarList = new ArrayList<String>();
String files = "";
File deleteFile = null;
Iterator<String> mapItr = null;
//String extension ="jar";
try {
folder = new File(filePath);
listOfFiles = folder.listFiles();
for (int i = 0; i < listOfFiles.length; i++) {
if (listOfFiles[i].isFile()) {
files = listOfFiles[i].getName();
jarList.add(files);
}
}
if (jarList.size() > 0) {
replaceJarsMap = PatternClassifier.findDuplicatesOrLowerVersion(jarList);
System.err.println("Duplicate / Lower Version - Total Count : "+replaceJarsMap.size());
mapItr = replaceJarsMap.keySet().iterator();
while (mapItr.hasNext()) {
String key = mapItr.next();
int repeat = replaceJarsMap.get(key);
System.out.println( key +" : "+repeat);
for (int i = 0; i <repeat; i++) {
deleteFile = new File(filePath + System.getProperty ("file.separator")+key);
try{
if (deleteFile != null && deleteFile.exists()){
if(deleteFile.delete()){
System.err.println(key +" deleted");
}
}
}catch (Exception e) {
}
}
}
}
} catch (Exception e) {
// TODO: handle exception
}
}
You only need to give the path of your Lib to this function.This method will find all the duplicate or lower version of of file.
And the crucial function is given below...Which finds out the duplicates from the list of files you provided.
public static Map<String,Integer> findDuplicatesOrLowerVersion(List<String> fileNameList) {
List<String> oldJarList = new ArrayList<String>();
String cmprTemp[] = null;
boolean match = false;
String regex = "",regexFileType = "",verInfo1 = "",verInfo2 = "",compareName = "",tempCompareName = "",tempJarName ="";
Map<String,Integer> duplicateEntryMap = new HashMap<String, Integer>();
int count = 0;
Collections.sort(fileNameList, Collections.reverseOrder());
try{
int size = fileNameList.size();
for(int i = 0;i<size;i++){
cmprTemp = fileNameList.get(i).split("[0-9\\._]*");
for(String s : cmprTemp){
compareName += s;
}
regex = "^"+compareName+"[ajr0-9_\\-\\.]*";
regexFileType = "[0-9a-zA-Z\\-\\._]*\\.jar$";
if( fileNameList.get(i).matches(regexFileType) && !oldJarList.contains(fileNameList.get(i))){
for(int j = i+1 ;j<size;j++){
cmprTemp = fileNameList.get(j).split("[0-9\\._]*");
for(String s : cmprTemp){
tempCompareName += s;
}
match = (fileNameList.get(j).matches(regexFileType) && tempCompareName.matches(regex));
if(match){
cmprTemp = fileNameList.get(i).split("[a-zA-Z\\-\\._]*");
for(String s : cmprTemp){
verInfo1 += s;
}
verInfo1 += "000";
cmprTemp = fileNameList.get(j).split("[a-zA-Z\\-\\._]*");
for(String s : cmprTemp){
verInfo2 += s;
}
verInfo2 += "000";
int length = 0;
if(verInfo1.length()>verInfo2.length()){
length = verInfo2.length();
}else{
length = verInfo1.length();
}
if(Long.parseLong(verInfo1.substring(0,length))>=Long.parseLong(verInfo2.substring(0,length))){
count = 0;
if(!oldJarList.contains(fileNameList.get(j))){
oldJarList.add(fileNameList.get(j));
duplicateEntryMap.put(fileNameList.get(j),++count);
}else{
count = duplicateEntryMap.get(fileNameList.get(j));
duplicateEntryMap.put(fileNameList.get(j),++count);
}
}else{
tempJarName = fileNameList.get(i);
}
match = false;verInfo1 = "";verInfo2 = "";
}
tempCompareName = "";
}
if(tempJarName!=null && !tempJarName.equals("")){
count = 0;
if(!oldJarList.contains(fileNameList.get(i))){
oldJarList.add(fileNameList.get(i));
duplicateEntryMap.put(fileNameList.get(i),++count);
}else{
count = dupl icateEntryMap.get(fileNameList.get(i));
duplicateEntryMap.put(fileNameList.get(i),++count);
}
tempJarName = "";
}
}
compareName = "";
}
}catch (Exception e) {
e.printStackTrace();
}
return duplicateEntryMap;
}
What findDuplicatesOrLowerVersion(List fileNameList) function task - Simply it found the duplicates and passting a map which contains the name of the file and number of time the lower version repeats.
Try this. The remaining file exist in the folder should be latest or files with out duplicates.Am using this for finding the oldest files.on the basis of that it will find the old and delete it.
This am only checking the name..Futher improvement you can made.
Where PatternClassifier is a class which contains the second method given here.

Categories

Resources