counting unique occurrences of string in document

counting unique occurrences of string in document - java

I am reading a logfile into java. For each line in the logfile, I am checking to see if the line contains an ip address. If the line contains an ip address, I want to then +1 to the count of the number of times that ip address showed up in the log file. How can I accomplish this in Java?
The code below successfully extracts the ip address from each line that contains an ip address, but the process for counting occurrences of ip addresses does not work.
void read(String fileName) throws IOException {
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)));
int counter = 0;
ArrayList<IPHolder> ips = new ArrayList<IPHolder>();
try {
String line;
while ((line = br.readLine()) != null) {
if(!getIP(line).equals("0.0.0.0")){
if(ips.size()==0){
IPHolder newIP = new IPHolder();
newIP.setIp(getIP(line));
newIP.setCount(0);
ips.add(newIP);
}
for(int j=0;j<ips.size();j++){
if(ips.get(j).getIp().equals(getIP(line))){
ips.get(j).setCount(ips.get(j).getCount()+1);
}else{
IPHolder newIP = new IPHolder();
newIP.setIp(getIP(line));
newIP.setCount(0);
ips.add(newIP);
}
}
if(counter % 1000 == 0){System.out.println(counter+", "+ips.size());}
counter+=1;
}
}
} finally {br.close();}
for(int k=0;k<ips.size();k++){
System.out.println("ip, count: "+ips.get(k).getIp()+" , "+ips.get(k).getCount());
}
}
public String getIP(String ipString){//extracts an ip from a string if the string contains an ip
String IPADDRESS_PATTERN =
"(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)";
Pattern pattern = Pattern.compile(IPADDRESS_PATTERN);
Matcher matcher = pattern.matcher(ipString);
if (matcher.find()) {
return matcher.group();
}
else{
return "0.0.0.0";
}
}
The holder class is:
public class IPHolder {
private String ip;
private int count;
public String getIp(){return ip;}
public void setIp(String i){ip=i;}
public int getCount(){return count;}
public void setCount(int ct){count=ct;}
}

The key word to search for is HashMap in this case.
A HashMap is a list of key value pairs (in this case pairs of ips and their count).
"192.168.1.12" - 12
"192.168.1.13" - 17
"192.168.1.14" - 9
and so on.
It is much easier to use and access than to always iterate over your array of container objects to find out whether there already is a container for that ip or not.
BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(/*Your file */)));
HashMap<String, Integer> occurrences = new HashMap<String, Integer>();
String line = null;
while( (line = br.readLine()) != null) {
// Iterate over lines and search for ip address patterns
String[] addressesFoundInLine = ...;
for(String ip: addressesFoundInLine ) {
// Did you already have that address in your file earlier? If yes, increase its counter by
if(occurrences.containsKey(ip))
occurrences.put(ip, occurrences.get(ip)+1);
// If not, create a new entry for this address
else
occurrences.put(ip, 1);
}
}
// TreeMaps are automatically orered if their elements implement 'Comparable' which is the case for strings and integers
TreeMap<Integer, ArrayList<String>> turnedAround = new TreeMap<Integer, ArrayList<String>>();
Set<Entry<String, Integer>> es = occurrences.entrySet();
// Switch keys and values of HashMap and create a new TreeMap (in case there are two ips with the same count, add them to a list)
for(Entry<String, Integer> en: es) {
if(turnedAround.containsKey(en.getValue()))
turnedAround.get(en.getValue()).add((String) en.getKey());
else {
ArrayList<String> ips = new ArrayList<String>();
ips.add(en.getKey());
turnedAround.put(en.getValue(), ips);
}
}
// Print out the values (if there are two ips with the same counts they are printed out without an special order, that would require another sorting step)
for(Entry<Integer, ArrayList<String>> entry: turnedAround.entrySet()) {
for(String s: entry.getValue())
System.out.println(s + " - " + entry.getKey());
}
In my case the output was the following:
192.168.1.19 - 4
192.168.1.18 - 7
192.168.1.27 - 19
192.168.1.13 - 19
192.168.1.12 - 28
I answered this question about half an hour ago and I guess that is exactly what you are searching for, so if you need some example code, take a look at it.

Here is some code that uses a HashMap to store the IPs and a regex to match them in each line. It uses try-with-resources to automatically close the file.
EDIT: I added code to print in descending order like you asked in the other answer.
void read(String fileName) throws IOException {
//Step 1 find and register IPs and store their occurence counts
HashMap<String, Integer> ipAddressCounts = new HashMap<>();
try (BufferedReader br = new BufferedReader(new InputStreamReader(new FileInputStream(fileName)))) {
Pattern findIPAddrPattern = Pattern.compile("((\\d+.){3}\\d+)");
String line;
while ((line = br.readLine()) != null) {
Matcher matcher = findIPAddrPattern.matcher(line);
while (matcher.find()) {
String ipAddr = matcher.group(0);
if ( ipAddressCounts.get(ipAddr) == null ) {
ipAddressCounts.put(ipAddr, 1);
}
else {
ipAddressCounts.put(ipAddr, ipAddressCounts.get(ipAddr) + 1);
}
}
}
}
//Step 2 reverse the map to store IPs by their frequency
HashMap<Integer, HashSet<String>> countToAddrs = new HashMap<>();
for (Map.Entry<String, Integer> entry : ipAddressCounts.entrySet()) {
Integer count = entry.getValue();
if ( countToAddrs.get(count) == null )
countToAddrs.put(count, new HashSet<String>());
countToAddrs.get(count).add(entry.getKey());
}
//Step 3 sort and print the ip addreses, most frequent first
ArrayList<Integer> allCounts = new ArrayList<>(countToAddrs.keySet());
Collections.sort(allCounts, Collections.reverseOrder());
for (Integer count : allCounts) {
for (String ip : countToAddrs.get(count)) {
System.out.println("ip, count: " + ip + " , " + count);
}
}
}

Related

Input/Output blank space

In this code, I get all words from a file and count them. After, that write them and their frequencies in a file.
This code is doing what i want exactly but additionally it count all blank spaces and write them to file , too. How can i not include them?
String line;
BigDecimal count = new BigDecimal(0);
ArrayList<String> words = new ArrayList<String>();
Pattern pattern = Pattern.compile("[^a-zA-Z]", Pattern.CASE_INSENSITIVE);
while ((line = reader.readLine()) != null) {
String string1 = line.toLowerCase();
String string[] = pattern.split(string1);
for (String s : string) {
words.add(s);
}
}
Map<String, BigDecimal> map = new HashMap<String, BigDecimal>();
for (String s : words) {
BigDecimal x = new BigDecimal(1);
if (map.containsKey(s)) {
count = map.get(s);
map.put(s, count.add(x));
} else if (!map.containsKey(s)) {
map.put(s, x);
}
}
Map<String, BigDecimal> wordHistogram = map;
List<Entry<String, BigDecimal>> sortedWordHistogram = new LinkedList<Entry<String, BigDecimal>>(
wordHistogram.entrySet());
Collections.sort(sortedWordHistogram, (o1, o2) -> o2.getValue().compareTo(o1.getValue()));
Map<String, BigDecimal> inTxt = map;
for (Entry<String, BigDecimal> entry : sortedWordHistogram) {
inTxt.put(entry.getKey(), entry.getValue());
writer.write(entry.getKey() + " : " + entry.getValue() + "\n");
}
I believe it is efficient enough but any adjustment to make it better or more efficient is pleased.

Simply replace your regex ([^a-zA-Z]) with \\s+.
This will make sure all the spaces between the words are considered while splitting a line.
Also, you can simplify your code further by replacing the following lines:
Pattern pattern = Pattern.compile("[^a-zA-Z]", Pattern.CASE_INSENSITIVE);
while ((line = reader.readLine()) != null) {
String string1 = line.toLowerCase();
String string[] = pattern.split(string1);
for (String s : string) {
words.add(s);
}
}
with
while ((line = reader.readLine()) != null) {
String string[] = line.trim().toLowerCase().split("\\s+");
for (String s : string) {
words.add(s);
}
}
Note that I have also used trim() additionally in order to remove the leading and trailing whitespace characters from the line before splitting it.

number of element occurence

I'm trying to find number of element occurrence using treeset and hashmap.
when i'm running the program, value is not increasing in hashmap
I've tried map.put(data,map.get(data)+1) it is causing null pointer exception.
public class ReadData {
public static void main(String[] args) {
File f = new File("E:\\new1.txt");
try {
BufferedReader br = new BufferedReader(new FileReader(f));
String data = "";
int count =1;
HashMap<String,Integer> map = null;
TreeSet<String> set = new TreeSet<String>();
set.add("");
while((data=br.readLine())!=null) {
map = new HashMap<String,Integer>();
if(set.contains(data)) {
map.put(data,map.get(data)+1);
System.out.println("correct");
System.out.println(count+1);
}else
{
map.put(data,count);
set.add(data);
System.out.println("Not correct");
}
//System.out.println(map);
Set sets = map.entrySet();
Iterator iterator = sets.iterator();
while(iterator.hasNext()) {
Map.Entry mentry = (Map.Entry)iterator.next();
System.out.print("key is: "+ mentry.getKey() + " & Value is: ");
System.out.println(mentry.getValue());
}
}
}catch(Exception e) {
System.out.println(e);
}
}
}
input:- orange
apple
orange
orange
expeted o/p key is orange & value is 3
key is apple & value is 1
The output is key is: orange & Value is: 1
key is: apple & Value is: 1
java.lang.NullPointerException

You can do it cleaner using streams, with Collectors.groupingBy() and Collectors.counting(). You should also use try-with-resource construct and new Files class:
String delimiter = " ";
Path p = Paths.get("E:", "file.txt");
try (BufferedReader br = Files.newBufferedReader(p)) {
Map<String, Long> result = br.lines()
.flatMap(l -> Arrays.stream(l.split(delimiter)))
.collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
System.out.println(result);
}
For orange apple orange orange input this code will print {orange=3, apple=1}.

Please notices that
HashMap<String,Integer> map = null;
is not the same as an empty map. First all you must create a new map before to use it.
It this case use for example
HashMap<String,Integer> map = null;
An then you are creating into the loop a new map, this is hard to read for your purpose. I would suggest just instantiate your map together with the set and remove
map = new HashMap<String,Integer>();
inside while loop
Your code should look like
HashMap<String, Integer> map = new HashMap<String, Integer>();
TreeSet<String> set = new TreeSet<String>();
set.add("");
while ((data = br.readLine()) != null) {

You can also use TreeMap instead of using HashMap + TreeSet.
public class ReadData {
public static void main(String[] args) {
try {
File f = new File("E:\\new1.txt");
BufferedReader br = new BufferedReader(new FileReader(f));
TreeMap<String,Integer> map = new TreeMap(String, Integer);
while((String data=br.readLine()) != null) {
String[] fruitNames = data.split(" "); // or regex s+ can also be used
for(String fruitName : fruitNames){
Integer count = map.get(fruitName);
Integer newVal = count == null ? 1 : count+1 ;
map.put(fruitName, newVal);
}
// iterate over keys in TreeMap
}
}catch(Exception e) {
System.out.println(e);
}
}
}

If you want to count the occurrences of a string, you can simply use StringUtils.countMatches from
Apache Commons lang.
//First get all the words from your line -
String[] allWords = data.split("\\s");
//Retrieve unique strings
String[] uniqueStrings = Arrays.stream(allWords).distinct().toArray(String[]::new);
// Print the occurrence of each string in data
for (String word: uniqueStrings){
System.out.println("Count of occurrences for the word " + word + "is: " + StringUtils.countMatches(data, word));
}

Assign a unique key to repeated Arraylist items. and Keep track of Ordering in java

I have a data like :
in an arraylist of Strings I am collecting names .
example:
souring.add(some word);
later I have something in souring = {a,b,c,d,d,e,e,e,f}
I want to assign each element a key like:
0=a
1=b
2=c
3=d
3=d
4=e
4=e
4=e
5=f
and then I store all ordering keys in an array . like:
array= [0,1,2,3,3,4,4,4,5]
heres my code on which I am working :
public void parseFile(String path){
String myData="";
try {
BufferedReader br = new BufferedReader(new FileReader(path)); {
int remainingLines = 0;
String stringYouAreLookingFor = "";
for(String line1; (line1 = br.readLine()) != null; ) {
myData = myData + line1;
if (line1.contains("relation ") && line1.endsWith(";")) {
remainingLines = 4;//<Number of Lines you want to read after keyword>;
stringYouAreLookingFor += line1;
String everyThingInsideParentheses = stringYouAreLookingFor.replaceFirst(".*\\((.*?)\\).*", "$1");
String[] splitItems = everyThingInsideParentheses.split("\\s*,\\s*");
String[] sourceNode = new String[10];
String[] destNode = new String[15];
int i=0;
int size = splitItems.length;
int no_of_sd=size;
tv.setText(tv.getText()+"size " + size + "\n"+"\n"+"\n");
sourceNode[0]=splitItems[i];
// here I want to check and assign keys and track order...
souring.add(names);
if(size==2){
destNode[0]=splitItems[i+1];
tv.setText(tv.getText()+"dest node = " + destNode[0] +"\n"+"\n"+"\n");
destination.add(destNode[0]);
}
else{
tv.setText(tv.getText()+"dest node = No destination found"+"\n"+"\n"+"\n");
}
} else if (remainingLines > 0) {
remainingLines--;
stringYouAreLookingFor += line1;
}
}
br.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
How can I do this?
can any one help me in this..?

I would advise you to use ArrayList instead of String[]
So, if you want to add an element you just write
ArrayList<String> list = new ArrayList<String>;
list.add("whatever you want");
Then, if you want to avoid repetitions just use the following concept:
if(!list.contains(someString)){
list.add(someString);
}
And if you want to reach some element you just type:
list.get(index);
Or you can easily find an index of an element
int index=list.indexOf(someString);
Hope it helps!

Why don't you give it a try, its take time to understand what you actually want.
HashMap<Integer,String> storeValueWithKey=new HashMap<>();
// let x=4 be same key and y="x" be new value you want to insert
if(storeValueWithKey.containsKey(x))
storeValueWithKey.get(x)+=","+y;
else
storeValueWithKey.put(z,y); //Here z is new key
//Than for searching ,let key=4 be value and searchValue="a"
ArrayList<String> searchIn=new ArrayList<>(Arrays.asList(storeValueWithKey.get("key").split(",")));
if(searchIn.contains("searchValue"))
If problem still persist than comment

OutOfMemoryError: Java heap space-ArrayLists Java

for(int i=0; i<words.size(); i++){
for(int j=0; j<Final.size(); j++){
if(words.get(i)==Final.get(j)){
temp=times.get(j);
temp=temp+1;
times.set(j, temp);
}
else{
Final.add(words.get(i));
times.add(1);
}
}
}
I want to create two ArrayLists; times(integers) and Final(String). The ArrayList "words" includes words of a string and some words are shown multiple times. What Im trying to do is add every word(but just once) of the "words" to the "Final", and add th number(how many times this word appears on the "words") to the "times" . Is something wrong?
Because I get OutOfMemoryError: Java heap space

I also think using a Hashmap is the best solution.
In your code, there is an error, maybe your problem is here.
Replace the following :
if(words.get(i)==Final.get(j)){
By :
if(words.get(i).equals(Final.get(j))){

you don't require two arrays to find out word and its count. you can get this detail after using hashmap. this hashmap contains key as your word and value will be its count.
like one hashmap
Map<String, Integer> words = new HashMap<String, Integer>();
and then you can use this map by following way
try {
//getting content from file.
Scanner inputFile = new Scanner(new File("d:\\test.txt"));
//reading line by line
while (inputFile.hasNextLine()) {
// SringTokenize is automatically divide the string with space.
StringTokenizer tokenizer = new StringTokenizer(
inputFile.nextLine());
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken();
// If the HashMap already contains the key, increment the
// value
if (words.containsKey(word)) {
words.put(word, words.get(word) + 1);
}
// Otherwise, set the value to 1
else {
words.put(word, 1);
}
}
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}// Loop through the HashMap and print the results
for (Entry<String, Integer> entry : words.entrySet()) {
String key = entry.getKey();
Integer value = entry.getValue();
System.out.println("Word"+key + ": its occurance " + value);

If you are going to run out of memory it would be trying to read all the words into a collection. I suggest you not do this and instead count the words as you get them.
e.g.
Map<String, Integer> freq = new HashMap<>();
try(BufferedReader br = new BufferedReader(new FileReader(filename))) {
for(String line; (line = br.readLine()) != null; ) {
for(String word : line.trim().split("\\s+")) {
Integer count = freq.get(word);
freq.put(word, count == null ? 1 : 1 + count);
}
}
}

try this example.
String[] words = {"asdf","zvcxc", "asdf", "zxc","zxc", "zxc"};
Map<String, Integer> result = new HashMap<String, Integer>();
for (String word : words) {
if (!result.containsKey(word)) {
result.put(word, 1);
} else {
result.put(word, result.get(word) + 1);
}
}
//print result
for (Map.Entry<String, Integer> entry : result.entrySet()) {
System.out.println(String.format("%s -- %s times", entry.getKey(), entry.getValue()));
}
Output:
zvcxc -- 1 times
zxc -- 3 times
asdf -- 2 times

How can I count the number of cities per country from the data file?

How can I count the number of cities per country from the data file? I would also like to display the value as percentage of the total.
import java.util.StringTokenizer;
import java.io.*;
public class city
{
public static void main(String[] args)
{
String[] city = new String[120];
String country = null;
String[] latDegree =new String[120];
String lonDegree =null;
String latMinute =null;
String lonMinute =null;
String latDir = null;
String lonDir = null;
String time = null;
String amORpm = null;
try
{
File myFile = new File("CityLongandLat.txt");
FileReader fr = new FileReader(myFile);
BufferedReader br = new BufferedReader(fr);
String line = null;
int position =0;
int latitude=0;
while( (line = br.readLine()) != null)
{
// System.out.println(line);
StringTokenizer st = new StringTokenizer(line,",");
while(st.hasMoreTokens())
{
city[position] = st.nextToken();
country = st.nextToken();
latDegree[latitude] =st.nextToken();
latMinute =st.nextToken();
latDir = st.nextToken();
lonDegree =st.nextToken();
lonMinute =st.nextToken();
lonDir = st.nextToken();
time = st.nextToken();
amORpm = st.nextToken();
}
if(city.length<8)
{
System.out.print(city[position] + "\t\t");
}
else
{
System.out.print(city[position] + "\t");
}
if(country.length()<16)
{
System.out.print(country +"\t\t");
}
else
{
System.out.print(country);
}
System.out.print(latDegree + "\t");
System.out.print(latMinute + "\t");
System.out.print(latDir + "\t");
System.out.print(lonDegree + "\t");
System.out.print(lonMinute + "\t");
System.out.print(lonDir + "\t");
System.out.print(time + "\t");
System.out.println(amORpm + "\t");
position++;
}
br.close();
}
catch(Exception ex)
{
System.out.println("Error !!!");
}
}
}

One easy way that comes to my mind would be as follows...
Create a hashMap Object where the key is a string (the country) and the value is an integer (number of cities found for the country) so it would be something like
Map countryResultsFoundMap = new HashMap< String,Integer>();
In short, for each row you would pick the country, (I would recommend that you .trim() and .toLowerCase() the value first) and check if it is existing in the hashMap, if not, add the entry like countryResultsFoundMap.put(country,0), otherwise, if the country already exists the pick the value from the hashMAp and add +1 to its integer value.
Eventually you will have all the values stored in the map and you can have access to that data for your calculations.
Hope that helps

"here are some of the output from the data file from my programme"
Aberdeen Scotland 57 2 [Ljava.lang.String;#33906773 9 N [Ljava.lang.String;#4d7‌7c977 9 W 05:00 p.m. Adelaide Australia 34 138 [Ljava.lang.String;#33906773 55 S [Ljava.lang.String;‌#4d77c977 36 E 02:30 a.m...
The reason why your getting that output, is because you're trying to print the array object latDegree.
String[] latDegree
...
System.out.print(latDegree + "\t");
Also, you have lattitude = 0; but you never increment it, so it will always use the index 0 for the array. You need to increment it, like you did position++.
So for the print statement, print the print the value at index lattitude, not the entire array
Try this
System.out.print(latDegree[lattitude] + "\t");
...
lattitude++;
If for some reason you do want to print the array, then use Arrays.toString(array); or just iterate through it

I would also start with a map, and group the cities by country with a map.
Map<String,<List<String>>
Where the key is the country and the value is the list of cities in this country. With the size() methods you can perform the operations cities per country and percentage of total.
When you read one line you check if the key (country) already exists, if not you create a new list and add the city, otherwise add the city only to the existing list.
As a starter you could use the following snippet. However this sample assumes that the content of the file is read already and given as an argument to the method.
Map<String,List<String>> groupByCountry(List<String> lines){
Map<String,List<String>> group = new HashMap<>();
for (String line : lines) {
String[] tokens = line.split(",");
String city = tokens[0];
String country = tokens[1];
...
if(group.containsKey(country)){
group.get(country).add(city);
}else{
List<String> cities = new ArrayList<>();
cities.add(city);
group.put(country, cities);
}
}
return group;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

counting unique occurrences of string in document - java

Related

Input/Output blank space

number of element occurence

Assign a unique key to repeated Arraylist items. and Keep track of Ordering in java

OutOfMemoryError: Java heap space-ArrayLists Java

How can I count the number of cities per country from the data file?

Categories

Resources