Indexing book in java - java

I'm trying to write a program that takes in a text file as input, adds words in it as keys and the associated to the words values schould be page numbers they are located in. Text looks like this:
Page1
blah bla bl
Page2
some blah
So for word "blah" output must be
blah : [1,2].
I only inserted the keys, but I can't figure out how to insert associated values to them. Here's what I have so far:
BufferedReader reader = new BufferedReader(input);
try {
Map <String, List<Integer>> library
= new TreeMap<String, List<Integer>>();
String line = reader.readLine();
while (line != null) {
String[] tokens = line.trim().split("\\s+");
for (int i = 0; i < tokens.length; i++) {
String word = tokens[i];
if (!library.containsKey(word)
&& !word.startsWith("Page")) {
library.put(word, new LinkedList<Integer>());
if (tokens[0].startsWith("Page")
&& library.containsKey(word)) {
List<Integer> pages = library.get(word);
int page = getNum(tokens[0]);
pages.add(page);
page++;
}
}
}
}
line = reader.readLine();
}
}
To get number of page I use this method
private static int getNum(String s) {
int result = 0;
int p = 1;
int i = s.length() - 1;
while (i >= 0) {
int d = s.charAt(i) - '0';
if (d >= 0 && d <= 9) {
result += d * p;
} else {
break;
}
i--;
p *= 10;
}
return result;
}
Thank's for all Your ideas!

The pages variable is declared inside the scope of your inner if statement. Once that block ends the variable is out of scope and undefined. If you want to use the list of pages later then it needs to be declared as a class variable.
I assume you are using pages to later generate a table of contents. But it's not strictly necessary as you can generate it later from your word index - I'll demonstrate how to do that below.
You also need to declare a currentPage variable which hold the latest 'PageN' text you have seen. There's no need to increment this manually: you should just store the number in the text (which copes with blank pages).
Page numbers seem to always be on their own line so page detection should be on the line text not on the word (which copes with situations where a line reads 'for more information see Page72').
It's also worth checking that there's a valid page number before your first word.
So putting that all together your code should be structured something like the following:
Map<String, Set<Integer>> index = new TreeMap<>();
int currentPage = -1;
String currentLine;
while ((currentLine = reader.readLine()) != null) {
if (isPage(currentLine)) {
currentPage = getPageNum(currentLine);
} else {
assert currentPage > 0;
for (String word: words(currentLine)) {
if (!index.contains(word))
index.put(word, new TreeSet<>());
index.get(word).add(currentPage);
}
}
}
I've separated methods words, isPage and getPageNum but you seem to have working code for all of those.
I've also changed the List of pages to a Set to reflect the fact that you only want a word-page reference once in the index.
To get an ordered list of all pages from the index use:
index.values().stream()
.flatMap(List::stream).distinct().sorted()
.collect(Collectors.toList());
That's assuming Java8 but it's not too hard to convert if you don't have streams.
If you are going to generate a reverse index (pages to words) then for efficiency reasons you should probably create the reverse map (Map<Integer, List<String>>) as you are processing the words.

You should try something like this. I'm not totally sure how you're using the pages, but this code will check if library contains the word (like you already have) and then if it doesn't it will add the page number to the list for that word.
if (!library.containsKey(word) && !word.startsWith("Page")) {
library.put(word, new LinkedList<Integer>());
}
else {
library.put(word, library.get(word).add(page));
}

Your problem seems to be in this piece of logic:
if (tokens[0].startsWith("Page")
&& library.containsKey(word)) {
clearly you are adding page numbers only when line starts with Page otherwise the logic inside if condition is not executed so you never updated the page number for any words.

Related

Count occurrences in 2D Array

I'm trying to count the occurrences per line from a text file containing a large amount of codes (numbers).
Example of text file content:
9045,9107,2376,9017
2387,4405,4499,7120
9107,2376,3559,3488
9045,4405,3559,4499
I want to compare a similar set of numbers that I get from a text field, for example:
9107,4405,2387,4499
The only result I'm looking for, is if it contains more than 2 numbers (per line) from the text file. So in this case it will be true, because:
9045,9107,2376,9017 - false (1)
2387,4405,4499,7120 - true (3)
9107,2387,3559,3488 - false (2)
9045,4425,3559,4490 - false (0)
From what I understand, the best way to do this, is by using a 2d-array, and I've managed to get the file imported successfully:
Scanner in = null;
try {
in = new Scanner(new File("areas.txt"));
} catch (FileNotFoundException ex) {
Logger.getLogger(NewJFrame.class.getName()).log(Level.SEVERE, null, ex);
}
List < String[] > lines = new ArrayList < > ();
while ( in .hasNextLine()) {
String line = in .nextLine().trim();
String[] splitted = line.split(", ");
lines.add(splitted);
}
String[][] result = new String[lines.size()][];
for (int i = 0; i < result.length; i++) {
result[i] = lines.get(i);
}
System.out.println(Arrays.deepToString(result));
The result I get:
[[9045,9107,2376,9017], [2387,4405,4499,7120], [9107,2376,3559,3488], [9045,4405,3559,4499], [], []]
From here I'm a bit stuck on checking the codes individually per line. Any suggestions or advice? Is the 2d-array the best way of doing this, or is there maybe an easier or better way of doing it?
The expected number of inputs defines the type of searching algorithm you should use.
If you aren't searching through thousands of lines then a simple algorithm will do just fine. When in doubt favour simplicity over complex and hard to understand algorithms.
While it is not an efficient algorithm, in most cases a simple nested for-loop will do the trick.
A simple implementation would look like this:
final int FOUND_THRESHOLD = 2;
String[] comparedCodes = {"9107", "4405", "2387", "4499"};
String[][] allInputs = {
{"9045", "9107", "2376", "9017"}, // This should not match
{"2387", "4405", "4499", "7120"}, // This should match
{"9107", "2376", "3559", "3488"}, // This should not match
{"9045", "4405", "3559", "4499"}, // This should match
};
List<String[] > results = new ArrayList<>();
for (String[] input: allInputs) {
int numFound = 0;
// Compare the codes
for (String code: input) {
for (String c: comparedCodes) {
if (code.equals(c)) {
numFound++;
break; // Breaking out here prevents unnecessary work
}
}
if (numFound >= FOUND_THRESHOLD) {
results.add(input);
break; // Breaking out here prevents unnecessary work
}
}
}
for (String[] result: results) {
System.out.println(Arrays.toString(result));
}
which provides us with the output:
[2387, 4405, 4499, 7120]
[9045, 4405, 3559, 4499]
To expand on my comment, here's a rough outline of what you could do:
String textFieldContents = ... //get it
//build a set of the user input by splitting at commas
//a stream is used to be able to trim the elements before collecting them into a set
Set<String> userInput = Arrays.stream(textFieldContents .split(","))
.map(String::trim).collect(Collectors.toSet());
//stream the lines in the file
List<Boolean> matchResults = Files.lines(Path.of("areas.txt"))
//map each line to true/false
.map(line -> {
//split the line and stream the parts
return Arrays.stream(line.split(","))
//trim each part
.map(String::trim)
//select only those contained in the user input set
.filter(part -> userInput.contains(part))
//count matching elements and return whether there are more than 2 or not
.count() > 2l;
})
//collect the results into a list, each element position should correspond to the zero-based line number
.collect(Collectors.toList());
If you need to collect the matching lines instead of a flag per line you could replace map() with filter() (same content) and change the result type to List<String>.

How to remove line breaks and empty lines from String

I am trying to run a mapreduce job on hadoop which reads the fifth entry of a tab delimited file (fifth entry are user reviews) and then do some sentiment analysis and word count on them.
However, as you know with user reviews, they usually include line breaks and empty lines. My code iterates through the words of each review to find keywords and check sentiment if keyword is found.
The problem is as the code iterates through the review, it gives me ArrayIndexOutofBoundsException Error because of these line breaks and empty lines in one review.
I have tried using replaceAll("\r", " ") and replaceAll("\n", " ") to no avail.
I have also tried if(tokenizer.countTokens() == 2){
word.set(tokenizer.nextToken());}
else {
}
also to no avail. Below is my code:
public class KWSentiment_Mapper extends Mapper<LongWritable, Text, Text, IntWritable> {
ArrayList<String> keywordsList = new ArrayList<String>();
ArrayList<String> posWordsList = new ArrayList<String>();
ArrayList<String> tokensList = new ArrayList<String>();
int e;
#Override
public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String[] line = value.toString().split("\t");
String Review = line[4].replaceAll("[\\-\\+\\\\)\\.\\(\"\\{\\$\\^:,]", "").toLowerCase();
StringTokenizer tokenizer = new StringTokenizer(Review);
while (tokenizer.hasMoreTokens()) {
// 1- first read the review line and store the tokens in an arraylist, 2-
// iterate through review to check for KW if found
// 3-check if there's PosWord near (upto +3 and -2)
// 4- setWord & context.write 5- null the review line arraylist
String CompareString = tokenizer.nextToken();
tokensList.add(CompareString);
}
{
for (int i = 0; i < tokensList.size(); i++)
{
for (int j = 0; j < keywordsList.size(); j++) {
boolean flag = false;
if (tokensList.get(i).startsWith(keywordsList.get(j)) == true) {
for (int e = Math.max(0, i - 2); e < Math.min(tokensList.size(), i + 4); e++) {
if (posWordsList.contains(tokensList.get(e))) {
word.set(keywordsList.get(j));
context.write(word, one);
flag = true;
break; // breaks out of e loop }}
}
}
}
if (flag)
break;
}
}
tokensList.clear();
}
}
Expected results are such that:
Take these two cases of reviews where error occurs:
Case 1: "Beautiful and spacious!
I highly recommend this place and great host."
Case 2: "The place in general was really silent but we didn't feel stayed.
Aside from this, the bathroom is big and the shower is really nice but there problem. "
The system should read the whole review as one line and iterate through the words in it. However, it just stops as it finds a line break or an empty line as in case 2.
Case 1 should be read such as: "Beautiful and spacious! I highly recommend this place and great host."
Case 2 should be:"The place in general was really silent but we didn't feel stayed. Aside from this, the bathroom is big and the shower is really nice but there problem. "
I am running out of time and would really appreciate help here.
Thanks!
So, I hope I am understanding what what you are trying to do....
If I am reading what you have above correctly, the value of 'value' passed into your map function above contains the delimited value that you would like to parse the user reviews out of. If that is the case, I believe we can make use of the escaping functionality in the opencsv library using tabs as your delimiting character instead of commas to correctly populate the user review field:
http://opencsv.sourceforge.net
In this example we are reading one line from the input that is passed in and parsing it into 'columns' base on the tab character and placing the results in the 'nextLine' array. This will allow us to use the escaping functionality of the CSVReader without reading an actual file and instead using the value of the text passed into your map function.
StringReader reader = new StringReader(value.toString());
CSVReader csvReader = new CSVReader(reader, '\t', '\"', '\\', 0);
String [] nextLine = csvReader.readNext();
if(nextLine != null && nextLine.length >= 5) {
// Do some stuff
}
In the example that you pasted above, I think even that split("\n") will be problematic as tabs within a user review split into two results in the result in addition to new lines being treated as new records. But, both of these characters are legal as long as they are inside a quoted value (as they should be in a properly escaped file and as they are in your example). CSVReader should handle all of these.
Validate each line at the start of the map method, so that you know line[4] exists and isn't null.
if (value == null || value.toString == null) {
return;
}
String[] line = value.toString().split("\t");
if (line == null || line.length() < 5 || line[4] == null) {
return;
}
As for line breaks, you'll need to show some sample input. By default MapReduce passes each line into the map method independently, so if you do want to read multiple lines as one message, you'll have to write a custom InputSplit, or pre-format your data so that all data for each review is on the same line.

Splitting a line and filling an array skipping blank values in Java

I have an array of line, which is somewhat like below
Here's example:
A-NUMBER ROUTINF ACO AO L MISCELL
0-0 0 1-20
0-00
0-01 FDS 3-20
0-02 6 7 3-20
0-03 4 3-20
1-0 F=PRE
ANT=3
NAPI=1
1-1 F=PRE
ANT=3
I need to parse the line according to column by skipping the column which has blank values and create a new line like below
ANUM = 0-0, ACO=0, L=1-20;
ANUM = 0-00;
ANUM = 0-01, ROUTINF=FDS, L=3-20;
ANUM = 0-02, ACO=6, AO=7, L=3-20;
ANUM = 0-03, AO=4,L=3-20;
ANUM = 1-0, F=PRE, ANT=3, NAPI=1;
ANUM = 1-1, F=PRE, ANT=3;
I can split the line but my code can't remember which column the value belongs to and when to skip the values.
String[] splitted = null;
for (Integer i = 0; i < lines.size(); i++) {
splitted = lines.get(i).split("\\s+");
for(String str : splitted)
if(!(splitted.length == 1)){
anum = splitted[0];
routinf = splitted[1];
aco = splitted[2];
ao = splitted[3];
l = splitted[4];
}else {
miscell = splitted[0];
}
}
The columns in your file seems to be of fixed length (I don't see any other way to distinguish each column). If that is the case then I would recommend using substring(srat, end) instead of split.
Create a class to hold one single record.
class Record {
String aNumber,
List<String> routingf, aco, ao, l, miscell;
public Record(String aNumber) {
this.aNumber = aNumber;
this.routingf = new ArrayList<>();
// init other lists like above ...
}
public void addRoutingf(String routingf) {
// add only of not null and is not empty trimmed
if(routingf != null && routiingf.trim().length() > 0) {
this.routingf.add(routiingf);
}
}
// implement add-methods for other lists like above ...
}
While parsing each line remember the last created record. If in the actual line A-NUMBER is empty then use the last created record to store the values, otherwise create a new record and remember it as last/actual so you can use it for the upcoming lines if necessary.
Save all record in a list
List<Record> records = new ArrayList<>();
What is the common separator? Just split on that... Your + at the moment will consume any amount of white space. \s{1,4} wil limit it to between 1 and 4 characters. Find the right numbers for your data.
if your input time use one space char (for instance tab) between columns your code is almost OK
String[] splitted = null;
for (Integer i = 0; i < lines.size(); i++) {
splitted = lines.get(i).split("\\s");
if(!(splitted.length == 1)){
anum = splitted[0];
routinf = splitted[1];
aco = splitted[2];
ao = splitted[3];
l = splitted[4];
}else {
miscell = splitted[0];
}
}
//print only not empty fields
pls note removing of unnecessary for loop and change of split character to \s from \s+
Just a thought, but you could also experiment if it helps to keep the whitespaces in the result for defining which column it belongs to.
lines.get(i).split(yourDelimiter, -1);
Its hard to tell if this helps without knowing what exactly your origin files are looking like, but you could give it a try.
e.g. if the values are always at a certain point in the splitted string with whitespaces, you could easily tell which column it belongs to and extract them.

String split and compare - fastest method

I have a string like:
1,2,3:3,4,5
The string on the left side of the delimiter needs to be compared to the string on the right side of the delimiter(:). Now when I mean compare, I actually mean to find if the elements in the right part (3,4,5) are present in the elements of the left part (1,2,3). The right part can contain duplicates and that's fine (evidently meaning I cannot use a HashSet). I've accomplished this (details below) but I need the fastest way to split and compare the above mentioned strings.
This is purely a performance based question to find out which method can be faster since the actual input that I will be using is huge (on either side). There would be only a single line and it will be read through stdin.
How I've accomplished this:
Read stdin.
Split using string.split and store the left part in a HashSet.
Store the right part in an ArrayList.
Iterate through the array list use contains() to check if the element is present in the HashSet.
Read input into byte[] array to hold the pointer on the side of your code.
Read byte by byte, computing integer elements on the way:
int b = inputBytes[p++];
int d = b - '0';
if (0 <= d) {
if (d <= 9) {
element = element * 10 + d;
} else {
// b == ':'
}
} else {
// b == ','
// add element to the hash; element = 0;
...
}
if (p == inputBytesLength) {
inputBytesLength = in.read(inputBytes);
if (inputBytesLength == 0) { ... }
p = 0;
}
Use int[] with length of sufficiently big power of two as hash:
// as add()
int h = element * 0x9E3779B9;
int i = h >>> (32 - hashSizePower);
while (hash[i] != 0) {
if (--i < 0) i += hashSize;
}
hash[i] = element;
// contains() similarly
Assuming a line of input fits in JVM heap, three common approaches to parsing strings from input in Java are:
java.util.Scanner
java.io.BufferedReader#readLine & java.util.StringTokenizer
java.io.BufferedReader#readLine & java.lang.String#split
It wasn’t obvious to me which approach was best for this problem, so I decided to try it out. I generated test data, implemented a parser for each approach, and timed the results.
Test Data
I generated 4 files of test data:
testdata_1k.txt - size 20KB
testdata_10k.txt - size 205KB
testdata_100k.txt - size 2MB
testdata_1000k.txt - size 20M
The files I generated matched the format you described. Each , delimited element is a random integer. The number in the file name describes the number of elements on each side if the :. For example, testdata_1k.txt has 1,000 elements on the left and 1,000 elements on the right.
Test Code
Here's the code I used to test each approach. Please note, these are not examples of production quality code.
Scanner Code
public Map<String, Boolean> scanner(InputStream stream) {
final Scanner in = new Scanner(new BufferedInputStream(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final HashSet<String> left = new HashSet<String>();
in.useDelimiter(",");
boolean leftSide = true;
while (in.hasNext()) {
String token = in.next();
if (leftSide) {
int delim = token.indexOf(':');
if (delim >= 0) {
left.add(token.substring(0, delim));
String rightToken = token.substring(delim + 1, token.length());
result.put(rightToken, left.contains(rightToken));
leftSide = false;
} else {
left.add(token);
}
} else {
result.put(token, left.contains(token));
}
}
return result;
}
StringTokenizer Code
public Map<String, Boolean> stringTokenizer(InputStream stream) throws IOException {
final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final StringTokenizer lineTokens = new StringTokenizer(in.readLine(), ":");
final HashSet<String> left = new HashSet<String>();
if (lineTokens.hasMoreTokens()) {
final StringTokenizer leftTokens = new StringTokenizer(lineTokens.nextToken(), ",");
while (leftTokens.hasMoreTokens()) {
left.add(leftTokens.nextToken());
}
}
if (lineTokens.hasMoreTokens()) {
final StringTokenizer rightTokens = new StringTokenizer(lineTokens.nextToken(), ",");
while (rightTokens.hasMoreTokens()) {
String token = rightTokens.nextToken();
result.put(token, left.contains(token));
}
}
return result;
}
String.split Code
public Map<String, Boolean> split(InputStream stream) throws IOException {
final BufferedReader in = new BufferedReader(new InputStreamReader(stream));
final HashMap<String, Boolean> result = new HashMap<String, Boolean>();
final String[] splitLine = in.readLine().split(":");
final HashSet<String> left = new HashSet<String>(Arrays.asList(splitLine[0].split(",")));
for (String element : splitLine[1].split(",")) {
result.put(element, left.contains(element));
}
return result;
}
Timing
I ran each approach 6 times against each file. I threw the first sample out. The following represents the average of the remaining 5 samples.
Scanner
testdata_1k.txt - 23.2948 millis
testdata_10k.txt - 39.5036 millis
testdata_100k.txt - 240.5626 millis
testdata_1000k.txt - 2671.5132 millis
StringTokenizer
testdata_1k.txt - 31.2344 millis
testdata_10k.txt -14.7926 millis
testdata_100k.txt - 102.6412 millis
testdata_1000k.txt - 1353.073 millis
String.split
testdata_1k.txt - 8.9596 millis
testdata_10k.txt - 7.8396 millis
testdata_100k.txt - 63.4854 millis
testdata_1000k.txt - 947.8384 millis
Conclusion
Assuming your data fits in JVM heap, it’s hard to beat the parsing speed of String.split compared to StringTokenizer and Scanner.

Print even and odd lines from file

I am trying to read from a file, then print the even elements first, followed by the odd lines. Is it best to read the lines and store them in a list for even and another for odd then print each? Or is there a more efficient way around this?
the snippet of code below, is the method in which i am doing this sorting... As of now, it simply stores the input into a list and prints them. Is there an efficient way to print even lined words followed by odd numbered lines?
public static void test(BufferedReader r, PrintWriter w) throws IOException {
ArrayList<String> s = new ArrayList<String>();
String line;
int n = 0;
while ((line = r.readLine()) != null) {
s.add(line);
n++;
}
Iterator<String> i = s.iterator();
while (i.hasNext()) {
w.println(i.next());
}
}
thanks in advance for any help/input!
Well, your best bet is to print the even lines as you read them, and store the odd lines for later printing.
ArrayList<String> s = new ArrayList<String>();
String line;
int n = 0;
while ((line = r.readLine()) != null) {
if(n % 2 == 0){
s.add(line);
}
else{
w.println(line);
}
n++;
}
Iterator<String> i = s.iterator();
while (i.hasNext()) {
w.println(i.next());
}
That will half the amount of space required. Another option might be to print the odd lines to a string, then print that value to the output stream - might be more efficient for shorter inputs
You can change your loop as follows:
while (i.hasNext()) {
String odd = i.next();
if (i.hasNext()) {
String even = i.next();
w.println(even);
w.println(odd);
} else {
w.println(odd);
}
}
For small files what you're doing is fine - just iterate over your list twice printing alternate lines, evens on the first pass, odds on the second.
For large files, read the file twice and print alternate lines as before. What's a large file? That's system dependent.

Categories

Resources