How to get the positions of all matches in a String? - java

I have a text document and a query (the query could be more than one word). I want to find the position of all occurrences of the query in the document.
I thought of the documentText.indexOf(query) or using regular expression but I could not make it work.
I end up with the following method:
First, I have create a dataType called QueryOccurrence
public class QueryOccurrence implements Serializable{
public QueryOccurrence(){}
private int start;
private int end;
public QueryOccurrence(int nameStart,int nameEnd,String nameText){
start=nameStart;
end=nameEnd;
}
public int getStart(){
return start;
}
public int getEnd(){
return end;
}
public void SetStart(int i){
start=i;
}
public void SetEnd(int i){
end=i;
}
}
Then, I have used this datatype in the following method:
public static List<QueryOccurrence>FindQueryPositions(String documentText, String query){
// Normalize do the following: lower case, trim, and remove punctuation
String normalizedQuery = Normalize.Normalize(query);
String normalizedDocument = Normalize.Normalize(documentText);
String[] documentWords = normalizedDocument.split(" ");;
String[] queryArray = normalizedQuery.split(" ");
List<QueryOccurrence> foundQueries = new ArrayList();
QueryOccurrence foundQuery = new QueryOccurrence();
int index = 0;
for (String word : documentWords) {
if (word.equals(queryArray[0])){
foundQuery.SetStart(index);
}
if (word.equals(queryArray[queryArray.length-1])){
foundQuery.SetEnd(index);
if((foundQuery.End()-foundQuery.Start())+1==queryArray.length){
//add the found query to the list
foundQueries.add(foundQuery);
//flush the foundQuery variable to use it again
foundQuery= new QueryOccurrence();
}
}
index++;
}
return foundQueries;
}
This method return a list of all occurrence of the query in the document each one with its position.
Could you suggest any easer and faster way to accomplish this task.
Thanks

Your first approach was a good idea, but String.indexOf does not support regular expressions.
Another easier way which uses a similar approach, but in a two step method, is as follows:
List<Integer> positions = new ArrayList();
Pattern p = Pattern.compile(queryPattern); // insert your pattern here
Matcher m = p.matcher(documentText);
while (m.find()) {
positions.add(m.start());
}
Where positions will hold all the start positions of the matches.

Related

Return the first index from arraylist where string was found logic confusion

guys so I have this method that I am trying to construct, I am just having a hard time understanding the logic. This is the condition of the method:
public int search(String str) – search the list for parameter str.
Searches should work regardless of case. For example, “TOMATO” is
equivalent to “tomato.”
Hint: the String class has a method called
equalsIgnoreCase. If the string str appears more than once in the
ArrayList, return the first index where the string str was found or
return -1 if the string str was not found in the ArrayList.
This is what I have so far for my code, I am not sure if this is the right way to do it. My ArrayList is defined as words.
In order to solve this issue, I am thinking of using a foreach statement to iterate through the ArrayList then an If to check if the words match then return the Index value based on the match but I am getting error. The other confusion I am having is how do I only return the first Index value only. Maybe I am doing this wrong. Any help or direction is appreciated.
public int search(String str)
{
for(String s : words)
if(s.contains(s.equalsIgnoreCase(str)))
return s.get(s.equalsIgnoreCase(str));
}
The first answer unnecessarily has to search through the list of words to find the index once it has determined that the word is in the list. The code should be able to already know the index. This is the more efficient approach:
public int search(String str) {
int i = 0;
for (String s : words) {
if (s.equalsIgnoreCase(str))
return i;
i++;
}
return -1;
}
There is also the more classic approach...the way it might have been done before the enhance for loop was added to the Java language:
public int search(String str) {
for (int i = 0; i < words.size(); i++)
if (words.get(i).equalsIgnoreCase(str))
return i;
return -1;
}
You actually overcomplicated it a little bit
public int search(String str) {
for(String s : words) {
if(s.equalsIgnoreCase(str)) {
return words.indexOf(s);
}
}
return -1;
}
Since the return method will stop running more code in the function it will always return the first matching word.
You can use stream also to resolve this problem:
public boolean search(List<String> words, String wordToMatch)
{
Predicate<String> equalityPred = s -> s.equalsIgnoreCase(wordToMatch);
return words.stream().anyMatch(equalityPred);
}

Working with Substring In JAVA from right hand direction

Is it possible to get substring from right hand hand(Reverse) direction using substring() in JAVA.
Example.
Suppose String S="abcdef",
Can I get Substring "fedc" using S.substring(S.length()-1,3) ?
If it is not correct, please suggest me how to get Substring from right hand end(Reverse direction)??
You could reverse the string and use substring. Unfortunately String does not have that, but StringBuilder has it e.g.
new StringBuilder("abcdef").reverse().toString().substring(0,4);
You can reverse the string and find the substring
// reverse
String s = "abcdef";
StringBuilder builder = new StringBuilder(s);
String substring = builder.reverse().substring(0,3);
Java doesn't support extension methods like C# does, so I would build a function for this. This way you can control how much of the reverse substring you want with a parameter.
public class StackOverflow {
public static void main(String[] args) {
String data = "abcdef";
for (int i = 0; i < data.length(); i++) {
System.out.println(reverseSubstring(data, i+1));
}
}
public static String reverseSubstring(String data, int length) {
return new StringBuilder(data).reverse().substring(0, length);
}
}
Result:
f
fe
fed
fedc
fedcb
fedcba
UPDATE
Another approach is to create a wrapper class to String. This way you can call a class method like how you're asking in your question with the example S.substring(S.length()-1,3). This will also allow you to still have all the String methods after using the wrapper's get() method.
String Wrapper
public class MyString {
private String theString;
public MyString(String s) {
theString = s;
}
public String get() {
return theString;
}
public String reverseSubstring(int length) {
return new StringBuilder(theString).reverse().substring(0, length);
}
}
Usage
public class StackOverflow {
public static void main(String[] args) {
MyString data = new MyString("abcdef");
for (int i = 0; i < data.get().length(); i++) {
System.out.println(data.reverseSubstring(i+1));
}
}
}
Results:
f
fe
fed
fedc
fedcb
fedcba

Java - How to measure a Matcher processing

Suppose that I got a brilliant idea of making a html link tag parser in order to explore the internet and i use a regex to parse and capture each occurrence of a link in a page. This code currently works fine, but I am seeking to add some members to reflect the "operation status".
public class LinkScanner {
private static final Pattern hrefPattern = Pattern.compile("<a\\b[^>]*href=\"(.*?)\".*?>(.*?)</a>");
public Collection<String> scan(String html) {
ArrayList<String> links = new ArrayList<>();
Matcher hrefMatcher = hrefPattern.matcher(html);
while (hrefMatcher.find()) {
String link = hrefMatcher.group(1);
links.add(link);
}
return links;
}
}
How I can measure this process?
For example : consider this an hypothetic measurement implementation...
public class LinkScannerWithStatus {
private int matched;
private int total;
public Collection<String> scan(String html) {
ArrayList<String> links = new ArrayList<>();
Matcher hrefMatcher = hrefPattern.matcher(html);
total = hrefMatcher.getFindCount(); // Assume getFindCount exists
while (hrefMatcher.find()) {
String link = hrefMatcher.group(1);
links.add(link);
matched++; // assume is a linear measurement mechanism
}
return links;
}
}
I don't know where to start.. I don't even know if the conjunction "Matcher processing" is grammatically valid :S
Unfortunately Matcher doesn't have a listener interface to measure progress. It would probably be prohibitively expensive to have one.
If you have the full page as String instance then you can use region to select regions of the page. You can use this to scan these regions in sequence. Then you can report to the user which part you are currently scanning. You may have to backtrack a bit to allow overlap of the regions.
You could optimize if you backtrack by using hitEnd to check if a match was ongoing. If it wasn't then you don't need to backtrack.
One problem is that URL's are not really limited in size, so you need to make a choice what size of URL's you care to support.
If you create a good regular expression then you should not really have to report back the progress, unless you are processing truly huge files. Even in that case the I/O should have more overhead than the scanning for HTML anchors.
Performance and memory issues aside, you can use a DOM parser to evaluate the HTML, that way, while you walk the DOM you can perform a given action.
Another possibility is to interpret the given HTML as XML and use SAX. This is efficient but assumes a structure that may not be there.
As requested by Victor I'll post another answer. In this case CharSequence is implemented as a wrapper around another CharSequence. As the Matcher instance requests characters the CountingCharSequence reports to a listener interface.
It's slightly dangerous to do this as CharSequence.toString() method returns a true String instance which cannot be monitored. On the other hand, it seems that the current implementation is relatively simple to implement and it does work. toString() is called, but that seems to be to populate the groups when a match has been found. Better write some unit tests around it though.
Oh, and as I have to print the "100%" mark manually there is probably a rounding error or off-by-one error. Happy debugging :P
public class RegExProgress {
// the org. LinkScanner provided by Victor
public static class LinkScanner {
private static final Pattern hrefPattern = Pattern.compile("<a\\b[^>]*href=\"(.*?)\".*?>(.*?)</a>");
public Collection<String> scan(CharSequence html) {
ArrayList<String> links = new ArrayList<>();
Matcher hrefMatcher = hrefPattern.matcher(html);
while (hrefMatcher.find()) {
String link = hrefMatcher.group(1);
links.add(link);
}
return links;
}
}
interface ProgressListener {
void listen(int characterOffset);
}
static class SyncedProgressListener implements ProgressListener {
private final int size;
private final double blockSize;
private final double percentageOfBlock;
private int block;
public SyncedProgressListener(int max, int blocks) {
this.size = max;
this.blockSize = (double) size / (double) blocks - 0.000_001d;
this.percentageOfBlock = (double) size / blockSize;
this.block = 0;
print();
}
public synchronized void listen(int characterOffset) {
if (characterOffset >= blockSize * (block + 1)) {
this.block = (int) ((double) characterOffset / blockSize);
print();
}
}
private void print() {
System.out.printf("%d%%%n", (int) (block * percentageOfBlock));
}
}
static class CountingCharSequence implements CharSequence {
private final CharSequence wrapped;
private final int start;
private final int end;
private ProgressListener progressListener;
public CountingCharSequence(CharSequence wrapped, ProgressListener progressListener) {
this.wrapped = wrapped;
this.progressListener = progressListener;
this.start = 0;
this.end = wrapped.length();
}
public CountingCharSequence(CharSequence wrapped, int start, int end, ProgressListener pl) {
this.wrapped = wrapped;
this.progressListener = pl;
this.start = start;
this.end = end;
}
#Override
public CharSequence subSequence(int start, int end) {
// this may not be needed, as charAt() has to be called eventually
System.out.printf("subSequence(%d, %d)%n", start, end);
int newStart = this.start + start;
int newEnd = this.start + end - start;
progressListener.listen(newStart);
return new CountingCharSequence(wrapped, newStart, newEnd, progressListener);
}
#Override
public int length() {
System.out.printf("length(): %d%n", end - start);
return end - start;
}
#Override
public char charAt(int index) {
//System.out.printf("charAt(%d)%n", index);
int realIndex = start + index;
progressListener.listen(realIndex);
return this.wrapped.charAt(realIndex);
}
#Override
public String toString() {
System.out.printf(" >>> toString() <<< %n", start, end);
return wrapped.toString();
}
}
public static void main(String[] args) throws Exception {
LinkScanner scanner = new LinkScanner();
String content = new String(Files.readAllBytes(Paths.get("regex - Java - How to measure a Matcher processing - Stack Overflow.htm")));
SyncedProgressListener pl = new SyncedProgressListener(content.length(), 10);
CountingCharSequence ccs = new CountingCharSequence(content, pl);
Collection<String> urls = scanner.scan(ccs);
// OK, I admit, this is because of an off-by one error
System.out.printf("100%% - %d%n", urls.size());
}
}
So, to measure your progress through a document, you want to find the total number of matches, then as you go match by match, you update the progress and add them to stored links LinkedList.
You can count the total number of matches using:
int countMatches = StringUtils.countMatches(String text, String target);
So then, just look for the String "href" or maybe the tag or some other component of a link, then you will have a hopefully accurate picture of how many links you have, then you can parse them one by one. It's not ideal because it doesn't accept regex as the target parameter.

Java storing both line number and value from a file

I have a set of data that look like this.
1:2:3:4:5
6:7:8:9:10
I have manage to use array list to store the information using a delimiter of ":".
However i would like to store the information of their line numbers together in the array list.
class test
{
String items;
String linenumber;
}
Example:
test(1,1)
test(2,1)
test(6,2)
test(7,2)
Here is my current code.
Scanner fileScanner = new Scanner(new File(fname));
fileScanner.useDelimiter("\n");
int counter = 0; String scounter;
String test;
String events;
while(fileScanner.hasNext())
{
events = fileScanner.next();
scounter = Integer.toString(counter);
Base obj = new Base(scounter, events);
baseArrayList.add(obj);
}
fileScanner.close();
I have try using delimiter "\n" and then trying to split out the string and it is not very successful.
Any advice would be appreciated.
public void Base_Seperator()
{
String temp, temp2;
String[] split;
String days, events;
for(int i = 0; i < baseArrayList.size(); i++)
{
temp = baseArrayList.get(i).events;
temp2 = baseArrayList.get(i).days;
split = temp.split(":");
}
}
Despite the code in #Alex's answer that may solve your problem, your attempt is almost close to get what you want/need. Now you only need to create Test instances and store them in a container, usually a List. I'll add the necessary code to start this from your code:
//it is better to return the List instead of declaring it as a static field
public List<Test> Base_Seperator() {
//try to declare variables in the narrower scope
//String temp, temp2;
//String[] split;
//String days, events;
//this variable must be recognized in all the paths of this method
List<Test> testList = new ArrayList<Test>();
for(int i = 0; i < baseArrayList.size(); i++) {
//these variables should only work within the for statement
String temp = baseArrayList.get(i).events;
String temp2 = baseArrayList.get(i).days;
String[] split = temp.split(":");
//you have splitted the String by :
//now you have every element between : as an item stored in split array
//go through each one and create a new Test instance
//first, let's create the lineNumber variable as String
String lineNumber = Integer.toString(i+1);
//using enhanced for to go through these elements
for (String value : split) {
//now, let's create Test instance
Test test = new Test(value, lineNumber);
//store the instance in testList
testList.add(test);
}
}
//now just return the list with the desired values
return testList;
}
Not part of your question, but some advices:
There are plenty other ways to write code to achieve the same solution (take #Alex's answer as an example). I didn't posted any of them because looks like you're in learning phase, so it will be better for you to first achieve what you're looking for with your own effort (and a little of help).
Not sure if you're doing it (or not) but you should not use raw types. This is, you should always provide a generic type when the class/interface needs it. For example, it is better to define a variable as ArrayList<MyClass> myClassList rather than ArrayList myClass so the class become parameterized and the compiler can help you to avoid problems at runtime.
It is better to always program oriented to interfaces/abstract classes. This means, it is better to declare the variables as an interface or abstract class rather than the specific class implementation. This is the case for ArrayList and List:
List<String> stringList = new ArrayList<String>();
//above is better than
ArrayList<String> stringList2 = new ArrayList<String>();
In case you need to use a different implementation of the interface/abstract class, you will have to change the object initialization only (hopefully).
More info:
What is a raw type and why shouldn't we use it?
What does it mean to "program to an interface"?
Looks like you want to store days instead of lineNumber in your Test instances:
//comment this line
//Test test = new Test(value, lineNumber);
//use this one instead
Test test = new Test(value, days);
First of all you don't need to keep line number info in the test object because it can be inferred from the ArrayList that holds them. If you must though, it should be changed to an int. So,
class test
{
ArrayList items<Integer>;
int linenumber;
public test(int line, String[] input){
items=new ArrayList();
linenumber=line;
//populate with the line read by the Scanner
for(int i=0; i<input.lenth; i++)
items.add(Integer.parseInt(input[i]));
}
}
I use an ArrayList inside test because you don't know how many elements you'll be handling. Moving on to the scanner
Scanner fileScanner = new Scanner(new File(fname));
// fileScanner.useDelimiter("\n"); You don't need this!
String tmp[];
int line=0; //number of lines
while(fileScanner.hasNext()) {
line++;
//this returns the entire line, that's why you don't need useDelimeter()
//it also splits it on '.' I'm not sure if that needs to be escaped but
//just to be sure
tmp=fileScanner.nextLine() . split(Pattern.quote("."));
baseArrayList.add(new test(line, tmp));
}
fileScanner.close();
Here I use test to store the objects you read, I'm not sure what Base is supposed to be.
A Java Bean/construct is required that will hold the day and the item together. The following code will read the text file. Each line will be converted to a List where finally the application will populate the List DayItems collection properly.
public class DayItem {
private int day;
private String item;
public int getDay() {
return day;
}
public void setDay(final int day) {
this.day = day;
}
public String getItem() {
return item;
}
public void setItem(final String item) {
this.item = item;
}
}
And main code
public class ReadFile {
private static final List<DayItem> dayItems = new ArrayList<DayItem>();
public static void main(String args[]) throws FileNotFoundException{
final BufferedReader bufferReader = new BufferedReader(new FileReader("items.txt"));
int lineNumber=0;
try
{
String currentLine;
while ((currentLine = bufferReader.readLine()) != null) {
lineNumber++;
List<String> todaysItems = Arrays.asList(currentLine.split(":"));
addItems(todaysItems,lineNumber);
}
} catch (IOException e) {
e.printStackTrace();
}
}
private static void addItems(final List<String> todaysItems,final int day){
int listSize = todaysItems.size();
for(int i=0;i<listSize;i++){
String item = todaysItems.get(i);
DayItem dayItem = new DayItem();
dayItem.setDay(day);
dayItem.setItem(item);
dayItems.add(dayItem);
}
}
}

How to iterate over regexp compliant strings

What is the easiest way to implement a class (in Java) that would serve as an iterator over the set of all values which conform to a given regexp?
Let's say I have a class like this:
public class RegexpIterator
{
private String regexp;
public RegexpIterator(String regexp) {
this.regexp = regexp;
}
public abstract boolean hasNext() {
...
}
public abstract String next() {
...
}
}
How do I implement it? The class assumes some linear ordering on the set of all conforming values and the next() method should return the i-th value when called for the i-th time.
Ideally the solution should support full regexp syntax (as supported by the Java SDK).
To avoid confusion, please note that the class is not supposed to iterate over matches of the given regexp over a given string. Rather it should (eventually) enumerate all string values that conform to the regexp (i.e. would be accepted by the matches() method of a matcher), without any other input string given as argument.
To further clarify the question, let's show a simple example.
RegexpIterator it = new RegexpIterator("ab?cd?e");
while (it.hasNext()) {
System.out.println(it.next());
}
This code snippet should have the following output (the order of lines is not relevant, even though a solution which would list shorter strings first would be preferred).
ace
abce
ecde
abcde
Note that with some regexps, such as ab[A-Z]*cd, the set of values over which the class is to iterate is ininite. The preceeding code snippet would run forever in these cases.
Do you need to implement a class? This pattern works well:
Pattern p = Pattern.compile("[0-9]+");
Matcher m = p.matcher("123, sdfr 123kjkh 543lkj ioj345ljoij123oij");
while (m.find()) {
System.out.println(m.group());
}
output:
123
123
543
345
123
for a more generalized solution:
public static List<String> getMatches(String input, String regex) {
List<String> retval = new ArrayList<String>();
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(input);
while (m.find()) {
retval.add(m.group());
}
return retval;
}
which then can be used like this:
public static void main(String[] args) {
List<String> matches = getMatches("this matches _all words that _start _with an _underscore", "_[a-z]*");
for (String s : matches) { // List implements the 'iterable' interface
System.out.println(s);
}
}
which produces this:
_all
_start
_with
_underscore
more information about the Matcher class can be found here: http://docs.oracle.com/javase/6/docs/api/java/util/regex/Matcher.html
Here is another working example. It might be helpful :
public class RegxIterator<E> implements RegexpIterator {
private Iterator<E> itr = null;
public RegxIterator(Iterator<E> itr, String regex) {
ArrayList<E> list = new ArrayList<E>();
while (itr.hasNext()) {
E e = itr.next();
if (Pattern.matches(regex, e.toString()))
list.add(e);
}
this.itr = list.iterator();
}
#Override
public boolean hasNext() {
return this.itr.hasNext();
}
#Override
public String next() {
return this.itr.next().toString();
}
}
If you want to use it for other dataTypes(Integer,Float etc. or other classes where toString() is meaningful), declare next() to return Object instead of String. Then you may able be to perform a typeCast on the return value to get back the actual type.

Categories

Resources