Find closet string in java - java

I am trying to find strings from one text file that are present in another. I have 2 text files, file1.txt and file2.txt the contents of which are as below :
file1.txt
Hello
Second Line
Text line
Final Line
file2.txt
Final Linee
Text llline
line 3 of file2
Helloo
The code I have is as below :
public class Regex {
public static void main (String[] args) throws IOException{
BufferedReader inputFile= new BufferedReader(new FileReader("file1.txt"));
String line;
String pattern;
while((line = inputFile.readLine()) != null){
System.out.println(line);
BufferedReader patternsFile = new BufferedReader(new FileReader("file2.txt"));
while ((pattern = patternsFile.readLine()) != null){
Pattern r = Pattern.compile(pattern);
System.out.println(r);
Matcher m = r.matcher(line);
if (m.find()){
System.out.println("Line corresponding to pattern in file1.txt : " + line);
}
}
}
}
However, the above code returns all the lines from file1.txt that match some pattern from file2.txt. However, I want to find the closest string with edit distance of n letters. So for example if n=1, then the output should be :
Hello
Final Line
and if n=2 then it should output
Hello
Final Line
Text line
I am starting out with Java, and have absolutely no experience with it. Therefore any and all help would be appreciated.
Thank you

Okay, i can give two tips.
First of all, you may want to look at Apache Lucene if you are writing a text analyser or something similar or you need some strong matching features.
Secondly, if you are looking for something more "minimal" you can implement a Cosine Similarity algorithm which is really interesting and should really look at it.
Then you can re-implement it and adapt for you code.
You can find an implementation in Apache Common Text

Related

Need to tokenize a file containing one line only

I need to open a file test.txt this file only has one sentence, but it is all in one line only. My job is to separate each word and display all the words that are misspelled.
I've tried using a BufferReader and FileReader, but it just prints out the name of the file. I want it to see the first line and essentially put all the words in an array. If anyone can explain how exactly I should be using BufferReader or FileReader would be great.
This is test.txt:
The warst drought in the United States in neearly a century is expected to drive up the price of milk, beef and pork next yeer, the government said Wednesdaay, as consumers bear some of the bruntt of the sweltering heat that is drivng up the cost of feed corrn.
Note: This appears as one single line in the editor.
This is what I tried:
FileReader fr = new FileReader("test.txt");
BufferReader br = new BufferReader(fr);
StringBuilder sb = new StringBuilder();
String s;
while((s = br.readLine()) != null){
sb.append(s);
sb.toString();
}
Thanks for your help.
for (String line : Files.readAllLines(Paths.get("filepath.txt"))) {
// ...
}
Java: How to read a text file

Strategy for Processing a Text File with a Header using Reg Exp in Java

I have a file that contains a header with comments (e.g. [Comment] This is a comment) and a subsequent data section. The data starts at "Mk1=".
The program I am working on should:
Copy the header contents
Search and replace only in the data section of the file
Write header and data to a new file
I am currently using:
StringBuffer
Scanner
regex.Pattern;
In my code so far (reduced to its essentials):
public static void main(String[] args) {
File file = readFile("file.ext");
Scanner inputScanner = null;
try {
inputScanner = new Scanner(file);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
String currentLine = "";
while(inputScanner.hasNext()) {
currentLine = inputScanner.findInLine(regexpPattern);
if (currentLine != null){
fileOutput.append(currentLine + "\n");
}
}
}
Because the Scanner works like a queue, I have trouble figuring out what strategy I should use. I have found examples of using a Matcher instead of a Scanner. To my understanding I also have to work with boolean flags, because of the queue-like structure of Scanner. The findInHorizon() method does not seem helpful as I want the reg exp only to apply beyond the horizon. Is there perhaps a "hack" for the delimiter of the Scanner, assuming I know the series of characters of the header start and end?
File Example
[Comment]
Text goes here.
[Another Comment]
;Instructions: Below you will find Mk1= where the data can be assigned.
;More text.
Mk1=data
Mk2=data
Mk3=data
What strategy should I use?
Assuming you can use java.nio.file.Files (since Java 1.7) and your text file isn't too big, I'd read all lines at once and go for the Matcher:
Charset charset = Charset.forName("UTF-8");
List<String> lines = Files.readAllLines(file.toPath(), charset);
for (String line : lines) {
Matcher matcher = regexpPattern.matcher(line);
if (matcher.matches()) {
// do something
}
}
Using regex groups will prove useful for retrieving parameter-value pairs:
Pattern dataPattern = Pattern.compile("^Mk(\\d+)=(.*)$");
Matcher dataMatcher = dataPattern.matcher(line);
int mk = Integer.parseInt(dataMatcher.group(1));
String data = dataMatcher.group(2);
Parsing is a two step process: You have a tokenizer which recognizes patterns in the input and a parser which reads tokens but also has a state to know where it is.
You can use regexp for the "tokenize" part of the problem but you also need a parser which remembers "I have seen [Comment]" so it knows what could/should be next.
Related:
https://class.coursera.org/compilers/lecture

Parsing Individual Lines of Multi-Line Text File?

I have a question about something I've done in the past, but never really thought if it was the most efficient method to use.
Let's say I have a text file, where each line contains something important and let's then say I have multiple sets of these lines, each corresponding to a unique environment...so for example:
1
String that I need to parse for specific tokens..
2
String that I need to parse for specific tokens..
String that I need to parse for specific tokens..
3
String that I need to parse for specific tokens..
String that I need to parse for specific tokens..
String that I need to parse for specific tokens..
So given the above input file, my past way of solving this would be something similar to the following (semi-pseudocode!):
BufferedReader inputFile = new BufferedReader(new FileReader("file.txt"));
while(inputFile.hasNextLine())
{
Scanner line = new Scanner(inputFile.nextLine());
//parse the line looking for tokens
}
inputFile.close();
My issue with this is it seems incredibly inefficient to create a new Scanner object for every line I have in my BufferedReader.
Is there a better way to achieve this functionality?
One suggestion may be to scan the whole document by tokens, but my issue with that is I won't be able to keep track of how many strings are apart of the subset (indicated by the integer); or at least I can't think of another solution to that other than to decrement a counter every time I look at a new line.
Thanks in advance!
check out with this;
public static void main(String[] args) throws IOException {
BufferedReader bf = new BufferedReader(new FileReader(new File("d:/sample.txt")));
LineNumberReader lr = new LineNumberReader(bf);
String line = "";
while ((line = lr.readLine()) != null) {
System.out.println("Line Number " + lr.getLineNumber() +
": " + line);
}
}

Split lines into two Strings using BufferedReader

I want to split each line into two separate strings when reading through the txt file I'm using and later store them in a HashMap. But right now I can't seem to read through the file properly. This is what a small part of my file looks like:
....
CPI Clock Per Instruction
CPI Common Programming Interface [IBM]
.CPI Code Page Information (file name extension) [MS-DOS]
CPI-C Common Programming Interface for Communications [IBM]
CPIO Copy In and Out [Unix]
....
And this is what my code looks like:
try {
BufferedReader br = new BufferedReader(new FileReader("akronymer.txt"));
String line;
String akronym;
String betydning;
while((line = br.readLine()) != null) {
String[] linje = line.split("\\s+");
akronym = linje[0];
betydning = linje[1];
System.out.println(akronym + " || " + betydning);
}
} catch(Exception e) {
System.out.println("Feilen som ble fanget opp: " + e);
}
What I want is to store the acronym in one String and the definition in another String
The problem is that whitespace in the definition is interpreted as additional fields. You're getting only the first word of the definition in linje[1] because the other words are in other array elements:
["CPI", "Clock", "Per", "Instruction"]
Supply a limit parameter in the two-arg overload of split, to stop at 2 fields:
String[] linje = line.split("\\s+", 2);
E.g. linje[0] will be CPI and linje[1] will be Clock Per Instruction.
If you want to limit your split to only two parts then use split("\\s+", 2). Now you are splitting your line on every whitespace, so every word is stored in different position.

Reading a specific text in Java

This is kind of a followup to my other question simple Java Regex read between two
Now my code looks like this. I am reading the contents of a file, scanning for whatever between src and -t1. Running this code will return 1 correct link but the source file contains 10 and I can't figure out the loop. I thought another way might be to write to a second file on disk and remove the first link from the original source but I can't code that either:
File workfile = new File("page.txt");
BufferedReader br = new BufferedReader(new FileReader(workfile));
String line;
while ((line = br.readLine()) != null) {
//System.out.println(line);
String url = line.split("<img src=")[1].split("-t1")[0];
System.out.println(url);
}
br.close();
I think you want something like
import java.util.regex.*;
Pattern urlPattern = Pattern.compile("<img src=(.*?)-t1");
while ((line = br.readLine()) != null) {
Matcher m = urlPattern.matcher (line);
while (m.find()) {
System.out.println(m.group(1));
}
}
The regular expression looks for strings beginning with <img src= and ending with -t1 (and looks for the shortest substrings possible, so that more than one can be found in the line). The part in parentheses is a "capture group" to capture the text that gets matched; this is called group 1. Then, for each line, we loop on find() to find all occurrences in each line. Each time we find one, we print what's in group 1.

Categories

Resources