Here is my basic problem: I am reading some lines in from a file. The format of each line in the file is this:
John Doe 123
There is a tab between Doe and 123.
I'm looking for a regex such that I can "pick off" the John Doe. Something like scanner.next(regular expression) that would give me the John Doe.
This is probably very simple, but I can't seem to get it to work. Also, I'm trying to figure this out without having to rely on the tab being there.
I've looked here: Regular Expression regex to validate input: Two words with a space between. But none of these answers worked. I kept getting runtime errors.
Some Code:
while(inFile.hasNextLine()){
String s = inFile.nextLine();
Scanner string = new Scanner(s);
System.out.println(s); // check to make sure I got the string
System.out.println(string.next("[A-Za-z]+ [A-Za-z]+")); //This
//doesn't work for me
System.out.println(string.next("\\b[A-Za-z ]+\\b"));//Nor does
//this
}
Are you required to use regex for this? You could simply use a split method across \t on each line and just grab the first or second element (I'm not sure which you meant by 'pick off' john doe).
It would help if you provided the code you're trying that is giving you runtime errors.
You could use regex:
[A-Za-z]+ [A-Za-z]+
if you always knew your name was going to be two words.
You could also try
\b[A-Za-z ]+\b
which matches any number of words (containing alphabets), making sure it captures whole words (that's what the '\b' is) --> to return "John Doe" instead of "John Doe " (with the trailing space too). Don't forget backslashes need to be escaped in Java.
This basically works to isolate John Doe from the rest...
public String isolateAndTrim( String candidate ) {
// This pattern isolates "John Doe" as a group...
Pattern pattern = Pattern.compile( "(\\w+\\s+\\w+)\\s+\\d*" );
Matcher matcher = pattern.matcher( candidate );
String clean = "";
if ( matcher.matches() ) {
clean = matcher.group( 1 );
// This replace all reduces away extraneous whitespace...
clean = clean.replaceAll( "\\s+", " " );
}
return clean;
}
The grouping parenthesis will allow you to "pick off" the name portion from the digit portion. "John Doe", "Jane Austin", whatever. You should learn the grouping stuff in RegEx as it works great for problems just like this one.
The trick to remove the extra whitespace comes from How to remove duplicate white spaces in string using Java?
Do you prefer simplicity and readability? If so, consider the following solution
import java.io.File;
import java.io.FileNotFoundException;
import java.util.Scanner;
public class MyLineScanner
{
public static void readLine(String source_file) throws FileNotFoundException
{
File source = new File(source_file);
Scanner line_scanner = new Scanner(source);
while(line_scanner.hasNextLine())
{
String line = line_scanner.nextLine();
// check to make sure line is exists;
System.out.println(line);
// this work for me
Scanner words_scanner = new Scanner(line);
words_scanner.useDelimiter("\t");
while (words_scanner.hasNext())
{
System.out.format("word : %s %n", words_scanner.next());
}
}
}
public static void main(String[] args) throws FileNotFoundException
{
readLine("source.txt");
}
}
Related
I want to split paragraphs into sentences. For eg: "Mary had a little lamb. Its fleece was white." i want to split it into:
"Mary had a little lamb."
"Its fleece was white."
currently i tried using text.split("[.]"); and got the result:
"Mary had a little lamb" (no full stop present. i need it)
" Its fleece was white" (space present before the sentence and still, no full stop)
what I intend to do is split this para into proper sentences and put it into array.
String text = sc.nextLine();
String[] sentence = text.split("[.]");
please help!
you can just append the full stop '.' to the string after the splitting.
Somthing like:
String[] splitString = theString.split("[.]");
for(String s : splitString){
s += ".";
}
Something in that direction.
Assuming that there is at least a chance that sentence splitting is not the last bit of natural language processing required, you should consider using a natural language processing (NLP) library like OpenNLP. You can try out OpenNLP through a web interface thanks to the Gate project, who have made an English language processing pipeline available as a web page. Make sure to use the "Customise Annotations" button to get to see the sentence structure.
Somethig like this should work:
public class Main {
public static void main(String[] args) {
String paragraph = "Mary had a little lamb. Its fleece was white.";
String sentences[] = paragraph.split("[.]");
for (String sentence:sentences){
System.out.println(sentence);
}
}
}
assume this sentence:
String sResult = "This is a test. This is a T.L.A. test.";
so you'd better to try this
String sResult = "This is a test. This is a T.L.A. test.";
String[] sSentence = sResult.split("(?<=[a-z])\\.\\s+");
Result:
This is a test
This is a T.L.A. test.
Note that there are abbrevations that do not end with capital letters, such as abbrev., Mr., etc... And there are also sentences that don't end in periods!
I'm currently working on a SIC assembler and scanning lines from the following file:
begin START 0
main LDX zero
copy LDCH str1, x
STCH str2, x
TIX eleven
JLT copy
str1 BYTE C'TEST STRING'
str2 RESB 11
zero WORD 0
eleven WORD 11
END main
I'm using, as you might have already guessed, a regex to extract the fields from each line of code. Right now, I'm just testing if the lines match the regex (as they're supposed to). If they do, the program prints them. The problem is, it just recognizes the first line, and ignores the rest (i. e. from the second line on, they do not match the regex).
Here's the code so far:
public static void main(String args[]) throws FileNotFoundException {
Scanner scan = new Scanner(new File("/home/daniel/test.asm"));
Pattern std = Pattern.compile("(^$|[a-z0-9\\-\\_]*)(\\s+)([A-Z]+)(\\s+)([a-z0-9\\-\\_]*)");
String lineFromFile;
lineFromFile = scan.nextLine();
Matcher standard = std.matcher(lineFromFile);
while (standard.find()) {
System.out.println(lineFromFile);
lineFromFile = scan.nextLine();
}
}
It prints just the first line:
begin START 0
The weird thing comes here: if I copy the second line directly from the file, and declare a String object with it, and test it manually, it does work! And the same with the rest of the other lines. Something like:
public static void main(String args[]) throws FileNotFoundException {
Scanner scan = new Scanner(new File("/home/daniel/test.asm"));
Pattern std = Pattern.compile("(^$|[a-z0-9\\-\\_]*)(\\s+)([A-Z]+)(\\s+)([a-z0-9\\-\\_]*)");
String lineFromFile;
lineFromFile = "main LDX zero";
Matcher standard = std.matcher(lineFromFile);
if (standard.find())
System.out.println(lineFromFile);
}
And it does prints it!
main LDX zero
I don't know if it has something to do with the regex, or the file. I'd really appreciate if any of you guys help me to find the error.
Thanks for your time! :)
NOTE :- I am assuming your regex is correct
You need to update the Matcher object for every line you read from input. (For demonstration, I have just updated your code to read line by line from console and not file.)
Java Code
String pattern = "(^$|[a-z0-9\\-\\_]*)(\\s+)([A-Z]+)(\\s+)([a-z0-9\\-\\_]*)";
Pattern r = Pattern.compile(pattern);
String line = "";
Matcher m;
while((line = tmp.nextLine()) != null) {
m = r.matcher(line);
while(m.find()) {
System.out.println(m.group(1) + m.group(2)+ m.group(3)+ m.group(4)+ m.group(5));
}
}
Ideone Demo
Though, use of if will be sufficient here until there are multiple matches on single line
if(m.find()) {
System.out.println(m.group(1) + m.group(2)+ m.group(3)+ m.group(4)+ m.group(5));
}
EDIT
Assuming only three part in your input, you can use this regex instead
^((?:\w+)?\s+)(\w+\s+)(.*)$
Regex Demo
Your regex does appear to be incorrect, but that's not your immediate problem. Your while loop has to iterate through all lines, not just the ones that match. If you're using a Scanner, the test condition is the hasNextLine() method. You do the matching inside the loop. You can still create the Matcher ahead of time and apply it to each line using the reset() method:
Scanner sc = new Scanner(new File("test.asm"));
Pattern p = Pattern.compile("^([a-z0-9_-]*)\\s+([A-Z]+)\\s+(.*)");
Matcher m = p.matcher("");
while (sc.hasNextLine()) {
String lineFromFile = sc.nextLine();
if (m.reset(lineFromFile).find()) {
System.out.printf("%-8s %-6s %s%n", m.group(1), m.group(2), m.group(3));
}
}
As for your regex, the last part seemed to be too restrictive--it doesn't match your sample data, anyway. I changed it to consume everything after the second whitespace gap. I also simplified the first part and got rid of the unnecessary groups.
I have a String :
"Hello world... I am here. Please respond."
and I would like to count the number of sentences within the String. I had an idea to use a Scanner as well as the useDelimiter method to split any String into sentences.
Scanner in = new Scanner(file);
in.useDelimiter("insert here");
I'd like to create a regular expression which can go through the String I have shown above and identify it to have two sentences. I initially tried using the delimiter:
[^?.]
It gets hung up on the ellipses.
You could use a regular expression that checks for a non end of sentence, followed by an end of sentence like:
[^?!.][?!.]
Although as #Gabe Sechan points out, a regular expression may not be accurate when the sentence includes abbreviated words such as Dr., Rd., St., etc.
this could help :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" );
return tokens.size();
}
and you can also add enter button as separator and make it independent on your OS by the following line of code
String pattern = System.getProperty("line.separator" + " ");
actually you can find more about the
Enter
here : Java regex: newline + white space
and hence finally the method becomes :
public int getNumSentences()
{
List<String> tokens = getTokens( "[^!?.]+" + pattern + "+" );
return tokens.size();
}
hope this could help :) !
A regular expression probably isn't the right tool for this. English is not a regular language, so regular expressions get hung up- a lot. For one thing you can't even be sure a period in the middle of the text is an end of sentence- abbreviations (like Mr.), acronyms with periods, and initials will screw you up as well. Its not the right tool.
For your sentence : "Hello world... I am here. Please respond."
The code will be :
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class JavaRegex {
public static void main(String[] args) {
int count=0;
String sentence = "Hello world... I am here. Please respond.";
Pattern pattern = Pattern.compile("\\..");
Matcher matcher = pattern.matcher(sentence);
while(matcher.find()) {
count++;
}
System.out.println("No. of sentence = "+count);
}
}
I am reading a tab-delimited text file line by line which is extremely messy and trying to get the unique columns names out of it.
The problem is it contains tabs as field separator but some column names have space in their names! I am using
String[] cols = line.split("\\t");
which seems that is not working properly since for some cases it gets the spaces as separators! Is using regex a good solution? If yes, could you advise what regex removes white spaces from a string but keeps the tabs?
Data is like:
Sever ID Name
12221 zxsz
Tab in a string literal is just "\t". "\\t" is a literal backslash followed by a "t". Having said that, either method works for me:
public class Scratch2 {
public static void main(String[] args) {
String welk = "anna one\tanna two\tanna three";
System.out.println("\\t");
String[] annas = welk.split("\t");
for (String anna : annas) {
System.out.println(anna);
}
System.out.println("\\\\t");
annas = welk.split("\\t");
for (String anna : annas) {
System.out.println(anna);
}
}
}
Output:
\t
anna one
anna two
anna three
\\t
anna one
anna two
anna three
The simplest explanation is that your input strings don't contain the whitespace characters you think they do.
I'm trying to use the scanner class to parse all the words in a file. The file contains common text, but I only want to take the words excluding all the puntuation.
The solution I have until now is not complete but is already giving me some problem:
Scanner fileScan= new Scanner(file);
String word;
while(fileScan.hasNext("[^ ,!?.]+")){
word= fileScan.next();
this.addToIndex(word, filename);
}
Now if I use this on a sentence like "hi my name is mario!" it returns just "hi", "my", "name" and "is". It's not matching "mario!" (obviously) but it's not matching "mario", like I think it should.
Can you explain why is that and helping me find a better solution if you have one?
Thank you
This works:
import java.util.*;
class S {
public static void main(String[] args) {
Scanner fileScan= new Scanner("hi my name is mario!").useDelimiter("[ ,!?.]+");
String word;
while(fileScan.hasNext()){
word= fileScan.next();
System.out.println(word);
}
} // end of main()
}
javac -g S.java && java S
hi
my
name
is
mario
Since you want to get rid of the punctuation, you can simply replace all punctuation marks before adding to the index:
word = word.replaceAll("\\{Punct}", "");
In the case of hypens, or other isolated punctuation marks, you just check if word.isEmpty() before adding.
Of course, you'd have to get rid of your custom delimiter.