java tokenizer for strings - java

I have a text file and want to tokenize its lines -- but only the sentences with the # character.
For example, given...
Buah... Molt bon concert!! #Postconcert #gintonic
...I want to print only #Postconcert #gintonic.
I have already tried this code with some changes...
public class MyTokenizer {
/**
* #param args
*/
public static void main(String[] args) {
tokenize("Europe3.txt","allo.txt");
}
public static void tokenize(String sFile,String sFileOut) {
String sLine="", sToken="";
MyBufferedReaderWriter f = new MyBufferedReaderWriter();
f.openRFile(sFile);
MyBufferedReaderWriter fOut = new MyBufferedReaderWriter();
fOut.openWFile(sFileOut);
while ((sLine=f.readLine()) != null) {
//StringTokenizer st = new StringTokenizer(sLine, "#");
String[] tokens = sLine.split("\\#");
for (String token : tokens)
{
fOut.writeLine(token);
//System.out.println(token);
}
/*while (st.hasMoreTokens()) {
sToken = st.nextToken();
System.out.println(sToken);
}*/
}
f.closeRFile();
}
}
Can anyone help?

You can try something like with Regex:
package com.stackoverflow.answers;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HashExtractor {
public static void main(String[] args) {
String strInput = "Buah... Molt bon concert!! #Postconcert #gintonic";
String strPattern = "(?:\\s|\\A)[##]+([A-Za-z0-9-_]+)";
Pattern pattern = Pattern.compile(strPattern);
Matcher matcher = pattern.matcher(strInput);
while (matcher.find()) {
System.out.println(matcher.group());
}
}
}

As per the given example, when using the split() function the values would be stored something like this:
tokens[0]=Buah... Molt bon concert!!
tokens[1]=Postconcert
tokens[2]=gintonic
So you just need to skip first value and append '#' (if you need that in your other) to the other string values.
Hope this helps.

You have not specially asked for this, but I assume you try to extract all the #hashtags from your textfile.
To do this, Regex is your friend:
String text = "Buah... Molt bon concert!! #Postconcert #gintonic";
System.out.println(getHashTags(text));
public Collection<String> getHashTags(String text) {
Pattern pattern = Pattern.compile("(#\\w+)");
Matcher matcher = pattern.matcher(text);
Set<String> htags = new HashSet();
while (matcher.find()) {
htags.add(matcher.group(1));
}
return htags;
}
Compile a pattern like this #\w+, everything that starts with a # followed by one or more (+) word character (\w).
Then we have to escape the \ for java with a \\.
And finally put this expression in a group to get access to the matched text by surrounding it with braces (#\w+).
For every match, add the first matched group to the set htags, finally we get a set with all the hashtags in it.
[#gintonic, #Postconcert]

Related

Finding six consecutive integers in three lines of string

I have written an OCR program in Java where it scans documents and finds all text in it. My primary task is to find the Invoice number which can be 6 or more integer.
I used the substring functionality but that's not so efficient as the position of that number is changing with every document, but it is always present in the first three lines of OCR text.
I want to write code in Java 8 from where I can iterate through the first three lines and get this 6 consecutive numbers.
I am using Tesseract for OCR.
Example:
,——— ————i_
g DAILYW RK SHE 278464
E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha
from this, I need to extract the number 278464.
Please help!!
try the following code using regex.
import java.lang.Math; // headers MUST be above the first class
import java.util.regex.Pattern;
import java.util.regex.Matcher;
public class Test
{
// arguments are passed using the text field below this editor
public static void main(String[] args)
{
Pattern pattern = Pattern.compile("(?<=\\D)\\d{6}(?!\\d)");
String str = "g DAILYW RK SHE 278464";
Matcher matcher = pattern.matcher(str);
if(matcher.find()){
String s = matcher.group();
//278464
System.out.println(s);
}
}
}
(?<=\\D) match but not catch text current and before current are not numbers
\\d{6} match exactly 6 numbers
(?!\\d) match but not catch text current and after current are not numbers
It can be solved simply with \\d{6,} as shown below:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main(String args[]) {
// Tests
String[] textArr1 = { ",——— ————i_", "g DAILYW RK SHE 2784647",
"E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha" };
String[] textArr2 = { ",——— ————i_", "g DAILYW RK SHE ——— ————",
"E C 0 mp] on THE 278464 POUJER Hello, Mumbai, Co. Maha" };
String[] textArr3 = { ",——— 278464————i_", "g DAILYW RK SHE POUJER",
"E C 0 mp] on THE POUJER Hello, Mumbai, Co. Maha" };
System.out.println(getInvoiceNumber(textArr1));
System.out.println(getInvoiceNumber(textArr2));
System.out.println(getInvoiceNumber(textArr3));
}
static String getInvoiceNumber(String[] textArr) {
String invoiceNumber = "";
Pattern pattern = Pattern.compile("\\d{6,}");
for (String text : textArr) {
Matcher matcher = pattern.matcher(text);
if (matcher.find()) {
invoiceNumber = matcher.group();
}
}
return invoiceNumber;
}
}
Output:
2784647
278464
278464
check this code.
public class Test {
private static final Pattern p = Pattern.compile("(\\d{6,})");
public static void main(String[] args) {
try {
Scanner scanner = new Scanner(new File("here put your file path"));
System.out.println("done");
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
// create matcher for pattern p and given string
Matcher m = p.matcher(line);
// if an occurrence if a pattern was found in a given string...
if (m.find()) {
System.out.println(m.group(1)); // second matched digits
}
}
scanner.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
}
}
}

Splitting string on spaces unless in double quotes but double quotes can have a preceding string attached

I need to split a string in Java (first remove whitespaces between quotes and then split at whitespaces.)
"abc test=\"x y z\" magic=\" hello \" hola"
becomes:
firstly:
"abc test=\"xyz\" magic=\"hello\" hola"
and then:
abc
test="xyz"
magic="hello"
hola
Scenario :
I am getting a string something like above from input and I want to break it into parts as above. One way to approach was first remove the spaces between quotes and then split at spaces. Also string before quotes complicates it. Second one was split at spaces but not if inside quote and then remove spaces from individual split. I tried capturing quotes with "\"([^\"]+)\"" but I'm not able to capture just the spaces inside quotes. I tried some more but no luck.
We can do this using a formal pattern matcher. The secret sauce of the answer below is to use the not-much-used Matcher#appendReplacement method. We pause at each match, and then append a custom replacement of anything appearing inside two pairs of quotes. The custom method removeSpaces() strips all whitespace from each quoted term.
public static String removeSpaces(String input) {
return input.replaceAll("\\s+", "");
}
String input = "abc test=\"x y z\" magic=\" hello \" hola";
Pattern p = Pattern.compile("\"(.*?)\"");
Matcher m = p.matcher(input);
StringBuffer sb = new StringBuffer("");
while (m.find()) {
m.appendReplacement(sb, "\"" + removeSpaces(m.group(1)) + "\"");
}
m.appendTail(sb);
String[] parts = sb.toString().split("\\s+");
for (String part : parts) {
System.out.println(part);
}
abc
test="xyz"
magic="hello"
hola
Demo
The big caveat here, as the above comments hinted at, is that we are really using a regex engine as a rudimentary parser. To see where my solution would fail fast, just remove one of the quotes by accident from a quoted term. But, if you are sure you input is well formed as you have showed us, this answer might work for you.
I wanted to mention the java 9's Matcher.replaceAll lambda extension:
// Find quoted strings and remove there whitespace:
s = Pattern.compile("\"[^\"]*\"").matcher(s)
.replaceAll(mr -> mr.group().replaceAll("\\s", ""));
// Turn the remaining whitespace in a comma and brace all.
s = '{' + s.trim().replaceAll("\\s+", ", ") + '}';
Probably the other answer is better but still I have written it so I will post it here ;) It takes a different approach
public static void main(String[] args) {
String test="abc test=\"x y z\" magic=\" hello \" hola";
Pattern pattern = Pattern.compile("([^\\\"]+=\\\"[^\\\"]+\\\" )");
Matcher matcher = pattern.matcher(test);
int lastIndex=0;
while(matcher.find()) {
String[] parts=matcher.group(0).trim().split("=");
boolean newLine=false;
for (String string : parts[0].split("\\s+")) {
if(newLine)
System.out.println();
newLine=true;
System.out.print(string);
}
System.out.println("="+parts[1].replaceAll("\\s",""));
lastIndex=matcher.end();
}
System.out.println(test.substring(lastIndex).trim());
}
Result is
abc
test="xyz"
magic="hello"
hola
It sounds like you want to write a basic parser/Tokenizer. My bet is that after you make something that can deal with pretty printing in this structure, you will soon want to start validating that there arn't any mis-matching "'s.
But in essence, you have a few stages for this particular problem, and Java has a built in tokenizer that can prove useful.
import java.util.LinkedList;
import java.util.List;
import java.util.StringTokenizer;
import java.util.stream.Collectors;
public class Q50151376{
private static class Whitespace{
Whitespace(){ }
#Override
public String toString() {
return "\n";
}
}
private static class QuotedString {
public final String string;
QuotedString(String string) {
this.string = "\"" + string.trim() + "\"";
}
#Override
public String toString() {
return string;
}
}
public static void main(String[] args) {
String test = "abc test=\"x y z\" magic=\" hello \" hola";
StringTokenizer tokenizer = new StringTokenizer(test, "\"");
boolean inQuotes = false;
List<Object> out = new LinkedList<>();
while (tokenizer.hasMoreTokens()) {
final String token = tokenizer.nextToken();
if (inQuotes) {
out.add(new QuotedString(token));
} else {
out.addAll(TokenizeWhitespace(token));
}
inQuotes = !inQuotes;
}
System.out.println(joinAsStrings(out));
}
private static String joinAsStrings(List<Object> out) {
return out.stream()
.map(Object::toString)
.collect(Collectors.joining());
}
public static List<Object> TokenizeWhitespace(String in){
List<Object> out = new LinkedList<>();
StringTokenizer tokenizer = new StringTokenizer(in, " ", true);
boolean ignoreWhitespace = false;
while (tokenizer.hasMoreTokens()){
String token = tokenizer.nextToken();
boolean whitespace = token.equals(" ");
if(!whitespace){
out.add(token);
ignoreWhitespace = false;
} else if(!ignoreWhitespace) {
out.add(new Whitespace());
ignoreWhitespace = true;
}
}
return out;
}
}

How to replace tokens in java using regex?

I am having a string template containing $variables which needs to be replaced.
String Template: "hi my name is $name.\nI am $age old. I am $sex"
The solution which i tried verifying does not work in the java program.
http://regexr.com/3dtq1
Further, I referred to https://www.regex101.com/ where i could not check if the pattern works for java. But, while going through one of the tutorials I found that "$ Matches end of line". what's the best way to replace the tokens in the template with the variables?
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class PatternCompiler {
static String text = "hi my name is $name.\nI am $age old. I am $sex";
static Map<String,String> replacements = new HashMap<String,String>();
static Pattern pattern = Pattern.compile("\\$\\w+");
static Matcher matcher = pattern.matcher(text);
public static void main(String[] args) {
replacements.put("name", "kumar");
replacements.put("age", "26");
replacements.put("sex", "male");
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
String replacement = replacements.get(matcher.group(1));
if (replacement != null) {
// matcher.appendReplacement(buffer, replacement);
// see comment
matcher.appendReplacement(buffer, "");
buffer.append(replacement);
}
}
matcher.appendTail(buffer);
System.out.println(buffer.toString());
}
}
You are using matcher.group(1) but you didn't define any group in the regexp (( )), so you can use only group() for the whole matched string, which is what you want.
Replace line:
String replacement = replacements.get(matcher.group(1));
With:
String replacement = replacements.get(matcher.group().substring(1));
Notice the substring, your map contains only words, but matcher will match also $, so you need to search in map for "$age".substring(1)" but do replacement on the whole $age.
You can try replacing the pattern string with
\\$(\\w+)
and the variable replacement works. Your current pattern only has group 0 (the entire pattern) but not group 1. Adding the parenthesis makes the first group the variable name and the replacement will replace the dollar sign and the variable name.
Your code has just minor glitches.
static Map<String,String> replacements = new HashMap<>();
static Pattern pattern = Pattern.compile("\\$\\w+\\b"); // \b not really needed
// As no braces (...) there is no group(1)
String replacement = replacements.get(matcher.group());
Your not using the right thing as your key. Change to group(), and change map to '$name' etc:
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class HelloWorld {
static String text = "hi my name is $name.\nI am $age old. I am $sex";
static Map<String,String> replacements = new HashMap<String,String>();
static Pattern pattern = Pattern.compile("\\$\\w+");
static Matcher matcher = pattern.matcher(text);
public static void main(String[] args) {
replacements.put("$name", "kumar");
replacements.put("$age", "26");
replacements.put("$sex", "male");
StringBuffer buffer = new StringBuffer();
while (matcher.find()) {
String replacement = replacements.get(matcher.group());
System.out.println(replacement);
if (replacement != null) {
// matcher.appendReplacement(buffer, replacement);
// see comment
matcher.appendReplacement(buffer, "");
buffer.append(replacement);
}
}
matcher.appendTail(buffer);
System.out.println(buffer.toString());
}
}

How do you replace groups in a regular expression?

How, exactly, do you replace groups while appending them to a string buffer?
For Example:
(a)(b)(c)
How can you replace group 1 with d, group 2 with e and so on?
I'm working with the Java regex engine.
Thanks in advance.
You could use Matcher's appendReplacement
Here is an example sample using:
input: "hello bob How is your cat?"
regular expression: "(bob|cat)"
output: "hello alice How is your dog"
public static void main(String[] args) {
Pattern p = Pattern.compile("(bob|cat)");
Matcher m = p.matcher("hello bob How is your cat?");
StringBuffer s = new StringBuffer();
while (m.find()) {
m.appendReplacement(s, doReplace(m.group(1)));
}
m.appendTail(s);
System.out.println(s.toString());
}
public static String doReplace(String s) {
if(s.equals("bob")) {
return "alice";
}
if(s.equals("cat")) {
return "dog";
}
return "";
}
You could use Matcher#start(group) and Matcher#end(group) to build a generic replacement method:
public static String replaceGroup(String regex, String source, int groupToReplace, String replacement) {
return replaceGroup(regex, source, groupToReplace, 1, replacement);
}
public static String replaceGroup(String regex, String source, int groupToReplace, int groupOccurrence, String replacement) {
Matcher m = Pattern.compile(regex).matcher(source);
for (int i = 0; i < groupOccurrence; i++)
if (!m.find()) return source; // pattern not met, may also throw an exception here
return new StringBuilder(source).replace(m.start(groupToReplace), m.end(groupToReplace), replacement).toString();
}
public static void main(String[] args) {
// replace with "%" what was matched by group 1
// input: aaa123ccc
// output: %123ccc
System.out.println(replaceGroup("([a-z]+)([0-9]+)([a-z]+)", "aaa123ccc", 1, "%"));
// replace with "!!!" what was matched the 4th time by the group 2
// input: a1b2c3d4e5
// output: a1b2c3d!!!e5
System.out.println(replaceGroup("([a-z])(\\d)", "a1b2c3d4e5", 2, 4, "!!!"));
}
Check online demo here.
Are you looking for something like this?
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Program1 {
public static void main(String[] args) {
Pattern p = Pattern.compile("(a)(b)(c)");
String str = "111abc222abc333";
String out = null;
Matcher m = p.matcher(str);
out = m.replaceAll("z$3y$2x$1");
System.out.println(out);
}
}
This gives 111zcybxa222zcybxa333 as output.
I guess you will see what this example does.
But OK, I think there's no ready built-in
method through which you can say e.g.:
- replace group 3 with zzz
- replace group 2 with yyy
- replace group 1 with xxx

Java Pattern match

I've a long template from which I need to extract certain strings based on certain patterns. When I went through some examples I found that use of quantifiers is good in such situations.For example following is my template, from which I need to extract while and doWhile.
This is a sample document.
$while($variable)This text can be repeated many times until do while is called.$endWhile.
Some sample text follows this.
$while($variable2)This text can be repeated many times until do while is called.$endWhile.
Some sample text.
I need to extract the whole text, starting from $while($variable) till $endWhile. I then need to process the value of $variable. After that I need to insert the text between $while and $endWhile to the original text.
I've the logic of extracting the variable. But I'm not sure how to use quantifiers or pattern match here.
Can someone please provide me a sample code for this? Any help will be greatly appreciated
You can use a rather simple regex-based solution here with a Matcher:
Pattern pattern = Pattern.compile("\\$while\\((.*?)\\)(.*?)\\$endWhile", Pattern.DOTALL);
Matcher matcher = pattern.matcher(yourString);
while(matcher.find()){
String variable = matcher.group(1); // this will include the $
String value = matcher.group(2);
// now do something with variable and value
}
If you want to replace the variables in the original text, you should use the Matcher.appendReplacement() / Matcher.appendTail() solution:
Pattern pattern = Pattern.compile("\\$while\\((.*?)\\)(.*?)\\$endWhile", Pattern.DOTALL);
Matcher matcher = pattern.matcher(yourString);
StringBuffer sb = new StringBuffer();
while(matcher.find()){
String variable = matcher.group(1); // this will include the $
String value = matcher.group(2);
// now do something with variable and value
matcher.appendReplacement(sb, value);
}
matcher.appendTail(sb);
Reference:
Methods of the Pattern Class
(Sun Java Tutorial)
Methods of the Matcher Class
(Sun Java Tutorial)
Pattern JavaDoc
Matcher JavaDoc
public class PatternInString {
static String testcase1 = "what i meant here";
static String testcase2 = "here";
public static void main(String args[])throws StringIndexOutOfBoundsException{
PatternInString testInstance= new PatternInString();
boolean result = testInstance.occurs(testcase1,testcase2);
System.out.println(result);
}
//write your code here
public boolean occurs(String str1, String str2)throws StringIndexOutOfBoundsException
{ int i;
boolean result=false;
int num7=str1.indexOf(" ");
int num8=str1.lastIndexOf(" ");
String str6=str1.substring(num8+1);
String str5=str1.substring(0,num7);
if(str5.equals(str2))
{
result=true;
}
else if(str6.equals(str2))
{
result=true;
}
int num=-1;
try
{
for(i=0;i<str1.length()-1;i++)
{ num=num+1;
num=str1.indexOf(" ",num);
int num1=str1.indexOf(" ",num+1);
String str=str1.substring(num+1,num1);
if(str.equals(str2))
{
result=true;
break;
}
}
}
catch(Exception e)
{
}
return result;
}
}

Categories

Resources