How can I parse this kind of text?

How can I parse this kind of text? - java

This is the format my text is in:
15no16no17yes the parents who have older children always tell you the next stage is worse.18yes using only their hands and feet make some of the worst movies in the history of the world.19no
So the basic format is this:
number yes|no text(may/may not be there) repeated
The text after yes or no can be empty, or can start with a space. (I have tried to illustrate this above).
The code I have works for this format:
number yes|no repeated
More examples of text to parse:
30no31yesapproximately 278 billion miles from anything.32no33no34no
30no31yesapproximately 278 billion miles from anything32no33yessince the invention of call waiting34yesGravity is a contributing factor in 73 percent of all accidents involving falling objects.
35yesanybody who owns hideous clothing36yes if you take it from another person's plate37yes172 miles per hour upside down38yesonly more intelligent39yes any product including floor wax that has fat in it
35no36yestake it from another person's plate37yes172 miles per hour upside down38no39no
35no36no37yes172 miles per hour38no39no
35no36no37yesupside down38no39no
How do I modify my code?
String regex = "^(\\d+)(yes|no)";
Pattern p = Pattern.compile(regex);
while(input.hasNextLine()) {
String line = input.nextLine();
String myStr = line;
Matcher m = p.matcher(myStr);
while(m.find()) {
String all = m.group();
String digits = m.group(1);
String bool = m.group(2);
// do stuff
myStr = myStr.substring(all.length());
m.reset(myStr);
} // end while
} // end while
I tried using String regex = "^(\\d+)(yes|no)(.*)"; but the problem is that it captures everything after a yes or no.
What do I do?
PS: Please let me know if anything is unclear and I'll provide more explanations.

Try this. I think it will work. In the end of the parsing, you will have a List of Answers. Now, you just need to make some modifications to return this List and use its answers. My algorithm just detects all the answers and where they start in the main String and with this information, slice the text. So, the algorithm have two steps (1: bounds detection, 2: string slicing). I made some comnents in my code. Hope it works.
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
*
* #author David Buzatto
*/
public class ASimpleParser {
public static void main( String[] args ) {
new ASimpleParser().exec();
}
public void exec() {
String[] in = {
"30no31yesapproximately 278 billion miles from anything.32no33no34no",
"30no31yesapproximately 278 billion miles from anything32no33yessince the invention of call waiting34yesGravity is a contributing factor in 73 percent of all accidents involving falling objects.",
"35yesanybody who owns hideous clothing36yes if you take it from another person's plate37yes172 miles per hour upside down38yesonly more intelligent39yes any product including floor wax that has fat in it",
"35no36yestake it from another person's plate37yes172 miles per hour upside down38no39no",
"35no36no37yes172 miles per hour38no39no",
"35no36no37yesupside down38no39no"
};
Pattern p = Pattern.compile( "(\\d+)(yes|no)" );
List<Answer> allAnswers = new ArrayList<Answer>();
for ( String s : in ) {
List<Answer> answers = new ArrayList<Answer>();
Matcher m = p.matcher( s );
// step 1: detecting answer bounds (start)
while ( m.find() ) {
Answer a = new Answer();
a.answerStart = m.group();
a.number = m.group( 1 );
a.yesOrNo = m.group( 2 );
a.startAt = s.indexOf( a.answerStart );
answers.add( a );
}
// step 2: slicing
for ( int i = 0; i < answers.size(); i++ ) {
Answer a = answers.get( i );
// needs to compare to the right one, the will have the right bounds
if ( i != answers.size() - 1 ) {
Answer rightAnswer = answers.get( i + 1 );
a.text = s.substring( a.startAt + a.answerStart.length(), rightAnswer.startAt );
} else { // int the last answer, the right bound is the end of the main String. s.length() may be ommited.
a.text = s.substring( a.startAt + a.answerStart.length(), s.length() );
}
}
allAnswers.addAll( answers );
}
// just iterating over the answers to show them.
for ( Answer a : allAnswers ) {
System.out.println( a );
}
}
// a private class to contain the answers data
private class Answer {
String answerStart;
String number;
String yesOrNo;
String text;
int startAt;
#Override
public String toString() {
return "Answer{" + "number=" + number + ", answer=" + yesOrNo + ", text=" + text + ", startAt=" + startAt + '}';
}
}
}

Related

How to print a substring with only the matching elements of a string?

Given a String that lists metadata about a book line by line, how do I print out only the lines that match the data I am looking for?
In order to do this, I've been trying to create substrings for each lines using indexes. The substring starts at the beginning of a line and ends before a "\n". I have not seen lists, arrays or bufferedReader yet.
For each substring that I parse through, I check if it contains my pattern. If it does, I add it to a string that only includes my results.
Here would be an example of my list (in french); I'd like to match, for say, all the books written in 2017.
Origine D. Brown 2017 Thriller Policier
Romance et de si belles fiancailles M. H. Clark 2018 thriller policier Romance
La fille du train P. Hawkins 2015 Policier
There is a flaw in how I am doing this and I am stuck with an IndexOutOfBounds exception that I can't figure out. Definitely new in creating algorithms like this.
public static String search() {
String list;
int indexLineStart = 0;
int indexLineEnd = list.indexOf("\n");
int indexFinal = list.length()-1;
String listToPrint = "";
while (indexLineStart <= indexFinal){
String listCheck = list.substring(indexLineStart, indexLineEnd);
if (listCheck.contains(dataToMatch)){
listToPrint = listToPrint + "\n" + listCheck;
}
indexLineStart = indexLineEnd +1 ;
indexLineEnd = list.indexOf("\n", indexLineStart);
}
return listeToPrint;
}

Regardless of the comments about using split() and String[], which do have merit :-)
The IndexOutOfBounds exception I believe is being caused by the second of these two lines:
indexLineStart = indexLineEnd +1 ;
indexLineEnd = list.indexOf("\n", indexLineStart);
You wan't them swapped around (I believe).

You don't have to make this much complex logic by using String.substring(), what you can use is String.split() and can make an array of your string. At each index is a book, then, you can search for you matching criteria, and add the book to the finalString if it matches your search.
Working Code:
public class stackString
{
public static void main(String[] args)
{
String list = "Origine D. Brown 2017 Thriller Policier\n Romance et de si belles fiancailles M. H. Clark 2018 thriller policier Romance\n La fille du train P. Hawkins 2015 Policier\n";
String[] listArray = list.split("\n"); // make a String Array on each index is new book
String finalString = ""; // final array to store the books that matches the search
String matchCondition = "2017";
for(int i =0; i<listArray.length;i++)
if(listArray[i].contains(matchCondition))
finalString += listArray[i]+"\n";
System.out.println(finalString);
}
}

Here is a solution using pattern matching
public static List<String> search(String input, String keyword)
{
Pattern pattern = Pattern.compile(".*" + keyword + ".*");
Matcher matcher = pattern.matcher(input);
List<String> linesContainingKeyword = new LinkedList<>();
while (matcher.find())
{
linesContainingKeyword.add(matcher.group());
}
return linesContainingKeyword;
}

Since I wasn't allowed to use lists and arrays, I got this to be functional this morning.
public static String linesWithPattern (String pattern){
String library;
library = library + "\n"; //Added and end of line at the end of the file to parse through it without problem.
String substring = "";
String substringWithPattern = "";
char endOfLine = '\n';
int nbrLines = countNbrLines(library, endOfLine); //Method to count number of '\n'
int lineStart = 0;
int lineEnd = 0;
for (int i = 0; i < nbrLines ; i++){
lineStart = lineEnd;
if (lineStart == 0){
lineEnd = library.indexOf('\n');
} else if (lineStart != 0){
lineEnd = library.indexOf('\n', (lineEnd + 1));
}
substring = library.substring(lineStart, lineEnd);
if (substring.toLowerCase().contains(motif.toLowerCase())){
substringWithPattern = substring + substringWithPattern + '\n';
}
if (!library.toLowerCase().contains(pattern.toLowerCase())){
substringWithPattern = "\nNO ENTRY FOUND \n";
}
}
if (library.toLowerCase().contains(pattern)){
substringWithPattern = "This or these books were found in the library \n" +
"--------------------------" + substringWithPattern;
}
return substringWithPattern;

The IndexOutOfBounds exception is thrown when the index you are searching for is not in the range of array length. When I went through the code, you are getting this exception because of below line execution where probably the indexLineEnd value is more than the actual length of List if the string variable list is not Null (Since your code doesn't show list variable to be initialized).
String listCheck = list.substring(indexLineStart, indexLineEnd);
Please run the application in debug mode to get the exact value that is getting passed to the method to understand why it throwing the exception.
you need to be careful at calculating the value of indexLineEnd.

How to pattern match and transform string to generate certain output?

The below code is for getting some form of input which includes lots of whitespace in between important strings and before and after the important strings, so far I have been able to filter the whitespace out. After preparing the string what I want to do is process it.
Here is an example of the inputs that I may get and the favorable output I want;
Input
+--------------+
EDIT example.mv Starter web-onyx-01.example.net.mv
Notice how whitespace id before and after the domain, this whitespace could be concluded as random amount.
Output
+--------------+
example.mv. in ns web-onyx-01.example.net.mv.
In the output the important bit is the whitespace between the domain (Example.) and the keyword (in) and keyword (ns) and host (web-onyx-01.example.net.mv.)
Also notice the period (".") after the domain and host. Another part is the fact that if its a (.mv) ccTLD we will have to remove that bit from the string,
What I would like to achieve is this transformation with multiple lines of text, meaning I want to process a bunch of unordered chaotic list of strings and batch process them to produce the clean looking outputs.
The code is by no-means any good design, but this is at least what I have come up with. NOTE: I am a beginner who is still learning about programming. I would like your suggestions to improve the code as well as to solve the problem at hand i.e transform the input to the desired output.
P.S The output is for zone files in DNS, so errors can be very problematic.
So far my code is accepting text from a textarea and outputs the text into another textarea which shows the output.
My code works for as long as the array length is 2 and 3 but fails at anything larger. So how do I go about being able to process the input to the output dynamically for as big as the list/array may become in the future?
String s = jTextArea1.getText();
Pattern p = Pattern.compile("ADD|EDIT|DELETE|Domain|Starter|Silver|Gold|ADSL Business|Pro|Lite|Standard|ADSL Multi|Pro Plus", Pattern.MULTILINE);
Matcher m = p.matcher(s);
s = m.replaceAll("");
String ms = s.replaceAll("(?m)(^\\s+|[\\t\\f ](?=[\\t\\f ])|[\\t\\f ]$|\\s+\\z)", "");
String[] last = ms.split(" ");
for (String test : last){
System.out.println(test);
}
System.out.println("The length of array is: " +last.length);
if (str.isContain(last[0], ".mv")) {
if (last.length == 2) {
for(int i = 0; i < last.length; i++) {
last[0] = last[0].replaceFirst(".mv", "");
System.out.println(last[0]);
last[i] += ".";
if (last[i] == null ? last[0] == null : last[i].equals(last[0])) {
last[i]+= " in ns ";
}
String str1 = String.join("", last);
jTextArea2.setText(str1);
System.out.println(str1);
}
}
else if (last.length == 3) {
for(int i = 0; i < last.length; i++) {
last[0] = last[0].replaceFirst(".mv", "");
System.out.println(last[0]);
last[i] += ".";
if (last[i] == null ? last[0] == null : last[i].equals(last[0])) {
last[i]+= " in ns ";
}
if (last[i] == null ? last[1] == null : last[i].equals(last[1])){
last[i] += "\n";
}
if (last[i] == null ? last[2] == null : last[i].equals(last[2])){
last[i] = last[0] + last[2];
}
String str1 = String.join("", last);
jTextArea2.setText(str1);
System.out.println(str1);
}
}
}

As I understand your question you have multiple lines of input in the following form:
whitespace[command]whitespace[domain]whitespace[label]whitespace[target-domain]whitespace
You want to convert that to the following form such that multiple lines are aligned nicely:
[domain]. in ns [target-domain].
To do that I'd suggest the following:
Split your input into multiple lines
Use a regular expression to check the line format (e.g. for a valid command etc.) and extract the domains
store the maximum length of both domains separately
build a string format using the maximum lengths
iterate over the extraced domains and build a string for that line using the format defined in step 4
Example:
String input = " EDIT domain1.mv Starter example.domain1.net.mv \n" +
" DELETE long-domain1.mv Silver long-example.long-domain1.net.mv \n" +
" ADD short-domain1.mv ADSL Business ex.sdomain1.net.mv \n";
//step 1: split the input into lines
String[] lines = input.split( "\n" );
//step 2: build a regular expression to check the line format and extract the domains - which are the (\S+) parts
Pattern pattern = Pattern.compile( "^\\s*(?:ADD|EDIT|DELETE)\\s+(\\S+)\\s+(?:Domain|Starter|Silver|Gold|ADSL Business|Pro|Lite|Standard|ADSL Multi|Pro Plus)\\s+(\\S+)\\s*$" );
List<String[]> lineList = new LinkedList<>();
int maxLengthDomain = 0;
int maxLengthTargetDomain = 0;
for( String line : lines )
{
//step 2: check the line
Matcher matcher = pattern.matcher( line );
if( matcher.matches() ) {
//step 2: extract the domains
String domain = matcher.group( 1 );
String targetDomain = matcher.group( 2 );
//step 3: get the maximum length of the domains
maxLengthDomain = Math.max( maxLengthDomain, domain.length() );
maxLengthTargetDomain = Math.max( maxLengthTargetDomain, targetDomain.length() );
lineList.add( new String[] { domain, targetDomain } );
}
}
//step 4: build the format string with variable lengths
String formatString = String.format( "%%-%ds in ns %%-%ds", maxLengthDomain + 5, maxLengthTargetDomain + 2 );
//step 5: build the output
for( String[] line : lineList ) {
System.out.println( String.format( formatString, line[0] + ".", line[1] + "." ) );
}
Result:
domain1.mv. in ns example.domain1.net.mv.
long-domain1.mv. in ns long-example.long-domain1.net.mv.
short-domain1.mv. in ns ex.sdomain1.net.mv.

Regex for start and end of sentence

Is there a way to match start and end of sentence in Java? The easiest case is ending with simple (.) dot. In some other cases it could end with colum (:) or a shortcut ended with colum (.:).
For example some random news text:
Cliffs have collapsed in New Zealand during an earthquake in the city
of Christchurch on the South Island. No serious damage or fatalities
were reported in the Valentine's Day quake that struck at 13:13 local
time. Based on the med. report everybody were ok.
My goal is to get the shortcut of a word + the context of it, but if possible only the sentence in which the shortcut belonds.
So the successfull output for me will be if I would be able to get something like this:
selected word -> collapsed
context -> Cliffs have collapsed in New Zealand during an earthquake in the city of Christchurch on the South Island.
selected word -> med.
context -> Based on the med. report everybody were ok.
Thanks

You spot the sentence easily. It starts with a capital letter and ends with one of .:!? chars followed by space and another capital letter or reached the end of the whole string.
Compare the difference time. Based and med. report.
So the regex capturing the whole sentence should look like this:
([A-Z][a-z].*?[.:!?](?=$| [A-Z]))
Take a look! Regex101

what you are looking for is a natural language processing toolkit. for java you can use: CoreNLP
and they already have some example cases on their tutorials page.
you can certainly make a regex expression that looks for all chars inbetween the set of chars (.:? etc...), and it would look something like this:
\.*?(?=[\.\:])\
then you would have to loop through the matched results and find the relevant sentences which have your words in them. but i recommend you use a NLP to achieve this.

The code:
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static void main( String[] args ) {
final Map<String, String> dict = new HashMap<>();
dict.put( "med", "medical" );
final String text =
"Cliffs have collapsed in New Zealand during an earthquake in the "
+ "city of Christchurch on the South Island. No serious damage or "
+ "fatalities were reported in the Valentine's Day quake that struck "
+ "at 13:13 local time. Based on the med. report everybody were ok.";
final Pattern p = Pattern.compile( "[^\\.]+\\W+(\\w+)\\." );
final Matcher m = p.matcher( text );
int pos = 0;
while(( pos < text.length()) && m.find( pos )) {
pos = m.end() + 1;
final String word = m.group( 1 );
if( dict.containsKey( word )) {
final String repl = dict.get( word );
final String beginOfSentence = text.substring( m.start(), m.end());
final String endOfSentence;
if( m.find( pos )) {
endOfSentence = text.substring( m.start() - 1, m.end());
}
else {
endOfSentence = text.substring( m.start() - 1);
}
System.err.printf( "Replace '%s.' in '%s%s' with '%s'\n",
word, beginOfSentence, endOfSentence, repl );
final String sentence =
( beginOfSentence + endOfSentence ).replaceAll( word+'.', repl );
System.err.println( sentence );
}
}
}
}
The execution:
Replace 'med.' in 'Based on the med. report everybody were ok.' with 'medical'
Based on the medical report everybody were ok.

For Loop Depreciation Java [duplicate]

I was wondering if someone can show me how to use the format method for Java Strings.
For instance If I want the width of all my output to be the same
For instance, Suppose I always want my output to be the same
Name = Bob
Age = 27
Occupation = Student
Status = Single
In this example, all the output are neatly formatted under each other; How would I accomplish this with the format method.

System.out.println(String.format("%-20s= %s" , "label", "content" ));
Where %s is a placeholder for you string.
The '-' makes the result left-justified.
20 is the width of the first string
The output looks like this:
label = content
As a reference I recommend Javadoc on formatter syntax

If you want a minimum of 4 characters, for instance,
System.out.println(String.format("%4d", 5));
// Results in " 5", minimum of 4 characters

To answer your updated question you can do
String[] lines = ("Name = Bob\n" +
"Age = 27\n" +
"Occupation = Student\n" +
"Status = Single").split("\n");
for (String line : lines) {
String[] parts = line.split(" = +");
System.out.printf("%-19s %s%n", parts[0] + " =", parts[1]);
}
prints
Name = Bob
Age = 27
Occupation = Student
Status = Single

EDIT: This is an extremely primitive answer but I can't delete it because it was accepted. See the answers below for a better solution though
Why not just generate a whitespace string dynamically to insert into the statement.
So if you want them all to start on the 50th character...
String key = "Name =";
String space = "";
for(int i; i<(50-key.length); i++)
{space = space + " ";}
String value = "Bob\n";
System.out.println(key+space+value);
Put all of that in a loop and initialize/set the "key" and "value" variables before each iteration and you're golden. I would also use the StringBuilder class too which is more efficient.

#Override
public String toString() {
return String.format("%15s /n %15d /n %15s /n %15s", name, age, Occupation, status);
}

For decimal values you can use DecimalFormat
import java.text.*;
public class DecimalFormatDemo {
static public void customFormat(String pattern, double value ) {
DecimalFormat myFormatter = new DecimalFormat(pattern);
String output = myFormatter.format(value);
System.out.println(value + " " + pattern + " " + output);
}
static public void main(String[] args) {
customFormat("###,###.###", 123456.789);
customFormat("###.##", 123456.789);
customFormat("000000.000", 123.78);
customFormat("$###,###.###", 12345.67);
}
}
and output will be:
123456.789 ###,###.### 123,456.789
123456.789 ###.## 123456.79
123.78 000000.000 000123.780
12345.67 $###,###.### $12,345.67
For more details look here:
http://docs.oracle.com/javase/tutorial/java/data/numberformat.html

Java output formatting for Strings

I was wondering if someone can show me how to use the format method for Java Strings.
For instance If I want the width of all my output to be the same
For instance, Suppose I always want my output to be the same
Name = Bob
Age = 27
Occupation = Student
Status = Single
In this example, all the output are neatly formatted under each other; How would I accomplish this with the format method.

System.out.println(String.format("%-20s= %s" , "label", "content" ));
Where %s is a placeholder for you string.
The '-' makes the result left-justified.
20 is the width of the first string
The output looks like this:
label = content
As a reference I recommend Javadoc on formatter syntax

If you want a minimum of 4 characters, for instance,
System.out.println(String.format("%4d", 5));
// Results in " 5", minimum of 4 characters

To answer your updated question you can do
String[] lines = ("Name = Bob\n" +
"Age = 27\n" +
"Occupation = Student\n" +
"Status = Single").split("\n");
for (String line : lines) {
String[] parts = line.split(" = +");
System.out.printf("%-19s %s%n", parts[0] + " =", parts[1]);
}
prints
Name = Bob
Age = 27
Occupation = Student
Status = Single

EDIT: This is an extremely primitive answer but I can't delete it because it was accepted. See the answers below for a better solution though
Why not just generate a whitespace string dynamically to insert into the statement.
So if you want them all to start on the 50th character...
String key = "Name =";
String space = "";
for(int i; i<(50-key.length); i++)
{space = space + " ";}
String value = "Bob\n";
System.out.println(key+space+value);
Put all of that in a loop and initialize/set the "key" and "value" variables before each iteration and you're golden. I would also use the StringBuilder class too which is more efficient.

#Override
public String toString() {
return String.format("%15s /n %15d /n %15s /n %15s", name, age, Occupation, status);
}

For decimal values you can use DecimalFormat
import java.text.*;
public class DecimalFormatDemo {
static public void customFormat(String pattern, double value ) {
DecimalFormat myFormatter = new DecimalFormat(pattern);
String output = myFormatter.format(value);
System.out.println(value + " " + pattern + " " + output);
}
static public void main(String[] args) {
customFormat("###,###.###", 123456.789);
customFormat("###.##", 123456.789);
customFormat("000000.000", 123.78);
customFormat("$###,###.###", 12345.67);
}
}
and output will be:
123456.789 ###,###.### 123,456.789
123456.789 ###.## 123456.79
123.78 000000.000 000123.780
12345.67 $###,###.### $12,345.67
For more details look here:
http://docs.oracle.com/javase/tutorial/java/data/numberformat.html

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How can I parse this kind of text? - java

Related

How to print a substring with only the matching elements of a string?

How to pattern match and transform string to generate certain output?

Regex for start and end of sentence

For Loop Depreciation Java [duplicate]

Java output formatting for Strings

Categories

Resources