How to grab text from a messy string in java? - java

I am reading a text file which contains movie titles, year, language etc.
I am trying to grab those attributes.
Suppose some string are like this :
String s = "A Fatal Inversion" (1992)"
String d = "(aka "Verhngnisvolles Erbe" (1992)) (Germany)"
String f = "\"#Yaprava\" (2013) "
String g = "(aka \"Love Heritage\" (2002)) (International: English title)"
How can i grab title, year, country if specified, what sort of title if specified from this?
I am not very good at using regex and patterns, but I don't know how to find what sort of attribute it is when they are not specified. I am doing this because I am trying to generate xml from a textfile. I have the dtd for it but im not sure I need it to use it in this case.
Edit: Here is what i have tried.
String pattern;
Pattern p = Pattern.compile("\"([^\"]*)\"");
Matcher m;
Pattern number = Pattern.compile("\\d+");
Matcher num;
m = p.matcher(s);
num = number.matcher(s);
if(m.find()){
System.out.println(m.group(1));
}
if(num.find()){
System.out.println(num.group(0));
}

I suggest you extract the year first as this seems fairly consistent. Then I'd extract the country (if present) and the rest I'll assume is the title.
For extracting the countries I'd recommend you hardcode a regex pattern with the names of known countries. It might take some iterating to determine what these are as they seem to be pretty inconsistent.
This code is a bit ugly (but then so is the data!):
public class Extraction {
public final String original;
public String year = "";
public String title = "";
public String country = "";
private String remaining;
public Extraction(String s) {
this.original = s;
this.remaining = s;
extractBracketedYear();
extractBracketedCountry();
this.title = remaining;
}
private void extractBracketedYear() {
Matcher matcher = Pattern.compile(" ?\\(([0-9]+)\\) ?").matcher(remaining);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
this.year = matcher.group(1);
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
remaining = sb.toString();
}
private void extractBracketedCountry() {
Matcher matcher = Pattern.compile("\\((Germany|International: English.*?)\\)").matcher(remaining);
StringBuffer sb = new StringBuffer();
while (matcher.find()) {
this.country = matcher.group(1);
matcher.appendReplacement(sb, "");
}
matcher.appendTail(sb);
remaining = sb.toString();
}
public static void main(String... args) {
for (String s : new String[] {
"A Fatal Inversion (1992)",
"(aka \"Verhngnisvolles Erbe\" (1992)) (Germany)",
"\"#Yaprava\" (2013) ",
"(aka \"Love Heritage\" (2002)) (International: English title)"}) {
Extraction extraction = new Extraction(s);
System.out.println("title = " + extraction.title);
System.out.println("country = " + extraction.country);
System.out.println("year = " + extraction.year);
System.out.println();
}
}
}
Produces:
title = A Fatal Inversion
country =
year = 1992
title = (aka "Verhngnisvolles Erbe")
country = Germany
year = 1992
title = "#Yaprava"
country =
year = 2013
title = (aka "Love Heritage")
country = International: English title
year = 2002
Once you've got this data, you can manipulate it further (e.g. "International: English title" -> "England").

Related

Separate into column without using split function

I am trying to separate these value into ID, FullName and Phone. I know we can split it by using java split function. But is there any other ways to separate it? Values:
1 Peater John 2522523254
10 Neal Tom 2522523254
11 Tom Jackson 2522523254
111 Jack Smith 2522523254
12 Brownson Black 2522523254
I tried to use substring method but it won't work properly.
String id = line.substring(0, 3);
If I do this then it will work till 4th line, but other won't work properly.
If it is fixed length you can use String.substring(). But you should also trim() the result before you try to convert it to numeric:
String idTxt=line.substring(0,4);
Long id=Long.parseLong(idTxt.trim());
String name=line.substring(5,25).trim(); // or whatever the size is of name column.
You can use regex and Pattern
Pattern pattern = Pattern.compile("(\\d*)\s*([\\w\\s]*)\\s*(\\d*)");
Matcher matcher = pattern.matcher(content);
if (matcher.find()) {
string id = matcher.group(0);
string name = matcher.group(1);
string phone = matcher.group(2);
}
package Generic;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
class Main
{
public static void main(String[] args)
{
String txt=" 12 Brownson Black 2522523254";
String re1=".*?"; // Non-greedy match on filler
String re2="(\\d+)"; // Integer Number 1
String re3="(\\s+)"; // White Space 1
String re4="((?:[a-z][a-z]+))"; // Word 1
String re5="(\\s+)"; // White Space 2
String re6="((?:[a-z][a-z]+))"; // Word 2
String re7="(\\s+)"; // White Space 3
String re8="(\\d+)"; // Integer Number 2
Pattern p = Pattern.compile(re1+re2+re3+re4+re5+re6+re7+re8,Pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(txt);
if (m.find())
{
int id = Integer.parseInt(m.group(1));
String name =m.group(3) + " ";
name = name+m.group(5);
long phone = Long.parseLong(m.group(7));
System.out.println(id);
System.out.println(name);
System.out.println(phone);
}
}
}
What about this:
int first_space;
int last_space;
first_space = my_string.indexOf(' ');
last_space = my_string.lastIndexOf(' ');
if ((first_space > 0) && (last_space > first_space))
{
long id;
String full_name;
String phone;
id = Long.parseLong(my_string.substring(0, first_space));
full_name = my_string.substring(first_space + 1, last_space);
phone = my_string.substring(last_space + 1);
}
Use a regexp:
private static final Pattern RE = Pattern.compile(
"^\\s*(\\d+)\\s+(\\S+(?: \\S+)*)\\s+(\\d+)\\s*$");
Matcher matcher = RE.matcher(s);
if (matcher.matches()) {
System.out.println("ID: " + matcher.group(1));
System.out.println("FullName: " + matcher.group(2));
System.out.println("Phone: " + matcher.group(3));
}
You can use a StringTokenizer for this. You won't have to worry about amount of spaces and/or tabs before or after your values, and no need for complex regex expressions:
String line = " 1 Peater John\t2522523254 ";
StringTokenizer st = new StringTokenizer(line, " \t");
String id = "";
String name = "";
String phone = "";
// The first token is your id, you can parse it to an int if you like or need it
if(st.hasMoreTokens()) {
id = st.nextToken();
}
// Loop over the remaining tokens
while(st.hasMoreTokens()) {
String token = st.nextToken();
// As long a there are other tokens, you're processing the name
if(st.hasMoreTokens()) {
if(name.length() > 0) {
name = name + " ";
}
name = name + token;
}
// If there are no more tokens, you've reached the phone number
else {
phone = token;
}
}
System.out.println(id);
System.out.println(name);
System.out.println(phone);

How to Split the String by symbol name and Date in this case

I have got a String in this format
FUTSTKACC28-APR-2016
ACC is a symbol and 28-APR-2016 is a expiry date
FUTSTK is predefined word
How to retrieve values symbol and Date in this case
For example how to get
ACC
and
28-APR-2016
some sample data
FUTSTKACC26-MAY-2016
FUTSTKACC28-APR-2016
FUTSTKACC30-JUN-2016
FUTSTKADANIENT26-MAY-2016
FUTSTKADANIENT28-APR-2016
FUTSTKADANIENT30-JUN-2016
You have a fixed length prefix word and a fixed length date. You can remove the prefix, and then take the substrings from the right by the 11 characters in your dates. Something like,
String[] sample = { "FUTSTKACC26-MAY-2016", "FUTSTKACC28-APR-2016",
"FUTSTKACC30-JUN-2016", "FUTSTKADANIENT26-MAY-2016",
"FUTSTKADANIENT28-APR-2016", "FUTSTKADANIENT30-JUN-2016" };
String predefWord = "FUTSTK";
for (String input : sample) {
if (input.startsWith(predefWord)) {
input = input.substring(predefWord.length());
// There are 11 characters in the date format
String symbol = input.substring(0, input.length() - 11);
String dateStr = input.substring(input.length() - 11);
System.out.printf("symbol=%s, date=%s%n", symbol, dateStr);
}
}
Output is
symbol=ACC, date=26-MAY-2016
symbol=ACC, date=28-APR-2016
symbol=ACC, date=30-JUN-2016
symbol=ADANIENT, date=26-MAY-2016
symbol=ADANIENT, date=28-APR-2016
symbol=ADANIENT, date=30-JUN-2016
Something like this should work:
final String PATTERN = "(FUTSTK)(.+)(\d\d-\w\w\w-\d\d\d\d)"
Pattern p = Pattern.compile(PATTERN);
Matcher m = p.matcher("FUTSTKACC28-APR-2016");
String symbol = m.group(1);
DateFormat format = new SimpleDateFormat("dd-MMM-yyyy", Locale.ENGLISH);
Date date = format.parse(string);
final String str = "FUTSTKACCCCCCC28-APR-2016";
final String[] strArr = str.split("-");
final String month = strArr[0].substring(strArr[0].length() - 2);
final String word = strArr[0].substring(0, strArr[0].length() - 2);
System.out.println("word: " + word);
System.out.println("date: " + month + "-" + strArr[1] + "-" + strArr[2]);
A regex approach (bits stolen from #ElliottFrisch) assuming you know the predefined word:
String[] sample = { "FUTSTKACC26-MAY-2016", "FUTSTKACC28-APR-2016",
"FUTSTKACC30-JUN-2016", "FUTSTKADANIENT26-MAY-2016",
"FUTSTKADANIENT28-APR-2016", "FUTSTKADANIENT30-JUN-2016" };
String predefined = "FUTSTK";
Pattern p = Pattern.compile(Pattern.quote(predefined) + "(\\w+)(\\d\\d-\\w\\w\\w-\\d\\d\\d\\d)");
for (String s: sample) {
Matcher m = p.matcher(s);
if (m.matches()) {
System.out.println(m.group(1) + " " + m.group(2));
}
}
output:
ACC 26-MAY-2016
ACC 28-APR-2016
ACC 30-JUN-2016
ADANIENT 26-MAY-2016
ADANIENT 28-APR-2016
ADANIENT 30-JUN-2016

How to match the text file against multiple regex patterns and count the number of occurences of these patterns?

I want to find and count all the occurrences of the words unit, device, method, module in every line of the text file separately. That's what I've done, but I don't know how to use multiple patterns and how to count the occurrence of every word in the line separately? Now it counts only occurrences of all words together for every line. Thank you in advance!
private void countPaterns() throws IOException {
Pattern nom = Pattern.compile("unit|device|method|module|material|process|system");
String str = null;
BufferedReader r = new BufferedReader(new FileReader("D:/test/test1.txt"));
while ((str = r.readLine()) != null) {
Matcher matcher = nom.matcher(str);
int countnomen = 0;
while (matcher.find()) {
countnomen++;
}
//intList.add(countnomen);
System.out.println(countnomen + " davon ist das Wort System");
}
r.close();
//return intList;
}
Better to use a word boundary and use a map to keep counts of each matched keyword.
Pattern nom = Pattern.compile("\\b(unit|device|method|module|material|process|system)\\b");
String str = null;
BufferedReader r = new BufferedReader(new FileReader("D:/test/test1.txt"));
Map<String, Integer> counts = new HashMap<>();
while ((str = r.readLine()) != null) {
Matcher matcher = nom.matcher(str);
while (matcher.find()) {
String key = matcher.group(1);
int c = 0;
if (counts.containsKey(key))
c = counts.get(key);
counts.put(key, c+1)
}
}
r.close();
System.out.println(counts);
Here's a Java 9 (and above) solution:
public static void main(String[] args) {
List<String> expressions = List.of("(good)", "(bad)");
String phrase = " good bad bad good good bad bad bad";
for (String regex : expressions) {
Pattern gPattern = Pattern.compile(regex);
Matcher matcher = gPattern.matcher(phrase);
long count = matcher.results().count();
System.out.println("Pattern \"" + regex + "\" appears " + count + (count == 1 ? " time" : " times"));
}
}
Outputs
Pattern "(good)" appears 3 times
Pattern "(bad)" appears 5 times

Words inside square brackes - RegExp

String linkPattern = "\\[[A-Za-z_0-9]+\\]";
String text = "[build]/directory/[something]/[build]/";
RegExp reg = RegExp.compile(linkPattern,"g");
MatchResult matchResult = reg.exec(text);
for (int i = 0; i < matchResult.getGroupCount(); i++) {
System.out.println("group" + i + "=" + matchResult.getGroup(i));
}
I am trying to get all blocks which are encapsulated by squared bracets form a path string:
and I only get group0="[build]" what i want is:
1:"[build]" 2:"[something]" 3:"[build]"
EDIT:
just to be clear words inside the brackets are generated with random text
public static String genText()
{
final int LENGTH = (int)(Math.random()*12)+4;
StringBuffer sb = new StringBuffer();
for (int x = 0; x < LENGTH; x++)
{
sb.append((char)((int)(Math.random() * 26) + 97));
}
String str = sb.toString();
str = str.substring(0,1).toUpperCase() + str.substring(1);
return str;
}
EDIT 2:
JDK works fine, GWT RegExp gives this problem
SOLVED:
Answer from Didier L
String linkPattern = "\\[[A-Za-z_0-9]+\\]";
String result = "";
String text = "[build]/directory/[something]/[build]/";
RegExp reg = RegExp.compile(linkPattern,"g");
MatchResult matchResult = null;
while((matchResult=reg.exec(text)) != null){
if(matchResult.getGroupCount()==1)
System.out.println( matchResult.getGroup(0));
}
I don't know which regex library you are using but using the one from the JDK it would go along the lines of
String linkPattern = "\\[[A-Za-z_0-9]+\\]";
String text = "[build]/directory/[something]/[build]/";
Pattern pat = Pattern.compile(linkPattern);
Matcher mat = pat.matcher(text);
while (mat.find()) {
System.out.println(mat.group());
}
Output:
[build]
[something]
[build]
Try:
String linkPattern = "(\\[[A-Za-z_0-9]+\\])*";
EDIT:
Second try:
String linkPattern = "\\[(\\w+)\\]+"
Third try, see http://rubular.com/r/eyAQ3Vg68N

JAVA regex failing

I have string which is of format:
;1=2011-10-23T16:16:53+0530;2=2011-10-23T16:16:53+0530;3=2011-10-23T16:16:53+0530;4=2011-10-23T16:16:53+0530;
I have written following code to find string 2011-10-23T16:16:53+0530 from (;1=2011-10-23T16:16:53+0530;)
Pattern pattern = Pattern.compile("(;1+)=(\\w+);");
String strFound= "";
Matcher matcher = pattern.matcher(strindData);
while (matcher.find()) {
strFound= matcher.group(2);
}
But it is not working as expected. Can you please give me any hint?
Can you please give me any hint?
Yes. Neither -, nor :, nor + are part of \w.
Do you have to use a regex? Why not call String.split() to break up the string on semi-colon boundaries. Then call it again to break up the chunks by the equals sign. At that point you'll have an integer and the date in string form. From there you can parse the date string.
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.Date;
public final class DateScan {
private static final String INPUT = ";1=2011-10-23T16:16:53+0530;2=2011-10-23T16:16:53+0530;3=2011-10-23T16:16:53+0530;4=2011-10-23T16:16:53+0530;";
public static void main(final String... args) {
final SimpleDateFormat parser = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ssZ");
final String[] pairs = INPUT.split(";");
for (final String pair : pairs) {
if ("".equals(pair)) {
continue;
}
final String[] integerAndDate = pair.split("=");
final Integer integer = Integer.parseInt(integerAndDate[0]);
final String dateString = integerAndDate[1];
try {
final Date date = parser.parse(dateString);
System.out.println(integer + " -> " + date);
} catch (final ParseException pe) {
System.err.println("bad date: " + dateString + ": " + pe);
}
}
}
}
I've change the input a bit, but just for presentation reasons that is
You can try this:
String input = " ;1=2011-10-23T16:16:53+0530; 2=2011-10-23T16:17:53+0530;3=2011-10-23T16:18:53+0530;4=2011-10-23T16:19:53+0530;";
Pattern p = Pattern.compile("(;\\d+?)?=(.+?);");
Matcher m = p.matcher(input);
while(m.find()){
System.out.println(m.group(2));
}

Categories

Resources