Cant match Srt subtitle using Regex in Java - java

In try in this code to parse an srt subtitle:
public class MatchArray {
public static void main(String args[]) {
File file = new File(
"C:/Users/Thiago/workspace/SubRegex/src/Dirty Harry VOST - Clint Eastwood.srt");
{
try {
Scanner in = new Scanner(file);
try {
String contents = in.nextLine();
while (in.hasNextLine()) {
contents = contents + "\n" + in.nextLine();
}
String pattern = "([\\d]+)\r([\\d]{2}:[\\d]{2}:[\\d]{2}),([\\d]{3})[\\s]*-->[\\s]*([\\d]{2}:[\\d]{2}:[\\d]{2}),([\\d]{3})\r(([^|\r]+(\r|$))+)";
Pattern r = Pattern.compile(pattern);
// Now create matcher object.
Matcher m = r.matcher(contents);
ArrayList<String> start = new ArrayList<String>();
while (m.find()) {
start.add(m.group(1));
start.add(m.group(2));
start.add(m.group(3));
start.add(m.group(4));
start.add(m.group(5));
start.add(m.group(6));
start.add(m.group(7));
System.out.println(start);
}
}
finally {
in.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
But when i execute it, it dosent capture any group, when try to capture only the time with this pattern:
([\\d]{2}:[\\d]{2}:[\\d]{2}),([\\d]{3})[\\s]*-->[\\s]*([\\d]{2}:[\\d]{2}:[\\d]{2}),([\\d]{3})
It works. So how do I make it capture the entire subtitle?

I can not quite understand your need but i thought this can help.
Please try the regex:
(\\d+?)\\s*(\\d+?:\\d+?:\\d+?,\\d+?)\\s+-->\\s+(\\d+?:\\d+?:\\d+?,\\d+?)\\s+(.+)
I tried it on http://www.myregextester.com/index.php and it worked.
I hope this can help.

Related

List<String> with entities to encode to UTF-8

I have list which get regex value and add to List
private static List<String> listaOfQuestion(Scanner sc, List<File> listaQuestion) {
List<String> question = new ArrayList<String>();
for (File input1 : listaQuestion) {
try {
sc = new Scanner(input1);
} catch (FileNotFoundException e) {
e.printStackTrace();
}
while (sc.hasNextLine()) {
Scanner s = new Scanner(sc.nextLine());
while (s.hasNext()) {
String words = s.nextLine();
try {
question.add(getTagValuesQ(words).toString());
} catch (Exception e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
}
return question;
}
I want to parse all value like
List
Bielańska
Wyziński
Wciślik
To
List
Bielańska
Wyzińska
Wciślik
To UTF-8, i'm searching throught the forum, and i didn't see solution or i just dont get it.
I appreciate every form of help, but because i'm new the best will be standard example or something like this which i will be able to understand.
I solved my problem, i needed use
<...>
Scanner s = new Scanner(sc.nextLine());
while(s.hasNext()){
String words = s.nextLine();
String decoded = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(words);
<...>
I tried using Apache Common Lang and solved it:
String s = "Bielańska Wyziński Wciślik";
String decoded = org.apache.commons.lang3.StringEscapeUtils.unescapeHtml4(s);
System.out.println(decoded);
Output:
Bielańska Wyziński Wciślik
https://commons.apache.org/proper/commons-lang/download_lang.cgi

Regex extract string, why my pattern don't works?

I have a long string in this format (a long single line in file):
"1":"Aname","2":"AnotherName","3":"Sempronio"
I want to extract the number and the name and save them on a Map.
I tried this:
FileReader fileReader = null;
BufferedReader br = null;
File file = new File("./SingleLineFileNames.txt");
try {
fileReader = new FileReader(file);
br = new BufferedReader(fileReader);
String line;
Pattern p = Pattern.compile("\"(\\d+)\":\"([\\w-.' ]+)\"");
Matcher matcher;
while((line = br.readLine()) != null) {
matcher = p.matcher(line);
String name;
int i = 1;
while((name = matcher.group(i)) != null){
// save in map
i++;
}
}
}
catch (FileNotFoundException e) {
e.printStackTrace();
}
catch (IOException e) {
e.printStackTrace();
}
finally {
try {
br.close();
fileReader.close();
}
catch (IOException e) {
e.printStackTrace();
}
}
return null;
result is java.lang.IllegalStateException: No match found
It's the right way to iterate on groups?
Where I wrong?
First split the String at , (String#split) and then split each resulting array element at : to get key and value. With input strings like these, I wonder what kind of masochism is on the developers using regex sledgehammers breaking these simple nuts..
If you use hyphen inside [] then always place at the first or at the last.
Pattern p = Pattern.compile("\"(\\d+)\":\"([-\\w.' ]+)\"");
^ here
Also the way you are checking the group() is not correct. Check here:
while(matcher.find()){
System.out.println(matcher.group(1));
System.out.println(matcher.group(2));
}
Remove the broken square bracket construct ([\\w-.' ]+) . For the name containing word characters only, it is enough to put (\\w+) there.

Problems trying to retrieve information from txt file

I'm stuck on one issue in my application. I have one text file that contains one piece of code that I need to retrieve to apply into one string variable. The problem is which is the best way to do this? I ran those samples below, but they are logically incorrect / incomplete. Take a look:
Reading through line:
BufferedReader bfr = new BufferedReader(new FileReader(Node));
String line = null;
try {
while( (line = bfr.readLine()) != null ){
line.contentEquals("d.href");
System.out.println(line);
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Reading through character:
BufferedReader bfr = new BufferedReader(new FileReader(Node));
int i = 0;
try {
while ((i = bfr.read()) != -1) {
char ch = (char) i;
System.out.println(Character.toString(ch));
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
};
Reading through Scanner:
BufferedReader bfr = new BufferedReader(new FileReader(Node));
BufferedReader bfr = new BufferedReader(new FileReader(Node));
int wordCount = 0, totalcount = 0;
Scanner s = new Scanner(googleNode);
while (s.hasNext()) {
totalcount++;
if (s.next().contains("(?=d.href).*?(}=?)")) wordCount++;
}
System.out.println(wordCount+" "+totalcount);
With (1.) I'm having difficult to find d.href with contains the start of the code piece.
With (2.) I can't think or find one way to store d.href as string and retrieve the rest of information.
With (3.) I can correctly find d.href but I can't retrieve pieces of the txt.
Could anyone help me please?
As answer of my question, I used scanner to read word by word in the text file. .contains("window.maybeRedirectForGBV") returns one boolean value, and hasNext() one string. Then, I stoped the query for my code stretch on the text file one word before I wanted and moved forward one more time to store the value of the next word on one string variable. From this point you only need to treat your string the way you want. Hope this help.
String stringSplit = null;
Scanner s = new Scanner(Node);
while (s.hasNext()) {
if (s.next().contains("window.maybeRedirectForGBV")){
stringSplit = s.next();
break;
}
}
You can use regular expressions like this:
Pattern pattern = Pattern.compile("^\\s*d\\.href([^=]*)=(.*)$");
// Groups: 1-----1 2--2
// Possibly spaces, "d.href", any characters not '=', the '=', any chars.
....
Matcher m = pattern.matcher(line);
if (m.matches()) {
String dHrefSuffix = m.group(1);
String value = m.group(2);
System.out.println(value);
break;
}
BufferedReader will do.

Parse a string line by opening a file using Regex

This is the below text file(log.txt) I am opening and need to match each line using regular expressions.
Jerty|gas|petrol|2.42
Tree|planet|cigar|19.00
Karie|entertainment|grocery|9.20
So I wrote this regular expressions but it is not getting matched.
public static String pattern = "(.*?)|(.*?)|(.*?)|(.*?)";
public static void main(String[] args) {
File file = new File("C:\\log.txt");
try {
Pattern regex = Pattern.compile(pattern);
Scanner scanner = new Scanner(file);
while (scanner.hasNextLine()) {
String line = scanner.nextLine();
Matcher m = regex.matcher(line);
if(m.matches()) {
System.out.println(m.group(1));
}
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Any suggestions will be appreciated.
The | is a special regex symbol which means 'or'. So, you have to escape it.
public static String pattern = "(.*?)\\|(.*?)\\|(.*?)\\|(.*?)";
You can greatly simplify the regex for this. Since the data appears to be pipe-separated, you should just split on the pipe character. You'll end up with an array of fields which can you further parse as needed:
String[] fields = line.split("\\|");

Multiline regexp matcher

There is input file with content:
XX00002200000
XX00003300000
regexp:
(.{6}22.{5}\W)(.{6}33.{5})
Tried in The Regex Coach(app for regexp testing), strings are matched OK.
Java:
pattern = Pattern.compile(patternString);
inputStream = resource.getInputStream();
scanner = new Scanner(inputStream, charsetName);
scanner.useDelimiter("\r\n");
patternString is regexp(mentioned above) added as bean property from .xml
It's failed from Java.
Simple solution: ".{6}22.{5}\\s+.{6}33.{5}". Note that \s+ is a shorthand for consequent whitespace elements.
Heres an example:
public static void main(String[] argv) throws FileNotFoundException {
String input = "yXX00002200000\r\nXX00003300000\nshort", regex = ".{6}22.{5}\\s+.{6}33.{5}", result = "";
Pattern pattern = Pattern.compile(regex);
Matcher m = pattern.matcher(input);
while (m.find()) {
result = m.group();
System.out.println(result);
}
}
With output:
XX00002200000
XX00003300000
To play around with Java Regex you can use: Regular Expression Editor (free online editor)
Edit: I think that you are changing the input when you are reading data, try:
public static String readFile(String filename) throws FileNotFoundException {
Scanner sc = new Scanner(new File(filename));
StringBuilder sb = new StringBuilder();
while (sc.hasNextLine())
sb.append(sc.nextLine());
sc.close();
return sb.toString();
}
Or
static String readFile(String path) {
FileInputStream stream = null;
FileChannel channel = null;
MappedByteBuffer buffer = null;
try {
stream = new FileInputStream(new File(path));
channel = stream.getChannel();
buffer = channel.map(FileChannel.MapMode.READ_ONLY, 0,
channel.size());
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
stream.close();
} catch (Exception e2) {
e2.printStackTrace();
}
}
return Charset.defaultCharset().decode(buffer).toString();
}
With imports like:
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.MappedByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.charset.Charset;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
Try this change in delimiter:
scanner.useDelimiter("\\s+");
also why don't you use a more general regex expression like this :
".{6}[0-9]{2}.{5}"
The regex you have mentioned above is for 2 lines.Since you have mentioned the delimiter as a new line you should be giving a regex expression suitable for a single line.
Pardon my ignorance, but I am still not sure what exactly are you trying to search. In case, you are trying to search for the string (with new lines)
XX00002200000
XX00003300000
then why are you reading it by delimiting it by new lines?
To read the above string as it is, the following code works
Pattern p = Pattern.compile(".{6}22.{5}\\W+.{6}33.{5}");
FileInputStream scanner = null;
try {
scanner = new FileInputStream("C:\\new.txt");
{
byte[] f = new byte[100];
scanner.read(f);
String s = new String(f);
Matcher m = p.matcher(s);
if(m.find())
System.out.println(m.group());
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
NB: here new.txt file contains the string
XX00002200000
XX00003300000

Categories

Resources