YouTube auto generated caption file has non sequential timing

YouTube auto generated caption file has non sequential timing - java

I'm using YouTube API 3 to upload videos then by requesting their caption file depending on the auto captioning I got the following file with non sequential timing
1
00:00:00,000 --> 00:00:06,629
good weekend uh how was my weekend we
2
00:00:05,549 --> 00:00:14,960
don't do this we are
3
00:00:06,629 --> 00:00:14,960
yeah it's good Roman yeah well I gotta
Sample video : https://youtu.be/F2TVsMD_bDQ
So why the end of each subtitle slot not eh first of the next one ?

After searching for days and digging on YouTube documentation I found nothing to fix this issue so I solved this situation on my own I've created code using regex expressions to fix subtitles timing order I have tested It against 5 videos and It worked perfectly :
/**
*
* #author youans
*/
public class SubtitleCorrector {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
String fileContent = null;
File inFile = new File("/IN_DIRECTORY/Test Video Bad Format.srt");
BufferedReader br = new BufferedReader(new FileReader(inFile));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
fileContent = sb.toString();
} finally {
br.close();
}
String ragex = "\\d{2}:\\d{2}:\\d{2},\\d{3}";
List<String> slotsTiming = new ArrayList(new TreeSet(getAllMatches(fileContent, ragex)));
System.out.println(slotsTiming.size());
String timingRagex = "(((^1\n)|(\\n\\d+\n))(\\d{2}:\\d{2}:\\d{2},\\d{3}.*\\d{2}:\\d{2}:\\d{2},\\d{3}))";
ragex = timingRagex + "[A-Za-z-,;'\"\\s]+";
List<String> subtitleSlots = getAllMatches(fileContent, ragex);
List<String> textOnlySlots = new ArrayList();
for (String subtitleSlot : subtitleSlots) {
textOnlySlots.add(subtitleSlot.replaceAll(timingRagex + "|\n", ""));
}
StringBuilder sb = new StringBuilder("");
for (int i = 0; i < textOnlySlots.size(); i++) {
sb.append((i + 1)).append("\n").append(slotsTiming.get(i)).append(" --> ").append(slotsTiming.get(i + 1)).append("\n").append(textOnlySlots.get(i)).append("\n\n");
}
File outFile = new File("/OUT_DIRECTOR/" + inFile.getName().replaceFirst("[.][^.]+$|bad format", "") + "_edited.SRT");
PrintWriter pw = new PrintWriter(outFile);
pw.write(sb.toString());
pw.flush();
pw.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
public static List<String> getAllMatches(String text, String regex) {
List matches = new ArrayList<>();
Matcher m = Pattern.compile("(?=(" + regex + "))").matcher(text);
while (m.find()) {
matches.add(m.group(1));
}
return matches;
}
}

Related

ArrayList with search terms check a .txt file for duplicates

I am supposed to revise my actually finished code again, because it is too long. The idea was, or was suggested to me, to write search terms in an ArrayList and then run this over a .txt file, which is then stored in an ArrayList. Duplicates are to be read over and not read in.
boolean allegefunden = false;
BufferedReader reader;
String zeile = null;
ArrayList<String> arr = new ArrayList<String>();
try {
reader = new BufferedReader(new FileReader("C:\\Dev\\lesenUndSchreibenInput.txt"));
zeile = reader.readLine();
while (zeile != null) {
if (zeile.contains((CharSequence) suchbegriffe)) {
arr.add(zeile);
allegefunden = true;
} else if (allegefunden == true && zeile.contains((CharSequence) suchbegriffe)) {
} else
arr.add(zeile);
However, the normal contains method does not work.

I'm not sure I understand fully what you're trying to do, but to replace the contains method, you could build a regexp at the beginning of the search:
private static Pattern reFromWords(List<String> searchWords) {
String s = searchWords.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|"));
return Pattern.compile(s);
}
Searching in a file would then look like this:
Pattern regexp = reFromWords(suchbegriffe);
try (Reader fileReader = new FileReader("yourfile.txt");
BufferedReader reader = new BufferedReader(fileReader)) {
String line = reader.readLine();
while (line != null) {
Matcher matcher = regexp.matcher(line);
if (matcher.find()) {
String foundWord = matcher.group();
System.out.println("Found " + foundWord + " in line: " + line);
}
line = reader.readLine();
}
} catch (IOException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}

Java: Read file as long as a new date is found

I want to read the file and add each entry to an arraylist on a date. But the date should also be included.
File Example:
15.09.2002 Hello, this is the first entry.
\t this line, I also need in the first entry.
\t this line, I also need in the first entry.
\t this line, I also need in the first entry.
17.10.2020 And this ist the next entry
I tried this. But the Reader reads only the first Line
public class versuch1 {
public static void main(String[] args) {
ArrayList<String> liste = new ArrayList<String>();
String lastLine = "";
String str_all = "";
String currLine = "";
try {
FileReader fstream = new FileReader("test.txt");
BufferedReader br = new BufferedReader(fstream);
while ((currLine = br.readLine()) != null) {
Pattern p = Pattern
.compile("[0-3]?[0-9].[0-3]?[0-9].(?:[0-9]{2})?[0-9]{2} [0-2]?[0-9]:[0-6]?[0-9]:[0-5]");
Matcher m = p.matcher(currLine);
if (m.find() == true) {
lastLine = currLine;
liste.add(lastLine);
} else if (m.find() == false) {
str_all = currLine + " " + lastLine;
liste.set((liste.indexOf(currLine)), str_all);
}
}
br.close();
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}
System.out.print(liste.get(0) + " "+liste.get(1);
}
}

I have solved my problem :)
public class versuch1 {
public static void main(String[] args) {
ArrayList<String> liste = new ArrayList<String>();
String lastLine = "";
String currLine = "";
String str_all = "";
try {
FileReader fstream = new FileReader("test.txt");
BufferedReader br = new BufferedReader(fstream);
currLine = br.readLine();
while (currLine != null) {
Pattern p = Pattern
.compile("[0-3]?[0-9].[0-3]?[0-9].(?:[0-9]{2})?[0-9]{2} [0-2]?[0-9]:[0-6]?[0-9]:[0-5]");
Matcher m = p.matcher(currLine);
if (m.find() == true) {
liste.add(currLine);
lastLine = currLine;
} else if (m.find() == false) {
liste.set((liste.size() - 1), (str_all));
lastLine = str_all;
}
currLine = br.readLine();
str_all = lastLine + currLine;
}
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}
System.out.print(liste.get(1) + " ");
}
}

While reading the lines, keep a "current entry".
If the line read begins with a date, then it belongs to a new entry. In this case add the current entry to the list of entries and create a new current entry consisting of the read line.
If the line did not begin with a date, just add it to the current entry.
For this to work, you need to read the first line into the current entry before the loop. And after the loop you need to add the current entry to the list of entries. This in turn only works if there is at least one line and the first line begins with a date. So handle the special case of no lines specially (use if-else). And report an error if the first line does not begin with a date.
Happy coding.

How to fix a text file wrt to punctuation?

I am currently working on an independent project, but I am having trouble converting a text file into the proper format. Currently, my program reads a new line -- it assumes a line = a sentence -- but this is problematic since someone could just insert a paragraph where punctuation is scattered all over the place. What I want to do is make each sentence become its individual line and then read from that file. I didn't want to come empty so I tried it the only way I could and I got it to work with short-length strings, but once I get into longer text files I had to use Streams and I came into issues: (File name too long)
Example:
Input: This is a dummy sentence. Hello this is one too. And this one too.
Output:
This is a dummy sentence.
Hello this is one too.
And this one too.
This is working
public static void main(String args[])
{
String text = "Joanne had one requirement: Her child must be" +
" adopted by college graduates. So the doctor arranged" +
"for the baby to be placed with a lawyer and his wife." +
" Paul and Clara named their new baby Steven Paul Jobs.";
Pattern pattern = Pattern.compile("\\?|\\.|\\!|\\¡|\\¿");
Matcher matcher = pattern.matcher(text);
StringBuilder text_fixed = new StringBuilder();
String withline = "";
int starter = 0;
String overall = "";
String blankspace = " ";
while (matcher.find())
{
int holder = matcher.start();
System.out.println("=========> " + holder);
/***/
withline = text.substring(starter, holder + 1);
withline = withline + "\r\n";
overall = overall + withline;
System.out.println(withline);
starter = holder + 2;
}
System.out.println(overall);
//return overall;
}
This gets issues:
public static void main(String[] args) throws IOException
{
final String INPUT_FILE = "practice.txt";
InputStream in = new FileInputStream(INPUT_FILE);
String fixread = getStringFromInputStream(in);
String fixedspace = fixme(fixread);
File ins = new File(fixedspace);
BufferedReader reader = new BufferedReader(new FileReader(ins));
Pattern p = Pattern.compile("\n");
String line, sentence;
String[] t;
while ((line = reader.readLine()) != null )
{
t = p.split(line); /**hold curr sentence and remove it from OG txt file since you will reread.*/
sentence = t[0];
indiv_sentences.add(sentence);
}
//putSentencestoTrie(indiv_sentences);
//runAutocompletealt();
}
private static String fixme(String fixread)
{
Pattern pattern = Pattern.compile("\\?|\\.|\\!|\\¡|\\¿");
String actString = fixread.toString();
Matcher matcher = pattern.matcher(actString);
String withline = "";
int starter = 0;
String overall = "";
while (matcher.find())
{
int holder = matcher.start();
withline = actString.substring(starter, holder + 1);
withline = withline + "\r\n";
overall = overall + withline;
starter = holder + 2;
}
return overall;
}
/**this is not my code, this was provided by an outside source, I do not take credit*/
/**http://www.mkyong.com/java/how-to-convert-inputstream-to-string-in-java/*/
private static String getStringFromInputStream(InputStream is) {
BufferedReader br = null;
StringBuilder sb = new StringBuilder();
String line;
try {
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
sb.append(line);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return sb.toString();
}
https://github.com/ChristianCSE/Phrase-Finder
I am pretty sure this is all the code I use for this section, but if you need to see the rest of my code I provided a link to my repository. Thanks!

The problem is you are creating the file with name that supposed to be its content-which is too long for a filename.
String fixedspace = fixme(fixread);
File ins = new File(fixedspace);//this is the issue, you gave the content as its name
Try to give a sample name and write output to the file.One sample is below.
String fixedspace = fixme(fixread);
File out= new File("output.txt");
FileWriter fr = new FileWriter(out);
fr.write(fixedspace);
Then read it and continue.

Parse csv file and store result as JFree chart dataset consumes heap space

I have a Netbeans Module Application that runs ok when executed within my Netbeans IDE.
But when I run the distribution executable from the generated unzipped folder, the application program swing worker task will stop after a while. It loops thru couple of files and then stops.
My best guess is that I have to do something about the loop where I process the csv files? Or ... any ide or hint would be most appreciated
The size of the files is 2000 - 600.000 rows and contains 5 time series that are defined as double.
I store the datasets in collection.
Here is my method with the while loop
protected XYDataset generateDataSet(String filePath) {
TimeSeriesCollection dataset = null;
try {
dataset = new TimeSeriesCollection();
boolean isHeaderSet = false;
String fileRow;
StringTokenizer tokenizer;
BufferedReader br;
List<String> headers;
String encoding = "UTF-8";
br = new BufferedReader(new InputStreamReader(new FileInputStream(filePath), encoding));
//br = new BufferedReader(new FileReader(filePath));
if (!br.ready()) {
throw new FileNotFoundException();
}
fileRow = br.readLine();
Loop starts here
while (fileRow != null) {
if (!isHeaderSet) {
headers = getHeaders(fileRow);
for (String string : headers) {
dataset.addSeries(new TimeSeries(string));
}
isHeaderSet = true;
}
if (fileRow.startsWith("#")) {
fileRow = br.readLine();
}
String timeStamp = null;
String theTok1 = null;
String theTok2;
tokenizer = new StringTokenizer(fileRow);
if (tokenizer.hasMoreTokens()) {
theTok1 = tokenizer.nextToken().trim();
}
if (tokenizer.hasMoreTokens()) {
theTok2 = tokenizer.nextToken().trim();
timeStamp = theTok1 + " " + theTok2;
}
Millisecond m = null;
if (timeStamp != null) {
m = getMillisecond(timeStamp);
}
int serieNumber = 0;
br.mark(201);
if (br.readLine() == null) {
br.reset();
while (tokenizer.hasMoreTokens()) {
if (dataset.getSeriesCount() > serieNumber) {
dataset.getSeries(serieNumber).add(m, parseDouble(tokenizer.nextToken().trim()), true);
Last code row abowe, I set notifyer to true on the very last scv file row otherwise it will loop throue the data set every time I add a new serie and its enough to do that on the last row.
} else {
tokenizer.nextToken();
}
serieNumber++;
}
} else {
br.reset();
while (tokenizer.hasMoreTokens()) {
if (dataset.getSeriesCount() > serieNumber) {
dataset.getSeries(serieNumber).add(m, parseDouble(tokenizer.nextToken().trim()), false);
} else {
tokenizer.nextToken();
}
serieNumber++;
}
}
fileRow = br.readLine();
}
br.close();
} catch (FileNotFoundException ex) {
printStackTrace(ex);
} catch (IOException | ParseException ex) {
printStackTrace(ex);
}
return dataset;
}
Here is also methods I use when processing heders and timestamp called from the code above.
(sometimes the csv file misses headers)
/**
* If the start cahr "#" is missing then the headers will all be "NA".
*
* #param fileRow a row with any numbers of headers,
* #return ArrayList with headers
*/
protected List<String> getHeaders(String fileRow) {
List<String> returnValue = new ArrayList<>();
StringTokenizer tokenizer;
if (fileRow.startsWith("#")) {
tokenizer = new StringTokenizer(fileRow.substring(1));
} else {
tokenizer = new StringTokenizer(fileRow);
tokenizer.nextToken();
tokenizer.nextToken();//date and time is one header but two tokens
while (tokenizer.hasMoreTokens()) {
returnValue.add("NA");
tokenizer.nextToken();
}
return returnValue;
}
tokenizer.nextToken();
while (tokenizer.hasMoreTokens()) {
returnValue.add(tokenizer.nextToken().trim());
}
return returnValue;
}
/**
* #param fileRow must match pattern "yyyy-MM-dd HH:mm:ss.SSS"
* #return
* #throws ParseException
*/
public Millisecond getMillisecond(String timeStamp) throws ParseException {
Date date = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS").parse(timeStamp);
return new Millisecond(date);
}

Assuming that you invoke generateDataSet() from your implementation of doInBackground(), alterations to dataset will typically fire events on the background thread, a violation of Swing's single thread rule. Instead, publish() interim results and process() them as shown here.

usage of stringtokenizer in java to display selected contents from a file

Can any one suggest, how to use string-tokens in java, to read all data in a file, and display only some of its contents. Like, if i have
apple = 23456, mango = 12345, orange= 76548, guava = 56734
I need to select apple, and the value corresponding to apple should be displayed in the output.
This is the code
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.StringTokenizer;
public class ReadFile {
public static void main(String[] args) {
try {
String csvFile = "Data.txt";
//create BufferedReader to read csv file
BufferedReader br = new BufferedReader(new FileReader(csvFile));
String line = "";
StringTokenizer st = null;
int lineNumber = 0;
int tokenNumber = 0;
//read comma separated file line by line
while ((line = br.readLine()) != null) {
lineNumber++;
//use comma as token separator
st = new StringTokenizer(line, ",");
while (st.hasMoreTokens()) {
tokenNumber++;
//display csv values
System.out.print(st.nextToken() + " ");
}
System.out.println();
//reset token number
tokenNumber = 0;
}
} catch (Exception e) {
System.err.println("CSV file cannot be read : " + e);
}
}
}
this is the file I'm working on :
ImageFormat=GeoTIFF
ProcessingLevel=GEO
ResampCode=CC
NoScans=10496
NoPixels=10944
MapProjection=UTM
Ellipsoid=WGS_84
Datum=WGS_84
MapOriginLat=0.00000000
MapOriginLon=0.00000000
ProdULLat=18.54590200
ProdULLon=73.80059300
ProdURLat=18.54653200
ProdURLon=73.90427600
ProdLRLat=18.45168500
ProdLRLon=73.90487900
ProdLLLat=18.45105900
ProdLLLon=73.80125300
ProdULMapX=373416.66169100
ProdULMapY=2051005.23286800
ProdURMapX=384360.66169100
ProdURMapY=2051005.23286800
ProdLRMapX=373416.66169100
ProdLRMapY=2040509.23286800
ProdLLMapX=384360.66169100
ProdLLMapY=2040509.23286800
Out of this, i need to display only the following :
NoScans
NoPixels
ProdULLat
ProdULLon
ProdLRLat
ProdLRLon

public class Test {
public String getValue(String str, String strDelim, String keyValueDelim, String key){
StringTokenizer tokens = new StringTokenizer(str, strDelim);
String sentence;
while(tokens.hasMoreElements()){
sentence = tokens.nextToken();
if(sentence.contains(key)){
return sentence.split(keyValueDelim)[1];
}
}
return null;
}
public static void main(String[] args) {
System.out.println(new Test().getValue("apple = 23456, mango = 12345, orange= 76548, guava = 56734", ",", "=", "apple"));
}
}
" I noticed you have edited your question and added your code. for your new version question you can still simply call method while reading the String from the file and get your desire value ! "

I have written code assuming you have already stored data from file to a String,
public static void main(String[] args) {
try {
String[] CONSTANTS = {"apple", "guava"};
String input = "apple = 23456, mango = 12345, orange= 76548, guava = 56734";
String[] token = input.split(",");
for(String eachToken : token) {
String[] subToken = eachToken.split("=");
// checking whether this data is required or not.
if(subToken[0].trim().equals(CONSTANTS[0]) || subToken[0].trim().equals(CONSTANTS[1])) {
System.out.println("No Need to do anything");
} else {
System.out.println(subToken[0] + " " + subToken[1]);
}
}
} catch(Exception e) {
e.printStackTrace();
}
}

read a complete line using bufferedreader and pass it to stringtokenizer with tokenizer as "="[as you mentioned in your file].
for more please paste your file and what you have tried so far..
ArrayList<String> list = new ArrayList<String>();
list.add("NoScans");
list.add("NoPixels");
list.add("ProdULLat");
list.add("ProdULLon");
list.add("ProdLRLat");
list.add("ProdLRLon");
//read a line from a file.
while ((line = br.readLine()) != null) {
lineNumber++;
//use 'equal to' as token separator
st = new StringTokenizer(line, "=");
//check for tokens from the above string tokenizer.
while (st.hasMoreTokens()) {
String key = st.nextToken(); //this will give the first token eg: NoScans
String value = st.nextToken(); //this will give the second token eg:10496
//check the value is present in the list or not. If it is present then print
//the value else leave it as it is.
if(list.contains(key){
//display csv values
System.out.print(key+"="+ " "+value);
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

YouTube auto generated caption file has non sequential timing - java

Related

ArrayList with search terms check a .txt file for duplicates

Java: Read file as long as a new date is found

How to fix a text file wrt to punctuation?

Parse csv file and store result as JFree chart dataset consumes heap space

usage of stringtokenizer in java to display selected contents from a file

Categories

Resources