How to fix a text file wrt to punctuation?

How to fix a text file wrt to punctuation? - java

I am currently working on an independent project, but I am having trouble converting a text file into the proper format. Currently, my program reads a new line -- it assumes a line = a sentence -- but this is problematic since someone could just insert a paragraph where punctuation is scattered all over the place. What I want to do is make each sentence become its individual line and then read from that file. I didn't want to come empty so I tried it the only way I could and I got it to work with short-length strings, but once I get into longer text files I had to use Streams and I came into issues: (File name too long)
Example:
Input: This is a dummy sentence. Hello this is one too. And this one too.
Output:
This is a dummy sentence.
Hello this is one too.
And this one too.
This is working
public static void main(String args[])
{
String text = "Joanne had one requirement: Her child must be" +
" adopted by college graduates. So the doctor arranged" +
"for the baby to be placed with a lawyer and his wife." +
" Paul and Clara named their new baby Steven Paul Jobs.";
Pattern pattern = Pattern.compile("\\?|\\.|\\!|\\¡|\\¿");
Matcher matcher = pattern.matcher(text);
StringBuilder text_fixed = new StringBuilder();
String withline = "";
int starter = 0;
String overall = "";
String blankspace = " ";
while (matcher.find())
{
int holder = matcher.start();
System.out.println("=========> " + holder);
/***/
withline = text.substring(starter, holder + 1);
withline = withline + "\r\n";
overall = overall + withline;
System.out.println(withline);
starter = holder + 2;
}
System.out.println(overall);
//return overall;
}
This gets issues:
public static void main(String[] args) throws IOException
{
final String INPUT_FILE = "practice.txt";
InputStream in = new FileInputStream(INPUT_FILE);
String fixread = getStringFromInputStream(in);
String fixedspace = fixme(fixread);
File ins = new File(fixedspace);
BufferedReader reader = new BufferedReader(new FileReader(ins));
Pattern p = Pattern.compile("\n");
String line, sentence;
String[] t;
while ((line = reader.readLine()) != null )
{
t = p.split(line); /**hold curr sentence and remove it from OG txt file since you will reread.*/
sentence = t[0];
indiv_sentences.add(sentence);
}
//putSentencestoTrie(indiv_sentences);
//runAutocompletealt();
}
private static String fixme(String fixread)
{
Pattern pattern = Pattern.compile("\\?|\\.|\\!|\\¡|\\¿");
String actString = fixread.toString();
Matcher matcher = pattern.matcher(actString);
String withline = "";
int starter = 0;
String overall = "";
while (matcher.find())
{
int holder = matcher.start();
withline = actString.substring(starter, holder + 1);
withline = withline + "\r\n";
overall = overall + withline;
starter = holder + 2;
}
return overall;
}
/**this is not my code, this was provided by an outside source, I do not take credit*/
/**http://www.mkyong.com/java/how-to-convert-inputstream-to-string-in-java/*/
private static String getStringFromInputStream(InputStream is) {
BufferedReader br = null;
StringBuilder sb = new StringBuilder();
String line;
try {
br = new BufferedReader(new InputStreamReader(is));
while ((line = br.readLine()) != null) {
sb.append(line);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return sb.toString();
}
https://github.com/ChristianCSE/Phrase-Finder
I am pretty sure this is all the code I use for this section, but if you need to see the rest of my code I provided a link to my repository. Thanks!

The problem is you are creating the file with name that supposed to be its content-which is too long for a filename.
String fixedspace = fixme(fixread);
File ins = new File(fixedspace);//this is the issue, you gave the content as its name
Try to give a sample name and write output to the file.One sample is below.
String fixedspace = fixme(fixread);
File out= new File("output.txt");
FileWriter fr = new FileWriter(out);
fr.write(fixedspace);
Then read it and continue.

Related

ArrayList with search terms check a .txt file for duplicates

I am supposed to revise my actually finished code again, because it is too long. The idea was, or was suggested to me, to write search terms in an ArrayList and then run this over a .txt file, which is then stored in an ArrayList. Duplicates are to be read over and not read in.
boolean allegefunden = false;
BufferedReader reader;
String zeile = null;
ArrayList<String> arr = new ArrayList<String>();
try {
reader = new BufferedReader(new FileReader("C:\\Dev\\lesenUndSchreibenInput.txt"));
zeile = reader.readLine();
while (zeile != null) {
if (zeile.contains((CharSequence) suchbegriffe)) {
arr.add(zeile);
allegefunden = true;
} else if (allegefunden == true && zeile.contains((CharSequence) suchbegriffe)) {
} else
arr.add(zeile);
However, the normal contains method does not work.

I'm not sure I understand fully what you're trying to do, but to replace the contains method, you could build a regexp at the beginning of the search:
private static Pattern reFromWords(List<String> searchWords) {
String s = searchWords.stream()
.map(Pattern::quote)
.collect(Collectors.joining("|"));
return Pattern.compile(s);
}
Searching in a file would then look like this:
Pattern regexp = reFromWords(suchbegriffe);
try (Reader fileReader = new FileReader("yourfile.txt");
BufferedReader reader = new BufferedReader(fileReader)) {
String line = reader.readLine();
while (line != null) {
Matcher matcher = regexp.matcher(line);
if (matcher.find()) {
String foundWord = matcher.group();
System.out.println("Found " + foundWord + " in line: " + line);
}
line = reader.readLine();
}
} catch (IOException ex) {
Logger.getLogger(Main.class.getName()).log(Level.SEVERE, null, ex);
}

Java: Read file as long as a new date is found

I want to read the file and add each entry to an arraylist on a date. But the date should also be included.
File Example:
15.09.2002 Hello, this is the first entry.
\t this line, I also need in the first entry.
\t this line, I also need in the first entry.
\t this line, I also need in the first entry.
17.10.2020 And this ist the next entry
I tried this. But the Reader reads only the first Line
public class versuch1 {
public static void main(String[] args) {
ArrayList<String> liste = new ArrayList<String>();
String lastLine = "";
String str_all = "";
String currLine = "";
try {
FileReader fstream = new FileReader("test.txt");
BufferedReader br = new BufferedReader(fstream);
while ((currLine = br.readLine()) != null) {
Pattern p = Pattern
.compile("[0-3]?[0-9].[0-3]?[0-9].(?:[0-9]{2})?[0-9]{2} [0-2]?[0-9]:[0-6]?[0-9]:[0-5]");
Matcher m = p.matcher(currLine);
if (m.find() == true) {
lastLine = currLine;
liste.add(lastLine);
} else if (m.find() == false) {
str_all = currLine + " " + lastLine;
liste.set((liste.indexOf(currLine)), str_all);
}
}
br.close();
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}
System.out.print(liste.get(0) + " "+liste.get(1);
}
}

I have solved my problem :)
public class versuch1 {
public static void main(String[] args) {
ArrayList<String> liste = new ArrayList<String>();
String lastLine = "";
String currLine = "";
String str_all = "";
try {
FileReader fstream = new FileReader("test.txt");
BufferedReader br = new BufferedReader(fstream);
currLine = br.readLine();
while (currLine != null) {
Pattern p = Pattern
.compile("[0-3]?[0-9].[0-3]?[0-9].(?:[0-9]{2})?[0-9]{2} [0-2]?[0-9]:[0-6]?[0-9]:[0-5]");
Matcher m = p.matcher(currLine);
if (m.find() == true) {
liste.add(currLine);
lastLine = currLine;
} else if (m.find() == false) {
liste.set((liste.size() - 1), (str_all));
lastLine = str_all;
}
currLine = br.readLine();
str_all = lastLine + currLine;
}
} catch (Exception e) {
System.err.println("Error: " + e.getMessage());
}
System.out.print(liste.get(1) + " ");
}
}

While reading the lines, keep a "current entry".
If the line read begins with a date, then it belongs to a new entry. In this case add the current entry to the list of entries and create a new current entry consisting of the read line.
If the line did not begin with a date, just add it to the current entry.
For this to work, you need to read the first line into the current entry before the loop. And after the loop you need to add the current entry to the list of entries. This in turn only works if there is at least one line and the first line begins with a date. So handle the special case of no lines specially (use if-else). And report an error if the first line does not begin with a date.
Happy coding.

Search for multiline String in a text file

I have a text file from which i am trying to search for a String which has multiple lines. A single string i am able to search but i need multi line string to be searched.
I have tried to search for single line which is working fine.
public static void main(String[] args) throws IOException
{
File f1=new File("D:\\Test\\test.txt");
String[] words=null;
FileReader fr = new FileReader(f1);
BufferedReader br = new BufferedReader(fr);
String s;
String input="line one";
// here i want to search for multilines as single string like
// String input ="line one"+
// "line two";
int count=0;
while((s=br.readLine())!=null)
{
words=s.split("\n");
for (String word : words)
{
if (word.equals(input))
{
count++;
}
}
}
if(count!=0)
{
System.out.println("The given String "+input+ " is present for "+count+ " times ");
}
else
{
System.out.println("The given word is not present in the file");
}
fr.close();
}
And below are the file contents.
line one
line two
line three
line four

Use the StringBuilder for that, read every line from file and append them to StringBuilder with lineSeparator
StringBuilder lineInFile = new StringBuilder();
while((s=br.readLine()) != null){
lineInFile.append(s).append(System.lineSeparator());
}
Now check the searchString in lineInFile by using contains
StringBuilder searchString = new StringBuilder();
builder1.append("line one");
builder1.append(System.lineSeparator());
builder1.append("line two");
System.out.println(lineInFile.toString().contains(searchString));

More complicated solution from default C (code is based on code from book «The C programming language» )
final String searchFor = "Ich reiß der Puppe den Kopf ab\n" +
"Ja, ich reiß' ich der Puppe den Kopf ab";
int found = 0;
try {
String fileContent = new String(Files.readAllBytes(
new File("puppe-text").toPath()
));
int i, j, k;
for (i = 0; i < fileContent.length(); i++) {
for (k = i, j = 0; (fileContent.charAt(k++) == searchFor.charAt(j++)) && (j < searchFor.length());) {
// nothig
}
if (j == searchFor.length()) {
++found;
}
}
} catch (IOException ignore) {}
System.out.println(found);

Why don't you just normalize all the lines in the file to one string variable and then just count the number of occurrences of the input in the file. I have used Regex to count the occurrences but can be done in any custom way you find suitable.
public static void main(String[] args) throws IOException
{
File f1=new File("test.txt");
String[] words=null;
FileReader fr = new FileReader(f1);
BufferedReader br = new BufferedReader(fr);
String s;
String input="line one line two";
// here i want to search for multilines as single string like
// String input ="line one"+
// "line two";
int count=0;
String fileStr = "";
while((s=br.readLine())!=null)
{
// Normalizing the whole file to be stored in one single variable
fileStr += s + " ";
}
// Now count the occurences
Pattern p = Pattern.compile(input);
Matcher m = p.matcher(fileStr);
while (m.find()) {
count++;
}
System.out.println(count);
fr.close();
}
Use StringBuilder class for efficient string concatenation.

Try with Scanner.findWithinHorizon()
String pathToFile = "/home/user/lines.txt";
String s1 = "line two";
String s2 = "line three";
String pattern = String.join(System.lineSeparator(), s1, s2);
int count = 0;
try (Scanner scanner = new Scanner(new FileInputStream(pathToFile))) {
while (scanner.hasNext()) {
String withinHorizon = scanner.findWithinHorizon(pattern, pattern.length());
if (withinHorizon != null) {
count++;
} else {
scanner.nextLine();
}
}
} catch (FileNotFoundException e) {
e.printStackTrace();
}
System.out.println(count);

Try This,
public static void main(String[] args) throws IOException {
File f1 = new File("./src/test/test.txt");
FileReader fr = new FileReader(f1);
BufferedReader br = new BufferedReader(fr);
String input = "line one";
int count = 0;
String line;
while ((line = br.readLine()) != null) {
if (line.contains(input)) {
count++;
}
}
if (count != 0) {
System.out.println("The given String " + input + " is present for " + count + " times ");
} else {
System.out.println("The given word is not present in the file");
}
fr.close();
}

YouTube auto generated caption file has non sequential timing

I'm using YouTube API 3 to upload videos then by requesting their caption file depending on the auto captioning I got the following file with non sequential timing
1
00:00:00,000 --> 00:00:06,629
good weekend uh how was my weekend we
2
00:00:05,549 --> 00:00:14,960
don't do this we are
3
00:00:06,629 --> 00:00:14,960
yeah it's good Roman yeah well I gotta
Sample video : https://youtu.be/F2TVsMD_bDQ
So why the end of each subtitle slot not eh first of the next one ?

After searching for days and digging on YouTube documentation I found nothing to fix this issue so I solved this situation on my own I've created code using regex expressions to fix subtitles timing order I have tested It against 5 videos and It worked perfectly :
/**
*
* #author youans
*/
public class SubtitleCorrector {
/**
* #param args the command line arguments
*/
public static void main(String[] args) {
try {
String fileContent = null;
File inFile = new File("/IN_DIRECTORY/Test Video Bad Format.srt");
BufferedReader br = new BufferedReader(new FileReader(inFile));
try {
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null) {
sb.append(line);
sb.append("\n");
line = br.readLine();
}
fileContent = sb.toString();
} finally {
br.close();
}
String ragex = "\\d{2}:\\d{2}:\\d{2},\\d{3}";
List<String> slotsTiming = new ArrayList(new TreeSet(getAllMatches(fileContent, ragex)));
System.out.println(slotsTiming.size());
String timingRagex = "(((^1\n)|(\\n\\d+\n))(\\d{2}:\\d{2}:\\d{2},\\d{3}.*\\d{2}:\\d{2}:\\d{2},\\d{3}))";
ragex = timingRagex + "[A-Za-z-,;'\"\\s]+";
List<String> subtitleSlots = getAllMatches(fileContent, ragex);
List<String> textOnlySlots = new ArrayList();
for (String subtitleSlot : subtitleSlots) {
textOnlySlots.add(subtitleSlot.replaceAll(timingRagex + "|\n", ""));
}
StringBuilder sb = new StringBuilder("");
for (int i = 0; i < textOnlySlots.size(); i++) {
sb.append((i + 1)).append("\n").append(slotsTiming.get(i)).append(" --> ").append(slotsTiming.get(i + 1)).append("\n").append(textOnlySlots.get(i)).append("\n\n");
}
File outFile = new File("/OUT_DIRECTOR/" + inFile.getName().replaceFirst("[.][^.]+$|bad format", "") + "_edited.SRT");
PrintWriter pw = new PrintWriter(outFile);
pw.write(sb.toString());
pw.flush();
pw.close();
} catch (Exception ex) {
ex.printStackTrace();
}
}
public static List<String> getAllMatches(String text, String regex) {
List matches = new ArrayList<>();
Matcher m = Pattern.compile("(?=(" + regex + "))").matcher(text);
while (m.find()) {
matches.add(m.group(1));
}
return matches;
}
}

Trying to extract a substring from a buffered reader that reads between certain tags

I'm extracting 5 webpages using bufferedreader, each separated by a space, I want to use a substring to extract each pages url, html, source, and date. But I need guidance on how to use the substring properly to achieve this, cheers.
public static List<WebPage> readRawTextFile(Context ctx, int resId) {
InputStream inputStream = ctx.getResources().openRawResource(
R.raw.pages);
InputStreamReader inputreader = new InputStreamReader(inputStream);
BufferedReader buffreader = new BufferedReader(inputreader);
String line;
StringBuilder text = new StringBuilder();
try {
while ((line = buffreader.readLine()) != null) {
if (line.length() == 0) {
// ignore for now
//Will be used when blank line is encountered
}
if (line.length() != 0) {
//here I want the substring to pull out the correctStrings
int sURL = line.indexOf("<!--");
int eURL = line.indexOf("-->");
line.substring(sURL,eURL);
**//Problem is here**
}
}
} catch (IOException e) {
return null;
}
return null;
}

I think what u want is like this ,
public class Test {
public static void main(String args[]) {
String text = "<!--Address:google.co.uk.html-->";
String converted1 = text.replaceAll("\\<!--", "");
String converted2 = converted1.replaceAll("\\-->", "");
System.out.println(converted2);
}
}
result show : Address:google.co.uk.html

In catch block don't return null, use printStackTrace();. It will help you to find if something went wrong.
String str1 = "<!--Address:google.co.uk.html-->";
// Approach 1
int st = str1.indexOf("<!--"); // gives index which starts from <
int en = str1.indexOf("-->"); // gives index which starts from -
str1 = str1.substring(st + 4, en);
System.out.println(str1);
// Approach 2
String str2 = "<!--Address:google.co.uk.html-->";
str2 = str2.replaceAll("[<>!-]", "");
System.out.println( str2);
Note $100: be aware that using regex in replaceAll it will replace everything in string containing regex params.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to fix a text file wrt to punctuation? - java

Related

ArrayList with search terms check a .txt file for duplicates

Java: Read file as long as a new date is found

Search for multiline String in a text file

YouTube auto generated caption file has non sequential timing

Trying to extract a substring from a buffered reader that reads between certain tags

Categories

Resources