Parse a text file into multiple text file - java

I want to get multiple file by parsing a input file Through Java.
The Input file contains many fasta format of thousands of protein sequence and I want to generate raw format(i.e., without any comma semicolon and without any extra symbol like ">", "[", "]" etc) of each protein sequence.
A fasta sequence starts form ">" symbol followed by description of protein and then sequence of protein.
For example ► >lcl|NC_000001.10_cdsid_XP_003403591.1 [gene=LOC100652771]
[protein=hypothetical protein LOC100652771] [protein_id=XP_003403591.1] [location=join(12190..12227,12595..12721,13403..13639)]
MSESINFSHNLGQLLSPPRCVVMPGMPFPSIRSPELQKTTADLDHTLVSVPSVAESLHHPEITFLTAFCL
PSFTRSRPLPDRQLHHCLALCPSFALPAGDGVCHGPGLQGSCYKGETQESVESRVLPGPRHRH
Like above formate the input file contains 1000s of protein sequence. I have to generate thousands of raw file containing only individual protein sequence without any special symbol or gaps.
I have developed the code for it in Java but out put is : Cannot open a file followed by cannot find file.
Please help me to solve my problem.
Regards
Vijay Kumar Garg
Varanasi
Bharat (India)
The code is
/*Java code to convert FASTA format to a raw format*/
import java.io.*;
import java.util.*;
import java.util.regex.*;
import java.io.FileInputStream;
// java package for using regular expression
public class Arrayren
{
public static void main(String args[]) throws IOException
{
String a[]=new String[1000];
String b[][] =new String[1000][1000];
/*open the id file*/
try
{
File f = new File ("input.txt");
//opening the text document containing genbank ids
FileInputStream fis = new FileInputStream("input.txt");
//Reading the file contents through inputstream
BufferedInputStream bis = new BufferedInputStream(fis);
// Writing the contents to a buffered stream
DataInputStream dis = new DataInputStream(bis);
//Method for reading Java Standard data types
String inputline;
String line;
String separator = System.getProperty("line.separator");
// reads a line till next line operator is found
int i=0;
while ((inputline=dis.readLine()) != null)
{
i++;
a[i]=inputline;
a[i]=a[i].replaceAll(separator,"");
//replaces unwanted patterns like /n with space
a[i]=a[i].trim();
// trims out if any space is available
a[i]=a[i]+".txt";
//takes the file name into an array
try
// to handle run time error
/*take the sequence in to an array*/
{
BufferedReader in = new BufferedReader (new FileReader(a[i]));
String inline = null;
int j=0;
while((inline=in.readLine()) != null)
{
j++;
b[i][j]=inline;
Pattern q=Pattern.compile(">");
//Compiling the regular expression
Matcher n=q.matcher(inline);
//creates the matcher for the above pattern
if(n.find())
{
/*appending the comment line*/
b[i][j]=b[i][j].replaceAll(">gi","");
//identify the pattern and replace it with a space
b[i][j]=b[i][j].replaceAll("[a-zA-Z]","");
b[i][j]=b[i][j].replaceAll("|","");
b[i][j]=b[i][j].replaceAll("\\d{1,15}","");
b[i][j]=b[i][j].replaceAll(".","");
b[i][j]=b[i][j].replaceAll("_","");
b[i][j]=b[i][j].replaceAll("\\(","");
b[i][j]=b[i][j].replaceAll("\\)","");
}
/*printing the sequence in to a text file*/
b[i][j]=b[i][j].replaceAll(separator,"");
b[i][j]=b[i][j].trim();
// trims out if any space is available
File create = new File(inputline+"R.txt");
try
{
if(!create.exists())
{
create.createNewFile();
// creates a new file
}
else
{
System.out.println("file already exists");
}
}
catch(IOException e)
// to catch the exception and print the error if cannot open a file
{
System.err.println("cannot create a file");
}
BufferedWriter outt = new BufferedWriter(new FileWriter(inputline+"R.txt", true));
outt.write(b[i][j]);
// printing the contents to a text file
outt.close();
// closing the text file
System.out.println(b[i][j]);
}
}
catch(Exception e)
{
System.out.println("cannot open a file");
}
}
}
catch(Exception ex)
// catch the exception and prints the error if cannot find file
{
System.out.println("cannot find file ");
}
}
}
If you provide me correct it will be much easier to understand.

This code will not win prices, due to missing java expertice. For instance I would expect OutOfMemory even if it is correct.
Best would be a rewrite. Nevertheless we all began small.
Give full path to file. Also on the output the directory is probably missing from the file.
Better use BufferedReader etc. i.o. DateInputStream.
Initialize i with -1. Better use for (int i = 0; i < a.length; ++i).
Best compile the Pattern outside the loop. But remove the Matcher. You can do if (s.contains(">") as well.
. One does not need to create a new file.
Code:
const String encoding = "Windows-1252"; // Or "UTF-8" or leave away.
File f = new File("C:/input.txt");
BufferedReader dis = new BufferedReader(new InputStreamReader(
new FileInputStream(f), encoding));
...
int i= -1; // So i++ starts with 0.
while ((inputline=dis.readLine()) != null)
{
i++;
a[i]=inputline.trim();
//replaces unwanted patterns like /n with space
// Not needed a[i]=a[i].replaceAll(separator,"");

Your code contains the following two catch blocks:
catch(Exception e)
{
System.out.println("cannot open a file");
}
catch(Exception ex)
// catch the exception and prints the error if cannot find file
{
System.out.println("cannot find file ");
}
Both of these swallow the exception and print a generic "it didn't work" message, which tells you that the catch block was entered, but nothing more than that.
Exceptions often contain useful information that would help you track down where the real problem is. By ignoring them, you're making it much harder to diagnose your problem. Worse still, you're catching Exception, which is the superclass of a lot of exceptions, so these catch blocks are catching lots of different types of exceptions and ignoring them all.
The simplest way to get information out of an exception is to call its printStackTrace() method, which prints the exception type, exception message and stack trace. Add a call to this within both of these catch blocks, and that will help you see more clearly what exception is being thrown and from where.

Related

CsvMalformedLineException: Unterminated quoted field at end of CSV line

I am writing code to process a list of tar.gz files, inside which there are multiple, csv files. I have encountered the error below
com.opencsv.exceptions.CsvMalformedLineException: Unterminated quoted field at end of CSV line. Beginning of lost text: [,,,,,,
]
at com.opencsv.CSVReader.primeNextRecord(CSVReader.java:245)
at com.opencsv.CSVReader.flexibleRead(CSVReader.java:598)
at com.opencsv.CSVReader.readNext(CSVReader.java:204)
at uk.ac.shef.inf.analysis.Test.readAllLines(Test.java:64)
at uk.ac.shef.inf.analysis.Test.main(Test.java:42)
And the code causing this problem is below, on line B.
public class Test {
public static void main(String[] args) {
try {
Path source = Paths.get("/home/xxxx/Work/data/amazon/labelled/small/Books_5.json.1.tar.gz");
InputStream fi = Files.newInputStream(source);
BufferedInputStream bi = new BufferedInputStream(fi);
GzipCompressorInputStream gzi = new GzipCompressorInputStream(bi);
TarArchiveInputStream ti = new TarArchiveInputStream(gzi);
CSVParser parser = new CSVParserBuilder().withStrictQuotes(true)
.withQuoteChar('"').withSeparator(',').
.withEscapeChar('|'). // Line A
build();
BufferedReader br = null;
ArchiveEntry entry;
entry = ti.getNextEntry();
while (entry != null) {
br = new BufferedReader(new InputStreamReader(ti)); // Read directly from tarInput
System.out.format("\n%s\t\t > %s", new Date(), entry.getName());
try{
CSVReader reader = new CSVReaderBuilder(br).withCSVParser(parser)
.build();
List<String[]> r = readAllLines(reader);
} catch (Exception ioe){
ioe.printStackTrace();
}
System.out.println(entry.getName());
entry=ti.getNextEntry(); // Line B
}
}catch (Exception e){
e.printStackTrace();
}
}
private static List<String[]> readAllLines(CSVReader reader) {
List<String[]> out = new ArrayList<>();
int line=0;
try{
String[] lineInArray = reader.readNext();
while(lineInArray!=null) {
//System.out.println(Arrays.asList(lineInArray));
out.add(lineInArray);
line++;
lineInArray=reader.readNext();
}
}catch (Exception e){
System.out.println(line);
e.printStackTrace();
}
System.out.println(out.size());
return out;
}
}
I also attach a screenshot of the actual line within the csv file that caused this problem here, look at line 5213. I also include a test tar.gz file here: https://drive.google.com/file/d/1qHfWiJItnE19-BFdbQ3s3Gek__VkoUqk/view?usp=sharing
While debugging, I have some questions.
I think the issue is the \ character in the data file (line 5213 above), which is the escape character in Java. I verified this idea by adding line A to my code above, and it works. However, obviously I don't want to hardcode this as there can be other characters in the data causing same issue. So my question 1 is: is there anyway to tell Java to ignore escape characters? Something like the opposite of withEscapeChar('|')? UPDATE: the answer is to use '\0', thanks to the first comment below.
When debugging, I notice that my program stops working on the next .csv file within the tar.gz file as soon as it hit the above exception. To explain what I mean, inside the tar.gz file included in the above link, there are two csvs: _10.csv and _110.csv. The problematic line is in _10.csv. When my program hit that line, an exception is thrown and the program moves on to the next file _110.csv (entry=ti.getNextEntry();). This file is actually fine, but the method readAllLines that is supposed to read this next csv file will throw the same exception immediately on the first line. I don't think my code is correct, especially the while loop: I suspect the input stream was still stuck at the previous position that caused the exception. But I don't know how to fix this. Help please?
using RFC4180Parser worked for me.

Java -- Need help to enhance the code

I wrote a simple program to read the content from text/log file to html with conditional formatting.
Below is my code.
import java.io.*;
import java.util.*;
class TextToHtmlConversion {
public void readFile(String[] args) {
for (String textfile : args) {
try{
//command line parameter
BufferedReader br = new BufferedReader(new FileReader(textfile));
String strLine;
//Read File Line By Line
while ((strLine = br.readLine()) != null) {
Date d = new Date();
String dateWithoutTime = d.toString().substring(0, 10);
String outputfile = new String("Test Report"+dateWithoutTime+".html");
FileWriter filestream = new FileWriter(outputfile,true);
BufferedWriter out = new BufferedWriter(filestream);
out.write("<html>");
out.write("<body>");
out.write("<table width='500'>");
out.write("<tr>");
out.write("<td width='50%'>");
if(strLine.startsWith(" CustomerName is ")){
//System.out.println("value of String split Client is :"+strLine.substring(16));
out.write(strLine.substring(16));
}
out.write("</td>");
out.write("<td width='50%'>");
if(strLine.startsWith(" Logged in users are ")){
if(!strLine.substring(21).isEmpty()){
out.write("<textarea name='myTextBox' cols='5' rows='1' style='background-color:Red'>");
out.write("</textarea>");
}else{
System.out.println("else if block:");
out.write("<textarea name='myTextBox' cols='5' rows='1' style='background-color:Green'>");
out.write("</textarea>");
} //closing else block
//out.write("<br>");
out.write("</td>");
}
out.write("</td>");
out.write("</tr>");
out.write("</table>");
out.write("</body>");
out.write("</html>");
out.close();
}
//Close the input stream
in.close();
}catch (Exception e){//Catch exception if any
System.err.println("Error: " + e.getMessage());
e.printStackTrace();
}
}
}
public static void main(String args[]) {
TextToHtmlConversion myReader = new TextToHtmlConversion();
String fileArray[] = {"D:/JavaTesting/test.log"};
myReader.readFile(fileArray);
}
}
I was thinking to enhance my program and the confusion is of either i should use Maps or properties file to store search string. I was looking out for a approach to avoid using substring method (using index of a line). Any suggestions are truly appreciated.
From top to bottom:
Don't use wildcard imports.
Don't use the default package
restructure your readFile method in more smaller methods
Use the new Java 7 file API to read files
Try to use a try-block with a resource (your file)
I wouldn't write continuously to a file, write it in the end
Don't catch general Exception
Use a final block to close resources (or the try block mentioned before)
And in general: Don't create HTML by appending strings, this is a bad pattern for its own. But well, it seems that what you want to do.
Edit
Oh one more: Your text file contains some data right? If your data represents some entities (or objects) it would be good to create a POJO for this. I think your text file contains users (right?). Then create a class called Users and parse the text file to get a list of all users in it. Something like:
List<User> users = User.parse("your-file.txt");
Afterwards you have a nice user object and all your ugly parsing is in one central point.

How do I quadruple the integer values in a text file? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 9 years ago.
I guess this comes down to reading and writing to the same file. I would like to be able to return the same text file as is input, but with all integer values quadrupled. Should I even be attempting this with Java, or is it better to write to a new file and overwrite the original .txt file?
In essence, I'm trying to transform This:
12
fish
55 10 yellow 3
into this:
48
fish
220 40 yellow 12
Here's what I've got so far. Currently, it doesn't modify the .txt file.
import java.io.*;
import java.util.Scanner;
public class CharacterStretcher
{
public static void main(String[] args)
{
Scanner keyboard = new Scanner( System.in );
System.out.println("Copy and paste the path of the file to fix");
// get which file you want to read and write
File file = new File(keyboard.next());
File file2 = new File("temp.txt");
BufferedReader reader;
BufferedWriter writer;
try {
// new a writer and point the writer to the file
FileInputStream fstream = new FileInputStream(file);
// Use DataInputStream to read binary NOT text.
reader = new BufferedReader(new InputStreamReader(fstream));
writer = new BufferedWriter(new FileWriter(file2, true));
String line = "";
String temp = "";
int var = 0;
int start = 0;
System.out.println("000");
while ((line = reader.readLine()) != null)
{
System.out.println("a");
if(line.contains("="))
{
System.out.println("b");
var = 0;
temp = line.substring(line.indexOf('='));
for(int x = 0; x < temp.length(); x++)
{
System.out.println(temp.charAt(x));
if(temp.charAt(x)>47 && temp.charAt(x)<58) //if 0<=char<=9
{
if(start==0)
start = x;
var*=10;
var+=temp.indexOf(x)-48; //converts back into single digit
}
else
{
if(start!=0)
{
temp = temp.substring(0, start) + var*4 + temp.substring(x);
//writer.write(line.substring(0, line.indexOf('=')) + temp);
//TODO: Currently writes a bunch of garbage to the end of the file, how to write in the middle?
//move x if var*4 has an extra digit
if((var<10 && var>2)
|| (var<100 && var>24)
|| (var<1000 && var>249)
|| (var<10000 && var>2499))
x++;
}
//start = 0;
}
System.out.println(temp + " " + start);
}
if(start==0)
writer.write(line);
else
writer.write(temp);
}
}
System.out.println("end");
// writer the content to the file
//writer.write("I write something to a file.");
// always remember to close the writer
writer.close();
//writer = null;
file2.renameTo(file); //TODO: Not sure if this works...
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Given that this is a pretty quick and simple hack of a formatted text file, I don't think you need to be too clever about it.
Your logic for deciding whether you are looking at a number is pretty complex and I'd say it's overkill.
I've written up a basic outline of what I'd do in this instance.
It's not very clever or impressive, but should get the job done I think.
I've left out the overwriting and reading the input form the console so you get to do some of the implementation yourself ;-)
import java.io.*;
public class CharacterStretcher {
public static void main(String[] args) {
//Assumes the input is at c:\data.txt
File inputFile = new File("c:\\data.txt");
//Assumes the output is at c:\temp.txt
File outputFile = new File("c:\\temp.txt");
try {
//Construct a file reader and writer
final FileInputStream fstream = new FileInputStream(inputFile);
final BufferedReader reader = new BufferedReader(new InputStreamReader(fstream));
final BufferedWriter writer = new BufferedWriter(new FileWriter(outputFile, false));
//Read the file line by line...
String line;
while ((line = reader.readLine()) != null) {
//Create a StringBuilder to build our modified lines that will
//go into the output file
StringBuilder newLine = new StringBuilder();
//Split each line from the input file by spaces
String[] parts = line.split(" ");
//For each part of the input line, check if it's a number
for (String part : parts) {
try {
//If we can parse the part as an integer, we assume
//it's a number because it almost certainly is!
int number = Integer.parseInt(part);
//We add this to out new line, but multiply it by 4
newLine.append(String.valueOf(number * 4));
} catch (NumberFormatException nfEx) {
//If we couldn't parse it as an integer, we just add it
//to the new line - it's going to be a String.
newLine.append(part);
}
//Add a space between each part on the new line
newLine.append(" ");
}
//Write the new line to the output file remembering to chop the
//trailing space off the end, and remembering to add the line
//breaks
writer.append(newLine.toString().substring(0, newLine.toString().length() - 1) + "\r\n");
writer.flush();
}
//Close the file handles.
reader.close();
writer.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
}
You may want to consider one of these:
Build the new file in memory, rather than trying to write to the same file you are reading from. You could use StringBuilder for this.
Write to a new file, then overwrite the old file with the new one. This SO Question may help you there.
With both of these, you will be able to see your whole output, separate from the input file.
Additionally, with option (2), you don't have the risk of the operation failing in the middle and giving you a messed up file.
Now, you certainly can modify the file in-place. But it seems like unnecessary complexity for your case, unless you have really huge input files.
At the very least, if you try it this way first, you can narrow down on why the more complicated version is failing.
You cannot read and simultaneously write to the same file, because this would modify the text you currently read. This means, you must first write a modified new file and later rename it to the original one. You probably need to remove the original file before renameing.
For renaming, you can use File.renameTo or see one of the many SO's questions
You seem to parse integers in your code by collecting single digits and adding them up. You should consider using either a Scanner.nextInt or employ Integer.parseInt.
You can read your file line by line, split the words at white space and then parse them and check if it is either an integer or some other word.

Search text file for a specific line

I want to search for specific lines of text in a text file. If the piece of text I am looking for is on a specific line, I would like to read further on that line for more input.
So far I have 3 tags I am looking for.
#public
#private
#virtual
If I find any of these on a line, I would like to read what comes next so for example I could have a line like this:
#public double getHeight();
If I determine that the tag I found is #public then I have to take the following part after the white-space until I reach the semicolon. The problem is, that I can't really think of an efficient way to do this without excessive use of charAt(..) which neither looks pretty but probably isn't good either in the long run for a large file, or for multiple files in a row.
I would like help to solve this efficiently as I currently can't comprehend how I would do it. The code itself is used to parse comments in a C++ file, to later generate a Header file. The Pseudo Code part is where I am stuck. Some people suggest BufferedReader, others say Scanner. I went with Scanner as that seems to be the replacement for BufferedReader.
public void run() {
Scanner scanner = null;
String filename, path;
StringBuilder puBuilder, prBuilder, viBuilder;
puBuilder = new StringBuilder();
prBuilder = new StringBuilder();
viBuilder = new StringBuilder();
for(File f : files) {
try {
filename = f.getName();
path = f.getCanonicalPath();
scanner = new Scanner(new FileReader(f));
} catch (FileNotFoundException ex) {
System.out.println("FileNotFoundException: " + ex.getMessage());
} catch (IOException ex) {
System.out.println("IOException: " + ex.getMessage());
}
String line;
while((line = scanner.nextLine()) != null) {
/**
* Pseudo Code
* if #public then
* puBuilder.append(line.substring(after white space)
* + line.substring(until and including the semicolon);
*/
}
}
}
I may be misunderstanding you.. but are you just looking for String.contains()?
if(line.contains("#public")){}
String tag = "";
if(line.startsWith("#public")){
tag = "#public";
}else if{....other tags....}
line = line.substring(tag.length(), line.indexOf(";")).trim();
This gives you a string that goes from the end of the tag (which in this case is public), and then to the character preceding the semi-colon, and then trims off the whitespace on the ends.
if (line.startsWith("#public")) {
...
}
if you are allow to use open source libraries i suggest using the apache common-io and common-lang libraries. these are widely use java librariues that will make you life a lot more simpler.
String text = null;
InputStream in = null;
List<String> lines = null;
for(File f : files) {
try{
in = new FileInputStream(f);
lines = IOUtils.readLines(in);
for (String line: lines){
if (line.contains("#public"){
text = StringUtils.substringBetween("#public", ";");
...
}
}
}
catch (Exception e){
...
}
finally{
// alway remember to close the resource
IOUtils.closeQuietly(in);
}
}

BufferedReader not reading file (Android)

I am having a problem reading files with bufferedReader... I am trying to read in a dictionary file where every word is on a newline. It works for one file I have, but when I tried adding a larger wordlist file the enable wordlist and then on the first read: 'while ((currentLine=br.readLine()) != null) ' it cause an exception with no description... Please help!
try
{
InputStream is = this.getResources().openRawResource(R.raw.enable1);
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String currentLine=null;
while ((currentLine=br.readLine()) != null)
{
dictionaryList.add(currentLine);
}
br.close();
}
catch (Exception e)
{
//error here
}
*Looks like there is a file size limit of 1048576 bytes... otherwise it crashes.
So I like I said in the edit the new wordlist was over 1048576 bytes and was causing an IO exception without any error... (i had a string set to e.Message() in the catch put the message was null)
What I did was divide the wordlist into separate files based on word size (btw there are 26 different files! message me if you want them)
then depending on the size of the word I have I load the specific wordlist where all of the files are in the format enable# (# is the word size). If anyone wants to know I am doing that like this:
int wordListID=0;
String wordList="enable"+goodText.length();
try {
Class res = R.raw.class;
Field field = res.getField(wordList);
wordListID= field.getInt(null);
}
catch (Exception e) {
//something
}
i then send that specific wordListID to:
InputStream is = this.getResources().openRawResource(wordListID);
and know I have a small enough file which actually helps my performance too!
*This is my first application so I may not be doing things the correct way... just trying to get the hang of things

Categories

Resources