Task 1: Read each row from one csv file into one seprate txt file.
Task 2: Reverse: in one folder, read text from each txt file and put into a row in a single csv. So, read all txt files into one csv file.
How would you do this? Would Java or Python be good to get this task done in very quickly?
Update:
For Java, there are already some quite useful libraries you can use, for example opencsv or javacsv. But better have a look at wikipedia about csv if no knowledge on csv. And this post tells you all the possibilities in Java.
Note: Due to the simplicity of the question, some one pre-assume this is a homework. I hereby declare it is not.
More background: I am working on my own experiments on machine learning and setting up a large scale test set. I need crawl, scrape and file type transfer as the basic utility for the experiment. Building a lot of things by myself for now, and suddenly want to learn Python due to some recent discoveries and get the feeling Python is more concise than Java for many parsing and file handling situations. Hence got this question.
I just want to save time for both you and me by getting to the gist without stating the not-so-related background. And my questions is more about the second question "Java vs Python". Because I run into few lines of code of Python using some csv library (? not sure, that's why I asked), but just do not know how to use Python. That are all the reasons why I got this question. Thanks.
From what you write there is little need on using something specific for CSV files. In particular for Task 1, this is a pure data I/O operation on text files. In Python for instance:
for i,l in enumerate(open(the_file)):
f = open('new_file_%i.csv' % i, 'w')
f.write(l)
f.close()
For Task 2, if you can guarantee that each file has the same structure (same number of fields per row) it is again a pure data I/O operation:
# glob files
files = glob('file_*.csv')
target = open('combined.csv', 'w')
for f in files:
target.write(open(f).read())
target.write(new_line_speparator_for_your_platform)
target.close()
Whether you do this in Java or Python depends on the availability on the target system and your personal preference only.
In that case I would use python since it is often more concise than Java.
Plus, the CSV files are really easy to handle with Python without installing something. I don't know for Java.
Task 1
It would roughly be this based on an example from the official documentation:
import csv
with open('some.csv', 'r') as f:
reader = csv.reader(f)
rownumber = 0
for row in reader:
g=open("anyfile"+str(rownumber)+".txt","w")
g.write(row)
rownumber = rownumber + 1
g.close()
Task 2
f = open("csvfile.csv","w")
dirList=os.listdir(path)
for fname in dirList:
if fname[-4::] == ".txt":
g = open("fname")
for line in g: f.write(line)
g.close
f.close()
In python,
Task 1:
import csv
with open('file.csv', 'rb') as df:
reader = csv.reader(df)
for rownumber, row in enumerate(reader):
with open(''.join(str(rownumber),'.txt') as f:
f.write(row)
Task 2:
from glob import glob
with open('output.csv', 'wb') as output:
for f in glob('*.txt'):
with open(f) as myFile:
rows = myFile.readlines()
output.write(rows)
You will need to adjust these for your use cases.
Related
I need to build an application which scans through a large amount of files. These files contain blocks with some data about a sessions, in which each line has a different value. E.g.: "=ID: 39487".
At that point I have that line, but the problem I now face is that I need the value n lines above that ID. I was thinking about an Iterator but it only has forward methods. I also thought about saving the results in a List but that defies the reason to use Stream and some files are huge so that would cause memory problems.
I was wondering if something like this is possible using the Stream API (Files)? Or perhaps a better question, is there a better way to approach this?
Stream<String> lines = Files.lines(Paths.get(file.getName()));
Iterator<String> search = lines.iterator();
You can't arbitrarily read backwards and forwards through the file with the same reader (no matter if you're using streams, iterators, or a plain BufferedReader.)
If you need:
m lines before a given line
n lines after the given line
You don't know the value of m and n in advance, until you reach that line
...then you essentially have three options:
Read the whole file once, keep it in memory, and then your task is trivial (but this uses the most memory.)
Read the whole file once, mark the line numbers that you need, then do a second pass where you extract the lines you require.
Read the whole file once, storing some form of metadata about line lengths as you go, then use a RandomAccessFile to extract the specific bits you need without having to read the whole file again.
I'd suggest given the files are huge, the second option here is probably the most realistic. The third will probably give you better performance, but will require much more in the way of development effort.
As an alternative if you can guarantee that both n and m are below a certain value, and that value is a reasonable size - you could also just keep a certain number of lines in a buffer as you're processing the file, and read through that buffer when you need to read lines "backwards".
Try my library. abacus-util
try(Reader reader = new FileReader(yourFile)) {
StreamEx.of(reader)
.sliding(n, n, ArrayList::new)
.filter(l -> l.get(l.size() - 1).contains("=ID: 39487"))
./* then do your work */
}
No matter how big your file is. as long as n is small number, not millions
I have covered lots of StackOverflow questions and Google search results, read many discussion topics but I couldn't find any proper answer for my question. I have an Sparse Matrix in .mat format which contains 36600 nodes (36600x36600 adjacency matrix) to read and manipulate (like matrix vector multiplication) in Java Environment. I applied many answers that discussed at here but I always got NullPointerException errors although there was a data at that .mat files.(Some says these result is because of size of data) I have applied these following code to my .mat file that return null and NullPointerException.
MatFileReader matfilereader = new MatFileReader("sourceData.mat");
MLArray mlArrayRetrieved = matfilereader.getMLArray("data");
System.out.println(mlArrayRetrieved);
System.out.println(mlArrayRetrieved.contentToString());
Also I have tried many times to convert .mat file to .csv or .xls in MATLAB Environment and Python Environment at Jupyter Notebook but, I did not get any result at these times, too.
That .mat file is going to be a adjacency matrix and will be a source for a specific algorithm in Cytoscape project. Hence, I must use it at Java Environment and I have decided to use the COLT Library for matrix manipulations. Suggestions and advises are going to help me so much. Thanks for reading.
just use find to get rows, columns and values of nonzeros elements and save these as text,csv or...:
[row, col, v] = find(my_spares_matrix);
Below is a code snippet using MFL that would result in a MATLAB-like printout of all values in your sparse matrix
Mat5.readFromFile("sourceData.mat")
.getSparse("data")
.forEach((row, col, real, imag) -> {
System.out.println(String.format("(%d,%d) \t %1.4f ", row + 1, col + 1, real));
});
The CSV workaround will work fine for the mentioned 750KB matrix, but it would likely become difficult to work with once data sets go beyond >50MB. MAT files store sparse data in a (binary) Compressed Sparse Column (CSC) format, which can be loaded with significantly less overhead than CSV files.
I was searching for free translation dictionaries. Freedict (freedict.org) provides the ones I need but I don't know, how to parse the *.index and *.dict files. I also don't really know, what to google, to find useful information about these formats.
The *.index files look following:
00databasealphabet QdGI l
00databasedictfmt1121 B b
00databaseinfo c 5o
00databaseshort 6E u
00databaseurl 6y c
00databaseutf8 A B
a BHO M
a bad risc BHa u
a bag of nerves BII 2
[...]
and the *.dict files:
[Lot of info stuff]
German-English FreeDict Dictionary ver. 0.3.4
Pipi machen /piːpiːmaxən/
to pee; to piss
(Aktien) zusammenlegen /aktsiːəntsuːzamənleːgən/
to merge (with)
[...]
I would be glad to see some example projects (preferably in python, but java, c, c++ are also ok) to understand how to handle these files.
It is too late. However, i hope that it can be useful for others like me.
JGoerzen writes a Dictdlib lib. You can see more details how he parse .index and .dict files.
https://github.com/jgoerzen/dictdlib/blob/master/dictdlib.py
dictd considers its format of .index and .dict[.dz] as private, to reserve itself the right to change it in the future.
If you want to process it directly anyway, the index contains the headwords and the .dict[.dz] contains definitions. It is optionally compressed with a special modified gzip algorithm providing almost random access, which gzip normally does not. The index contains 3 columns per line, tab separated:
The headword for looking up the definition.
The absolute byte position of the definition in the .dict[.dz] file, base64 encoded.
The length of the definition in bytes, base64 encoded.
For more details see the dict(8) man page (section Database Format) you should have found in your research before asking your question. For processing the headwords correctly, you'd have to consider encoding and character collation.
Eventually it would be better to use an existing library to read dictd databases. But that really depends on whether the library is good (no experience here).
Finally, as you noted yourself, XML is made exactly for easy processing. You could extract the headwords and translations using XPath, leaving out all the grammatical stuff and no need to bother parsing anything.
After getting this far the next problem would be that there is no one-to-one mapping between words in different lanuages...
I'm new to Java programming, and I ran into this problem:
I'm creating a program that reads a .csv file, converts its lines into objects and then manipulate these objects.
Being more specific, the application reads every line giving it an index and also reads certain values from those lines and stores them in TRIE trees.
The application then can read indexes from the values stored in the trees and then retrieve the full information of the corresponding line.
My problem is that, even though I've been researching the last couple of days, I don't know how to write these structures in binary files, nor how to read them.
I want to write the lines (with their indexes) in a binary indexed file and read only the exact index that I retrieved from the TRIEs.
For the tree writing, I was looking for something like this (in C)
fwrite(tree, sizeof(struct TrieTree), 1, file)
For the "binary indexed file", I was thinking on writing objects like the TRIEs, and maybe reading each object until I've read enough to reach the corresponding index, but this probably wouldn't be very efficient.
Recapitulating, I need help in writing and reading objects in binary files and solutions on how to create an indexed file.
I think you are (for starters) best off when trying to do this with serialization.
Here is just one example from stackoverflow: What is object serialization?
(I think copy&paste of the code does not make sense, please follow the link to read)
Admittedly this does not yet solve your index creation problem.
Here is an alternative to Java native serialization, Google Protocol Buffers.
I am going to write direct quotes from documentation mostly in this answer, so be sure to follow the link at the end of answer if you are interested into more details.
What is it:
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
In other words, you can serialize your structures in Java and deserialize at .net, pyhton etc. This you don't have in java native Serialization.
Performance:
This may vary according to use case but in principle GPB should be faster, as its built with performance and interchangeability in mind.
Here is stack overflow link discussing Java native vs GPB:
High performance serialization: Java vs Google Protocol Buffers vs ...?
How does it work:
You specify how you want the information you're serializing to be structured by defining protocol buffer message types in .proto files. Each protocol buffer message is a small logical record of information, containing a series of name-value pairs. Here's a very basic example of a .proto file that defines a message containing information about a person:
message Person {
required string name = 1;
required int32 id = 2;
optional string email = 3;
enum PhoneType {
MOBILE = 0;
HOME = 1;
WORK = 2;
}
message PhoneNumber {
required string number = 1;
optional PhoneType type = 2 [default = HOME];
}
repeated PhoneNumber phone = 4;
}
Once you've defined your messages, you run the protocol buffer compiler for your application's language on your .proto file to generate data access classes. These provide simple accessors for each field (like name() and set_name()) as well as methods to serialize/parse the whole structure to/from raw bytes.
You can then use this class in your application to populate, serialize, and retrieve Person protocol buffer messages. You might then write some code like this:
Person john = Person.newBuilder()
.setId(1234)
.setName("John Doe")
.setEmail("jdoe#example.com")
.build();
output = new FileOutputStream(args[0]);
john.writeTo(output);
Read all about it here:
https://developers.google.com/protocol-buffers/
You could look at GPB as an alternative format to XSD describing XML structures, just more compact and with faster serialization.
Backstory:
I am creating a LTspice program, where I am creating a circuit with over 1000 resistors.
There are 9 different types resistors. I need to change the value of each type of resistor, many times. I can do this manually but I don’t want to. The file is like a text file and can be read by a program like notepad. The filetype is .asc
I was going to create a java program to help me with this.
File Snippet:
SYMATTR InstName RiMC3
SYMATTR Value 0.01
SYMBOL res -1952 480 R90
WINDOW 0 0 56 VBottom 2
WINDOW 3 32 56 VTop 2
SYMATTR InstName RiMA3
SYMATTR Value 0.01
SYMBOL res -2336 160 R0
SYMATTR InstName ReC3
SYMATTR Value 8
Question:
How can I changes a word, I don´t know, in a file, but I know where it is, compared to another word I know?
An example:
I know the word "RiMC3", I need to changes the 3th word after this word to "0.02".
In the file Snippet the value is "0.01", but this will not always be the case.
My Solution:
I need a place to start.
Is this call something special? I have not found anything like this on google.
If you want to do this programmatically, you need to think about the limitations and requirements.
We don't know exactly how you want to do this, or in what context. But you can write this out on paper, in English, to give you a place to start.
For example, if we are going to make a standalone Java program (or class) to do this, and given simple line-oriented text, a naive approach might be:
Open the file for read
Open a file for write
Scan the file line by line
For each line:
Match the pattern or regular expression you are looking for and, if
it matches, modify the line in memory
Write out the possibly modified line to the output file
Finish up:
Close the files
Rename the output file to the input file
Buffering, error handling, application domain specifics are left as an exercise for the reader.