Mahout: converting one large text file to SequenceFile format - java

I have done a lot of searching on the web for this, but I've found nothing, even though I feel like it has to be somewhat common. I have used Mahout's seqdirectory command to convert a folder containing text files (each file is a separate document) in the past. But in this case there are so many documents (in the 100,000s) that I have one very large text file in which each line is a document. How can I convert this large file to SequenceFile format so that Mahout understands that each line should be considered a separate document? Thank you very much for any help.

Yeah, it is not quite apparent or very intuitive how to do this, although (lucky for you :P) I have answered that exact question several times here in stack, for instance here. Have a look ;)

Related

Best file type to save budget data in Java?

I'm starting on a new budgeting/allowance project and I'm trying to figure out how to save the data between opening and closing the program. Arrays seem clunky and I don't know how I would save the array data to a file anyway. I've really only worked with text files before so this is new to me.
I'm assuming a database of some sort is what I need but I don't know what I should be looking for.
I know this has to be a simple issue but I honestly don't know where to start; any help is greatly appreciated.
That is entirely for you to decide, there is no one right answer here.
You can save in a text file, e.g. CSV if very simple, otherwise JSON or XML are common choices, or in a binary file, e.g. Java serialized objects, or some embedded database file might do.
It really depends on how complex the data is, how big the file can become, whether you want to be able to edit the file directly in a text editor, and how important load/save performance is to you.
Since it's a new project and you seem fairly new to this, I'd suggest JSON or XML, whichever of the two you are more familiar with. But that's just my opinion.
It's entirely your choice.

How to detect mistakes in IRIs in a RDF file?

I am trying to make a RDF corrector. One of the things I specifically want to correct are IRIs. My question is that, irrespective of the RDF format, is there anything that I can do to correct mistakes in the IRI? I understand there can be multiple number of mistakes, but what are the most generic mistakes that I can fix?
I am using ANTLR to make the corrector. I have extended the BaseErrorListener so that it gives out the errors made in the IRI in particular.
In my experience, the errors made in the real world depend on the source. A source may be systematically creating IRIs with spaces in, or have been binary copied between ISO-8859-1 ("latin") and UTF-8 (the correct format) which corrupts the UTF-8. These low level errors can be best fixed with a text editor on the input file (and correct the code generating them).
Try a few sample IRIs at http://www.sparql.org/iri-validator.html, which prints out warnings and errors, and is the same code as the parsers.

Security of uploading and parsing Named Binary Tag files (NBT) via PHP

I'm building a application that deals with uploading/downloading Named Binary Tag files (NBT).
After they're uploaded I need to parse them and get some information.
I'm a bit concerned security wise as I don't have the necessary knowledge to properly understand how they're build or what kind of data to expect from them.
What are some sanity checks that I can perform, when the files are uploaded, to make sure that they are indeed NBT files.
Should I be concerned when parsing them?
If there's anything else I should be concerned with, please, do tell.
I realize these are vague questions. There aren't a lot of answers on Google, else I wouldn't be here.
The file-format for NBT is really simple and compact. It's a binary stream (uncompressed or gzipped), which was specified by Notch.
One "problem" comes with special crafted NBT-files, which contains a lot of empty lists and lists of lists ... the memory-overhead of parsing these may result in service failure (mostly because the created objects for each entry just fills your memory).
One solution could be to limit the amount of entries you are reading and when reaching that limit just dropping the parsed file.
I recently published a java-library for reading nbt-files (but without having a limit), maybe it helps you to understand that file-format.
edit: forgot to share this website about the "exploit": http://arstechnica.com/security/2015/04/just-released-minecraft-exploit-makes-it-easy-to-crash-game-servers/

Java NIO - How to efficiently parse a file containing both ascii and binary data?

I have some data files looking something like this:
text
header
"lots of binary data hear"
/header
more text
header
"more binary data"
/header
....
Most of the files are around 1-5MB in size. It's very unlikely that I will have to deal with any files larger than approximately 30MB.
I'm fairly new to Java NIO and the API looks a bit like a jungle to me. Could anyone give me any pointers to how I should go about parsing a file like this?
Would it be possible to have multiple threads consuming data from different parts of the file? The file will just be open for reading.
Redesign the file. That's a terrible design.
Question is how would you know if you're reading text or binary data. If there is a clear demarcation of the text and binary regions (like a marker, or a defined block size), then I suspect Preon would be able to help you out. Preon does have support for reading both text and binary data in a useful way. And since I'm pretty sure your binary data represents something else, you might also be able to decode the binary bits into a more useful data structure than just an array.
you can FileChannel.map(), and read it like an array.

Generate Images for formulas in Java

I'd like to generate an image file showing some mathematical expression, taking a String like "(x+a)^n=∑_(k=0)^n" as input and getting a more (human) readable image file as output. AFAIK stuff like that is used in Wikipedia for example. Are there maybe any java libraries that do that?
Or maybe I use the wrong approach. What would you do if the requirement was to enable pasting of formulas from MS Word into an HTML-document? I'd ask the user to just make a screenshot himself, but that would be the lazy way^^
Edit: Thanks for the answers so far, but I really do not control the input. What I get is some messy Word-style formula, not clean latex-formatted one.
Edit2: http://www.panschk.de/text.tex
Looks a bit like LaTeX doesn't it? That's what I get when I do
clipboard.getContents(RTFTransfer.getInstance()) after having pasted a formula from Word07.
First and foremost you should familiarize yourself with TeX (and LaTeX) - a famous typesetting system created by Donald Knuth. Typesetting mathematical formulae is an advanced topic with many opinions and much attention to detail - therefore use something that builds upon TeX. That way you are sure to get it right ;-)
Edit: Take a look at texvc
It can output to PNG, HTML, MathML. Check out the README
Edit #2 Convert that messy Word-stuff to TeX or MathML?
My colleague found a surprisingly simple solution for this very specific problem: When you copy formulas from Word2007, they are also stored as "HTML" in the Clipboard. As representing formulas in HTML isn't easy neither, Word just creates a temporary image file on the fly and embeds it into the HTML-code. You can then simply take the temporary formula-image and copy it somewhere else. Problem solved;)
What you're looking for is Latex.
MikTex is a nice little application for churning out images using LaTeX.
I'd like to look into creating them on-the-fly though...
Steer clear of LaTeX. Seriously.
Check out JEuclid. It can convert MathML expressions into images.

Categories

Resources