Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
EDIT: I'm mostly parsing "comma-seperated values", fuzzy brought that term to my attention.
Interpreting the blocks of CSV are the main question here.
I know how to read the file into something like a String[] and some of the basic features of String, but I don't think using methods like contains() and analyzing everything character by character will work.
What are some ways I can do this in a smarter way?
Example of a line:
-barfoob: boobs, foob, "foo bar"
There's a reason that everyone assumes you're talking about XML: inventing a proprietary text-based file format requires very strong justification in the face of the maturity and easy availability of XML parsers.
And your question indicates that you have very little prior knowledge about parsers (otherwise you'd be writing an ANTLR or JavaCC grammar instead of asking this question) - which is another strong argument against rolling your own, except as a learning experience.
Since the input is "formatted similarly to HTML", then it is likely that your data is best represented using a tree-like structure, and also, it is likely that it is XML or similar to XML.
If this is the case, I propose the smartest way to parse your file is to use an XML parser.
Here are some resources you may find helpful:
A chapter on XML parsing from Sun: http://java.sun.com/developer/Books/xmljava/ch03.pdf
An article that might help you get started qucikly: http://onjava.com/pub/a/onjava/2002/06/26/xml.html
HTH
If the document is valid XML, then any of the other answers will work. If it's not, you'll have to lex.
you should look at ANTLR even if you want to write the parser yourself, ANTLR is a great alternative. Or at least look at YAML
This and digging through wikipedia for related articles will probably suffice.
I think the java.util.Scanner will help you. Have a look at http://java.sun.com/javase/6/docs/api/java/util/Scanner.html
Depending on how complicated your "schema" is, a regular expression might be what you want. If there is a lot of nesting then it might be easiest to convert to XML or JSON and use a prebuilt parser.
People are right about standard formats being best practice, but let's set that aside.
Assuming that the example you give is representative, the task is pretty trivial.
You show a line with an initial token, demarked with a colon-space, then a list of comma-separated values. Separate at that first colon-space, and then use split() on the part to the right. Handling of the quotes is trivial, too.
After looking at your sample input, I fail to see any resemblance to HTML or XML:
-barfoob: boobs, foob, "foo bar"
If this is what you want to parse, I have an alternative suggestion, to use the Java properties parser (comes with standard Java), and then parse the remainder of each line using your own custom code. You will need to refactor your format somewhat in order for this to work, so it's up to you.
barfoob=boobs, foob, "foo bar"
Java properties will be be able to return you barfoob as the property name, and boobs, foob, "foo bar" as the property value. That's where you can use your custom code to split the property value into boobs, foob and foo bar.
I'd strongly advice to not reinvent the wheel and use an existing solution like Flatworm, Fixedformat4j or jFFP that can all parse positional or comma-separated values files (personally, I recommend Flatworm).
You may be able to use the Neko HTML parser to some degree. It depends on how it handles the non-standard HTML.
If the XML is valid, I personally prefer using http://www.xom.nu simply because it features a nice DOM model. As pointed out, though, there are parsers in J2SE.
Related
In the thread What’s your favorite “programmer ignorance” pet peeve?, the following answer appears, with a large amount of upvotes:
Programmers who build XML using string concatenation.
My question is, why is building XML via string concatenation (such as a StringBuilder in C#) bad?
I've done this several times in the past, as it's sometimes the quickest way for me to get from point A to point B when to comes to the data structures/objects I'm working with. So far, I have come up with a few reasons why this isn't the greatest approach, but is there something I'm overlooking? Why should this be avoided?
Probably the biggest reason I can think of is you need to escape your strings manually, and most new programmers (and even some experienced programmers) will forget this. It will work great for them when they test it, but then "randomly" their apps will fail when someone throws an & symbol in their input somewhere. Ok, I'll buy this, but it's really easy to prevent the problem (SecurityElement.Escape to name one).
When I do this, I usually omit the XML declaration (i.e. <?xml version="1.0"?>). Is this harmful?
Performance penalties? If you stick with proper string concatenation (i.e. StringBuilder), is this anything to be concerned about? Presumably, a class like XmlWriter will also need to do a bit of string manipulation...
There are more elegant ways of generating XML, such as using XmlSerializer to automatically serialize/deserialize your classes. Ok sure, I agree. C# has a ton of useful classes for this, but sometimes I don't want to make a class for something really quick, like writing out a log file or something. Is this just me being lazy? If I am doing something "real" this is my preferred approach for dealing w/ XML.
You can end up with invalid XML, but you will not find out until you parse it again - and then it is too late. I learned this the hard way.
I think readability, flexibility and scalability are important factors. Consider the following piece of Linq-to-Xml:
XDocument doc = new XDocument(new XDeclaration("1.0","UTF-8","yes"),
new XElement("products", from p in collection
select new XElement("product",
new XAttribute("guid", p.ProductId),
new XAttribute("title", p.Title),
new XAttribute("version", p.Version))));
Can you find a way to do it easier than this? I can output it to a browser, save it to a document, add attributes/elements in seconds and so on ... just by adding couple lines of code. I can do practically everything with it without much of effort.
Actually, I find the biggest problem with string concatenation is not getting it right the first time, but rather keeping it right during code maintenance. All too often, a perfectly-written piece of XML using string concat is updated to meet a new requirement, and string concat code is just too brittle.
As long as the alternatives were XML serialization and XmlDocument, I could see the simplicity argument in favor of string concat. However, ever since XDocument et. al., there is just no reason to use string concat to build XML anymore. See Sander's answer for the best way to write XML.
Another benefit of XDocument is that XML is actually a rather complex standard, and most programmers simply do not understand it. I'm currently dealing with a person who sends me "XML", complete with unquoted attribute values, missing end tags, improper case sensitivity, and incorrect escaping. But because IE accepts it (as HTML), it must be right! Sigh... Anyway, the point is that string concatenation lets you write anything, but XDocument will force standards-complying XML.
I wrote a blog entry back in 2006 moaning about XML generated by string concatenation; the simple point is that if an XML document fails to validate (encoding issues, namespace issues and so on) it is not XML and cannot be treated as such.
I have seen multiple problems with XML documents that can be directly attributed to generating XML documents by hand using string concatenation, and nearly always around the correct use of encoding.
Ask yourself this; what character set am I currently encoding my document with ('ascii7', 'ibm850', 'iso-8859-1' etc)? What will happen if I write a UTF-16 string value into an XML document that has been manually declared as 'ibm850'?
Given the richness of the XML support in .NET with XmlDocument and now especially with XDocument, there would have to be a seriously compelling argument for not using these libraries over basic string concatenation IMHO.
I think that the problem is that you aren't watching the xml file as a logical data storage thing, but as a simple textfile where you write strings.
It's obvious that those libraries do string manipulation for you, but reading/writing xml should be something similar to saving datas into a database or something logically similar
If you need trivial XML then it's fine. Its just the maintainability of string concatenation breaks down when the xml becomes larger or more complex. You pay either at development or at maintenance time. The choice is yours always - but history suggests the maintenance is always more costly and thus anything that makes it easier is worthwhile generally.
You need to escape your strings manually. That's right. But is that all? Sure, you can put the XML spec on your desk and double-check every time that you've considered every possible corner-case when you're building an XML string. Or you can use a library that encapsulates this knowledge...
Another point against using string concatenation is that the hierarchical structure of the data is not clear when reading the code. In #Sander's example of Linq-to-XML for example, it's clear to what parent element the "product" element belongs, to what element the "title" attribute applies, etc.
As you said, it's just awkward to build XML correct using string concatenation, especially now you have XML linq that allows for simple construction of an XML graph and will get namespaces, etc correct.
Obviously context and how it is being used matters, such as in the logging example string.Format can be perfectly acceptable.
But too often people ignore these alternatives when working with complex XML graphs and just use a StringBuilder.
The main reason is DRY: Don't Repeat Yourself.
If you use string concat to do XML, you will constantly be repeating the functions that keep your string as a valid XML document. All the validation would be repeated, or not present. Better to rely on a class that is written with XML validation included.
I've always found creating an XML to be more of a chore than reading in one. I've never gotten the hang of serialization - it never seems to work for my classes - and instead of spending a week trying to get it to work, I can create an XML file using strings in a mere fraction of the time and write it out.
And then I load it in using an XMLReader tree. And if the XML file doesn't read as valid, I go back and find the problem within my saving routines and corret it. But until I get a working save/load system, I refuse to perform mission-critical work until I know my tools are solid.
I guess it comes down to programmer preference. Sure, there are different ways of doing things, for sure, but for developing/testing/researching/debugging, this would be fine. However I would also clean up my code and comment it before handing it off to another programmer.
Because regardless of the fact you're using StringBuilder or XMLNodes to save/read your file, if it is all gibberish mess, nobody is going to understand how it works.
Maybe it won't ever happen, but what if your environment switches to XML 2.0 someday? Your string-concatenated XML may or may not be valid in the new environment, but XDocument will almost certainly do the right thing.
Okay, that's a reach, but especially if your not-quite-standards-compliant XML doesn't specify an XML version declaration... just saying.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I want to know if there is an API to do text analysis in Java. Something that can extract all words in a text, separate words, expressions, etc. Something that can inform if a word found is a number, date, year, name, currency, etc.
I'm starting the text analysis now, so I only need an API to kickoff. I made a web-crawler, now I need something to analyze the downloaded data. Need methods to count the number of words in a page, similar words, data type and another resources related to the text.
Are there APIs for text analysis in Java?
EDIT: Text-mining, I want to mining the text. An API for Java that provides this.
It looks like you're looking for a Named Entity Recogniser.
You have got a couple of choices.
CRFClassifier from the Stanford Natural Language Processing Group, is a Java implementation of a Named Entity Recogniser.
GATE (General Architecture for Text Engineering), an open source suite for language processing. Take a look at the screenshots at the page for developers: http://gate.ac.uk/family/developer.html. It should give you a brief idea what this can do. The video tutorial gives you a better overview of what this software has to offer.
You may need to customise one of them to fit your needs.
You also have other options:
simple text extraction via Web services: e.g. Tagthe.net and Yahoo's Term Extractor.
part-of-speech (POS) tagging: extracting part-of-speech (e.g. verbs, nouns) from the text. Here is a post on SO: What is a good Java library for Parts-Of-Speech tagging?.
In terms of training for CRFClassifier, you could find a brief explanation at their FAQ:
...the training data should be in tab-separated columns, and you
define the meaning of those columns via a map. One column should be
called "answer" and has the NER class, and existing features know
about names like "word" and "tag". You define the data file, the map,
and what features to generate via a properties file. There is
considerable documentation of what features different properties
generate in the Javadoc of NERFeatureFactory, though ultimately you
have to go to the source code to answer some questions...
You can also find a code snippet at the javadoc of CRFClassifier:
Typical command-line usage
For running a trained model with a provided serialized classifier on a
text file:
java -mx500m edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier
conll.ner.gz -textFile samplesentences.txt
When specifying all parameters in a properties file (train, test, or
runtime):
java -mx1g edu.stanford.nlp.ie.crf.CRFClassifier -prop propFile
To train and test a simple NER model from the command line:
java -mx1000m edu.stanford.nlp.ie.crf.CRFClassifier -trainFile
trainFile -testFile testFile -macro > output
For example - you might use some classes from standard library java.text, or use StreamTokenizer (you might customize it according to your requirements). But as you know - text data from internet sources is usually has many orthographical mistakes and for better performance you have to use something like fuzzy tokenizer - java.text and other standart utils has too limited capabilities in such context.
So, I'd advice you to use regular expressions (java.util.regex) and create own kind of tokenizer according to your needs.
P.S.
According to your needs - you might create state-machine parser for recognizing templated parts in raw texts. You might see simple state-machine recognizer on the picture below (you can construct more advanced parser, which could recognize much more complex templates in text).
If you're dealing with large amounts of data, maybe Apache's Lucene will help with what you need.
Otherwise it might be easiest to just create your own Analyzer class that leans heavily on the standard Pattern class. That way, you can control what text is considered a word, boundary, number, date, etc. E.g., is 20110723 a date or number? You might need to implement a multiple-pass parsing algorithm to better "understand" the data.
I recommend looking at LingPipe too. If you are OK with webservices then this article has a good summary of different APIs
I'd rather adapt Lucene's Analysis and Stemmer classes rather than reinventing the wheel. They have a vast majority of cases covered. See also the additional and contrib classes.
I want to write Java code to build a LALR parser for my grammar. Can someone please suggest some books or some links where I can learn how to write Java code for a LALR parser?
Writing a LALR parser by hand is difficult, but it can he done. If you want to learn the theory behind constructing parsers for them by hand, consider looking into "Parsing Techniques: A Practical Guide" by Grune and Jacobs. It's an excellent book on general parsing techniques, and the chapter on LR parsing is particularly good.
If you're more interested in just getting a LALR parser that is written in Java, consider looking into Java CUP, which is a general purpose parser generator for Java.
Hope this helps!
You can split the LALR functionality in two parts: preparation of the tables and parsing the input.
The first part is complex and errorprone, so even if you like knowing how it works I suggest to use a proven working table generator for the LALR states (and for the tokenizer DFA as well).
The second part consists of consuming those tables using some quite simple algorithms to tokenize and process the input into a parse tree/concrete syntax tree. This is easier to implement yourself if you like to do so, and you still have full control over how it works and what it does.
When doing parsing tasks, I personally use the free GOLD Parsing System, which has a nice UI for creating and debugging the grammar and it does also generate table files which can then be loaded and processed by an existing engine or your own implementation (the file format for these CGT files is well documented).
As previously stated, you would always use a parser-generator to produce an LALAR parser. A few such tools for Java are:
SableCC (my personal favourite)
CUP
Beaver3
SJPT
Gold
Just want to mention that my project CookCC ( http://coconut2015.github.io/cookcc/ ) is a LALR(1) parser + Lexer (much like flex).
The unique feature of CookCC is that you can write your lexer and parser in Java using Java annotations. See the calculator example here: https://github.com/coconut2015/cookcc/blob/master/tests/javaap/calc/Calculator.java
I would like to be able to parse XML that isn't necessarily well-formed. I'd be looking for a fuzzy rather than a strict parser, able to recover from badly nested tags, for example. I could write my own but it's worth asking here first.
Update:
What I'm trying to do is extract links and other info from HTML. In the case of well-formed XML I can use the Scala XML API. In the case of ill-formed XML, it would be nice to somehow convert it into correct XML (somehow) and deal with it the same way, otherwise I'd have to have two completely different sets of functions for dealing with documents.
Obviously because the input is not well-formed and I'm trying to create a well-formed tree, there would have to be some heuristic involved (such as when you see <parent><child></parent> you would close the <child> first and when you then see a <child> you ignore it). But of course this isn't a proper grammar and so there's no correct way of doing it.
What you're looking for would not be an XML parser. XML is very strict about nesting, closing, etc. One of the other answers suggests Tag Soup. This is a good suggestion, though technically it is much closer to a lexer than a parser. If all you want from XML-ish content is an event stream without any validation, then it's almost trivial to roll your own solution. Just loop through the input, consuming content which matches regular expressions along the way (this is exactly what Tag Soup does).
The problem is that a lexer is not going to be able to give you many of the features you want from a parser (e.g. production of a tree-based representation of the input). You have to implement that logic yourself because there is no way that such a "lenient" parser would be able to determine how to handle cases like the following:
<parent>
<child>
</parent>
</child>
Think about it: what sort of tree would expect to get out of this? There's really no sane answer to that question, which is precisely why a parser isn't going to be of much help.
Now, that's not to say that you couldn't use Tag Soup (or your own hand-written lexer) to produce some sort of tree structure based on this input, but the implementation would be very fragile. With tree-oriented formats like XML, you really have no choice but to be strict, otherwise it becomes nearly impossible to get a reasonable result (this is part of why browsers have such a hard time with compatibility).
Try the parser on the XHtml object. It is much more lenient than the one on XML.
Take a look at htmlcleaner. I have used it successfully to convert "HTML from the wild" to valid XML.
Try Tag Soup.
JTidy does something similar but only for HTML.
I mostly agree with Daniel Spiewak's answer. This is just another way to create "your own parser".
While I don't know of any Scala specific solution, you can try using Woodstox, a Java library that implements the StAX API. (Being an even-based API, I am assuming it will be more fault tolerant than a DOM parser)
There is also a Scala wrapper around Woodstox called Frostbridge, developed by the same guy who made the Simple Build Tool for Scala.
I had mixed opinions about Frostbridge when I tried it, but perhaps it is more suitable for your purposes.
I agree with the answers that turning invalid XML into "correct" XML is impossible.
Why don't you just do a regular text search for the hrefs if that's all you're interested in? One issue would be commented out links, but if the XML is invalid, it might not be possible to tell what is intended to be commented out!
Caucho has a JAXP compliant XML parser that is a little bit more tolerant than what you would usually expect. (Including support for dealing with escaped character entity references, AFAIK.)
Find JavaDoc for the parsers here
A related topic (with my solution) is listed below:
Scala and html parsing
When I try to validate an XML file against an XSD in java (see this example) there are some incompatibilities between the regular expressions given in the XSD file and the regular expressions in java.
If there is an regular expression like "[ab-]" in the XSD (meaning any of the characters "a", "b" or "-", java complains about a syntax error in the expression.
This is a known bug since 28-MAR-2005, see Sun bug database.
What can I do to work around this bug? Up to now I try to "correct" the XSD file by replacing the "[ab-]" by "[ab\-]", but sometimes this is not an option.
If you have problems with this bug, too, please vote for it at the Sun bug database!
Since a bug is already filed, I'd recommend you try a different XML Schema processor. There's not going to be a lot you can do about it.
If you can preprocess the stream the XSD is coming in on, then you could create a parser which understands the basic regular expression structure and can fix anything that looks of the form [.*-] (where the .star is not a literal in this case).
Although it may not be the best solution in the world, you could consider using the Sax parser. I have used it for over 3 years now, however I have not done much regex validation with it, so I cannot speak to it's robustness related to that.
Other than that, I think Kaleb is probably correct on the preprocessing side (which is anything but ideal) - you might be able to use a regex for any of the incoming regex'es to do a replace.... although that has quite the code smell about it.
Edit:
An additional thought that just came to me. If the regex does not need to be in the xsd - i.e. it is there simply because that was "easiest" in the past - you could do the regex validation outside of the xsd. But, if other systems use the xsd, that is likely not the correct solution, and you can forget I said anything.