SGML parser in Java? [closed]

SGML parser in Java? [closed] - java

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I'm looking for a parser in Java that can parse a document formatted in SGML.
For duplicate monitors:
I'm aware of the two other threads that discuss this topic:
Parsing Java String with SGML
Java SGML to XML conversion?
But neither has a resolution, hence the new topic.
For people that confuse XML with SGML:
Please read this: http://www.w3.org/TR/NOTE-sgml-xml-971215#null
(in short, there are enough subtle differences to at least make it unusable in it's vanilla form)
For people who are fond of asking posters to Google it:
I already did and the closest I could come up with was the widely popular SAXParser: http://download.oracle.com/javase/1.4.2/docs/api/javax/xml/parsers/SAXParser.html
But that of course is meant to be an XML parser. I'm looking around to see if anyone has implemented a modification of the SAX Parser to accommodate SGML.
Lastly, I cannot use SX as I'm looking for a Java solution.
Thanks! :)

I have a few approaches to this problem
The first is what you did -- check to see if the sgml document is close enough to XML for the standard SAX parser to work.
The second is to do the same with HTML parsers. The trick here is to find one that doesn't ignore non-HTML elements.
I did find some Java SGML parsers, more in acedemia, when searching for "sgml parser Java". I do not know how well they work.
The last step is to take a standard (non Java) SGML parser and transform the documents into something you can read in Java.
It looks like you were able to work with the first step.

I use OpenSP via JNI, as it seems there is no pure Java SGML parser. I've written an experimental SAX-like wrapper that is available at http://sourceforge.net/projects/sasgml (of course, it has all the drawbacks of JNI... but was enough for my requirements).
Another approach is converting the document to XML by using sx from Open SP, and then run a traditional SAX parser.

There is no api for parsing SGML using Java at this time. There also isn't any api or library for converting SGML to XML and then parsing it using Java. With the status of SGML being supplanted by XML for all the projects I've worked on until now, I don't think there will every be any work done in this area, but that is only a guess.
Here is some open source code code from a University that does it, however I haven't tried it and you would have to search to find the other dependent classes. I believe the only viable solution in Java would require Regular Expressions.
Also, here is a link for public SGML/XML software.

Java SE includes an HTML parser in the javax.swing.text.html.parser package. It claims in its documentation to be a general SGML parser, but then counterclaims in the documentation that you should only use it with the provided HTML DTD class.
If you put it in lenient mode and your SGML documents don't have a lot of implied end tags, you may get reasonable results.
Read about the parser in its JavaDoc, here: http://docs.oracle.com/javase/6/docs/api/javax/swing/text/html/parser/DocumentParser.html
Create an instance like this:
new DocumentParser(DTD.getDTD("html32"))
Or you could ignore the warnings against using a custom DTD with DocumentParser, and create a subclass of DTD that matches the rules of your own SGML format.
This is clearly not an industrial strength SGML parser, but it should be a good starting point for a one-time data migration effort. I've found it useful in previous projects for parsing HTML.

If its HTML that you're parsing, this might do:
http://ccil.org/~cowan/XML/tagsoup/

Though its a very old post and I'm not claiming that the answer I am providing is perfect but it served my purpose. So I am keeping this code I wrote using stack to get the data in a way was required in my case. I hope it may be helpful for others.
try (BufferedReader br = new BufferedReader(new FileReader(new File(
fileName)))) {
while ((line = br.readLine()) != null) {
line = line.trim();
int startOfTag = line.indexOf("<");
int endOfTag = line.indexOf(">");
String currentTag = "";
if (startOfTag > -1 && endOfTag > -1) {
if (countStart)
headerTagsCounter++;
currentTag = line.substring(startOfTag + 1, endOfTag);
String currentData = line.substring(endOfTag + 1,
line.length());
if (i == 1) {
tagStack.push(currentTag);
i++;
}
if (currentData.isEmpty() || currentData == "") {//If there is no data, its a parent tag...
if (!currentTag.contains("/")) {// if its an opening tag...
switch (currentTag) {// these tags are useless in my case, so just skipping these tags.
case "CORRECTION":
case "PAPER":
case "PRIVATE-TO-PUBLIC":
case "DELETION":
case "CONFIRMING-COPY":
case "CAPTION":
case "STUB":
case "COLUMN":
case "TABLE-FOOTNOTES-SECTION":
case "FOOTNOTES":
case "PAGE":
break;
default: {
countStart = false;
int tagCounterNumber = 0;
String historyTagToRemove = "";
for (String historyTag : historyStack) {
String tagCounter = "";
if (historyTag.contains(currentTag)) {//if it's a repeating tag..Append the counter and update the same in history tag..
historyTagToRemove = historyTag;
if (historyTag
.equalsIgnoreCase(currentTag)) {
tagCounterNumber = 1;
} else if (historyTag.length() > currentTag
.length()) {
tagCounter = historyTag
.substring(currentTag
.length());
if (tagCounter != null
&& !tagCounter.isEmpty()) {
tagCounterNumber = Integer
.parseInt(tagCounter) + 1;
}
}
}
}
if (tagCounterNumber > 0)
currentTag += tagCounterNumber;
if (historyTagToRemove != null
&& !historyTagToRemove.isEmpty()) {
historyStack.remove(historyTagToRemove);
historyStack.push(currentTag);
}
tagStack.push(currentTag);
break;
}
}
} else// if its end of a tag... Match the current tag with top of stack and if its a match, pop it out
{
currentTag = currentTag.substring(1);
String tagRemoved = "";
String topStackTag = tagStack.lastElement();
if (topStackTag.contains(currentTag)) {
tagRemoved = tagStack.pop();
historyStack.push(tagRemoved);
}
if (tagStack.size() < 2)
cik = "";
if (tagStack.size() == 2 && cik != null
&& !cik.isEmpty())
for (int j = headerTagsCounter - 1; j < tagList.size(); j++) {
String item = tagList.get(j);
if (!item.contains("##")) {
item += "##" + cik;
tagList.remove(j);
tagList.add(j, item);
}
}
}
} else {// if current tag has some data...
currentData = currentData.trim();
String stackValue = "";
for (String tag : tagStack) {
if (stackValue != null && !stackValue.isEmpty()
&& stackValue != "")
stackValue = stackValue + "||" + tag;
else
stackValue = tag;
}
switch (currentTag) {
case "ACCESSION-NUMBER":
accessionNumber = currentData;
break;
case "FILING-DATE":
dateFiled = currentData;
break;
case "TYPE":
formType = currentData;
break;
case "CIK":
cik = currentData;
break;
}
tagList.add(stackValue + "$$" + currentTag + "::"+ currentData);
}
}
}
// Now all your data is available with in tagList, stack is separated by ||, key is separated by $$ and value is separated by ::
}
} catch (Exception e) {
// TODO Auto-generated catch block
}
}
Output:
Source of file: http://10k-staging.s3.amazonaws.com/edgar0105/2016/12/20/935015/000119312516799070/0001193125-16-799070.hdr.sgml
Output of code:
SEC-HEADER$$SEC-HEADER::0001193125-16-799070.hdr.sgml : 20161220
SEC-HEADER$$ACCEPTANCE-DATETIME::20161220172458
SEC-HEADER$$ACCESSION-NUMBER::0001193125-16-799070
SEC-HEADER$$TYPE::485APOS
SEC-HEADER$$PUBLIC-DOCUMENT-COUNT::9
SEC-HEADER$$FILING-DATE::20161220
SEC-HEADER$$DATE-OF-FILING-DATE-CHANGE::20161220
SEC-HEADER||FILER||COMPANY-DATA$$CONFORMED-NAME::ARTISAN PARTNERS FUNDS INC##0000935015
SEC-HEADER||FILER||COMPANY-DATA$$CIK::0000935015##0000935015
SEC-HEADER||FILER||COMPANY-DATA$$IRS-NUMBER::391811840##0000935015
SEC-HEADER||FILER||COMPANY-DATA$$STATE-OF-INCORPORATION::WI##0000935015
SEC-HEADER||FILER||COMPANY-DATA$$FISCAL-YEAR-END::0930##0000935015
SEC-HEADER||FILER||FILING-VALUES$$FORM-TYPE::485APOS##0000935015
SEC-HEADER||FILER||FILING-VALUES$$ACT::33##0000935015
SEC-HEADER||FILER||FILING-VALUES$$FILE-NUMBER::033-88316##0000935015
SEC-HEADER||FILER||FILING-VALUES$$FILM-NUMBER::162062197##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$STREET1::875 EAST WISCONSIN AVE STE 800##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$CITY::MILWAUKEE##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$STATE::WI##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$ZIP::53202##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$PHONE::414-390-6100##0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$STREET1::875 EAST WISCONSIN AVE STE 800##0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$CITY::MILWAUKEE##0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$STATE::WI##0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$ZIP::53202##0000935015
SEC-HEADER||FILER||FORMER-COMPANY$$FORMER-CONFORMED-NAME::ARTISAN FUNDS INC##0000935015
SEC-HEADER||FILER||FORMER-COMPANY$$DATE-CHANGED::19950310##0000935015
SEC-HEADER||FILER||FORMER-COMPANY1$$FORMER-CONFORMED-NAME::ZIEGLER FUNDS INC##0000935015
SEC-HEADER||FILER||FORMER-COMPANY1$$DATE-CHANGED::19950109##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$CONFORMED-NAME::ARTISAN PARTNERS FUNDS INC##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$CIK::0000935015##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$IRS-NUMBER::391811840##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$STATE-OF-INCORPORATION::WI##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$FISCAL-YEAR-END::0930##0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FORM-TYPE::485APOS##0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$ACT::40##0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FILE-NUMBER::811-08932##0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FILM-NUMBER::162062198##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$STREET1::875 EAST WISCONSIN AVE STE 800##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$CITY::MILWAUKEE##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$STATE::WI##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$ZIP::53202##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$PHONE::414-390-6100##0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$STREET1::875 EAST WISCONSIN AVE STE 800##0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$CITY::MILWAUKEE##0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$STATE::WI##0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$ZIP::53202##0000935015
SEC-HEADER||FILER1||FORMER-COMPANY2$$FORMER-CONFORMED-NAME::ARTISAN FUNDS INC##0000935015
SEC-HEADER||FILER1||FORMER-COMPANY2$$DATE-CHANGED::19950310##0000935015
SEC-HEADER||FILER1||FORMER-COMPANY3$$FORMER-CONFORMED-NAME::ZIEGLER FUNDS INC##0000935015
SEC-HEADER||FILER1||FORMER-COMPANY3$$DATE-CHANGED::19950109##0000935015
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS$$OWNER-CIK::0000935015
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES$$SERIES-ID::S000056665
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES$$SERIES-NAME::Artisan Thematic Fund
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES||CLASS-CONTRACT$$CLASS-CONTRACT-ID::C000179292
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES||CLASS-CONTRACT$$CLASS-CONTRACT-NAME::Investor Shares

Related

OpenOffice xSentenceCursor stuck at end of paragraph

I am using this routine to iterate over sentences in an OpenOffice document:
while (moreParagraphsOO) {
while (moreSentencesOO) {
xSentenceCursor.gotoEndOfSentence(true);
textSentence = xSentenceCursor.getString();
xTextViewCursor.gotoRange(xSentenceCursor.getStart(), false);
xTextViewCursor.gotoRange(xSentenceCursor.getEnd(), true);
if (!textSentence.equals("")) {
return textSentence;
}
moreSentencesOO = xSentenceCursor.gotoNextSentence(false);
}
moreParagraphsOO = xParagraphCursor.gotoNextParagraph(false);
moreSentencesOO = xSentenceCursor.gotoStartOfSentence(false);
}
It works fine unless it finds a paragraph which ends with ". ", this is, a period and one or several whitespaces after it. In that case it enters and infinite loop executing the
while (moreSentencesOO)
...
moreSentencesOO = xSentenceCursor.gotoNextSentence(false);
endlessly. I am not so proeficient with OpenOffice API, and I am quite stuck here. Any ideas?
Thanks.
EDIT: I have come with a somewhat awkward patch consisting in checking the current position of the cursor, and if it does not advance between two iterations, jump to next paragraph:
while (moreParagraphsOO) {
while (moreSentencesOO) {
/**********************************/
int previousPosX = xTextViewCursor.getPosition().X;
int previousPosY = xTextViewCursor.getPosition().Y;
/*********************************/
xSentenceCursor.gotoEndOfSentence(true);
textSentence = xSentenceCursor.getString();
xTextViewCursor.gotoRange(xSentenceCursor.getStart(), false);
xTextViewCursor.gotoRange(xSentenceCursor.getEnd(), true);
if (!textSentence.equals("")) {
return textSentence;
}
moreSentencesOO = xSentenceCursor.gotoNextSentence(false);
/**********************************/
if (previousPosX == xTextViewCursor.getPosition().X &&
previousPosY == xTextViewCursor.getPosition().Y){
xParagraphCursor.gotoNextParagraph(false);
}
/**********************************/
}
moreParagraphsOO = xParagraphCursor.gotoNextParagraph(false);
moreSentencesOO = xSentenceCursor.gotoStartOfSentence(false);
}
It seems to work, but I am unsure about whether it could introduce future problems. I would rather prefer an "elegant" solution.

According to gotoNextSentence(), it should only return true if the cursor was moved, so this is a bug. Consider filing a report.
The problem seems to occur when isEndOfSentence() but not isStartOfSentence(). So test for that instead of getPosition().
Here is Andrew Pitonyak's Basic macro that I modified to include this fix.
Sub CountSentences
oCursor = ThisComponent.Text.createTextCursor()
oCursor.gotoStart(False)
Do
nSentences = nSentences + 1
If oCursor.isEndOfSentence() And Not oCursor.isStartOfSentence() Then
oCursor.goRight(1, False)
End If
Loop While oCursor.gotoNextSentence(False)
MsgBox nSentences & " sentences."
End Sub

how to figure out which character doesn't map to utf-8

I maintain a small java servlet-based webapp that presents forms for input, and writes the contents of those forms to MariaDB.
The app runs on a Linux box, although the users visit the webapp from Windows.
Some users paste text into these forms that was copied from MSWord docs, and when that happens, they get internal exceptions like the following:
Caused by: org.mariadb.jdbc.internal.util.dao.QueryException:
Incorrect string value: '\xC2\x96 for...' for column 'ssimpact' at row
1
For instance, I tested it with text like the following:
Project – for
Where the dash is a "long dash" from the MSWord document.
I don't think it's possible to convert the wayward characters in this text to the "correct" characters, so I'm trying to figure out how to produce a reasonable error message that shows a substring of the bad text in question, along with the index of the first bad character.
I noticed postings like this: How to determine if a String contains invalid encoded characters .
I thought this would get me close, but it's not quite working.
I'm trying to use the following method:
private int findUnmappableCharIndex(String entireString) {
int charIndex;
for (charIndex = 0; charIndex < entireString.length(); ++ charIndex) {
String currentChar = entireString.substring(charIndex, charIndex + 1);
CharBuffer out = CharBuffer.wrap(new char[currentChar.length()]);
CharsetDecoder decoder = Charset.forName("utf-8").newDecoder();
CoderResult result = decoder.decode(ByteBuffer.wrap(currentChar.getBytes()), out, true);
if (result.isError() || result.isOverflow() || result.isUnderflow() || result.isMalformed() || result.isUnmappable()) {
break;
}
CoderResult flushResult = decoder.flush(out);
if (flushResult.isOverflow()) {
break;
}
}
if (charIndex == entireString.length() + 1) {
charIndex = -1;
}
return charIndex;
}
This doesn't work. I get "underflow" on the first character, which is a valid character. I'm sure I don't fully understand the decoder mechanism.

Java XML parser error Invalid character Unicode 0x1A when copy/paste from Word

Sorry to double post. But my earlier post was based on Flex:
Flex TextArea - copy/paste from Word - Invalid unicode characters on xml parsing
But now I'm posting this on the Java side.
The issue is:
We have an email functionality (part of our application) where we create an XML string & put it on the queue. Another application picks it up, parses the XML & sends out emails.
We get an XML parser exception when the email text (<BODY>....</BODY) is copy/pasted from Word:
Invalid character in attribute value BODY (Unicode: 0x1A)
As we use Java as well, I'm trying to remove the invalid characters from the String using:
body = body.replaceAll("‘", "");
body = body.replaceAll("’", "");
//Strip invalid characters
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in))) {
return ""; // vacancy test.
}
for (int i = 0; i < in.length(); i++) {
//NOTE: No IndexOutOfBoundsException caught here; it should not happen.
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
//Strip once more
private String stripNonValidXMLCharacter(String in) {
if (in == null || ("".equals(in))) {
return null;
}
StringBuffer out = new StringBuffer(in);
for (int i = 0; i < out.length(); i++) {
if (out.charAt(i) == 0x1a) {
out.setCharAt(i, '-');
}
}
return out.toString();
}
//Replace the special characters if any
emailText = emailText.replaceAll("[\\u0000-\\u0008\\u000B\\u000C"
+ "\\u000E-\\u001F"
+ "\\uD800-\\uDFFF\\uFFFE\\uFFFF\\u00C5\\u00D4\\u00EC"
+ "\\u00A8\\u00F4\\u00B4\\u00CC\\u2211]", " ");
emailText = emailText.replaceAll("[\\x00-\\x1F]", "");
emailText = emailText.replaceAll(
"[\\x00-\\x08\\x0b\\x0c\\x0e-\\x1f]", "");
emailText = emailText.replaceAll("\\p{C}", "");
But they still do not work. Also the XML string starts with:
<?xml version="1.0" encoding="UTF-8"?>
<EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\\SMTPSchema.xsd\">
I think the issue occurs when there are multiple Tabs in the Word doc. Like for eg.
Text......text
<newLine>
<tab><tab><tab> text...text
<newLine>
The resulting xml string is:
<?xml version="1.0" encoding="UTF-8"?> <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\SMTPSchema.xsd"> <EMAIL SOURCE="t#t.com" DEST="t#t.com" CC="" BCC="t#t.com" SUBJECT="test 61" BODY="As such there was no mechanism constructed to migrate the enrollment user base to Data Collection or to keep security attributes for common users in sync between the two systems. The purpose of this document is to outline two strategies for bring the user base between the two applications into sync.? It still is the same. ** Please note: This e-mail message was sent from a notification-only address that cannot accept incoming e-mail. Please do not reply to this message."/> </EMAILS>
Please note then the "?" is where there are multiple tabs in the Word doc. Hope my question is clear & someone can help in resolving the issue
Thanks

Have you tried using an XML library such as TagSoup / JSoup / JTidy to sanitize your XML?

The invalid (hidden) character was from the UI (Flex TextArea). So had to take care of that in the UI so that it does not pass over to Java as well. Handled & removed it using the chagingHandler in the Flex textArea to restrict the characters.

Is there an smart way to write a fixed length flat file?

Is there any framework/library to help writing fixed length flat files in java?
I want to write a collection of beans/entities into a flat file without worrying with convertions, padding, alignment, fillers, etcs
For example, I'd like to parse a bean like:
public class Entity{
String name = "name"; // length = 10; align left; fill with spaces
Integer id = 123; // length = 5; align left; fill with spaces
Integer serial = 321 // length = 5; align to right; fill with '0'
Date register = new Date();// length = 8; convert to yyyyMMdd
}
... into ...
name 123 0032120110505
mikhas 5000 0122120110504
superuser 1 0000120101231
...

You're not likely to encounter a framework that can cope with a "Legacy" system's format. In most cases, Legacy systems don't use standard formats, but frameworks expect them. As a maintainer of legacy COBOL systems and Java/Groovy convert, I encounter this mismatch frequently. "Worrying with conversions, padding, alignment, fillers, etcs" is primarily what you do when dealing with a legacy system. Of course, you can encapsulate some of it away into handy helpers. But most likely, you'll need to get real familiar with java.util.Formatter.
For example, you might use the Decorator pattern to create decorators to do the conversion. Below is a bit of groovy (easily convertible into Java):
class Entity{
String name = "name"; // length = 10; align left; fill with spaces
Integer id = 123; // length = 5; align left; fill with spaces
Integer serial = 321 // length = 5; align to right; fill with '0'
Date register = new Date();// length = 8; convert to yyyyMMdd
}
class EntityLegacyDecorator {
Entity d
EntityLegacyDecorator(Entity d) { this.d = d }
String asRecord() {
return String.format('%-10s%-5d%05d%tY%<tm%<td',
d.name,d.id,d.serial,d.register)
}
}
def e = new Entity(name: 'name', id: 123, serial: 321, register: new Date('2011/05/06'))
assert new EntityLegacyDecorator(e).asRecord() == 'name 123 0032120110506'
This is workable if you don't have too many of these and the objects aren't too complex. But pretty quickly the format string gets intolerable. Then you might want decorators for Date, like:
class DateYMD {
Date d
DateYMD(d) { this.d = d }
String toString() { return d.format('yyyyMMdd') }
}
so you can format with %s:
String asRecord() {
return String.format('%-10s%-5d%05d%s',
d.name,d.id,d.serial,new DateYMD(d.register))
}
But for significant number of bean properties, the string is still too gross, so you want something that understands columns and lengths that looks like the COBOL spec you were handed, so you'll write something like this:
class RecordBuilder {
final StringBuilder record
RecordBuilder(recordSize) {
record = new StringBuilder(recordSize)
record.setLength(recordSize)
}
def setField(pos,length,String s) {
record.replace(pos - 1, pos + length, s.padRight(length))
}
def setField(pos,length,Date d) {
setField(pos,length, new DateYMD(d).toString())
}
def setField(pos,length, Integer i, boolean padded) {
if (padded)
setField(pos,length, String.format("%0" + length + "d",i))
else
setField(pos,length, String.format("%-" + length + "d",i))
}
String toString() { record.toString() }
}
class EntityLegacyDecorator {
Entity d
EntityLegacyDecorator(Entity d) { this.d = d }
String asRecord() {
RecordBuilder record = new RecordBuilder(28)
record.setField(1,10,d.name)
record.setField(11,5,d.id,false)
record.setField(16,5,d.serial,true)
record.setField(21,8,d.register)
return record.toString()
}
}
After you've written enough setField() methods to handle you legacy system, you'll briefly consider posting it on GitHub as a "framework" so the next poor sap doesn't have to to it again. But then you'll consider all the ridiculous ways you've seen COBOL store a "date" (MMDDYY, YYMMDD, YYDDD, YYYYDDD) and numerics (assumed decimal, explicit decimal, sign as trailing separate or sign as leading floating character). Then you'll realize why nobody has produced a good framework for this and occasionally post bits of your production code into SO as an example... ;)

If you are still looking for a framework, check out BeanIO at http://www.beanio.org

uniVocity-parsers goes a long way to support tricky fixed-width formats, including lines with different fields, paddings, etc.
Check out this example to write imaginary client & accounts details. This uses a lookahead value to identify which format to use when writing a row:
FixedWidthFields accountFields = new FixedWidthFields();
accountFields.addField("ID", 10); //account ID has length of 10
accountFields.addField("Bank", 8); //bank name has length of 8
accountFields.addField("AccountNumber", 15); //etc
accountFields.addField("Swift", 12);
//Format for clients' records
FixedWidthFields clientFields = new FixedWidthFields();
clientFields.addField("Lookahead", 5); //clients have their lookahead in a separate column
clientFields.addField("ClientID", 15, FieldAlignment.RIGHT, '0'); //let's pad client ID's with leading zeroes.
clientFields.addField("Name", 20);
FixedWidthWriterSettings settings = new FixedWidthWriterSettings();
settings.getFormat().setLineSeparator("\n");
settings.getFormat().setPadding('_');
//If a record starts with C#, it's a client record, so we associate "C#" with the client format.
settings.addFormatForLookahead("C#", clientFields);
//Rows starting with #A should be written using the account format
settings.addFormatForLookahead("A#", accountFields);
StringWriter out = new StringWriter();
//Let's write
FixedWidthWriter writer = new FixedWidthWriter(out, settings);
writer.writeRow(new Object[]{"C#",23234, "Miss Foo"});
writer.writeRow(new Object[]{"A#23234", "HSBC", "123433-000", "HSBCAUS"});
writer.writeRow(new Object[]{"A#234", "HSBC", "222343-130", "HSBCCAD"});
writer.writeRow(new Object[]{"C#",322, "Mr Bar"});
writer.writeRow(new Object[]{"A#1234", "CITI", "213343-130", "CITICAD"});
writer.close();
System.out.println(out.toString());
The output will be:
C#___000000000023234Miss Foo____________
A#23234___HSBC____123433-000_____HSBCAUS_____
A#234_____HSBC____222343-130_____HSBCCAD_____
C#___000000000000322Mr Bar______________
A#1234____CITI____213343-130_____CITICAD_____
This is just a rough example. There are many other options available, including support for annotated java beans, which you can find here.
Disclosure: I'm the author of this library, it's open-source and free (Apache 2.0 License)

The library Fixedformat4j is a pretty neat tool to do exactly this: http://fixedformat4j.ancientprogramming.com/

Spring Batch has a FlatFileItemWriter, but that won't help you unless you use the whole Spring Batch API.
But apart from that, I'd say you just need a library that makes writing to files easy (unless you want to write the whole IO code yourself).
Two that come to mind are:
Guava
Files.write(stringData, file, Charsets.UTF_8);
Commons / IO
FileUtils.writeStringToFile(file, stringData, "UTF-8");

Don't know of any frame work but you can just use RandomAccessFile. You can position the file pointer to anywhere in the file to do your reads and writes.

I've just find a nice library that I'm using:
http://sourceforge.net/apps/trac/ffpojo/wiki
Very simple to configurate with XML or annotations!

A simple way to write beans/entities to a flat file is to use ObjectOutputStream.
public static void writeToFile(File file, Serializable object) throws IOException {
ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(file));
oos.writeObject(object);
oos.close();
}
You can write to a fixed length flat file with
FileUtils.writeByteArrayToFile(new File(filename), new byte[length]);
You need to be more specific about what you want to do with the file. ;)

Try FFPOJO API as it has everything which you need to create a flat file with fixed lengths and also it will convert a file to an object and vice versa.
#PositionalRecord
public class CFTimeStamp {
String timeStamp;
public CFTimeStamp(String timeStamp) {
this.timeStamp = timeStamp;
}
#PositionalField(initialPosition = 1, finalPosition = 26, paddingAlign = PaddingAlign.RIGHT, paddingCharacter = '0')
public String getTimeStamp() {
return timeStamp;
}
#Override
public String toString() {
try {
FFPojoHelper ffPojo = FFPojoHelper.getInstance();
return ffPojo.parseToText(this);
} catch (FFPojoException ex) {
trsLogger.error(ex.getMessage(), ex);
}
return null;
}
}

ROME API to parse RSS/Atom

I'm trying to parse RSS/Atom feeds with the ROME library. I am new to Java, so I am not in tune with many of its intricacies.
Does ROME automatically use its modules to handle different feeds as it comes across them, or do I have to ask it to use them? If so, any direction on this.
How do I get to the correct 'source'? I was trying to use item.getSource(), but it is giving me fits. I guess I am using the wrong interface. Some direction would be much appreciated.
Here is the meat of what I have for collection my data.
I noted two areas where I am having problems, both revolving around getting Source Information of the feed. And by source, I want CNN, or FoxNews, or whomever, not the Author.
Judging from my reading, .getSource() is the correct method.
List<String> feedList = theFeeds.getFeeds();
List<FeedData> feedOutput = new ArrayList<FeedData>();
for (String sites : feedList ) {
URL feedUrl = new URL(sites);
SyndFeedInput input = new SyndFeedInput();
SyndFeed feed = input.build(new XmlReader(feedUrl));
List<SyndEntry> entries = feed.getEntries();
for (SyndEntry item : entries){
String title = item.getTitle();
String link = item.getUri();
Date date = item.getPublishedDate();
Problem here --> ** SyndEntry source = item.getSource();
String description;
if (item.getDescription()== null){
description = "";
} else {
description = item.getDescription().getValue();
}
String cleanDescription = description.replaceAll("\\<.*?>","").replaceAll("\\s+", " ");
FeedData feedData = new FeedData();
feedData.setTitle(title);
feedData.setLink(link);
And Here --> ** feedData.setSource(link);
feedData.setDate(date);
feedData.setDescription(cleanDescription);
String preview =createPreview(cleanDescription);
feedData.setPreview(preview);
feedOutput.add(feedData);
// lets print out my pieces.
System.out.println("Title: " + title);
System.out.println("Date: " + date);
System.out.println("Text: " + cleanDescription);
System.out.println("Preview: " + preview);
System.out.println("*****");
}
}

getSource() is definitely wrong - it returns back SyndFeed to which entry in question belongs. Perhaps what you want is getContributors()?
As far as modules go, they should be selected automatically. You can even write your own and plug it in as described here

What about trying regex the source from the URL without using the API?
That was my first thought, anyway I checked against the RSS standardized format itself to get an idea if this option is actually available at this level, and then try to trace its implementation upwards...
In RSS 2.0, I have found the source element, however it appears that it doesn't exist in previous versions of the spec- not good news for us!
[ is an optional sub-element of 1
Its value is the name of the RSS channel that the item came from, derived from its . It has one required attribute, url, which links to the XMLization of the source.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

SGML parser in Java? [closed] - java

If its HTML that you're parsing, this might do: http://ccil.org/~cowan/XML/tagsoup/

Related

OpenOffice xSentenceCursor stuck at end of paragraph

how to figure out which character doesn't map to utf-8

Java XML parser error Invalid character Unicode 0x1A when copy/paste from Word

Is there an smart way to write a fixed length flat file?

ROME API to parse RSS/Atom

Categories

Resources