how to figure out which character doesn't map to utf-8

how to figure out which character doesn't map to utf-8 - java

I maintain a small java servlet-based webapp that presents forms for input, and writes the contents of those forms to MariaDB.
The app runs on a Linux box, although the users visit the webapp from Windows.
Some users paste text into these forms that was copied from MSWord docs, and when that happens, they get internal exceptions like the following:
Caused by: org.mariadb.jdbc.internal.util.dao.QueryException:
Incorrect string value: '\xC2\x96 for...' for column 'ssimpact' at row
1
For instance, I tested it with text like the following:
Project – for
Where the dash is a "long dash" from the MSWord document.
I don't think it's possible to convert the wayward characters in this text to the "correct" characters, so I'm trying to figure out how to produce a reasonable error message that shows a substring of the bad text in question, along with the index of the first bad character.
I noticed postings like this: How to determine if a String contains invalid encoded characters .
I thought this would get me close, but it's not quite working.
I'm trying to use the following method:
private int findUnmappableCharIndex(String entireString) {
int charIndex;
for (charIndex = 0; charIndex < entireString.length(); ++ charIndex) {
String currentChar = entireString.substring(charIndex, charIndex + 1);
CharBuffer out = CharBuffer.wrap(new char[currentChar.length()]);
CharsetDecoder decoder = Charset.forName("utf-8").newDecoder();
CoderResult result = decoder.decode(ByteBuffer.wrap(currentChar.getBytes()), out, true);
if (result.isError() || result.isOverflow() || result.isUnderflow() || result.isMalformed() || result.isUnmappable()) {
break;
}
CoderResult flushResult = decoder.flush(out);
if (flushResult.isOverflow()) {
break;
}
}
if (charIndex == entireString.length() + 1) {
charIndex = -1;
}
return charIndex;
}
This doesn't work. I get "underflow" on the first character, which is a valid character. I'm sure I don't fully understand the decoder mechanism.

Related

How can I set pageToken to get item lists from Google Cloud Storage via Java SDK?

I want to set pageToken to get items stored at Google Cloud Storage. I'm using Google API Client Library for Java v1.19.x.
I have no idea to generate pageToken from file path(or file name).
2 files stored in bucket.
my-bucket
/test.csv
/test2.csv
When I tried Google APIs Explorer with following parameters, I could get nextPageToken Cgh0ZXN0LmNzdg==.
And I found out that I can get test.csv string by decoding nextPageToken with base64.
bucket: my-bucket
pageToken:
prefix: test
maxResults: 1
{"kind": "storage#objects", "nextPageToken": "Cgh0ZXN0LmNzdg==", ...}
But How can I get Cgh0ZXN0LmNzdg== from test.csv?
Although I tried Base64 encoding, result didn't match.
import com.google.api.client.repackaged.org.apache.commons.codec.binary.Base64;
String lastFile = "test.csv"
String token = Base64.encodeBase64String(lastFile.getBytes());
String bucket = "my-bucket"
String prefix = "test"
Storage.Objects.List listObjects = client.objects().list(bucket);
listObjects.setPrefix(prefix);
listObjects.setPageToken(token);
long maxResults = 1;
listObjects.setMaxResults(maxResults);
do {
Objects objects = listObjects.execute();
List<StorageObject> items = objects.getItems();
token = objects.getNextPageToken();
listObjects.setPageToken(token);
} while (token != null);

I could get next token from file path string using following codes by myself.
How to get nextToken from path string
String nextToken = base64encode(0x0a + asciiCode + pathString)
asciiCode can be taken between 0x01(SOH) and 0x7f(DEL). It seems to depend on path length.
my-bucket/
a/a(3byte) 0x03
a/ab(4byte) 0x04
test.txt(8byte) 0x08
Notice
If path length is longer than 1024 byte, another rule seems to apply. But I couldn't found out rules.
See also Object Name Requirements
import com.google.common.io.BaseEncoding;
String lastFile = "test.csv"
String token = base64Encode(lastFile);
String bucket = "my-bucket"
String prefix = "test"
Storage.Objects.List listObjects = client.objects().list(bucket);
listObjects.setPrefix(prefix);
listObjects.setPageToken(token);
long maxResults = 1;
listObjects.setMaxResults(maxResults);
do {
Objects objects = listObjects.execute();
List<StorageObject> items = objects.getItems();
token = objects.getNextPageToken();
listObjects.setPageToken(token);
} while (token != null);
private String base64Encode(String path) {
byte[] encoding;
byte[] utf8 = path.getBytes(Charsets.UTF_8);
encoding = new byte[utf8.length + 2];
encoding[0] = 0x0a;
encoding[1] = new Byte(String.valueOf(path.length()));
String s = BaseEncoding.base64().encode(encoding);
return s;
}

I know this question is already answered and is applied to Java, I'd like to mention that this question applies to PHP as well.
With the help of the approved post from sakama above I figured out a PHP version of his solution.
The PHP equivalent for generating the token is as follow:
base64_encode(pack('c', 0x0a) . pack('c', $path_string_length) . pack('a*', $path_string));
The byte pattern seems indeed (as sakama already mentioned) to be:
<line feed><line data length><line data>

Java XML parser error Invalid character Unicode 0x1A when copy/paste from Word

Sorry to double post. But my earlier post was based on Flex:
Flex TextArea - copy/paste from Word - Invalid unicode characters on xml parsing
But now I'm posting this on the Java side.
The issue is:
We have an email functionality (part of our application) where we create an XML string & put it on the queue. Another application picks it up, parses the XML & sends out emails.
We get an XML parser exception when the email text (<BODY>....</BODY) is copy/pasted from Word:
Invalid character in attribute value BODY (Unicode: 0x1A)
As we use Java as well, I'm trying to remove the invalid characters from the String using:
body = body.replaceAll("‘", "");
body = body.replaceAll("’", "");
//Strip invalid characters
public String stripNonValidXMLCharacters(String in) {
StringBuffer out = new StringBuffer(); // Used to hold the output.
char current; // Used to reference the current character.
if (in == null || ("".equals(in))) {
return ""; // vacancy test.
}
for (int i = 0; i < in.length(); i++) {
//NOTE: No IndexOutOfBoundsException caught here; it should not happen.
current = in.charAt(i);
if ((current == 0x9)
|| (current == 0xA)
|| (current == 0xD)
|| ((current >= 0x20) && (current <= 0xD7FF))
|| ((current >= 0xE000) && (current <= 0xFFFD))
|| ((current >= 0x10000) && (current <= 0x10FFFF)))
out.append(current);
}
return out.toString();
}
//Strip once more
private String stripNonValidXMLCharacter(String in) {
if (in == null || ("".equals(in))) {
return null;
}
StringBuffer out = new StringBuffer(in);
for (int i = 0; i < out.length(); i++) {
if (out.charAt(i) == 0x1a) {
out.setCharAt(i, '-');
}
}
return out.toString();
}
//Replace the special characters if any
emailText = emailText.replaceAll("[\\u0000-\\u0008\\u000B\\u000C"
+ "\\u000E-\\u001F"
+ "\\uD800-\\uDFFF\\uFFFE\\uFFFF\\u00C5\\u00D4\\u00EC"
+ "\\u00A8\\u00F4\\u00B4\\u00CC\\u2211]", " ");
emailText = emailText.replaceAll("[\\x00-\\x1F]", "");
emailText = emailText.replaceAll(
"[\\x00-\\x08\\x0b\\x0c\\x0e-\\x1f]", "");
emailText = emailText.replaceAll("\\p{C}", "");
But they still do not work. Also the XML string starts with:
<?xml version="1.0" encoding="UTF-8"?>
<EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\\SMTPSchema.xsd\">
I think the issue occurs when there are multiple Tabs in the Word doc. Like for eg.
Text......text
<newLine>
<tab><tab><tab> text...text
<newLine>
The resulting xml string is:
<?xml version="1.0" encoding="UTF-8"?> <EMAILS xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNameSpaceSchemaLocation=".\SMTPSchema.xsd"> <EMAIL SOURCE="t#t.com" DEST="t#t.com" CC="" BCC="t#t.com" SUBJECT="test 61" BODY="As such there was no mechanism constructed to migrate the enrollment user base to Data Collection or to keep security attributes for common users in sync between the two systems. The purpose of this document is to outline two strategies for bring the user base between the two applications into sync.? It still is the same. ** Please note: This e-mail message was sent from a notification-only address that cannot accept incoming e-mail. Please do not reply to this message."/> </EMAILS>
Please note then the "?" is where there are multiple tabs in the Word doc. Hope my question is clear & someone can help in resolving the issue
Thanks

Have you tried using an XML library such as TagSoup / JSoup / JTidy to sanitize your XML?

The invalid (hidden) character was from the UI (Flex TextArea). So had to take care of that in the UI so that it does not pass over to Java as well. Handled & removed it using the chagingHandler in the Flex textArea to restrict the characters.

Design generic process using template pattern

I have a routine that I repeatedly doing for many projects and I want to generalized it. I used iText for PDF manipulation.
Let say that I have 2000 PDFs inside a folder, and I need to zip these together. Let say the limit is 1000 PDFs per zip. So the name of the zip would follow this rule: job name + job sequence. For example, the zip name of the first 1000 PDF would be XNKXMN + AA and the second zip name would be XNKXMN + AB. Before zipping these PDFs, I need to add some text to each PDF. Text look something like this job name + job sequence + pdf sequence. So the first PDF inside the first zip will have this text XNKXMN + AA + 000001, and the one after that is XNKXMN + AA + 000002. Here is my attempt
First I have abstract clas GenericText that represent my text.
public abstract class GenericText {
private float x;
private float y;
private float rotation;
/**
* Since the text that the user want to insert onto the Pdf might vary
* from page to page, or from logical document to logical document, we allow
* the user to write their own implementation of the text. To give the user enough
* flexibility, we give them the reference to the physical page index, the logical page index.
* #param physcialPage The actual page number that the user current looking at
* #param logicalPage A Pdf might contain multiples sub-documents, <code>logicalPage</code>
* tell the user which logical sub-document the system currently looking at
*/
public abstract String generateText(int physicalPage, int logicalPage);
GenericText(float x, float y, float rotation){
this.x = x;
...
}
}
JobGenerator.java: my generic API to do what I describe above
public String generatePrintJob(List<File> pdfList, String outputPath,
String printName, String seq, List<GenericText> textList, int maxSize)
for (int currentPdfDocument = 0; currentPdfDocument < pdfList.size(); currentPdfDocument++) {
File pdf = pdfList.get(currentPdfDocument);
if (currentPdfDocument % maxSize != 0) {
if(textList != null && !textList.isEmpty()){
for(GenericText gt : textList){
String text = gt.generateText(currentPdfDocument, currentPdfDocument)
//Add the text content to the PDF using PdfReader and PdfWriter
}
}
...
}else{
//Close the current output stream and zip output stream
seq = Utils.getNextSeq(seq);
jobPath = outputPath + File.separator + printName + File.separator + seq + ".zip"
//Open new zip output stream with the new <code>jobPath</code>
}
}
}
So now in my main class I would just do this
final String printName = printNameLookup.get(baseOutputName);
String jobSeq = config.getPrintJobSeq();
final String seq = jobSeq;
GenericText keyline = new GenericText(90, 640, 0){
#Override
public String generateText(int physicalPage, int logicalPage) {
//if logicalPage = 1, Utils.right(String.valueOf(logicalPage), 6, '0') -> 000001
return printName + seq + " " + Utils.right(String.valueOf(logicalPage), 6, '0');
}
};
textList.add(keyline);
JobGenerator pjg = new JobGenerator();
pjg.generatePrintJob(...,..., printName, jobSeq, textList, 1000);
The problem that I am having with this design is that, even though I process archive the PDF into two zip correctly, the text is not correctly reflect. The print and the sequence does not change accordingly, it stay XNKXMN + AA for 2000 PDF instead of XNKXMN + AA for the first 1000 and change to XNKXMN + AB for the later 1000. There seems to be flawed in my design, please help
EDIT:
After looking at toto2 code, I see my problem. I create GenericText with the hope of adding text anywhere on the pdf page without affecting the basic logic of the process. However, the job sequence is by definition depending on the logic,as it need to increment if there are too many PDFs for one ZIP to handle (> maxSize). I need to rethink this.

When you create an anonymous GenerateText, the final seq which you use in the overridden generateText method is truly final and will always remain the value given at creation time. The update you carry on seq inside the else in generatePrintJob does nothing.
On a more general note, your code looks very complex and you should probably take a step back and do some major refactoring.
EDIT:
I would instead try something different, with no template method pattern:
int numberOfZipFiles =
(int) Math.ceil((double) pdfList.size() / maxSize);
for (int iZip = 0; iZip < numberOfZipFiles; iZip++) {
String batchSubName = generateBatchSubName(iZip); // gives AA, AB,...
for (int iFile = 0; iFile < maxSize; iFile++) {
int fileNumber = iZip * maxSize + iFile;
if (fileNumber >= pdfList.size()) // can happen for last batch
return;
String text = jobName + batchSubName + iFile;
... add "text" to pdfList.get(fileNumber)
}
}
However, you might also want to maintain the template pattern. In that case, I would keep the for-loops I wrote above, but I would change the generating method to genericText.generateText(iZip, iFile) where iZip = 0 gives AA and iZip = 1 gives AB, etc:
for (int iZip = 0; iZip < numberOfZipFiles; iZip++) {
for (int iFile = 0; iFile < maxSize; iFile++) {
int fileNumber = iZip * maxSize + iFile;
if (fileNumber >= pdfList.size()) // can happen for last batch
return;
String text = genericText.generateText(iZip, iFile);
... add "text" to pdfList.get(fileNumber)
}
}
It would be possible also to have genericText.generateText(fileNumber) which could itself decompose the fileNumber in AA000001, etc. But that would be somewhat dangerous because maxSize would be used in two different places and it might be bug prone to have duplicate data like that.

Is there an smart way to write a fixed length flat file?

Is there any framework/library to help writing fixed length flat files in java?
I want to write a collection of beans/entities into a flat file without worrying with convertions, padding, alignment, fillers, etcs
For example, I'd like to parse a bean like:
public class Entity{
String name = "name"; // length = 10; align left; fill with spaces
Integer id = 123; // length = 5; align left; fill with spaces
Integer serial = 321 // length = 5; align to right; fill with '0'
Date register = new Date();// length = 8; convert to yyyyMMdd
}
... into ...
name 123 0032120110505
mikhas 5000 0122120110504
superuser 1 0000120101231
...

You're not likely to encounter a framework that can cope with a "Legacy" system's format. In most cases, Legacy systems don't use standard formats, but frameworks expect them. As a maintainer of legacy COBOL systems and Java/Groovy convert, I encounter this mismatch frequently. "Worrying with conversions, padding, alignment, fillers, etcs" is primarily what you do when dealing with a legacy system. Of course, you can encapsulate some of it away into handy helpers. But most likely, you'll need to get real familiar with java.util.Formatter.
For example, you might use the Decorator pattern to create decorators to do the conversion. Below is a bit of groovy (easily convertible into Java):
class Entity{
String name = "name"; // length = 10; align left; fill with spaces
Integer id = 123; // length = 5; align left; fill with spaces
Integer serial = 321 // length = 5; align to right; fill with '0'
Date register = new Date();// length = 8; convert to yyyyMMdd
}
class EntityLegacyDecorator {
Entity d
EntityLegacyDecorator(Entity d) { this.d = d }
String asRecord() {
return String.format('%-10s%-5d%05d%tY%<tm%<td',
d.name,d.id,d.serial,d.register)
}
}
def e = new Entity(name: 'name', id: 123, serial: 321, register: new Date('2011/05/06'))
assert new EntityLegacyDecorator(e).asRecord() == 'name 123 0032120110506'
This is workable if you don't have too many of these and the objects aren't too complex. But pretty quickly the format string gets intolerable. Then you might want decorators for Date, like:
class DateYMD {
Date d
DateYMD(d) { this.d = d }
String toString() { return d.format('yyyyMMdd') }
}
so you can format with %s:
String asRecord() {
return String.format('%-10s%-5d%05d%s',
d.name,d.id,d.serial,new DateYMD(d.register))
}
But for significant number of bean properties, the string is still too gross, so you want something that understands columns and lengths that looks like the COBOL spec you were handed, so you'll write something like this:
class RecordBuilder {
final StringBuilder record
RecordBuilder(recordSize) {
record = new StringBuilder(recordSize)
record.setLength(recordSize)
}
def setField(pos,length,String s) {
record.replace(pos - 1, pos + length, s.padRight(length))
}
def setField(pos,length,Date d) {
setField(pos,length, new DateYMD(d).toString())
}
def setField(pos,length, Integer i, boolean padded) {
if (padded)
setField(pos,length, String.format("%0" + length + "d",i))
else
setField(pos,length, String.format("%-" + length + "d",i))
}
String toString() { record.toString() }
}
class EntityLegacyDecorator {
Entity d
EntityLegacyDecorator(Entity d) { this.d = d }
String asRecord() {
RecordBuilder record = new RecordBuilder(28)
record.setField(1,10,d.name)
record.setField(11,5,d.id,false)
record.setField(16,5,d.serial,true)
record.setField(21,8,d.register)
return record.toString()
}
}
After you've written enough setField() methods to handle you legacy system, you'll briefly consider posting it on GitHub as a "framework" so the next poor sap doesn't have to to it again. But then you'll consider all the ridiculous ways you've seen COBOL store a "date" (MMDDYY, YYMMDD, YYDDD, YYYYDDD) and numerics (assumed decimal, explicit decimal, sign as trailing separate or sign as leading floating character). Then you'll realize why nobody has produced a good framework for this and occasionally post bits of your production code into SO as an example... ;)

If you are still looking for a framework, check out BeanIO at http://www.beanio.org

uniVocity-parsers goes a long way to support tricky fixed-width formats, including lines with different fields, paddings, etc.
Check out this example to write imaginary client & accounts details. This uses a lookahead value to identify which format to use when writing a row:
FixedWidthFields accountFields = new FixedWidthFields();
accountFields.addField("ID", 10); //account ID has length of 10
accountFields.addField("Bank", 8); //bank name has length of 8
accountFields.addField("AccountNumber", 15); //etc
accountFields.addField("Swift", 12);
//Format for clients' records
FixedWidthFields clientFields = new FixedWidthFields();
clientFields.addField("Lookahead", 5); //clients have their lookahead in a separate column
clientFields.addField("ClientID", 15, FieldAlignment.RIGHT, '0'); //let's pad client ID's with leading zeroes.
clientFields.addField("Name", 20);
FixedWidthWriterSettings settings = new FixedWidthWriterSettings();
settings.getFormat().setLineSeparator("\n");
settings.getFormat().setPadding('_');
//If a record starts with C#, it's a client record, so we associate "C#" with the client format.
settings.addFormatForLookahead("C#", clientFields);
//Rows starting with #A should be written using the account format
settings.addFormatForLookahead("A#", accountFields);
StringWriter out = new StringWriter();
//Let's write
FixedWidthWriter writer = new FixedWidthWriter(out, settings);
writer.writeRow(new Object[]{"C#",23234, "Miss Foo"});
writer.writeRow(new Object[]{"A#23234", "HSBC", "123433-000", "HSBCAUS"});
writer.writeRow(new Object[]{"A#234", "HSBC", "222343-130", "HSBCCAD"});
writer.writeRow(new Object[]{"C#",322, "Mr Bar"});
writer.writeRow(new Object[]{"A#1234", "CITI", "213343-130", "CITICAD"});
writer.close();
System.out.println(out.toString());
The output will be:
C#___000000000023234Miss Foo____________
A#23234___HSBC____123433-000_____HSBCAUS_____
A#234_____HSBC____222343-130_____HSBCCAD_____
C#___000000000000322Mr Bar______________
A#1234____CITI____213343-130_____CITICAD_____
This is just a rough example. There are many other options available, including support for annotated java beans, which you can find here.
Disclosure: I'm the author of this library, it's open-source and free (Apache 2.0 License)

The library Fixedformat4j is a pretty neat tool to do exactly this: http://fixedformat4j.ancientprogramming.com/

Spring Batch has a FlatFileItemWriter, but that won't help you unless you use the whole Spring Batch API.
But apart from that, I'd say you just need a library that makes writing to files easy (unless you want to write the whole IO code yourself).
Two that come to mind are:
Guava
Files.write(stringData, file, Charsets.UTF_8);
Commons / IO
FileUtils.writeStringToFile(file, stringData, "UTF-8");

Don't know of any frame work but you can just use RandomAccessFile. You can position the file pointer to anywhere in the file to do your reads and writes.

I've just find a nice library that I'm using:
http://sourceforge.net/apps/trac/ffpojo/wiki
Very simple to configurate with XML or annotations!

A simple way to write beans/entities to a flat file is to use ObjectOutputStream.
public static void writeToFile(File file, Serializable object) throws IOException {
ObjectOutputStream oos = new ObjectOutputStream(new FileOutputStream(file));
oos.writeObject(object);
oos.close();
}
You can write to a fixed length flat file with
FileUtils.writeByteArrayToFile(new File(filename), new byte[length]);
You need to be more specific about what you want to do with the file. ;)

Try FFPOJO API as it has everything which you need to create a flat file with fixed lengths and also it will convert a file to an object and vice versa.
#PositionalRecord
public class CFTimeStamp {
String timeStamp;
public CFTimeStamp(String timeStamp) {
this.timeStamp = timeStamp;
}
#PositionalField(initialPosition = 1, finalPosition = 26, paddingAlign = PaddingAlign.RIGHT, paddingCharacter = '0')
public String getTimeStamp() {
return timeStamp;
}
#Override
public String toString() {
try {
FFPojoHelper ffPojo = FFPojoHelper.getInstance();
return ffPojo.parseToText(this);
} catch (FFPojoException ex) {
trsLogger.error(ex.getMessage(), ex);
}
return null;
}
}

SGML parser in Java? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I'm looking for a parser in Java that can parse a document formatted in SGML.
For duplicate monitors:
I'm aware of the two other threads that discuss this topic:
Parsing Java String with SGML
Java SGML to XML conversion?
But neither has a resolution, hence the new topic.
For people that confuse XML with SGML:
Please read this: http://www.w3.org/TR/NOTE-sgml-xml-971215#null
(in short, there are enough subtle differences to at least make it unusable in it's vanilla form)
For people who are fond of asking posters to Google it:
I already did and the closest I could come up with was the widely popular SAXParser: http://download.oracle.com/javase/1.4.2/docs/api/javax/xml/parsers/SAXParser.html
But that of course is meant to be an XML parser. I'm looking around to see if anyone has implemented a modification of the SAX Parser to accommodate SGML.
Lastly, I cannot use SX as I'm looking for a Java solution.
Thanks! :)

I have a few approaches to this problem
The first is what you did -- check to see if the sgml document is close enough to XML for the standard SAX parser to work.
The second is to do the same with HTML parsers. The trick here is to find one that doesn't ignore non-HTML elements.
I did find some Java SGML parsers, more in acedemia, when searching for "sgml parser Java". I do not know how well they work.
The last step is to take a standard (non Java) SGML parser and transform the documents into something you can read in Java.
It looks like you were able to work with the first step.

I use OpenSP via JNI, as it seems there is no pure Java SGML parser. I've written an experimental SAX-like wrapper that is available at http://sourceforge.net/projects/sasgml (of course, it has all the drawbacks of JNI... but was enough for my requirements).
Another approach is converting the document to XML by using sx from Open SP, and then run a traditional SAX parser.

There is no api for parsing SGML using Java at this time. There also isn't any api or library for converting SGML to XML and then parsing it using Java. With the status of SGML being supplanted by XML for all the projects I've worked on until now, I don't think there will every be any work done in this area, but that is only a guess.
Here is some open source code code from a University that does it, however I haven't tried it and you would have to search to find the other dependent classes. I believe the only viable solution in Java would require Regular Expressions.
Also, here is a link for public SGML/XML software.

Java SE includes an HTML parser in the javax.swing.text.html.parser package. It claims in its documentation to be a general SGML parser, but then counterclaims in the documentation that you should only use it with the provided HTML DTD class.
If you put it in lenient mode and your SGML documents don't have a lot of implied end tags, you may get reasonable results.
Read about the parser in its JavaDoc, here: http://docs.oracle.com/javase/6/docs/api/javax/swing/text/html/parser/DocumentParser.html
Create an instance like this:
new DocumentParser(DTD.getDTD("html32"))
Or you could ignore the warnings against using a custom DTD with DocumentParser, and create a subclass of DTD that matches the rules of your own SGML format.
This is clearly not an industrial strength SGML parser, but it should be a good starting point for a one-time data migration effort. I've found it useful in previous projects for parsing HTML.

If its HTML that you're parsing, this might do:
http://ccil.org/~cowan/XML/tagsoup/

Though its a very old post and I'm not claiming that the answer I am providing is perfect but it served my purpose. So I am keeping this code I wrote using stack to get the data in a way was required in my case. I hope it may be helpful for others.
try (BufferedReader br = new BufferedReader(new FileReader(new File(
fileName)))) {
while ((line = br.readLine()) != null) {
line = line.trim();
int startOfTag = line.indexOf("<");
int endOfTag = line.indexOf(">");
String currentTag = "";
if (startOfTag > -1 && endOfTag > -1) {
if (countStart)
headerTagsCounter++;
currentTag = line.substring(startOfTag + 1, endOfTag);
String currentData = line.substring(endOfTag + 1,
line.length());
if (i == 1) {
tagStack.push(currentTag);
i++;
}
if (currentData.isEmpty() || currentData == "") {//If there is no data, its a parent tag...
if (!currentTag.contains("/")) {// if its an opening tag...
switch (currentTag) {// these tags are useless in my case, so just skipping these tags.
case "CORRECTION":
case "PAPER":
case "PRIVATE-TO-PUBLIC":
case "DELETION":
case "CONFIRMING-COPY":
case "CAPTION":
case "STUB":
case "COLUMN":
case "TABLE-FOOTNOTES-SECTION":
case "FOOTNOTES":
case "PAGE":
break;
default: {
countStart = false;
int tagCounterNumber = 0;
String historyTagToRemove = "";
for (String historyTag : historyStack) {
String tagCounter = "";
if (historyTag.contains(currentTag)) {//if it's a repeating tag..Append the counter and update the same in history tag..
historyTagToRemove = historyTag;
if (historyTag
.equalsIgnoreCase(currentTag)) {
tagCounterNumber = 1;
} else if (historyTag.length() > currentTag
.length()) {
tagCounter = historyTag
.substring(currentTag
.length());
if (tagCounter != null
&& !tagCounter.isEmpty()) {
tagCounterNumber = Integer
.parseInt(tagCounter) + 1;
}
}
}
}
if (tagCounterNumber > 0)
currentTag += tagCounterNumber;
if (historyTagToRemove != null
&& !historyTagToRemove.isEmpty()) {
historyStack.remove(historyTagToRemove);
historyStack.push(currentTag);
}
tagStack.push(currentTag);
break;
}
}
} else// if its end of a tag... Match the current tag with top of stack and if its a match, pop it out
{
currentTag = currentTag.substring(1);
String tagRemoved = "";
String topStackTag = tagStack.lastElement();
if (topStackTag.contains(currentTag)) {
tagRemoved = tagStack.pop();
historyStack.push(tagRemoved);
}
if (tagStack.size() < 2)
cik = "";
if (tagStack.size() == 2 && cik != null
&& !cik.isEmpty())
for (int j = headerTagsCounter - 1; j < tagList.size(); j++) {
String item = tagList.get(j);
if (!item.contains("##")) {
item += "##" + cik;
tagList.remove(j);
tagList.add(j, item);
}
}
}
} else {// if current tag has some data...
currentData = currentData.trim();
String stackValue = "";
for (String tag : tagStack) {
if (stackValue != null && !stackValue.isEmpty()
&& stackValue != "")
stackValue = stackValue + "||" + tag;
else
stackValue = tag;
}
switch (currentTag) {
case "ACCESSION-NUMBER":
accessionNumber = currentData;
break;
case "FILING-DATE":
dateFiled = currentData;
break;
case "TYPE":
formType = currentData;
break;
case "CIK":
cik = currentData;
break;
}
tagList.add(stackValue + "$$" + currentTag + "::"+ currentData);
}
}
}
// Now all your data is available with in tagList, stack is separated by ||, key is separated by $$ and value is separated by ::
}
} catch (Exception e) {
// TODO Auto-generated catch block
}
}
Output:
Source of file: http://10k-staging.s3.amazonaws.com/edgar0105/2016/12/20/935015/000119312516799070/0001193125-16-799070.hdr.sgml
Output of code:
SEC-HEADER$$SEC-HEADER::0001193125-16-799070.hdr.sgml : 20161220
SEC-HEADER$$ACCEPTANCE-DATETIME::20161220172458
SEC-HEADER$$ACCESSION-NUMBER::0001193125-16-799070
SEC-HEADER$$TYPE::485APOS
SEC-HEADER$$PUBLIC-DOCUMENT-COUNT::9
SEC-HEADER$$FILING-DATE::20161220
SEC-HEADER$$DATE-OF-FILING-DATE-CHANGE::20161220
SEC-HEADER||FILER||COMPANY-DATA$$CONFORMED-NAME::ARTISAN PARTNERS FUNDS INC##0000935015
SEC-HEADER||FILER||COMPANY-DATA$$CIK::0000935015##0000935015
SEC-HEADER||FILER||COMPANY-DATA$$IRS-NUMBER::391811840##0000935015
SEC-HEADER||FILER||COMPANY-DATA$$STATE-OF-INCORPORATION::WI##0000935015
SEC-HEADER||FILER||COMPANY-DATA$$FISCAL-YEAR-END::0930##0000935015
SEC-HEADER||FILER||FILING-VALUES$$FORM-TYPE::485APOS##0000935015
SEC-HEADER||FILER||FILING-VALUES$$ACT::33##0000935015
SEC-HEADER||FILER||FILING-VALUES$$FILE-NUMBER::033-88316##0000935015
SEC-HEADER||FILER||FILING-VALUES$$FILM-NUMBER::162062197##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$STREET1::875 EAST WISCONSIN AVE STE 800##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$CITY::MILWAUKEE##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$STATE::WI##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$ZIP::53202##0000935015
SEC-HEADER||FILER||BUSINESS-ADDRESS$$PHONE::414-390-6100##0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$STREET1::875 EAST WISCONSIN AVE STE 800##0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$CITY::MILWAUKEE##0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$STATE::WI##0000935015
SEC-HEADER||FILER||MAIL-ADDRESS$$ZIP::53202##0000935015
SEC-HEADER||FILER||FORMER-COMPANY$$FORMER-CONFORMED-NAME::ARTISAN FUNDS INC##0000935015
SEC-HEADER||FILER||FORMER-COMPANY$$DATE-CHANGED::19950310##0000935015
SEC-HEADER||FILER||FORMER-COMPANY1$$FORMER-CONFORMED-NAME::ZIEGLER FUNDS INC##0000935015
SEC-HEADER||FILER||FORMER-COMPANY1$$DATE-CHANGED::19950109##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$CONFORMED-NAME::ARTISAN PARTNERS FUNDS INC##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$CIK::0000935015##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$IRS-NUMBER::391811840##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$STATE-OF-INCORPORATION::WI##0000935015
SEC-HEADER||FILER1||COMPANY-DATA1$$FISCAL-YEAR-END::0930##0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FORM-TYPE::485APOS##0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$ACT::40##0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FILE-NUMBER::811-08932##0000935015
SEC-HEADER||FILER1||FILING-VALUES1$$FILM-NUMBER::162062198##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$STREET1::875 EAST WISCONSIN AVE STE 800##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$CITY::MILWAUKEE##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$STATE::WI##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$ZIP::53202##0000935015
SEC-HEADER||FILER1||BUSINESS-ADDRESS1$$PHONE::414-390-6100##0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$STREET1::875 EAST WISCONSIN AVE STE 800##0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$CITY::MILWAUKEE##0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$STATE::WI##0000935015
SEC-HEADER||FILER1||MAIL-ADDRESS1$$ZIP::53202##0000935015
SEC-HEADER||FILER1||FORMER-COMPANY2$$FORMER-CONFORMED-NAME::ARTISAN FUNDS INC##0000935015
SEC-HEADER||FILER1||FORMER-COMPANY2$$DATE-CHANGED::19950310##0000935015
SEC-HEADER||FILER1||FORMER-COMPANY3$$FORMER-CONFORMED-NAME::ZIEGLER FUNDS INC##0000935015
SEC-HEADER||FILER1||FORMER-COMPANY3$$DATE-CHANGED::19950109##0000935015
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS$$OWNER-CIK::0000935015
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES$$SERIES-ID::S000056665
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES$$SERIES-NAME::Artisan Thematic Fund
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES||CLASS-CONTRACT$$CLASS-CONTRACT-ID::C000179292
SEC-HEADER||SERIES-AND-CLASSES-CONTRACTS-DATA||NEW-SERIES-AND-CLASSES-CONTRACTS||NEW-SERIES||CLASS-CONTRACT$$CLASS-CONTRACT-NAME::Investor Shares

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

how to figure out which character doesn't map to utf-8 - java

Related

How can I set pageToken to get item lists from Google Cloud Storage via Java SDK?

Java XML parser error Invalid character Unicode 0x1A when copy/paste from Word

Design generic process using template pattern

Is there an smart way to write a fixed length flat file?

SGML parser in Java? [closed]

Categories

Resources