Mainframe comp-3 field reading using JRecord - java

I am trying to read mainframe file but all are working other than comp 3 file.Below program is giving strange values.It is not able to read the salary value which is double also it is giving 2020202020.20 values. I don't know what am missing.Please help me to find it.
Program:
public final class Readcopybook {
private String dataFile = "EMPFILE.txt";
private String copybookName = "EMPCOPYBOOK.txt";
public Readcopybook() {
super();
AbstractLine line;
try {
ICobolIOBuilder iob = JRecordInterface1.COBOL.newIOBuilder(copybookName)
.setFileOrganization(Constants.IO_BINARY_IBM_4680).setSplitCopybook(CopybookLoader.SPLIT_NONE);
AbstractLineReader reader = iob.newReader(dataFile);
while ((line = reader.read()) != null) {
System.out.println(line.getFieldValue("EMP-NO").asString() + " "
+ line.getFieldValue("EMP-NAME").asString() + " "
+ line.getFieldValue("EMP-ADDRESS").asString() + " "
+ line.getFieldValue("EMP-SALARY").asString() + " "
+ line.getFieldValue("EMP-ZIPCODE").asString());
}
reader.close();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
new Readcopybook();
}
}
EMPCOPYBOOK:
001700 01 EMP-RECORD.
001900 10 EMP-NO PIC 9(10).
002000 10 EMP-NAME PIC X(30).
002100 10 EMP-ADDRESS PIC X(30).
002200 10 EMP-SALARY PIC S9(8)V9(2) COMP-3.
002200 10 EMP-ZIPCODE PIC 9(4).
EMPFILE:
0000001001suneel kumar r bangalore e¡5671
0000001002JOSEPH WHITE FIELD rrn4500
Output:
1001 suneel kumar r bangalore 20200165a10 5671
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
2020202020.20
0.00
1002 JOSEPH WHITE FIELD 202072726e0 4500

One problem is you have done a Ebcdic to Ascii conversion on the file.
The 2020... is a dead give away x'20' is the ascii space character.
This Answer deals with problems with doing an Ebcdic to ascii conversion.
You need to do a Binary transfer from the Mainframe and read the file using Ebcdic. You will need to check the RECFM on the Mainframe. If the RECFM is
FB - problems just transfer
VB - either convert to FB on the mainframe of include the RDW (Record Descriptor Word) option in the transfer.
Other - Convert to FB/VB on the mainframe
Updated java Code
int fileOrg = Constants.IO_FIXED_LENGTH_RECORDS; // or Constants.IO_VB
ICobolIOBuilder iob = JRecordInterface1.COBOL
.newIOBuilder(copybookName)
.setFileOrganization(fileOrg)
.setFont("Cp037")
.setSplitCopybook(CopybookLoader.SPLIT_NONE);
Note: IO_BINARY_IBM_4680 is for IBM 4690 Registers
There is a wiki entry here
or this Question
How do you generate java~jrecord code fror a Cobol copybook

Related

Apply LOOCV in java splitting with a specific condition

I have a csv file containing 24231 rows. I would like to apply LOOCV based on the project name instead of the observations of the whole dataset.
So if my dataset contains information for 15 projects, I would like to have the training set based on 14 projects and the test set based on the other project.
I was relying on weka's API, is there anything that automates this process?
For non-numeric attributes, Weka allows you to retrieve the unique values via Attribute.numValues() (how many are there) and Attribute.value(int) (the -th value).
package weka;
import weka.core.Attribute;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.converters.ConverterUtils;
public class LOOByValue {
/**
* 1st arg: ARFF file to load
* 2nd arg: 0-based index in ARFF to use for class
* 3rd arg: 0-based index in ARFF to use for LOO
*
* #param args the command-line arguments
* #throws Exception if loading/processing of data fails
*/
public static void main(String[] args) throws Exception {
// load data
Instances full = ConverterUtils.DataSource.read(args[0]);
full.setClassIndex(Integer.parseInt(args[1]));
int looCol = Integer.parseInt(args[2]);
Attribute looAtt = full.attribute(looCol);
if (looAtt.isNumeric())
throw new IllegalStateException("Attribute cannot be numeric!");
// iterate unique values to create train/test splits
for (int i = 0; i < looAtt.numValues(); i++) {
String value = looAtt.value(i);
System.out.println("\n" + (i+1) + "/" + full.attribute(looCol).numValues() + ": " + value);
Instances train = new Instances(full, full.numInstances());
Instances test = new Instances(full, full.numInstances());
for (int n = 0; n < full.numInstances(); n++) {
Instance inst = full.instance(n);
if (inst.stringValue(looCol).equals(value))
test.add((Instance) inst.copy());
else
train.add((Instance) inst.copy());
}
train.compactify();
test.compactify();
// TODO do something with the data
System.out.println("train size: " + train.numInstances());
System.out.println("test size: " + test.numInstances());
}
}
}
With Weka's anneal UCI dataset and the surface-quality for leave-one-out, you can generate something like this:
1/5: ?
train size: 654
test size: 244
2/5: D
train size: 843
test size: 55
3/5: E
train size: 588
test size: 310
4/5: F
train size: 838
test size: 60
5/5: G
train size: 669
test size: 229

Retrive textcontent by matching start word and end word

I am getting text file with contents like below. I want to retrieve the data present between start_word=Tax% and end_word="ErrorMessage".
ParsedText:
Tax%
63 2 .90 0.00 D INTENS SH 80ML(48) 9.00% 9.00%
23 34013090 0.0 DS PURE WHIT 1 COG (24) 9.00% 9.00%
"ErrorMessage":"","ErrorDetails":""
After retreiving the output would be
63 2 .90 0.00 D INTENS SH 80ML(48) 9.00% 9.00%
23 34013090 0.0 DS PURE WHIT 1 COG (24) 9.00% 9.00%
Please help.
I am using camel to read the text then i want to retrive the data to process further as per my requiement.
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
public class DataExtractor implements Processor{
#Override
public void process(Exchange exchange) throws Exception {
String textContent=(String) exchange.getIn().getBody();
System.out.println("TextContents >>>>>>"+textContent);
}
}
In the text content I am getting the content that i have given above.I need help regarding retreiving the the data in java.
Below is the code snippet to extract the desired output:
String[] strArr = textContent.split("\\r?\\n");
StringBuilder stringBuilder = new StringBuilder();
boolean appendLines = false;
for(String strLines : strArr) {
if(strLines.contains("Tax%")) {
appendLines = true;
continue;
}
if(strLines.contains("\"ErrorMessage\"")) {
break;
}
if(appendLines){
stringBuilder.append(strLines);
stringBuilder.append(System.getProperty("line.separator"));
}
}
textContent = stringBuilder.toString();

Can PDF documents contain "unreachable" content?

I am investigating Java PDF libraries.
I have a tried
org.apache.pdfbox
File file = new File("file.pdf");
PDDocument document = PDDocument.load(file);
// Instantiate PDFTextStripper class
PDFTextStripper pdfStripper = new PDFTextStripper();
// Retrieving text from PDF document
String text = pdfStripper.getText(document);
System.out.println(text);
// Closing the document
document.close();
com.itextpdf.text.pdf
public static final String SRC = "file.pdf";
public static final String DEST = "streams";
public static void main(final String[] args) throws IOException {
File file = new File(DEST);
new BruteForce().parse(SRC, DEST);
}
public void parse(final String src, final String dest) throws IOException {
PdfReader reader = new PdfReader(src);
PdfObject obj;
for (int i = 1; i <= reader.getXrefSize(); i++) {
obj = reader.getPdfObject(i);
if ((obj != null) && obj.isStream()) {
PRStream stream = (PRStream) obj;
byte[] b;
try {
b = PdfReader.getStreamBytes(stream);
} catch (UnsupportedPdfException e) {
b = PdfReader.getStreamBytesRaw(stream);
}
FileOutputStream fos = new FileOutputStream(String.format(dest, i));
fos.write(b);
fos.flush();
fos.close();
} else {
final PdfDictionary pdfDictionary = (PdfDictionary) obj;
System.out.println("\t>>>>> " + pdfDictionary + "\t\t" + pdfDictionary.getKeys());
final Set<PdfName> pdfNames = pdfDictionary.getKeys();
for (final PdfName pdfName : pdfNames) {
final PdfObject pdfObject = pdfDictionary.get(pdfName);
final int type = pdfObject.type();
switch (type) {
case PdfObject.NULL:
System.out.println("\t NULL " + pdfObject);
break;
case PdfObject.BOOLEAN:
System.out.println("\t BOOLEAN " + pdfObject);
break;
case PdfObject.NUMBER:
System.out.println("\t NUMBER " + pdfObject);
break;
case PdfObject.STRING:
System.out.println("\t STRING " + pdfObject);
break;
case PdfObject.NAME:
System.out.println("\t NAME " + pdfObject);
break;
case PdfObject.ARRAY:
System.out.println("\t ARRAY " + pdfObject);
break;
case PdfObject.DICTIONARY:
System.out.println("\t DICTIONARY " + ((PdfDictionary)pdfObject).getKeys());
break;
case PdfObject.STREAM:
System.out.println("\t STREAM " + pdfObject);
break;
case PdfObject.INDIRECT:
System.out.println("\t INDIRECT " +pdfObject.getIndRef());
break;
default:
}
System.out.println("\t\t--- " + pdfObject.type());
}
}
}
}
com.snowtide.pdf
String pdfFilePath = "file.pdf";
Document pdf = PDF.open(pdfFilePath);
final List<Annotation> annotations = pdf.getAllAnnotations();
for (final Annotation annotation : annotations) {
System.out.println(annotation.pageNumber());
}
System.out.println(pdf.getAttributeMap());
System.out.println(pdf.getAttributeKeys());
System.out.println("=============================");
StringBuilder text = new StringBuilder(1024);
pdf.pipe(new OutputTarget(text));
pdf.close();
System.out.println(text);
I can extract all visible PDF content including links, text, and images apart from what appears to be a "Watermark" that appears on every page.
Can PDF documents contain "unreachable" content?
Is there no way to extract ALL content from a PDF file?
UPDATE
thinking the "watermark" was an image I tried this code
File fileW = new File("file.pdf");
PDDocument document = PDDocument.load(fileW);
PDPageTree list = document.getPages();
for (PDPage page : list) {
PDResources pdResources = page.getResources();
for (COSName c : pdResources.getXObjectNames()) {
System.out.println("????? ::>>>" + c);
PDXObject o = pdResources.getXObject(c);
if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {
File file = new File("Temp/" + System.nanoTime() + ".png");
ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) o).getImage(), "png", file);
} else {
}
}
}
The PDF does contain images of the authors, however the "watermark" is not reached with this approach.
The page content streams of the example document provided by the OP have the following structure from page 2 onward:
A textual header line "www.electrophoresis-journal.com Page X Electrophoresis":
BT
/F1 9.12 Tf
1 0 0 1 72.024 798.46 Tm
/GS7 gs
0 g
0 G
[(w)11(w)11(w)11(.)-12(e)-2(l)15(e)-2(c)23(t)-10(r)-8(o)26(pho)26(r)-8(e)-2(s)21(i)-10(s)] TJ
ET
[...]
BT
1 0 0 1 441.53 798.46 Tm
[(E)6(l)-10(e)-2(c)23(t)-10(r)-8(o)26(pho)26(r)-8(e)23(s)-5(i)15(s)] TJ
ET
BT
1 0 0 1 497.47 798.46 Tm
[( )] TJ
ET
BT
1 0 0 1 72.024 787.9 Tm
[( )] TJ
ET
This text can easily be extracted using normal iText or PDFBox text extraction.
A textual multi-line footer "Received: ... All rights reserved."
BT
1 0 0 1 72.024 109.7 Tm
[(R)9(e)-2(c)23(e)-2(i)-10(v)26(e)-2(d:)41( )] TJ
ET
[...]
BT
1 0 0 1 72.024 47.76 Tm
[(T)6(hi)-10(s)21( )-12(a)23(r)-8(t)15(i)-10(c)23(l)-10(e)23( )13(i)-10(s)21( )-12(pr)-8(o)26(t)15(e)-2(c)23(t)-10(e)-2(d)26( )-12(by)53( )-12(c)-2(o)26(p)-25(y)53(r)-8(i)-10(g)26(ht)-10(.)-12( )-12(A)38(l)-10(l)15( )13(r)-8(i)-10(g)26(ht)15(s)-5( )13(r)-8(e)23(s)-5(e)-2(r)-8(v)26(e)-2(d)26(.)] TJ
ET
BT
1 0 0 1 278.52 47.76 Tm
[( )] TJ
ET
BT
1 0 0 1 72.024 37.2 Tm
[( )] TJ
ET
This text also can easily be extracted using normal iText or PDFBox text extraction.
A set of PDF path creation and filling operations using a custom graphics state forming the transparent "Accepted Article" writing on the left of the page:
/GS8 gs
0 g
39.605 266.51 m
39.605 261.29 39.605 256.06 39.605 250.84 c
42.197 249.94 44.776 248.99 47.367 248.09 c
49.296 247.41 50.704 247.08 51.649 247.08 c
52.413 247.08 53.058 247.38 53.609 247.97 c
54.191 248.54 54.548 249.82 54.729 251.77 c
55.18 251.77 55.624 251.77 56.075 251.77 c
56.075 247.51 56.075 243.26 56.075 239.02 c
55.624 239.02 55.18 239.02 54.729 239.02 c
54.36 240.72 53.903 241.8 53.314 242.3 c
52.144 243.3 49.809 244.47 46.247 245.67 c
32.719 250.33 19.286 255.25 5.7645 259.91 c
5.7645 260.26 5.7645 260.61 5.7645 260.95 c
19.43 265.57 33.014 270.43 46.679 275.05 c
49.984 276.16 52.075 277.24 53.064 278.15 c
54.053 279.06 54.623 280.36 54.729 282 c
55.18 282 55.624 282 56.075 282 c
56.075 276.68 56.075 271.35 56.075 266.03 c
55.624 266.03 55.18 266.03 54.729 266.03 c
54.623 267.64 54.303 268.75 53.753 269.31 c
53.202 269.88 52.519 270.15 51.718 270.15 c
50.666 270.15 48.97 269.75 46.679 268.95 c
44.319 268.15 41.971 267.31 39.605 266.51 c
h
36.92 265.67 m
30.284 263.43 23.686 261.05 17.045 258.81 c
23.686 256.5 30.284 254.07 36.92 251.77 c
36.92 256.4 36.92 261.04 36.92 265.67 c
h
f*
[...]
35.361 630.34 m
40.294 630.31 44.156 631.32 46.967 633.29 c
49.784 635.27 51.18 637.63 51.18 640.31 c
51.18 642.1 50.573 643.67 49.364 645 c
48.156 646.3 46.141 647.43 43.236 648.31 c
43.48 648.62 43.712 648.93 43.962 649.24 c
47.261 648.83 50.253 647.57 52.989 645.6 c
55.731 643.62 57.089 641.06 57.089 638.05 c
57.089 634.76 55.549 631.92 52.413 629.63 c
49.302 627.3 45.158 626.1 39.899 626.1 c
34.203 626.1 29.802 627.33 26.585 629.71 c
23.405 632.07 21.834 635.12 21.834 638.73 c
21.834 641.8 23.048 644.34 25.496 646.28 c
27.981 648.22 31.267 649.24 35.361 649.24 c
35.361 642.94 35.361 636.64 35.361 630.34 c
h
33.258 630.34 m
33.258 634.56 33.258 638.78 33.258 643 c
31.117 642.91 29.633 642.7 28.763 642.37 c
27.417 641.87 26.341 641.14 25.571 640.16 c
24.801 639.19 24.406 638.13 24.406 637.06 c
24.406 635.42 25.158 633.91 26.729 632.64 c
28.306 631.34 30.466 630.55 33.258 630.34 c
h
f*
(The instructions I quoted draw the initial 'A' and the final 'e'.)
This writing cannot be extracted using normal iText or PDFBox text extraction as it neither is drawn using text instruction nor is marked with an ActualText entry. (The latter could be recognized using customized iText or PDFBox text extraction.)
But you can extract this writing as the sequence of path creation and drawing commands it consists of using an implementation of the iText ExtRenderListener interface or a subclass of the PDFBox PDFGraphicsStreamEngine.
The actual text content of the article, opaque, using text drawing instructions, e.g.
BT
/F2 10.08 Tf
1 0 0 1 72.024 760.78 Tm
/GS7 gs
0 g
[(H)-7(I)8(G)16(H)-7( )-106(TH)-6(R)32(O)-7(U)8(G)16(H)-7(P)16(U)8(T )-106(M)-7(U)8(LTI)] TJ
ET
BT
1 0 0 1 212.98 760.78 Tm
[(-)] TJ
ET
BT
1 0 0 1 216.1 760.78 Tm
[(O)-7(R)8(G)-7(A)8(N)32( )-130(M)15(ETA)32(BO)-6(LO)16(M)-7(I)8(C)8(S)8( )-130(I)8(N)8( )-106(TH)-6(E)24( )-130(A)8(P)16(P)16(/)-7(P)16(S)8(1 )-106(M)-7(O)-7(U)8(S)8(E)24( )-130(M)15(O)-7(D)8(EL)24( )-106(O)-7(F)16( )] TJ
ET
This text also can easily be extracted using normal iText or PDFBox text extraction.
Concerning the OP's questions, therefore,
I can extract all visible PDF content including links, text, and images apart from what appears to be a "Watermark" that appears on every page.
Can PDF documents contain "unreachable" content?
That content is not "unreachable", it merely is not text drawn using text drawing instructions but instead text drawn like an arbitrary shape.
Is there no way to extract ALL content from a PDF file?
You can extract that content, merely not as text but instead as a collection of path creation and drawing instructions. Whenever you suspect such instructions to actually draw letter shapes, you can try to determine the text by rendering these paths as a bitmap and applying OCR.

How to train Chunker in Opennlp?

I need to train the Chunker in Opennlp to classify the training data as a noun phrase. How do I proceed? The documentation online does not have an explanation how to do it without the command line, incorporated in a program. It says to use en-chunker.train, but how do you make that file?
EDIT: #Alaye
After running the code you gave in your answer, I get the following error that I cannot fix:
Indexing events using cutoff of 5
Computing event counts... done. 3 events
Dropped event B-NP:[w_2=bos, w_1=bos, w0=He, w1=reckons, w2=., w_1=bosw0=He, w0=Hew1=reckons, t_2=bos, t_1=bos, t0=PRP, t1=VBZ, t2=., t_2=bost_1=bos, t_1=bost0=PRP, t0=PRPt1=VBZ, t1=VBZt2=., t_2=bost_1=bost0=PRP, t_1=bost0=PRPt1=VBZ, t0=PRPt1=VBZt2=., p_2=bos, p_1=bos, p_2=bosp_1=bos, p_1=bost_2=bos, p_1=bost_1=bos, p_1=bost0=PRP, p_1=bost1=VBZ, p_1=bost2=., p_1=bost_2=bost_1=bos, p_1=bost_1=bost0=PRP, p_1=bost0=PRPt1=VBZ, p_1=bost1=VBZt2=., p_1=bost_2=bost_1=bost0=PRP, p_1=bost_1=bost0=PRPt1=VBZ, p_1=bost0=PRPt1=VBZt2=., p_1=bosw_2=bos, p_1=bosw_1=bos, p_1=bosw0=He, p_1=bosw1=reckons, p_1=bosw2=., p_1=bosw_1=bosw0=He, p_1=bosw0=Hew1=reckons]
Dropped event B-VP:[w_2=bos, w_1=He, w0=reckons, w1=., w2=eos, w_1=Hew0=reckons, w0=reckonsw1=., t_2=bos, t_1=PRP, t0=VBZ, t1=., t2=eos, t_2=bost_1=PRP, t_1=PRPt0=VBZ, t0=VBZt1=., t1=.t2=eos, t_2=bost_1=PRPt0=VBZ, t_1=PRPt0=VBZt1=., t0=VBZt1=.t2=eos, p_2=bos, p_1=B-NP, p_2=bosp_1=B-NP, p_1=B-NPt_2=bos, p_1=B-NPt_1=PRP, p_1=B-NPt0=VBZ, p_1=B-NPt1=., p_1=B-NPt2=eos, p_1=B-NPt_2=bost_1=PRP, p_1=B-NPt_1=PRPt0=VBZ, p_1=B-NPt0=VBZt1=., p_1=B-NPt1=.t2=eos, p_1=B-NPt_2=bost_1=PRPt0=VBZ, p_1=B-NPt_1=PRPt0=VBZt1=., p_1=B-NPt0=VBZt1=.t2=eos, p_1=B-NPw_2=bos, p_1=B-NPw_1=He, p_1=B-NPw0=reckons, p_1=B-NPw1=., p_1=B-NPw2=eos, p_1=B-NPw_1=Hew0=reckons, p_1=B-NPw0=reckonsw1=.]
Dropped event O:[w_2=He, w_1=reckons, w0=., w1=eos, w2=eos, w_1=reckonsw0=., w0=.w1=eos, t_2=PRP, t_1=VBZ, t0=., t1=eos, t2=eos, t_2=PRPt_1=VBZ, t_1=VBZt0=., t0=.t1=eos, t1=eost2=eos, t_2=PRPt_1=VBZt0=., t_1=VBZt0=.t1=eos, t0=.t1=eost2=eos, p_2B-NP, p_1=B-VP, p_2B-NPp_1=B-VP, p_1=B-VPt_2=PRP, p_1=B-VPt_1=VBZ, p_1=B-VPt0=., p_1=B-VPt1=eos, p_1=B-VPt2=eos, p_1=B-VPt_2=PRPt_1=VBZ, p_1=B-VPt_1=VBZt0=., p_1=B-VPt0=.t1=eos, p_1=B-VPt1=eost2=eos, p_1=B-VPt_2=PRPt_1=VBZt0=., p_1=B-VPt_1=VBZt0=.t1=eos, p_1=B-VPt0=.t1=eost2=eos, p_1=B-VPw_2=He, p_1=B-VPw_1=reckons, p_1=B-VPw0=., p_1=B-VPw1=eos, p_1=B-VPw2=eos, p_1=B-VPw_1=reckonsw0=., p_1=B-VPw0=.w1=eos]
Indexing... done.
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at opennlp.tools.ml.model.AbstractDataIndexer.sortAndMerge(AbstractDataIndexer.java:89)
at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:105)
at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
at opennlp.tools.ml.model.TrainUtil.train(TrainUtil.java:53)
at opennlp.tools.chunker.ChunkerME.train(ChunkerME.java:253)
at com.oracle.crm.nlp.CustomChunker2.main(CustomChunker2.java:91)
Sorting and merging events... Process exited with exit code 1.
(My en-chunker.train had only the first 2 and last line of your sample data set.)
Could you please tell me why this is happening and how to fix it?
EDIT2: I got the Chunker to work, however it gives an error when I change the sentence in the training set to any sentence other than the one you've given in your answer. Can you tell me why that could be happening?
As said in Opennlp Documentation
Sample sentence of the training data:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O
This is how you make your en-chunk.train file and you can create the corresponding .bin file using CLI:
$ opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding
or using API
public class SentenceTrainer {
public static void trainModel(String inputFile, String modelFile)
throws IOException {
Objects.nonNull(inputFile);
Objects.nonNull(modelFile);
MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory(
new File(inputFile));
Charset charset = Charset.forName("UTF-8");
ObjectStream<String> lineStream =
new PlainTextByLineStream(new FileInputStream("en-chunker.train"),charset);
ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(lineStream);
ChunkerModel model;
try {
model = ChunkerME.train("en", sampleStream,
new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams());
}
finally {
sampleStream.close();
}
OutputStream modelOut = null;
try {
modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
model.serialize(modelOut);
} finally {
if (modelOut != null)
modelOut.close();
}
}
}
and the main method will be:
public class Main {
public static void main(String args[]) throws IOException {
String inputFile = "//path//to//data.train";
String modelFile = "//path//to//.bin";
SentenceTrainer.trainModel(inputFile, modelFile);
}
}
reference: this blog
hope this helps!
PS: collect/write the data as above in a .txt file and rename it with .train extension or even the trainingdata.txt will work. that is how you make a .train file.

how to validate all preflight error for PDF/A-1a in pdfbox

i am working with vaildate PDFA/1A .I followed this code which already exist in this link PDFbox Preflight PDF/A-1b check not working properly in java version 1.8
public class test
{
public static void main(final String[] args) throws Exception
{
File pdfa=new File("D:/DMC-B787-A-00-40-07-00A-008B-D.pdf"); // error pdf
isPDFAdocument(pdfa);
System.out.println("sucess");
}
private static void isPDFAdocument(File pdfa)
{
ValidationResult result = null;
PreflightParser parser;
try
{
parser = new PreflightParser(pdfa);
parser.parse(Format.PDF_A1A);
PreflightDocument documentt = parser.getPreflightDocument();
result = documentt.getResult();
System.out.println("result"+result);
documentt.close();
}
catch (SyntaxValidationException e)
{
result = e.getResult();
}
catch (IOException e)
{
e.printStackTrace();
}
if (result.isValid())
{
System.out.println("The file " + pdfa + " is a valid PDF/A-1a file");
}
else
{
System.out.println("The file" + pdfa + " is not valid, error(s) :");
for (ValidationError error : result.getErrorsList())
{
System.out.println(error.getErrorCode() + " : " + error.getDetails());
}
}
it's not checking the error which mention below .if there it have to show exception but still its vaildate success.
Kindly suggest how to validate all probability preflight error below .how to check it in pdfbox.
Error
CharSet incomplete for Type 1 font (2 matches on 1 page) - 2
Width information for rendered glyphs is inconsistent (2 matches on 1 page) - 2
Document information
File name: "DMC-B787-A-00-40-07-00A-008B-D.pdf"
Path: "C:\Users\wm751e\Documents\Feb19\Synchronize print\Only WDM\Archived doctypes
latest\EA_TBC2016-02-2115.57.49IPD\EA_TBC2016-02-2115.57.49IPD\00"
PDF version number: "1.4"
File size (KB): 114.2
Title: "Illustrated Parts Data - Service Bulletin/Modification List"
Author: "The Boeing Company (PRINTENGINEWEB_BUILD_1.7.49.5.0.0; s1000d_merged_v6.5.36_4.xsl; JobID:)"
Creator: "AH XSL Formatter V6.0 MR7 for Linux64 : 6.0.8.9416 (2013/02/26 10:36JST)"
Producer: "Antenna House PDF Output Library 6.0.389 (Linux64)"
Created: "2/21/2016 3:56 PM"
Modified: "2/21/2016 3:56 PM"
Trapping: "False"
Number of plates: 4
Names of plates: "(Cyan) (Magenta) (Yellow) (Black) "
Environment
Preflight, 15.0.0 (151)
Acrobat version: 15.60
Operating system: Microsoft Windows 7 Service Pack 1 (Build 7601

Categories

Resources