how to read bullets from RTF file - java

I have a rtf file which has some text with bullets as shown in the screenshot below
I want to extract the data along with the bullets but when I print in the console, I get junk values. How do I print exactly the same from console.
The way I tried is as below
public static void main(String[] args) throws IOException, BadLocationException {
RTFEditorKit rtf = new RTFEditorKit();
Document doc = rtf.createDefaultDocument();
FileInputStream fis = new FileInputStream("C:\\Users\\Guest\\Desktop\\abc.rtf");
InputStreamReader i =new InputStreamReader(fis,"UTF-8");
rtf.read(i,doc,0);
System.out.println(doc.getText(0,doc.getLength()));
}
Console output:
I assumed junk values are due to console not supporting chareset so I tried to generate a pdf file but in pdf also I get the same junk values.
this is the pdf code
Paragraph de=new Paragraph();
Phrase pde=new Phrase();
pde.add(new Chunk(getText("C:\\Users\\Guest\\Desktop\\abc.rtf"),smallNormal_11));
de.add(pde);
de.getFont().setStyle(BaseFont.IDENTITY_H);
document.add(de);
public static String getText() throws IOException, BadLocationException {
RTFEditorKit rtf = new RTFEditorKit();
Document doc = rtf.createDefaultDocument();
FileInputStream fis = new FileInputStream("C:\\Users\\Guest\\Desktop\\abc.rtf");
InputStreamReader i =new InputStreamReader(fis,"UTF-8");
rtf.read(i,doc,0);
String output=doc.getText(0,doc.getLength());
return output;
}

Despite what you said, my guess is that it is a console encoding problem.
Anyway you can easily check it:
Just replace this line:
System.out.println(doc.getText(0,doc.getLength()));
With these 2 lines :
PrintStream ps = new PrintStream(System.out, true, "UTF-8");
ps.println(doc.getText(0,doc.getLength()));
This will force console encoding to UTF-8.
If it is still wrong, I would suspect your file is not fully rtf-compliant.
I made some tests and your code works well (the console one, I did not try the pdf) under Linux, but the console is natively in UTF-8.

Related

Encoding for FontFactor.getFont() [duplicate]

This question already has an answer here:
iText : Unable to print mathematical characters like ∈, ∩, ∑, ∫, ∆ √, ∠
(1 answer)
Closed 6 years ago.
Hiyas
I'm trying to display this string:
λλλλλλλλλλλλλλλλλλλλλλλλ
which is read from a RTF file, parsed and put into this variable. It is NOT used as constant in the code.
Font pdfFont = FontFactory.getFont(font.getFont().getName(), BaseFont.IDENTITY_H, embed, font.getFont().getSize2D(), style);
Phrase phrase = new Phrase("λλλλλλλλλλλλλλλλλλλλλλλλ", pdfFont);
ColumnText.showTextAligned(content[i], alignment, phrase, x, y, rotation);
I also tried CP1252 (and basically all the other encodings I found) together with a simple ArialMT.ttf font, but that damn string is never displayed. I can see that the conversion to the byte array inside iText (we use 5.5.0) always returns a null length byte array which explains why the text is not used, but I don't understand why. What encoding would I need to use to make this visible in a PDF?
Thanks a lot
I suppose that you want to get a result that looks like this:
That's easy. I first tried the SunCharacter example from the official documentation. That example was written in answer to the question: iText : Unable to print mathematical characters like ∈, ∩, ∑, ∫, ∆ √, ∠
I then changed the TEXT to:
public static final String TEXT = "Always use the Unicode notation for special characters: \u03bb";
As you can see, I don't use λ in my source code (that's bad practice). Instead I use \u03bb which is the Unicode notation of λ.
The result looked like this:
That's not what you want; you want ArialMT. So I changed the FONT to:
public static final String FONT = "c:/windows/fonts/arial.ttf";
This gave me the desired PDF.
This is the full code sample:
public class LambdaCharacter {
public static final String DEST = "results/fonts/lambda_character.pdf";
public static final String FONT = "c:/windows/fonts/arial.ttf";
public static final String TEXT = "Always use the Unicode notation for special characters: \u03bb";
public static void main(String[] args) throws IOException, DocumentException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new LambdaCharacter().createPdf(DEST);
}
public void createPdf(String dest) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream(dest));
document.open();
BaseFont bf = BaseFont.createFont(FONT, BaseFont.IDENTITY_H, BaseFont.EMBEDDED);
Font f = new Font(bf, 12);
Paragraph p = new Paragraph(TEXT, f);
document.add(p);
document.close();
}
}
I works just fine.
Maybe you aren't really using Arial. Maybe font.getFont().getName() doesn't give you the correct name of the font. Or maybe it gives you the correct name of the font, but you forgot to register the font. In that case, you will see that Helvetica is used. Helvetica can't render a lambda. You need Arial or Cardo-Regular or Arial Unicode or another font, as long as that font knows how to render a lambda.
If you don't know how to register a font, read:
How to load custom font in FontFactory.register in iText or
Creating fonts from *.ttf files using iText or
Using Fonts in System with iTextSharp or
Get list of supported fonts in ITextSharp or
Why is my font not applied when I create a PDF document? or... (there are just too many hits when I search for an answer to that question)

Extracting an embedded object from a pdf

I had embedded a byte array into a pdf file (Java).
Now I am trying to extract that same array.
The array was embedded as a "MOVIE" file.
I couldn't find any clue on how to do that...
Any ideas?
Thanks!
EDIT
I used this code to embed the byte array:
public static void pack(byte[] file) throws IOException, DocumentException{
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(RESULT));
writer.setPdfVersion(PdfWriter.PDF_VERSION_1_7);
writer.addDeveloperExtension(PdfDeveloperExtension.ADOBE_1_7_EXTENSIONLEVEL3);
document.open();
RichMediaAnnotation richMedia = new RichMediaAnnotation(writer, new Rectangle(0,0,0,0));
PdfFileSpecification fs
= PdfFileSpecification.fileEmbedded(writer, null, "test.avi", file);
PdfIndirectReference asset = richMedia.addAsset("test.avi", fs);
RichMediaConfiguration configuration = new RichMediaConfiguration(PdfName.MOVIE);
RichMediaInstance instance = new RichMediaInstance(PdfName.MOVIE);
RichMediaParams flashVars = new RichMediaParams();
instance.setAsset(asset);
configuration.addInstance(instance);
RichMediaActivation activation = new RichMediaActivation();
richMedia.setActivation(activation);
PdfAnnotation richMediaAnnotation = richMedia.createAnnotation();
richMediaAnnotation.setFlags(PdfAnnotation.FLAGS_PRINT);
writer.addAnnotation(richMediaAnnotation);
document.close();
I have written a brute force method to extract all streams in a PDF and store them as a file without an extension:
public static final String SRC = "resources/pdfs/image.pdf";
public static final String DEST = "results/parse/stream%s";
public static void main(String[] args) throws IOException {
File file = new File(DEST);
file.getParentFile().mkdirs();
new ExtractStreams().parse(SRC, DEST);
}
public void parse(String src, String dest) throws IOException {
PdfReader reader = new PdfReader(src);
PdfObject obj;
for (int i = 1; i <= reader.getXrefSize(); i++) {
obj = reader.getPdfObject(i);
if (obj != null && obj.isStream()) {
PRStream stream = (PRStream)obj;
byte[] b;
try {
b = PdfReader.getStreamBytes(stream);
}
catch(UnsupportedPdfException e) {
b = PdfReader.getStreamBytesRaw(stream);
}
FileOutputStream fos = new FileOutputStream(String.format(dest, i));
fos.write(b);
fos.flush();
fos.close();
}
}
}
Note that I get all PDF objects that are streams as a PRStream object. I also use two different methods:
When I use PdfReader.getStreamBytes(stream), iText will look at the filter. For instance: page content streams consists of PDF syntax that is compressed using /FlateDecode. By using PdfReader.getStreamBytes(stream), you will get the uncompressed PDF syntax.
Not all filters are supported in iText. Take for instance /DCTDecode which is the filter used to store JPEGs inside a PDF. Why and how would you "decode" such a stream? You wouldn't, and that's when we use PdfReader.getStreamBytesRaw(stream) which is also the method you need to get your AVI-bytes from your PDF.
This example already gives you the methods you'll certainly need to extract PDF streams. Now it's up to you to find the path to the stream you need. That calls for iText RUPS. With iText RUPS you can look at the internal structure of a PDF file. In your case, you need to find the annotations as is done in this question: All links of existing pdf change the action property to inherit zoom - iText library
You loop over the page dictionaries, then loop over the /Annots array of this dictionary (if it's present), but instead of checking for /Link annotations (which is what was asked in the question I refer to), you have to check for /RichMedia annotations and from there examine the assets until you find the stream that contains the AVI file. RUPS will show you how to dive into the annotation dictionary.

Create a New Page During Text to PDF conversion Using Itext

I am converting a text file to PDF using iText. The conversion works fine but I need that during conversion if the BufferedReader encounters a certain text, a new PDF Page is Started. This is what I have tried But A new Page is not started when that Text is encountered. My Sample code is as Below(Just the relevant part).
Document output = new Document(PageSize.B3);
FileInputStream fs = new FileInputStream("C:/ABC Statements final/File.TXT");
FileOutputStream file = new FileOutputStream(new File("C:/Pdf Statements/File.PDF"));
BufferedReader br = new BufferedReader(new InputStreamReader(fs));
PdfWriter writer = PdfWriter.getInstance(output, file);
output.open();
writer.open();
.............................
String pageend = "Page Total";
String trimmedend = br.readLine().trim();
if (trimmedend.startsWith(pageend)) {
output.newPage();
}
Maybe you need to change your if-statement to something like this:
String pageend = "page total";
...
if (trimmedend.toLowerCase().contains(pageend)) {
...
}
This way, you avoid case-sensitivity and you avoid the problem of having characters that aren't considered being white space before "page total". Of course: this is just an educated guess. I don't know what your original data stream looks like.

Displaying Jtidy error/warning messages in a GUI JTextArea

I am writing a program that uses jtidy to clean up html from source code obtained from a URL. I want to display the errors and warnings in a GUI, in a JTextArea. How would I "reroute" the warnings from printing to stdout to the JTextArea? I've looked over the Jtidy API and don't see anything that does what I want. Anyone know how I can do this, or if it's even possible?
// testing jtidy options
public void test(String U) throws MalformedURLException, IOException
{
Tidy tidy = new Tidy();
InputStream URLInputStream = new URL(U).openStream();
File file = new File("test.html");
FileOutputStream fop = new FileOutputStream(file);
tidy.setShowWarnings(true);
tidy.setShowErrors(0);
tidy.setSmartIndent(true);
tidy.setMakeClean(true);
tidy.setXHTML(true);
Document doc = tidy.parseDOM(URLInputStream, fop);
}
Assuming JTidy prints errors and warnings to stdout, you can just temporarily change where System.out calls go:
PrintStream originalOut = System.out;
ByteArrayOutputStream baos = new ByteArrayOutputStream();
PrintStream myOutputStream = new PrintStream(baos);
System.setOut(myOutputStream);
// your JTidy code here
String capturedOutput = new String(baos.toByteArray(), StandardCharsets.UTF_8);
System.setOut(originalOut);
// Send capturedOutput to a JTextArea
myTextArea.append(capturedOutput);
There is an analogous method if you need to do this for System.err instead/as well.

error in converting word document to pdf using iText

The below is the code that i used to convert a word document to pdf. After compiling the code, the PDF file is generated. But the file contains some junk characters along with the word document content. Please help me to know what modification should i do to get rid of the junk characters.
The code i used is:
import com.lowagie.text.Document;
import com.lowagie.text.Paragraph;
import com.lowagie.text.pdf.PdfWriter;
import java.io.File;
import java.io.FileOutputStream;
public class PdfConverter
{
private void createPdf(String inputFile, String outputFile)//, boolean isPictureFile)
{
Document pdfDocument = new Document();
String pdfFilePath = outputFile;
try
{
FileOutputStream fileOutputStream = new FileOutputStream(pdfFilePath);
PdfWriter writer = null;
writer = PdfWriter.getInstance(pdfDocument, fileOutputStream);
writer.open();
pdfDocument.open();
/*if (isPictureFile)
{
pdfDocument.add(com.lowagie.text.Image.getInstance(inputFile));
}
else
{ */
File file = new File(inputFile);
pdfDocument.add(new Paragraph(org.apache.commons.io.FileUtils.readFileToString(file)));
//}
pdfDocument.close();
writer.close();
System.out.println("PDF has been generted");
}
catch (Exception exception)
{
System.out.println("Document Exception!" + exception);
}
}
public static void main(String args[])
{
PdfConverter pdfConversion = new PdfConverter();
pdfConversion.createPdf("C:/test.doc", "C:/test.pdf");//, true);
}
}
Thanks for you help.
Only because you name your class PdfConverter you don't have one. All you do is reading the binary content as a String and writing this as one paragraph (and that's what you see). This approach will definitively not be successful. See https://stackoverflow.com/questions/437394 for a similar question.
If you are interested just in the content of your word document, you might want to give Apache POI - the Java API for Microsoft Documents a try to read your the document not at binary level but on a hight abstraction level. If your Word document has a simple (and I mean a really simple) structure you might get reasonable results.
To do this, you will have to read the doc file correctly and then use the read data to create the PDF file.
What you are doing right now is that you are reading data from doc file, which is having garbage values since you are not using proper API to read the data, and then storing the obtained garbage data in the PDF file. Hence the issue.

Categories

Resources