Java get plain Text from RTF

Java get plain Text from RTF - java

I have on my database a column that holds text in RTF format.
How can I get only the plain text of it, using Java?

RTFEditorKit rtfParser = new RTFEditorKit();
Document document = rtfParser.createDefaultDocument();
rtfParser.read(new ByteArrayInputStream(rtfBytes), document, 0);
String text = document.getText(0, document.getLength());
this should work

If you can try "AdvancedRTFEditorKit", it might be cool. Try here http://java-sl.com/advanced_rtf_editor_kit.html
I have used it to create a complete RTF editor, with all the supports MS Word has.

Apache POI will also read Microsoft Word formats, not just RTF.
POI
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
public String getRtfText(String fileName) {
File rtfFile = null;
WordExtractor rtfExtractor = null ;
try {
rtfFile = new File(fileName);
//A FileInputStream obtains input bytes from a file.
FileInputStream inStream = new FileInputStream(rtfFile.getAbsolutePath());
//A HWPFDocument used to read document file from FileInputStream
HWPFDocument doc=new HWPFDocument(inStream);
rtfExtractor = new WordExtractor(doc);
}
catch(Exception ex)
{
System.out.println(ex.getMessage());
}
//This Array stores each line from the document file.
String [] rtfArray = rtfExtractor.getParagraphText();
String rtfString = "";
for(int i=0; i < rtfArray.length; i++) rtfString += rtfArray[i];
System.out.println(rtfString);
return rtfString;
}

This works if the RTF text is in a JEditorPane
String s = getPlainText(aJEditorPane.getDocument());
String getPlainText(Document doc) {
try {
return doc.getText(0, doc.getLength());
}
catch (BadLocationException ex) {
System.err.println(ex);
return null;
}
}

Related

Unable to read unicode character in pdf using java

I am trying to convert Pdf document that contains Tamil unicode characters into a word document retaining all the formatting. I am not able to read the unicode character in the Pdf they are appearing as junk character in word. I am using the below code can someone please help?
public static void main(String[] args) throws IOException {
System.out.println("Document converted started");
XWPFDocument doc = new XWPFDocument();
String pdf = "D:\\sample1.pdf";
PdfReader reader = new PdfReader(pdf);
// InputStreamReader isr = new InputStreamReader(reader,"UTF8");
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
TextExtractionStrategy strategy = parser.processContent(i,
new SimpleTextExtractionStrategy());
System.out.println(strategy.getResultantText());
String text = strategy.getResultantText();
XWPFParagraph p = doc.createParagraph();
XWPFRun run = p.createRun();
// run.setFontFamily(new Font("Arial"));
run.setFontSize(14);
run.setText(text);
// run.addBreak(BreakType.PAGE);
}
FileOutputStream out = new FileOutputStream("D:\\tamildoc.docx");
doc.write(out);
out.close();
reader.close();
System.out.println("Document converted successfully");
}

You can use the library Apache PDFBox https://pdfbox.apache.org/download.cgi . With the component PDFTextStripper, invoking method getText(PDDocument doc) you will obtain a simple String that represents the content of .pdf file
Here an example :
UploadedFile file = new UploadedFile(fileName);
InputStream is = file.getInputStream();
PDDocument doc = PDDocument.load(is);
String content = new PDFTextStripper().getText(doc);
doc.close();
And after that you can write on your file

How to extract urls from an html file stored on my computer using java?

I need to find all the urls present in an html file which is stored in my computer itself and extract the links and store it to a variable. I'm using the code below to scan the file and get the lines. But I'm having a hard time extracting just the links. I would appreciate if someone could help me out.
Scanner htmlScanner = new Scanner(new File(args[0]));
PrintWriter output = new PrintWriter(new FileWriter(args[1]));
while(htmlScanner.hasNext()){
output.print(htmlScanner.next());
}
System.out.println("\nDone");
htmlScanner.close();
output.close();

You can actually do this with the Swing HTML parser. Though the Swing parser only understands HTML 3.2, tags introduced in later HTML versions will simply be treated as unknown, and all you actually want are links anyway.
static Collection<String> getLinks(Path file)
throws IOException,
MimeTypeParseException,
BadLocationException {
HTMLEditorKit htmlKit = new HTMLEditorKit();
HTMLDocument htmlDoc;
try {
htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
try (Reader reader =
Files.newBufferedReader(file, StandardCharsets.ISO_8859_1)) {
htmlKit.read(reader, htmlDoc, 0);
}
} catch (ChangedCharSetException e) {
MimeType mimeType = new MimeType(e.getCharSetSpec());
String charset = mimeType.getParameter("charset");
htmlDoc = (HTMLDocument) htmlKit.createDefaultDocument();
htmlDoc.putProperty("IgnoreCharsetDirective", true);
try (Reader reader =
Files.newBufferedReader(file, Charset.forName(charset))) {
htmlKit.read(reader, htmlDoc, 0);
}
}
Collection<String> links = new ArrayList<>();
for (HTML.Tag tag : Arrays.asList(HTML.Tag.LINK, HTML.Tag.A)) {
HTMLDocument.Iterator it = htmlDoc.getIterator(tag);
while (it.isValid()) {
String link = (String)
it.getAttributes().getAttribute(HTML.Attribute.HREF);
if (link != null) {
links.add(link);
}
it.next();
}
}
return links;
}

Displaying contents of doc file in jTextPane

i m trying to display contents of a doc file into jTextPane. But it is displaying only the last line of document while on console it is displaying whole document.
I m using Apache POI library.
File file = null;
WordExtractor extractor = null ;
try {
file = new File("C:\\Users\\Siddique Ansari\\Documents\\CV Parser\\Siddique_Resume.doc");
FileInputStream fis=new FileInputStream(file.getAbsolutePath());
HWPFDocument document=new HWPFDocument(fis);
extractor = new WordExtractor(document);
String [] fileData = extractor.getParagraphText();
for(int i=0;i<fileData.length;i++){
System.out.println(fileData[i]);
jTextPane1.setText(fileData[i]);
}
}
catch(Exception exep){}

jTextPane1.setText(fileData[i]); will override the current value each time.
Instead, append to the underlying document:
Document doc = jTextPane1.getDocument();
// ... in your loop:
doc.insertString(doc.getLength(), fileData[i], null);

Instead of:
for(int i=0;i<fileData.length;i++){
System.out.println(fileData[i]);
jTextPane1.setText(fileData[i]);
}
try
StringBuilder content = new StringBuilder();
for(int i=0; i < fileData.length; i++){
System.out.println(fileData[i]);
content.append(fileData[i]).append("\n");
jTextPane1.setText(content.toString());
}
Also,
catch(Exception exep){}
is never a good idea. At least write:
catch(Exception exep) { exep.printStackTrace(); }
so you know what's going on when an excecption occurs.

Updating an MSWord document with Apache POI

I'm trying to update a Microsoft Word document using Apache POI. The msword document is a template that contains a number of placeholders in the form "${place.holder}" and all I need to do is to replace the holders with specific values. What I've got so far is
private void start() throws FileNotFoundException, IOException {
POIFSFileSystem fsfilesystem = null;
HWPFDocument hwpfdoc = null;
InputStream resourceAsStream = getClass().getResourceAsStream("/path/to/document/templates/RMA FORM.doc");
try {
fsfilesystem = new POIFSFileSystem(resourceAsStream );
hwpfdoc = new HWPFDocument(fsfilesystem);
Range range = hwpfdoc.getRange();
range.replaceText("${rma.number}","08739");
range.replaceText("${customer.name}", "Roger Swann");
FileOutputStream fos = new FileOutputStream(new File("C:\\temp\\updatedTemplate.doc"));
hwpfdoc.write(fos);
fos.flush();
fos.close();
} catch (FileNotFoundException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
}
The program runs without errors. If I look in the output file with a Hex editor I can see that the placeholders have been replaced by the program. However, when I try to open the document with MSWord, MSWord crashes.
Is there a step (series of steps) that I'm missing, or am I basically out of luck with this? Do I need to adjust any counters because the length of the replacement text is not the same as the length of the replaced text?
Regards

use new FileInputStream() instead of getClass().getResourceAsStream("/path/to/document/templates/RMA FORM.doc");

How to extract paragraphs instead of whole texts only for XWPFWordExtractor (POI Library) Java

I know the following code could extract whole texts of the docx document, however, I need to extract paragraph instead. Is there are possible way??
public static String extractText(InputStream in) throws Exception {
JOptionPane.showMessageDialog(null, "Start extracting docx");
XWPFDocument doc = new XWPFDocument(in);
XWPFWordExtractor ex = new XWPFWordExtractor(doc);
String text = ex.getText();
return text;
}
Any helps would much appreciated. I need this so urgently.

That's just a guess after brief looking at the API:
doc.getParagraphs()
Link to the API: http://poi.apache.org/apidocs/org/apache/poi/xwpf/usermodel/XWPFDocument.html#getParagraphs()

I wrote utility method for this as below:
public static List<String> getParagraphs(File file)
{
List<String> paragraphs = new ArrayList<>();
try
{
FileInputStream fis = new FileInputStream(file);
XWPFDocument xdoc = new XWPFDocument(OPCPackage.open(fis));
List<XWPFParagraph> paragraphList = xdoc.getParagraphs();
for (XWPFParagraph paragraph : paragraphList)
{
paragraphs.add(paragraph.getText());
}
}
catch (Exception ex)
{
ex.printStackTrace();
}
return paragraphs;
}

Though, the question is very old. I am answering in the hope to help if somebody's browser ended here in the quest of answer.
XWPFDocument document = new XWPFDocument(fis);
List<XWPFParagraph> paragraphs = document.getParagraphs();
for(XWPFParagraph paragraph: paragraphs){
System.out.println("Text in this paragraph: " + paragraph.getText());
}
System.out.println("Total no of paragraph in Docx : "+paragraphs.size());
Hope this helps!

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Java get plain Text from RTF - java

I have on my database a column that holds text in RTF format. How can I get only the plain text of it, using Java?

RTFEditorKit rtfParser = new RTFEditorKit(); Document document = rtfParser.createDefaultDocument(); rtfParser.read(new ByteArrayInputStream(rtfBytes), document, 0); String text = document.getText(0, document.getLength()); this should work

If you can try "AdvancedRTFEditorKit", it might be cool. Try here http://java-sl.com/advanced_rtf_editor_kit.html I have used it to create a complete RTF editor, with all the supports MS Word has.

This works if the RTF text is in a JEditorPane String s = getPlainText(aJEditorPane.getDocument()); String getPlainText(Document doc) { try { return doc.getText(0, doc.getLength()); } catch (BadLocationException ex) { System.err.println(ex); return null; } }

Related

Unable to read unicode character in pdf using java

How to extract urls from an html file stored on my computer using java?

Displaying contents of doc file in jTextPane

Updating an MSWord document with Apache POI

How to extract paragraphs instead of whole texts only for XWPFWordExtractor (POI Library) Java

Categories

Resources