Spliting paragraphs that endswith "." and new line after dot in Java

Spliting paragraphs that endswith "." and new line after dot in Java - java

I am trying to read text from PDF file and split each paragraph and put it into ArrayList and print elements of ArrayList but I have no outputs
String path = "E:\\test.pdf";
PDFTextStripper pdfStripper = null;
PDDocument pdDoc = null;
COSDocument cosDoc = null;
File file = new File(path);
PDFParser parser = new PDFParser(new FileInputStream(file));
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
pdfStripper.setStartPage(1);
pdfStripper.setEndPage(1);
String page = pdfStripper.getText(pdDoc);
String[] paragraph = page.split("\n");
ArrayList<String> ramy = new ArrayList<>();
String p = "";
for (String x : paragraph) {
if ((x.endsWith("\\.")) || (x.endsWith("\\." + "\\s+"))) {
p += x;
ramy.add(p);
p = "";
} else {
p += x;
}
}
for (String x : ramy) {
System.out.print(x + "\n\n");
}
Note : I am using NetBeans 8.0.2, windows 8.1 and pdfbox library to read from pdf file.

The most crippling bug you have is that you are calling endsWith() with "\\.", which is two characters; a literal backslash and a literal dot (not an escaped dot) and again with "\\.\\s+" (again all literal characters). It's clear you (incorrectly) believed that the method accepts regex, which it doesn't.
Assuming your logic is sound, change your test to use a regex-based test:
if (x.matches(".*\\.\\s*"))
This test combines the intention of your code into one test.
Note that you don't need to end the regex with $, because matches() must match the whole string to return true, so ^ and $ are implied at the start/end of the pattern.

Related

replace string with unicode text in pdf file using PDFbox?

I need to read the strings from PDF file and replace it with the Unicode text.If it is ASCII chars everything is fine. But with Unicode characters, it showing question marks/junk text.No problem with font file(ttf) I am able to write a unicode text to the pdf file with a different class (PDFContentStream). With this class, there is no option to replace text but we can add new text.
Sample unicode text
Bɐɑɒ
issue (Address column)
https://drive.google.com/file/d/1DbsApTCSfTwwK3txsDGW8sXtDG_u-VJv/view?usp=sharing
I am using PDFBox.
Please help me with this.....
check the code I am using.....
enter image description herepublic static PDDocument _ReplaceText(PDDocument document, String searchString, String replacement)
throws IOException {
if (StringUtils.isEmpty(searchString) || StringUtils.isEmpty(replacement)) {
return document;
}
for (PDPage page : document.getPages()) {
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
//PDFont font2 = PDType0Font.load(document, new File("avenir-next-regular.ttf"));
resources.add(font);
//resources.add(font2);
//resources.add(PDType1Font.TIMES_ROMAN);
page.setResources(resources);
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
List tokens = parser.getTokens();
for (int j = 0; j < tokens.size(); j++) {
Object next = tokens.get(j);
if (next instanceof Operator) {
Operator op = (Operator) next;
String pstring = "";
int prej = 0;
// Tj and TJ are the two operators that display strings in a PDF
if (op.getName().equals("Tj")) {
// Tj takes one operator and that is the string to display so lets update that
// operator
COSString previous = (COSString) tokens.get(j - 1);
String string = previous.getString();
string = string.replaceFirst(searchString, replacement);
previous.setValue(string.getBytes());
} else if (op.getName().equals("TJ")) {
COSArray previous = (COSArray) tokens.get(j - 1);
for (int k = 0; k < previous.size(); k++) {
Object arrElement = previous.getObject(k);
if (arrElement instanceof COSString) {
COSString cosString = (COSString) arrElement;
String string = cosString.getString();
if (j == prej) {
pstring += string;
} else {
prej = j;
pstring = string;
}
}
}
if (searchString.equals(pstring.trim())) {
COSString cosString2 = (COSString) previous.getObject(0);
cosString2.setValue(replacement.getBytes());
int total = previous.size() - 1;
for (int k = total; k > 0; k--) {
previous.remove(k);
}
}
}
}
}
// now that the tokens are updated we will replace the page content stream.
PDStream updatedStream = new PDStream(document);
OutputStream out = updatedStream.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter tokenWriter = new ContentStreamWriter(out);
tokenWriter.writeTokens(tokens);
out.close();
page.setContents(updatedStream);
}
return document;
}

Your code utterly breaks the PDF, cf. the Adobe Preflight output:
The cause is obvious, your code
PDResources resources = new PDResources();
PDFont font = PDType0Font.load(document, new File("arial-unicode-ms.ttf"));
resources.add(font);
page.setResources(resources);
drops the pre-existing page Resources and your replacement contains only a single font the name of which you allow PDFBox to choose arbitrarily.
You must not drop existing resources as they are used in your document.
Inspecting the content of your PDF page it becomes obvious that the encoding of the originally used fonts T1_0 and T1_1 either is a single byte encoding or a mixed single/multi-byte encoding; the lower single byte values appear to be encoded ASCII-like.
I would assume that the encoding is WinAnsiEncoding or a subset thereof. As a corollary your task
to read the strings from PDF file and replace it with the Unicode text
cannot be implemented as a simple replacement, at least not with arbitrary Unicode code points in mind.
What you can implement instead is:
First run your source PDF through a customized text stripper which instead of extracting the plain text searches for your strings to replace and returns their positions. There are numerous questions and answers here that show you how to determine coordinates of strings in text stripper sub classes, a recent one being this one.
Next remove those original strings from your PDF. In your case an approach similar to your original code above (without dropping the resource, obviously), replacing the strings by equally long strings of spaces might work even it is a dirty hack.
Finally add your replacements at the determined positions using a PDFContentStream in append mode; for this add your new font to the existing resources.
Please be aware, though, that PDF is not designed to be used like this. Template PDFs can be used as background for new content, but attempting to replace content therein usually is a bad design leading to trouble. If you need to mark positions in the template, use annotations which can easily be dropped during fill-in. Or use AcroForm forms, the native PDF form technology, to start with.

Text associated to PDF paragraph in document content object wit PDFBox

I'm trying to get the text associated to a paragraph navigating through the content tree of a PDF file. I am using PDFBox and cannot find the link between the paragraph and the text that it contains (see code below):
public class ReadPdf {
public static void main( String[] args ) throws IOException{
MyBufferedWriter out = new MyBufferedWriter(new FileWriter(new File(
"C:/Users/wip.txt")));
RandomAccessFile raf = new RandomAccessFile(new File(
"C:/Users/mypdf.pdf"), "r");
PDFParser parser = new PDFParser(raf);
parser.parse();
COSDocument cosDoc = parser.getDocument();
out.write(cosDoc.getXrefTable().toString());
out.write(cosDoc.getObjects().toString());
PDDocument document = parser.getPDDocument()
document.getClass();
COSParser cosParser = new COSParser(raf);
PDStructureTreeRoot treeRoot = document.getDocumentCatalog().getStructureTreeRoot();
for (Object kid : treeRoot.getKids()){
for (Object kid2 :((PDStructureElement)kid).getKids()){
PDStructureElement kid2c = (PDStructureElement)kid2;
if (kid2c.getStandardStructureType() == "P"){
for (Object kid3 : kid2c.getKids()){
if (kid3 instanceof PDStructureElement){
PDStructureElement kid3c = (PDStructureElement)kid3;
}
else{
for (Entry<COSName, COSBase>entry : kid2c.getCOSObject().entrySet()){
// Print all the Keys in the paragraph COSDictionary
System.out.println(entry.getKey().toString());
System.out.println(entry.getValue().toString());}
}}}}}}}
When I print the contents I get the following Keys:
/P : Reference to Parent
/A : Format of the paragraph
/K : Position of the paragraph in the section
/C : Name of the paragraph (!= text)
/Pg : Reference to the page
Example output:
COSName{K}
COSInt{2}
COSName{Pg}
COSObject{12, 0}
COSName{C}
COSName{Normal}
COSName{A}
COSObject{434, 0}
COSName{S}
COSName{Normal}
COSName{P}
COSObject{421, 0}
Now none of these points to the actual text inside the paragraph.
I know that the relation can be obtained as it is parsed when I open the document with acrobat (see pic below):

I found a way to do this through the parsing of the Content Stream from a page.
Navigating through the PDF Specification Chapter 10.6.3 there is a link between the numbering of each Text Stream which comes under \P \MCID and an attribute of the Tag (PDStructureElement in PDFBox) which can be found in the COSObject.
1) To get the text and the MCID:
PDPage pdPage;
Iterator<PDStream> inputStream = pdPage.getContentStreams();
while (inputStream.hasNext()) {
try {
PDFStreamParser parser2 = new PDFStreamParser((PDStream)inputStream.next());
parser2.parse();
List<Object> tokens = parser2.getTokens();
for (int j = 0; j < tokens.size(); j++){
tokenString = (tokenString + tokens.get(j).toString()}
// here comes the parsing of the string. Chapter 5 specifies what each of the operators Tj (actual text), Tm, BDC, BT, ET, EMC mean, MCID
Then to get the tags and their attribute that matches MCID:
PDStructureElement pDStructureElement;
pDStructureElement .getCOSObject().getInt(COSName.K)
That should do it. In documents without Tags (document.getDocumentCatalog().getStructureTreeRoot() is empty of children) this match cannot be performed but the text can still be read using step 1.

Jave, Lucene : Search with numbers as String not working

I am working on integrating Lucene in our Spring-MVC based project and currently it's working good, other than search with numbers.
Whenever I try search like 123Ab or 123 or anything which has numbers inside it, I don't get back any search results.
As soon as I remove the numbers though, it works fine.
Any suggestions? Thank you.
Code :
public List<Integer> searchLucene(String text, long groupId, boolean type) {
List<Integer> objectIds = new ArrayList<>();
if (text != null) {
//String specialChars = "+ - && || ! ( ) { } [ ] ^ \" ~ * ? : \\ /";
text = text.replace("+", "\\+");
text = text.replace("-", "\\-");
text = text.replace("&&", "\\&&");
text = text.replace("||", "\\||");
text = text.replace("!", "\\!");
text = text.replace("(", "\\(");
text = text.replace(")", "\\)");
text = text.replace("{", "\\}");
text = text.replace("{", "\\}");
text = text.replace("[", "\\[");
text = text.replace("^", "\\^");
// text = text.replace("\"","\\\"");
text = text.replace("~", "\\~");
text = text.replace("*", "\\*");
text = text.replace("?", "\\?");
text = text.replace(":", "\\:");
//text = text.replace("\\","\\\\");
text = text.replace("/", "\\/");
try {
Path path;
//Set system path code
Directory directory = FSDirectory.open(path);
IndexReader indexReader = DirectoryReader.open(directory);
IndexSearcher indexSearcher = new IndexSearcher(indexReader);
QueryParser queryParser = new QueryParser("contents", new SimpleAnalyzer());
Query query;
query = queryParser.parse(text+"*");
TopDocs topDocs = indexSearcher.search(query, 50);
for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
org.apache.lucene.document.Document document = indexSearcher.doc(scoreDoc.doc);
objectIds.add(Integer.valueOf(document.get("id")));
System.out.println("");
System.out.println("id " + document.get("id"));
System.out.println("content " + document.get("contents"));
}
indexSearcher.getIndexReader().close();
directory.close();
return objectIds;
} catch (Exception ignored) {
}
}
return null;
}
Indexing code :
#Override
public void saveIndexes(String text, String tagFileName, String filePath, long groupId, boolean type, int objectId) {
try {
//indexing directory
File testDir;
Path path1;
Directory index_dir;
if (type) {
// System path code
Directory directory = org.apache.lucene.store.FSDirectory.open(path);
IndexWriterConfig config = new IndexWriterConfig(new SimpleAnalyzer());
IndexWriter indexWriter = new IndexWriter(directory, config);
org.apache.lucene.document.Document doc = new org.apache.lucene.document.Document();
if (filePath != null) {
File file = new File(filePath); // current directory
doc.add(new TextField("path", file.getPath(), Field.Store.YES));
}
doc.add(new StringField("id", String.valueOf(objectId), Field.Store.YES));
// doc.add(new TextField("id",String.valueOf(objectId),Field.Store.YES));
if (text == null) {
if (filePath != null) {
FileInputStream is = new FileInputStream(filePath);
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
StringBuilder stringBuffer = new StringBuilder();
String line;
while ((line = reader.readLine()) != null) {
stringBuffer.append(line).append("\n");
}
stringBuffer.append("\n").append(tagFileName);
reader.close();
doc.add(new TextField("contents", stringBuffer.toString(), Field.Store.YES));
}
} else {
text = text + "\n" + tagFileName;
doc.add(new TextField("contents", text, Field.Store.YES));
}
indexWriter.addDocument(doc);
indexWriter.commit();
indexWriter.flush();
indexWriter.close();
directory.close();
} catch (Exception ignored) {
}
}
I have tried with and without wildcard i.e *. Thank you.

Issue is in your indexing code.
Your field contents is a TextField and you are using a SimpleAnalyzer so if you see SimpleAnalyzer documentation, it says ,
An Analyzer that filters LetterTokenizer with LowerCaseFilter
So that means for your field, if it is set to tokenized numbers will be removed.
Now look at , TextField code, here a TextField is always tokenized irrespective of it being TYPE_STORED or TYPE_NOT_STORED.
So if you wish to index letters and numbers, you need to use a StringField instead of a TextField.
StringField documentation,
A field that is indexed but not tokenized: the entire String value is
indexed as a single token. For example this might be used for a
'country' field or an 'id' field, or any field that you intend to use
for sorting or access through the field cache.
A StringField is never tokenized irrespective of it being TYPE_STORED or TYPE_NOT_STORED
So after indexing, numbers are removed from contents field and is indexed without numbers so you don't find those patterns while searching.
Instead of QueryParser and doing complicated searches, first use a query like below to first verify your indexed Terms,
Query wildcardQuery = new WildcardQuery(new Term("contents", searchString));
TopDocs hits = searcher.search(wildcardQuery, 20);
Also, to know if debugging to be focused on indexer side or searcher side , use Luke Tool to see if terms are created as per your need. If terms are there, you can focus on searcher code.

Remove special characters from text/PDF with Apache Tika

I am parsing PDF file to extract text with Apache Tika.
//Create a body content handler
BodyContentHandler handler = new BodyContentHandler();
//Metadata
Metadata metadata = new Metadata();
//Input file path
FileInputStream inputstream = new FileInputStream(new File(faInputFileName));
//Parser context. It is used to parse InputStream
ParseContext pcontext = new ParseContext();
try
{
//parsing the document using PDF parser from Tika.
PDFParser pdfparser = new PDFParser();
//Do the parsing by calling the parse function of pdfparser
pdfparser.parse(inputstream, handler, metadata,pcontext);
}catch(Exception e)
{
System.out.println("Exception caught:");
}
String extractedText = handler.toString();
Above code works and text from the PDF is extcted.
There are some special characters in the PDF file (like #/&/£ or trademark sign, etc). How can I remove those special charaters during or after the extraction process?

PDF uses unicode code points you may well have strings that contain surrogate pairs, combining forms (eg for diacritics) etc, and may wish to preserve these as their closest ASCII equivalent, eg normalise é to e. If so, you can do something like this:
import java.text.Normalizer;
String normalisedText = Normalizer.normalize(handler.toString(), Normalizer.Form.NFD);
If you are simply after ASCII text then once normalised you could filter the string you get from Tika using a regular expression as per this answer:
extractedText = normalisedText.replaceAll("[^\\p{ASCII}]", "");
However, since regular expressions can be slow (particularly on large strings) you may want to avoid the regex and do a simple substitution (as per this answer):
public static String flattenToAscii(String string) {
char[] out = new char[string.length()];
String normalized = Normalizer.normalize(string, Normalizer.Form.NFD);
int j = 0;
for (int i = 0, n = normalized.length(); i < n; ++i) {
char c = normalized.charAt(i);
if (c <= '\u007F') out[j++] = c;
}
return new String(out);
}

Identify hidden text Word 2003/2007 using Apache POI

I am converting a Word (2003 and 2007) document to HTML format. I have managed to read the text, formats etc from the Word document. But the document contains some hidden text like 'Header Change History' which need not be displayed on the page. Is there any way to identify hidden texts from a Word document.
Any help will be much valuable.

I am not sure if this is a complete (or even accurate) solution, but for the files in the DOCX format, it seems that you can check if a character run is hidden by
XWPFRun cr;
if (cr.getCTR().getRPr().getVanish() != null){
// it is hidden
}
Got this from reverse-engineering the XML, and at least in my usage it seems to work. Would be very glad for additional (more informed) input, and a way to do the same thing in the old binary file format.

The following code snippet helps in identifying if the text is hidden
POIFSFileSystem fs = null;
boolean isHidden = false;
try {
fs = new POIFSFileSystem(new FileInputStream(filesname));
HWPFDocument doc = new HWPFDocument(fs);
WordExtractor we = new WordExtractor(doc);
String[] paragraphs = we.getParagraphText();
System.out.println("Word Document has " + paragraphs.length
+ " paragraphs");
Range range = doc.getRange();
for (int k = 0; k < range.numParagraphs(); k++) {
org.apache.poi.hwpf.usermodel.Paragraph paragraph = range
.getParagraph(k);
paragraph.text().trim();
paragraph.text().replaceAll("\\cM?\r?\n", "");
for (int j = 0; j < paragraph.numCharacterRuns(); j++) {
org.apache.poi.hwpf.usermodel.CharacterRun cr = paragraph
.getCharacterRun(j);
if (cr.isVanished()) {
// it is hidden
System.out.println("text is hidden ");
isHidden = true;
break;
}
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.