StringComparison in java - java

We developed a PDF reader desktop app using iTextSharp and we are now developing an android app. This is my C# code I want to know what can be use for StringComparison in Java
public final void PDFReferenceGetter(String pSearch, StringComparison SC, String sourceFile, String destinationFile)
{
//
}
the full code
public final void PDFReferenceGetter(String pSearch, String SC, String sourceFile, String destinationFile)
{
PdfStamper stamper = null;
PdfContentByte contentByte;
Rectangle refRectangle = null;
int refPage = 0;
//this.Cursor = Cursors.WaitCursor;
if ((new java.io.File(sourceFile)).isFile())
{
PdfReader pReader = new PdfReader(sourceFile);
stamper = new PdfStamper(pReader, new FileOutputStream(destinationFile));
for (int page = 1; page <= pReader.getNumberOfPages(); page++)
{
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
contentByte = stamper.getUnderContent(page);
//Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, but i'm not sure if this could change in some cases
strategy._UndercontentCharacterSpacing = contentByte.getCharacterSpacing();
strategy._UndercontentHorizontalScaling = contentByte.getHorizontalScaling();
//It's not really needed to get the text back, but we have to call this line ALWAYS,
//because it triggers the process that will get all chunks from PDF into our strategy Object
String currentText = PdfTextExtractor.getTextFromPage(pReader, page, strategy);
//The real getter process starts in the following line
java.util.ArrayList<Rectangle> matchesFound = strategy.GetTextLocations("References", SC);
//Set the fill color of the shapes, I don't use a border because it would make the rect bigger
//but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover
contentByte.setColorFill(BaseColor.PINK);
//MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color:s
for (Rectangle rect : matchesFound)
{
refRectangle = rect;
refPage = page;
}
contentByte.fill();
}
for (int page = 1; page <= pReader.getNumberOfPages(); page++)
{
LocationTextExtractionStrategy strategy = new LocationTextExtractionStrategy();
contentByte = stamper.getUnderContent(page);
//Send some data contained in PdfContentByte, looks like the first is always cero for me and the second 100, but i'm not sure if this could change in some cases
strategy._UndercontentCharacterSpacing = contentByte.getCharacterSpacing();
strategy._UndercontentHorizontalScaling = contentByte.getHorizontalScaling();
//It's not really needed to get the text back, but we have to call this line ALWAYS,
//because it triggers the process that will get all chunks from PDF into our strategy Object
String currentText = PdfTextExtractor.getTextFromPage(pReader, page, strategy);
String text = currentText;
String patternString = pSearch;
Pattern pattern = Pattern.compile(patternString);
Matcher matcher = pattern.matcher(text);
boolean matches = matcher.matches();
if(matches == true)
{
ArrayList<String> mc;
mc.add(text);
//MatchCollection mc = Regex.Matches(currentText, pSearch);
java.util.ArrayList<Rectangle> matchesFound = new java.util.ArrayList<Rectangle>();
for (String m : mc)
{
matchesFound = strategy.getTextLocations(m.toString(), SC);
for (Rectangle rect : matchesFound)
{
contentByte.rectangle(rect.getLeft(), rect.getBottom(), rect.getWidth(), rect.getHeight());
PdfDestination pdfdest = new PdfDestination(PdfDestination.XYZ, refRectangle.LEFT, refRectangle.TOP, 0);
PdfAnnotation annot = PdfAnnotation.createLink(stamper.getWriter(), rect, PdfAnnotation.HIGHLIGHT_INVERT, refPage, pdfdest);
stamper.addAnnotation(annot, page);
}
}
//The real getter process starts in the following line
//Set the fill color of the shapes, I don't use a border because it would make the rect bigger
//but maybe using a thin border could be a solution if you see the currect rect is not big enough to cover all the text it should cover
contentByte.setColorFill(BaseColor.LIGHT_GRAY);
//MatchesFound contains all text with locations, so do whatever you want with it, this highlights them using PINK color:
contentByte.fill();
}
stamper.close();
pReader.close();
}
//this.Cursor = Cursors.Default;
}
}

The StringComparison enum is described in more detail here.
The short answer is that there is, unfortunately, no suitable type in the java libraries.
The easy solution
Create your own Java enum mirroring the c#. You also have to create your own string comparison method taking the StringComparison into account, e.g. ignoring case, etc, depending on the value of the StringComparison.
The best solution
I would avoid using the StringComparison in the interface of a method. Instead search for usages of the method. I'm guessing it is only used to sometimes ignore case and others not. Or that it is completely unused. For the later case - Simply remove it and you're done! For the former case just pass in a bool to the interface instead! Remember to update the c# code to keep the ports somewhat in sync.

If you have one string:
String myString = "somestring";
And another one:
String anotherString = "somestringelse";
You can do use the built in equals() function like this:
if(myString.equals(anotherString)) {
//Do code
}

You can use basic Java string comparison methods
If you wanna compare two strings completely I mean as a whole string
String string1="abcd", string2="abcd";
if(string1.equals(string2)) ----> returns true as they are equal else it returns false.
If you wanna compare two strings completely ignoring their cases you can use the following method
String string1="abcd", string2="AbCd";
if(string1.equalsIgnorecase(string2)) -- > returns true as they are equal though their cases are different else it returns false.
If you don't wanna compare whole strings you can use following methods
check the following link for all the string comparison methods in Java
http://docs.oracle.com/javase/tutorial/java/data/comparestrings.html

Related

How to extract data from PDF and split into particluar categories using java

I am trying to extract data from PDF and splitting it into certain categories.I am able to extract data from PDF and Split it into categories on basis of their font size. For example:Lets say there are 3 category, Country category, capital category and city category. I am able to put all countries, capitals and cities into their respective categories. But I am not able to map which capital belong to which city and which Country or which country belong which city and capital.
*It is reading data randomly, How I can Read data from bottom to Top without breaking the sequence, so I can Put first word in first category, 2nd into second and so on. *
Or anyone know some more efficient way? so I can put text into their respective categories and map it also.
I am using Java and
Here is my code:
public class readPdfText {
public static void main(String[] args) {
try{
PdfReader reader = null;
String src = "pdffile.pdf";
try {
reader = new PdfReader("pdfile.pdf");
} catch (IOException e1) {
// TODO Auto-generated catch block
e1.printStackTrace();
}
SemTextExtractionStrategy smt = new SemTextExtractionStrategy();
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
PdfTextExtractor.getTextFromPage(reader, i, smt);
}
}catch(Exception e){
}
}
}
SemTextExtractionStrategy class:
public class SemTextExtractionStrategy implements TextExtractionStrategy {
private String text;
StringBuffer str = new StringBuffer();
StringBuffer item = new StringBuffer();
StringBuffer cat = new StringBuffer();
StringBuffer desc = new StringBuffer();
float temp = 0;
#Override
public void beginTextBlock() {
}
#Override
public void renderText(TextRenderInfo renderInfo) {
text = renderInfo.getText();
Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1),
topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();
compare(text, curFontSize);
}
private void add(String text2, float curFontSize) {
str.append(text2);
System.out.println("str: " + str);
}
public void compare(String text2, float curFontSize) {
// text2.getFont().getBaseFont().Contains("bold");
// temp = curFontSize;
boolean flag = check(text);
if (temp == curFontSize) {
str.append(text);
/*
* if (curFontSize == 11.222168){ item.append(str);
* System.out.println(item); }else if (curFontSize == 10.420532){
* desc.append(str); }
*/
// str.append(text);
} else {
if (temp>9.8 && temp<10){
String Contry= str.toString();
System.out.println("Contry: "+Contry);
}else if(temp>8 && temp <9){
String itemPrice= str.toString();
System.out.println("itemPrice: "+itemPrice);
}else if(temp >7 && temp< 7.2){
String captial= str.toString();
System.out.println("captial: "+captial);
}else if(temp >7.2 && temp <8){
String city= str.toString();
System.out.println("city: "+city);
}else{
System.out.println("size: "+temp+" "+"str: "+str);
}
temp = curFontSize;
// System.out.println(temp);
str.delete(0, str.length());
str.append(text);
}
}
private boolean check(String text2) {
return true;
}
#Override
public void endTextBlock() {
}
#Override
public void renderImage(ImageRenderInfo renderInfo) {
}
#Override
public String getResultantText() {
return text;
}
}
It is reading data randomly, How I can Read data from bottom to Top without breaking the sequence, so I can Put first word in first category, 2nd into second and so on.
No, not randomly but instead in the order of the corresponding drawing operations in the content stream.
Your TextExtractionStrategy implementation SemTextExtractionStrategy simply uses the text in the order in which it is forwarded to it which is the order in which it is drawn. The order of the drawing operations does not need to be the reading order, though, as each drawing operation may start at a custom position on the page; if multiple fonts are used on one page, e.g., the text may be drawn grouped by font.
If you want to analyze the text from such a document, you first have to collect and sort the text fragments you get, and only when all text from the page is parsed, you can start analyzing it.
The LocationTextExtractionStrategy (included in the iText distribution) can be taken as an example of a strategy doing just that. It uses its inner class TextChunk for collecting the fragments, though, and this class does not carry the text ascent information you use in your code.
A SemLocationTextExtractionStrategy, therefore, would have to use an extended TextChunk class to also keep that information (or some information derived from it, e.g. a text category).
Furthermore the LocationTextExtractionStrategy only sorts top to bottom, left to right. If your PDF has a different design, e.g. if it is multi-columnar, either your sorting has to be adapted or you have to use filters and analyze the page column by column.
BTW, your code to determine the font size
Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();
Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1),
topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();
does not return the actual font size but only the ascent above the base line. And even that only for unrotated text; as soon as rotation is part of the game, your code only returns the height of the rectangle enveloping the line from the start of the base line to the end of the ascent line. The length of the line from base line start to ascent line start would at least be independent from rotation.
Or anyone know some more efficient way?
Your task seems to depend very much on the PDF you are trying to extract information from. Without that PDF, therefore, tips for more efficient ways will remain vague.

Remove Hightlight matching String content

Ok, few days ago I made one post regarding to the remove of Hightlighted text in JTextArea:
Removing Highlight from specific word - Java
The thing is, that time I made one code to remove Hightlights macthing its size...but now I have a lot of words with the same size in my app and obviously the application isnt running right.
So I ask, Does anyone know a library or a way to do this removal macthing the content of each highlighted string?
You could write a method to get the text for a given highlighter:
private static String highlightedText(Highlight h, Document d) {
int start = h.getStartIndex();
int end = h.getEndIndex();
int length = end - start;
return d.getText(start, length);
}
Then your removeHighlights method would look like this:
public void removeHighlights(JTextComponent c, String toBlackOut) {
Highlighter highlighter = c.getHighlighter();
Highlighter.Highlight[] highlights = h.getHighlights();
Document d = c.getDocument();
for (Highlighter.Highlight h : highlights)
if (highlightedText(h, d).equals(toBlackOut) && h.getPainter() instanceof TextHighLighter)
highlighter.removeHighlight(h);
}

caret position into the html of JEditorPane

The getCaretPosition method of JEditorPane gives an index into the text only part of the html control. Is there a possibility to get the index into the html text?
To be more specific suppose I have a html text (where | denotes the caret position)
abcd<img src="1.jpg"/>123|<img src="2.jpg"/>
Now getCaretPosition gives 8 while I would need 25 as a result to read out the filename of the image.
I had mostly the same problem and solved it with the following method (I used JTextPane, but it should be the same for JEditorPane):
public int getCaretPositionHTML(JTextPane pane) {
HTMLDocument document = (HTMLDocument) pane.getDocument();
String text = pane.getText();
String x;
Random RNG = new Random();
while (true) {
x = RNG.nextLong() + "";
if (text.indexOf(x) < 0) break;
}
try {
document.insertString(pane.getCaretPosition(), x, null);
} catch (BadLocationException ex) {
ex.printStackTrace();
return -1;
}
text = pane.getText();
int i = text.indexOf(x);
pane.setText(text.replace(x, ""));
return i;
}
It just assumes your JTextPane won't contain all possible Long values ;)
The underlying model of the JEditorPane (some subclass of StyledDocument, in your case HTMLDocument) doesn't actually hold the HTML text as its internal representation. Instead, it has a tree of Elements containing style attributes. It only becomes HTML once that tree is run through the HTMLWriter. That makes what you're trying to do kinda tricky! I could imagine putting some flag attribute on the character element that you're currently on, and then using a specially crafted subclass of HTMLWriter to write out until that marker and count the characters, but that sounds like something of an epic hack. There is probably an easier way to get what you want there, though it's a bit unclear to me what that actually is.
I had the same problem, and solved it with the following code:
editor.getDocument().insertString(editor.getCaretPosition(),"String to insert", null);
I don't think you can transform your caret to be able to count tags as characters. If your final aim is to read image filename, you should use :
HTMLEditorKit (JEditorPane.getEditorKitForContentType("text/html") );
For more information about utilisation see Oracle HTMLEditorKit documentation and this O'Reilly PDF that contains interesting examples.

Parsing a "rgb (x, x, x)" String Into a Color Object

Is there an effecient way/existing solution for parsing the string "rgb (x, x, x)" [where x in this case is 0-255] into a color object? [I'm planning to use the color values to convert them into the hex color equivilience.
I would prefer there to be a GWT option for this. I also realize that it would be easy to use something like Scanner.nextInt. However I was looking for a more reliable manner to get this information.
As far as I know there's nothing like this built-in to Java or GWT. You'll have to code your own method:
public static Color parse(String input)
{
Pattern c = Pattern.compile("rgb *\\( *([0-9]+), *([0-9]+), *([0-9]+) *\\)");
Matcher m = c.matcher(input);
if (m.matches())
{
return new Color(Integer.valueOf(m.group(1)), // r
Integer.valueOf(m.group(2)), // g
Integer.valueOf(m.group(3))); // b
}
return null;
}
You can use that like this
// java.awt.Color[r=128,g=32,b=212]
System.out.println(parse("rgb(128,32,212)"));
// java.awt.Color[r=255,g=0,b=255]
System.out.println(parse("rgb (255, 0, 255)"));
// throws IllegalArgumentException:
// Color parameter outside of expected range: Red Blue
System.out.println(parse("rgb (256, 1, 300)"));
For those of use who don't understand regex:
public class Test
{
public static void main(String args[]) throws Exception
{
String text = "rgb(255,0,0)";
String[] colors = text.substring(4, text.length() - 1 ).split(",");
Color color = new Color(
Integer.parseInt(colors[0].trim()),
Integer.parseInt(colors[1].trim()),
Integer.parseInt(colors[2].trim())
);
System.out.println( color );
}
}
Edit: I knew someone would comment on error checking. I was leaving that up to the poster. It is easily handled by doing:
if (text.startsWith("rgb(") && text.endsWith(")"))
// do the parsing
if (colors.length == 3)
// build and return the color
return null;
The point is your don't need a complicated regex that nobody understands at first glance. Adding error conditions is a simple task.
I still prefer the regex solution (and voted accordingly) but camickr does make a point that regex is a bit obscure, especially to kids today who haven't used Unix (when it was a manly-man's operating system with only a command line interface -- Booyah!!). So here is a high-level solution that I'm offering up, not because I think it's better, but because it acts as an example of how to use some the nifty Guava functions:
package com.stevej;
import com.google.common.base.CharMatcher;
import com.google.common.base.Splitter;
import com.google.common.collect.Iterables;
public class StackOverflowMain {
public static void main(String[] args) {
Splitter extractParams = Splitter.on("rgb").omitEmptyStrings().trimResults();
Splitter splitParams =
Splitter.on(CharMatcher.anyOf("(),").or(CharMatcher.WHITESPACE)).omitEmptyStrings()
.trimResults();
final String test1 = "rgb(11,44,88)";
System.out.println("test1");
for (String param : splitParams.split(Iterables.getOnlyElement(extractParams.split(test1)))) {
System.out.println("param: [" + param + "]");
}
final String test2 = "rgb ( 111, 444 , 888 )";
System.out.println("test2");
for (String param : splitParams.split(Iterables.getOnlyElement(extractParams.split(test2)))) {
System.out.println("param: [" + param + "]");
}
}
}
Output:
test1param: [11]param: [44]param: [88]test2param: [111]param: [444]param: [888]
It's regex-ee-ish without the regex.
It is left as an exercise to the reader to add checks that (a) "rgb" appears in the beginning of the string, (b) the parentheses are balanced and correctly positioned, and (c) the correct number of correctly formatted rgb integers are returned.
And the C# form:
public static bool ParseRgb(string input, out Color color)
{
var regex = new Regex("rgb *\\( *([0-9]+), *([0-9]+), *([0-9]+) *\\)");
var m = regex.Match(input);
if (m.Success)
{
color = Color.FromArgb(int.Parse(m.Groups[1].Value), int.Parse(m.Groups[2].Value), int.Parse(m.Groups[3].Value));
return true;
}
color = new Color();
return false;
}

Why is the size of this vector 1?

When I use System.out.println to show the size of a vector after calling the following method then it shows 1 although it should show 2 because the String parameter is "7455573;photo41.png;photo42.png" .
private void getIdClientAndPhotonames(String csvClientPhotos)
{
Vector vListPhotosOfClient = new Vector();
String chainePhotos = "";
String photoName = "";
String photoDirectory = new String(csvClientPhotos.substring(0, csvClientPhotos.indexOf(';')));
chainePhotos = csvClientPhotos.substring(csvClientPhotos.indexOf(';')+1);
chainePhotos = chainePhotos.substring(0, chainePhotos.lastIndexOf(';'));
if (chainePhotos.indexOf(';') == -1)
{
vListPhotosOfClient.addElement(new String(chainePhotos));
}
else // aaa;bbb;...
{
for (int i = 0 ; i < chainePhotos.length() ; i++)
{
if (chainePhotos.charAt(i) == ';')
{
vListPhotosOfClient.addElement(new String(photoName));
photoName = "";
continue;
}
photoName = photoName.concat(String.valueOf(chainePhotos.charAt(i)));
}
}
}
So the vector should contain the two String photo41.png and photo42.png , but when I print the vector content I get only photo41.png.
So what is wrong in my code ?
The answer is not valid for this question anymore, because it has been retagged to java-me. Still true if it was Java (like in the beginning): use String#split if you need to handle csv files.
It's be far easier to split the string:
String[] parts = csvClientPhotos.split(";");
This will give a string array:
{"7455573","photo41.png","photo42.png"}
Then you'd simply copy parts[1] and parts[2] to your vector.
You have two immediate problems.
The first is with your initial manipulation of the string. The two lines:
chainePhotos = csvClientPhotos.substring(csvClientPhotos.indexOf(';')+1);
chainePhotos = chainePhotos.substring(0, chainePhotos.lastIndexOf(';'));
when applied to 7455573;photo41.png;photo42.png will end up giving you photo41.png.
That's because the first line removes everything up to the first ; (7455573;) and the second strips off everything from the final ; onwards (;photo42.png). If your intent is to just get rid of the 7455573; bit, you don't need the second line.
Note that fixing this issue alone will not solve all your ills, you still need one more change.
Even though your input string (to the loop) is the correct photo41.png;photo42.png, you still only add an item to the vector each time you encounter a delimiting ;. There is no such delimiter at the end of that string, meaning that the final item won't be added.
You can fix this by putting the following immediately after the for loop:
if (! photoName.equals(""))
vListPhotosOfClient.addElement(new String(photoName));
which will catch the case of the final name not being terminated with the ;.
These two lines are the problem:
chainePhotos = csvClientPhotos.substring(csvClientPhotos.indexOf(';') + 1);
chainePhotos = chainePhotos.substring(0, chainePhotos.lastIndexOf(';'));
After the first one the chainePhotos contains "photo41.png;photo42.png", but the second one makes it photo41.png - which trigers the if an ends the method with only one element in the vector.
EDITED: what a mess.
I ran it with correct input (as provided by the OP) and made a comment above.
I then fixed it as suggested above, while accidently changing the input to 7455573;photo41.png;photo42.png; which worked, but is probably incorrect and doesn't match the explanation above input-wise.
I wish someone would un-answer this.
You can split the string manually. If the string having the ; symbol means why you can do like this? just do like this,
private void getIdClientAndPhotonames(String csvClientPhotos)
{
Vector vListPhotosOfClient = split(csvClientPhotos);
}
private vector split(String original) {
Vector nodes = new Vector();
String separator = ";";
// Parse nodes into vector
int index = original.indexOf(separator);
while(index>=0) {
nodes.addElement( original.substring(0, index) );
original = original.substring(index+separator.length());
index = original.indexOf(separator);
}
// Get the last node
nodes.addElement( original );
return nodes;
}

Categories

Resources