How can i convert the sentence into html using java program. Suppose if i have bold character or underline or superscript words while creating those how can i add html tag conditionally before and after that particular words.
What i think is read the excel file and then using string builder trying to append the whole sentence into p tag.
As i am very new into it . I dont understand how to iteratre through each word and check for various conditions
what i tried is this it cjust add the p tag in start and end , but does convert help in bold and specific in bold and underline
public String getGeneralValue(Cell currentCell){
switch (currentCell.getCellType()){
case STRING:
return currentCell.getStringCellValue();
case NUMERIC:
return String.valueOf((int) currentCell.getNumericCellValue());
case BOOLEAN:
return String.valueOf(currentCell.getBooleanCellValue()).toUpperCase();
default:system.out.println("other than numeric and String");
}
return "";
}
public String setPType(Cell currentCell){
StringBuilder sb = new StringBuilder();
sb.append("<p>");
sb.append(getGeneralValue(currentCell));
sb.append("</p>");
return sb.toString();
}
Here is the image an example
Related
I'm using PDFBox in java and successfully retrieved a pdf. But now I wish to search for a specific word and only retrieve the following number. To be concrete, I want to search for Tax and retrieve the number that is tax. The two strings are separated by a tab it seems.
My code is as following atm
File file = new File("yes.pdf");
try {
PDDocument document = PDDocument.load(file);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
System.out.println(text);
// search for the word tax
// retrieve the number af the word "Tax"
document.close();
}
I have used similar thing in my project. I hope it will help you.
public class ExtractNumber {
public static void main(String[] args) throws IOException {
PDDocument doc = PDDocument.load(new File("yourFile location"));
PDFTextStripper stripper = new PDFTextStripper();
List<String> digitList = new ArrayList<String>();
//Read Text from pdf
String string = stripper.getText(doc);
// numbers follow by string
Pattern mainPattern = Pattern.compile("[a-zA-Z]\\d+");
//Provide actual text
Matcher mainMatcher = mainPattern.matcher(string);
while (mainMatcher.find()) {
//Get only numbers
Pattern subPattern = Pattern.compile("\\d+");
String subText = mainMatcher.group();
Matcher subMatcher = subPattern.matcher(subText);
subMatcher.find();
digitList.add(subMatcher.group());
}
if (doc != null) {
doc.close();
}
if(digitList != null && digitList.size() > 0 ) {
for(String digit: digitList) {
System.out.println(digit);
}
}
}
}
Regular expression [a-zA-Z]\d+ find one or more digit follow by latter from pdf text.
\d+ expression find specific text from above pattern.
you can also use different regular expression for find specific number of digit.
You can get more idea from this tutorial.
The best way to do something like that is to use regular expressions. I often use this tool to write my regular expressions. Your regex should probably look something like: tax\s([0-9]+). You can take a look at this tutorial on how to use regex in Java.
I am using alfresco download upload services using java.
When I upload the file to alfreco server it gives me the following path :
/app:Home/cm:Company_x0020_Home/cm:Abc/cm:TestFile/cm:V4/cm:BC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
When I use the same file path and download using alfresco services I took the file name at the end of the path
i.e ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
How can I remove or decode the [Unicode] characters in fileName
String decoded = URLDecoder.decode(queryString, "UTF-8");
The above does not work .
These are some Unicode characters which appeared in my file name.
https://en.wikipedia.org/wiki/List_of_Unicode_characters
Please do not mark the question as duplicate as I have searched below links but non of those gave the solution.
Following are the links that I have searched for replacing unicode charectors in String with java.
Java removing unicode characters
Remove non-ASCII characters from String in Java
How can I replace a unicode character in java string
Java Replace Unicode Characters in a String
The solution given by Jeff Potts will be perfect .
But i had a situation where i was using file name in diffrent project where i wont use org.alfresco related jars
I had to take all those dependencies to use for a simple file decoding
So i used java native methods which uses regex to parse the file name and decode it,which gave me the perfect solution which was same from using
ISO9075.decode(test);
This is the code which can be used
public String decode_FileName(String fileName) {
System.out.println("fileName : " + fileName);
String decodedfileName = fileName;
String temp = "";
Matcher m = Pattern.compile("\\_x(.*?)\\_").matcher(decodedfileName); //rejex which matches _x0020_ kind of charectors
List<String> unicodeChars = new ArrayList<String>();
while (m.find()) {
unicodeChars.add(m.group(1));
}
for (int i = 0; i < unicodeChars.size(); i++) {
temp = unicodeChars.get(i);
if (isInteger(temp)) {
String replace_char = String.valueOf(((char) Integer.parseInt(String.valueOf(temp), 16)));//converting
decodedfileName = decodedfileName.replace("_x" + temp + "_", replace_char);
}
}
System.out.println("Decoded FileName :" + decodedfileName);
return decodedfileName;
}
And use this small java util to know Is integer
public static boolean isInteger(String s) {
try {
Integer.parseInt(s);
} catch (NumberFormatException e) {
return false;
} catch (NullPointerException e) {
return false;
}
return true;
}
So the above code works as simple as this :
Example :
0028 Left parenthesis U+0028 You can see in the link
https://en.wikipedia.org/wiki/List_of_Unicode_characters
String replace_char = String.valueOf(((char) Integer.parseInt(String.valueOf("0028"), 16)));
System.out.println(replace_char);
This code gives output : ( which is a Left parenthesis
This is what the logic i have used in my java program.
The above program will give results same as ISO9075.decode(test)
Output :
fileName : ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf
Decoded FileName :ABC1X 0400 0109-(1-2)_v2.pdf
In the org.alfresco.util package you will find a class called ISO9075. You can use it to encode and decode strings according to that spec. For example:
String test = "ABC1X_x0020_0400_x0020_0109-_x0028_1-2_x0029__v2.pdf";
String out = ISO9075.decode(test);
System.out.println(out);
Returns:
ABC1X 0400 0109-(1-2)_v2.pdf
If you want to see what it does behind the scenes, look at the source.
I'm using an API that sends and receives raw bytes.
But i have problem with displaying the Arabic words that comes over the API, it's displaying like diamond question marks "���"
I've tried to convert the string from and to utf-8.
This example returns question marks but not inside the black square "??? ???" :
String str = new String(originalStr.getBytes("ISO-8859-1"), "UTF-8");
This one returns empty string :
String str = new String(originalStr.getBytes("WINDOWS-1256"), "UTF-8");
And this one also returns an empty string :
String str = new String(originalStr.getBytes("WINDOWS-1252"), "UTF-8");
I've succeded to display the Arabic words in PHP by converting from cp1256 to utf-8 :
echo iconv('cp1256', 'utf-8', $string);
The correct character encoding for Arabic is cp1256
How can i achieve that?
This XML (rdf file extension, but is XML) was generated by a automatic tool, but unfortunately have various "unescaped" strings like
<tag xml:lang="fr">L'insuline (du latin insula, île) </tag>
And the parser (and reasoner software) crash with this...
Java or PHP solutions are valid to me too!
Thanks,
Celso
Here's a general method that I use a lot to make sure a String is escaped properly for XML.
private static final String AMP = "&";
private static final String LT = "<";
private static final String GT = ">";
private static final String QUOTE = """;
private static final String APOS = "'";
public static String encodeEntities(String dirtyString) {
StringBuffer buff = new StringBuffer();
char[] chars = dirtyString.toCharArray();
for (int i = 0; i < chars.length; i++) {
if (chars[i] > 0x7f) {
buff.append("&#" + (int) chars[i] + ";");
continue;
}
switch (chars[i]) {
case '&':
buff.append(AMP);
break;
case '<':
buff.append(LT);
break;
case '\'':
buff.append(APOS);
break;
case '"':
buff.append(QUOTE);
break;
case '>':
buff.append(GT);
break;
default:
buff.append(chars[i]);
break;
}
}
return buff.toString();
}
The xml given by the OP is well-formed xml as the single quote character is valid and so is the circumflex "i", neither needs escaping. I would make sure you're using a text encoding such as UTF-8. Here's quick java example that does an identity transformation:
public static void main(String[] args) throws Exception {
Transformer t = TransformerFactory.newInstance().newTransformer();
StreamResult s = new StreamResult(System.out);
t.transform(new StreamSource(new StringReader("<tag xml:lang=\"fr\">L'insuline (du latin insula, île) </tag>")), s);
}
The XML fragment given by the OP looks well-formed. Neither the apostrophe nor the i-circumflex needs escaping. The most likely problem is that the XML is encoded using iso-8859-1, but lacks an XML declaration, so the parser think it is in UTF-8 encoding. The solution then is to add the XML declaration <?xml version="1.0" encoding="iso-8859-1"?>, which tells the parser how to decode the characters. (For a document containing only ASCII characters, iso-8859-1 and utf-8 are indistinguishable, so this problem only surfaces when you use characters outside the ASCII range).
A word of advice: if you had given the error message generated by the parser, you wouldn't have got so many incorrect answers.
I've got some HTML files that need to be parsed and cleaned, and they occasionally have content with special characters like <, >, ", etc. which have not been properly escaped.
I have tried running the files through jTidy, but the best I can get it to do is just omit the content it sees as malformed html. Is there a different library that will just escape the malformed fragments instead of omitting them? If not, any recommendations on what library would be easiest to modify?
Clarification:
Sample input: <p> blah blah <M+1> blah </p>
Desired output: <p> blah blah <M+1> blah </p>
You can also try TagSoup. TagSoup emits regular old SAX events so in the end you get what looks like a well-formed XML document.
I have had very good luck with TagSoup and I'm always surprised at how well it handles poorly constructed HTML files.
Ultimately I solved this by running a regular expression first and an unmodified TagSoup second.
Here is my regular expression code to escape unknown tags like <M+1>
private static String escapeUnknownTags(String input) {
Scanner scan = new Scanner(input);
StringBuilder builder = new StringBuilder();
while (scan.hasNext()) {
String s = scan.findWithinHorizon("[^<]*</?[^<>]*>?", 1000000);
if (s == null) {
builder.append(escape(scan.next(".*")));
} else {
processMatch(s, builder);
}
}
return builder.toString();
}
private static void processMatch(String s, StringBuilder builder) {
if (!isKnown(s)) {
String escaped = escape(s);
builder.append(escaped);
}
else {
builder.append(s);
}
}
private static String escape(String s) {
s = s.replaceAll("<", "<");
s = s.replaceAll(">", ">");
return s;
}
private static boolean isKnown(String s) {
Scanner scan = new Scanner(s);
if (scan.findWithinHorizon("[^<]*</?([^<> ]*)[^<>]*>?", 10000) == null) {
return false;
}
MatchResult mr = scan.match();
try {
String tag = mr.group(1).toLowerCase();
if (HTML.getTag(tag) != null) {
return true;
}
}
catch (Exception e) {
// Should never happen
e.printStackTrace();
}
return false;
}
HTML cleaner
HtmlCleaner is open-source HTML parser written in Java. HTML found on
Web is usually dirty, ill-formed and unsuitable for further
processing. For any serious consumption of such documents, it is
necessary to first clean up the mess and bring the order to tags,
attributes and ordinary text. For the given HTML document, HtmlCleaner
reorders individual elements and produces well-formed XML. By default,
it follows similar rules that the most of web browsers use in order to
create Document Object Model. However, user may provide custom tag and
rule set for tag filtering and balancing.
Ok, I suspect it is this. Use the following code, it will help.
javax.swing.text.html.HTML