Apache POI - Read and store Rich Text content in DB

Apache POI - Read and store Rich Text content in DB - java

We have a new requirement in our java application where user’s would upload an excel file.
One of the column in the excel file will be formatted with bold, italics, bullet points, colored text etc.
We need to read this excel file and store these values in Oracle DB table.
Also subsequently we need to extract these data and download into excel sheet with the formatting preserved.
We planned to use the Apache-poi for the same, but now stuck at the point where we have the HSSFRichTextString object that needs to be converted into a format to store into Oracle table.
The tostring() method of HSSFRichTextString gives the string but the formatting is lost.
Can someone please suggest me how to convert this HSSFRichTextString object into Oracle data type (preferably clob).

You are right in that the toString() method will just return the unformatted String contents of the HSSFRichTextString.
Here is a method of extracting out all the other important data from the HSSFRichTextString to be stored with the string value.
Very similar to my answer to this question, extract the rich text formatting information from the HSSFRichTextString, and store that data in a class you'll create, FormattingRun.
public class FormattingRun {
private int beginIdx;
private int length;
private short fontIdx;
public FormattingRun(int beginIdx, int length, short fontIdx) {
this.beginIdx = beginIdx;
this.length = length;
this.fontIdx = fontIdx;
}
public int getBegin() { return beginIdx; }
public int getLength() { return length; }
public short getFontIndex { return fontIdx; }
}
Then, call Apache POI methods to extract that data.
numFormattingRuns() - Returns the number of formatting runs in the HSFFRichTextString.
getFontOfFormattingRun(int) - Returns the short font index present at the specified position in the string
Now, the actual extraction of the data:
List<FormattingRun> formattingRuns = new ArrayList<FormattingRun>();
int numFormattingRuns = richTextString.numFormattingRuns();
for (int fmtIdx = 0; fmtIdx < numFormattingRuns; fmtIdx)
{
int begin = richTextString.getIndexOfFormattingRun(fmtIdx);
short fontIndex = richTextString.getFontOfFormattingRun(fmtIdx);
// Walk the string to determine the length of the formatting run.
int length = 0;
for (int j = begin; j < richTextString.length(); j++)
{
short currFontIndex = richTextString.getFontAtIndex(j);
if (currFontIndex == fontIndex)
length++;
else
break;
}
formattingRuns.add(new FormattingRun(begin, length, fontIndex));
}
To store this data in the database, first recognize that there is a one-to-many relationship between a HSSFRichTextString and FormattingRun. So in whatever Oracle table you're planning on storing the rich text string data, you will need to create a foreign key relationship to another new table that stores the formatting run data. Something like this:
Table: rich_text_string
rts_id NUMBER
contents VARCHAR2(4000)
with rts_id being the primary key, and:
Table: rts_formatting_runs
rts_id NUMBER
run_id NUMBER
run_pos NUMBER
run_len NUMBER
font_index NUMBER
with (rts_id, run_id) being the primary key, and rts_id referring back to the rich_text_string table.
Using your favorite Java-to-database framework (JDBC, Hibernate, etc.), store the String value into contents in rich_text_string, and the associated FormattingRun object data into rt_formatting_runs.
Just be careful - the font index is only valid within the workbook. You'll need to store the font information from the HSSFWorkbook also, to give the font_index meaning.
It's not stored as a CLOB, but the data are arguably more meaningful stored this way.

Related

How to generate random string with no duplicates in java

I read some answers , usually they use a set or some other data structure to ensure there is no duplicates. but for my situation , I already stored a lot random string in database , I have to make sure that the generated random string should not existed in database .
and I don't think retrieve all random string from database into a set and then generated the random string is a good idea...
I found that System.currentTimeMillis() will generate a "random" number , but how to translate that number to a random string is a question...I need a string with length 8.
any suggestion will be appreciated

You can use Apache library for this: RandomStringUtils
RandomStringUtils.randomAlphanumeric(8).toUpperCase() // for alphanumeric
RandomStringUtils.randomAlphabetic(8).toUpperCase() // for pure alphabets
randomAlphabetic(int count)
Creates a random string whose length is the number of characters specified.
randomAlphanumeric(int count)
Creates a random string whose length is the number of characters specified.

So there are two issues here - creating the random string, and making sure there's no duplicate already in the db.
If you are not bound to 8 characters, you can use a UUID as the commenter above suggested. The UUID class returns a strong that is highly statistically unlikely to be a duplicate of a previously generated UUID so you can use it for this precise purpose without checking if its already in your database.
UUID.randomUUID().toString();
Or if you don't care whether what the unique id is as long as its unique you could use an identity or autoincrement field which pretty much all DB's support. If you do that, though you have the read the record after you commit it to get the identity assigned by the db.
which produces a string which looks something that looks like this:
5e0013fd-3ed4-41b4-b05d-0cdf4324bb19
If you are have to have an 8 character string as your unique id and you don't want to import the apache library, \you can generate random 8 character string like this:
final String alpha="ABCDEFGHIJKLMNOPQRSTUVWXYZ";
final Random rand= new Random();
public String myUID() {
int i = 8;
String uid="";
while (i-- > 0) {
uid+=alpha.charAt(rand.nextInt(26));
}
return uid;
}
To make sure its not a duplicate, you should add a unique index to the column in the db which contains it.
You can either query the db first to make sure that no row has that id before you insert the row, or catch the exception and retry if you've generated a duplicate.

Method currentTimeMillis() returns the current time in milliseconds in long so convert long to string, and s.substring(5, s.length()) give you last 8 digit's of milliseconds those are always identical for each millisecond.
public static void main(String[] args) {
String s = String.valueOf(System.currentTimeMillis());
System.out.println(s.substring(5, s.length()));
}
You have to make sure that this string is available or not in your database each time.

Fast value access from string-based key path

I'm currently implementing a generic model for pivot-like data visualization in ColdFusion 9.
I'm not interested in supporting multiple measures and the model exposes a numeric valueAt(string colKey, string rowKey) function that can be called by a view in order to retrieve the resulting aggregation of a measure based on column and row dimensions.
For example, with the data set below, if the measure was AVG(Age) and the column dimension Rank, then model.valueOf('3', '') would return 2.33.
Wine Age Rank
WineA 3 3
WineB 4 2
WineC 2 3
WineD 2 3
Now, the data structure that naturally came to my mind was to use a java.util.HashMap to store the computed data, using a combination of column and row values converted to string as keys. This means that depending on the data set, I might potentially have a very large number of keys that will start with the same prefix.
I purposely created a large data set (1 million entries) with multiple strings having the same prefix and checked the percentage of bucket collisions I would get using the default java String.hashCode() algorithm and MurmurHash3.
Here's how I build the data set sample:
<cfset maxItemsCount = 1000000>
<cfset tokens = ['test', 'one', 'two', 'tree', 'four', 'five']>
<cfset tokensLen = arrayLen(tokens)>
<cfset items = []>
<cfset loopCount = 1>
<cfloop condition="arrayLen(items) lt maxItemsCount">
<cfset item = ''>
<cfloop from="1" to="#tokensLen#" index="i">
<cfset item = listAppend(item, tokens[i] & loopCount, '_')>
<cfset arrayAppend(items, item)>
</cfloop>
<cfset ++loopCount>
</cfloop>
With an array initialized to 2 * entries count, I got 27% collisions with String.hashCode() and 22% for Murmur. It took around 2580 milliseconds with java.util.HashMap only to store and retrieve keys once.
I'm looking for ideas on how to improve performance, whether by using a different data structures (perhaps nested hash maps?) or find a way to reduce the number of collisions without compromising the API signature?
Thanks!

With a million entries, there will always be some collisions (unless your array is much longer than 1e12 entries :D). I guess that MurmurHash makes a perfect job here, but you could try MD5 for comparison (which is sort of guaranteed to do a perfect job).
Now, the data structure that naturally came to my mind was to use a java.util.HashMap to store the computed data, using a combination of column and row values converted to string as keys. This means that depending on the data set, I might potentially have a very large number of keys that will start with the same prefix.
You're concatenating Strings and so producing quite some garbage. It may be better to create a
#Value static class Key {
private final String row;
private final String column;
}
as a key for your HashMap, where #Value is a Lombok annotation generating all the boring stuff like equals, hashCode and the constructor.
You can do easily without Lombok and even a bit better:
static class Key {
Key(String row, String column) {
// Do NOT use 31 as a multiplier as it increases the number of collisions!
// Try Murmur, too.
hashCode = row.hashCode() + 113 * column.hashCode();
this.row = row;
this.column = column;
}
public int hashCode() {
return hashCode;
}
public boolean equals(Object o) {
if (this == o) return true;
if (!(o instanceof Key)) return false;
Key that = (Key) o;
// Check hashCode first.
if (this.hashCode != that.hashCode) return false;
if (!this.row.equals(that.row)) return false;
if (!this.column.equals(that.column)) return false;
return true;
}
private final int hashCode;
private final String row;
private final String column;
}

How can i read and print Lucene index 4.0

I want to read index from my Indexer file.
So the result that i want are all terms of each documents and number of TF-IDF.
Please suggest some example code for me. Thx :)

First things is to get a listing of documents. An alternative might be iterating through indexed terms, but the method IndexReader.terms() appears to have been removed from 4.0 (though it exists in AtomicReader, which could be worth looking at). The best method I'm aware of to get all documents is to simply loop through the documents by the document id:
//where reader is your IndexReader, however you go about opening/managing it
for (int i=0; i<reader.maxDoc(); i++) {
if (reader.isDeleted(i))
continue;
//operate on the document with id = i ...
}
Then you need a listing of all indexed terms. I'm assuming we have no interest in stored fields, since the data you want doesn't make sense for them. For retrieving the terms you can use IndexReader.getTermVectors(int). Note, I'm not actually retrieving the document, since we don't need to access it directly. Continuing from where we left off:
String field;
FieldsEnum fieldsiterator;
TermsEnum termsiterator;
//To Simplify, you can rely on DefaultSimilarity to calculate tf and idf for you.
DefaultSimilarity freqcalculator = new DefaultSimilarity()
//numDocs and maxDoc are not the same thing:
int numDocs = reader.numDocs();
int maxDoc = reader.maxDoc();
for (int i=0; i<maxDoc; i++) {
if (reader.isDeleted(i))
continue;
fieldsiterator = reader.getTermVectors(i).iterator();
while (field = fieldsiterator.next()) {
termsiterator = fieldsiterator.terms().iterator();
while (terms.next()) {
//id = document id, field = field name
//String representations of the current term
String termtext = termsiterator.term().utf8ToString();
//Get idf, using docfreq from the reader.
//I haven't tested this, and I'm not quite 100% sure of the context of this method.
//If it doesn't work, idfalternate below should.
int idf = termsiterator.docfreq();
int idfalternate = freqcalculator.idf(reader.docFreq(field, termsiterator.term()), numDocs);
}
}
}

Is it possible to append 2 rich text strings?

I need to append to 2 HSSFRichTextStrings in Java with Apache POI. How can I do this?
What I'm exactly doing is I'm getting the rich text string already present in a cell and I'm trying to append an additional rich text string to it and write it back to the cell.
Please tell me how to do this.

It is possible to append two HSSFRichTextStrings, but you will have to do most of the work yourself. You will need to take advantage of the following methods in HSSFRichTextString:
numFormattingRuns() - Returns the number of formatting runs in the HSFFRichTextString.
getFontOfFormattingRun(int) - Returns the short font index present at the specified position in the string
applyFont(int, int, short) - Applies the font referred to by the short font index between the given start index (inclusive) and end index (exclusive).
First, create a little class to store formatting run stats:
public class FormattingRun {
private int beginIdx;
private int length;
private short fontIdx;
public FormattingRun(int beginIdx, int length, short fontIdx) {
this.beginIdx = beginIdx;
this.length = length;
this.fontIdx = fontIdx;
}
public int getBegin() { return beginIdx; }
public int getLength() { return length; }
public short getFontIndex { return fontIdx; }
}
Next, gather all of the formatting run statistics for each of the two strings. You'll have to walk the strings yourself to determine how long each formatting run lasts.
List<FormattingRun> formattingRuns = new ArrayList<FormattingRun>();
int numFormattingRuns = richTextString.numFormattingRuns();
for (int fmtIdx = 0; fmtIdx < numFormattingRuns; fmtIdx)
{
int begin = richTextString.getIndexOfFormattingRun(fmtIdx);
short fontIndex = richTextString.getFontOfFormattingRun(fmtIdx);
// Walk the string to determine the length of the formatting run.
int length = 0;
for (int j = begin; j < richTextString.length(); j++)
{
short currFontIndex = richTextString.getFontAtIndex(j);
if (currFontIndex == fontIndex)
length++;
else
break;
}
formattingRuns.add(new FormattingRun(begin, length, fontIndex));
}
Next, concatenate the two String values yourself and create the result HSSFRichTextString.
HSSFRichTextString result = new HSSFRichTextString(
richTextString1.getString() + richTextString2.getString());
Last, apply both sets of formatting runs, with the second set of runs being offset by the first string's length.
for (FormattingRun run1 : formattingRuns1)
{
int begin = run1.getBegin();
int end = begin + run1.getLength();
short fontIdx = run1.getFontIndex();
result.applyFont(begin, end, fontIdx);
}
for (FormattingRun run2 : formattingRuns2)
{
// offset by string length 1
int begin = run2.getBegin() + richTextString1.length();
int end = begin + run2.getLength();
short fontIdx = run2.getFontIndex();
result.applyFont(begin, end, fontIdx);
}
That should do it for concatenating HSSFRichTextStrings.
If you ever want to concatenate XSSFRichTextStrings, found in .xlsx files, the process is very similar. One difference is that XSSFRichTextString#getFontOfFormattingRun will return an XSSFFont instead of a short font index. That's okay, because calling applyFont on an XSSFRichTextString takes an XSSFFont anyway. Another difference is that getFontOfFormattingRun may throw a NullPointerException if there is no font applied for the formatting run, which occurs when there is no different font applied than the font that is already there for the CellStyle for the entire Cell.

If you're using XSSFRichTextStrings, you can't directly concatenate two RichTextString.
However, you can indirectly do so by finding the text value of the second RichTextString and then using the append method to append that string value with an applied font (RichText in essence).
XSSFRichTextString rt1 = new XSSFRichTextString("Apache POI is");
rt1.applyFont(plainArial);
XSSFRichTextString rt2 = new XSSFRichTextString(" great!");
rt2.applyFont(boldArial);
String text = rt2.getString();
cell1.setCellValue(rt1.append(text, boldArial));
Source:
enter link description here

displaytag external paging/sorting and getting true row number

I'm using external paging/sorting with a custom TableDecorator and the following DisplayTag table in a JSP:
<display:table id="ixnlist" name="pageScope.itemList" sort="external"
decorator="org.mdibl.ctd.pwa.displaytag.decorator.IxnTableWrapper">
<display:column title="Row" property="rowNum" />
...more columns...
</display:table>
In the table decorator, getListIndex() returns the row number relative only to the current page, not to the overall list (i.e., if we're displaying 100 objects per page, then getListIndex() returns "0" at the top of page 2, not "100").
/**
* Returns the row number data for the current row.
*
* #return String containing row number heading.
*/
public String getRowNum() {
final StringBuilder out = new StringBuilder(8);
out.append(nf.format(getListIndex() + 1))
.append('.');
return out.toString();
}
Is it possible in the table decorator to somehow get the row number reflecting the correct offset? Displaytag is aware of the offset someplace, as it uses it to format the pagination links.
The displaytag docs do not address this question, and the ${row_rowNum} implicit object works identically to getListIndex() in the decorator.
Yes, it's possible to do this by adding a row-number column to the paginated SQL and having the TableDecorator use that if available, but I'd rather not rely on the DAO for that kind of metadata. The following TableDecorator method takes advantage of a rownum column if it exists, otherwise it uses getListIndex():
/**
* Returns the row number data for the current row.
*
* #return String containing row number heading.
*/
public String getRowNum() {
final StringBuilder out = new StringBuilder(8);
final Map row = (Map) getCurrentRowObject();
// Use 'rnum' column for external pagination if it exists.
// Kludgy way of doing this.
if (row.get("rnum") != null) {
out.append(nf.format(row.get("rnum")));
} else {
out.append(nf.format(getListIndex() + 1));
}
out.append('.');
return out.toString();
}
Thanks.
/mcr

You should be able to calculate the correct overall index value by referencing the page number which is in the request.
Code something like this your TableDecorator class should work:
public String getIndex() {
int numItemsPerPage = 100;
int page = Integer.parseInt(getPageContext().getRequest().getParameter("page"));
int index = getListIndex();
return ((page - 1) * numItemsPerPage) + index + 1;
}

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

Apache POI - Read and store Rich Text content in DB - java

Related

How to generate random string with no duplicates in java

Fast value access from string-based key path

How can i read and print Lucene index 4.0

Is it possible to append 2 rich text strings?

displaytag external paging/sorting and getting true row number

Categories

Resources