I'm using Lucene for an Eclipse plugin. Currently I iterate over my indexed terms like this:
I get a Terms instance using IndexReader.getTermVector(id, field)
I iterate over this instance using TermsEnum like this: while ((text = vectorEnum.next()) != null)
Now what I want additionally is to get the first n elements of a field. I figured I have to use PostingsEnum to accomplish this, but I don't get how to use it. I guess I can get it by calling postings() on my TermsEnum, but I don't know what to do with that.
Edit:
That's the important part of my code I guess:
Terms vector = indexReader.getTermVector(id, field);
BytesRef text = null;
if (vector != null) {
TermsEnum vectorEnum = vector.iterator();
while ((text = vectorEnum.next()) != null) {
String term = text.utf8ToString();
[do stuff]
}
And that's the FieldType:
FieldType fieldType = new FieldType();
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
fieldType.setStored(true);
fieldType.setStoreTermVectors(true);
fieldType.setTokenized(true);
Not sure why but requesting positions using setIndexOptions doesn't seem to work so you have to explicitly set setStoreTermVectorPositions. You still have to set the index options to something other than NONE but it doesn't seem necessary to use DOCS_AND_FREQS_AND_POSITIONS, i.e.
fieldType.setIndexOptions(IndexOptions.DOCS);
fieldType.setStoreTermVectorPositions(true);
then you can access the positions:
Terms vector = indexReader.getTermVector(id, field);
if (vector != null) {
TermsEnum vectorEnum = vector.iterator();
BytesRef text;
while ((text = vectorEnum.next()) != null) {
String term = text.utf8ToString();
PostingsEnum postings = vectorEnum.postings(null, PostingsEnum.POSITIONS);
while (postings.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int freq = postings.freq();
while (freq-- > 0)
logger.info("Position: {}", postings.nextPosition());
}
}
}
Related
I got the pptx file with simple presentation. It has background image, white text on it and this text has shadow. I need to simplify presentation and remove all this things (set backgroun to white, font color to black and remove shadows)
Change bachground and font colors are pretty straightforward, like this
SlideShow ppt = SlideShowFactory.create(inputStream);
List<Slide> slides= ppt.getSlides();
for (int i = 0; i< slides.size(); i++) {
Slide slide = slides.get(i);
((XSLFSlide)slide).getBackground().setFillColor(Color.white);
XSLFTextShape[] shapes = ((XSLFSlide) slide).getPlaceholders();
for (XSLFTextShape textShape : shapes) {
List<XSLFTextParagraph> textparagraphs = textShape.getTextParagraphs();
for (XSLFTextParagraph para : textparagraphs) {
List<XSLFTextRun> textruns = para.getTextRuns();
for (XSLFTextRun incomingTextRun : textruns) {
incomingTextRun.setFontColor(Color.black);
}
}
But i can't figure out how to remove shadows. Here is examle before and after
I tried to call getShadow() method on TextShape, but it's null, in XSLFTextRun there is no methods to manage text shadows. For HSLF i saw that there is setShadowed() for TextRun.
But how to deal with shadows in XSLF?
Thanks!
UPDATE:
Thanks Axel Richter for really valuable answer.
In my doc i found two cases with shadowed text.
First one is as Axel described, solution is to clean shadow from CTRegularTextRun. Also i find out that XSLFTextParagraph.getTextRuns() may contain LineBreak objects, so before casting XSLFTextRun.getXMLObject() - it's good idea to check that it's instance of CTRegularTextRun and not CTTextLineBreak
Code:
private void clearShadowFromTextRun(XSLFTextRun run) {
if (run.getXmlObject() instanceof CTRegularTextRun) {
CTRegularTextRun cTRun = (CTRegularTextRun) run.getXmlObject();
if (cTRun.getRPr() != null) {
if (cTRun.getRPr().getEffectLst() != null) {
if (cTRun.getRPr().getEffectLst().getOuterShdw() != null) {
cTRun.getRPr().getEffectLst().unsetOuterShdw();
}
}
}
}
}
Second case - SlideMaster contains some styles definitions for body and title. So if we want remove all shadows competely - we should clear them too.
Code:
private void clearSlideMastersShadowStyles(XMLSlideShow ppt) {
List<XSLFSlideMaster> slideMasters = ppt.getSlideMasters();
for (XSLFSlideMaster slideMaster : slideMasters) {
CTSlideMaster ctSlideMaster = slideMaster.getXmlObject();
if (ctSlideMaster.getTxStyles() != null) {
if (ctSlideMaster.getTxStyles().getTitleStyle() != null) {
clearShadowsFromStyle(ctSlideMaster.getTxStyles().getTitleStyle());
}
if (ctSlideMaster.getTxStyles().getBodyStyle() != null) {
clearShadowsFromStyle(ctSlideMaster.getTxStyles().getBodyStyle());
}
if (ctSlideMaster.getTxStyles().getOtherStyle() != null) {
clearShadowsFromStyle(ctSlideMaster.getTxStyles().getOtherStyle());
}
}
}
}
private void clearShadowsFromStyle(CTTextListStyle ctTextListStyle) {
if (ctTextListStyle.getLvl1PPr() != null) {
if (ctTextListStyle.getLvl1PPr().getDefRPr() != null)
if (ctTextListStyle.getLvl1PPr().getDefRPr().getEffectLst() != null)
if (ctTextListStyle.getLvl1PPr().getDefRPr().getEffectLst().getOuterShdw() != null)
ctTextListStyle.getLvl1PPr().getDefRPr().getEffectLst().unsetOuterShdw();
}
//same stuff for other 8 levels. Ugly uhh...
}
Settings of text shadow is not yet implemented in XSLFTextRun. But of course they are set in the XML.
A run having shadowed text looks like:
<a:r>
<a:rPr lang="de-DE" smtClean="0" dirty="0" b="1">
<a:effectLst>
<a:outerShdw dir="2700000" algn="tl" dist="38100" blurRad="38100">
<a:srgbClr val="000000">
<a:alpha val="43137"/>
</a:srgbClr>
</a:outerShdw>
</a:effectLst>
</a:rPr>
<a:t>The text...</a:t>
</a:r>
As you see there is a rPr ( run properties) having a effectLst having a outerShdw element. We can use ooxml-schemas classes and methods to set and unset this.
...
incomingTextRun.setFontColor(Color.black);
org.openxmlformats.schemas.drawingml.x2006.main.CTRegularTextRun cTRun = (org.openxmlformats.schemas.drawingml.x2006.main.CTRegularTextRun)incomingTextRun.getXmlObject();
if (cTRun.getRPr() != null) {
if (cTRun.getRPr().getEffectLst() != null) {
if (cTRun.getRPr().getEffectLst().getOuterShdw() != null) {
cTRun.getRPr().getEffectLst().unsetOuterShdw();
}
}
}
...
I'm using apache PDFBox from java, and I have a source PDF with multiple optional content groups. What I am wanting to do is export a version of the PDF that includes only the standard content and the optional content groups that were enabled. It is important for my purposes that I preserve any dynamic aspects of the original.... so text fields are still text fields, vector images are still vector images, etc. The reason that this is required is because I intend to ultimately be using a pdf form editor program that does not know how to handle optional content, and would blindly render all of them, so I want to preprocess the source pdf, and use the form editing program on a less cluttered destination pdf.
I've been trying to find something that could give me any hints on how to do this with google, but to no avail. I don't know if I'm just using the wrong search terms, or if this is just something that is outside of what the PDFBox API was designed for. I rather hope it's not the latter. The info shown here does not seem to work (converting the C# code to java), because despite the pdf I'm trying to import having optional content, there does not seem to be any OC resources when I examine the tokens on each page.
for(PDPage page:pages) {
PDResources resources = page.getResources();
PDFStreamParser parser = new PDFStreamParser(page);
parser.parse();
Collection tokens = parser.getTokens();
...
}
I'm truly sorry for not having any more code to show what I've tried so far, but I've just been poring over the java API docs for about 8 hours now trying to figure out what I might need to do this, and just haven't been able to figure it out.
What I DO know how to do is add text, lines, and images to a new PDPage, but I do not know how to retrieve that information from a given source page to copy it over, nor how to tell which optional content group such information is part of (if any). I am also not sure how to copy form fields in the source pdf over to the destination, nor how to copy the font information over.
Honestly, if there's a web page out there that I wasn't able to find with google with the searches that I tried, I'd be entirely happy to read up more about it, but I am really quite stuck here, and I don't know anyone personally that knows about this library.
Please help.
EDIT:
Trying what I understand from what was suggested below, I've written a loop to examine each XObject on the page as follows:
PDResources resources = pdPage.getResources();
Iterable<COSName> names = resources.getXObjectNames();
for(COSName name:names) {
PDXObject xobj = resources.getXObject(name);
PDFStreamParser parser = new PDFStreamParser(xobj.getStream().toByteArray());
parser.parse();
Object [] tokens = parser.getTokens().toArray();
for(int i = 0;i<tokens.length-1;i++) {
Object obj = tokens[i];
if (obj instanceof COSName && obj.equals(COSName.OC)) {
i++;
Object obj = tokens[i];
if (obj instanceof COSName) {
PDPropertyList props = resources.getProperties((COSName)obj);
if (props != null) {
...
However, after an OC key, the next entry in the tokens array is always an Operator tagged as "BMC". Nowhere am I finding any info that I can recognize from the named optional content groups.
Here's a robust solution for removing marked content blocks (open to feedback if anyone finds anything that isn't working right). You should be able to adjust for OC blocks...
This code properly handles nesting and removal of resources (xobject, graphics state and fonts - easy to add others if needed).
public class MarkedContentRemover {
private final MarkedContentMatcher matcher;
/**
*
*/
public MarkedContentRemover(MarkedContentMatcher matcher) {
this.matcher = matcher;
}
public int removeMarkedContent(PDDocument doc, PDPage page) throws IOException {
ResourceSuppressionTracker resourceSuppressionTracker = new ResourceSuppressionTracker();
PDResources pdResources = page.getResources();
PDFStreamParser pdParser = new PDFStreamParser(page);
PDStream newContents = new PDStream(doc);
OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
List<Object> operands = new ArrayList<>();
Operator operator = null;
Object token;
int suppressDepth = 0;
boolean resumeOutputOnNextOperator = false;
int removedCount = 0;
while (true) {
operands.clear();
token = pdParser.parseNextToken();
while(token != null && !(token instanceof Operator)) {
operands.add(token);
token = pdParser.parseNextToken();
}
operator = (Operator)token;
if (operator == null) break;
if (resumeOutputOnNextOperator) {
resumeOutputOnNextOperator = false;
suppressDepth--;
if (suppressDepth == 0)
removedCount++;
}
if (OperatorName.BEGIN_MARKED_CONTENT_SEQ.equals(operator.getName())
|| OperatorName.BEGIN_MARKED_CONTENT.equals(operator.getName())) {
COSName contentId = (COSName)operands.get(0);
final COSDictionary properties;
if (operands.size() > 1) {
Object propsOperand = operands.get(1);
if (propsOperand instanceof COSDictionary) {
properties = (COSDictionary) propsOperand;
} else if (propsOperand instanceof COSName) {
properties = pdResources.getProperties((COSName)propsOperand).getCOSObject();
} else {
properties = new COSDictionary();
}
} else {
properties = new COSDictionary();
}
if (matcher.matches(contentId, properties)) {
suppressDepth++;
}
}
if (OperatorName.END_MARKED_CONTENT.equals(operator.getName())) {
if (suppressDepth > 0)
resumeOutputOnNextOperator = true;
}
else if (OperatorName.SET_GRAPHICS_STATE_PARAMS.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.EXT_G_STATE, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.DRAW_OBJECT.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.XOBJECT, operands.get(0), suppressDepth == 0);
}
else if (OperatorName.SET_FONT_AND_SIZE.equals(operator.getName())) {
resourceSuppressionTracker.markForOperator(COSName.FONT, operands.get(0), suppressDepth == 0);
}
if (suppressDepth == 0) {
newContentWriter.writeTokens(operands);
newContentWriter.writeTokens(operator);
}
}
if (resumeOutputOnNextOperator)
removedCount++;
newContentOutput.close();
page.setContents(newContents);
resourceSuppressionTracker.updateResources(pdResources);
return removedCount;
}
private static class ResourceSuppressionTracker{
// if the boolean is TRUE, then the resource should be removed. If the boolean is FALSE, the resource should not be removed
private final Map<COSName, Map<COSName, Boolean>> tracker = new HashMap<>();
public void markForOperator(COSName resourceType, Object resourceNameOperand, boolean preserve) {
if (!(resourceNameOperand instanceof COSName)) return;
if (preserve) {
markForPreservation(resourceType, (COSName)resourceNameOperand);
} else {
markForRemoval(resourceType, (COSName)resourceNameOperand);
}
}
public void markForRemoval(COSName resourceType, COSName refId) {
if (!resourceIsPreserved(resourceType, refId)) {
getResourceTracker(resourceType).put(refId, Boolean.TRUE);
}
}
public void markForPreservation(COSName resourceType, COSName refId) {
getResourceTracker(resourceType).put(refId, Boolean.FALSE);
}
public void updateResources(PDResources pdResources) {
for (Map.Entry<COSName, Map<COSName, Boolean>> resourceEntry : tracker.entrySet()) {
for(Map.Entry<COSName, Boolean> refEntry : resourceEntry.getValue().entrySet()) {
if (refEntry.getValue().equals(Boolean.TRUE)) {
pdResources.getCOSObject().getCOSDictionary(COSName.XOBJECT).removeItem(refEntry.getKey());
}
}
}
}
private boolean resourceIsPreserved(COSName resourceType, COSName refId) {
return getResourceTracker(resourceType).getOrDefault(refId, Boolean.FALSE);
}
private Map<COSName, Boolean> getResourceTracker(COSName resourceType){
if (!tracker.containsKey(resourceType)) {
tracker.put(resourceType, new HashMap<>());
}
return tracker.get(resourceType);
}
}
}
Helper class:
public interface MarkedContentMatcher {
public boolean matches(COSName contentId, COSDictionary props);
}
Optional Content Groups are marked with BDC and EMC. You will have to navigate through all of the tokens returned from the parser and remove the "section" from the array. Here is some C# Code that was posted a while ago - [1]: How to delete an optional content group alongwith its content from pdf using pdfbox?
I investigated that (converting to Java) but couldn't get it work as expected. I managed to remove the content between BDC and EMC and then save the result using the same technique as the sample but the PDF was corrupted. Perhaps that is my lack of C# Knowledge (related to Tuples etc.)
Here is what I came up with, as I said it doesn't work perhaps you or someone else (mkl, Tilman Hausherr) can spot the flaw.
OCGDelete (PDDocument doc, int pageNum, String OCName) {
PDPage pdPage = (PDPage) doc.getDocumentCatalog().getPages().get(pageNum);
PDResources pdResources = pdPage.getResources();
PDFStreamParser pdParser = new PDFStreamParser(pdPage);
int ocgStart
int ocgLength
Collection tokens = pdParser.getTokens();
Object[] newTokens = tokens.toArray()
try {
for (int index = 0; index < newTokens.length; index++) {
obj = newTokens[index]
if (obj instanceof COSName && obj.equals(COSName.OC)) {
// println "Found COSName at "+index /// Found Optional Content
startIndex = index
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if (obj instanceof COSName) {
prop = pdRes.getProperties(obj)
if (prop != null && prop instanceof PDOptionalContentGroup) {
if ((prop.getName()).equals(delLayer)) {
println "Found the Layer to be deleted"
println "prop Name was " + prop.getName()
index++
if (index < newTokens.size()) {
obj = newTokens[index]
if ((obj.getName()).equals("BDC")) {
ocgStart = index
println("OCG Start " + ocgStart)
ocgLength = -1
index++
while (index < newTokens.size()) {
ocgLength++
obj = newTokens[index]
println " Loop through relevant OCG Tokens " + obj
if (obj instanceof Operator && (obj.getName()).equals("EMC")) {
println "the next obj was " + obj
println "after that " + newTokens[index + 1] + "and then " + newTokens[index + 2]
println("OCG End " + ocgLength++)
break
}
index++
}
if (endIndex > 0) {
println "End Index was something " + (startIndex + ocgLength)
}
}
}
}
}
}
}
}
}
}
catch (Exception ex){
println ex.message()
}
for (int i = ocgStart; i < ocgStart+ ocgLength; i++){
newTokens.removeAt(i)
}
PDStream newContents = new PDStream(doc);
OutputStream output = newContents.createOutputStream(COSName.FLATE_DECODE);
ContentStreamWriter writer = new ContentStreamWriter(output);
writer.writeTokens(newTokens);
output.close();
pdPage.setContents(newContents);
}
I'm just a lucene starter and and i got stuck on a problem during a change from a RAMDIrectory to a FSDirectory:
First my code:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
Directory DIR = FSDirectory.open(new File(INDEXLOC)); //INDEXLOC = "path/to/dir/"
// RAMDirectory DIR = new RAMDirectory();
// Index some made up content
IndexWriter writer =
new IndexWriter(DIR, iwc);
// Store both position and offset information
FieldType type = new FieldType();
type.setStored(true);
type.setStoreTermVectors(true);
type.setStoreTermVectorOffsets(true);
type.setStoreTermVectorPositions(true);
type.setIndexed(true);
type.setTokenized(true);
IDocumentParser p = DocumentParserFactory.getParser(f);
ArrayList<ParserDocument> DOCS = p.getParsedDocuments();
for (int i = 0; i < DOCS.size(); i++) {
Document doc = new Document();
Field id = new StringField("id", "doc_" + i, Field.Store.YES);
doc.add(id);
Field text = new Field("content", DOCS.get(i).getContent(), type);
doc.add(text);
writer.addDocument(doc);
}
writer.close();
// Get a searcher
IndexSearcher searcher = new IndexSearcher(DirectoryReader.open(DIR));
// Do a search using SpanQuery
SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "zahl"));
TopDocs results = searcher.search(fleeceQ, 10);
for (int i = 0; i < results.scoreDocs.length; i++) {
ScoreDoc scoreDoc = results.scoreDocs[i];
System.out.println("Score Doc: " + scoreDoc);
}
IndexReader reader = searcher.getIndexReader();
AtomicReader wrapper = SlowCompositeReaderWrapper.wrap(reader);
Map<Term, TermContext> termContexts = new HashMap<Term, TermContext>();
Spans spans = fleeceQ.getSpans(wrapper.getContext(), new Bits.MatchAllBits(reader.numDocs()), termContexts);
int window = 2;// get the words within two of the match
while (spans.next() == true) {
Map<Integer, String> entries = new TreeMap<Integer, String>();
System.out.println("Doc: " + spans.doc() + " Start: " + spans.start() + " End: " + spans.end());
int start = spans.start() - window;
int end = spans.end() + window;
Terms content = reader.getTermVector(spans.doc(), "content");
TermsEnum termsEnum = content.iterator(null);
BytesRef term;
while ((term = termsEnum.next()) != null) {
// could store the BytesRef here, but String is easier for this
// example
String s = new String(term.bytes, term.offset, term.length);
DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
int i = 0;
int position = -1;
while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
if (position >= start && position <= end) {
entries.put(position, s);
}
i++;
}
}
}
System.out.println("Entries:" + entries);
}
it's just some code i found on a great website and i wanted to try .... everything works great using the RAMDirectory. But if i change it to my FSDirectory it's giving me a NullpointerException like :
Exception in thread "main" java.lang.NullPointerException at
com.org.test.TextDB.myMethod(TextDB.java:184) at
com.org.test.Main.main(Main.java:31)
The statement Terms content = reader.getTermVector(spans.doc(), "content"); seems to get no result and returns null. so the exception. but why? in my ramDIR everything works fine.
It seems that the indexWriter or the Reader (really don't know) didn't write or didn't read the field "content" properly from the index. But i really don't know why its 'written' in a RAMDirectory and not written in a FSDIrectory?!
Anybody an idea to that?
Gave this a test a quick test run, and I can't reproduce your issue.
I think the most likely issue here is old documents in your index. The way this is written, every time it is run, more documents will be added to your index. Old documents from previous runs won't get deleted, or overwritten, they'll just stick around. So, if you have run this before on the same directory, say perhaps, before you added the line type.setStoreTermVectors(true);, some of your results may be these old documents with term vectors, and reader.getTermVector(...) will return null, if the document does not store term vectors.
Of course, anything indexed in a RAMDirectory will be dropped as soon as execution finishes, so the issue would not occur in that case.
Simple solution would be to try deleting the index directory and run it again.
If you want to start with a fresh index when you run this, you can set that up through the IndexWriterConfig:
private static IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_43,
new StandardAnalyzer(Version.LUCENE_43));
iwc.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
That's a guess, of course, but seems consistent with the behavior you've described.
I have most of a parent/child-doc solution for a problem I'm working on, but I ran into a hitch: from inside a facet that iterates over the child docs I need to access the value of a parent doc field. I have (or I can get) the parent doc ID (from the _parent field of the child doc, or worst case by indexing it again as a normal field) but that's an "external" ID, not the node-internal ID that I need to load the field value from the field cache. (I'm using default routing so the parent doc is definitely in the same shard as the children.)
More concretely, here's what I have in the FacetCollector so far (ES 0.20.6):
protected void doSetNextReader(IndexReader reader, int docBase) throws IOException {
/* not sure this will work, otherwise I can index the field seperately */
parentFieldData = (LongFieldData) fieldDataCache.cache(FieldDataType.DefaultTypes.LONG, reader, "_parent");
parentSpringinessFieldData = (FloatFieldData) fieldDataCache.cache(FieldDataType.DefaultTypes.FLOAT, "springiness");
/* ... */
protected void doCollect(int doc) throws IOException {
long parentID = parentFieldData.value(doc); // or whatever the correct equivalent here is
// here's the problem:
parentSpringiness = parentSpringinessFieldData.value(parentID)
// type error: expected int (node-internal ID), got long (external ID)
Any suggestions? (I can't upgrade to 0.90 yet but would be interested to hear if that would help.)
Honking great disclaimer: (1) I ended up not using this approach at all, so this is only slightly-tested code, and (2) far as I can see it will be pretty horribly inefficient, and it has the same memory overhead as parent queries. If another approach will work for you, do consider it (for my use case I ended up using nested documents, with a custom facet collector that iterates over both the nested and the parent documents, to have easy access to the field values of both).
The example within the ES code to work from is org.elasticsearch.index.search.child.ChildCollector. The first element you need is in the Collector initialisation:
try {
context.idCache().refresh(context.searcher().subReaders());
} catch (Exception e) {
throw new FacetPhaseExecutionException(facetName, "Failed to load parent-ID cache", e);
}
This makes possible the following line in doSetNextReader():
typeCache = context.idCache().reader(reader).type(parentType);
which gives you a lookup of the parent doc's UId in doCollect(int childDocId):
HashedBytesArray postingUid = typeCache.parentIdByDoc(childDocId);
The parent document won't necessarily be found in the same reader as the child doc: when the Collector initialises you also need to store all readers (needed to access the field value) and for each reader an IdReaderTypeCache (to resolve the parent doc's UId to a reader-internal docId).
this.readers = new Tuple[context.searcher().subReaders().length];
for (int i = 0; i < readers.length; i++) {
IndexReader reader = context.searcher().subReaders()[i];
readers[i] = new Tuple<IndexReader, IdReaderTypeCache>(reader, context.idCache().reader(reader).type(parentType));
}
this.context = context;
Then when you need the parent doc field, you have to iterate over the reader/typecache pairs looking for the right one:
int parentDocId = -1;
for (Tuple<IndexReader, IdReaderTypeCache> tuple : readers) {
IndexReader indexReader = tuple.v1();
IdReaderTypeCache idReaderTypeCache = tuple.v2();
if (idReaderTypeCache == null) { // might be if we don't have that doc with that type in this reader
continue;
}
parentDocId = idReaderTypeCache.docById(postingUid);
if (parentDocId != -1 && !indexReader.isDeleted(parentDocId)) {
FloatFieldData parentSpringinessFieldData = (FloatFieldData) fieldDataCache.cache(
FieldDataType.DefaultTypes.FLOAT,
indexReader,
"springiness");
parentSpringiness = parentSpringinessFieldData.value(parentDocId);
break;
}
}
if (parentDocId == -1) {
throw new FacetPhaseExecutionException(facetName, "Parent doc " + postingUid + " could not be found!");
}
I use htmlcleaner to parse HTML files. here is example of an html file.
.......<div class="name">Name</div>;......
I get the word Name using this construction in my code
HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
props.setAllowHtmlInsideAttributes(true);
props.setAllowMultiWordAttributes(true);
props.setRecognizeUnicodeChars(true);
props.setOmitComments(true);
rootNode = cleaner.clean(htmlPage);
TagNode linkElements[] = rootNode.getElementsByName("div",true);
for (int i = 0; linkElements != null && i < linkElements.length; i++)
{
String classType = linkElements.getAttributeByName("name");
if (classType != null)
{
if(classType.equals(class)&& classType.equals(CSSClassname)) { linkList.add(linkElements); }
}
System.out.println("TagNode" + linkElements.getText());
linkList.add(linkElements);
}
and then add all of this name's to listview using
TagNode=linkelements.getText().toString()
;
But I don't understand how I can get the link in my example. I want to get the link http://exxample.com but I don't know what to do.
Please help me. I read the tutorial and used the function but can't.
P.S. Sorry for my bad English
I don't use HtmlCleaner, but according to the javadoc you do it this way:
List<String> links = new ArrayList<String> ();
for (TagNode aTag : linkElements[i].getElementListByName ("a", false))
{
String link = aTag.getAttributeByName ("href");
if (link != null && link.length () > 0) links.add (link);
}
P.S.: you posted clearly uncompilable code
P.P.S.: why don't you use some library that creates an ordinary DOM tree from html? This way you'll be able to work with parsed document using a common-known API.