How do I choose the best k mean cluster in weka - java

As you can see the bottom result I have two different clusters using different seed. I would like to choose the best cluster out of the two clusters.
I know that the minimum square error is the better. However, it shows the same square error although I use different seeds. I want to know why it shows similar square error. I also want to know what other things I need to consider when i am selecting the best cluster.
*******************************************************************
kMeans
======
Number of iterations: 10
Within cluster sum of squared errors: 527.6988818392938
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(4898) (2781) (2117)
=====================================================
fixedacidity 6.8548 6.9565 6.7212
volatileacidity 0.2782 0.2826 0.2725
citricacid 0.3342 0.3389 0.3279
residualsugar 6.3914 8.2678 3.9265
chlorides 0.0458 0.0521 0.0374
freesulfurdioxide 35.3081 38.6897 30.8658
totalsulfurdioxide 138.3607 155.2585 116.1627
density 0.994 0.9958 0.9916
pH 3.1883 3.1691 3.2134
sulphates 0.4898 0.492 0.4871
alcohol 10.5143 9.6325 11.6726
quality 5.8779 5.4779 6.4034
Time taken to build model (full training data) : 0.19 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 2781 ( 57%)
1 2117 ( 43%)
***********************************************************************
kMeans
======
Number of iterations: 7
Within cluster sum of squared errors: 527.6993178146143
Missing values globally replaced with mean/mode
Cluster centroids:
Cluster#
Attribute Full Data 0 1
(4898) (2122) (2776)
=====================================================
fixedacidity 6.8548 6.7208 6.9572
volatileacidity 0.2782 0.2723 0.2828
citricacid 0.3342 0.3281 0.3389
residualsugar 6.3914 3.9451 8.2614
chlorides 0.0458 0.0374 0.0522
freesulfurdioxide 35.3081 30.9105 38.6697
totalsulfurdioxide 138.3607 116.2175 155.2871
density 0.994 0.9917 0.9958
pH 3.1883 3.2137 3.1689
sulphates 0.4898 0.4876 0.4916
alcohol 10.5143 11.6695 9.6312
quality 5.8779 6.4043 5.4755
Time taken to build model (full training data) : 0.15 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 2122 ( 43%)
1 2776 ( 57%)

Define "best result".
By the definition of k-means, a lower sum of squares is better.
Anything else is worse by k-means - but that doesn't mean that a different quality criterion (or clustering algorithm) could be more helpful for your actual problem.

Using different seeds doesnot guarantee you different clusters in the result.

Related

How to fix search returning hits for missing data in Liferay?

In Liferay we are looking for articles that satisfy certain conditions by using the following code:
Hits hits = indexSearcherHelper.search(searchContext, query);
Search query which we use is defined as:
BooleanFilter filter = new BooleanFilter();
filter.addRequiredTerm(Field.GROUP_ID, globalSiteId);
filter.addRequiredTerm(Field.STATUS, WorkflowConstants.STATUS_APPROVED);
filter.addRequiredTerm("ddmStructureKey", "TEST");
filter.addRequiredTerm("head", true);
MatchAllQuery query = new MatchAllQuery();
query.setPreBooleanFilter(filter);
and this search finds multiple hits.
Then we attempt to get the article like this:
JournalArticleResource journalArticleResource = journalArticleResourceLocalService.getArticleResource(GetterUtil.getLong(hits.toList().get(0).get(Field.ENTRY_CLASS_PK)));
JournalArticle article = journalArticleLocalService.getArticle(journalArticleResource.getGroupId(), journalArticleResource.getArticleId());
However, this produces following error:
No JournalArticleResource exists with the primary key 809477.
In 95% of cases, this code works as expected. But in some cases (on some environments), it appears that index search found results which are not valid. Why does this happen?
Can it be that index has some stale records that are from old, already deleted articles? Do we need to reindex the database?
UPDATE 1: I have observed a very strange behaviour of the index search:
The following code:
for (int counter = 0; counter < 10; counter++)
{
System.out.println(counter);
System.out.println(indexSearcherHelper.search(searchContext, query).toList().size());
}
produces this result:
0
0
1
4
2
7
3
0
4
4
5
7
6
0
7
4
8
7
9
0
There is only 1 result in reality that needs to be found. On all other environments this code keeps finding just one result in all 10 searches, since we added only 1 article.
In this case, however, it keeps finding no results, 4 results, 7 results and keeps repeating the same pattern.
What is going on here? Is database corrupted? Is it Liferay bug? How can the same search return different number of results?
(By the way, last year we did a live database migration from one server to another, that is, migration of the database while Liferay was up and running [not too good idea] to reduce the production downtime, so I am afraid that we might be experiencing the database corruption here.)
UPDATE 2: as requested in the comments, here is the version of the Liferay that we are using and an example of the search with values of some fields modified since this is a production example from closed source application.
Version:
Liferay Community Edition Portal 7.0.4 GA5 (Wilberforce / Build 7004 / October 23, 2017)
System.out.println(hits.toList().get(0));
{
ddmTemplateKey=[673861],
entryClassPK=[809477],
ddm__keyword__673858__LActive_hr_HR=[true],
publishDate=[20211116063000],
ddm__keyword__673858__SActive_hr_HR=[false],
ddm__keyword__673858__GNA_en_US_String_sortable=[ne],
ddm__text__673858__OList_hr_HR_String_sortable=[32554651079],
classNameId=[0],
ddm__keyword__673858__SActive_en_US_String_sortable=[false],
ddm__keyword__673858__O_hr_HR_String_sortable=[opis pop upa],
modified_sortable=[1637050218921],
title_hr_HR=[Test ss n],
ddm__keyword__673858__O_en_US=[Opis pop upa],
version=[2.4],
ddm__keyword__673858__B_en_US=[grey],
ddm__keyword__673858__SActive_hr_HR_String_sortable=[false],
ddm__keyword__673858__OAll_en_US_String_sortable=[false],
status=[0],
ddm__keyword__673858__GPA_en_US=[OK],
publishDate_sortable=[1637044200000],
content_hr_HR=[OK 32554651079 NE true Opis pop upa all true Test pop najnoviji Utorak grey false all false /ervices],
ddm__keyword__673858__TR_en_US=[all],
ddm__keyword__673858__B_hr_HR=[grey],
uid=[com.liferay.journal.model.JournalArticle_PORTLET_811280],
localized_title_en_US_sortable=[test ss n],
layoutUuid=[],
ddm__text__673858__OList_en_US=[32554651079],
ddm__keyword__673858__GNA_hr_HR=[NE],
ddm__keyword__673858__TR_en_US_String_sortable=[all],
ddm__keyword__673858__GNA_hr_HR_String_sortable=[ne],
createDate=[20211115132217],
ddm__keyword__673858__OAll_hr_HR_String_sortable=[false],
displayDate_sortable=[1637044200000],
ddm__keyword__673858__O_en_US_String_sortable=[opis pop upa],
entryClassName=[com.liferay.journal.model.JournalArticle],
ddm__keyword__673858__N_en_US=[Test pop najnoviji Utorak],
ddm__keyword__673858__S_hr_HR_String_sortable=[all],
userId=[30588],
localized_title_en_US=[test ss n],
ddm__keyword__673858__N_hr_HR_String_sortable=[test pop najnoviji utorak],
ddm__keyword__673858__OListActive_hr_HR=[true],
ddm__keyword__673858__GPA_hr_HR_String_sortable [ok],
treePath=[, 673853],
ddm__keyword__673858__B_en_US_String_sortable=[grey],
ddm__keyword__673858__S_hr_HR=[all], groupId=[20152],
ddm__keyword__673858__B_hr_HR_String_sortable=[grey],
createDate_sortable=[1636982537964],
classPK=[0],
ddm__keyword__673858__S_en_US_String_sortable=[all],
ddm__keyword__673858__GPA_hr_HR=[OK],
scopeGroupId=[20152],
articleId_String_sortable=[809475],
ddm__keyword__673858__OAll_hr_HR=[false],
modified=[20211116081018],
ddm__keyword__673858__LActive_hr_HR_String_sortable=[true],
ddm__keyword__673858__L_hr_HR=[/ervices],
localized_title_hr_HR_sortable=[test ss n],
ddm__keyword__673858__L_en_US=[/ervices],
visible=[true],
ddmStructureKey=[TEST],
ddm__keyword__673858__OAll_en_US=[false],
defaultLanguageId=[hr_HR],
ddm__keyword__673858__L_hr_HR_String_sortable=[/ervices],
viewCount_sortable=[0],
folderId=[673853],
classTypeId=[673858],
ddm__text__673858__OList_hr_HR=[32554651079],
ddm__keyword__673858__TR_hr_HR_String_sortable=[all],
companyId=[20116],
rootEntryClassPK=[809477],
ddm__keyword__673858__LA_en_US_String_sortable=[true],
displayDate=[20211116063000],
ddm__keyword__673858__OListActive_hr_HR_String_sortable=[true],
ddm__keyword__673858__SActive_en_US=[false],
ddm__keyword__673858__OListActive_en_US=[true],
ddm__keyword__673858__LActive_en_US=[true],
content=[OK 32554651079 NE true Opis pop upa all true Test pop najnoviji Utorak grey false all false /ervices],
head=[true],
ddm__keyword__673858__GPA_en_US_String_sortable=[ok],
ddm__keyword__673858__OListActive_en_US_String_sortable=[true],
ratings=[0.0],
expirationDate_sortable=[9223372036854775807],
viewCount=[0],
ddm__text__673858__OList_en_US_String_sortable=[32554651079],
localized_title_hr_HR=[test ss n],
expirationDate=[99950812133000],
ddm__keyword__673858__N_en_US_String_sortable=[test pop najnoviji utorak],
roleId=[20123, 20124, 20126],
ddm__keyword__673858__S_en_US=[all],
articleId=[809475],
ddm__keyword__673858__N_hr_HR=[Test pop najnoviji Utorak],
userName=[tuser%40admin -],
localized_title=[test ss n],
stagingGroup=[false],
headListable=[true],
ddm__keyword__673858__L_en_US_String_sortable=[/ervices],
ddm__keyword__673858__O_hr_HR=[Opis pop upa],
ddm__keyword__673858__TR_hr_HR=[all],
ddm__keyword__673858__GNA_en_US=[NE]
}
You might be using the wrong service, try using the journalArticleLocalService.
The id of the journal article resource is the id of the journal article plus 1, so if you have more than one article, in most cases it wont produce the error, but will return the wrong article.
Perhaps you are hitting some inconsistencies in your Elasticsearch index: JournalArticles that don't exist in the Database but they exist in the Elasticsearch.
You can double check this and correct it using my Liferay Index Checker, see https://github.com/jorgediaz-lr/index-checker#readme
Once you have installed it, you have to:
Check the "Display orphan index entries" option
Click on "Check Index"
If you have any orphan results, you can remove them clicking on the "Remove orphans" button.

Hazelcast SQL interface slow performance HZ 4.2.2 vs HZ 5.0.2

Situation :
We have a product with approx 30 attributes (String, Enum, Double)
values
We have iMap with indexes for all attributes IndexType.HASH for
string value and IndexType.SORTED for double values. (900MB together)
We have 300k products in map.(aprox 500MB )
We use local Datagrid with one member
JVM config: -Xms6G -Xmx8G
For HZ 5: we enabled JetConfig
config.getJetConfig().setEnabled(true);
Use Java AdoptOpenJDK 11.0.8
When invoking SQL query with pagination in HZ4 we got a response approx in 20-50ms, but the same query in Hazelcast 5 we got results in 2000-2500 ms
...ORDER BY param1 ASC LIMIT 20 OFFSET 0...
SqlResult sqlRows = hazelcastInstance.getSql().execute(sqlBuilder.toString());
When we tried to use predicates on the same map and in HZ4 and HZ5 we got the same results about 2000-2500 ms to get predicated page
PagingPredicate<Long, Product> pagingPredicate = Predicates.pagingPredicate(predicate, ProductComparatorFactory.getEntryComparator(sortName), max);
pagingPredicate.setPage(from / max);
///get final list of products
List<Product> selectedPageA = new ArrayList<>(productMap.getAll(productMap.keySet(pagingPredicate)).values());
For HZ 5 we add Mapping
hazelcastInstance.getSql().execute("CREATE MAPPING "ProductScreenerRepositoryProductMap" EXTERNAL NAME "ProductScreenerRepositoryProductMap"
TYPE IMap
OPTIONS (
'keyFormat' = 'java',
'keyJavaClass' = 'java.lang.Long',
'valueFormat' = 'java',
'valueJavaClass' = 'com.finmason.finriver.product.Product'
)");
}
There is used SQL
SELECT * FROM ProductScreenerRepositoryProductMap
WHERE doubleValue1 >= -0.9624378795139998
AND doubleValue1 <= 0.9727269574354098
AND doubleValue2 >= -0.9
AND doubleValue2 <= 0.9
ORDER BY doubleValue3 ASC LIMIT 20 OFFSET 0
And Product use simple serialization
Please upgrade to Hazelcast 5.1 (planned for February 23 right now).
It should be fixed with https://github.com/hazelcast/hazelcast/pull/20681
Actually this case will speed up by 3 separate PRs from 5.1:
https://github.com/hazelcast/hazelcast/pull/20681 - this one make your query use index
https://github.com/hazelcast/hazelcast/pull/20402 - this one will do less deserialization on cluster side
https://github.com/hazelcast/hazelcast/pull/20398 - this one makes deserialization on client side faster for multi-column queries
There are two cases not resolved in 5.1, they are described in https://github.com/hazelcast/hazelcast/pull/20796 - it should not be
a problem in your case, but if someone else see this post, it may be his/her. I hope that fix will be delivered in 5.1.1.
If you have a possibility to upgrade to full 5.1 after the release then I strongly recommend you to do it.

How to compute pseudorange from the parameters fetched via Google GNSSLogger?

The official GNSS raw measurements fetched via GNSS logger app provides the following parameters :
TimeNanos
LeapSecond
TimeUncertaintyNanos
FullBiasNanos
BiasNanos
BiasUncertaintyNanos
DriftNanosPerSecond
DriftUncertaintyNanosPerSecond HardwareClockDiscontinuityCount
Svid
TimeOffsetNanos
State
ReceivedSvTimeNanos
ReceivedSvTimeUncertaintyNanos
Cn0DbHz
PseudorangeRateMetersPerSecond
PseudorangeRateUncertaintyMetersPerSecond
I'm looking for the raw pseudorange measurements PR from the above data. A little help?
Reference 1: https://github.com/google/gps-measurement-tools
Reference 2 : https://developer.android.com/guide/topics/sensors/gnss
Pseudorange[m] = (AverageTravelTime[s] + delta_t[s]) * speedOfLight[m/s]
where: m - meters, s - seconds.
Try this way:
Select satellites from one constellation (at first try with GPS).
Chose max value of ReceivedSvTimeNanos.
Calculate delta_t for each satellite as max ReceivedSvTimeNanos minus current ReceivedSvTimeNanos(delta_t = maxRst - curRst).
Average travel time is 70 milliseconds, speed of light 299792458 m/s. use it for calculation.
Don't forget to convert all values to the same units.
For details refer to this pdf and UserPositionVelocityWeightedLeastSquare class
Unfortunately Android doesn't provide pseudorange directly from the API - you have to calculate this yourself.
The EU GSA has a great document here that explains in detail how to use GNSS raw measurements in section 2.4:
https://www.gsa.europa.eu/system/files/reports/gnss_raw_measurement_web_0.pdf
Specifically, section 2.4.2 explains how to calculate pseudorange from the data given by the Android APIs. It's literally pages of text, so I won't copy the whole thing in-line here, but here's the Example 1 they share for a Matlab code snippet to compute the pseudorange for Galileo, GPS and BeiDou signals when the time-of-week is encoded:
% Select GPS + GAL TOW decoded (state bit 3 enabled)
pos = find( (gnss.Const == 1 | gnss.Const == 6) & bitand(gnss.State,2^3);
% Generate the measured time in full GNSS time
tRx_GNSS = gnss.timeNano(pos) - (gnss.FullBiasNano(1) + gnss.BiasNano(1));
% Change the valid range from full GNSS to TOW
tRx = mod(tRx_GNSS(pos),WEEKSEC*1e9);
% Generate the satellite time
tTx = gnss.ReceivedSvTime(pos) + gnss.TimeOffsetNano(pos);
% Generate the pseudorange
prMilliSeconds = (tRx - tTx );
pr = prMilliSeconds *Constant.C*1e-9;

Regex - Get text between two strings

I have a large text file which contains many abstracts (7k of them). I want to separate them. They have the following properties:
a number at the begining with a period right after
123.
and it always ends in:
[PubMed - indexed for MEDLINE]
It would be even better if I can get the title and abstract out of the separated string. I am fine if I have to split the articles first then split the texts.
In the example the title is the third line:
Effects of propofol and isoflurane on haemodynamics and the inflammatory response in cardiopulmonary bypass surgery.
The abstract is on the 8th line:
Cardiopulmonary bypass (CPB) causes reperfusion injury...
I have tried to use the following code for this text
Regex:
[0-9\.]*\s*(((?![0-9\.]*|MEDLINE).)+)\s*MEDLINE
Text:
1. Br J Biomed Sci. 2015;72(3):93-101.
Effects of propofol and isoflurane on haemodynamics and the inflammatory response
in cardiopulmonary bypass surgery.
Sayed S, Idriss NK, Sayyedf HG, Ashry AA, Rafatt DM, Mohamed AO, Blann AD.
Cardiopulmonary bypass (CPB) causes reperfusion injury that when most severe is
clinically manifested as a systemic inflammatory response syndrome. The
anaesthetic propofol may have anti-inflammatory properties that may reduce such a
response. We hypothesised differing effects of propofol and isoflurane on
inflammatory markers in patients having CBR Forty patients undergoing elective
CPB were randomised to receive either propofol or isoflurane for maintenance of
anaesthesia. CRP, IL-6, IL-8, HIF-1α (ELISA), CD11 and CD18 expression (flow
cytometry), and haemoxygenase (HO-1) promoter polymorphisms (PCR/electrophoresis)
were measured before anaesthetic induction, 4 hours post-CPB, and 24 hours later.
There were no differences in the 4 hours changes in CRP, IL-6, IL-8 or CD18
between the two groups, but those in the propofol group had higher HIF-1α (P =
0.016) and lower CD11 expression (P = 0.026). After 24 hours, compared to the
isoflurane group, the propofol group had significantly lower levels of CRP (P <
0.001), IL-6 (P < 0.001) and IL-8 (P < 0.001), with higher levels CD11 (P =
0.009) and CD18 (P = 0.002) expression. After 24 hours, patients on propofol had
increased expression of shorter HO-1 GT(n) repeats than patients on isoflurane (P
= 0.001). Use of propofol in CPB is associated with a less adverse inflammatory
profile than is isofluorane, and an increased up-regulation of HO-1. This
supports the hypothesis that propofol has anti-inflammatory activity.
PMID: 26510263 [PubMed - indexed for MEDLINE]
Two useful solutions have been proposed by Mariano and stribizhev:
Mariano's solution: Use the split method with the typical end
(?m)\[PubMed - indexed for MEDLINE\]$
DEMO : http://ideone.com/Qw5ss2
Java 4+
stribizhev's solution: Fully extract data from the text
(?m)^\s*\d+\..*\R{2} # Get to the title
(?<title>[^\n]*(?:\n(?!\n)[^\n]*)*) # Get title
\R{2} # Get to the authors
[^\n]*(?:\n(?!\R)[^\R]*)* # Consume authors
(?<abstract>[^\[]*(?:\[(?!PubMed[ ]-[ ]indexed[ ]for[ ]MEDLINE\])[^\[]*)*) #Grab abstract
DEMO: https://regex101.com/r/sG2yQ2/2
Java 8+
Try this:
"^[0-9]+\..*\s+(.*)\s+.*\s+((?:\s|.)*?)\[PubMed - indexed for MEDLINE\]"
First group would be title. Second would be abstract.

SearchContextMissingException Failed to execute fetch phase [search/phase/fetch/id]

Cluser: I am Using elasticsearch 1.3.1 with 6 nodes in different servers, which are all connected with by LAN. The bandwidth is high and the each one has 45 GB RAM in it.
Configuration The Heap size we allocated for the node to run is 10g. We do have the elasticsearch default configuration except the unique discoverym, cluster name, node name and we 2 zone. 3 node belongs to one zone and the other belongs to another zone.
indices: 15, total size of the indices is 76GB.
Now a days i am facing the SearchContextMissingException exception as DEBUG log. It smells like some search query has taken to much of time to fetch. but I checked with queries, there was no query to produce high amount of load to the cluster... I am wondering why this happen.
Issue: Due to this issue one by one all the nodes start to collect GC. and result in the OOM :(
Here is my exception. Please kindly explain me 2 things.
What is SearchContextMissingException? Why it happen?
How can we prevent the cluster from these type of query?
The Error:
[YYYY-MM-DD HH:mm:ss,039][DEBUG][action.search.type ] [es_node_01] [5031530]
Failed to execute fetch phase
org.elasticsearch.transport.RemoteTransportException: [es_node_02][inet[/1x.x.xx.xx:9300]][search/phase/fetch/id]
Caused by: org.elasticsearch.search.SearchContextMissingException: No search context found for id [5031530]
at org.elasticsearch.search.SearchService.findContext(SearchService.java:480)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:450)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:793)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchFetchByIdTransportHandler.messageReceived(SearchServiceTransportAction.java:782)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:722)
If you can, update to 1.4.2. It fixes some known resilience issues, including cascading failures like you describe.
Regardless of that, the default configuration will definitely get you in trouble. Minimum, you may need to look at setting up circuitbreakers for e.g. field data caches.
Here's a snippet lifted from our production configuration. I assume you have also configured linux filehandles limits correctly: see here
# prevent swapping
bootstrap.mlockall: true
indices.breaker.total.limit: 70%
indices.fielddata.cache.size: 70%
# make elasticsearch work harder to migrate/allocate indices on startup (we have a lot of shards due to logstash); default was 2
cluster.routing.allocation.node_concurrent_recoveries: 8
# enable cors
http.cors.enabled: true
http.cors.allow-origin: /https?:\/\/(localhost|kibana.*\.linko\.io)(:[0-9]+)?/
index.query.bool.max_clause_count: 4096
The same error (or debug statement) still occurs in 1.6.0, and is not a bug.
When you create a new scroll request:
SearchResponse scrollResponse = client.prepareSearch(index).setTypes(types).setSearchType(SearchType.SCAN)
.setScroll(new TimeValue(60000)).setSize(maxItemsPerScrollRequest).setQuery(ElasticSearchQueryBuilder.createMatchAllQuery()).execute().actionGet();
String scrollId = scrollResponse.getScrollId();
a new scroll id is created (apart from the scrollId the response is empty). To fetch the results:
long resultCounter = 0l; // to keep track of the number of results retrieved
Long nResultsTotal = null; // total number of items we will be expecting
do {
final SearchResponse response = client.prepareSearchScroll(scrollId).setScroll(new TimeValue(600000)).execute().actionGet();
// handle result
if(nResultsTotal==null) // if not initialized
nResultsTotal = response.getHits().getTotalHits(); //set total number of Documents
resultCounter += response.getHits().getHits().length; //keep track of the items retrieved
} while (resultCounter < nResultsTotal);
This approach works regardless of the number of shards you have. Another option is to add a break statement when:
boolean breakIf = response.getHits().getHits().length < (nShards * maxItemsPerScrollRequest);
The number of items to be returned is maxItemsPerScrollRequest (per shard!), so we'd expect the number of items requested multiplied by the number of shards. But when we have multiple shards, and one of those is out of documents, whereas others do not, then the former method will still give us all available documents. The latter will stop prematurely - I expect (haven't tried!)
Another way to stop seeing this exception (since it is 'only' DEBUG), is to open the logging.yml file in the config directory of ElasticSearch, then change:
action: DEBUG
to
action: INFO

Categories

Resources