Edit: i changed the name as there is a similar SO question How do I fix SpreadSheetAddRows function crashing when adding a large query? out there that describes my issue so i pharased more succinctly...the issue is spreadsheetAddrows for my query result bombs the entire server at what i consider a moderate size (1600 rows, 27 columns) but that sounds considerably less than his 18,000 rows
I am using an oracle stored procedure accessed via coldfusion 9.0.1 cfstoredproc that on completion creates a spreadsheet for the user to download
The issue is that result sets greater than say 1200 rows are returning a 500 internal server error, 700 rows return fine, so i am guessing it is a memory problem?
the only message i received other than 500 Internal server error in the standard coldfusion look was in small print "gc overhead limit exceeded" and that was only once on a page refresh, which refers to the underlying Java JVM
I am not even sure how to go about diagnosing this
here is the end of the cfstoredproc and spreadsheet obj
<!--- variables assigned correctly above --->
<cfprocresult name="RC1">
</cfstoredproc>
<cfset sObj = spreadsheetNew("reconcile","yes")>
<cfset SpreadsheetAddRow(sObj, "Column_1, ... , Column27")>
<cfset SpreadsheetFormatRow(sObj, {bold=TRUE, alignment="center"}, 1)>
<cfset spreadsheetAddRows(sObj, RC1)>
<cfheader name="content-disposition" value="attachment; filename=report_#Dateformat(NOW(),"MMDDYYYY")#.xlsx">
<cfcontent type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" variable="#spreadsheetReadBinary(sObj)#">
My Answer lies with coldfusion and one simple fact: DO NOT USE SpreadsheetAddRows or any of those related functions like SpreadsheetFormatRows
My solution to this was to execute the query, create an xls file, use the tag cfspreadsheet to write to the newly created xls file, then serve to the browser, deleting after serving
Using SpreadsheetAddRows, Runtime went from crashing server on 1000+ rows, 5+mins on 700 rows
Using the method outlined above 1-1.5 secs
if you are interested in more code, i can provide just comment, i am using the coldbox framework so didnt think the specificness would help just the new workflow
Related
I have seen and tried many existing StackOverflow posts regarding this issue but none work. I guess my JAVA heap space is not as large as expected for my large dataset, My dataset contains 6.5M rows. My Linux instance contains 64GB Ram with 4 cores. As per this suggestion I need to fix my code but I think making a dictionary from pyspark dataframe should not be very costly. Please advise me if any other way to compute that.
I just want to make a python dictionary from my pyspark dataframe, this is the content of my pyspark dataframe,
property_sql_df.show() shows,
+--------------+------------+--------------------+--------------------+
| id|country_code| name| hash_of_cc_pn_li|
+--------------+------------+--------------------+--------------------+
| BOND-9129450| US|Scotron Home w/Ga...|90cb0946cf4139e12...|
| BOND-1742850| US|Sited in the Mead...|d5c301f00e9966483...|
| BOND-3211356| US|NEW LISTING - Com...|811fa26e240d726ec...|
| BOND-7630290| US|EC277- 9 Bedroom ...|d5c301f00e9966483...|
| BOND-7175508| US|East Hampton Retr...|90cb0946cf4139e12...|
+--------------+------------+--------------------+--------------------+
What I want is to make a dictionary with hash_of_cc_pn_li as key and id as a list value.
Expected Output
{
"90cb0946cf4139e12": ["BOND-9129450", "BOND-7175508"]
"d5c301f00e9966483": ["BOND-1742850","BOND-7630290"]
}
What I have tried so far,
%%time
duplicate_property_list = {}
for ind in property_sql_df.collect():
hashed_value = ind.hash_of_cc_pn_li
property_id = ind.id
if hashed_value in duplicate_property_list:
duplicate_property_list[hashed_value].append(property_id)
else:
duplicate_property_list[hashed_value] = [property_id]
What I get now on the console:
java.lang.OutOfMemoryError: Java heap space
and showing this error on Jupyter notebook output
ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:33097)
making a dictionary from pyspark dataframe should not be very costly
This is true in terms of runtime, but this will easily take up a lot of space. Especially if you're doing property_sql_df.collect(), at which point you're loading your entire dataframe into driver memory. At 6.5M rows, you'll already hit 65GB if each row has 10KB, or 10K characters, and we haven't even gotten to the dictionary yet.
First, you can collect just the columns you need (e.g. not name). Second, you can do the aggregation upstream in Spark, which will save some space depending on how many ids there are per hash_of_cc_pn_li:
rows = property_sql_df.groupBy("hash_of_cc_pn_li") \
.agg(collect_set("id").alias("ids")) \
.collect()
duplicate_property_list = { row.hash_of_cc_pn_li: row.ids for row in rows }
Adding accepted answer from linked post for posterity. The answer solves the problem by leveraging write.json method and preventing the collection of too-large dataset to the Driver here:
https://stackoverflow.com/a/63111765/12378881
Here's how to make a sample DataFrame with your data:
data = [
("BOND-9129450", "90cb"),
("BOND-1742850", "d5c3"),
("BOND-3211356", "811f"),
("BOND-7630290", "d5c3"),
("BOND-7175508", "90cb"),
]
df = spark.createDataFrame(data, ["id", "hash_of_cc_pn_li"])
Let's aggregate the data in a Spark DataFrame to limit the number of rows that are collected on the driver node. We'll use the two_columns_to_dictionary function defined in quinn to create the dictionary.
agg_df = df.groupBy("hash_of_cc_pn_li").agg(F.max("hash_of_cc_pn_li").alias("hash"), F.collect_list("id").alias("id"))
res = quinn.two_columns_to_dictionary(agg_df, "hash", "id")
print(res) # => {'811f': ['BOND-3211356'], 'd5c3': ['BOND-1742850', 'BOND-7630290'], '90cb': ['BOND-9129450', 'BOND-7175508']}
This might work on a relatively small, 6.5 million row dataset, but it won't work on a huge dataset. "I think making a dictionary from pyspark dataframe should not be very costly" is only true for DataFrames that are really tiny. Making a dictionary from a PySpark DataFrame is actually very expensive.
PySpark is a cluster computing framework that benefits from having data spread out across nodes in a cluster. When you call collect all the data is moved to the driver node and the worker nodes don't help. You'll get an OutOfMemory exception whenever you try to move too much data to the driver node.
It's probably best to avoid the dictionary entirely and figure out a different way to solve the problem. Great question.
From Spark-2.4 we can use groupBy,collect_list,map_from_arrays,to_json built in functions for this case.
Example:
df.show()
#+------------+-----------------+
#| id| hash_of_cc_pn_li|
#+------------+-----------------+
#|BOND-9129450|90cb0946cf4139e12|
#|BOND-7175508|90cb0946cf4139e12|
#|BOND-1742850|d5c301f00e9966483|
#|BOND-7630290|d5c301f00e9966483|
#+------------+-----------------+
df.groupBy(col("hash_of_cc_pn_li")).\
agg(collect_list(col("id")).alias("id")).\
selectExpr("to_json(map_from_arrays(array(hash_of_cc_pn_li),array(id))) as output").\
show(10,False)
#+-----------------------------------------------------+
#|output |
#+-----------------------------------------------------+
#|{"90cb0946cf4139e12":["BOND-9129450","BOND-7175508"]}|
#|{"d5c301f00e9966483":["BOND-1742850","BOND-7630290"]}|
#+-----------------------------------------------------+
To get one dict use another agg with collect_list.
df.groupBy(col("hash_of_cc_pn_li")).\
agg(collect_list(col("id")).alias("id")).\
agg(to_json(map_from_arrays(collect_list(col("hash_of_cc_pn_li")),collect_list(col("id")))).alias("output")).\
show(10,False)
#+---------------------------------------------------------------------------------------------------------+
#|output |
#+---------------------------------------------------------------------------------------------------------+
#|{"90cb0946cf4139e12":["BOND-9129450","BOND-7175508"],"d5c301f00e9966483":["BOND-1742850","BOND-7630290"]}|
#+---------------------------------------------------------------------------------------------------------+
For context, I am not an SSAS expert, or even an avid user, I'm primarily a Java developer. We have a data science team that uses SSAS to write, develop and test various models.
In order to integrate the output of these models with other non-Microsoft systems, I am building a Java-based service that can query certain fields from the cube using Olap4j/XMLA to run an MDX query. But the performance (or lack thereof) is confusing me.
If I open MSS Studio, "Browse" the cube, drag a number of measures into the measures pane, toggle "Show Empty Cells" (otherwise for some reason I get no results), and hit execute, I get the expected results almost instantly. If I click on the red square to turn off "design mode", it takes me to the MDX code that looks something like:
SELECT { } ON COLUMNS, { (
[Main].[Measure01].[Measure01].ALLMEMBERS *
[Main].[Measure02].[Measure02].ALLMEMBERS *
[Main].[Measure03].[Measure03].ALLMEMBERS *
[Main].[Measure04].[Measure04].ALLMEMBERS
) } DIMENSION PROPERTIES MEMBER_CAPTION, MEMBER_UNIQUE_NAME ON ROWS FROM (
SELECT ( {
[Main].[Measure01].&[1]
} ) ON COLUMNS FROM [Model])
CELL PROPERTIES VALUE
If I take this MDX query and paste it into my Java application, and run it, it takes over 30 seconds to return the results, using the following code:
Class.forName("org.olap4j.driver.xmla.XmlaOlap4jDriver");
Connection connection = DriverManager.getConnection(cubeUrl);
OlapConnection olapConnection = connection.unwrap(OlapConnection.class);
olapConnection.setCatalog(catalog);
OlapStatement statement = olapConnection.createStatement();
LOG.info("Running Cube query");
CellSet cellSet = statement.executeOlapQuery("<<The MDX Query here>>");
And the more measures I add, the slower it gets. I've tried putting some logging and debug breakpoints in my code, but it really seems like it's SSAS itself that is being slow in returning my data.
Bearing in mind I know very little about SSAS, what can I try? Does Olap4j have some config options I haven't set? Does MSS Studio do some optimization behind the scenes that is impossible for me to replicate?
EDIT 1:
On a hunch I installed Wireshark to monitor my network traffic, and while my query is running I see hundreds of thousands if not millions of packets going between my laptop and the SSAS server. Network packets are hard to interpret, but a lot of them seem to be sending HTTP data with the measure values in. Things like:
<Member Hierarchy="[Main].[Measure01]">
<UName>[Main].[Measure01].&[0]</UName>
<Caption>0.00</Caption>
<LName>[Main].[Measure01].[Measure01]</LName>
<LNum>1</LNum>
<DisplayInfo>131072</DisplayInfo>
<MEMBER_CAPTION>0.00</MEMBER_CAPTION>
<MEMBER_UNIQUE_NAME>[Main].[Measure01].&[0]</MEMBER_UNIQUE_NAME>
<MEMBER_NAME>0</MEMBER_NAME>
<MEMBER_VALUE xsi:type="xsd:double">0</MEMBER_VALUE>
</Member>
<Member Hierarchy="[Main].[Measure02]">
<UName>[Main].[Measure02].&[0]</UName>
<Caption>0</Caption>
<LName>[Main].[Measure02].[Measure02]</LName>
<LNum>1</LNum>
<DisplayInfo>131072</DisplayInfo>
<MEMBER_CAPTION>0</MEMBER_CAPTION>
<MEMBER_UNIQUE_NAME>[Main].[Measure02].&[0]</MEMBER_UNIQUE_NAME>
<MEMBER_NAME>0</MEMBER_NAME>
<MEMBER_VALUE xsi:type="xsd:double">0</MEMBER_VALUE>
</Member>
<Member Hierarchy="[Main].[Measure03]">
<UName>[Main].[Measure03].&</UName>
<Caption/>
<LName>[Main].[Measure03].[Measure03]</LName>
<LNum>1</LNum>
<DisplayInfo>131072</DisplayInfo>
<MEMBER_UNIQUE_NAME>[Main].[Measure03].&</MEMBER_UNIQUE_NAME>
<MEMBER_VALUE xsi:nil="true"/>
</Member>
So it seems like the slowness might actually all be network traffic! Is there a way to get olap4j/IIS/SSAS to compress the traffic, so that I can get similar performance in olap4j as I get with MSS, where the same volume of data is retrieved in under a second?
Background:
We have a 3-node solr cloud that was migrated to docker. It works as expected, however, for new data that is inserted, it can only be retrieved by id. Once we try to use filters, it doesn't show. Note that old data can still be filtered without any issues.
The database is is used via spring-boot crud-like application.
More background:
The app and the solr were migrated by another person and I have inherited the codebase recently so I am not familiar in much detail about the implementation and am still digging and debugging.
The nodes were migrated as-is (the data was copied into a docker mount).
What I have so far:
I have checked the logs of all the solr nodes and see the following happening when making the calls to the application:
Filtering:
2019-02-22 14:17:07.525 INFO (qtp15xxxxx-15) [c:content_api s:shard1 r:core_node1 x:content_api_shard1_replica0] o.a.s.c.S.Request
[content_api_shard1_replica0]
webapp=/solr path=/select
params=
{q=*:*&start=0&fq=id-lws-ttf:127103&fq=active-boo-ttf:(true)&fq=(publish-date-tda-ttf:[*+TO+2019-02-22T15:17:07Z]+OR+(*:*+NOT+publish-date-tda-ttf:[*+TO+*]))AND+(expiration-date-tda-ttf:[2019-02-22T15:17:07Z+TO+*]+OR+(*:*+NOT+expiration-date-tda-ttf:[*+TO+*]))&sort=create-date-tda-ttf+desc&rows=10&wt=javabin&version=2}
hits=0 status=0 QTime=37
Get by ID:
2019-02-22 14:16:56.441 INFO (qtp15xxxxxx-16) [c:content_api s:shard1 r:core_node1 x:content_api_shard1_replica0] o.a.s.c.S.Request
[content_api_shard1_replica0]
webapp=/solr path=/get params={ids=https://example.com/app/contents/127103/middle-east&wt=javabin&version=2}
status=0 QTime=0
Disclaimer:
I am an absolute beginner in working with Solr and am going through documentation ATM in order to get better insight into the nuts and bolts.
Assumptions and WIP:
The person who migrated it told me that only the data was copied, not the configuration. I have acquired the old config files (/opt/solr/server/solr/configsets/) and am trying to compare to the new ones. But the assumption is that the configs were defaults.
The old version was 6.4.2 and the new one is 6.6.5 (not sure that this could be the issue)
Is there something obvious that we are missing here? What is superconfusing is the fact that the data can be retrieved by id AND that the OLD data can be filtered
Update:
After some researching, I have to say that I have excluded the config issue because when I inspect the configuration from the admin UI, I see the correct configuration.
Also, another weird behavior is that the data can be queried after some time (like more than 5 days). I can see that because I run the query from the UI and order it by descending creation date. From there, I can see my tests that I was not just days ago
Relevant commit config part:
<autoCommit>
<maxTime>${solr.autoCommit.maxTime:15000}</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
<autoSoftCommit>
<maxTime>${solr.autoSoftCommit.maxTime:-1}</maxTime>
</autoSoftCommit>
More config output from the admin endpoint:
config:{
znodeVersion:0,
luceneMatchVersion:"org.apache.lucene.util.Version:6.0.1",
updateHandler:{
indexWriter:{
closeWaitsForMerges:true
},
commitWithin:{
softCommit:true
},
autoCommit:{
maxDocs:-1,
maxTime:15000,
openSearcher:false
},
autoSoftCommit:{
maxDocs:-1,
maxTime:-1
}
},
query:{
useFilterForSortedQuery:false,
queryResultWindowSize:20,
queryResultMaxDocsCached:200,
enableLazyFieldLoading:true,
maxBooleanClauses:1024,
filterCache:{
autowarmCount:"0",
size:"512",
initialSize:"512",
class:"solr.FastLRUCache",
name:"filterCache"
},
queryResultCache:{
autowarmCount:"0",
size:"512",
initialSize:"512",
class:"solr.LRUCache",
name:"queryResultCache"
},
documentCache:{
autowarmCount:"0",
size:"512",
initialSize:"512",
class:"solr.LRUCache",
name:"documentCache"
},
:{
size:"10000",
showItems:"-1",
initialSize:"10",
name:"fieldValueCache"
}
},
...
According to your examples you're only retrieving the document when you're querying the realtime get endpoint - i.e. /get. This endpoint returns documents by querying by id, even if the document hasn't been commited to the index or a new searcher has been opened.
A new searcher has to be created before any changes to the index become visible to the regular search endpoints, since the old searcher will still use the old index files for searching. If a new searcher isn't created, the stale content will still be returned. This matches the behaviour you're seeing, where you're not opening any new searchers, and content becomes visible when the searcher is recycled for other reasons (possibly because of restarts/another explicit commit/merges/optimizes/etc.).
Your example configuration shows that the autoSoftCommit is disabled, while the regular autoCommit is set to not open a new searcher (and thus, no new content is shown). I usually recommend disabling this feature and instead relying on using commitWithin in the URL as it allows greater configurability for different types of data, and allows you to ask for a new searcher to be opened within at least x seconds since data has been added. The default behaviour for commitWithin is that a new searcher will be opened after the commit has happened.
Sounds like you might have switched to a default managed schema on upgrade. Look for schema.xml in your previous install along with a section in your prior install's solrconfig.xml. More info at https://lucene.apache.org/solr/guide/6_6/schema-factory-definition-in-solrconfig.html#SchemaFactoryDefinitioninSolrConfig-SolrUsesManagedSchemabyDefault
I am getting CPU performance issue on server when I am trying to download CSV in my project, CPU goes 100% but SQL returns the response within 1 minute. In the CSV we are writing around 600K records for one user it is working fine but for concurrent users we are getting this issue.
Environment
Spring 4.2.5
Tomcat 7/8 (RAM 2GB Allocated)
MySQL 5.0.5
Java 1.7
Here is the Spring Controller code:-
#RequestMapping(value="csvData")
public void getCSVData(HttpServletRequest request,
HttpServletResponse response,
#RequestParam(value="param1", required=false) String param1,
#RequestParam(value="param2", required=false) String param2,
#RequestParam(value="param3", required=false) String param3) throws IOException{
List<Log> logs = service.getCSVData(param1,param2,param3);
response.setHeader("Content-type","application/csv");
response.setHeader("Content-disposition","inline; filename=logData.csv");
PrintWriter out = response.getWriter();
out.println("Field1,Field2,Field3,.......,Field16");
for(Log row: logs){
out.println(row.getField1()+","+row.getField2()+","+row.getField3()+"......"+row.getField16());
}
out.flush();
out.close();
}}
Persistance Code:- I am using spring JDBCTemplate
#Override
public List<Log> getCSVLog(String param1,String param2,String param3) {
String sql =SqlConstants.CSV_ACTIVITY.toString();
List<Log> csvLog = JdbcTemplate.query(sql, new Object[]{param1, param2, param3},
new RowMapper<Log>() {
#Override
public Log mapRow(ResultSet rs, int rowNum)
throws SQLException {
Log log = new Log();
log.getField1(rs.getInt("field1"));
log.getField2(rs.getString("field2"));
log.getField3(rs.getString("field3"));
.
.
.
log.getField16(rs.getString("field16"));
}
return log;
}
});
return csvLog;
}
I think you need to be specific on what you meant by "100% CPU usage" whether it's the Java process or MySQL server. As you have got 600K records, trying to load everything in to memory would easily end up in OutOfMemoryError. Given that this works for one user means that you've got enough heap space to process this number of records for just one user and symptoms surface when there are multiple users trying to use the same service.
First issue I can see in your posted code is that you try to load everything into one big list and the size of the list varies based on the content of the Log class. Using a list like this also means that you have to have enough memory to process JDBC result set and generate new list of Log instances. This can be a major problem with a growing number of users. This type of short-lived objects will cause frequent GC and once GC cannot keep up with the amount of garbage being collected it fails obviously. To solve this major issue my suggestion is to use ScrollableResultSet. Additionally you can make this result set read-only, for example below is code fragment for creating a scrollable result set. Take a look at the documentation for how to use it.
Statement st = conn.createStatement(ResultSet.TYPE_SCROLL_SENSITIVE, ResultSet.CONCUR_READ_ONLY);
Above option is suitable if you're using pure JDBC or SpringJDBC template. If Hibernate is already used in your project you can still achieve the same this with the below code fragment. Again please check the documentation for more information and you have a different JPA provider.
StatelessSession session = sessionFactory.openStatelessSession();
Query query = session.createSQLQuery(queryStr).setCacheable(false).setFetchSize(Integer.MIN_VALUE).setReadOnly(true);
query.setParameter(query_param_key, query_paramter_value);
ScrollableResults resultSet = query.scroll(ScrollMode.FORWARD_ONLY);
This way you're not loading all the records to Java process in one go, instead you they're loaded on demand and will have small memory footprint at any given time. Note that JDBC connection will be open until you're done with processing the entire record set. This also means that your DB connection pool can be exhausted if many users are going to download CSV files from this endpoint. You need to take measures to overcome this problem (i.e use of an API manager to rate limit the calls to this endpoint, reading from a read-replica or whatever viable option).
My other suggestion is to stream data which you have already done, so that any records fetched from the DB are processed and sent to client before the next set of records are processed. Again I would suggest you to use a CSV library such as SuperCSV to handle this as these libraries are designed to handle a good load of data.
Please note that this answer may not exactly answer your question as you haven't provided necessary parts of your source such as how to retrieve data from DB but will give the right direction to solve this issue
Your problem in loading all data on application server from database at once, try to run query with limit and offset parameters (with mandatory order by), push loaded records to client and load next part of data with different offset. It help you decrease memory footprint and will not required keep connection to database open all the time. Of course, database will loaded a bit more, but maybe whole situation will better. Try different limit values, for example 5K-50K and monitor cpu usage - on both app server and database.
If you can allow keep many open connection to database #Bunti answer is very good.
http://dev.mysql.com/doc/refman/5.7/en/select.html
I have a problem with Apache Jena TDB. Basically I create a new Dataset, load data from an RDF/XML file into a named model with the name "http://example.com/model/filename" where filename is the name of the XML/RDF file. After loading the data, all statements from the named model are inserted into the default model. The named model is kept in the dataset for backup reasons.
When I now try to query the named models in the Dataset, TDB freezes and the application seems to run in an infinite loop, so it is not terminated nor does it throw an exception.
What is causing that freeze and how can I prevent it?
Example code:
Dataset ds = TDBFactory.createDataset("tdb");
Model mod = ds.getDefaultModel();
File f = new File("example.rdf");
FileInputStream fis = new FileInputStream(f);
ds.begin(ReadWrite.WRITE);
// Get a new named model to load the data into
Model nm = ds.getNamedModel("http://example.com/model/example.rdf");
nm.read(fis, null);
// Do some queries on the Model using the utility methods of Model, no SPARQL used
// Add all statements from the named model to the default model
mod.add(nm);
ds.commit();
ds.end();
// So far everything works as expected, but the following line causes the freeze
Iterator<String> it = ds.listNames();
Any method call that accesses the existing named models causes the same freeze reaction, so this is the same for getNamedModel("http://example.com/model/example.rdf"); for example. Adding new named models by calling getNamedModel("http://example.com/model/example123.rdf"); works fine, so only access to existing models is broken.
Used environment: Linux 64bit, Oracle Java 1.7.0_09, Jena 2.7.4 (incl. TDB 0.9.4)
Thanks in advance for any help!
Edit: Fixed mistake in code fragment
Edit2: Solution (my last comment to AndyS answer)
Ok, I went through the whole program and added all missing transactions. Not it is working as expected. I suspect Jena throwing an Exception during the shutdown sequence of my program but that Exception was not reported properly and the "freeze" was caused by other threads not terminating correctly. Thanks for pointing the faulty transaction usage out.
Could you turn this into a test case and send it to the jena users mailing list please?
You should get the default model inside the transaction - you got it outside.
Also, if you have used a dataset transactionally, you can't use it untransactionally as you do at ds.listNames. It shouldn't freeze - you should get some kind of warning.