I am trying to prototype performance results of orientdb mass delete of vertexes. I need to prototype trying to delete more than 10000 upto million.
Firstly, I am using the light weight edges property to be false while creating my vertices and edges following this Issue with creating edge in OrientDb with Blueprints / Tinkerpop
When I try deleting (Please see below the code)
private static OrientGraph graph = new OrientGraph(
"remote:localhost/WorkDBMassDelete2", "admin", "admin");
private static void removeCompleatedWork() {
try {
long startTime = System.currentTimeMillis();
List params = new ArrayList();
String deleteQuery = "delete vertex Work where out('status') contains (work-state = 'Not Started')";
int no = graph.getRawGraph().command(new OCommandSQL(deleteQuery))
.execute(params);
// graph.commit();
long endTime = System.currentTimeMillis();
System.out.println("No of activities removed : " + no
+ " and time taken is : " + (endTime - startTime));
} catch (Exception e) {
e.printStackTrace();
} finally {
graph.shutdown();
}
}
The Results are good if I tring to delete in 100's 500 activities take ~500 ms. But when i trying to delete 2500/5000 activities the no's are high for 2500 deletions it takes ~6000.
A) I also tried creating index. What is the best practise to create a index on the attribute work-state or to create index on the edge status? I tried both while creating the vertex and edge. But both are not improving the performance a lot.
((OrientBaseGraph) graph).createKeyIndex("Status", Edge.class);
//or on the vertex
((OrientBaseGraph) graph).createKeyIndex("work-state", Vertex.class);
What is the best practice to delete mass/group data using the query like mentioned above? Any help is greatly appreciated.
UPDATE:
I downloaded orientdb-community-1.7-20140416.230539-144-distribution.tar.gz from https://oss.sonatype.org/content/repositories/snapshots/com/orientechnologies/orientdb-community/1.7-SNAPSHOT/.
When I try deleting using the subquery from the studio / program I get the following error:com.orientechnologies.orient.core.sql.OCommandSQLParsingException: Error on parsing command at position #0: Class 'FROM was not found . I had modified my query like this:
delete vertex from (select in('status') from State where work-state = 'Complete')
Also while I ran it through program I updated my maven dependencies to 1.7-SNAPSHOT libraries. My old query was still producing the same numbers and the subquery deletion was giving errors even in studio. Please let me know if I am missing anything. Thanks !!
First, please try the same exact code with 1.7-SNAPSHOT. It should be faster.
Then in 1.7-SNAPSHOT we just added the ability to delete vertices from a sub-query. This is because why browsing all the Work when you could delete all the incoming vertex from the Status vertex "Not Started"?
So if you've 1.7-SNAPSHOT change this query from:
delete vertex Work where out('status') contains (work-state = 'Not Started')
to (assuming the status vertex is called "State"):
delete vertex from (select in('status') from State where work-state = 'Not Started')
Related
Im trying to copy the exrcise about halfway down the page on this link:
https://d2l.ai/chapter_recurrent-neural-networks/sequence.html
The exercise uses a sine function to create 1000 data points between -1 through 1 and use a recurrent network to approximate the function.
Below is the code I used. I'm going back to study more why this isn't working as it doesn't make much sense to me now when I was easily able to use a feed forward network to approximate this function.
//get data
ArrayList<DataSet> list = new ArrayList();
DataSet dss = DataSetFetch.getDataSet(Constants.DataTypes.math, "sine", 20, 500, 0, 0);
DataSet dsMain = dss.copy();
if (!dss.isEmpty()){
list.add(dss);
}
if (list.isEmpty()){
return;
}
//format dataset
list = DataSetFormatter.formatReccurnent(list, 0);
//get network
int history = 10;
ArrayList<LayerDescription> ldlist = new ArrayList<>();
LayerDescription l = new LayerDescription(1,history, Activation.RELU);
ldlist.add(l);
LayerDescription ll = new LayerDescription(history, 1, Activation.IDENTITY, LossFunctions.LossFunction.MSE);
ldlist.add(ll);
ListenerDescription ld = new ListenerDescription(20, true, false);
MultiLayerNetwork network = Reccurent.getLstm(ldlist, 123, WeightInit.XAVIER, new RmsProp(), ld);
//train network
final List<DataSet> lister = list.get(0).asList();
DataSetIterator iter = new ListDataSetIterator<>(lister, 50);
network.fit(iter, 50);
network.rnnClearPreviousState();
//test network
ArrayList<DataSet> resList = new ArrayList<>();
DataSet result = new DataSet();
INDArray arr = Nd4j.zeros(lister.size()+1);
INDArray holder;
if (list.size() > 1){
//test on training data
System.err.println("oops");
}else{
//test on original or scaled data
for (int i = 0; i < lister.size(); i++) {
holder = network.rnnTimeStep(lister.get(i).getFeatures());
arr.putScalar(i,holder.getFloat(0));
}
}
//add originaldata
resList.add(dsMain);
//result
result.setFeatures(dsMain.getFeatures());
result.setLabels(arr);
resList.add(result);
//display
DisplayData.plot2DScatterGraph(resList);
Can you explain the code I would need for a 1 in 10 hidden and 1 out lstm network to approximate a sine function?
Im not using any normalization as function is already -1:1 and Im using the Y input as the feature and the following Y Input as the label to train the network.
You notice i am building a class that allows for easier construction of nets and I have tried throwing many changes at the problem but I am sick of guessing.
Here are some examples of my results. Blue is data red is result
This is one of those times were you go from wondering why was this not working to how in the hell were my original results were as good as they were.
My failing was not understanding the documentation clearly and also not understanding BPTT.
With feed forward networks each iteration is stored as a row and each input as a column. An example is [dataset.size, network inputs.size]
However with recurrent input its reversed with each row being a an input and each column an iteration in time necessary to activate the state of the lstm chain of events. At minimum my input needed to be [0, networkinputs.size, dataset.size] But could also be [dataset.size, networkinputs.size, statelength.size]
In my previous example I was training the network with data in this format [dataset.size, networkinputs.size, 1]. So from my low resolution understanding the lstm network should never have worked at all but somehow produced at least something.
There may have also been some issue with converting the dataset to a list as I also changed how I feed the network but but I think the bulk of the issue was a data structure issue.
Below are my new results
Hard to tell what is going on without seeing the full code. For a start I don't see an RnnOutputLayer specified. You could take a look this which shows you how to build an RNN in DL4J.
If your RNN setup is correct this could be a tuning issue. You can find more on tuning here. Adam is probably a better choice for an updater than RMSProp. And tanh probably is a good choice for the activation for your output layer since it's range is (-1,1). Other things to check/tweak - learning rate, number of epochs, set up of your data (like are you trying to predict to far out?).
I'm having a problem with the value being populated from a column from a different table instead of the table I'm wanting it to pull from.
From mySQL, I get the desired result of null; however, from within the program, it is displaying the value (If I have given it one for that row,) or the value from a different table. This is even while using aliases and explicitly stating which table to pull from.
I have a device table and a device type table. The device table has all the same fields as the device, but it is just 1's and 0's that dictate which fields are needed. (This is used when generating the edit / create a device GUI form.)
This only happens when using the resultSet.getInt("column_with_issue") if the values are INTs in both tables. It is returning a proper null when I do a resultSet.getString("column_with_no_issues") as long as the types are different.
Here is my original query, and what I've tried to change as per suggestion to fix it.
SELECT d.*, dt.type, os.os_name FROM device d inner join device_type dt on d.type_id = dt.id left outer join operating_system os on d.os = os.id
WHERE d.division_number = 1 or 1
AND (d.asset_tag = 1 or 1)
AND (dt.type = 1 or 1);
Possible fix as described in other peoples stackoverflow/javaranch questions. This will give each specific column its own name. However even when I go to pull from (as an example one that has two different variable types) dwireless, I still get the integer value that should be from my device_type.wireless for that kind of device instead of the expected null value.
SELECT d.id as did, d.division_number as ddivision_number, d.current_location as dcurrent_location, d.type_id as dtype_id, d.model as dmodel, d.asset_tag as dasset_tag, d.serial_number as dserial_number, d.manufacture_date as dmanufacture_date, d.screen as dscreen, d.os as dos, d.users_name as dusers_name, d.office_installed as doffice_installed, d.memory as dmemory, d.series as dseries, d.ip_address as dip_address, d.host_name as dhostname, d.vm_host as dvm_host, d.processor as dprocessor, d.wireless as dwireless, d.purchase_quality as dpurchase_quality, d.purchase_price as dpurchase_price, d.active as dactive, d.notes as dnotes, dt.type, os.os_name
FROM device d inner join device_type dt on d.type_id = dt.id left outer join operating_system os on d.os = os.id
WHERE d.division_number = 1 or 1
AND (d.asset_tag = 1 or 1)
AND (dt.type = 1 or 1);
Here is the Java code that I'm using to pull the variables from the result set.
private ObservableList<Device> getDevices(String sqlStatement) throws SQLException {
ObservableList<Device> devices = FXCollections.observableArrayList();
// Run the query to select all of our devices.
Statement searchDeviceStatement = Main.myConn.createStatement();
ResultSet rs = searchDeviceStatement.executeQuery(sqlStatement);
// Add all of our devices into our array list to be displayed.
// These are the actual columns of the database.
while (rs.next()) {
devices.add(new Device(rs.getInt("did"), rs.getInt("ddivision_number"), rs.getInt("dtype_id"),
rs.getString("type"), rs.getString("dmodel"), rs.getInt("dasset_tag"), rs.getString("dserial_number"),
rs.getInt("dmanufacture_date"), rs.getString("dscreen"), rs.getInt("dos"), rs.getString("os_name"),
rs.getString("dusers_name"), rs.getInt("doffice_installed"), rs.getString("dmemory"),
rs.getString("dseries"), rs.getString("dip_address"), rs.getString("dhost_name"),
rs.getString("dvm_host"), rs.getString("dprocessor"), rs.getInt("dwireless"),
rs.getString("dpurchase_quality"), rs.getDouble("dpurchase_price"), rs.getInt("dactive"), rs.getString("dnotes")));
}
return devices;
}
The query works as intended from mySQL Workbench, as seen here, but when displaying the data in my program it looks like this.
Answer: I am an idiot. Default values for integers are 0, while strings are null. JDBC is returning null but when it creates my Device object with that value it is defaulted to 0. This is explained about 1/3rd of the way down on this page.
I have been running the following query to find relatives within a certain "distance" of a given person:
#Query("start person=node({0}), relatives=node:__types__(className='Person') match p=person-[:PARTNER|CHILD*]-relatives where LENGTH(p) <= 2*{1} return distinct relatives")
Set<Person> getRelatives(Person person, int distance);
The 2*{1} comes from one conceptual "hop" between people being represented as two nodes - one Person and one Partnership.
This has been fine so far, on test populations. Now I'm moving on to actual data, which consists of sizes from 1-10 million, and this is taking for ever (also from the data browser in the web interface).
Assuming the cost was from loading everything into ancestors, I rewrote the query as a test in the data browser:
start person=node(385716) match p=person-[:PARTNER|CHILD*1..10]-relatives where relatives.__type__! = 'Person' return distinct relatives
And that works fine, in fractions of a second on the same data store. But when I want to put it back into Java:
#Query("start person=node({0}) match p=person-[:PARTNER|CHILD*1..{1}]-relatives where relatives.__type__! = 'Person' return relatives")
Set<Person> getRelatives(Person person, int distance);
That won't work:
[...]
Nested exception is Properties on pattern elements are not allowed in MATCH.
"start person=node({0}) match p=person-[:PARTNER|CHILD*1..{1}]-relatives where relatives.__type__! = 'Neo4jPerson' return relatives"
^
Is there a better way of putting a path length restriction in there? I would prefer not to use a where as that would involve loading ALL the paths, potentially loading millions of nodes where I need only go to a depth of 10. This would presumably leave me no better off.
Any ideas would be greatly appreciated!
Michael to the rescue!
My solution:
public Set<Person> getRelatives(final Person person, final int distance) {
final String query = "start person=node(" + person.getId() + ") "
+ "match p=person-[:PARTNER|CHILD*1.." + 2 * distance + "]-relatives "
+ "where relatives.__type__! = '" + Person.class.getSimpleName() + "' "
+ "return distinct relatives";
return this.query(query);
// Where I would previously instead have called
// return personRepository.getRelatives(person, distance);
}
public Set<Person> query(final String q) {
final EndResult<Person> result = this.template.query(q, MapUtil.map()).to(Neo4jPerson.class);
final Set<Person> people = new HashSet<Person>();
for (final Person p : result) {
people.add(p);
}
return people;
}
Which runs very quickly!
You're almost there :)
Your first query is a full graph scan, which effectively loads the whole database into memory and pulls all nodes through this pattern match multiple times.
So it won't be fast, also it would return huge datasets, don't know if that's what you want.
The second query looks good, the only thing is that you cannot parametrize the min-max values of variable length relationships. Due to effects to query optimization / caching.
So for right now you'd have to go with template.query() or different query methods in your repo for different max-values.
I'm concerned about my Java client directly connecting to the MySQL server due to all of the issues that could occur, and the security risks I believe it could pose. Such as someone being able to decompile the file and get the login details for the database. As beautiful as it would be, I'm too scared to take that risk. I've written a PHP script to echo data that the client can interpret. The PHP script is what connects to the MySQL.
It's rather simple: Java->PHP->MySQL
I'm going to provide screenshots of the MySQL structure, so you may better understand when trying to visualize this.
id: possibly tid/sid
tid: teacher id, used to link to the teacher
sid: student id, used to link to the student
gid: grade id
aid: assignment id
gp: gained points
pp: possible points
Grading rows are for each assignment per student. So for example if a teacher had 30 students assigned to one assignment, there would be 30 rows in the grading tab and one in the assignments. Duplicate assignment names are NOT allowed.
When the client is requesting the data, I just use a buffered reader & URL to download the string. This is the example output of when the client receives the assignment names.
Test Assignment;Secondary Test Assignment;
This is what it looks like to the client once the column names are downloaded:
As you can see the first two columns are there by default, the last two are assignment names.
I want each row in the table to be a student. However, here is where my trouble comes in. I'm trying to receive the proper data from grading. I don't know how I'm going to do this. I have about 3 months experience with Java, so you could definitely call me a newbie.
Here is my idea, but I didn't think it was so great of an idea:
Search through all of the column names and insert the value into the proper column in that row where assignment name matches.
I didn't know how difficult that would be. I'm guessing the nice people who developed swing built something in for that, but I can't find any resources.
Does anyone have any recommendations on what to do in this situation? I feel lost.
Let's start with the Java client. Here is some code that reads from a php page and that creates a JTable out of it. (actually it's reading from a String for simplicity but you can easily change the code to match your real case, see the comment in the code).
public static void main(String[] args) throws Exception {
String receivedFromPHP = "Student ID;Student Name;Test Assignment;Secondary Test Assignment;\n"
+ "1;Luc;Test assignment 1;Secondary Test assignment 1;\n"
+ "2;Vador;Test assignment 2;Secondary Test assignment 2;";
BufferedReader br = new BufferedReader(new StringReader(receivedFromPHP));
// For real: br = new BufferedReader(new InputStreamReader(new URL("http://localhost/yourPhpPage.php").openStream()));
DefaultTableModel dtm = new DefaultTableModel();
String line;
boolean headersReceived = false;
while ((line = br.readLine()) != null) {
String[] columns = line.split(";");
if (!headersReceived) {
dtm.setColumnIdentifiers(columns);
headersReceived = true;
} else {
dtm.addRow(columns);
}
}
JTable table = new JTable(dtm);
JFrame f = new JFrame();
f.add(new JScrollPane(table));
f.pack();
f.setLocationRelativeTo(null);
f.setVisible(true);
f.setDefaultCloseOperation(JFrame.EXIT_ON_CLOSE);
}
Nothing really difficult until now. The real thing is to write the php page with the proper query. Obviously, you know better what you want your page to output but I guess your are going for something like this:
<?php
$mysqli = new mysqli("localhost", "my_user", "my_password", "pinkfluf_dvonx");
/* check connection */
if ($mysqli->connect_errno) {
printf("Connect failed: %s\n", $mysqli->connect_error);
exit();
}
/* Select queries return a resultset */
if ($result = $mysqli->query("SELECT Name FROM City LIMIT 10")) {
printf("Select returned %d rows.\n", $result->num_rows);
/* free result set */
$result->close();
}
/* If we have to retrieve large amount of data we use MYSQLI_USE_RESULT */
if ($result = $mysqli->query('SELECT u.id AS "Student ID", u.username AS "Student Name", ... FROM members u, grading g, assignments a WHERE ...')) {
while($row = $result->fetch_array(MYSQLI_NUM)) {
for ($i=0; $i<sizeof($row); $i++) {
echo $row[$i] . ";";
}
echo "\n";
}
$result->close();
}
$mysqli->close();
?>
Of course, the code I give here is very approximative (given the information I could extract from your question) so it's certain that you'll need to adapt the code to make it work as you'd like to but I hope it can help you getting started (keep going :)).
As far as securing your database, I'd recommend creating a locked down user that can only execute stored procedures - then you don't have to worry about someone decompiling your code. They'd only be able to access what they can access through your code. Here's a tutorial on how to do that.
As far as your main question goes, I would recommend all your data gathering/sorting be done in your SQL query. If you're doing that in the JTable, you end up mixing your Model and View (see MVC for more detail).
So essentailly you want your data coming back from the query in this form:
Student; Student Name; Test Assignment; Secondary Test Assignment
Which means,
You need to add a relation between your grade table and your assignment table (most likely addding aid to the grading table)
You're going to need to come up with a slightly more complicated SQL Query - something like this:
Select g.sid, g.name, a.name from ASSIGNMENTS a
join GRADING g on a.aid = g.aid
where g.tid = 123123 order by g.name
Create a 2D array based on the data and put it in the table (If you're still using your PHP interface, you'll want to split the strings on your delimiters to create a 2D array.)
((DefaultTableModel)table.getModel).setDataVector(data, columnNames);
EDIT
If you're convinced you just want to search through the rows for a value, and then update a column in the row you found - this should get you in the right direction:
Integer searchStudentID = 123123;
int searchColumn = 0;
String updateValue = "Value";
int updateColumn = 3;
//Look through the table for the right row
Vector<Vector<Object>> data = ((DefaultTableModel)table.getModel()).getDataVector();
for(Vector<Object> row : data){
// If on the right row, update it
if(row.elementAt(searchColumn).equals(searchStudentID)){
row.setElementAt(updateValue, updateColumn);
}
}
My problem is this: I am trying to process about 1.5 million rows of data in Spring via JDBCTemplate coming from MySQL. With such a large number of rows, I am using the RowCallbackHandler class as suggested here
The code is actually working, but's SLOW... The thing is that no matter what I set the fetch size to, I seem to get approximately 350 records per fetch, with a 2 to 3 second delay between fetches (from observing my logs). I tried commenting out the store command and confirmed that behavior stayed the same, so the problem is not with the writes.
There are 6 columns, only 1 that is a varchar, and that one is only 25 characters long, so I can't see throughput being the issue.
Ideally I'd like to get more like 30000-50000 rows at a time. Is there a way to do that?
Here is my code:
protected void runCallback(String query, Map params, int fetchSize, RowCallbackHandler rch)
throws DatabaseException {
int oldFetchSize = getJdbcTemplate().getFetchSize();
if (fetchSize > 0) {
getJdbcTemplate().setFetchSize(fetchSize);
}
try {
getJdbcTemplate().query(getSql(query), rch);
}
catch (DataAccessException ex) {
logger.error(ExceptionUtils.getStackTrace(ex));
throw new DatabaseException( ex.getMessage() );
}
getJdbcTemplate().setFetchSize(oldFetchSize);
}
and the handler:
public class SaveUserFolderStatesCallback implements RowCallbackHandler {
#Override
public void processRow(ResultSet rs) throws SQLException {
//Save each row sequentially.
//Do NOT call ResultSet.next() !!!!
Calendar asOf = Calendar.getInstance();
log.info("AS OF DATE: " + asOf.getTime());
Long x = (Long) rs.getLong("x");
Long xx = (Long) rs.getLong("xx");
String xxx = (String) rs.getString("xxx");
BigDecimal xxxx = (BigDecimal)rs.getBigDecimal("xxxx");
Double xxxx = (budgetAmountBD == null) ? 0.0 : budgetAmountBD.doubleValue();
BigDecimal xxxxx = (BigDecimal)rs.getBigDecimal("xxxxx");
Double xxxxx = (actualAmountBD == null) ? 0.0 : actualAmountBD.doubleValue();
dbstore(x, xx, xxx, xxxx, xxxxx, asOf);
}
}
And what is your query? Try to create an indexex for fields you are searching/sorting. That will help.
Second strategy: in memory cache implementation. Or using of hibernate plus 2nd level cache.
Both this technics can significantly speed up your query execution.
Few Questions
How long does it takes if you query the DB directly. Another issue could be ASYNC_NETWORK_IO delay between application and DB hosts.
did you check it without using Spring
The answer actually is to do setFetchSize(Integer.MIN_VALUE) while this totally violates the stated contract of Statement.setFetchSize, the mysql java connector uses this value to stream the resultset. This results in tremendous performance improvement.
Another part of the fix is that I also needed to create my own subclass of (Spring) JdbcTemplate that would accomodate the negative fetch size... Actually, I took the code example here, where I first found the idea of setting fetchSize(Integer.MIN_VALUE)
http://javasplitter.blogspot.com/2009/10/pimp-ma-jdbc-resultset.html
Thank you both for your help!