I want to implement java application that can connect to any sql server and load any table from it. For each table I want to create histogram based on some arbitrary columns.
For example if I have this table
name profit
------------
name1 12
name2 14
name3 18
name4 13
I can create histogram with bin size 4 based on min and max value of profit column and count number of records for each bin.
result is:
profit count
---------------
12-16 3
16-20 1
My solution for this problem is retrieving all the data based on required columns and after that construct the bins and group by the records using java stream Collectors.groupingBy.
I'm not sure if my solution is optimized and for this I want some help to find the better algorithm specially when I have big number of records.(for example use some benefits of sql server or other frameworks that can be used.)
Can I use better algorithm for this issue?
edit 1:
assume my sql result is in List data
private String mySimpleHash(Object[] row, int index) {
StringBuilder hash = new StringBuilder();
for (int i = 0; i < row.length; i++)
if (i != index)
hash.append(row[i]).append(":");
return hash.toString();
}
//index is index of column for histogram
List<Object[]> histogramData = new ArrayList<>();
final Map<String, List<Object[]>> map = data.stream().collect(
Collectors.groupingBy(row -> mySimpleHash(Arrays.copyOfRange(row, index))));
for (final Map.Entry<String, List<Object[]>> entry : map.entrySet()) {
Object[] newRow = newData.get(rowNum);
double result = entry.getValue().stream()
.mapToDouble(row ->
Double.valueOf(row[index].toString())).count();
newRow[index] = result;
histogramData.add(newRow);
}
As you have considered, performing the aggregation after getting all the data out of SQL server is going to be very expensive if the number of rows in your tables increase. You can simply do the aggregation within SQL. Depending on how you are expressing your histogram bins, this is either trivial or requires some work. In your case, the requirement that the lowest bin start at min value requires a little bit of setup as opposed to binning starting from 0. See sample below. The inner query is mapping values to a bin number, the outer query is aggregating and computing the bin boundaries.
CREATE TABLE Test (
Name varchar(max) NOT NULL,
Profit int NOT NULL
)
INSERT Test(Name, Profit)
VALUES
('name1', 12),
('name2', 14),
('name3', 18),
('name4', 13)
DECLARE #minValue int = (SELECT MIN(Profit) FROM Test)
DECLARE #binSize int = 4
SELECT
(#minValue + #binSize * Bin) AS BinLow,
(#minValue + #binSize * Bin) + #binSize - 1 AS BinHigh,
COUNT(*) AS Count
FROM (
SELECT
((Profit - #minValue) / #binSize) AS Bin
FROM
Test
) AS t
GROUP BY Bin
| BinLow | BinHigh | Count |
|--------|---------|-------|
| 12 | 15 | 3 |
| 16 | 19 | 1 |
http://sqlfiddle.com/#!18/d093c/9
Related
I have a list which is a java object like below.
public class TaxIdentifier {
public String id;
public String gender;
public String childId;
public String grade,
public String isProcessed;
////...///
getters and setters
///....///
}
Records in DB looks like below,
id gender childId grader isProcessed
11 M 111 3 Y
12 M 121 4 Y
11 M 131 2 Y
13 M 141 5 Y
14 M 151 1 Y
15 M 161 6 Y
List<TaxIdentifier> taxIdentifierList = new ArrayList<TaxIdentifier>();
for (TaxIdentifier taxIdentifier : taxIdentifierList) {
}
while I process for loop and get the id = 11, i have to check if there are other records with id = 11 and process them together and do a DB operation and then take the next record say in this case 12 and see if there are other records with id = 12 and so on.
One option is i get the id and query the DB to return all id = 11 and so on.
But this is too much back and forth with the DB.
What is the best way to do the same in java? Please advice.
If you anyway need to process all the records in the corresponding database table - you should retrieve all of them in 1 database roundtrip.
After that, you can collect all your TaxIdentifier records in dictionary data structure and process in whatever way you want.
The brief example may look like this:
Map<String, List<TaxIdentifier>> result = repositoty.findAll().stream().collect(Collectors.groupingBy(TaxIdentifier::getId));
Here all the TaxIdentifier records are grouped by TaxIdentifier's id (all the records with id equals "11") can be retrieved and processed this way:
List<TaxIdentifier> taxIdentifiersWithId11 = result.get("11");
I would leverage the power of your database. When you query, you should order your query by id in ascending order. Not sure which database you are using but I would do:
SELECT * FROM MY_DATABASE WHERE IS_PROCESSED = 'N' ORDER BY ID ASC;
That takes the load of sorting it off of your application and onto your database. Then your query returns unprocessed records with the lowest id's on top. Then just sequentially work through them in order.
I'm using MySQL and fetching a few different values from a table and then perform some basic math on it. Currently three seperate SELECT statements are in use and afterwards I perform some simple addition and subtraction with the outputs I get in Java.
I'm trying to optimize my code but sadly I gotta admit I'm a complete SQL noob. I'm pretty sure there's a way to join these select querys and the calculations so that I actually only get one output but I've not been able to find it.
My table looks something like this:
ID | value | inc | timestamp
--------------------------------------
0 | 5 | 4 | 2018-02-01 10:28:21
1 | 8 | 3 | 2018-02-01 10:28:47
...
My code currently looks like this:
int maxValue = MySQL.executeQuery("SELECT MAX(`value`) AS value FROM `table` where ID = idvalue AND `timestamp` >= TIMESTAMPADD(DAY,-3,NOW())");
int minValue = MySQL.executeQuery("SELECT MIN(`value`) AS value FROM `table` where ID = idvalue AND `timestamp` >= TIMESTAMPADD(DAY,-3,NOW())");
int minInc = MySQL.executeQuery("SELECT `inc` FROM `table` where ID = id AND value = minValue");
int output = maxValue - minValue + minInc;
Is there a way to shorten it to a single
int output = MYSQL.executeQuery( ??? );
?
Simply do select (select ...) - (select ...) + (select ...)
In your case, you can do( not tested in real environment )
select (SELECT MAX(`value`) AS value FROM `table` where ID = idvalue AND `timestamp` >= TIMESTAMPADD(DAY,-3,NOW())) - ( SELECT MIN(`value`) AS value FROM `table` where ID = idvalue AND `timestamp` >= TIMESTAMPADD(DAY,-3,NOW())) + (SELECT `inc` FROM `table` where ID = id AND value = minValue)
First off, there's something funky going on with the maxValue and minValue selects. The Max() and Min() operators will give you the max and min values of a given column of a given set of rows. Using one of these operators with such a specific where (by what seems to be a table's primary key) is probably not what you want to be doing.
Now, answering your question, I think you could do something along the lines of:
SELECT MAX('value') as max, MIN('value') as min
FROM `table` as t
WHERE ...
to "join" (careful with this word) the first 2 queries. This is simple select syntax: usually, there's no problem with selecting more than a column or an aggregate function at a time. Or, something like:
SELECT `inc`
FROM `table`
WHERE ID = id AND
value = (SELECT MIN('value') FROM 'table' WHERE ...)
to "join" the last two.
Single statement is possible using INNER JOIN since you are using a single able. Try this
SELECT MAX(`a.value`)-(MIN(`a.value`)+b.`inc`) AS Output
FROM `table` a
INNER JOIN `table` b ON a.ID=b.ID
AND a.ID = idvalue AND `a.timestamp` >= TIMESTAMPADD(a.DAY,-3,NOW())
AND b.value=(select MIN(value) from table WHERE ID=id);
I am having conflict filtering a Dataset<'Row> using the MEAN() and STDEV() built in functions in the org.apache.spark.sql.functions library.
This is the set of data I am working with (top 10):
Name Size Volumes
File1 1030 107529
File2 997 106006
File3 1546 112426
File4 2235 117335
File5 2061 115363
File6 1875 114015
File7 1237 110002
File8 1546 112289
File9 1030 107154
File10 1339 110276
What I am currently trying to do is find the outliers in this dataset. For that, I need to find the rows where the SIZE and VOLUMES are outliers using the 95% rule: μ - 2σ ≤ X ≤ μ + 2σ
This is the SQL-like query that I would like to run on this Dataset:
SELECT * FROM DATASET
WHERE size < (SELECT (AVG(size)-2STDEV(size)) FROM DATASET)
OR size > (SELECT (AVG(size)+2STDEV(size)) FROM DATASET)
OR volumes < (SELECT (AVG(volumes)-2STDEV(volumes)) FROM DATASET)
OR volumes > (SELECT (AVG(volumes)+2STDEV(volumes)) FROM DATASET)
I don't know how to implement nested queries and I'm struggling to find a way to solve this.
Also, if you happen to know other way of getting what I want, feel free to share it.
This is what I attempted to do but I get an error:
Column meanSize = functions.mean(size);
Column meanRecords = functions.mean(records);
Column stdSize = functions.stddev(size);
Column stdRecords = functions.stddev(records);
Column lowerSizeThreshold = size.lt((meanSize.minus(stdSize).minus(stdSize)));
Column upperSizeThreshold = size.gt(meanSize.plus(stdSize).plus(stdSize));
Column lowerRecordsThreshold = records.lt(meanRecords.minus(stdRecords).minus(stdRecords));
Column upperRecordsThreshold = records.gt(meanRecords.plus(stdRecords).plus(stdRecords));
Dataset<Row> outliers = dataFrame.where(lowerSizeThreshold.or(upperSizeThreshold).or(lowerRecordsThreshold).or(upperRecordsThreshold));
You asked about Java that I'm currently not using at all, so here comes a Scala version that I hope might somehow help you to find a corresponding Java version.
What about the following solution?
// preparing the dataset
val input = spark.
read.
text("input.txt").
as[String].
filter(line => !line.startsWith("Name")).
map(_.split("\\W+")).
withColumn("name", $"value"(0)).
withColumn("size", $"value"(1) cast "int").
withColumn("volumes", $"value"(2) cast "int").
select("name", "size", "volumes")
scala> input.show
+------+----+-------+
| name|size|volumes|
+------+----+-------+
| File1|1030| 107529|
| File2| 997| 106006|
| File3|1546| 112426|
| File4|2235| 117335|
| File5|2061| 115363|
| File6|1875| 114015|
| File7|1237| 110002|
| File8|1546| 112289|
| File9|1030| 107154|
|File10|1339| 110276|
+------+----+-------+
// the final computation
import org.apache.spark.sql.functions._
val (sizeAvg, sizeStddev, volumesAvg, volumesStddev) = input.
groupBy().
agg(avg("size"), stddev("size"), avg("volumes"), stddev("volumes")).
as[(Double, Double, Double, Double)].
head
val sizeLessThanStddev = col("size") < (sizeAvg - 2 * sizeStddev)
input.filter(sizeLessThanStddev)
This is only a first part of the 4-part filter operator, and am leaving the rest as a home exercise.
Thanks for your comments guys.
So this solution is for the Java implementation of Spark. If you want the implementation of Scala, look at Jacek Laskowski post.
Solution:
//first convert the columns Size and Records to a List<Double>
List<Double> sizeList = dataFrame.select("Size").javaRDD().map(f -> f.getDouble(0)).collect();
List<Double> recordsList = dataFrame.select("Records").javaRDD().map(f -> f.getDouble(0)).collect();
//then convert the lists into JavaDoubleRDD
JavaDoubleRDD size = sparkContext.parallelizeDoubles(sizeList);
JavaDoubleRDD records = sparkContext.parallelizeDoubles(recordsList);
//calculate the mean and stddev using the built in functions:
double sizeMean = size.mean();
double sizeStdev = size.stdev();
double recordsMean = records.mean();
double recordsStdev = records.stdev();
After that I can finally use those values in the Column comparison functions.
Hope you understood what I did.
I have written an app which helps organize home bills. The problem is that in one home can live more than one person, and one person can have more than one home (e.g. me - in both cases :) ). So I've decided to give the user a possibility to bind a contractor (payment receiver) to multiple users and multiple homes.
In my data base there are concatenation tables between accounts and contractors and between homes and contractors. Great, isn't it?
Now, the point is that I'm getting a list of related users (or houses) as sql array, and I finally keep it as Integer[] array. I've made some dummy database entries, so I can test the functionality and it works fine.
But... I have completely no idea how should I properly store changed values in database. The structure of my tables are:
Users
id | username | ....
1 | user1 | ...
2 | user2 | ...
Contractors
id | name | ...
1 | contractor1 | ...
users_contractors
user_id | contractor_id | is_deleted
1 | 1 | false
1 | 2 | false
etc .....
So what I have is: an array of users related to contractor and the second array of users related to contrator (the modified one). Now I need to store the values in DB. When user + contractor does not exists - i need to insert that relation. If it already exists in database, but does not exist in my array (e.g. the connection was deleted) - i need to update the relation table and marked as deleted=true.
I've found some solutions on how to compare two arrays, but they all assume that the arrays are the same length, and they compare values with the same index only.
So what I need is to compare not arrays as we speak, but the array values (if one array contains values from another array, or the opposite). Can this be achieved without forloop-in-forloop ?
Thank you in advance.
Tom
Is there any reason why you are using arrays instead of Lists/Collections? These can help you search for items and make it easier to compare two of them.
I don't have an IDE at hand now, so here is some pseudocode:
// Create a list with all the values (maybe use a hashset to prevent duplicates)
List<int> all = new List();
all.addAll(A);
all.addAll(B);
//for each loop
for (int i : all) {
boolean inA = A.contains(i);
boolean inB = B.contains(i);
if (inA && inB) {
// You can figure out the rest of these statements I think
}
}
Thanks to #DrIvol - I've managed to solve the issue using the code:
List<Integer> allUsers = new ArrayList<Integer>();
allUsers.addAll(bean.getUserId());
allUsers.addAll(bean.getNewUserId());
for(Integer i : allUsers) {
Boolean oldValue = bean.getUserId().contains(i);
Boolean newValue = bean.getNewUserId().contains(i);
if(oldValue && newValue) {
System.out.println(i + " value in both lists");
// Nothing to do
} else if (oldValue && !newValue) {
System.out.println(i + " value removed");
// Set value as deleted
} else if(!oldValue && newValue) {
System.out.println(i + " value added");
// Insert new value to concat table
}
}
It has one problem: If the value was on the first list, and it still is in the second list (no modification) - it's checked twice. But, since I don't need to do anything with this value - it's acceptable for now. Someday, when I'll finish beta version - I'll be doing some optimisations, so I'll make some deduplicator for the list :)
Thank you very much!
Tom
I have a table called friendgraph(friend_id, friend_follower_id) and I want to calculate 6-degrees of separation for a given friend and a given degree.
The table looks like this:
friend_id, friend_follower_id
0,1
0,9
1,47
1,12
2,41
2,66
2,75
3,65
3,21
3,4
3,94
4,64
4,32
How do I go and built a query where given friend_id_a, and order_k, find the users who are k degree apart from friend_id_a?
This is how my initial query file looks like:
create or replace function degree6
(friend_id_a in integer, order_k in integer)
return sys_refcursor as crs sys_refcursor;
I am looking for any kind of help or resources that will get me started and eventually arrive at the output.
UPDATE:
The output would be a list of other friends k degrees apart from friend_id_a.
Define order-k follower B of A such that B is not A, and:
1. if k = 0, then A is the only order-0 follower of A.
2. if k = 1, then followers of A are order-1 followers of A.
3. if k > 1; then for i = 0 to k-1, B is not order-i follower of A; and B is a follower
of a order-(k-1) follower of A
Thanks.
You can build a hierarchical query and filter by level and friend_id. For example to get all friends of user 0 at level 3:
SELECT friend_id, friend_follower_id, level
FROM friends
WHERE LEVEL = 3
CONNECT BY PRIOR friend_follower_id = friend_id
START WITH friend_id = 0
On Oracle 11gR2 or later we can use the recursive subquery-factoring syntax to do this.
with friends (id, follower, lvl) as
( select friend_id, friend_follower_id, 1
from friendgraph
where friend_id = 0
union all
select fg.friend_id, fg.friend_follower_id, f.lvl + 1
from friendgraph fg
join
friends f
on (fg.friend_id = f.follower)
where f.lvl + 1 <= 3
)
select *
from friends
/
Here's one way of implementing this in a function with a Ref Cursor:
create or replace function degree6
(friend_id_a in integer
, order_k in integer)
return sys_refcursor
is
return_value sys_refcursor;
begin
open return_value for
with friends (id, follower, lvl) as
( select friend_id, friend_follower_id, 1
from friendgraph
where friend_id = friend_id_a
union all
select fg.friend_id, fg.friend_follower_id, f.lvl + 1
from friendgraph fg
join
friends f
on (fg.friend_id = f.follower)
where f.lvl + 1 <= order_k
)
select *
from friends;
return return_value;
end degree6;
/
Using it in SQL*Plus:
SQL> var rc refcursor
SQL> exec :rc := degree6(0,3)
PL/SQL procedure successfully completed.
SQL> print rc
ID FOLLOWER LVL
---------- ---------- ----------
0 1 1
0 9 1
1 12 2
1 47 2
SQL>