Saving data in a for loop in dynamoDB AWS - java

I have to save data in 6 tables in dynamoDB AWS, can I put a 'for' loop and save one by one as shown below :-
DynamoDBMapper mapper = new DynamoDBMapper(dynamoDB);
for(i=0;i<6;i++)
{
mapper.save(<TABLE 1 DATA>)
// and loop and save data in every table
}
Does it looks fine or it can create some problem as I am doing data base operation in loop?
My tables are very small(5 columns)
Thanks
Kailash

Running in the for loop is a bad idea and you can use the batch write item api. dynamoDB.batchWriteItem(TableWriteItems... yourMultipleTableWriteItems)

If you only need to download the data from the table into a local file, like CSV, for example, you can use this CLI tool https://github.com/zshamrock/dynocsv to export data from your table into the CSV file.

Related

Jmeter: Using CSV file with uneven columns to test drive sampler

I have a csv file that has a list of stores. For every Store there are 10 departments.
I will need to make a GET API call for all the 10 departments in 100 stores. So my columns in CSV file are not eve. I will have column A with 100 store IDs, and column B with 10 department IDs.
How can I use every Store ID 10 times (once with every department ID) in Jmeter sampler?
If you want to achieve this using CSV Data Set Config - the only way is splitting your CSV file into 2 separate files
If the CSV file comes from external source and cannot be changed - you can consider using __groovy() function like:
${__groovy(new File('test.csv').readLines().get(vars.get('__jm__Loop Controller - Store__idx') as int).split('\,')[0],)}
Given example CSV file test.csv with the following contents:
store1,department1
store2,department2
,department3
,department4
,department5
,department6
,department7
,department8
,department9
,department10
You can achieve your requirement using below approach:
More information on Groovy scripting in JMeter: Apache Groovy - Why and How You Should Use It

How to replace whole SQL table data frequently?

I have a Spring application that runs a cron on it. The cron every few minutes gets new data from external API. The data should be stored in a database (MySQL), in place of old data (Old data should be overwritten by new data). The data requires to be overwritten instead of updated. The application itself provides REST API so the client is able to get the data from the database. So there should not be situation that client sees an empty or just a part of data from database because there is an data update.
Currently I've tried deleting whole old data and insert new data but there is a place that a client gets just a part of the data. I've tried it via Spring Data deleteAll and saveAll methods.
#Override
#Transactional
public List<Country> overrideAll(#NonNull Iterable<Country> countries) {
removeAllAndFlush();
List<CountryEntity> countriesToCreate = stream(countries.spliterator(), false)
.map(CountryEntity::from)
.collect(toList());
List<CountryEntity> createdCountries = repository.saveAll(countriesToCreate);
return createdCountries.stream()
.map(CountryEntity::toCountry)
.collect(toList());
}
private void removeAllAndFlush() {
repository.deleteAll();
repository.flush();
}
I also thought about having a temporary table that gets new data and when the data is complete just replace main table with temporary table. Is it a good idea? Any other ideas?
It's a good idea. You can minimize the downtime by working on another table until it's ready and then switch tables quickly by renaming. This will also improve perceived performance by the users because no record needs to be locked like what happens when using UPDATE/DELETE.
In MySQL, you can use RENAME TABLE if you don't have triggers on the table. It allows multiple table renaming at once and it works atomically (i.e. transaction - if any error happens, no change is made). You can use the following for example
RENAME TABLE countries TO countries_old, countries_new TO countries;
DROP TABLE countries_old;
Refer here for more details
https://dev.mysql.com/doc/refman/5.7/en/rename-table.html

Why Spark dataframe cache doesn't work here

I just wrote a toy class to test Spark dataframe (actually Dataset since I'm using Java).
Dataset<Row> ds = spark.sql("select id,name,gender from test2.dummy where dt='2018-12-12'");
ds = ds.withColumn("dt", lit("2018-12-17"));
ds.cache();
ds.write().mode(SaveMode.Append).insertInto("test2.dummy");
//
System.out.println(ds.count());
According to my understanding, there're 2 actions, "insertInto" and "count".
I debug the code step by step, when running "insertInto", I see several lines of:
19/01/21 20:14:56 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
When running "count", I still see similar logs:
19/01/21 20:15:26 INFO FileScanRDD: Reading File path: hdfs://ip:9000/root/hive/warehouse/test2.db/dummy/dt=2018-12-12/000000_0, range: 0-451, partition values: [2018-12-12]
I have 2 questions:
1) When there're 2 actions on same dataframe like above, if I don't call ds.cache or ds.persist explicitly, will the 2nd action always causes the re-executing of the sql query?
2) If I understand the log correctly, both actions trigger hdfs file reading, does that mean the ds.cache() actually doesn't work here? If so, why it doesn't work here?
Many thanks.
It's because you append into the table where ds is created from, so ds needs to be recomputed because the underlying data changed. In such cases, spark invalidates the cache. If you read e.g. this Jira (https://issues.apache.org/jira/browse/SPARK-24596):
When invalidating a cache, we invalid other caches dependent on this
cache to ensure cached data is up to date. For example, when the
underlying table has been modified or the table has been dropped
itself, all caches that use this table should be invalidated or
refreshed.
Try to run the ds.count before inserting into the table.
I found that the other answer doesn't work. What I had to do was break lineage such that the df I was writing does not know that one of its source is the table I am writing to. To break lineage, I created a copy df using
copy_of_df = sql_context.createDataframe(df.rdd)

java select large table and export to file

I have a table with 62,000,000 rows aprox, a need select data from these a export to .txt or .csv
My query limit the result to 60,000 rows aprox.
When I run my the query in my developer machine, I eat all memory and get a java.lang.OutOfMemoryError
In this moment I use Hibernate for DAO, but I can change to pure JDBC solution when you recommend
My pseoudo-code is
List<Map> list = myDao.getMyData(Params param); //program crash here
initFile();
for(Map map : list){
util.append(map); //this transform row to file
}
closeFile();
Suggesting me to write my file?
Note: I use .setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP); to get Map instead of any Entity
You could use hibernate's ScrollableResults. See documentation here: http://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch11.html#objectstate-querying-executing-scrolling
This uses server-side cursors, if your database engine / database driver supports this. Be sure for this to work you set the following properties:
query.setReadOnly(true);
query.setCacheable(false);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
while (results.next()) {
SomeEntity entity = results.get()[0];
}
results.close();
lock the table and then perform subset selection and exports, appending to the results file. ensure you unconditionally unlock when done.
Not nice, but the task will perform to completion even on limited resource servers or clients.

file (not in memory) based JDBC driver for CSV files

Is there a open source file based (NOT in-memory based) JDBC driver for CSV files? My CSV are dynamically generated from the UI according to the user selections and each user will have a different CSV file. I'm doing this to reduce database hits, since the information is contained in the CSV file. I only need to perform SELECT operations.
HSQLDB allows for indexed searches if we specify an index, but I won't be able to provide an unique column that can be used as an index, hence it does SQL operations in memory.
Edit:
I've tried CSVJDBC but that doesn't support simple operations like order by and group by. It is still unclear whether it reads from file or loads into memory.
I've tried xlSQL, but that again relies on HSQLDB and only works with Excel and not CSV. Plus its not in development or support anymore.
H2, but that only reads CSV. Doesn't support SQL.
You can solve this problem using the H2 database.
The following groovy script demonstrates:
Loading data into the database
Running a "GROUP BY" and "ORDER BY" sql query
Note: H2 supports in-memory databases, so you have the choice of persisting the data or not.
// Create the database
def sql = Sql.newInstance("jdbc:h2:db/csv", "user", "pass", "org.h2.Driver")
// Load CSV file
sql.execute("CREATE TABLE data (id INT PRIMARY KEY, message VARCHAR(255), score INT) AS SELECT * FROM CSVREAD('data.csv')")
// Print results
def result = sql.firstRow("SELECT message, score, count(*) FROM data GROUP BY message, score ORDER BY score")
assert result[0] == "hello world"
assert result[1] == 0
assert result[2] == 5
// Cleanup
sql.close()
Sample CSV data:
0,hello world,0
1,hello world,1
2,hello world,0
3,hello world,1
4,hello world,0
5,hello world,1
6,hello world,0
7,hello world,1
8,hello world,0
9,hello world,1
10,hello world,0
If you check the sourceforge project csvjdbc please report your expierences. the documentation says it is useful for importing CSV files.
Project page
This was discussed on Superuser https://superuser.com/questions/7169/querying-a-csv-file.
You can use the Text Tables feature of hsqldb: http://hsqldb.org/doc/2.0/guide/texttables-chapt.html
csvsql/gcsvsql are also possible solutions (but there is no JDBC driver, you will have to run a command line program for your query).
sqlite is another solution but you have to import the CSV file into a database before you can query it.
Alternatively, there is commercial software such as http://www.csv-jdbc.com/ which will do what you want.
To do anything with a file you have to load it into memory at some point. What you could do is just open the file and read it line by line, discarding the previous line as you read in a new one. Only downside to this approach is its linearity. Have you thought about using something like memcache on a server where you use Key-Value stores in memory you can query instead of dumping to a CSV file?
You can use either specialized JDBC driver, like CsvJdbc (http://csvjdbc.sourceforge.net) or you may chose to configure a database engine such as mySQL to treat your CSV as a table and then manipulate your CSV through standard JDBC driver.
The trade-off here - available SQL features vs performance.
Direct access to CSV via CsvJdbc (or similar) will allow you very quick operations on big data volumes, but without capabilities to sort or group records using SQL commands ;
mySQL CSV engine can provide rich set of SQL features, but with the cost of performance.
So if the size of your table is relatively small - go with mySQL. However if you need to process big files (> 100Mb) without need for grouping or sorting - go with CsvJdbc.
If you need both - handle very bif files and be able to manipulate them using SQL, then optimal course of action - to load the CSV into normal database table (e.g. mySQL) first and then handle the data as usual SQL table.

Categories

Resources