Is there a open source file based (NOT in-memory based) JDBC driver for CSV files? My CSV are dynamically generated from the UI according to the user selections and each user will have a different CSV file. I'm doing this to reduce database hits, since the information is contained in the CSV file. I only need to perform SELECT operations.
HSQLDB allows for indexed searches if we specify an index, but I won't be able to provide an unique column that can be used as an index, hence it does SQL operations in memory.
Edit:
I've tried CSVJDBC but that doesn't support simple operations like order by and group by. It is still unclear whether it reads from file or loads into memory.
I've tried xlSQL, but that again relies on HSQLDB and only works with Excel and not CSV. Plus its not in development or support anymore.
H2, but that only reads CSV. Doesn't support SQL.
You can solve this problem using the H2 database.
The following groovy script demonstrates:
Loading data into the database
Running a "GROUP BY" and "ORDER BY" sql query
Note: H2 supports in-memory databases, so you have the choice of persisting the data or not.
// Create the database
def sql = Sql.newInstance("jdbc:h2:db/csv", "user", "pass", "org.h2.Driver")
// Load CSV file
sql.execute("CREATE TABLE data (id INT PRIMARY KEY, message VARCHAR(255), score INT) AS SELECT * FROM CSVREAD('data.csv')")
// Print results
def result = sql.firstRow("SELECT message, score, count(*) FROM data GROUP BY message, score ORDER BY score")
assert result[0] == "hello world"
assert result[1] == 0
assert result[2] == 5
// Cleanup
sql.close()
Sample CSV data:
0,hello world,0
1,hello world,1
2,hello world,0
3,hello world,1
4,hello world,0
5,hello world,1
6,hello world,0
7,hello world,1
8,hello world,0
9,hello world,1
10,hello world,0
If you check the sourceforge project csvjdbc please report your expierences. the documentation says it is useful for importing CSV files.
Project page
This was discussed on Superuser https://superuser.com/questions/7169/querying-a-csv-file.
You can use the Text Tables feature of hsqldb: http://hsqldb.org/doc/2.0/guide/texttables-chapt.html
csvsql/gcsvsql are also possible solutions (but there is no JDBC driver, you will have to run a command line program for your query).
sqlite is another solution but you have to import the CSV file into a database before you can query it.
Alternatively, there is commercial software such as http://www.csv-jdbc.com/ which will do what you want.
To do anything with a file you have to load it into memory at some point. What you could do is just open the file and read it line by line, discarding the previous line as you read in a new one. Only downside to this approach is its linearity. Have you thought about using something like memcache on a server where you use Key-Value stores in memory you can query instead of dumping to a CSV file?
You can use either specialized JDBC driver, like CsvJdbc (http://csvjdbc.sourceforge.net) or you may chose to configure a database engine such as mySQL to treat your CSV as a table and then manipulate your CSV through standard JDBC driver.
The trade-off here - available SQL features vs performance.
Direct access to CSV via CsvJdbc (or similar) will allow you very quick operations on big data volumes, but without capabilities to sort or group records using SQL commands ;
mySQL CSV engine can provide rich set of SQL features, but with the cost of performance.
So if the size of your table is relatively small - go with mySQL. However if you need to process big files (> 100Mb) without need for grouping or sorting - go with CsvJdbc.
If you need both - handle very bif files and be able to manipulate them using SQL, then optimal course of action - to load the CSV into normal database table (e.g. mySQL) first and then handle the data as usual SQL table.
Related
As of NiFi 1.7.1, the new DBCPConnectionPoolLookup enables dynamic selection of database connections: set an attribute database.name on a FlowFile and when a consuming processor accesses a configured DBCPConnectionPoolLookup controller service, the content of that attribute will be used to get a connection through this lookup's configured properties, which contain a mapping of potential values to DBCPConnectionPool controller service.
I'd like to list the tables in each database that I've configured in the lookup, but the ListDatabaseTables processor does not accept incoming FlowFiles. This seems to mean that it's not usable for listing tables in a dynamic set of databases.
What is the best way to accomplish this?
ListDatabaseTables uses the JDBC API for getting table info from the metadata of an established JDBC connection. This hides the underlying method of how to actually get tables from a particular database.
If all your databases are of the same ilk, then if you have a list of databases, you could generate flow files with one per database, filling in the database.name attribute, then using ExecuteSQL with the DBCPConnectionPoolLookup to execute the corresponding SQL statement to get the tables for that database, such as SHOW TABLES. You can parse the records using any of the record-aware processors such as QueryRecord, UpdateRecord, ConvertRecord, etc. and if you need one table per flow file you can use SplitRecord. If the output is JSON or CSV or XML, you could use EvaluateJsonPath, ExtractText, or EvaluateXPath respectively to get the table name into an attribute, and continue on from there.
I wrote up NIFI-5519 to cover the proposal for ListDatabaseTables to optionally accept incoming connections, in the meantime you'd need 1 ListDatabaseTables instance to correspond to each of your DBCPConnectionPool instances.
I have a requirement to read the value form a PDF file and save the result in a db.
I have converted Pdf to text .
Now the text data looks like this:
Test Name Results Units Bio. Ref. Interval
LIPID PROFILE, BASIC, SERUM
Cholesterol Total 166.00 mg/dL <200.00
Triglycerides 118.00 mg/dL <150.00
My requirement is to read the table data from the Pdf file and save in the MySQL database as it is.
use java io to read the text file and jdbc to safe the information in the mysql via sql.
LOAD DATA INFILE '/testing.csv'
IGNORE INTO TABLE Test_table FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
IGNORE 1 ROWS (#col1)
SET test_ID="100",
test_reg_ID ="26003",
VALUE =#col1;
testing.csv:
TABLE Test_table
Question No.1: Help me convert this into JOOQ 3.6
Question No.2: I want to avoid the empty data which falls on row 1.
jOOQ has a CSV import API for that purpose. Here's how you'd translate that MySQL command to jOOQ:
DSL.using(configuration)
.loadInto(TEST_TABLE)
.loadCSV(new File("/testing.csv"))
.fields(TEST_TABLE.VALUE)
.separator(',')
// Available in jOOQ 3.10 only: https://github.com/jOOQ/jOOQ/issues/5737
// .lineSeparator("\n")
.ignoreRows(1)
.execute();
Note that jOOQ's Loader API doesn't support those default expressions as MySQL does (see #5740):
SET test_ID="100",
test_reg_ID ="26003"
There are a few workarounds:
You could patch the CSV data and prepend those columns before loading them.
You could use DSLContext.fetchFromCSV() and then use stream().map() to prepend the missing data, before using the alternative Record import API rather than the suggested CSV import API
You could run a simple UPDATE statement right after the import for this data.
A note on performance
Do note that jOOQ's loader API can be fine-tuned by specifying bulk, batch, and commit sizes. Nevertheless, the database's out-of-the-box import implementation is very likely to still be much faster than any client side import that has to go through JDBC.
I have a table with 62,000,000 rows aprox, a need select data from these a export to .txt or .csv
My query limit the result to 60,000 rows aprox.
When I run my the query in my developer machine, I eat all memory and get a java.lang.OutOfMemoryError
In this moment I use Hibernate for DAO, but I can change to pure JDBC solution when you recommend
My pseoudo-code is
List<Map> list = myDao.getMyData(Params param); //program crash here
initFile();
for(Map map : list){
util.append(map); //this transform row to file
}
closeFile();
Suggesting me to write my file?
Note: I use .setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP); to get Map instead of any Entity
You could use hibernate's ScrollableResults. See documentation here: http://docs.jboss.org/hibernate/orm/4.3/manual/en-US/html/ch11.html#objectstate-querying-executing-scrolling
This uses server-side cursors, if your database engine / database driver supports this. Be sure for this to work you set the following properties:
query.setReadOnly(true);
query.setCacheable(false);
ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
while (results.next()) {
SomeEntity entity = results.get()[0];
}
results.close();
lock the table and then perform subset selection and exports, appending to the results file. ensure you unconditionally unlock when done.
Not nice, but the task will perform to completion even on limited resource servers or clients.
I'm working on a system where the users need to be able to upload an excel file to the server, then the system needs to process the excel file to load data into the XMPie uProduce system.
I already have it working to load CSV files into the system. I can confirm that the excel files have been uploaded to the server successfully. However, when my program then tries to access the excel file in order to read the data, it gets this error:
The Microsoft Jet database engine could not find the object 'Sheet1'. Make sure the object exists and that you spell its name and the path name correctly.
I am setting the filter as:
select * from [Sheet1]
I have also tried it as:
select * from [filename.xls]
Neither have worked. Does anyone have any suggestions what the SQL filter should be for pulling data from a database?
Try this..
Writing an Excel query is as similar as writing a query in any other traditional data storage like SQL Server, Oracle, etc. However there are a few differences. First, you have to specify your sheet name instead of your table name. Next, you have to give starting and end cell references. Watch my following code carefully:
SELECT * FROM [users$A1:F500]
Here users is the spread sheet name.
When specifying Excel sheet names in an SQL query via ADO or similar, you have to put a $ symbol at the end of the sheet name. Try:
SELECT * FROM [Sheet1$]
More info here