Tokenize textual content using Spark SQL? - java

I an working on implementing a requirement to create a dictionary of words to documents using apache spark and mongodb.
In my scenario I have a mongo collection in which each document has some text type fields along with a field for owner of the document.
I want to parse the text content in collection docs and create a dictionary which maps words to the document and owner fields. Basically, the key would be a word and value would be _id and owner field.
The idea is to provide auto-suggestions specific to the user when he/she types in the text box in the UI based on the user's documents.
A user can create multiple documents and a word can be in multiple documents but only one user will be able to create a document.
I used mongo spark connector and I am able to load the collection docs into a data frame using spark sql.
I am not sure how to process the textual data which is in one of the dataframe columns now to extract the words.
Is there a way using Spark SQL to process the text content in the data frame column to extract/tokenize words and map it to _id and owner fields and write the results to another collection.
If not, can someone please let me know the right approach/steps on how I can achieve it.

Spark has support for tokenisation and other text processing tasks but it's not in the core library. Checkout the Spark MLlib:
https://spark.apache.org/docs/2.1.0/ml-guide.html
And more precisely the Transformers that work on DataFrames like:
https://spark.apache.org/docs/2.1.0/ml-features.html#tokenizer

Related

ETL design: What Queue should I use instead of my SQL table and still be able to process in parallel?

Need your help with re-design my system. we have very simple ETL but also very old and now when we handle a massive amount of data it became extremely slow and not flexible
the first process is the collector process:
collector process- always up
collector collect the message from the queue (rabbitMQ)
parse the message properties (JSON format) to java object (for example if the JSON contains field like 'id' and 'name' and 'color' we will create java object with int field 'id' and string field 'name', and string field 'color')
after parsing we write the object to CSV file as CSV row with all the properties in the object
we send ack and continuing to the next message in the queue
processing work-flow - happens every hour once
a process named 'Loader' loads all the CSV files (the collector outputs) to DB table named 'Input' using SQL INFILE LOAD all new rows have 'Not handled' status. the Input table is like a Queue in this design
a process named 'Processor' read from the table all the records with 'Not handled' status, transform it to java object, make some enrichment and then insert the record to another table named 'Output' with new fields, **each iteration we process 1000 rows in parallel - and using JDBC batch update for the DB insert **.
the major problem in this flow:
The message are not flexible in the existing flow - if I want for example to add new property to the JSON message (for example to add also 'city' ) I have to add also column 'city' to the table (because of the CSV file Load), the table contains massive amount of data and its not possible to add column every time the message changes.
My conclusion
The table is not the right choice for this design.
I have to get rid of the CSV writing and remove the 'Input' table to be able to have a flexible system, I thought of maybe using a queue instead of the table like KAFKA and maybe use tools such KAFKA streams for the enrichment. - this will allow me flexible and I won't need to add a column to a table every time I want to add a field to the message
the huge problem that I won't be able to process in parallel like I process today.
What can I use instead of table that will allow me to process the data in parallel?
Yes, using Kafka will improve this.
Ingestion
Your process that currently write CSV-files can instead publish to a Kafka topic. This can possibly be a replacement of RabbitMQ, depending on your requirements and scope.
Loader (optional)
Your other process that load data in the initial format and writes to a database table can instead publish to another Kafka topic in the format you want. This step can be omitted if you can write in the format the processor want directly.
Processor
The way you use 'Not handled' status is a way to treat your data as a queue, but this is handled by design in Kafka that uses a log (were a relational database is modeled as a set).
The processor subscribe to messages written by loader or ingestion. It transform it to java object ,make some enrichment - but instead of inserting the result to a new table, it can publish the data to a new output-topic.
Instead of doing work in batches: "each iteration we process 1000 rows in parallel - and using JDBC batchupdate for the DB insert" with Kafka and stream processing this is done in a continuous real time stream - as data arrives.
Schema evolvability
if i want for example to add new property to the json message (for example to add also 'city' ) i have to add also column 'city' to the table (because of the csv infile Load) , the table contains massive amount of data and its not possible to add column every time the message changes .
You can solve this by using Avro Schema when publishing to a Kafka topic.

Searching and storing values from CSV

I'm a java-beginner and want to learn how to read in files and store data in a way that makes it easy to manipulate.
I have a pretty big csv file (18000 rows). The data is representing the sortiment from all different beverages sold by a liqueur-shop. It consists of 16 something columns with headers like "article number, name, producer, amount of alcohol, etc etc. The columns are separated by "\t".
I now want to do some searching in this file to find things like how many products that are produced in Sweden and finding the most expensive liqueur/liter.
Since I really want to learn how to program and not just find the answer I'm not looking for any exact code here. I'm instead looking for the psuedo-code behind this and a good way of thinking when dealing with large sets of data and what kind of data structures that are best suited for a task.
Lets take the "How many products are from Sweden" example.
Since the data consists of both strings, ints and floats I cant put everything in a list. What is the best way of storing it so it later could be manipulated? Or can I find it as soon as it's parsed, maybe I don't have to store it at all?
If you're new to Java and programming in general I'd recommend a library to help you view and use your data, without getting into databases and learning SQL. One that I've used in the past is Commons CSV.
https://commons.apache.org/proper/commons-csv/user-guide.html#Parsing_files
It lets you easily parse a whole CSV file into CSVRecord objects. For example:
Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
String lastName = record.get("Last Name");
String firstName = record.get("First Name");
}
If you have csv file particularly then You may use database to store this data.
You go through to read csv in java using this link.
Make use of ORM framework like Hibernate use alongwith Spring application. Use this link to create application
By using this you can create queries to fetch the data like "How many products are from Sweden" and make use of Collection framework. This link to use HQL queries in same application.
Create JSP pages to show the results on UI.
Sorry for my english.
It seems you are looking for an in-memory SQL engine over your CSV file. I would suggest to use CQEngine which provides indexed view on top of Java collection framework with SQL-like queries.
You are basically treating Java collection as a database table. Assuming that each CSV line maps to some POJO class like Beverage:
IndexedCollection<Beverage> table = new ConcurrentIndexedCollection<Beverage>();
table.addIndex(NavigableIndex.onAttribute(Beverage.BEVERAGE_ID));
table.add(new Beverage(...));
table.add(new Beverage(...));
table.add(new Beverage(...));
What you need to do now is to read the CSV file and load it into IndexedCollection and then build a proper index on some fields. After that, you can query the table as a usual SQL database. At the end, de-serialize the collection to new CSV file (if you made any modification).

Couchbase JAVA SDK -Partialy update document

I have a document that has a data model corresponding to a user.
The user has an adresses array, a phone array and an email array.
I make CRUD operations on theses data using the Java SDK for Couchbase.
I have a constraint: I need to get all the document data in order to display data associated to the user. On the UI I can modify everything except the data contained by the arrays (phone, email and adresses).
How can I do to update only those data when I update the document ?
When I try to use the JsonIgnore annotation on arrays method when serializing the user object, it removes them from the document when the JAVA couchbase method replace took place.
Is there a way to update partially documents with the JAVA SDK for couchbase ?

MarkLogic search and retrieve specific fields

I am faily new to MarkLogic (and noSQL) and currently trying to learn the Java API client. My question on searching, which returns back search result snippets / matches, is it possible for the search result to include specific fields in the document?
For example, given this document:
{"id":"1", "type":"classified", "description": "This is a classified type."}
And I search using this:
QueryManager queryMgr = client.newQueryManager();
StringQueryDefinition query = queryMgr.newStringDefinition();
query.setCriteria("classified");
queryMgr.search(query, resultsHandle);
How can I get the JSON document's 3 defined fields (id, type, description) as part of the search result - so I can display them in my UI table?
Do I need to hit the DB again by loading the document via URI (thus if I have 1000 records, that means hitting the DB again 1000 times)?
You have several options to retrieve specific fields with your search results. You could use the Pojo Data Binding Interface. You could read multiple documents matching a query which brings back the entirety of each document which you can then get as a pojo or String or any other handle. Or you can use the same API you're using above but add search options to allow you to extract a portion of a matching document.
If you're bring back thousands of matches, you're probably not showing all those snippets to end users, so you should probably disable snippets using something like
<transform-results apply="empty-snippet" />
in your options.

insert database(DB2) tuples as XML elements in XML file using java?

how to insert database(DB2) tuples as XML elements in XML file using java?
Is there any possibility to retrieve XML elements which were entered earlier as database tuples?? or can they be used to provide a view customized to different users.
Although it would help to see a bit of an example of what you're trying to accomplish, I am fairly certain that a couple of different XML features in DB2 (referred to collectively as pureXML) can help your application convert smoothly between XML documents and relational data.
Publishing tuples/rows as XML is done with SQL/XML functions such as XMLELEMENT, XMLATTRIBUTE, XMLFOREST, XMLAGG, and XMLSERIALIZE, to name a few. These functions have been available since DB2 V8.1, when they were introduced as part of the SQL:2003 spec. Other DBMS vendors support these functions in their products, too. To produce more sophisticated XML constructs, such as hierarchical data relationships and repeating elements, you will probably want to exploit common table expressions that use XMLAGG or XMLGROUP.
XML data can be stored natively in DB2 v9.1 and newer by using the XML datatype, which produces a column that accepts any well-formed XML input. If you instead want to decompose/shred the inbound XML into one or more columns of a relational table, the XMLTABLE function takes in an XML document and your XPath expressions to convert the relevant nodes into a traditional result set that can be referenced by a SQL insert statement.

Categories

Resources