Streaming Kmeans Spark JAVA - java

Hi Basically we wanted to use KAFKA+SPARK Streaming to catch Twitter Spam on our thesis. And I wanted to use streamingKmeans. But I have very newbie and serious question:
In this spark StreamingKmeans scala example (https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/StreamingKMeansExample.scala) there is one line of code for prediction:
model.predictOnValues(testData.map(lp => (lp.label, lp.features))).print()
Why I need to pass the "LABEL" with features ? I mean, am I getting wrong the whole idea ? Isn't we want to predict the "label" ? How am I going to predict my tweets if they are spam or not ?

For the prediction only lp.features is used, whereas lp.label is considered as a key that is carried over. Quoting from the docs:
Use the model to make predictions on the values of a DStream and carry over its keys.
I guess in your example you would simply want to replace predictOnValues by predictOn

Related

How to pass and refer multiple sideinputs to a DoFn in JAVA

I'm trying to figure out how I can pass multiple side inputs to my DoFn and refer them separately inside ProcessContext.
I wasn't able to find anything for this in the beam documentation and wanted to gain some idea on how I can achieve this in JAVA
Though the example of using side inputs only has a single side input, the same pattern holds for multiple side inputs.
Specifically, the withSideInputs method of ParDo takes any number of PCollectionViews, each of which can be used as its own key in ProcessContext.sideInput.

Neural Net use for finding specific types of websites?

So I'm working on my first project and I'm trying to incorporate a neural net in it somehow. At the moment I just created web crawler that basically takes a word as input and then performs a google search and retrieves the html data of the links.
Now I am trying to only use the html data from specific types of websites, in my case websites that offer free educational content/courses. Example being This site https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-092-java-preparation-for-6-170-january-iap-2006/index.htm
I'm new to neural nets but is this something a neural net is able to do or would another method be better?
Also the rest of my code, such as for web crawler is in Java, so If neural net is applicable in this case what library or tool would you guys recommend for building/training the neural net. I was thinking Neuroph but would love to hear some suggestions.
When you use Neural Networks , it's for predicting something , for example you get an image as input and as ouput you'll have to get the nature of the image , for example knowing what's the content of image : is it a cat or a dog .. etc
About Web Crawler :
The web crawler you've been talking about is not something that necessarily needs neural network ( the idea that you wanted ) , but in case you wanna add some predictions then you can use it , for example taking word as input , making google search about it and then predicting nature of content
I dont know exacly what you wanna predict or the nature of prediction you want to do ( classification or regression ) but i can suggest you first how to take an input html
Taking Html content as input :
First thing to mention , the neural networks doesnt treat caracters , it treats numbers , so if you wanna treat an html content you'll have to use a mecanism , and that's not an easy step , there is a domain called NLP ( Natural Language Processing ) which gives you some good ways to treat texts , you can also use it for html content ( or in a different way if you want ).
I already made before a project of Text Suggestion with a recurrent neural network where i use one the NLP's methods , you can check it on my github because i explained on the Readme all the steps in details : https://github.com/KaramMed/Modele-de-Suggestion-du-Texte
About Library :
I recommand you to use TensorFlow for Java , it's one of the best librairies of Deep Learning and you can find so much tutorials about it

Parsing non-fixed format binary payload with a custom javascript conversion in Vorto

We are using Vorto now mainly as a normalized format and are starting to look into using the mapping engine for mapping different payload formats to Vorto model as well. I more or less understand how to map functionblock properties from JSON or binary payload using xpath and the conversion functions. However, I'm not clear how to support parsing of non-fixed format binary payload using this method.
For instance we have an off the shelf LoRaWAN sensor which transmits in the following format:
<length><frame type>[<sensor-id><sensor-value>] where length is the total frame length and sensor-id (for eg temperature, humidity, battery, ...) describes how to parse the sensor-value (ie length, datatype). In one frame multiple of these readings may be present in random order.
Parsing this can be done easily in for instance loraserver.io using a small javascript function which iterates over all the bytes en returns the parsed properties. The same way will work in the Ditto payload mapping engine afaik.
However, currently I don't see how to do something similar in Vorto mapping. This is just one specific sensor example of course, but more examples exist on the market using similar dynamic payload format. I know there is already an open issue (#1535) to improve the documentation, but it would already be helpful to know if such flexible parsing would be possible using the mapping DSL.
I tried passing the raw payload as bytearray to the javascript function. In order to test this I duplicated the org.eclipse.vorto.mapping.engine.converter.binary.BinaryMappingTest#testMappingBinaryContaining2DataPoints and adapted the model to use a custom javascript function like this
evaluator.addScriptFunction(new ScriptClassFunction("extractTemperature",
"function extractTemperature(value) { " +
" print(\"parameter of type \" + typeof value + \", value = \" + value);" +
" print(value[1]);" +
"}"));
The output of this function is
parameter of type number, value = 1
undefined
Where the value 1 is the first element of the bytearray used.
So the function does not seem to receive the parameter as bytarray.
The model is configured with .withXPathStereotype("custom:extractTemperature(data)", "demo") so the payload is passed (as BinaryData) in the same way as in the testMappingBinaryContaining2DataPoints test (.withXPathStereotype("custom:convert(vorto_conversion1:byteArrayToInt(data,0,0,0,2))", "demo")). The only difference I see now is that in the testMappingBinaryContaining2DataPoints test is that the byetarray parameter is passed to a Java function instead of a javascript function. Or am I missing something?
Also, I noticed that loop keywords like for and while are not allowed in the javascript code. So even if I can access the bytearray parameter in the javascript function I see no way for now how to iterate over this.
On gitter I received following reply (together with the suggestion to move discussion to SO)
You are right. We restricted the Javascript function usage to very rudimentary set of language keywords excluding for loops as nasty stuff can be implemented there. What you could do Instead is to register a java function In your own namespace to the mapping engine. That function can hold a byte array. Later this function can be contributed to the mapping engine as a standard function to extract a certain value out for other developers to reuse.
I don't think this is solution to the problem however. As mentioned above this is just one example of an off the shelf sensor payload format, and I don't see how this can be generalized enough to include as a generic function in the mapping engine. And I don't think it should be required to implement a sensor specific conversion in Java, since (as an end-user of an IoT platform wanting to deploy a new sensor type) this is more complex to develop and deploy than a little javascript function which can be altered at runtime in the mapping spec. I see a lot of value in being able to do simple mappings in javascript, just like this can be done in for example loraserver.io and Eclipse Ditto.
I think being able to pass a byte array to javascript is a first step. Also I wonder where exactly the risk is in allowing loops in the javascript? For example Ditto also has some restrictions in the javascript sandbox (see here) but this allows loops and only prevents endless looping and recursion.
They state the following:
Using Rhino instead of Nashorn, the newer JavaScript engine shipped with Java, has the benefit that sandboxing can be applied in a better way.
Sandboxing of different payload scripts is required as Ditto is intended to be run as cloud service where multiple connections to different endpoints are managed for different tenants at the same time. This requires the isolation of each single script to avoid interference with other scripts and to protect the JVM executing the script against harmful code execution.
Would using Rhino in Vorto as well allow to control the risks you see and allow loop construct in Vorto mapping?
PS: can someone with enough SO reputation points add the tag eclipse-vorto please?
I created an issue for you request to support this in the Javascript converters: https://github.com/eclipse/vorto/issues/2029
As stated in the issue, as a current workaround, you can register your own custom converter function with Java and re-use this function across your mappings. In these java converter functions, you have all the power of the java language to convert to extract the right property from the arbitrary list.
In order to find out how to implement your own custom converter function with Java, take a look here: https://github.com/eclipse/vorto/tree/master/mapping-engine#Advanced-Usage
Since Eclipse Vorto 0.12.3 release, a fix for your request is available. With this it is possible to pass array object to javascript Converter as well as use for loops inside javascript functions. You might wanna give it a try.
See release notes https://github.com/eclipse/vorto/blob/master/docs/release-notes.md

Writing to GCS from dataflow based on windowing and element count

I am attempting to implement a solution where I need to write data (json) messages from pubsub into GCS using dataflow. My question is exactly similar to this one
I need to write either based on windowing or element count.
Here is the code sample for writes from the the above question:
windowedValues.apply(FileIO.<String, String>writeDynamic()
.by(Event::getKey)
.via(TextIO.sink())
.to("gs://data_pipeline_events_test/events/")
.withDestinationCoder(StringUtf8Coder.of())
.withNumShards(1)
.withNaming(key -> FileIO.Write.defaultNaming(key, ".json")));
The solution suggests using FileIO.WriteDynamic function. But i am not able to understand what .by(Event::getKey) does and where it comes from.
Any help on this is greatly appreciated.
It's partitioning elements into groups according to events' keys.
From my understanding, the events come from a PCollection using the KV class since it has the getKey method.
Note that :: is a new operator included in Java 8 that is used to refer a method of a class.

What are good methods to perform spreadsheet-like calculations in a programming language?

What's the best way to do spreadsheet-like calculations in a programming language? Example: A multi-user application needs to be available over the web that crunches columns and cells of numbers like a spread-sheet based on user submission. What are the best data structures/ database models/patterns to handle this type of work so that handling the different columns are done efficiently and easily in php, java, or even .Net. Is it better to use data structures within the language, or is it better to use a database? If using a database is the way, how does one go about doing this?
To do the actual calculation, look at graph theory. Basically you want to represent each cell as a node in a graph and each dependency as a directed edge. Next, do a topological sort to calculate the value of each cell in the right order.
Aspose.Cells (formerly Aspose.Excel.Web) is a good way to get the functionality you are looking for.
Unless you are asking more for a "How is it done?" than "I need to do it." Then I would look at the other answers given.
Along the lines of "I need to do it"
Microsoft has Excel Services which does just what you want.
Spreadsheet operations on the server. It is available via a web services interface, so you can connect and drive calculations from Java, PHP, .NET, whatever.
Excel Services is part of Sharepoint 2007.
Resolver One is a Spreadsheet app made in IronPython.
There is an explanation of the overall mechanic for the calculation [pythonology.org] it uses for user generated ecuations.
The relevant image showing Resolver One's overall algorithm.
Should note that users can write python code to be interpreted both on the cells and a special 'outside of sheet' place.
Look at another question here in SO, from where I reused my answer.
I can't tell you how to do it. But I would recommend you to look at the code of PHPExcel. PHPExcel is a library that allows you to create Excel files within PHP.
The workflow of PHPExcel is simplified like this:
Create an empty Excel file object
Add cells (with either data or formulas) to the "Excel file"
Call the create function which is generating the file itself
In your case you would have to replace 3. with something like "Create web interface".
Therefore I would recommend you to look at the code of this open source project and look how the general structure is. This should help you solving your problem.
I once used a binary tree to store the output of parsing a string using BODMAS. Each node was an operation between two other nodes, which could be a number, a variable or another operation.
So y = x * x + 2
became:
+
* 2
x x
Sadly this was at school in Pascal and is stored on a 5 1/4" disk, so you don't want it :)
SpreadsheetGear for .NET will let you load Excel workbooks, plug in values, calculate and then get the results.
You can see a few simple ASP.NET calculation samples here, other ASP.NET samples here and download a free trial here.
Disclaimer: I own SpreadsheetGear LLC
I must point out that google spreadsheets already does this kind of stuff.

Categories

Resources