How to vectorize text file in mahout?

How to vectorize text file in mahout? - java

I'm having a text file with label and tweets .
positive,I love this car
negative,I hate this book
positive,Good product.
I need to convert each line into vector value.If i use seq2sparse command means the whole document gets converted to vector,but i need to convert each line as vector not the whole document.
ex :
key : positive value : vectorvalue(tweet)
How can we achieve this in mahout?
/* Here is what i have done */
StringTokenizer str= new StringTokenizer(line,",");
String label=str.nextToken();
while (str.hasMoreTokens())
{
tweetline =str.nextToken();
System.out.println("Tweetline"+tweetline);
StringTokenizer words = new StringTokenizer(tweetline," ");
while(words.hasMoreTokens()){
featureList.add(words.nextToken());}
}
Vector unclassifiedInstanceVector = new RandomAccessSparseVector(tweetline.split(" ").length);
FeatureVectorEncoder vectorEncoder = new AdaptiveWordValueEncoder(label);
vectorEncoder.setProbes(1);
System.out.println("Feature List: "+featureList);
for (Object feature: featureList) {
vectorEncoder.addToVector((String) feature, unclassifiedInstanceVector);
}
context.write(new Text("/"+label), new VectorWritable(unclassifiedInstanceVector));
Thanks in advance

You can write it to app hdfs path with SequenceFile.Writer
FS = FileSystem.get(HBaseConfiguration.create());
String newPath = "/foo/mahouttest/part-r-00000";
Path newPathFile = new Path(newPath);
Text key = new Text();
VectorWritable value = new VectorWritable();
SequenceFile.Writer writer = SequenceFile.createWriter(FS, conf, newPathFile,
key.getClass(), value.getClass());
.....
key.set("c/"+label);
value.set(unclassifiedInstanceVector );
writer.append(key,value);

Related

How can a List of INDArrays be stored in a file

I am working on an reinforcement-learning project and have a List<INDArray> which holds a list of states of the world and a second List<INDArray>which holds action-prediction and reward values with the index corresponding to the states of the first List
I want to store these data for later training on the hard-drive, how can I achieve this?
Lets sax for example we have:
List<INDArray> stateList = new ArrayList<>();
stateList.add(Nd4j.valueArrayOf(new int[]{3,3,3}, 5));
stateList.add(Nd4j.valueArrayOf(new int[]{3,3,3}, 6));
List<INDArray> valueList = new ArrayList<>();
valueList.add(Nd4j.create(new float[]{1, 2}));
valueList.add(Nd4j.create(new float[]{3, 4}));

you have to preparefile content and then simply write into file.
String fileContent = "";
for (INDArray arr : valueList) {
str +=arr.getValue()+"/n";//arr.getValue() anything which u want to add
}
FileWriter fileWriter = new FileWriter("c:/temp/samplefile.txt");
fileWriter.write(fileContent);
fileWriter.close();

ArrayList<String> in PDF from a new row

I want to send some survey in PDF from java, I tryed different methods. I use with StringBuffer and without, but always see text in PDF in one row.
public void writePdf(OutputStream outputStream) throws Exception {
Paragraph paragraph = new Paragraph();
Document document = new Document();
PdfWriter.getInstance(document, outputStream);
document.open();
document.addTitle("Survey PDF");
ArrayList nameArrays = new ArrayList();
StringBuffer sb = new StringBuffer();
int i = -1;
for (String properties : textService.getAnswer()) {
nameArrays.add(properties);
i++;
}
for (int a= 0; a<=i; a++){
System.out.println("nameArrays.get(a) -"+nameArrays.get(a));
sb.append(nameArrays.get(a));
}
paragraph.add(sb.toString());
document.add(paragraph);
document.close();
}
textService.getAnswer() this - ArrayList<String>
Could you please advise how to separate the text in order each new sentence will be starting from new row?
Now I see like this:

You forgot the newline character \n and your code seems a bit overcomplicated.
Try this:
StringBuffer sb = new StringBuffer();
for (String property : textService.getAnswer()) {
sb.append(property);
sb.append('\n');
}

What about:
nameArrays.add(properties+"\n");

You might be able to fix that by simply appending "\n" to the strings that you collecting in your list; but I think: that very much depends on the PDF library you are using.
You see, "newlines" or "paragraphs" are to a certain degree about formatting. It seems like a conceptual problem to add that "formatting" information to the data that you are processing.
Meaning: you might want to check if your library allows you to provide strings - and then have the library do the formatting for you!
In other words: instead of giving strings with newlines; you should check if you can keep using strings without newlines, but if there is way to have the PDF library add line breaks were appropriate.
Side note on code quality: you are using raw types:
ArrayList nameArrays = new ArrayList();
should better be
ArrayList<String> names = new ArrayList<>();
[ I also changed the name - there is no point in putting the type of a collection into the variable name! ]

This method is for save values in array list into a pdf document. In the mfilePath variable "/" in here you can give folder name. As a example "/example/".
and also for mFileName variable you can use name. I give the date and time that document will created. don't give static name other vice your values are overriding in same pdf.
private void savePDF()
{
com.itextpdf.text.Document mDoc = new com.itextpdf.text.Document();
String mFileName = new SimpleDateFormat("YYYY-MM-DD-HH-MM-SS", Locale.getDefault()).format(System.currentTimeMillis());
String mFilePath = Environment.getExternalStorageDirectory() + "/" + mFileName + ".pdf";
try
{
PdfWriter.getInstance(mDoc, new FileOutputStream(mFilePath));
mDoc.open();
for(int d = 0; d < g; d++)
{
String mtext = answers.get(d);
mDoc.add(new Paragraph(mtext));
}
mDoc.close();
}
catch (Exception e)
{
}
}

How to include the unique id of each instance for sake of mapping in the future

I am using weka java API to classify couple of my instances, the file that I feed my weka file with is as follow:
0.3,0.1,1
0.0,0.04,0
0.0,0.03,1
And all of the above instances have unique id assigned to them for example the first row has id of 1098...
I wrote the following code which use weka java API to classify the result and return those instances that are classified incorrectly:
public static void SVM(ArrayList<String[]> testData) throws FileNotFoundException, IOException,
Exception {
BufferedReader breader = null;
breader = new BufferedReader(new FileReader("weka/train.txt"));
Instances train = new Instances(breader);
train.setClassIndex(train.numAttributes() - 1);
Instances unlabeled = new Instances(new BufferedReader(new FileReader(
"weka/test.txt")));
breader.close();
// set class attribute
unlabeled.setClassIndex(unlabeled.numAttributes() - 1);
// create copy
Instances labeled = new Instances(unlabeled);
LibSVM svm = new LibSVM();
svm.buildClassifier(train);
Evaluation eval = new Evaluation(train);
BufferedWriter writer = new BufferedWriter(new FileWriter(
"weka/labeledSVM.txt"));
for (int i = 0; i < unlabeled.numInstances(); i++) {
double clsLabel = svm.classifyInstance(unlabeled.instance(i));
if(unlabeled.instance(i).value(5)!=clsLabel){
writer.write("the unique id is: "+testData.get(i)[0] + " real label of the text is : "+ unlabeled.instance(i).toString() + ", According to Algorithm reult label is: " + clsLabel);
writer.newLine();
}
writer.flush();
writer.close();
}
But a big problem is that the mapping between the unique id and the instance labeled by algorithm is incorrect, so I am wondering if there is any way that I can include the unique id of each text inside the instances that I have but tell the weka classifier to ignore it ?
for example something like this:
1980,0.3,0.1,1
1981,0.0,0.04,0
1982,0.0,0.03,0
or any other suggestion is appreciated

The only way I found to do this was to create my own subclass of Instance.

Use "AddID" filter which will assign a uniqueID to every instance, then use FilteredClassifier i.e. weka.classifiers.meta.FilteredClassifier.

How to add List into properties file?

I am converting properties file into xml format like below .
public class XmlPropertiesWriter {
public static void main(String args[]) throws FileNotFoundException, IOException {
//Reading properties files in Java example
Properties props = new Properties();
FileOutputStream fos = new FileOutputStream("C:\\Users\\Desktop\\myxml.xml");
props.setProperty("key1", "test");
props.setProperty("key2", "test1");
//writing properites into properties file from Java
props.storeToXML(fos, "Properties file in xml format generated from Java program");
fos.close();
}
}
This is working fine.But I want to add one ArrayList into this xml file,How can I do this,Any one help me.

You can (un)serialized the list into string representation to store the data into the properties file:
ArrayList<String> list = new ArrayList<>( );
String serialized = list.stream( ).collect( Collectors.joining( "," ) );
String input = "data,data"
List<String> unserialized = Arrays.asList( input.split( "," ) );
With this method, take care to use a seperator which is never contained in your data.
Otherwise, write a xml (or json) file reader/writer to do what you want with support of list element

Depends on what type the ArrayList is. If it's a String type you can do
arrayList.toArray(new String[arrayList.size()]);
If the type is an object you can create a StringBuilder and add all the values seperated by a ; or : so you can split when needed
final StringBuilder builder = new Stringbuilder();
final List<Point> list = new ArrayList<Point>();
list.add(new Point(0, 0));
list.add(new Point(1, 0));
for(final Point p : list) {
builder.append(p.toString()).append(";");
}
properties.setProperty("list", builder.toString());
When you load the properties you can simply do then
final List<Point> list = new ArrayList<Point>();
final String[] points = properties.getProperty("list").split(";");
for(final String p : points) {
final int x = Integer.parseInt(p.substring(0, p.indexOf(","));
final int y = Integer.parseInt(p.substring(p.indexOf(","), p.indexOf(")"));
list.add(new Point(x, y);
}

Using setters/getters on a text file

I am trying to make a list of coordinates from a text file and I want to be efficient with it, so I created a class that has a latitude and longitude double variables, along with their getter and setter methods.
So in the other class I make an object of the previous class to use the setter methods.
CoordinatesParams params = new CoordinatesParams();
How can I read the list of coordinates from the text file and set them to the latitudeand longitudevariables?
Sorry if this question is very basic to some
File Bus_Routes = new File("C:/Users/Daniel Dold/Desktop/Routes/Bus_Routes.txt");
Scanner scanner = new Scanner(Bus_Routes);
String line = scanner.nextLine();
String[] parsed = line.split("\\s");
String routeText = parsed[0];
String dir = "C:/Users/Daniel Dold/Desktop/Routes/";
File routeFile = new File(dir, routeText);
Scanner sc = new Scanner(routeFile);
while(sc.hasNextLine())
{
String line2 = sc.nextLine();
String[] s = line2.split("\t");
}
This is what I have so far to print the results on the file.
51.50177649 -0.05012445
51.50210374 -0.05050666
51.50253617 -0.0509908
51.50265346 -0.05072191
51.50274404 -0.05055025
51.50301702 -0.05011841
The coordinates file are just 2 columns the first being latitude and the second longitude

I dont know how the File looks inside, but you should creat a list for your CoordinatesParamas and then iterate over the data you read from the file to fill the list with CoordinatesParamas objects and set the latitude and longitude in this objects via setters.
List<CoordinatesParamas> lCoordinates = new ArrayList<CoordinatesParamas>();
while(sc.hasNextLine()) {
CoordinatesParamas temp = new CoordinatesParamas();
String pair = sc.nextLine();
String[] s = pair.split(" ");
temp.setLatitude(s[0])
temp.setLongitude(s[1])
lCoordinates.add(temp );
}
I changed string.split("\t") with split(" ") because it looks like there is not a tab between the coordinates, only a whitespace.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.

How to vectorize text file in mahout? - java

Related

How can a List of INDArrays be stored in a file

ArrayList<String> in PDF from a new row

How to include the unique id of each instance for sake of mapping in the future

How to add List into properties file?

Using setters/getters on a text file

Categories

Resources