Reading a whole file with Spark - java

I'm trying to sum the length of all the lines of a file using spark.
These lines are formatted as follows :
A1004AX2J2HXGL\tB0007RT9LC\tMay 30, 2005\t3\n\t4\t5.0\tLes carottes sont cuites
To achieve my aim, I tried this code given in the documentation :
JavaRDD<String> txtFile = sc.textFile(filePath);
JavaRDD<Integer> linesLength = txtFile.map(s -> s.length());
long totalLength = linesLength.reduce((a, b) -> a+b);
However, it doesn't work. For instance, for a 5.8GB text file, it returns 1602633268 when it should return 5897600784.
I suppose it is due to the fact that some lines may contain strange characters,
which stop the reading for the line.
With goold old Java, this problem could be solved with a BufferedReader, like in this case. However, I found no mentions of something similar for Spark in the documentation.
How could I proceed ?

I know you already found at least part of your problem and answered the question, but I'd like to point out another problem: you are counting characters in this Spark code, but sounds like you are trying to find the file size in bytes. These are not at all necessarily the same thing.

I add it all wrong, it was just an integer overflow. I made it work by changing Integer to Long :
JavaRDD<String> txtFile = sc.textFile(path);
JavaRDD<Long> linesLength = txtFile.map(s -> Long.valueOf(s.length()));
Long totalLength = linesLength.reduce((a, b) -> a +b);

Related

Two Java strings same value differerent Eclipse ids don't HashMap right

Here is the code (class names are from Knime):
HashMap<String,DataColumnSpec> rcols = new HashMap<String, DataColumnSpec>();
rightSpec.forEach(rs -> { rcols.put(rs.getName(), rs); });
DataColumnSpec[] jcols = leftSpec.stream()
.filter(s -> rcols.containsKey(s.getName()))
.toArray(DataColumnSpec[]::new);
The result is empty, but it should not be! There really is one matching column!
Here is the debugger screenshot:
Note P# in the first instance with id=14978 and the second id=666.
What is going on here? What do I do to fix it?
The answer, sad to admit, was a non-printing character in one of the strings. The source of the data is the FileReader node on Knime, and it has a bug handling UTF-8-BOM data files. It injects a NUL character into the first string it reads, which is invisible in the debugger but throws off all the comparisons.
Full credit to #Ole V.V. It just didn't occur to me. Lesson learned!

Implementing Apriori Algorithm on Hadoop

I am attempting to implement the Apriori algorithm on using Hadoop. I have already implemented a non-distributed version of the Apriori algorithm but my lack of familiarity with Hadoop and MapReduce has presented a number of concerns.
The way I want to implement the algorithm is in two phases:
1) In the first phase, the map reduce job will operate on the original transaction dataset. The output of this phase is a file containing all of the 1-itemsets and their support of 1.
2) In the second phase, I want to read in the output of the previous phase and then construct the new itemsets. Importantly, I want to then, in the mapper, determine if any of the new itemsets are still found in the dataset. I imagine that if I send the original dataset as the input to the mapper, it will partition the original file so that each mapper only scans through a partial dataset. The candidate list however needs to be constructed from all of the previous phase's output. This will then iterate in a loop for a fixed number of passes.
My problem is figuring out how to specifically ensure that I can access the full itemsets in each mapper, as well as being able to access the original dataset to calculate the new support in each phase.
Thanks for any advice, comments, suggestions or answers.
EDIT: Based on the feedback, I just want to be more specific about what I'm asking here.
Before you start, I suggest you read the Hadoop Map-Reduce Tutorial.
Step 1:
Load your data file to HDFS. Let's assume your data is txt file and each set is a line.
a b c
a c d e
a e f
a f z
...
Step 2:
Follow the Map-Reduce Tutorial to build your own Apriori Class.
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
// Seprate the line into tokens by space
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
// Add the token into a writable set
... put the element into a writable set ...
}
context.write(word, one);
}
Step 3:
Run the mapreduce jar file. The output will be in a file in the HDFS.
You will have something like:
a b 3 (number of occurrence)
a b c 5
a d 2
...
Based on the output file, you could calculate the relationship.
On a related note, you might want to consider using a higher level abstraction than map-reduce like Cascading or Apache Spark.
I implemented AES algorithm in both Apache Spark and Hadoop MapReduce using Hadoop Streaming.
I know it is not the same as Apriori but you can try to use my approach.
Simple example of AES implemented using Hadoop Streming MapReduce.
Project structure for AES Hadoop Streaming
1n_reducer.py / 1n_combiner is the same code but without constraint .
import sys
CONSTRAINT = 1000
def do_reduce(word, _values):
return word, sum(_values)
prev_key = None
values = []
for line in sys.stdin:
key, value = line.split("\t")
if key != prev_key and prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
if result_value > CONSTRAINT:
print(result_key + "\t" + str(result_value))
values = []
prev_key = key
values.append(int(value))
if prev_key is not None:
result_key, result_value = do_reduce(prev_key, values)
if result_value > CONSTRAINT:
print(result_key + "\t" + str(result_value))
base_mapper.py:
import sys
def count_usage():
for line in sys.stdin:
elements = line.rstrip("\n").rsplit(",")
for item in elements:
print("{item}\t{count}".format(item=item, count=1))
if __name__ == "__main__":
count_usage()
2n_mapper.py uses the result of previous iteration.
In answer to your question, you can read the output of previous iteration to form itemsets in such way.
import itertools
import sys
sys.path.append('.')
N_DIM = 2
def get_2n_items():
items = set()
with open("part-00000") as inf:
for line in inf:
parts = line.split('\t')
if len(parts) > 1:
items.add(parts[0])
return items
def count_usage_of_2n_items():
all_items_set = get_2n_items()
for line in sys.stdin:
items = line.rstrip("\n").rsplit(",") # 74743 43355 53554
exist_in_items = set()
for item in items:
if item in all_items_set:
exist_in_items.add(item)
for combination in itertools.combinations(exist_in_items, N_DIM):
combination = sorted(combination)
print("{el1},{el2}\t{count}".format(el1=combination[0], el2=combination[1], count=1))
if __name__ == "__main__":
count_usage_of_2n_items()
From my experience, Apriori algorithm is not suitable for Hadoop if the number of unique combinations (items sets) is too large (100K+).
If you found an elegant solution for Apriori algorithm implementation using Hadoop MapReduce (Streaming or Java MapReduce implementation) please share with community.
PS. If you need more code snippets please ask for.

Spark - word count using java

I'm quite new to Spark and I would like to extract features (basically count of words) from a text file using the Dataset class. I have read the "Extracting, transforming and selecting features" tutorial on Spark but every example reported starts from a bag of words defined "on the fly". I have tried several times to generate the same kind of Dataset starting from a text file but I have always failed. Here is my code:
SparkSession spark = SparkSession
.builder()
.appName("Simple application")
.config("spark.master", "local")
.getOrCreate();
Dataset<String> textFile = spark.read()
.textFile("myFile.txt")
.as(Encoders.STRING());
Dataset<Row> words = textFile.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split("AG")).iterator();
}, Encoders.STRING()).filter(s -> !s.isEmpty()).toDF();
Word2Vec word2Vec = new Word2Vec()
.setInputCol("value")
.setOutputCol("result")
.setVectorSize(16)
.setMinCount(0);
Word2VecModel model = word2Vec.fit(words);
Dataset<Row> result = model.transform(words);
I get this error message: Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column value must be of type equal to one of the following types: [ArrayType(StringType,true), ArrayType(StringType,false)] but was actually of type StringType.
I think I have to convert each line into a Row using something like:
RowFactory.create(0.0, line)
but I cannot figure out how to do that.
Basically, I was trying to train a classification system based on word counts of strings generated from a long sequence of characters. My text file contains one sequence per line so I need to split and count them for each row (the sub-strings are called k-mers and a general description can be found here). Depending on the length of the k-mers I could have more than 4^32 different strings, so I was looking for a scalable machine learning algorithm like Spark.
If you just want to count occurences of words, you can do:
Dataset<String> words = textFile.flatMap(s -> {
return Arrays.asList(s.toLowerCase().split("AG")).iterator();
}, Encoders.STRING()).filter(s -> !s.isEmpty());
Dataset<Row> counts = words.toDF("word").groupBy(col("word")).count();
Word2Vec is much more powerful ML algorithm, in your case it's not necessary to use it. Remember to add import static org.apache.spark.sql.functions.*; at the beggining of the file

Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).
What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.
What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).
How should I proceed ?
Here is (in Python) my custom implementation of the Analyzer :
class CustomAnalyzer(PythonAnalyzer):
def createComponents(self, fieldName, reader):
source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
filter = StandardFilter(Version.LUCENE_4_10_1, source)
filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
filter = StopFilter(Version.LUCENE_4_10_1, filter,
StopAnalyzer.ENGLISH_STOP_WORDS_SET)
ts = tokenStream.getTokenStream()
token = ts.addAttribute(CharTermAttribute.class_)
offset = ts.addAttribute(OffsetAttribute.class_)
ts.reset()
while ts.incrementToken():
startOffset = offset.startOffset()
endOffset = offset.endOffset()
term = token.toString()
# accept or reject term
ts.end()
ts.close()
# How to store the terms in the index now ?
return ????
Thank you for your guidance in advance !
EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.
Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.
EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?
EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...
The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.
By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).
Here is the corresponding code snippet :
class MyFilter(PythonFilteringTokenFilter):
def __init__(self, version, tokenStream):
super(MyFilter, self).__init__(version, tokenStream)
self.termAtt = self.addAttribute(CharTermAttribute.class_)
def accept(self):
term = self.termAtt.toString()
accepted = False
# Do whatever is needed with the term
# accepted = ... (True/False)
return accepted
Then just append the filter to the other filters (as in the code snipped of the question) :
filter = MyFilter(Version.LUCENE_4_10_1, filter)

android java solve equations in string format

I need to solve string equations in an android app, e.g. "3 + 4*(5 - log(100))". I have tried to use BeanShell for this, unfortunately I have some problems with the integer/decimal numbers. When I enter
Interpreter interpreter = new Interpreter();
String res = "9223372036854775807D";
interpreter.eval("result = " + res);
res = interpreter.get("result").toString();
res = new BigDecimal(res).stripTrailingZeros().toPlainString();
I get as result 9223372036854776000??
But when I use String res = "9223372036854775807D"; I get the correct 9223372036854775807.
I simply cannot suspitude all D to L because then I get wrong results when having somthing like 3L/2L -> 1 (but should be 1.5
Does anyone know how to handle huge numbers such as 9223372036854775807 or -9223372036854775808 or can anyone suggest an alternative to BeanShell?
Use MathEval download it from this link: http://tech.dolhub.com/code/matheval
have you tried JEP expression parser?
it is a Good mathematical expression parser purely written in Java and can parse trigonometric,logarithm functions, complex values and you can customize your own functions also...

Categories

Resources