I am trying to build a deep learning model with transformer model architecture. In that case when I am trying to cleaning the dataset following error occurred.
I am using Pytorch and google colab for that case & trying to clean Java methods and comment dataset.
Tested Code
import re
from fast_trees.core import FastParser
parser = FastParser('java')
def get_cmt_params(cmt: str) -> List[str]:
'''
Grabs the parameter identifier names from a JavaDoc comment
:param cmt: the comment to extract the parameter identifier names from
:returns: an array of the parameter identifier names found in the given comment
'''
params = re.findall('#param+\s+\w+', cmt)
param_names = []
for param in params:
param_names.append(param.split()[1])
return param_name
Occured Error
Downloading repo https://github.com/tree-sitter/tree-sitter-java to /usr/local/lib/python3.7/dist-packages/fast_trees/tree-sitter-java.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-31-64f6fa6ed39b> in <module>()
3 from fast_trees.core import FastParser
4
----> 5 parser.set_language = FastParser('java')
6
7 def get_cmt_params(cmt: str) -> List[str]:
3 frames
/usr/local/lib/python3.7/dist-packages/fast_trees/core.py in FastParser(lang)
96 }
97
---> 98 return PARSERS[lang]()
/usr/local/lib/python3.7/dist-packages/fast_trees/core.py in __init__(self)
46
47 def __init__(self):
---> 48 super().__init__()
49
50 def get_method_parameters(self, mthd: str) -> List[str]:
/usr/local/lib/python3.7/dist-packages/fast_trees/core.py in __init__(self)
15 class BaseParser:
16 def __init__(self):
---> 17 self.build_parser()
18
19 def build_parser(self):
/usr/local/lib/python3.7/dist-packages/fast_trees/core.py in build_parser(self)
35 self.language = Language(build_dir, self.LANG)
36 self.parser = Parser()
---> 37 self.parser.set_language(self.language)
38
39 # Cell
ValueError: Incompatible Language version 13. Must not be between 9 and 12
an anybody help me to solve this issue?
fast_trees uses tree_sitter and according to tree_sitter repo it is an incomatibility issue. If you know the owner of fast_trees ask them to upgrade their tree_sitter version.
Or you can fork it and upgrade it yourself, but keep in mind it may not be backwards compatible if you take it upon yourself and it may not be just a simple new version install.
The fast-trees library uses the tree-sitter library and since they recommended using the 0.2.0 version of tree-sitter in order to use fast-trees. Although downgrade the tree-sitter to the 0.2.0 version will not be resolved your problem. I also tried out it by downgrading it.
So, without investing time to figure out the bug in tree-sitter it is better to move to another stable library that satisfies your requirements. So, as your requirement, you need to extract features from a given java code. So, you can use javalang library to extract features from a given java code.
javalang is a pure Python library for working with Java source code.
javalang provides a lexer and parser targeting Java 8. The
implementation is based on the Java language spec available at
http://docs.oracle.com/javase/specs/jls/se8/html/.
you can refer it from - https://pypi.org/project/javalang/0.13.0/
Since javalang is a pure library it will help go forward on your research without any bugs
Normally, I do not ask questions here, but problems I face up is so eerie that I can't fight it alone no more, I'm exhausted. Anyway, I'm going to describe everything I have found and I have found many interesting things I want to believe will help someone to help me.
Software versions:
- OS: Windows 10 Pro version: 1909 build: 18363.720
- IntelliJ IDEA: 2019.2.4 Ultimate
- Gradle wrapper version: 5.2.1-all
- jdk: 8
Problem lying in encodings, specially in console output in Gradle project.
Here is my build.gradle file:
plugins {
id 'java'
id 'idea'
id 'application'
}
group 'com.diceeee.mentoring'
version 'release'
sourceCompatibility = 1.8
application.mainClassName('D')
compileJava.options.encoding = 'utf-8'
tasks.withType(JavaCompile) {
options.encoding = 'utf-8'
}
repositories {
mavenCentral()
jcenter()
}
dependencies {
testCompile group: 'junit', name: 'junit', version: '4.12'
}
My sources are in UTF-8 encoding with CRLF, so in build.gradle I set that sources should be compiled with utf-8 encoding instead of my system default windows-1251 encoding.
Here is D.java:
import java.io.FileWriter;
import java.io.IOException;
public class D {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("file.encoding"));
String testLine = "Проверка работоспособности И Ш";
System.out.println(testLine);
FileWriter writer = new FileWriter("D:\\test.txt");
writer.write(testLine);
writer.close();
}
}
Also I have gradle.properties with one line:
org.gradle.jvmargs=-Dfile.encoding=utf-8
I checked if it works and assured myself that it works, encoding of Encoder in System.out really changed to utf-8.
When I run my gradle project, I get this:
21:04:53: Executing task 'D.main()'...
> Task :compileJava UP-TO-DATE
> Task :processResources NO-SOURCE
> Task :classes UP-TO-DATE
> Task :D.main()
UTF-8
�������� ����������������� � �
Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/5.2.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 0s
2 actionable tasks: 1 executed, 1 up-to-date
21:04:54: Task execution finished 'D.main()'.
There comes more info.
1) It's not coincidence that I left output in file in code. If we try to look in file, we can see this:
Проверка работоспособности И Ш
I'm not sure about is it right, but I have concluded that problem is lying somewhere in console because if there would be a problem with default encoding, file writer had used wrong encoding for file and outputs would be equal. But it does not happen.
2) I have debugged internals of PrintStream, OutputStreamWriter and StreamEncoder classes. StreamEncoder really uses utf-8 charset, also it encoded utf-8 text to the right byte sequence:
String testLine = "Проверка работоспособности И Ш";
Every cyrillic letter is 2 bytes, spaces are 1 byte, if we count all letters, we get 57.
Now, look here:
Encoder debugging screen with resulting bytes
So, as we can see, we get these first 57 bytes (other are from other inputs, buffer uses limits):
[-48, -97, -47, -128, -48, -66, -48, -78, -48, -75, -47, -128, -48, -70, -48, -80, 32, -47, -128, -48, -80, -48, -79, -48, -66, -47, -126, -48, -66, -47, -127, -48, -65, -48, -66, -47, -127, -48, -66, -48, -79, -48, -67, -48, -66, -47, -127, -47, -126, -48, -72, 32, -48, -104, 32, -48, -88, 91]
It looks properly, cyrillic letters encoded like [-48, -97], [-47, -128] and other groups of 2 bytes, so looks nice, spaces are matched too. So, encoder does the great job, it works, but what then is happening?
I dunno. Seriously. But there is more info. If it didn't seem mindblowing, I have prepared something else for ya.
I have created a clean Java project without any gradle/maven etc, only my own jdk and nothing more.
Program is the same:
package com.company;
import java.io.FileWriter;
import java.io.IOException;
public class Main {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("file.encoding"));
String testLine = "Проверка работоспособности И Ш";
System.out.println(testLine);
FileWriter writer = new FileWriter("D:\\test.txt");
writer.write(testLine);
writer.close();
}
}
I run it and what do I get?
"C:\Program Files\Java\jdk1.8.0_181\bin\java.exe" "-javaagent:C:\Program Files\JetBrains\IntelliJ IDEA 2019.2.4\lib\idea_rt.jar=58901:C:\Program Files\JetBrains\IntelliJ IDEA 2019.2.4\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_181\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\rt.jar;C:\Users\<my_removed_name>\IdeaProjects\test\out\production\test" com.company.Main
UTF-8
Проверка работоспособности И Ш
Process finished with exit code 0
And after that, I'm just died. Wtf is happening??? Back to the gradle project for a moment. I did a little modification:
import java.io.FileWriter;
import java.io.IOException;
import java.nio.charset.StandardCharsets;
public class D {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("file.encoding"));
String testLine = new String("Проверка работоспособности И Ш".getBytes(StandardCharsets.UTF_8), "windows-1251");
System.out.println(testLine);
FileWriter writer = new FileWriter("D:\\test.txt");
writer.write(testLine);
writer.close();
}
}
And output now is:
21:43:06: Executing task 'D.main()'...
> Task :compileJava
> Task :processResources NO-SOURCE
> Task :classes
> Task :D.main()
UTF-8
Проверка работоспособности �? Ш
Deprecated Gradle features were used in this build, making it incompatible with Gradle 6.0.
Use '--warning-mode all' to show the individual deprecation warnings.
See https://docs.gradle.org/5.2.1/userguide/command_line_interface.html#sec:command_line_warnings
BUILD SUCCESSFUL in 0s
2 actionable tasks: 2 executed
21:43:06: Task execution finished 'D.main()'.
In file:
Проверка работоспособности � Ш
Also, this output in console is the first thing that pushed me to determine what is going wrong, I was just coding and found that something is really wrong with cyrillic "И". I tried to solve it, and again, and again... and now I'm here, because I'm in the dead end, I tried all what I have found in the similar questions and topics about encoding problems, I have red some articles about default encoding in java, that Windows uses cp866 encoding in console, windows-1251 encoding as default, that we need to determine encoding explicitly with -Dfile.encoding=UTF-8, nothing helps, I don't even know what to look for to find a problem. I thought gradle did not recognize property and charset was still windows-1251, but debugging showed I was wrong.
Well, here is a complete list of things I have tried to solve a problem:
1) Set -Dfile.encoding=UTF-8 in idea.exe.vmoptions and idea64.exe.vmoptions with restart. Didn't help.
2) Set UTF-8 in IntelliJ IDEA -> Settings -> Editor -> File Encodings everywhere. Didn't help.
3) Set gradle compiler encoding to utf-8. Didn't help.
4) Set gradle jvm option org.gradle.jvmargs=-Dfile.encoding=utf-8. Didn't help.
5) Checked that Windows has russian language as default for programs that do not support unicode for cyrillic supporting. Didn't help.
I'm not sure what is the problem with gradle because clean project without gradle works great, console output is okay. But with gradle, cyrillic symbols are incorrect. Also, I tried to somehow correct output to console with getBytes(charset) and new String(byte[], charset) method/constructor, I tried these variants:
String testLine = new String("Проверка работоспособности И Ш".getBytes(StandardCharsets.UTF_8), "windows-1251");
Output:
Проверка работоспособности �? Ш
Not working.
String testLine = new String("Проверка работоспособности И Ш".getBytes(StandardCharsets.UTF_8), "cp866");
Output:
?�?�???????�???? ?�???????�???�?????�?????????�?�?? ?� ?�
Not working.
String testLine = new String("Проверка работоспособности И Ш".getBytes(StandardCharsets.UTF_8), "utf-8");
Output:
�������� ����������������� � �
Result we get without any convertations.
Also, I tried one more thing, is System.out wrapper to set another console encoding.
public class D {
public static void main(String[] args) throws IOException {
System.out.println(System.getProperty("file.encoding"));
System.setOut(new PrintStream(System.out, true, "utf-8"));
String testLine = "Проверка работоспособности И Ш";
System.out.println(testLine);
FileWriter writer = new FileWriter("D:\\test.txt");
writer.write(testLine);
writer.close();
}
}
And we still have nothing in output, it even didn't change:
> Task :D.main()
UTF-8
�������� ����������������� � �
Well, according to all this information, I think that something is really not good with console itself, because even the last execution of code above have this output in file:
Проверка работоспособности И Ш
It is in utf-8 encoding, it's correct output. But System.out.println prints something irrational in console, even if Encoder works good. I don't know what the shit is going on (sry for dirty-talking), if problem is really in gradle, how to check it? Or how to let gradle use another encoding for console output? Or maybe it is still something with IntelliJ IDEA even if output in project without gradle is correct?
I feel like a detective, but I have stalled, stucked in that case. I'm grateful if somebody helps me.
Run \ Edit Configurations, select your run configuration and write -Dfile.encoding=UTF-8 in VM Options field. This resolved issue for me.
I was experiencing a similar issue.
It's a Gradle-IntelliJ-on-non-ascii-language-version-Windows specific problem.
I solved this in the following way:
Set systemProp.file.encoding=utf-8 in gradle.properties file in the project
On IntelliJ, go to Settings -> Tools -> Terminal -> Application Settings and set cmd.exe /K "chcp 65001" as "Shell path"
The shell path should be just cmd.exe by default.
With the property value in the properties file should help build work with Gradle tool on IntelliJ,
and the shell path setting resolves the encoding on the integrated terminal.
If you are using the cmd outside of the IntelliJ and not from the integrated terminal on IntelliJ, simply call chcp 65001 on the console.
This will set the character encoding on the cmd console UTF-8.
Change the font to one that is able to correctly display all the characters in Settings (Preferences on macOS) | Editor | Font | Font settings.
I recently came across sklearn2pmml and jpmml-sklearn when looking for a way to convert scikit-learn models to PMML. However, I've been hitting errors when trying to use the basic usage examples that I'm unable to figure out.
When attempting to usage example in sklearn2pmml, I've been receiving the following issue around casting a long as an int:
Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
at numpy.core.NDArrayUtil.getShape(NDArrayUtil.java:66)
at org.jpmml.sklearn.ClassDictUtil.getShape(ClassDictUtil.java:92)
at org.jpmml.sklearn.ClassDictUtil.getShape(ClassDictUtil.java:76)
at sklearn.linear_model.BaseLinearClassifier.getCoefShape(BaseLinearClassifier.java:144)
at sklearn.linear_model.BaseLinearClassifier.getNumberOfFeatures(BaseLinearClassifier.java:56)
at sklearn.Classifier.createSchema(Classifier.java:50)
at org.jpmml.sklearn.Main.run(Main.java:104)
at org.jpmml.sklearn.Main.main(Main.java:87)
Traceback (most recent call last):
File "C:\Users\user\workspace\sklearn_pmml\test.py", line 40, in <module>
sklearn2pmml(iris_classifier, iris_mapper, "LogisticRegressionIris.pmml")
File "C:\Python27\lib\site-packages\sklearn2pmml\__init__.py", line 49, in sklearn2pmml
os.remove(dump)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\user\\appdata\\local\\temp\\tmpmxyp2y.pkl'
Any suggestions as to what is going on here?
Usage code:
#
# Step 1: feature engineering
#
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import pandas
import sklearn_pandas
iris = load_iris()
iris_df = pandas.concat((pandas.DataFrame(iris.data[:, :], columns = ["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"]), pandas.DataFrame(iris.target, columns = ["Species"])), axis = 1)
iris_mapper = sklearn_pandas.DataFrameMapper([
(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], PCA(n_components = 3)),
("Species", None)
])
iris = iris_mapper.fit_transform(iris_df)
#
# Step 2: training a logistic regression model
#
from sklearn.linear_model import LogisticRegressionCV
iris_X = iris[:, 0:3]
iris_y = iris[:, 3]
iris_classifier = LogisticRegressionCV()
iris_classifier.fit(iris_X, iris_y)
#
# Step 3: conversion to PMML
#
from sklearn2pmml import sklearn2pmml
sklearn2pmml(iris_classifier, iris_mapper, "LogisticRegressionIris.pmml")
EDIT 12/6:
After the new update, the same issue comes up farther down the line:
Dec 06, 2015 5:56:49 PM sklearn_pandas.DataFrameMapper updatePMML
INFO: Updating 1 target field and 3 active field(s)
Dec 06, 2015 5:56:49 PM sklearn_pandas.DataFrameMapper updatePMML
INFO: Mapping target field y to Species
Dec 06, 2015 5:56:49 PM sklearn_pandas.DataFrameMapper updatePMML
INFO: Mapping active field(s) [x1, x2, x3] to [Sepal.Length, Sepal.Width, Petal.Length, Petal.Width]
Traceback (most recent call last):
File "C:\Users\user\workspace\sklearn_pmml\test.py", line 40, in <module>
sklearn2pmml(iris_classifier, iris_mapper, "LogisticRegressionIris.pmml")
File "C:\Python27\lib\site-packages\sklearn2pmml\__init__.py", line 49, in sklearn2pmml
os.remove(dump)
WindowsError: [Error 32] The process cannot access the file because it is being used by another process: 'c:\\users\\user\\appdata\\local\\temp\\tmpqeblat.pkl'
JPMML-SkLearn expected that ndarray.shape is tuple of i4 (mapped to java.lang.Integer by the Pyrolite library). However, in this case it was a tuple of i8 (mapped to java.lang.Long). Hence the cast exception.
This issue has been addressed in JPMML-SkLearn commit f7c16ac2fb.
If you should encounter another exception (data translation between platforms could be tricky), then you should also open a JPMML-SkLearn issue about it.
For a project I am currently working on, I need to annotate sentences with FrameNet annotations. This is achieved well by the SEMAFOR semantic parser (https://github.com/Noahs-ARK/semafor). I installed and configured this tool as described on the git repository. However, if I run the runSemafor.sh script with the cygwin terminal, it throws and IllegalArgumentException indicating that the generated pos.tagged file cannot be parsed.
Here is the complete console output in cygwin (running it on windows):
$ ./runSemafor.sh D:/XFrame/Libs/Semafor/semafor/temp/sample.txt D:/XFrame/Libs/Semafor/semafor/temp/output 2
**********************************************************************
Tokenizing file: D:/XFrame/Libs/Semafor/semafor/temp/neu.txt
real 0m0.140s
user 0m0.015s
sys 0m0.108s
Finished tokenization.
**********************************************************************
**********************************************************************
Part-of-speech tagging tokenized data....
/cygdrive/d/XFrame/Libs/Semafor/semafor/scripts/jmx/cygdrive/d/XFrame/Libs/Semafor/semafor/bin
Read 11692 items from tagger.project/word.voc
Read 45 items from tagger.project/tag.voc
Read 42680 items from tagger.project/tagfeatures.contexts
Read 42680 contexts, 117558 numFeatures from tagger.project/tagfeatures.fmap
Read model tagger.project/model : numPredictions=45, numParams=117558
Read tagdict from tagger.project/tagdict
*This is MXPOST (Version 1.0)*
*Copyright (c) 1997 Adwait Ratnaparkhi*
Sentence: 0 Length: 9 Elapsed Time: 0.007 seconds.
real 0m0.762s
user 0m0.046s
sys 0m0.171s
/cygdrive/d/XFrame/Libs/Semafor/semafor/bin
Finished part-of-speech tagging tokenized data.
**********************************************************************
**********************************************************************
Converting postagged input to conll.
Exception in thread "main" java.lang.IllegalArgumentException:
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec.decode(Sentence Codec.java:83)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec$SentenceIterato r.computeNext(SentenceCodec.java:115)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec$SentenceIterato r.computeNext(SentenceCodec.java:100)
at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractI terator.java:143)
at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.j ava:138)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.ConvertFormat.convertStream(C onvertFormat.java:94)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.ConvertFormat.main(ConvertFor mat.java:76)
Caused by: java.lang.IllegalArgumentException: PosToken must have 2 "_"-separate d fields
at com.google.common.base.Preconditions.checkArgument(Preconditions.java :92)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.Token.fromPosTagged(Token.jav a:248)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec$2.decodeToken(S entenceCodec.java:28)
at edu.cmu.cs.lti.ark.fn.data.prep.formats.SentenceCodec.decode(Sentence Codec.java:79)
... 6 more
As a sample file for the annotation I use the sample file from the repository:
This is a test for SEMAFOR, a frame-semantic parser.
This is just a dummy line.
There's a Santa Claus!
The generated pos.tagged file however looks as if there is no error. Why does this exception occur?
This_DT is_VBZ a_DT test_NN for_IN SEMAFOR_NNP ,_, a_DT frame-semantic_JJ parser_NN ._.
This_DT is_VBZ just_RB a_DT dummy_JJ line_NN ._.
There_EX 's_VBZ a_DT Santa_NNP Claus_NNP !_.
I came across the exactly same issue as you stated, and solved it moments ago by myself. It is because that the parser only takes the right formatted input file with one sentence per line.
What you need to do is: when writing each sentence into the file, add the following lines in your codes, to remove line breakers or tabs. Then you should be good to go!
line = line.replace('\n', '')
line = line.replace('\t', '')
I am working in python based on a Java code.
I have this in Java:
public static byte[] datosOEM = new byte[900000];
I wrote this in Python following some documents that I found:
datosOEM=bytes([0x90, 0x00, 0x00])
When I run my program it shows me this:
Traceback (most recent call last):
File "test.py", line 63, in <module> # The line 63 is the location of x1=datosOEM[k];
x1=datosOEM[k];
IndexError:string index out of range
Craig correct this part and recommended me to change to this:
datosOEM = bytearray(900000)
Now, when I run my program it shows me this:
Traceback (most recent call last):
File "test.py", line 10, in <module> # The line 10 is the location of datosOEM = bytearray(900000)
datosOEM = bytearray(900000)
TypeError: 'type' object has no attribute '__getitem_'
How I can fix this problem?
Part of my code is this:
...
response=port.read(8)
print(response)
k=0
C=0
conexion=True
if(conexion):
while(response>200):
while(C==0):
x1=datosOEM[k];
if(x1=1):
x2=datosOEM[k+1];
...
Craig certainly told you how to create a bytearray with 900000 bytes:
datosOEM = bytearray(5)
print(datosOEM) # bytearray(b'\x00\x00\x00\x00\x00')
datosOEM[0] = 65
print(datosOEM) # bytearray(b'A\x00\x00\x00\x00')