parsing a cv file [closed]

parsing a cv file [closed] - java

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I want to write a code either in Java or PHP (Codeigniter) to extract information such as email and phone number of a user uploading hbis resume or cv to the site. Basically I want to build a cv parser.
Need help for this.
thanks
EDIT
The cv format will be in doc.

Since there is no standard CV format, parsing will be next to impossible.
Instead, consider collecting contact information in an HTML form when they upload.

I'd suggest you to build it using a set of regular expressions.
If you just want to extract phone number and email the parser is very simple. It will work almost 100% for emails and (I believe) 98% for phone numbers.
If you wish to extract other information it will be more complicated because there is no standards for CVs; information may be formatted using different ways. Anyway, good luck!

you should use python and write your own scraper, its easy and it can be done really quickly in your case with modules like beautiful soup, urllib2 ...
what its this all about
beautiful soup documentation

Ditto AlexR. If ALL you want to find is email address and phone number, you could scan for strings of characters in the appropriate format. A couple of simple regular expressions could do that fairly reliably. Even that wouldn't be 100%. If someone included, "Learned Java#Technocorp. US citizen." etc, you might easily be fooled into thinking that's an email address of "java#technocorp.us". Okay, that's a strained example, but it's the sort of thing that shoots down natural language parsing.
If you want more than that, there is no easy answer. You could search for keywords, like to find where he went to school you could look for the words "college" or "university". But even then, someone might put "Graduate of Foobar College" or "College: Foobar" or "BA from Foobar" or many many other possible formats.

As #Corbin said, there is no standard CV format. It will be quite difficult to parse with 100% accuracy.
Though, you can try Apache Tika - A Content Analysis Toolkit to parse resume doc/docx format. Apache also support many document format including pdf, txt, xml, odf etc.
Btw, extracting email and phone number from resume can be achieved with few lines of code with the help of regex after getting whole contents from cv using Apache Tika.
Let me know if you get stuck.
Hope this helps!
Note- (I am working on resume summarizer).

Related

exact speech word recognition [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I used Android Speech-to-Text API (Recognizer Intent), for recognition of the word said by the user. But the problem is that it returns the accurate word after autocorrection. I want it returns the exact word (without correction) said by the user. Please suggest me any other android library for this feature or how can I got my feature inside the android inbuild speech to text API.
I also saw the google API for this but that is paid And that is also AI-based.

I want it to return the exact word (without correction) said by the user.
I think you misunderstand what speech recognition is capable of doing.
A speech recognizing system is only capable of recognizing an uttered word as being one of a number of possible words. It doesn't ... and cannot ... tell you with 100% accuracy what the speaker actually said.
This applies to any speech recognition system, including a human listener. (How many times have you had to ask someone to "Say that again please" ?)
The only way to determine with absolute certainly the exact words that were spoken is to ask the person who spoke them to type them in! (And even then, they may not give you a 100% accurate answer, in some cases.)
In short, what you want is not possible. Software cannot do it. Humans cannot do it, even if they believe that they can1. You need to adjust your expectations.
1 - The Two Ronnies - Four Candles sketch
Identifying / recommending better (more accurate) speech recognition software or services is off-topic.

Transform an image of handwritten notes to text [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have hundreds of images of handwritten notes. They were written from different people but they are in sequence so you know that for example person1 wrote img1.jpg -> img100.jpg. The style of handwriting varies a lot from person to person but there are parts of the notes which are always fixed, I imagine that could help an algorithm (it helps me!).
I tried tesseract and it failed pretty bad at recognizing the text. I'm thinking since each person has like 100 images is there an algorithm I can train by feeding it a small number of examples, like 5 or less and it can learn from that? Or would it not be enough data? From searching around it seems looks like I need to implement a CNN (e.g. this paper).
My knowledge of ai is limited though, is this something that I could still do using a library and some studying? If so, what should I do going forward?

This is called OCR and there has been a progress. Actually, here is an example of how simple it is to parse an image file to text using tesseract:
try:
from PIL import Image
except ImportError:
import Image
import pytesseract
def ocr_core(file):
text = pytesseract.image_to_string(file)
return text
print(ocr_core('sample.png'))
BUT
I am not very sure that it can recognize different types of handwriting. You can give it a try yourself to find out. If you want to try the python example you need to import tesseract but first things first to install tesseract on your OS and add it to your PATH.

There are many OCRs out there and some perform better than others. However, this is a field that has improved a lot recently with the Deep Neural Networks. I would consider using a Cloud provider such as Azure, Google Cloud or Amazon. Your upload the image and they return the metadata.
For instance:
https://azure.microsoft.com/en-us/services/cognitive-services/computer-vision/
If you don't want to use cloud services for any reason, I would consider using TensorFlow... but some knowledge is required:
Tensorflow model for OCR

How to extract page number from PDF file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
We explored so many API's like tika,Pdfbox and itextpdf to extract page number from pdf file but we did not able to do this. In itextpdf we got PdfPageLabels.getPageLabels(reader) but the behaviour of this method is not uniform.

The reason why you don't find any software that is able to extract page numbers from a PDF is simple: the concept of a page number doesn't exist in PDF.
Allow me to predict your response.
*"Wait a minute!" you say, "When I open a PDF in Adobe Reader, I can clearly see a page number in the document!"
Well yes, you can see that page number with your eyes and your human intelligence, but to a machine that number is just some text drawn on a canvas. A machine consuming the document has no idea what all the glyphs and lines and shapes on a page are about. Hence, software can not give you the page number you see as a human. A machine doesn't know where to look!
If you know something about PDF, I can predict your next reply.
"Wait a minute!" you say, "What about Tagged PDF? Doesn't Tagged PDF mean that the semantics of a document are stored along with the representation?"
Well yes, when a PDF is tagged a snippet of text knows that is is part of a title, or a paragraph, or a list,... But Tagged PDF is there to define the structure of the real content. Page numbers however, are not part of the real content. They are marked as artifacts along with headers, footers and other items on a page that are not considered being real content. There is no way to distinguish page numbers.
"Then what are these page labels about?" you ask.
Well, page labels are optional. They are present in some PDFs that are well conceived, but they will be absent in a large majority of the PDFs you'll find in the wild.
This is the long answer. The short answer is simple: You are asking for something that is impossible (in general, not only with iText, Tika, PdfBox, or any other tool you might try).

Most efficient way to replace many (5000+) strings in a .txt file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Using a general-purpose programming language like Java, what is the most efficient way to search through a ~20 page document to replace a set of 5000+ strings with some predetermined replacement string? The program should not replace any strings that have already been replaced. What data structure would be optimal to store the 5000+ strings and each of their replacements - two arrays, a dictionary, or something else?
Here are some of the options that I have considered so far:
Iterate through the entire .txt document once time per string using string.replace. The problem is that the algorithm must iterate through the entire .txt document an extra time for each string stored.
Iterate through the .txt once while replacing string as necessary while creating a new string by appending replacements. This seems more efficient, but each step would still require checking the entire set of 5000+ strings for any strings to replace.
Is there a more optimized means of solving this problem, or is one of the above attempts already optimal?
Also, would it be possible to run this algorithm more efficiently in a lower-level language like C?

You want to replace some string in 5000 strings and you want to make it optimal ... Now my question to you is: How will you know if you have to replace a string if you dont read the string? It's not possible, you have to read everything. And the shortest way to do that is to go line by line and replace immediatly. And somebody can correct me if i'm wrong, but reading a file is one of the most basic operations there is so using a library for that besides what is available by default in the programming language seems total overkill to me. Furthermore, every language has basic io and if it doesn't then don't use it.
To store strings, it all depends what you want to do with them. Different data structures have different purposes and some are better suited in some situations then others. If you just need to store them then a simple array is fine. However, if you need more advanced functions then you need to consider your options. But again it's all up to what you want to do with them later.
And there is the memory issue, you need to calculate how much memory your 5000+ strings will take, because you might run out of memory. Then you need to think if it's worth it to use all that memory. check this link
Finally your question about C, ofcourse it will be more efficient. Java runs in a virtual machine that adds considerable overhead. So basically your Java program runs in another Java program and if you know that there is a cost for every single operation then you understand that C will be more efficient then Java in terms of performance.

I would use the commons-lang library, which I think has exactly what you are looking for. Basically you create one array with all the strings you want to substitute and another array with the substitutions. See http://commons.apache.org/proper/commons-lang/javadocs/api-release/index.html for details on the StringUtils#replaceEach method.

checking words in a dictionary [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I need to determine if an unknown 5 or 6 letter string is a valid word, i.e. is in the dictionary. I could submit the string/word to an online dictionary, but I need to check this string/word, which will be different each time, for about 100 to 150 times. This seems to be a bit time consuming.
My next thought would be to try to get a dictionary program of my own. It would need to be in Java as my program is written in Java. Does the Java API already have a class for doing this? Can I get a descent one that someone has already coded, and all I have to do is submit the string/word to it?
My program is not being used for spell checking. I want to write a program for unscrambling the Jumbled Word Puzzles when I get stuck on a scrambled word. Thanks for your suggestions.

You could use one of the open source dictionaries and load it into a database: ftp://ftp.cerias.purdue.edu/pub/dict/ and ftp://ftp.ox.ac.uk/pub/wordlists/

For scrambled words, you might want to look at the Jumble algorithm, an implementation of which is seen here.

If you don't need spell checking this would be really easy. Just load all your words into a HashSet and then check to see if that set contains the word you want to test. There are tons of word lists available.
If you do need a spell checker, then check out aspell or other free APIs.

aspell and its associated word lists and dictionaries might be the answer.

I think aspell has a Java version.
edit: actually it looks like you might do better with this aspell spinoff called Jazzy.

Maybe you can check some wordlist:
http://wordlist.sourceforge.net/
This page has some word lists in text format, so you can process in Java yourself, most easily using a HashSet. You need to use more efficient data structures if efficiency is important.

Maybe you could try Peter Norvig's spelling checker. I think it's an elegant way to get 80-90% accuracy.

Develop Reference

Java is a programming language and computing platform first released by Sun Microsystems in 1995.