I currently have a large amount of information sorted into table form on google docs, an example can be seen below:
I would like to transfer all of this information into Google Spreadsheet form. With lines 1-5 going across columns B-F, respectively, and the information going underneath each respective column.
Would I need to use a script to accomplish this task? If so, what type of script should I use, and where can I access such a script (i.e. potentially find a freelance programmer who can write it for me, if necessary). Are there any other ways this task could be accomplished? All of the information in the google docs is very standardized thus there is not any sort of variation which could complicate a script. If a script could transfer one set of 5, it could work on all of the sets.
Thank you, let me know if you need any more information.
This can be done with a lot of different languages. I would approach this using Java just because I am most familiar with it. I would start by downloading the Google Doc as plaintext (.txt). Then run it through line by line parsing it into .csv format. From there you can import it directly into Google Sheets.
You can do this with Notepad++ or equivalent editor. Need to use find and replace tool using extended keys.
Like for replacing a line break search for \r\n and replace with any you need.
If you can place \t [tab space] between fields you can simply paste them onto sheet they align into columns.
So here you can replace double line breaks with some symbol then single line break with \t and then again replace the symbol with single line break. you get all data in columns structure.
Related
I'm working with HTML tags, and I need to interpret HTML documents. Here's what I need to achieve:
I have to recognize and remove HTML tags without removing the
original content.
I have to store the index of the previously existing markups.
So here's a example. Imagine that I have the following markup:
This <strong>is a</strong> message.
In this example, we have a String sequence with 35 characters, and markedup with strong tag. As we know, an HTML markup has a start and an end, and if we interpret the start and end markup as a sequence of characters, each also has a start and an end (a character index).
Again, in the previous example, the beggining index of the open/start tag is 5 (starts at index 0), and the end index is 13. The same logic goes to the close tag.
Now, once we remove the markup, we end up with the following:
This is a message.
The question:
How can I remember with this sequence the places where I could enter the markup again?
For example, once the markup has been removed, how do I know that I have to insert the opening tag in the X position/index, and the closing tag in the Y position/index... Like so:
This is a message.
5 9
index 5 = <strong>
index 9 = </strong>
I must remember that it is possible to find the following situation:
<a>T<b attribute="value">h<c>i<d>s</a> <g>i<h>s</h></g> </b>a</c> <e>t</e>e<f>s</d>t</f>.
I need to implement this in Java. I've figured out how to get the start and end index of each tag in a document. For this, I'm using regular expressions (Pattern and Matcher), but I still do not know how to insert the tags again properly (as described). I would like a working example (if possible). It does not have to be the best example (the best solution) in the world, but only that it works the right way for any kind of situation.
If anyone has not understood my question, please comment that I will do it better.
Thanks in advance.
EDIT
People in the comments are saying that I should not use regular expressions to work with HTML. I do not care to use or not regular expressions to solve this problem, I just want to solve it, no matter how (But of course, in the most appropriate way).
I mentioned that I'm using regular expressions, but I do not mind using another approach that presents the same solution. I read that a XML parser could be the solution. Is that correct? Is there an XML parser capable of doing all this what I need?
Again, Thanks in advance.
EDIT 2
I'm doing this edition now to explain the applicability of my problem (as asked). Well, before I start, I want to say that what I'm trying to do is something I've never done before, it's not something on my area, so it may not be the most appropriate way to do it. Anyway...
I'm developing a site where users are allowed to read content but can not edit it (edit or remove text). However, users can still mark/highlight excerpts (ranges) of the content present (with some stylization). This is the big summary.
Now the problem is how to do this (in Java). On the client side, for now, I was thinking of using TinyMCE to enable styling of content without text editing. I could save stylized text to a database, but this would take up a lot of space, since every client is allowed to do this, given that they are many clients. So if a client marks snippets of a paragraph, saving the paragraph back in the database for each client in the system is somewhat costly in terms of memory.
So I thought of just saving the range (indexes) of the markups made by users in a database. It is much easier to save just a few numbers than all the text with the styling required. In the case, for example, I could save a line / record in a table that says:
In X paragraph, from Y to Z index, the user P defined a ABC
stylization.
This would require a translation / conversion, from database to HTML, and HTML to database. Setting a converter can be easy (I guess), but I do not know how to get the indexes (following this logic). And then we stop again at the beginning of my question.
Just to make it clear:
If someone offers a solution that will cost money, such as a paid API, tool, or something similar, unfortunately this option is not feasible for me. I'm sorry :/
In a similar way, I know it would be ideal to do this processing with JavaScript (client-side). It turns out that I do not have a specialized JavaScript team, so this needs to be done on the server side (unfortunately), which is written in Java. I can only use a JavaScript solution if it is already ready, easy and quick to use. Would you know of any ready-made, easy-to-use library that can do it in a simple way? Does it exist?
You can't use a regular expression to parse HTML. See this question (which includes this rather epic answer as well as several other interesting answers) for more information, but HTML isn't a regular language because it has a recursive structure.
Any language that allows recursion isn't regular by definition, so you can't parse it with a regex.
Keep in mind that HTML is a context-free languages (or, at least, pretty close to context-free). See also the Chomsky hierarchy.
I've done a lot of internet searching to find some information to no avail.. Hopefully you can help me..
I want to be able to use a flat file, with normal content (i.e. full english sentences, paragraphs etc), extract each word and store each word individually, one word per row, in a SQL database (doesn't matter if there are spaces but characters such as apostrophes can be kept in)
I then want to have a HTML page with code to access this DB and output the text to the user one word at a time, essentially 'writing' the inputted files text word-by-word on the web page.
This is just a coding exercise but I am frustrated as I know the what but not the how.. I am not sure where to start. Please note some of these files can be quite big ~ 20,000 words so there may be a performance element to consider to any solution.
TL;DR: I want to extract individual words from a text file with normal everyday sentences into a SQL DB that I can retrieve from a HTML page.
Simple read & split exercise
with open(<filename>) as f:
dd = {}
for ln in f:
wds = ln.strip().split()
for word in wds:
dd[word] = 1 # need something for value
for wkey in dd:
<insert into db>
Well, before you start you should choose just one programming language. Since you seem like you are a beginner I would highly recommend Python over Java, but it depends on if you're required to use any particular language by an employer/professor/etc.
Also just to point out, this is also a very BIG task that you've chosen. I'll try to break it down into parts for you, but I recommend starting with just one of these parts before you move on, and make sure it works on your local machine before you try putting it on the web.
First you need to use something read in your file, preferably line by line. A method similar to FileReader/BufferedReader in Java or the open(), readlines() functions in Python will do these. I would also check out the tutorials online on file handling for whichever of these two languages you're going to use. The Python one is here. Practice this with a test file or a small section of your real file before you start working on your real input files.
When you start processing the lines from the file, I would recommend splitting them into individual words using a string split function on spaces or on any punctuation, such as ,.!?". This way you'll pull out the individual words from the each line in the file.
Next, you'll want to choose a database API for the appropriate programming language. I used PyMySQL but there is also MySQLDB for Python. In Java there is JDBC.
You'll need to then build your database on a server somewhere, preferably on the same server as your HTML page for ease of connection. You'll want to practice connecting to your database and adding sample rows before you start trying to process your real input files.
You can't have normal HTML access the database directly - you'll need to use a coding language like Python for that. I've never used Java for webpages, but with Python you'll simply output text and tell the server to display it as the webpage. This will do the trick:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import otherstuffhere
## Must have this header to tell browser how to handle this output
## and must be printed first
print ("Content-Type: text/html\n\n")
## Connect to database here
## Your code to display words from the database goes below here
print (myfield1)
Also remember that when you output your text, you'll need to add all the HTML tags to the normal text output. For example, when printing each word, you'll need to add <p> or <br> to end each line, because although the Python print() function will automatically add a line break, this doesn't translate to a line break in HTML. For example:
print ("My word list is: <br>")
for word in dbOutputList:
print (word)
print ("<br>")
After that the REAL fun/crying begins, but you should work on the above before you move on.
The Problem:
I have numerous files that contain Apache web server log entries. Those entries are not in date time order and are scattered across the files. I am trying to use Pig to read a day's worth of files, group and order the log entries by date time, then write them to files named for the day and hour of the entries it contains.
Setup:
Once I have imported my files, I am using Regex to get the date field, then I am truncating it to hour. This produces a set that has the record in one field, and the date truncated to hour in another. From here I am grouping on the date-hour field.
First Attempt:
My first thought was to use the STORE command while iterating through my groups using a FOREACH and quickly found out that is not cool with Pig.
Second Attempt:
My second try was to use the MultiStorage() method in the piggybank which worked great until I looked at the file. The problem is that MulitStorage wants to write all fields to the file, including the field I used to group on. What I really want is just the original record written to the file.
The Question:
So...am I using Pig for something it is not intended for, or is there a better way for me to approach this problem using Pig? Now that I have this question out there, I will work on a simple code example to further explain my problem. Once I have it, I will post it here. Thanks in advance.
Out of the box, Pig doesn't have a lot of functionality. It does the basic stuff, but more times than not I find myself having to write custom UDFs or load/store funcs to get form 95% of the way there to 100% of the way there. I usually find it worth it since just writing a small store function is a lot less Java than a whole MapReduce program.
Your second attempt is really close to what I would do. You should either copy/paste the source code for MultiStorage or use inheritance as a starting point. Then, modify the putNext method to strip out the group value, but still write to that file. Unfortunately, Tuple doesn't have a remove or delete method, so you'll have to rewrite the entire tuple. Or, if all you have is the original string, just pull that out and output that wrapped in a Tuple.
Some general documentation on writing Load/Store functions in case you need a bit more help: http://pig.apache.org/docs/r0.10.0/udf.html#load-store-functions
I need to parse complex (non fixed length) csv files to Java objects in order to compare its values.
I first tried the Flatform Parsing Framework, i liked the approach of describing the values in an extra (xml) document. Maybe it's the right tool for simple csv (and also flat) files. Nevertheless my csv files contains lines that vary in quantity of fields - sometimes they span across multiple lines. There are also dependencies among those fields.
Here's a little sample: (each type has a certain amount of extra parameters)
; <COMMENTS (to be ignored)>
<NAME>,<TYPE_A>,<DESCRIPTION>,<PARAMETER>
<NAME>,<TYPE_B>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_C>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_D>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>,<PARAMETER>,<PARAMETER>, -
<PARAMETER>,<PARAMETER>, -
<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_B>,<DESCRIPTION>,<PARAMETER>,<PARAMETER>
<NAME>,<TYPE_A>,<DESCRIPTION>,<PARAMETER>
So i need something to describe and parse the csv file in a more complex manner. I'm new to this, I've heard about parser generator - is that what I need?
Try OpenCSV (see http://opencsv.sourceforge.net/#what-features). It handles embedded carriage returns just fine.
One option is to use the Scanner class or you might want to check out the Spring Batch. Ive never actually used SB but given batch jobs often read from simple text files i believe i read it caters for this including all sorts of object mapping.
You may also try japaki
I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks
You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.