Edit, reload and redownload website in java? - java

I have plans for a small application, to gather some data from a website.
The website have a few textboxes, in which you can write different numerical values, then click a button and an output value will be written on the page.
What i want the application to do, is to fill the textboxes, then "click" the button and gather the output data.
Now i'm only really familiar with java, but my guess is that it's better to write such an application in javascript?
Also if it's doable in java, should i then be looking at some custom libaries, apart from jsoup which i've already used?
I already sort of figured out how to download the html and extract the data i need using jsoup, it's writing the values back into the textboxes that troubles me.
Thank you

There is an implementation of the DOM(Document Object Model, a data structure representing webpages as object trees) in jsoup that can help you to change the textboxes' values. If you're going to code your project in Java, then JSoup is the better choice to do the job.

Related

Edit and sanitize user input in a servlet when Code is allowed?

The webpage I'm working on with JSP and a Java Servlet needs to enable the user to write comments and articles which contain text but also Code of various languages (including html and javascript).
The data is stored in a mysql database and displayed later on the page.
For input, I thought to use one of the many WYSIWYG Editors out there.
Those usually produce (x)Html code for the database.
This means I need a type of sanitizing on serverside before inserting into the database since the editor could be easily circumvented and malicious code displayed onto the site (the database itself is secured by prepared statements).
What would be the best and most simple way to approach this topic?
And would it make more sense to switch to BBCode Input instead of html?
I've found several threads here around, but most don't take into account that code actually needs to be displayed on the site and most threads are several years old already.
Huge thanks in advance!
You can use KefirBB to use BBCodes or for HTML filtration.
https://github.com/kefirfromperm/kefirbb

How to manipulate a Check box in Word with Java and save as PDF?

I need to edit some Check-Boxes in a big Wordfile (docx) and save this then as PDF. This file contains many images and is about 19MB big.
Maybe there will be the need of adding some Checkbox and text.
My idea was to use docx4j, but before to learn the ropes I want to ask if this is possible and which is the best way.
May it be better to save the document as a PDF and then use this as base for processing?
Yes, you can manipulate checkboxes using docx4j.
Be aware that there are several different kinds of checkboxes:
legacy checkbox
content control checkbox
checkbox character
and the details depend on which type are present.
For more, you should post a snippet of the relevant OpenXML (and as they say here on SO, code showing what you've tried).
Is it necessary to use only docx4j?
Recently i tried a solution that helps me manage a Word document with checkboxes and save it as a PDF file. I used Plumsail Documents. The case is about how to populate a Word template using a form with checkboxes. You can connect your app via Zapier or Power Automate to activate checkboxes depending on value from your app. You can set the resulting file as a PDF and deliver it by email and across any system using Zapier and Power Automate.
The great is that Plumsail Documents has a templating engine that allows it to operate pictures.
Your case may be like this:
Create a form in Plumsail Form. It will allow you to activate checkboxes depending on your needs, or your users' needs.
Create a process in Plumsail Documents, upload your Word document and set it as a template. Just put placeholders where you want to change or fill a document with some values or data. Set the resulting document in PDF format.
Set the delivery method. Save across apps or deliver by email.
I recommend you to read the article. That solution is not free, but there is a free 30-day trial, so you will have enough time to try it.

Scraping issue (data-reactid)

I'm trying to scrape a website and compile a spreadsheet based on what data I pull.
The website I am trying to scrape is WEARVR.
I am not too experienced with scraping, but my approach would be to find unique attributes within html tags and use this to scrape what I want.
So for this website my approach would be firstly to scrape a list of URLs of the pages you are taken to upon clicking on one of the experiences, for example : https://www.wearvr.com/#game_id=game_1041, and then secondly, cycle through this list scraping the relevant attributes each time.
However I am stuck at the first step as instead of working with simple "a href" tags, I come across "data-reactid" tags which confuse the matter.
I do my scraping with iMacros but I'm pretty decent at Java now so would learn scraping in Java if need be (which seems likely as iMacros is pretty limited).
My question is, how do these "data-reactid" tags work, and as such how can I utilise them for my scraping purposes?
Additionally if this is an XY problem, please let me know and suggest a better approach.
Thanks for reading!
The simplest way to approach scraping is to treat the page like a big string (because ultimately, that is what it is). You can search within that string for certain things (like href=) to grab links. You can also intelligently assume that whatever is in the a tags is relevant to the link and grab that.
You really don't have to understand HTML, and you don't have to understand how the page or any additional css or markup work, you just need to identify what sort of identifiable string combinations are around the text you want. I will say this is probably much easier to implement in Java than using IMacro, and probably more accurate.
The other way you can handle it, which requires a little more knowledge of HTML and XML, is to treat the entire page as an XML document. This...doesn't always work with HTML, particularly if it is older or badly formed, so the string approach is easier. You get some utility out of the various XML map libraries that exist, but otherwise its similar to the above.

Dynamic Content Parsing

I am working with content parsing I executed the sample program for this i have taken a sample link
please visit the below link
http://www.equitymaster.com/stockquotes/sector.asp?sector=0%2CSOFTL&utm_source=top-menu&utm_medium=website&utm_campaign=performance&utm_content=key-sector
or
Click Here
in the above link i parsed the table data and store into java object.
BSE and NSE are not my exact requirement just I am taken sample example. the above link is developed in the tables they are not used id's and classes. in my example I parsed data using XPath
this is my Xpath
/html/body/table[4]/tbody/tr/td/table[2]/tbody/tr[2]/td[2]/font/table[2]
I selected and parsing it is working fine . here is a problem in future if they changed website structure my program will not work for sure. tell me any other way to parse data dynamically and able to store in database. display the results based on the condition even if they changed the webpage structure I used for this JSOUP api for this. Tell me any other ApI's which provide best support for this type of requirement
If you're trying to parse a page without any clear id/class to select your nodes, you have to try and rely on something else. Redefining the whole tree is indeed the weakest way of doing it, if anything is added/changed everything will collapse.
You could try relying on color: //table[#bgcolor="#c9d0e0"], the "GET MORE INFO" field: //table[tr/td//text()="GET MORE INFO"], the "More Info" there is on every line: //table[.//td//text()="&nbspMore Info&nbsp"]...
The idea is to find something ideally unique (if you can't find any unique criteria, table[color condition selecting a few tables][2] is still stronger walking the whole tree), present every time, and use that as an id.

What technologies are there for formatted, structured data input and output?

I am working on a project here that ingests internal resumes from people at my company, strips out the skills and relevant content from them and stores it in a database. This was all done using docx4j and Grails. This required the resumes to first be submitted via a template that formatted everything just right so that the ingest tool knew what to look for to strip the data.
The 2nd portion of this, is what if we want to get out a "reduced" resume from the database. In other words, I want to search the uploaded content I now have, and only print out new resumes for people who have Java programming experience lets say. So I can go into my database, find the people who originally had java as a skill, and output a new set of resumes that are also still in a nice templated format, and only have the relevant info in them, instead of ALL the content.
I have been writing some software to do this in Java that will basically use a docx template, overwriting the items in customXML which are bound to the content controls in the doc, so the new data shows up and can eb saved as a new docx with that custom data.
This seems really cumbersome to me, and has some limitations. For one, lets say my template has a place for 3 Skills, and the particular person has 8 skills. There seems to be no good way to add those 5 additional skills to the docx other than painstakingly inserting the data with all of the formatting XML tags and such. This is a real pain, because if the template changes, I dont want to have to go back into my software and edit source code to change that additional data input XML tag to bold instead of italic.
I was doing some reading up on using Infopath to create a form that I could use to get the input, connecting to some sharepoint data source or something to store the stripped out data. However, I can't seem to find out if it is possible using sharepoint to get the data back out, in a nice formatted way. What would the general steps for this be? It seems like I couldnt find very much about this topic with any quick googling.
Thanks
You could set up the skills:
<skills>
<skill>..</skill>
<skill>..</skill>
and use a "repeat" content control pointing to the container. This would handle any number of <skill> entries.

Categories

Resources