How to read HTML text in java and show enriched text [on hold] - java

My requirement is :
1) I'll get the response(in a string) from 3rd party as HTML text in String.
2) I've to parse it in Java and show enriched text to USER .
Please help me . I'mm not getting leads from Google results.
Thanks

Seems like you're looking for something like JSoup

Related

parse text from xml

I have following link
https://hero.epa.gov/hero/ws/swift.cfc?method=getProjectRIS&project_id=993&getallabstracts=true
I want to parse this xml to get only text, like
Provider: HERO - 2.xx
DBvendor=EPA
Text-encoding=UTF-8
How can I parse it ?
Well, it's not a text file, it's an HTML file. If you open a file in browser and select view source you will be able to see text enclosed in <char> tags.
When it's opened in browser, these tags and other HTML content is interpreted and output is rendered on the page (that's why it looks like a text). If you want to implement similar behavior in Java then you should look into PhantomJS and/or JSoup examples.
It looks like a text file but it is an XML file and the browser just displays its text content.
To verify right click and look at the page source.
You can use a library like Jsoup for parsing the file and getting the contents.
https://jsoup.org/cookbook/introduction/parsing-a-document

How to deal with accent problems using HTMLAgilityPack

I'm try to extract the text of a html file, but inside of tag appears the following text:
<h3>Café<h3>
and when extract the text using the following code :
htmlDocument.DocumentNode.SelectSingleNode("some XPath").InnerText;
I get this string "Cafédirect" . How could fix this ?
I've answered this here, basically you can ask HtmlAgilityPack to detect encoding of the HTML document.
HTMLAgilityPack Asp.net C# Error Handling
I know the answer now, working I detect the way to do , here go :
htmlDocument.OptionDefaultStreamEncoding = Encoding.UTF8;
By default the encoding is System.Text.Encoding.Default with UTF-8 the accents are permitted

Parse XML with text between tags

i have the following scenario:
<xml>Text text text<a><b></b>Test text</a> text text text<c>text text</c><d><d/><xml>
How can i parse this xml so that i keep all the information (parse into a tree?). I need to keep the text and the sequence and position of the tags in the text.
Thanks for your help!
EDIT: I already tried using a java parser...i didn't manage to get it to work.
this isn't a well formed xml. you can't use a standard parser.
You must write a your.

Print html portion into pdf using Java

community!
My project is simple: I have a link to a website that has multiple information on different chemical substances and I want to extract some data and put in into pdf. Thing is that I want to keep the formatting of the original HTML (using it's css, of course).
Example of substance: http://www.molbase.com/en/msds_1659-31-0-moldata-2.html#tabs
I used jsoup to read the HTML of the table on the bottom of the page, the MSDS one, containing multiple sections with different information about the substance, but I really don't know how to save the exact HTML format into my pdf file. I have tried with iText too, but it gives me "missing ending tag" error, and if it worked, it would print the full page, not only that msds table.
Here is what I have tried to do, but ain't effective:
Document docu = Jsoup.connect(urlbun).get();
Element tableHeader = docu.select("div[class=\"msds\"]")
.first();
String[] finSyn = tableHeader.text().split(" ");
String moreText =" ";
I tried to split the text that the webpage has under that div ("class = "msds"") but I cannot find a way to split it the good way.
Please, could you please give me a hint on what to do? Even if the formating is not the same, I would like to be able to display the information in the same way, with indentation and such.
Thank you!
You can put the content that you want to convert to PDF inside a CSS ID (such as a DIV) and then use the PDFmyURL API to convert only that section to PDF.
Please refer to this on our website about how to select pieces from a page to convert to PDF
Disclosure: I work for the company that owns this site

How can I read the contents of a webpage, not the source code of that page? [duplicate]

Possible Duplicate:
How can I use translate.google.com/ to translate the string in Java program?
I want to read the contents of a webpage ,not the source code of that page.
Contents means some comment or some lines etc.
EX:
http://translate.google.com/#en|bn|I%20love%20life .
From this page, I want to collect the translated line "আমি জীবন ভালবাসি"
How can I get this in JAVA ??
but you do want to read the contents of the source code, as it contains the content your looking for..
<span id="result_box" class="short_text" lang="bn"><span class="hps">আমি</span></span>
this is the node that contains the content of the translation...if you can build the url containing the un translated string, capture the response to that url and then find #result_box, you will have your content
I believe you can achieve this with HtmlUnit. Look at the method DomNode#asText().

Resources