Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
User Journal

Journal blinder's Journal: Java HTML Renderer Question 6

first a little background. the project site that i run, it catalogs links (provided by users) and every night there's a batch process that runs, it queries the database and builds a lucene index (which is then searched on the front end).

anyway, something i want to incorporate is the ability for the indexing engine, to go out to a URL its processing and render the URL it gets as an image (i.e. png or jpg). much like how alexa does it.

i've have very limited success with such things like flying saucer's xhtml rendering engine (doesn't seem to support CSS at all).

i'm basically looking for a method to render HTML and save it as an image, sorta like bulk screen scraping. anyone have any experience with this type of functionality?

yeah, it has to be java, or accessible via a java framework (i wrote a very elegant and highly scalable back-end for this site so incorporating new functionality is quite simple)... i just need to figure out the best way to do this.

there is a commericial product called web renderer, but i'd prefer not to rely on a 3rd party product, plus i would rather spend my own resources on the solution than spending money on a product.

This discussion was created by blinder (153117) for no Foes, but now has been archived. No new comments can be posted.

Java HTML Renderer Question

Comments Filter:
  • How about embedding internet explorer into the java program? Not sure if you'd be able to scrape the whole window or if the OCX for IE would have its own inaccessible handle.

    Check out embedding a readily availalbe browser, maybe FF (not sure on embedding, but the executive does take command line arguments on both linux and windows.)
  • You are wanting to take a URL, do an HTTP GET on the URL and save the resulting page as an image that represents how it would appear in a browser, correct? I'm trying to determine what you are trying to accomplish by this. Every reason I could come up with why you would want to do it could be accomplished easier in other ways, but if you simply MUST have it as an image, I think I know a way to do it. I'll have to look up some code I wrote last year and see if it would work for this. If you MUST have the
    • in a nutshell, yes, just do a simple GET, fetch the page, render the HTML/CSS (whatever) and save it as an image, you got it. the indexing engine is already doing GETs to validate URLs (examines http return codes to make sure there is a site there in the first place, helps eliminate the vast majority of bad links).

    • oh yeah, i don't need explaining. really. i fully understand this... in fact, like i said, i have been using flying saucer's xhtml rendering API for just this task, but it doesn't fully support CSS so the results are not what i would expect. in fact, i hacked their framework to allow arbitrary html (instead of just xhtml) using jtidy... so i fully understand what the problems are... *sigh*

      my goal here is to either find an alternate rendering engine (which is what is required for this sort of thing) or learn
  • Roll your own maybe? Take the Gecko [mozilla.org] engine and build your own thumbnail generator. Maybe look at what this guy [mozillazine.org] has done for inspiration.

    Or maybe this [ubrowser.com] or this [mozdev.org] might provide inspiration.

    Sorry I can't offer anything useful though dude.
  • Did you find a way to do this yet? I've been looking for something similar just recently, although I don't need full support for html and css.

    I did find this note:

    http://weblogs.mozillazine.org/roc/archives/2005/0 5/rendering_web_p.html [mozillazine.org]

    but I don't know enough about mozilla/Gecko internals to take advantage of it, or even figure out whether it's useful or not in this context, plus it is, for me, a rather heavyweight solution to my tiny little problem.

    I also looked at a half dozen other solutions that don't q

System going down at 1:45 this afternoon for disk crashing.

Working...