PDA

View Full Version : How to extract text from image?


boron
01-30-2007, 02:47 AM
How can I extract text from (many) images on a site (all at once).

Example:
If I use iwebtool Link extractor (http://iwebtool.com/link_extractor?domain=amazingindiaexperience.info&show=0&title=1&address=1&pagerank=1), it shows PRs below each link. How can I convert that image into a text (for the sake of putting it into Excell and range it).

*It cane take ~20 seconds that images with PR are shown.

Reckless
01-30-2007, 03:38 AM
by hand ;)

really dont know how to do that...

Submerge
01-30-2007, 10:02 AM
I think you'd have to make a program for the desired results, unless an application uses the alt or title tags to print the text instead of the image.

boron
02-03-2007, 03:05 AM
I've partially solved PR checking for multiple URLs (http://extra-traffic.com/pra_checker.htm.en.htm): Works for 15 URLs (free version) and PR values are copy-paste friendly.

But I need it for 1,000 URLs :) .

PR images from my first post are in .png format. I guess the text can be extracted with a OCR (optical character recognition) software as it comes with scanners or is find on Sourceforge.net (http://sourceforge.net/search/?type_of_search=soft&words=ocr+png). Now only to find the right one.

zac439
02-03-2007, 08:11 PM
Hmm, not sure by what you mean, but check this out:
http://www.netdisaster.com/sucker.php?url=http://google.com/

It just sucks the text off any buttons/table/body of the webpage. You didn't really say what you needed specifically so I might be a little off base with this. Just a suggestion.

boron
02-04-2007, 11:40 PM
I was thinking i was clear, but here is what I need:

This is an image in .png format which shows PR of one site. http://www.iwebtool.com/images/vs/gpr.png?l=http%3A%2F%2Ffplanque.net%2FBlog%2FitTre nds%2F2004%2F01%2F10%2Frss_rdf_and_atom_in_a_nutsh ell"]PR of a prticular link

In real environment it is placed just beside site link: iwebtool link extractor (http://www.iwebtool.com/link_extractor?domain=amazingindiaexperience.info&show=2&address=1&pagerank=1).

I need that PR value in the text format so I can copy-paste it in Excel to arrange links according to PR. If I have list of 1,000 links with PR, you'll see why I need this.

darkcrunk
02-17-2007, 09:01 AM
Hehehe .. there is no definite solution to this I believe.

the iWebTool works well most of the time, but then again, nothing is free, is it :( ?

boron
03-01-2007, 05:47 AM
I think I've found some ways how to extract text and images from sites, movies...

1. Movies exractor Scout (http://www.bytescout.com/moviesextractorscout.html) can extract text or image from flash movie or from video (Youtube/Google). You can make a screenshot and save it as .jpg, .tiff, .png, .pdf file in one click.

2. Extract text/image from PDF file (http://www.dd.se/pdf/pdftips/default.html?sidor=/pdf/pdftips/extracttxtgraph.html) (tutorial).

Well, still not happy.

rjp
03-02-2007, 02:45 PM
Ever tried to write a neural network to do this?

Didn't think so. 8)

It's not an easy thing to do.

boron
03-02-2007, 11:31 PM
It's not an easy thing to do.

When a member on some forum asked for text extracting he got an opinion like even CIA/FBI/MI6 together would have troubles with it.

I know that images with security codes on the end of some forms are especially made this way that someone can't just copy them I managed to "extract" the code once with right click/properties and copied the # from the image name in URL.

I know there are some techniqs to extract text based on color difference between letters and background.

Basicaly I would be happy to find out how to use OCR (Optical Character Recognition) online. Well, I've found about 50 links on the bottom of OCR article in Wikipedia (http://en.wikipedia.org/wiki/Optical_character_recognition).

Arthur07
03-17-2007, 09:58 AM
How can I extract text from (many) images on a site (all at once)....

Yip by hand sounds like a good option :P.

The problem is, how would you distinguish text from the image and think of all the different fonts that could be used so it wouldn't be as simple as identifying a single pattern for each character (which isn't exactly simple).

Try searching on sourceforge.net you might get lucky.