April 13, 2015

Scraping title tags from large sites

I am working on a task of scraping title tags of large sites, 25 or 50 million pages each. It may be 2, 5, 10 sites, or more. Probably no more than 10, I would say.

There are two programs for this - Seo Spider by ScreamingFrog.co.uk, and Xenu's Link Sleuth (both written in Java), but they are not good for large sites (I know it from authors / owners of these programs).

I know that this task can be done in Java, PHP, and it is also possible to try to do it with Google Chrome Console, or some addons for this browser and Firefox (I am not sure if the browser path would be good).

Somebody recommended leasing a very good Linux server, and writing / getting a script, which would work on something like this from there.

In general there is hardware (good hardware / pc may be / is a factor in this too, especially a lot of RAM and an SDD drive), it, programming languages / Internet / websites, and I am not sure what is the best way to go with this. For some reason, the Linux path seems like a good way to go.

Can you tell me how this could work, how much it could cost, and is it a good idea, overall. I know some Linux and php, but I am not very good in Linux overall (I am planning to learn).

People recommend good server from Amazon, Linux script, and scraping titles through there. I will add, that I will not have a list of urls of the sites. This would need to be mapped in some way too, I am still not sure how (I know there is programs / websites for creating sitemaps, so this is something that should be possible to get done).

Thanks.

Click Here!