Web scraping / crawling a paricular Google book
For my work, I need to scrape the text from a large book on Google Books.
The book in question is a very old book and is out of copyright. The book
is a Gazetteer of the World. We will be putting the text into a database,
so we need the raw text rather than the pdf.
I have already spent much time researching the tools and techniques that
could be used to complete this task. I feel overwhelmed and do not know
where to start or which is the best / easiest method to employ. I do not
want to waste more time on a dead end.
The problem can be split into two parts: (1) Crawling the pages, (2)
Downloading the data. It is really part (1) that I am most stuck on. Once
I have the data (even if it is only the raw html pages), I'm sure I could
use a parser to extract what I want.
Navigating the pages is done by clicking continue or an arrow. The page
increment is not always consistent, it can vary because some pages have
embedded images. So, I cannot necessarily predict the next url. The
initial url for volume 1 of the book is:
http://books.google.co.uk/books?id=grENAAAAQAAJ&pg=PR5&output=text
I can program in Java and JavaScript and I have basic knowledge of Python.
I have considered node.js and scrapy amongst many other things. I tried
wget but receive a 401 unathorized access error. Also, I tried iRobot,
GreaseMonkey and FoxySpider.
I would appreciate any advice. Thanks very much.
No comments:
Post a Comment