I might ask something stupid but I want to learn some web scraping. I already know how to use perl, so I would prefer to do it using this language. I know there are a lot of modules on CPAN, I tried to read those but I barely understand something. I haven't found anthing that would explain from zero what this process means. I could use some help with some links or some materials to study a little web scraping.
At a pretty basic level, 'web scraping' is just downloading a webpage, and parsing it to extract the information you want. At a started level, the module you want is
LWP that lets you fetch content, and then 'something' to extract the information you want.
HTML::TableExtract for example. There's nothing to say you can't roll your own using pattern matching of course, but ... well, processing HTML isn't a new problem, so why re-invent the wheel?
At a more advanced level though, you might want to interact with a site - log in to it perhaps, or 'click through' some menus. For this, I like
I'm afraid I can't give you much more without a better understanding of the sort of problem you're trying to solve though. Are you at a basic 'fetch a webpage and parse' sort of level?
(You can find details and examples of the above modules on CPAN. The LWP page has some examples that should get you started.)
I wrote a pretty basic tutorial on WWW::Mechanize here ..I have successfully crawled pages on several occasions so please let me know if you have a case you would like to try and need some help :)
To start you can look at WWW::Mechanize and HTML::TreeBuilder::XPath modules.
In my opinion, the best module for web scraping is Web::Scraper. Its language can be quite terse at times, but there are plenty of examples.