Does your role or business require the gathering of data from multiple websites and web portals? Or are you already doing it, but finding the current approach is holding you back from scaling up your data extraction operations? It’s likely you’re searching Google using terms like “web scraping,” “web crawling,” “web harvesting,” or just “web data collection” seeking advice.
Before selecting a web scraping software solution to support your data collection, let’s consider the potential risk with such solutions that seem like the right fit, but might fail to meet the demands of your business.
1. The risk of dependency on one person to develop and maintain scrapers.
If you rely on one (unique) person to develop your web scrapers, you may be at risk when this person goes on vacation, get ill or leaves the company. You might have hundreds of homegrown scrapers that need to be maintained, enhanced, or new ones that need to be developed. Do you have a backup person? Be sure to use a product, which is not only easy to learn, but also easy to maintain overtime. Train a backup person who can step in with short notice.
2. The risk of not getting data because a web source changes format and it breaks you crawler.
When you scrape data from external websites, websites can suddenly change and break your web scrapers. When this happens, you need to redevelop your web scraper. If you cannot get this done in less than two hours, you have inefficient technology. A better approach is to consider a scalable, easy-to-use product to ensure you get accurate and reliable data, where the technology is resilient to websites changes.
3. The risk of not getting data because your server dies.
The server on which you run your web scrapers lose power, lose internet connectivity or a component goes bad. Unless you are using a web data extraction platform with automated fail-over, your web data stream ceases. Almost all homegrown or open-source web scraping solutions have no fail-over built in, and are not recommended for any serious web harvesting scenarios. Commercial enterprise grade solutions have built-in fail-over and allow you to install the product at no extra charge on multiple servers in a hybrid cloud (on premise and cloud) environment. These solutions will automatically switch the load to other servers when a server dies. It will even do automated load balancing to be sure the computed capacity of the individual servers is used in the most optimal way.
4. The risk of not knowing what and why something went wrong if you don’t get your data.
Do you have logging to keep track of what your scrapers are doing? Which web scraper collected what data and when? What went wrong? If you can’t answer these questions, you will likely not know where to start troubleshooting when your web scraping solution fails to deliver the data you expect. I have seen too many homegrown web scrapers that don’t provide a good answer and all of which leads to costly problems later on. An enterprise class web data extraction solution has advanced logging and monitoring capabilities, which tracks everything going on in your system. It will have a comprehensive front-end where you can search for errors and monitor timing of data collection. You’ll know if you collected more or less data than before. You can also specify which events you need to escalate, for example sending an email or SMS to a system administrator or business analyst to take immediate action.
5. The risk of not having evidence that the data you collected is accurate.
Let us say you collect product pricing from an affiliate partners web portal or website and your business model makes incremental profit every time you refer a customer who makes a purchase from the affiliate partner. Can you prove that a specific product had a specific price at a specific time? Do you collect evidence information? If you don’t, you could be leaving money on the table. An enterprise class web data extraction solution not only collects the data but also enables you to automatically store screen-shots of each web page exactly when you extracted the information. No more arguing who is right or wrong, you now have the evidence that data collected was accurate.
Are you reconsidering your web scraping approach? If so, click here to download “A Guide to Rethinking Web scraping”
Remember risk-free web data extraction will save you time and money in the long run, and make your business users happy.
Stefan Andreasen, Corporate Evangelist, Kapow & Information Integration