Highly scalable web data collection/aggregation

There are a number of companies in the travel industry who offer services that search for and compare airfares, travel specials, hotels rates, rental cars, etc. While many of these sites get this data as a subscription service, some of them create it themselves by using the Kapow Mashup Server to write robots that collect this information for them from public websites.

Another example of a data collection use case is a company that searches court records and other public records to automate background searches on individuals. Still another example is a company that scours the web every night looking for job postings that it aggregates and serves up for its customers every day. Some of these robots that collect this information are run in real-time and some are run in batch mode, but one thing they all have in common is sophisticated capabilities in parsing web pages to look for very specific web data.

Large scale data collection is much more than web scraping of text. It incorporates advanced numeric and text string search, relational page navigation methods, table handling, and user defined rules that help define relationships between target data elements. It is even possible to open PDF attachments on a web site and search that as if it were a web page itself.

Many customers who aggregate web content combine various unstructured data elements together to create a new structured data element, which they then write to a database for use by another application. The Kapow Mashup Server is extremely flexible in allowing the developer to define input or output data objects, greatly enhancing what they are capable of doing with the data once it’s collected from the web.

Go Back go back