Nowadays World Wide Web is accumulated with millions of dynamic and static web pages based on programming languages like HTML, ASP and PHP. Web is wonderful source of information providing a clear playground for data mining. As the data stored on web is in different formats and are dynamic in nature, it’s a big challenge to search, process and represent the unstructured information given on the web. For this the requirement for Web Data Research comes in the picture.
Complexity of a Web page is very much more than the complexity of any conventional text document. Web pages on the internet does not have standardization and uniformity. While text documents books and traditional books are much simpler with regard to their consistency. Further, search engines in limited capacity may not index all the web pages that makes the process of data mining extremely inefficient.
Moreover, Internet is a very high dynamic knowledge resource and growing at a rapid pace. News, sports Corporate and Finance sites update their websites on daily or hourly basis. Today Web reaches to billions of users having various profiles, usage and interest purposes. Every one of these requires relevant information but don’t know how to gather relevant data effectively and with lesser efforts.
It is essential to note that only a little section of the web provides really useful information. There are some usual methods that a user adopts while accessing information given on the internet:
- Random surfing: This means following large numbers of hyperlinks provided on the web page.
- On the basis of query on search engines: use Google or Bing to search related documents (entering matching keywords queries of interest in search box).
- Deep query searches: It’s like fetching searchable database from the website like eBay product search engines or Business.com service directory, etc.
To make use of the web as an important resource and knowledge discovery researchers have created suitable Data Mining techniques to extract relevant data smoothly, easily and cost-effectively.