Skip to content

Latest commit

 

History

History
34 lines (22 loc) · 2.83 KB

README.md

File metadata and controls

34 lines (22 loc) · 2.83 KB

Web-Scrapper

  1. Introduction 1.1 Abstract Internet is the vastest information and data source ever built by mankind. However, it is a huge collection of heterogeneous and poorly structured data, difficult to collect in a manual way and complicated to use in automated processes. Over the last years, techniques and tools have surged, allowing data collection and conversion to structured data to be managed by B2C and B2B systems. This article offers an introduction to web scraping techniques and some of the most popular and novel techniques for data extraction and reuse in complex processes. The possibilities to take benefit of such data are many, including areas like Open Government Data, Big Data, Business Intelligence, aggregators and comparators, development of new applications and mashups, among others.

1.2 Purpose • The act of going through web pages and extracting selected text or images. • An excellent tool for getting new data or enriching your current data. • Usually the first step of a data science project which requires a lot of data. • An alternative to API calls for data retrieval. Meaning, if you don’t have an API or if it’s limited in some way.

1.3 Necessity of Web Scrapping As already stated, approximately70% of the information generated in the Internet is available in PDF documents, an unstructured and hard to handle format. However, a web page has a structured format (HTML code), although in a non-reusable way.PDF scraping is not the object of the analysis of this article, although it is true that some tools exist to extract information, mainly related to data tables. This enormous amount of information published but captive of this kind of format is usually called “the tyranny of PDF”. Some tools that are presented in later sections of this document can read PDF documents and return information in a structured format, although in a basic and rudimentary way. Following with the main scope of this document (HTML documents), its structured nature multiplies the possibilities open by scraping techniques. Web scraping techniques and scraping tools rely in the structure and properties of the HTML language. This facilitates the task of scraping for tools and robots, stopping humans from the boring repetitive and error prone duty of manual data retrieval. Finally, these tools offer data in friendly formats for later processing and integration: JSON, XML, CSV, XLSo RSS.

1.4 Technology Used Application Architecture - J2EE Database Application - DB2 Web Deployment Server - Tomcat. Designing tool - Rational Library - JSoup

IMPORTANT

  1. Your Need to install JSOUP Library in your PC
  2. There is no need for Database
  3. All the links of the Websites are available in there respective Controllers.