Skip to content

Scraper

Base class for all scrapers

Examples:

>>> from scraper.base import AbstractScraper
>>> import requests
>>> class Scraper(AbstractScraper):
>>>     def scrape(self, url: str) -> str:
>>>         return requests.get(url).text
>>> scraper = Scraper()
>>> scraper.scrape("https://www.example.com/")

In this example we define our Scraper derived from AbstractScraper

AbstractScraper

Interface of scraper class

Source code in scraper/base.py
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class AbstractScraper(metaclass=ABCMeta):
    """
    Interface of scraper class
    """

    @abstractmethod
    def collect_data(self, **kwargs) -> dict:
        """
        Method to collect data from page

        Args:
            **kwargs: common kwargs

        Returns:
            (dict): return processed data
        """

    @abstractmethod
    def scrape(self, url: str) -> str:
        """
        Main method to start to scrape data from url
        Args:
            url: (str): url of web-site

        Returns:
            str: Text plain
        """

collect_data(**kwargs) abstractmethod

Method to collect data from page

Parameters:

Name Type Description Default
**kwargs

common kwargs

{}

Returns:

Type Description
dict

return processed data

Source code in scraper/base.py
23
24
25
26
27
28
29
30
31
32
33
@abstractmethod
def collect_data(self, **kwargs) -> dict:
    """
    Method to collect data from page

    Args:
        **kwargs: common kwargs

    Returns:
        (dict): return processed data
    """

scrape(url) abstractmethod

Main method to start to scrape data from url

Parameters:

Name Type Description Default
url str

(str): url of web-site

required

Returns:

Name Type Description
str str

Text plain

Source code in scraper/base.py
35
36
37
38
39
40
41
42
43
44
@abstractmethod
def scrape(self, url: str) -> str:
    """
    Main method to start to scrape data from url
    Args:
        url: (str): url of web-site

    Returns:
        str: Text plain
    """