Add Parser concept #72

cmdoret · 2023-09-21T15:25:09Z

Currently, we have an Extractor class, whose job is to extract all metadata about a repository.
With #70 and #68, Extractor has the ability to list_files() present in a repository and access their contents.

Ideally, the responsibility of an extractor should stop there. It should not be responsible for extracting metadata from file contents.

The proposal here is to have a separate object responsible for it: Parser. A Parser would take a file as input and extract specific RDF triples from it. The Repo's RDF graph could then be enriched using the Parser graphs.

graph TD;
    repo[Repository URL]-->ext{Extractor};
    ext --> meta[Metadata];
    ext --> files[Files];
    meta --> repograph{Repository};
    repograph --> repo_rdf[Repo RDF];
    files --> parser{Parser};
    parser --> spec_rdf[Specific RDF];
    repo_rdf --> union((Union));
    spec_rdf --> union;
    union --> enhanced[Enhanced RDF];

Parsers could be added for pyproject.toml, setup.py, licenses, Cargo.toml, R's DESCRIPTION, package.json, etc...

The text was updated successfully, but these errors were encountered:

cmdoret · 2023-10-20T07:14:15Z

A rough example of what the parser interface may look like. Each parser would only need to implement the parsing algorithm and the logic to define compatible resources.

class Parser(ABC):
    def __init__(self, max_size_kb: Optional[int]=2048):
        self.max_size = max_size

    @abstractmethod
    def _parse(self, input: Resource) -> rdflib.Graph:
        """Extract triples"""
        ...

    @abstractmethod
    def can_parse(self, input: Resource) -> bool:
        """Match based on filename (content and size?)"""
        ...

    def parse(self, input: Resource) -> Optional[rdflib.Graph]:
        if self.can_parse(input):
            return self._parse(input)
        return None

# Potentially more helper methods that will be available to all parsers

cmdoret added the refactor improving code without user-facing changes label Oct 18, 2023

cmdoret mentioned this issue Oct 20, 2023

make list_files recursive #76

Closed

cmdoret linked a pull request Oct 24, 2023 that will close this issue

feat: add parsers support #97

Merged

cmdoret closed this as completed in #97 Nov 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Parser concept #72

Add Parser concept #72

cmdoret commented Sep 21, 2023 •

edited

Loading

cmdoret commented Oct 20, 2023 •

edited

Loading

Add Parser concept #72

Add Parser concept #72

Comments

cmdoret commented Sep 21, 2023 • edited Loading

cmdoret commented Oct 20, 2023 • edited Loading

cmdoret commented Sep 21, 2023 •

edited

Loading

cmdoret commented Oct 20, 2023 •

edited

Loading