Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Parser concept #72

Closed
cmdoret opened this issue Sep 21, 2023 · 1 comment · Fixed by #97
Closed

Add Parser concept #72

cmdoret opened this issue Sep 21, 2023 · 1 comment · Fixed by #97
Labels
refactor improving code without user-facing changes

Comments

@cmdoret
Copy link
Member

cmdoret commented Sep 21, 2023

Currently, we have an Extractor class, whose job is to extract all metadata about a repository.
With #70 and #68, Extractor has the ability to list_files() present in a repository and access their contents.

Ideally, the responsibility of an extractor should stop there. It should not be responsible for extracting metadata from file contents.

The proposal here is to have a separate object responsible for it: Parser. A Parser would take a file as input and extract specific RDF triples from it. The Repo's RDF graph could then be enriched using the Parser graphs.

graph TD;
    repo[Repository URL]-->ext{Extractor};
    ext --> meta[Metadata];
    ext --> files[Files];
    meta --> repograph{Repository};
    repograph --> repo_rdf[Repo RDF];
    files --> parser{Parser};
    parser --> spec_rdf[Specific RDF];
    repo_rdf --> union((Union));
    spec_rdf --> union;
    union --> enhanced[Enhanced RDF];
Loading

Parsers could be added for pyproject.toml, setup.py, licenses, Cargo.toml, R's DESCRIPTION, package.json, etc...

@cmdoret cmdoret added the refactor improving code without user-facing changes label Oct 18, 2023
@cmdoret
Copy link
Member Author

cmdoret commented Oct 20, 2023

A rough example of what the parser interface may look like. Each parser would only need to implement the parsing algorithm and the logic to define compatible resources.

class Parser(ABC):
    def __init__(self, max_size_kb: Optional[int]=2048):
        self.max_size = max_size

    @abstractmethod
    def _parse(self, input: Resource) -> rdflib.Graph:
        """Extract triples"""
        ...

    @abstractmethod
    def can_parse(self, input: Resource) -> bool:
        """Match based on filename (content and size?)"""
        ...

    def parse(self, input: Resource) -> Optional[rdflib.Graph]:
        if self.can_parse(input):
            return self._parse(input)
        return None

# Potentially more helper methods that will be available to all parsers

@cmdoret cmdoret linked a pull request Oct 24, 2023 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor improving code without user-facing changes
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant