Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expanded row metadata for graph format #130

Open
MatthewRalston opened this issue Mar 29, 2024 · 3 comments
Open

Expanded row metadata for graph format #130

MatthewRalston opened this issue Mar 29, 2024 · 3 comments
Labels
dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation duplicate This issue or pull request already exists enhancement New feature or request good first issue Good for newcomers invalid This doesn't seem right
Milestone

Comments

@MatthewRalston
Copy link
Owner

MatthewRalston commented Mar 29, 2024

Key Question

What is needed for working data structure initialization? Why isn't it working?

The node and edge list and prioritization or
sort strategy for edge representation, weights, > multigraph and combination representation,
orientation of edges, dual strandedness
and .kdbg row metadata (non-int, but
Boolean) (i.e. fast lookup) row metadata
fields is not yet finalized.

[[ walk file ]]

Walks files are just like path files, and
primarily contain an ordering of edges. All
walks are paths, but a walk may have a
forward and reverse direction, and so all
walks and their originating context (aka
a .kdbg file) must either be minimal (all edges
and a positioning id (i) only - a "retrospective "
bool, a "solutional" bool (if the walk is said to
be solutional from an assembly process
associated from .kdbg version 1.0 .1 or
greater, a version number associated with the
kmerdb release, the sha256 of the git release
(on each edge yes), or expanded
(retrospective, prospective, previous forking nodes, previous walka
investigate and their node IDs)

schema concepts

for format versions of course...

Should be self referential, contain nodes, edges, and walks and/or paths. Metadata includes relevant references to schema versioning, and specific file references for interpretation.

[ minimal walks ]

A minimal walk file must also include all
edges of the original context (a.k.a. all edges
observed from the dataset(s) in the .kdbg
header), marked with a retrospective bool,
along with one or more copies of the same
edge prospective bool = True when
representing a specific walk (not a minimal
path, a single linear representation of edges, a
sort order with no presumed provided source reference)

solutional path

a walk, along with all previous walks (in
chronological aka integer id, by reference,
along with the sha256sum of the git release
that produced the walk, the metadata, etc...

[[ solutional path file ]]

Header metadata will have the source and the parameters in the header. And a walk id - (a sha256 of the walk) for an associated walk file, and walk name (given at "runtime" via CLI). May be 0 to represent unspecific or unqualified walk (origin unclear)

Related issues

#126 #122 #125 #102 #124

sidenote

The neighbor structure 🌪️is manifested by particular kmer IDs🌬️, which may be accessed from kmer arrays loaded alongside the edge list during a path producing process.

A working pipeline would include all components of the workflow onto the next step but all commands are partial. Schemas' in planning stage for future release

@MatthewRalston MatthewRalston added documentation Improvements or additions to documentation duplicate This issue or pull request already exists enhancement New feature or request good first issue Good for newcomers invalid This doesn't seem right dependencies Pull requests that update a dependency file labels Mar 29, 2024
@MatthewRalston MatthewRalston added this to the V0.7 stable? milestone Mar 29, 2024
@MatthewRalston
Copy link
Owner Author

MatthewRalston commented Mar 29, 2024

Key Question

What is needed for working data structure initialization? Why isn't it working?

Node files

No comment

Edge files

Not applicable

types of walks

  • Walk files
  • Path files
  • Tree files
    Contains:

walks from/to "central/incidental" nodes

  • Forward walk
  • Reverse walk

[[ node schema (in progress) ]]

  • node_id
  • pos_walk (id in walk file or path file, - pos_path
  • next_edge id (aka edge 2-tuple), next_path id

[[ Edge schema ]] ---------

  • node1_id
  • node2_id, pos_path, pos_walk,
  • prospective bool (aka most edges in a walk/path/climb should be retrospective in the destination context...)
  • preceding walk id
  • next walk id,
  • Forward schema
  • Reverse schema

[[ Walk schema (in progress) ]]

  • path schema
  • Walk schema
  • Solution schema

[[ The walk file ]]

Walks files are just like path files, and primarily contain an ordering of edges. All walks are paths, but a walk may have a forward and reverse direction, and so all walks and their originating context (aka a .kdbg file) must either be minimal (all edges and a positioning id (i) only - a "retrospective " bool, a "solutional" bool (if the walk is said to be solutional from an assembly process associated from .kdbg version 1.0 0 or greater, a version number associated with the kmerdb release, the sha256 of the git release (on each edge yes), or expanded (retrospective, prospective, previous forks investigate and their node IDs)

minimal walks

A minimal walk file must also include all edges of the original context (a.k.a. all edges observed from the dataset(s) in the .kdbg header), marked with a retrospective bool, along with one or more copies of the same edge prospective bool = True when representing a specific walk (not a minimal path, a single linear representation of edges, a sort order with no presumed origin id)

Related issues

Issues #126 #122 #125 #102 #124

@MatthewRalston thinks the path forward towards a graph format is in creating additional structural definitions. If i think through the relationships preserved among different incomplete and completely self-referential formats, they require associated metadata schemas, and the utility function of taking a table or metadata schematic input and generating a consistently hashable representation (the metadata header format, it's parser, and the table parsing functionality, as in these modules)...

  •    `kmerdb.graph`
    
  •    `kmerdb.fileutil`
    
  •    `kmerdb.parse`
    

and references..

i.e. "the format(s)"

And associated schemas...

This utility function wouldn't be part of the algorithm per-se, but it would be incident to that which is produced by virtue of the file-metadata-log (and this version-dataset pairing) thingawhosit. That's mostly contained in our __init__, and associated module files for format access and associated value provided from features and solutions in future versions.

and tying that to a git sha256 hash, should be preserved with all nodes of a given wall or path

@MatthewRalston
Copy link
Owner Author

MatthewRalston commented Apr 10, 2024

This issue has been tabled for the time being in favor of a cleaner UI and experience on the user end.

1. Interface overhaul (issue #132)

I want the user to understand the output and even ASCII styling (in absence of a rich.py dependency, which isn't needed)

output_dir

I want the logfile and output directories (required to collect .kdb, .kdbg, .stats.txt, output.log etc)

usage, steps, and features

I want the expanded help and usage statements, including the 'features' and 'steps' developed further.

minimal STDOUT

And finally, I want the STDOUT to be extremely minimal and/or non-existent, in the profile and graph commands. OR the formatting should display the resulting stats clearly apart from the header.

README "2.0" (issue #137)

Finally, readme overhaul

@MatthewRalston MatthewRalston moved this from Backlog to In review in Assembly algorithm (0.7.7+) Jul 14, 2024
@MatthewRalston MatthewRalston moved this from In review to Backlog in Assembly algorithm (0.7.7+) Jul 31, 2024
@MatthewRalston
Copy link
Owner Author

Okay, I've been working on some other features and needed documentation/UI overhauls. Delays pushed deadline back a few months, reprioritizing the assembly algorithm and possible numba/Python etc implementations of D2 metrics, more odds-ratio stuff on the horizon, more literature review and beginning to write a report and lit review on applications of kmer count matrices and distances to metagenomics and microbiomes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Pull requests that update a dependency file documentation Improvements or additions to documentation duplicate This issue or pull request already exists enhancement New feature or request good first issue Good for newcomers invalid This doesn't seem right
Projects
Status: Backlog
Development

No branches or pull requests

1 participant