Releases: poseidon-framework/poseidon-hs
Release v1.2.1.0
This release does not include changes for trident end users.
It adds two new subcommands for (public) archive management, but they are only relevant from a developer's perspective: chronicle
creates or updates a dedicated .yml file to document version iterations of Poseidon packages in a Git-managed archive, and timetravel
recovers package iterations based on this file to (re)construct said archive from the source repository.
Just as serve
, both new subcommands will be omitted in the command line help.
Release v1.2.0.0
This release comes with a massive rewrite of the server-client infrastructure, so the code behind the API to download and list packages from our public data archives.
The server is now implemented as a (hidden) subcommand of trident: serve
. It returns helpful error messages, if an incompatible version of trident tries to connect to it. And it is now capable of serving multiple (so not just the latest, but also older) versions of one package, which is an important step towards computational reproducibility of Poseidon-based pipelines.
All Server-APIs except for zip_file
now return a complex JSON datatype with server messages and a payload. The messages contain standard messages like a greeting and in the future perhaps also deprecation warnings. Some APIs also provide information or warnings in the server messages.
All APIs except for zip_file
also accept an additional parameter ?client_version=X.X.X
, so that the server may in the future respond to old clients that an update is needed in order to understand the API. Our trident list --remote
functionality already makes use of this.
Here are the individual APIs:
https://server.poseidon-adna.org/packages
: returns a list of all packages.https://server.poseidon-adna.org/groups
: returns a list of all groups.https://server.poseidon-adna.org/individuals
: returns a list of all individuals.https://server.poseidon-adna.org/zip_file/<package_name>?package_version=1.0.1
: returns a zip file of the package with the given name and the given version. If no version is given, it returns the latest.
The client subcommands fetch
and list
can not yet make full use of this new API in this release, because they lack an interface to request specific package versions. This will be added in a future release. But the output of both subcommands already differs from the previous implementation:
fetch
now appends the package version to the directory name when downloading a package. Previouslytrident fetch -d . -f "*2010_RasmussenNature*"
yielded a directory named2010_RasmussenNature
with the package data, but now it creates2010_RasmussenNature-2.0.1
(or whatever is the latest version of this package in the archive).fetch
does no longer have an option--upgrade
, since the new behaviour respects different versions to live side by side in different directories. If users whish to remove old versions, they should do so manually.list
lists not just data (package, groups, individuals) for the latest version of a package, but for all versions. The package version became an explicit output column.
As before, forge
and the other subcommands keep ignoring multiple package versions for now, and only read the latest.
The new server is available at a new URL (https://server.poseidon-adna.org), but the old version at (https://c107-224.cloud.gwdg.de) will also keep running for now. New releases of trident (v1.2.0.0+) will by default use the new server, while older versions must connect to the old one.
Release v1.1.12.0
This release implements the changes necessary for the Poseidon schema v2.7.1. That mostly means that the constraints on several .ssf file columns previously considered mandatory and unique were lifted.
Beyond that a number of type constraints specified already in Poseidon v2.7.0 for the .ssf file were finally implemented in poseidon-hs. A broken file will, thus actually be flagged upon reading if it violates the following requirements:
- .ssf columns that include Accession_IDs have to feature the correct and valid Accession_IDs according to INSDC specification.
- .ssf columns with dates have to be valid dates of the form
YYYY-MM-DD
. - .ssf columns featuring MD5 hashes require entries with exactly 32 hex-digits.
Both for the .janno and the .ssf file we elevated the log level of the common broken lines
error from debug to error. This makes these errors more prominent and more easy to resolve.
Release v1.1.11.4
This release fixes a core issue in the implementation of Poseidon v2.7.0, where multiple columns of the .ssf file where not defined correctly as list columns. Poseidon v2.7.0 is in itself deprecated, though, and will be replaced as soon as possible with an updated version. This trident release exists thus mainly to provide a working implementation of 2.7.0 for future reference.
Beyond this change in functionality, this release also includes heavy refactoring in the survey
subcommand, the golden test infrastructure and the overall version of Haskell poseidon-hs and trident are built with. These changes should not have any user-facing consequences.
Release v1.1.11.0
This release implements the changes necessary to make trident
capable of handling packages specified for the new Poseidon standard v2.7.0:
- A Poseidon package can now include a .ssf file ("sequencing source file") as specified.
trident
considers it invalidate
,update
,survey
and, most importantly,forge
, where .ssf files are compiled for new packages just as .janno files. trident
now understands and validates the new .janno columnsCountry_ISO
andLibrary_Names
.trident
now knows the possible valuemixed
for the .janno columnLibrary_Built
.
The behaviour of trident
for older package schema versions (v2.5.0 and v2.6.0) should be mostly unchanged. forge
and init
now return Poseidon v2.7.0 packages, though.
Release v1.1.10.2
This release bundles a number or minor changes, new minor command line options and some internal refactoring without immediate consequences for trident
.
Changes in command line options
- By default
validate
only tests genotype data by parsing the first 100 SNPs. This limitation is necessary for performance reasons, but can hide issues outside of this tiny subset. We now added an option--fullGeno
tovalidate
, which forces parsing of the entire .bed/.geno file. - The .fam file of Plink-formatted genotype data is used inconsistently across different popular aDNA software tools to store group/population name information. See more on this issue in our discussion here. We now added the (global) option
--inPlinkPopName
and--outPlinkPopName
with the argumentsasFamily
(default),asPhenotype
andasBoth
to control the reading and writing of the population name from and to Plink .fam files. - The
--no-extract
option for faster, package-wise data selection inforge
was not working properly. We fixed it, renamed it to--packagewise
and improved its command line help text.
Bugfixes
- As described here, our implementation of .janno file parsing struggled with some encodings of the
No-Break Space
unicode character. We now decided to delete these characters upon reading, following the assumption that they are generally not desired in a .janno file anyway. In this process we also decided to trim all whitespaces around .janno file fields.
Other changes
- The
-j
option oflist
, which allows to include additional .janno columns in the output with the--individuals
flag now allows to access arbitrary, additional variables. update
writes messages to the CHANGELOG file now with a prefix-
, to make it proper markdown.- The verbose debug-level (with
--logMode VerboseLog
) warnings about missing standard columns in the .janno file were turned off. - The important "schema version mismatch" error message was made more verbose and clear.
trident
failes gracefully now if one or all-d
/--baseDir
s do not exist.- The important "broken lines" error message in the .janno reading process now reminds users to turn on
--logMode VerboseLog
to get more information.
Release v1.1.7.0
This release clarifies a long standing uncertainty how trident treats individual ID duplicates. It adds a new feature to the forge language to specify individuals more precisely and thus resolve duplication conflicts.
trident does not allow individuals with identical identifiers, so Poseidon_ID
s, within one package. And we generally also discourage such duplicates across packages in package collections. But there is no reason to enforce this unnecessarily for subcommands where it does not matter. Here are the rules we defined now:
- Generally, so in the subcommands
ìnit
,fetch
,genoconvert
,update
,list
,summarise
, andsurvey
, trident logs a warning if it observes duplicates in a package collection found in the base dirs. But it proceeds normally then. - Deviating from this, the special subcommand
validate
stops with an error if it observes duplicates. This behaviour can be changed with the new flag--ignoreDuplicates
. - The
forge
subcommand, finally, also ignores duplicates in the base dirs, except (!) this conflict exists within the entities in the--forgeString
. In this case it stops with an informative error:
[Error] There are duplicated individuals, but forge does not allow that
[Error] Please specify in your --forgeString or --forgeFile:
[Error] <Inuk.SG> -> <2010_RasmussenNature:Greenland_Saqqaq.SG:Inuk.SG>
[Error] <Inuk.SG> -> <2011_RasmussenNature:Greenland_Saqqaq.SG:Inuk.SG>
[Error] Error in the forge selection: Unresolved duplicated individuals
This already shows that the -f
/--forgeString
selection language of forge
(and incidentally also fetch
) includes a new syntactic element since this release: Individuals can now be described not just with <individual>
, but also more specifically <package:group:individual>
. Such defined individuals take precedence over differently defined ones (so: directly with <individual>
or as a subset of *package*
or group
). This allows to resolve duplication issues precisely -- at least in cases where the duplicated individuals differ in source package or primary group.
Release v1.1.6.0
Additional columns in .janno files (V 1.1.5.0)
This release changes the way additional columns in .janno files are treated.
So far trident
fully ignored additional variables, which had the consequence that trident forge
dropped them without warning. With this new release, additional variables are loaded and carried along in forge
. For merging different .janno files A and B the following rules apply regarding additional columns:
- If A has an additional column which is not in B then empty cells in the rows imported from B are filled with
n/a
. - If A and B share additional columns with identical column name, then they are treated as semantically identical units and merged accordingly.
- In the resulting .janno file, all additional columns from both A and B are sorted alphabetically and appended after the normal, specified variables.
The following example illustrates the described behaviour:
A.janno
Poseidon_ID | Group_Name | Genetic_Sex | AdditionalColumn1 | AdditionalColumn2 |
---|---|---|---|---|
XXX011 | POP1 | M | A | D |
XXX012 | POP2 | F | B | E |
XXX013 | POP1 | M | C | F |
B.janno
Poseidon_ID | Group_Name | Genetic_Sex | AdditionalColumn3 | AdditionalColumn2 |
---|---|---|---|---|
YYY022 | POP5 | F | G | J |
YYY023 | POP5 | F | H | K |
YYY024 | POP5 | M | I | L |
A.janno + B.janno
Poseidon_ID | Group_Name | Genetic_Sex | AdditionalColumn1 | AdditionalColumn2 | AdditionalColumn3 |
---|---|---|---|---|---|
XXX011 | POP1 | M | A | D | n/a |
XXX012 | POP2 | F | B | E | n/a |
XXX013 | POP1 | M | C | F | n/a |
YYY022 | POP5 | F | n/a | J | G |
YYY023 | POP5 | F | n/a | K | H |
YYY024 | POP5 | M | n/a | L | I |
Minor changes (V 1.1.6.0)
--verbose
intrident validate
was deprecated. The respective output is now logged on the DEBUG level, so can be accessed with--logMode VerboseLog
- Trailing slashes in
--outPath
forinit
,genoconvert
andforge
are now automatically removed. This prevents a common, confusing error, where a trailing slash would causetrident
to assume the name of the resulting package is empty.
Release v1.1.4.2
With this release trident becomes able to handle the changes introduced for Poseidon v2.6.0.
- The contributor field in the POSEIDON.yml file is optional now and can be left blank.
- The contributor field now also can hold an ORCID in a subfield orcid.
trident
checks the structural correctness of this identifier. trident
now recognizes the new available entries for theCapture_Type
variable in the .janno file.
Beyond that:
- Already V 1.1.3.1 closed a loophole in .bib file validation, where .janno files could have arbitrary references if the .bib file was not correctly referenced in the POSEIDON.yml file.
- V 1.1.4.1 added a small validation check for the janno columns Date_BC_AD_Start, Date_BC_AD_Median and Date_BC_AD_Stop: Ages bigger than 2022 now trigger an error, because they are factually impossible and indicate that somebody accidentally entered a BP age.
- V 1.1.4.2 added parsing for Accession IDs. Wrong IDs are ignored (for now), so this is a non-breaking change.
Release v1.1.3.0
This release introduces a major change to the progress indicators in package downloading, reading, forging and converting. It also includes some minor code changes in the poseidon-hs library and the poseidon server executable.
Trident
From a trident user perspective only the change in the progress indicators is relevant. So far we used updating (self-overwriting) counters, which were great for interactive use of trident in modern terminal emulators. They are not suitable for use in scripts, though, because the command line output does not yield well structured log files. We therefore decided to integrate the progress indicators with our general logging infrastructure.
- Loading packages (so the
Initializing packages...
phase) now stays silent by default. With--logMode VerboseLog
you can list the packages that are currently loading:
[Debug] [10:56:05] Package 20: ./2015_LlorenteScience/POSEIDON.yml
[Debug] [10:56:05] Package 21: ./2017_KennettNatureCommunications/POSEIDON.yml
[Debug] [10:56:06] Package 22: ./2016_MartinianoNatureCommunications/POSEIDON.yml
[Debug] [10:56:06] Package 23: ./2016_BroushakiScience/POSEIDON.yml
[Debug] [10:56:06] Package 24: ./2017_LindoPNAS/POSEIDON.yml
[Debug] [10:56:06] Package 25: ./2021_Zegarac_SoutheasternEurope/POSEIDON.yml
forge
andgenoconvert
now print a log message every 10k SNPs:
[Info] SNPs: 220000 5s
[Info] SNPs: 230000 5s
[Info] SNPs: 240000 5s
[Info] SNPs: 250000 5s
[Info] SNPs: 260000 6s
[Info] SNPs: 270000 6s
fetch
now prints a log message whenever a +5% threshold is reached.
[Info] Package size: 15.3MB
[Info] MB: 0.8 5.2%
[Info] MB: 1.6 10.5%
[Info] MB: 2.4 15.7%
[Info] MB: 3.2 20.9%
[Info] MB: 4.0 26.1%
Server
The server has been updated in the following ways:
- It now uses Co-Log for logging
- A new option
-c
now makes it ignore checksums, which is useful for a fast start of the server if need be - Zip files are now stored in a separate folder, to keep the (git-backed) repository itself clean
- There is a new API named
/compatibility/<version>
which accepts a client version (from trident) and returns a JSON tuple of Haskell-type (Bool, Maybe String). The first element is simply a Boolean saying if the client version is compatible with the server or not, the second is an optional Warning message the server can return to the client. This will become important in the future.