Skip to content

Commit

Permalink
Merge pull request #211 from poseidon-framework/fixDuplicateManagement
Browse files Browse the repository at this point in the history
Fix duplicate management
  • Loading branch information
nevrome authored Jan 13, 2023
2 parents 097b645 + f3440ad commit 809afc1
Show file tree
Hide file tree
Showing 43 changed files with 613 additions and 162 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
- V 1.1.7.0: Reorganized handling of duplicate individuals: Duplicates are now generally ignored, except in validate (can also be turned of with a new flag) and forge. The forgeString language features a new syntactic entity to select individuals specifically and thus resolve duplication conflicts
- V 1.1.6.0: Removed outdated --verbose from validate and ignore trailing slashes from --outPath
- V 1.1.5.0: Enabled reading and forging additional, unspecified variables in .janno files
- V 1.1.4.2: Added parsing for Accession IDs (.janno file). Wrong entries are ignored, so this is non-breaking.
- V 1.1.4.2: Added parsing for Accession IDs (.janno file). Wrong entries are ignored, so this is non-breaking
- V 1.1.4.1: Added a small validation check for calibrated ages in the .janno file
- V 1.1.4.0: Changes to make poseidon-hs compatible with Poseidon v2.6.0 (backwards compatible with v2.5.0): contributor field optional, added orcid field for contributors, added more capture type options in janno files
- V 1.1.3.1: Package reading will now fail if bib-entries are not found due to missing bibtex files
Expand Down
20 changes: 20 additions & 0 deletions CHANGELOGRELEASE.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,23 @@
### V 1.1.7.0

This release clarifies a long standing uncertainty how trident treats individual ID duplicates. It adds a new feature to the forge language to specify individuals more precisly and thus resolve duplication conflicts.

trident does **not** allow individuals with identical identifiers, so `Poseidon_ID`s, **within one package**. And we generally also discourage such duplicates across packages in package collections. But there is no reason to enforce this unnecessarily for subcommands where it does not matter. Here are the rules we defined now:

- Generally, so in the subcommands `ìnit`, `fetch`, `genoconvert`, `update`, `list`, `summarise`, and `survey`, trident logs a warning if it observes duplicates in a package collection found in the base dirs. But it proceeds normally then.
- Deviating from this, the special subcommand `validate` stops with an error if it observes duplicates. This behaviour can be changed with the new flag `--ignoreDuplicates`.
- The `forge` subcommand, finally, also ignores duplicates in the base dirs, except (!) this conflict exists within the entities in the `--forgeString`. In this case it stops with an informative error:

```
[Error] There are duplicated individuals, but forge does not allow that
[Error] Please specify in your --forgeString or --forgeFile:
[Error] <Inuk.SG> -> <2010_RasmussenNature:Greenland_Saqqaq.SG:Inuk.SG>
[Error] <Inuk.SG> -> <2011_RasmussenNature:Greenland_Saqqaq.SG:Inuk.SG>
[Error] Error in the forge selection: Unresolved duplicated individuals
```

This already shows that the `-f`/`--forgeString` selection language of `forge` (and ìncidentally also `fetch`) includes a new syntactic element since this release: Individuals can now be described not just with `<individual>`, but also more specifically `<package:group:individual>`. Such defined individuals take precedence over differently defined ones (so: directly with `<individual>` or as a subset of `*package*` or `group`). This allows to resolve duplication issues precisely -- at least in cases where the duplicated individuals differ in source package or primary group.

### V 1.1.6.0

#### Additional columns in .janno files (V 1.1.5.0)
Expand Down
2 changes: 1 addition & 1 deletion poseidon-hs.cabal
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
name: poseidon-hs
version: 1.1.6.0
version: 1.1.7.0
synopsis: A package with tools for working with Poseidon Genotype Data
description: The tools in this package read and analyse Poseidon-formatted genotype databases, a modular system for storing genotype data from thousands of individuals.
license: MIT
Expand Down
1 change: 1 addition & 0 deletions src-executables/Main-trident.hs
Original file line number Diff line number Diff line change
Expand Up @@ -207,3 +207,4 @@ validateOptParser :: OP.Parser ValidateOptions
validateOptParser = ValidateOptions <$> parseBasePaths
<*> parseIgnoreGeno
<*> parseNoExitCode
<*> parseIgnoreDuplicates
41 changes: 21 additions & 20 deletions src/Poseidon/CLI/Forge.hs
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@ module Poseidon.CLI.Forge where
import Poseidon.BibFile (BibEntry (..), BibTeX,
writeBibTeXFile)
import Poseidon.EntitiesList (EntityInput, PoseidonEntity (..),
PoseidonIndividual (..),
SignedEntity (..),
conformingEntityIndices,
filterRelevantPackages,
findNonExistentEntities,
readEntityInputs)
readEntityInputs,
resolveEntityIndices)
import Poseidon.GenotypeData (GenoDataSource (..),
GenotypeDataSpec (..),
GenotypeFormatSpec (..),
Expand All @@ -29,15 +30,16 @@ import Poseidon.Package (PackageReadOptions (..),
newPackageTemplate,
readPoseidonPackageCollection,
writePoseidonPackage)
import Poseidon.SecondaryTypes (IndividualInfo (..))
import Poseidon.Utils (PoseidonException (..),
PoseidonLogIO,
determinePackageOutName, logInfo,
logWarning)
determinePackageOutName, logError,
logInfo, logWarning)

import Control.Exception (catch, throwIO)
import Control.Monad (forM, forM_, unless, when)
import Control.Monad.Reader (ask)
import Data.List (intercalate, nub, (\\))
import Data.List (intercalate, nub)
import Data.Maybe (mapMaybe)
import Data.Time (getCurrentTime)
import qualified Data.Vector as V
Expand Down Expand Up @@ -109,27 +111,35 @@ runForge (
logInfo $ "Forging with the following entity-list: " ++ (intercalate ", " . map show . take 10) entities ++
if length entities > 10 then " and " ++ show (length entities - 10) ++ " more" else ""

-- check for entities that do not exist this this dataset
-- check for entities that do not exist in this dataset
let nonExistentEntities = findNonExistentEntities entities . getJointIndividualInfo $ allPackages
unless (null nonExistentEntities) $
logWarning $ "The following entities do not exist in this dataset and will be ignored: " ++
logWarning $ "Detected entities that do not exist in the dataset. They will be ignored: " ++
intercalate ", " (map show nonExistentEntities)

-- determine relevant packages
let relevantPackages = filterRelevantPackages entities allPackages
logInfo $ (show . length $ relevantPackages) ++ " packages contain data for this forging operation"
when (null relevantPackages) $ liftIO $ throwIO PoseidonEmptyForgeException

-- determine relevant individual indices
let relevantIndices = conformingEntityIndices entities . getJointIndividualInfo $ relevantPackages
-- get all individuals from the relevant packages
let allInds = getJointIndividualInfo $ relevantPackages

-- determine indizes of relevant individuals and resolve duplicates
let (unresolvedDuplicatedInds, relevantIndices) = resolveEntityIndices entities allInds

-- check if there still are duplicates and if yes, then stop
unless (null unresolvedDuplicatedInds) $ do
logError "There are duplicated individuals, but forge does not allow that"
logError "Please specify in your --forgeString or --forgeFile:"
mapM_ (\(_,i@(IndividualInfo n _ _),_) -> logError $ show (SimpleInd n) ++ " -> " ++ show (SpecificInd i)) $ concat unresolvedDuplicatedInds
liftIO $ throwIO $ PoseidonForgeEntitiesException "Unresolved duplicated individuals"

-- collect data --
-- janno
let jannoRows = getJointJanno relevantPackages
relevantJannoRows = map (jannoRows !!) relevantIndices

-- check for duplicates among the individuals selected for merging
checkIndividualsUniqueJanno relevantJannoRows
-- bib
let bibEntries = concatMap posPacBib relevantPackages
relevantBibEntries = filterBibEntries relevantJannoRows bibEntries
Expand Down Expand Up @@ -213,15 +223,6 @@ sumNonMissingSNPs accumulator (_, geno) = do
| x == Missing = 0
| otherwise = 1

checkIndividualsUniqueJanno :: [JannoRow] -> PoseidonLogIO ()
checkIndividualsUniqueJanno rows = do
let indIDs = map jPoseidonID rows
when (length indIDs /= length (nub indIDs)) $ do
liftIO $ throwIO $ PoseidonForgeEntitiesException $
"Duplicate individuals in selection (" ++
intercalate ", " (indIDs \\ nub indIDs) ++
")"

filterBibEntries :: [JannoRow] -> BibTeX -> BibTeX
filterBibEntries samples references_ =
let relevantPublications = nub . concatMap getJannoList . mapMaybe jPublication $ samples
Expand Down
2 changes: 1 addition & 1 deletion src/Poseidon/CLI/Genoconvert.hs
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ data GenoconvertOptions = GenoconvertOptions

pacReadOpts :: PackageReadOptions
pacReadOpts = defaultPackageReadOptions {
_readOptStopOnDuplicates = True
_readOptStopOnDuplicates = False
, _readOptIgnoreChecksums = True
, _readOptIgnoreGeno = False
, _readOptGenoCheck = True
Expand Down
11 changes: 10 additions & 1 deletion src/Poseidon/CLI/OptparseApplicativeParsers.hs
Original file line number Diff line number Diff line change
Expand Up @@ -167,7 +167,9 @@ parseForgeEntitiesDirect = OP.option (OP.eitherReader readSignedEntities) (OP.lo
\forge will apply excludes and includes in order. If the first entity is negative, then forge \
\will assume you want to merge all individuals in the packages found in the baseDirs (except the \
\ones explicitly excluded) before the exclude entities are applied. \
\An empty forgeString (and no --forgeFile) will therefore merge all available individuals.")
\An empty forgeString (and no --forgeFile) will therefore merge all available individuals. \
\If there are individuals in your input packages with equal individual id, but different main group or \
\source package, they can be specified with the special syntax \"<package:group:individual>\".")
where
readSignedEntities s = case readEntitiesFromString s of
Left e -> Left (show e)
Expand Down Expand Up @@ -397,6 +399,13 @@ parseNoExitCode = OP.switch (
OP.hidden
)

parseIgnoreDuplicates :: OP.Parser Bool
parseIgnoreDuplicates = OP.switch (
OP.long "ignoreDuplicates" <>
OP.help "do not stop on duplicated individual names in the package collection" <>
OP.hidden
)

parseRemoteURL :: OP.Parser String
parseRemoteURL = OP.strOption (
OP.long "remoteURL" <>
Expand Down
2 changes: 1 addition & 1 deletion src/Poseidon/CLI/Update.hs
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ data UpdateOptions = UpdateOptions

pacReadOpts :: PackageReadOptions
pacReadOpts = defaultPackageReadOptions {
_readOptStopOnDuplicates = True
_readOptStopOnDuplicates = False
, _readOptIgnoreChecksums = True
, _readOptIgnoreGeno = True
, _readOptGenoCheck = False
Expand Down
14 changes: 7 additions & 7 deletions src/Poseidon/CLI/Validate.hs
Original file line number Diff line number Diff line change
Expand Up @@ -18,23 +18,23 @@ import System.Exit (exitFailure, exitSuccess)

-- | A datatype representing command line options for the validate command
data ValidateOptions = ValidateOptions
{ _validateBaseDirs :: [FilePath]
, _validateIgnoreGeno :: Bool
, _validateNoExitCode :: Bool
{ _validateBaseDirs :: [FilePath]
, _validateIgnoreGeno :: Bool
, _validateNoExitCode :: Bool
, _validateIgnoreDuplicates :: Bool
}

pacReadOpts :: PackageReadOptions
pacReadOpts = defaultPackageReadOptions {
_readOptStopOnDuplicates = True
, _readOptIgnoreChecksums = False
_readOptIgnoreChecksums = False
, _readOptGenoCheck = True
}

runValidate :: ValidateOptions -> PoseidonLogIO ()
runValidate (ValidateOptions baseDirs ignoreGeno noExitCode) = do
runValidate (ValidateOptions baseDirs ignoreGeno noExitCode ignoreDup) = do
posFiles <- liftIO $ concat <$> mapM findAllPoseidonYmlFiles baseDirs
allPackages <- readPoseidonPackageCollection
pacReadOpts {_readOptIgnoreGeno = ignoreGeno}
pacReadOpts {_readOptIgnoreGeno = ignoreGeno, _readOptStopOnDuplicates = not ignoreDup}
baseDirs
let numberOfPOSEIDONymlFiles = length posFiles
numberOfLoadedPackagesWithDuplicates = foldl' (+) 0 $ map posPacDuplicate allPackages
Expand Down
Loading

0 comments on commit 809afc1

Please sign in to comment.