RepeatModeler

Module: Bio::EnsEMBL::EGPipeline::PipeConfig::RepeatModeler_conf

Pipeline for building de-novo repeat libraries using RepeatModeler utility.

Prerequisites

A registry file with the locations of the core database server(s).

How to run

# REPEAT_MODELLER_OPTIONS='-max_seq_length 19000000' # or whatever

init_pipeline.pl Bio::EnsEMBL::EGPipeline::PipeConfig::RepeatModeler_conf \
  $($CMD details script) \
  -pipeline_tag  "_${RUN_TAG}" \
  -hive_force_init 1 \
  -registry $REG_FILE \
  -results_dir $RESULTS_DIR \
  -species $SPECIES \
  ${REPEAT_MODELLER_OPTIONS} \
  -do_clustering 1 \
  2> $OUT_DIR/init.stderr \
  1> $OUT_DIR/init.stdout

SYNC_CMD=$(cat $OUT_DIR/init.stdout | grep -- -sync'$' | perl -pe 's/^\s*//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -sync

LOOP_CMD=$(cat $OUT_DIR/init.stdout | grep -- -loop | perl -pe 's/^\s*//; s/\s*#.*$//; s/"//g')
# should get something like
#   beekeeper.pl -url $url -reg_file $REG_FILE -loop

$SYNC_CMD 2> $OUT_DIR/sync.stderr 1> $OUT_DIR/sync.stdout
$LOOP_CMD 2> $OUT_DIR/loop.stderr 1> $OUT_DIR/loop.stdout

Parameters / Options

option	default value	meaning
`-species`		species to process, several `-species` options are possible
`-results_dir`		directory to store results to
`-min_slice_length`	5000	ignore scaffolds that are shorter then this value
`-max_seq_length`	10_000_000	maximum length for a single scaffold, scaffolds will be splitted if exciding this value
`-max_seqs_per_file`	10_000	maximum number of scaffolds to be processed in a chnuck (by a single RepeatModeler) instance
`-do_clustering`	0	deal with redundancy of the final library, using `CDHit-EST` clustering
`-do_filtering`	0	remove repeat models that are found in CDSs (details)
`-repeatmodeler_dir`		Path to the directory containing the RepeatModeler executables
`-blast_engine`	ncbi	which BLAST program to use; options are 'ncbi' or 'wublast'

Output

Each chunked portion of the genome will produce a repeat library file. These are simply concatenated to produce a single file in the ${RESULTS_DIR} directory, named <SPECIES>.rm.lib, where <SPECIES> is the species production_name. There is likely to be some redundancy here, particularly for large genomes that have correspondingly large numbers of chunk files. This isn't ideal, but our main goal is to repeat mask the genome (for which redundancy doesn't matter), so the library is a means to an end, rather than an end in itself. The pipeline uses the multi-core functionality of RepeatModeler (9 cores per hive job), which makes the program faster and enables us to chunk as little as possible; but it remains a necessary evil. Use -do_clustering 1 to deal with the library redundancy.

Parts

A few generic from Common::RunnableDB.

A few from DNAFeatures.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RepeatModeler.md

RepeatModeler.md

RepeatModeler

Module: Bio::EnsEMBL::EGPipeline::PipeConfig::RepeatModeler_conf

Prerequisites

How to run

Parameters / Options

Output

Parts

Files

RepeatModeler.md

Latest commit

History

RepeatModeler.md

File metadata and controls

RepeatModeler

Module: Bio::EnsEMBL::EGPipeline::PipeConfig::RepeatModeler_conf

Prerequisites

How to run

Parameters / Options

Output

Parts