Skip to content

taffish/repeatmodeler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

repeatmodeler

TAFFISH tool app for RepeatModeler, a de novo transposable element family identification and modeling package.

This app packages RepeatModeler 2.0.9 from the official Dfam TETools runtime. TETools is the upstream-supported container route for RepeatModeler and includes RepeatModeler, RepeatMasker, RMBlast, FamDB/Dfam data, RepeatScout, RECON, LTR_retriever, GenomeTools, MAFFT, CD-HIT, UCSC twoBit helpers, and related TE utilities.

Package Identity

  • name: repeatmodeler
  • command: taf-repeatmodeler
  • kind: tool
  • version: 2.0.9-r1
  • image: ghcr.io/taffish/repeatmodeler:2.0.9-r1
  • TAFFISH app license: Apache-2.0
  • upstream: RepeatModeler 2.0.9
  • runtime base: dfam/tetools:2.0@sha256:6081b4d3883eff478873cb94cd24addf540275365d4acc4446bb647a341e95e2
  • native platform: linux/amd64

Install

taf install repeatmodeler

Usage

Show TAFFISH app help:

taf-repeatmodeler --help
taf-repeatmodeler --version
taf-repeatmodeler --compile

Show upstream RepeatModeler help and version:

taf-repeatmodeler -- -help
taf-repeatmodeler -- -version
taf-repeatmodeler RepeatModeler -help
taf-repeatmodeler BuildDatabase -help
taf-repeatmodeler RepeatClassifier -help

Build a RepeatModeler database from an assembled genome FASTA:

taf-repeatmodeler BuildDatabase -name genome genome.fa

Run RepeatModeler on that database:

taf-repeatmodeler -database genome -threads 16

Run the LTR structural discovery extension:

taf-repeatmodeler -database genome -threads 16 -LTRStruct

Use the produced repeat library with RepeatMasker:

taf-repeatmodeler RepeatMasker -pa 8 -engine rmblast -lib genome-families.fa genome.fa

Command Mode

taf-repeatmodeler is a normal command-mode TAFFISH tool app. Option-leading arguments are passed to the default upstream command, RepeatModeler:

taf-repeatmodeler -database genome -threads 16
taf-repeatmodeler -- -help

If the first argument is not an option, TAFFISH treats it as an executable in the same container. Use this for bundled helper commands:

taf-repeatmodeler BuildDatabase -name genome genome.fa
taf-repeatmodeler RepeatClassifier -consensi genome-families.fa
taf-repeatmodeler RepeatMasker -lib genome-families.fa genome.fa
taf-repeatmodeler rmblastn -version
taf-repeatmodeler famdb.py --help
taf-repeatmodeler gt -version
taf-repeatmodeler mafft --version

Do not write taf-repeatmodeler BuildDatabase ... expecting it to be a RepeatModeler subcommand. BuildDatabase, RepeatClassifier, and RepeatMasker are separate executables in the same runtime.

Runtime Contents

The Dockerfile starts from the official Dfam TETools image pinned by digest and adds only TAFFISH metadata, a working directory, a few stable helper symlinks for command mode, and build-time self-checks.

Runtime components checked by Dockerfile and smoke include:

  • RepeatModeler 2.0.9
  • RepeatMasker 4.2.4
  • RMBlast 2.17.1+
  • TRF 4.09
  • GenomeTools 1.6.4
  • MAFFT 7.471
  • RepeatScout 1.0.7
  • FamDB 3.0.0 and Dfam 4.0 component data bundled by TETools
  • RECON, LTR_retriever 2.9.0, CD-HIT 4.8.1, NINJA, RepeatAfterMe, and UCSC twoBit helper utilities

BLAST_USAGE_REPORT=false is set so RMBlast does not perform NCBI usage reporting during TAFFISH runs. PAGER=cat is set so help output remains non-interactive in terminals, CI, and flows.

Inputs

RepeatModeler is designed for assembled genome sequences, not raw sequencing reads. The normal workflow is:

  1. Create a database with BuildDatabase -name <db> genome.fa.
  2. Run RepeatModeler -database <db> -threads <N>.
  3. Use the resulting repeat-family FASTA as a custom library, commonly with RepeatMasker -lib <db>-families.fa.

For real assemblies, run from a fast local working directory with enough disk space for temporary files. RepeatModeler creates an RM_<pid>.<date>/ directory and keeps it for audit and recovery.

Outputs

Successful RepeatModeler runs commonly produce:

<database>-families.fa     Consensus repeat family FASTA
<database>-families.stk    Dfam-compatible Stockholm seed alignments
<database>-rmod.log        Summarized RepeatModeler log
RM_<pid>.<date>/           Detailed run directory with rounds and intermediates

The FASTA library can be passed to RepeatMasker with -lib for genome screening and masking.

Data And Database Boundary

This image follows the official TETools runtime and therefore includes the Dfam/FamDB component files bundled there:

/opt/FamDB-Dfam-4.0/Libraries/famdb/dfam40.0.h5
/opt/FamDB-Dfam-4.0/Libraries/famdb/dfam40.curated.consensus.0.h5

This is different from the lighter taf-repeatmasker app, where full species/clade FamDB runs require user-provided Dfam component files. Here the goal is a complete official RepeatModeler/TETools runtime, so the bundled TETools FamDB/Dfam data are retained.

The app does not include RepBase, project-specific genomes, or user reference datasets. Those resources must be supplied by the user and used under their own license terms.

Platform

This app is native linux/amd64 only because the official TETools image is a single amd64 image and includes x64 Linux binaries such as RMBlast and UCSC helper tools. src/main.taf asks Docker and Podman to run with:

--platform linux/amd64

On arm64 hosts, such as Apple Silicon Macs, Docker and Podman can run it through amd64 emulation. That is not native arm64 support.

Boundaries

This app does not:

  • provide a full repeat annotation flow;
  • bundle RepBase or commercial/restricted repeat libraries;
  • download external databases during normal execution or smoke tests;
  • validate biological correctness on large genomes in smoke;
  • claim native linux/arm64 support.

The smoke tests validate the packaged command surface, version binding, RepeatModeler configuration paths, key helper availability, and a tiny offline BuildDatabase run.

License Boundary

The TAFFISH app packaging files are licensed under Apache-2.0. The packaged upstream RepeatModeler software is licensed under OSL-2.1. The official TETools runtime bundles additional tools and Dfam/FamDB data under their own notices; those upstream and data licenses are not changed by this TAFFISH wrapper.

Citation

Please cite RepeatModeler and its component tools as requested by upstream for your analysis. Important upstream components include RepeatModeler, Dfam TETools, RepeatMasker, RMBlast/BLAST+, RepeatScout, RECON, TRF, LTR_retriever, GenomeTools, MAFFT, CD-HIT, and Dfam/FamDB data.

Upstream resources:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors