AUGUSTUSを用いた非モデル生物ゲノムへの遺伝子アノテーション付 | だれかのためのバイオインフォマティクス

どこにエキソンがあってどこにイントロンがあって、という情報をつけるときには遺伝子予想、遺伝子アノテーション付けという作業をしていきます。

今はBraker3などを用いた方法が主流っぽいですが、ここではAUGUSTUSをlocalで使った遺伝子予想の方法をまとめます。手数は多いですがその分ステップごとに戻ったり、パラメーターを変えた試行錯誤しやすいです。
参考：http://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=Augustus.Augustus

準備
1. ゲノムファイル（fasta file）の配列名をシンプルにする
2. インストール
手順
後の処理

準備

ゲノムファイル（fasta file）の配列名をシンプルにする

配列名（>の部分）をできるだけシンプルで一意な名前にする。
記号はトラブルのもとになりがちで怖いから使わない！！
おすすめは>Homosp0001の形。名前の一括変更はseqkitをつかう。

インストール

以下は全てconda でインストールできる

Augustus
BUSCO （AugustusのSpeciesに登録がない場合に必要）
RepeatModeler
RepeatMasker
HiSat2
Samtools
Bowtie2

あったら便利

Seqkit
Gffread
AGAT (Singularity)

手順

BUSCO遺伝子に基づく遺伝子モデル情報を作る

Augustusの実行時にSpeciesを指定する必要がある。
デフォルトで実装されている種に関しては（https://github.com/Gaius-Augustus/Augustus/blob/master/docs/ABOUT.md）、以下のステップはスキップできる。

busco --augustus --long -m genome \
    -i genome.fa  \
    -o [output] -l [分類群] -c [CPU]

Bash

-o はフォルダ名のみ指定可能。ディレクトリは指定できない

以下にSpeciesファイルが作られている

[output]/[分類群]/augustus_output/retraining_parameters/BUSCO_[output]/

Bash

Speciesファイルはこんな感じ。

BUSCO_*_exon_probs.pbl
BUSCO_*_intron_probs.pbl
BUSCO_*_metapars.cgp.cfg
BUSCO_*_parameters.cfg
BUSCO_*_igenic_probs.pbl
BUSCO_*_metapars.cfg
BUSCO_*_metapars.utr.cfg
BUSCO_*_weightmatrix.txt

Bash

丸ごと

augustus/config/species/

Bash

下にコピペしておく。以下ではSpeciesの名前をBUSCO_Spとする。

リピート配列をソフトマスクする

BuildDatabase -name [database名] genome.fa
RepeatModeler -database [database名] -threads [CPU]
RepeatMasker -xsmall -dir [output] -pa [CPU] -lib RM_*/consensi.fa.classified genome.fa

Bash

genome.fa.maskedの名前でソフトマスクされたFastaファイルが出力される

Augustusの並列化のための準備

Augustusはデフォルトでは並列化できないが、ゲノムファイルを分割して使用すればできる。

seqkit split2 genome.fa.masked -s 1 -O genome_splits

Bash

ヒントファイルをつくる

リピート領域のヒントファイルをつくる

RepeatMaskerの出力からエキソンでない領域のヒントファイルを作る

cat genome.fa.out | tail -n +3 | \
perl -ne 'chomp; s/^\s+//; @t = split(/\s+/);print $t[4]."\t"."repmask\tnonexonpart\t".$t[5]."\t".$t[6]."\t0\t.\t.\tsrc=RM\n";' | \
sort -n -k 1,1 > repeat.hints

Bash

RNA-seqからヒントファイルをつくる

GENOME="dir/genome.fa.masked"
GENOMESPLIT="genome_splits/"
RNAdir="RNAdir/"
READS=(DRR1 DRR2 DRR3)
SPECIES="BUSCO_Sp"
CPU=10

mkdir RNA_hints

hisat2-build -p 12 ${GENOME} genome_index
mkdir ${RNAdir}/bam

for read in ${READS[@]}; do
    hisat2 -p ${CPU} -x genome_index \
    -1 ${RNAdir}/${READS}_1.fastq.gz -2 ${RNAdir}/${READS}_2.fastq.gz \
    -S ${RNAdir}/bam/${READS}.sam
    samtools sort -@ ${CPU} ${RNAdir}/bam/${READS}.sam > ${RNAdir}/bam/${READS}.bam
    rm ${RNAdir}/bam/${READS}.sam
done

samtools merge merged.bam ${RNAdir}/bam/*bam &

samtools sort -@ ${CPU} -n merged.bam -o merged.s.bam 
filterBam --uniq --in merged.s.bam --out merged.s.f.bam 
samtools view -H merged.s.f.bam > header.txt
samtools sort -@ ${CPU} merged.s.f.bam -o both.merged.s.f.bam
bam2hints --intronsonly --in=both.merged.s.f.bam --out=hints.gff

nohup ls -1 "${GENOMESPLIT}" | xargs -n 1 -P ${CPU} -I {} \
  sh -c 'augustus "$1/$2" --species="$3" --extrinsicCfgFile=extrinsic.E.cfg \
  --alternatives-from-evidence=true --hintsfile=hints.gff \
  --allow_hinted_splicesites=atac --introns=on --genemodel=complete \
  > aug1.$2.out' _ "${GENOMESPLIT}" "{}" "${SPECIES}" &

cat aug1.{}.out > aug1.out

cat aug1.out | tee aug.prelim.gff | grep -P "\tintron\t" > aug1.introns.gff
cat hints.gff aug1.introns.gff | perl -ne '@array = split(/\t/, $_);print "$array[0]:$array[3]-$array[4]\n";'| sort -u > introns.lst
intron2exex.pl --introns=introns.lst --flank 150 --seq=${GENOME} --exex=exex.fa --map=map.psl

bowtie2-build exex.fa exex1
for read in ${READS[@]}; do
    bowtie2 -p ${CPU} -x exex1 \
    -1 ${RNAdir}/${READS}_1.fastq.gz \
    -2 ${RNAdir}/${READS}_2.fastq.gz -S ${read}.sam
done

samtools view -@ ${CPU} -S -F 4 <(cat *.sam ) > bowtie.F.sam
samMap.pl bowtie.F.sam map.psl 150 > bowtie.global.sam

bamtools filter -in merged.bam -out merged.noN.bam \
-script operation_N_filter.txt

cat header.txt bowtie.global.sam > bowtie.global.h.sam
samtools view -@ ${CPU} -bS -o bowtie.global.h.bam bowtie.global.h.sam
samtools merge -@ ${CPU} both.bam     bowtie.global.h.bam   merged.noN.bam
samtools sort -@ ${CPU} -n both.bam -o both.s.bam
filterBam --uniq --in both.s.bam --out both.sf.bam
samtools sort -@ ${CPU} both.sf.bam -o both.ssf.bam
bam2hints --intronsonly --in=both.ssf.bam --out=RNA.hints

Bash

近縁種のアミノ酸配列からヒントファイルをつくる

Exonerateでアミン酸配列をマッピングする。配列を分割して並列化する。

GENOME="dir/genome.fa.masked"
CPU=10

seqkit split2 protein.faa -p 100 -O for_exonerate_faa
nohup ls -1 for_exonerate_faa/ | xargs -n 1 -P ${CPU} -I {} \
sh -c 'exonerate -q for_exonerate_faa/"$1" \
  -t "$2" --model protein2genome --querytype protein --targettype dna \
  --showvulgar no --showalignment no --showtargetgff yes --showcigar no \
  --verbose 0 --softmasktarget TRUE > "$1".gff' _ {} ${GENOME} &
  
cat *.gff > exonerate.gff
exonerate2hints.pl --in=exonerate.gff--source=P --out=exonerate.hints

Bash

ヒントファイルをまとめて実行

ヒントファイルをまとめるが、デフォルトのコンフィグにはヒントの情報源（src）を、リピート（RM）、トランスクリプトーム（E）、タンパク質（P）に設定したファイルがないので自分で追加する。
…augustus/config/extrinsic/の下に以下の様なファイルを作る。

例えばこんな感じ（extrinsic.M.RM.E.P.cfg）

# extrinsic information configuration file for AUGUSTUS
#
# include with --extrinsicCfgFile=filename
# date: 15.4.2015
# Mario Stanke (mario.stanke@uni-greifswald.de)


# source of extrinsic information:
# M manual anchor (required)
# P protein database hit
# E EST/cDNA database hit
# C combined est/protein database hit
# D Dialign
# R retroposed genes
# T transMapped refSeqs
# W wiggle track coverage info from RNA-Seq

[SOURCES]
M RM P E

#
# individual_liability: Only unsatisfiable hints are disregarded. By default this flag is not set
# and the whole hint group is disregarded when one hint in it is unsatisfiable.
# 1group1gene: Try to predict a single gene that covers all hints of a given group. This is relevant for
"# hint groups with gaps, e.g. when two ESTs, say 5' and 3', from the same clone align nearby."
#
[SOURCE-PARAMETERS]


#   feature        bonus         malus   gradelevelcolumns
#               r+/r-
#
# the gradelevel colums have the following format for each source
# sourcecharacter numscoreclasses boundary    ...  boundary    gradequot  ...  gradequot
#

[GENERAL]
      start      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1.00E+03
       stop      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1.00E+03
        tss      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1
        tts      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1
        ass      1      1 0.1  M    1  1e+100  RM  1     1    E 1    1    P 1 100
        dss      1      1 0.1  M    1  1e+100  RM  1     1    E 1    1    P 1 100
   exonpart      1  .992 .985  M    1  1e+100  RM  1     1    E 1    1    P 1 1
       exon      1          1  M    1  1e+100  RM  1     1    E 1    1     P 1 1.00E+04
 intronpart      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1
     intron      1        .34  M    1  1e+100  RM  1     1    E 1  1e6    P 1 100
    CDSpart      1     1 .985  M    1  1e+100  RM  1     1    E 1    1   P 1 1.00E+05
        CDS      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1
    UTRpart      1     1 .985  M    1  1e+100  RM  1     1    E 1    1    P 1 1
        UTR      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1
     irpart      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1
nonexonpart      1          1  M    1  1e+100  RM  1     1.15 E 1    1    P 1 1
  genicpart      1          1  M    1  1e+100  RM  1     1    E 1    1    P 1 1

#
# Explanation: see original extrinsic.cfg file
#

Bash

あとは繋げたヒントファイルを入れて実行する。

GENOMESPLIT="genome_splits/"
SPECIES="BUSCO_Sp"
CPU=10

cat repeat.hints RNA.hints exonerate.hints > all.hints

nohup ls -1 ${GENOMESPLIT} | xargs -n 1 -P ${CPU}  -I {} \
sh -c 'augustus "$1/$2" --species="$3" --softmasking=1 \
--extrinsicCfgFile=extrinsic.M.RM.E.P.cfg --alternatives-from-evidence=true \
--hintsfile=all.hints --allow_hinted_splicesites=atac \
> {}.augfin.out' _ ${GENOMESPLIT} "{}" ${SPECIES} &

cat *.augfin.out > augfin.out

Bash

後の処理

そのままだと9列目のIDに被りが生じていたり、ちょっと使いにくいので、整形。

gffread -O augfin.out -o augfin.gff3

#ID を一意にする
awk -F'\t' '{
    OFS="\t";
    if ($3 == "transcript") {
        gsub(/ID=/, "ID="$1, $9);                     # IDに一列目 (scaffold名) を追加
    } else if ($3 == "CDS") {
        sub(/Parent=/, "Parent="$1, $9);             # Parentに一列目 (scaffold名) を追加
    }
    print $1, $2, $3, $4, $5, $6, $7, $8, $9;
}'  augfin.gff3 > augfin.uniqueID.gff3


#gffを標準化
agat_convert_sp_gxf2gxf.pl -g augfin.uniqueID.gff3 -o augfin.uniqueID.agated.gff3

Bash