change-history.org

Table of Contents

1 Database schema

In current we use Ensembl database schema as template.A full featured Ensembl database is consist of over 70 tables. For a gene prediction task using Augustus as annotation engine,we only need 3 of them.

1.1 table 'dna'

Contains DNA sequence. This table has a 1:1 relationship with the contig table. There's a one-one map for each record in this table to each single row in a plain file 'dna.txt' in which sequences are stored in format of 'int-id\tsequence'.

ColumnTypeDefault valueDescriptionIndex
seq_region_idINT(10)Primary key, internal identifier. Foreign key references to the seq_region table.primary key
sequenceLONGTEXTDNA sequence.

1.2 table 'seq_region'

Stores information about sequence regions. The primary key is used as a pointer into the dna table so that actual sequence can be obtained, and the coord_system_id allows sequence regions of multiple types to be stored.Contigs are stored with the 'coord_system_id=2'. Chromosomes have 'coord_system_id=1',they have no corresponding record in table 'dna'. The relationship between contigs and chromosomes is stored in the assembly table.

ColumnTypeDefault valueDescriptionIndex
seq_region_idINT(10)Primary key, internal identifier.primary key
nameVARCHAR(40)Sequence region name.unique key: name_cs_idx
coord_system_idINT(10)Foreign key references to the coord_system table.unique key: name_cs_idx
key: cs_idx
lengthINT(10)Sequence length.

1.3 table 'assembly'

This is the assembly table structure.

FieldTypeNullKeyDefaultExtra
asm_seq_region_idint(10) unsignedNOPRINULL
cmp_seq_region_idint(10) unsignedNOPRINULL
asm_startint(10)NOPRINULL
asm_endint(10)NOPRINULL
cmp_startint(10)NOPRINULL
cmp_endint(10)NOPRINULL
oritinyint(4)NOPRINULL

2 mysql++ API

In current,we use a third-part mysql API:mysql++ to handle sequence from database.I choose it because of its lightweight and it supports STL perfectly.

2.1 install

install package

# apt-get install libmysql++-dev
or see docs/INSTALL.md for a complete overview.

2.2 use SSQLS

mysqlpp allows user defined 'Specialized SQL Structure'.At the most superficial level,and SSQLS has a member variable corresponding to each field in the SQL table. In 'include/table_structure.h' defined 'dna','seq_region','assembly'.

sql_create_2(dna,
             1, 2,
             int,seq_region_id,
             std::string, sequence)  
sql_create_4(seq_region,
             1,4,
             int,seq_region_id,
             std::string,name,
             std::string,coord_system_id,
             int,length)
sql_create_6(assembly,
             1, 6,
             int, asm_seq_region_id,
             int, cmp_seq_region_id,
             int, asm_start,
             int, asm_end,
             int, cmp_start,
             int, cmp_end)

3 cmdline parameters

  • –dbaccess accepts comma separated string "database name,host name,user,passwd,table name"
  • the only parameter without a '–' is the query.If '–dbaccess' is indicated,query corresponds to a name in 'seq_region' table.So skip filetype detect in this case.
  • –predictionStart and –predictionEnd still work the same way as when input file is a fasta or genebank.
augustus --dbaccess="fly,localhost,henry,123456,," 3L --predictionStart=100 --predictionEnd=30000000 --species=fly 

4 modification

filedesc
Makefileadd 2 header path and 2 lib path;add -Wl,rpath=/your/run-timelib/path
types.ccl-322~l-324,comment an exception thow message to allow 'dbaccess' in sigle mode.I don't want to modify this behavior in system level so I just comment it.
types.ccreorder –dbaccess to "database name,host name,user,passwd,tablename"
randaccess.{hh,cc}accomplish the AnnoSequence* DbSeqAccess::getSeq method.Give a mysqlpp::connection object to class DbSeqAccess.
genbank.ccGBSplitter(string fname ),l-526. If input fname is a name in 'seq_region' in database,skip the filetype detect.
table_structure.hin 'trunks/include/mysqlppheader' add 3 SSQLS: 'dna','seq_region','assembly'

Author: yuqiulin <yuqiulin@genomics.cn>

Date: 2012-06-09 Sat