Alan Wardroper

Project - Spring, 2004: Extending BioPython::Uses of Python in Bioinformatics

Supervisor: James Cussens

Includes:

Pages	…27
Words	7372
References	….1
Figures	….7
Appendices	….7.

Disc:

    |____dna_manip

        |___Project2.py

        |___dna_manip.py

        |___seq.fas

        parser

        |___Embl2Fasta.py

        |___embl2fasta_script.py

        |___sampleEMBL.dat

        |___multiEMBL.dat

        writeup

        |___extendingpython.doc

        |___extendingpython.pdf

        |___extendingpython_pres.ppt

Background

Python ⁽ [1] ^, [2] ^, [3] ^, [4] ⁾ is an Open Source [i] ⁽ [5] ⁾, dynamic, interpreted, Object-Oriented Programming (OOP) language, comparable to Perl ⁽ [6] ^, [7] ^, [8] ⁾, Tcl, Scheme, or Java, which is useful for scientific computing in general and for Bioinformatics in particular. Similar to Perl, Python is a high level scripting language, which offers the power and flexibility of compiled system languages, such as C, combined with ease of use more often associated with Unix shell scripting. Being a high level language, Python has a (small) number of useful data types built in, and provides error checking, memory management, and a garbage collector, all of which speed up development and reduce debugging requirements as compared with C. These features also make Python relatively easy to learn and use.

Active development of the Biopython ⁽ [9] ^, [10] ⁾ toolkit for Computational Molecular Biology is facilitating its application in Biology. However, Biopython is a relatively new project and its file format handling capabilities are not as mature as those of BioPerl ⁽ [11] ⁾. The aim of this project was the development of a module to extend the ability of Biopython to manipulate biological data. This report describes a module and an accompanying sample script developed to facilitate extraction of FASTA format data from an EMBL format file.

Python

Python is an object-oriented language by design, with classes and multiple inheritance; in Python, almost everything is either an object (including the base variable types) or a method that operates on those objects and all calls are references ^{(2
,3 ,4
)}. Although it is possible to use Python simply as a procedural language, to make best use of its capabilities as an object-oriented language it is necessary to think in terms not of functions and operators but of classes and methods. Instead of the manipulation of internal data types using a large number of built-in functions, such as in the Perl core language, as a fully object-oriented language Python uses a combination of internal and external classes, methods, and other objects. As Python modules are treated as classes, compiled bytecode versions can be imported at runtime rather than the source code itself, avoiding the overhead associated with interpretation and thus optimizing performance. The language has built-in memory management allowing a rapid development cycle, and can call and interoperate with both other programs and the underlying operating system. The language has a powerful standard function library, implemented in a series of interrelated modules, and is extensible through the addition of further modules and packages, allowing it to scale well between small and large projects.

Python has several advantages over other languages, including a clean, consistent syntax, which aids in both learning the language and in ensuring code readability – overcoming one of the most frequent criticisms of Perl. These features allow the programmer to focus more on the programming task at hand rather than the nuances of the language itself.

Python has four basic data types: i.e., numbers (integers, floating point, complex, and unlimited-length long integers), strings , lists (analogous to arrays in C or Perl), and dictionaries (analogous to hashes). Python is strongly typed, raising an exception at any attempt to mix incompatible types (e.g., attempting to add a string and a number). However, the language is dynamic with no need to declare variables prior to use and the same variable can refer to objects of different types depending upon assignment over the course of its existence. In common with the other high level languages mentioned above, Python has automatic garbage collection obviating the need to manually perform memory management.

The Python interpreter allows code to be entered directly at the command line in interactive mode. Using this interactive mode, it is possible to test short pieces of code where each command produces immediate results, which aids in both learning the language itself and when exploring a new feature.

In addition, as Python like Perl is written in C, it is highly extensible though the addition of new modules implemented in C or C++; when speed of execution is an issue, an algorithm can be coded in C or C++ and made available as a module, which can then be used in the same way as pure Python modules. In addition, Jython, a version of the Python interpreter written in Java, allows Java objects to be used within Python and Java programs to be written using Python syntax and dynamic language features. There are also other utilities to make code written in Fortran accessible to Python programs. Thus, Python programmers can access and extend existing code written in other languages.

Comparison of Python and Perl

Perl has seen wide acceptance in Bioinformatics due to its powerful text processing capabilities, which make it well suited to the manipulation of biological sequence data – DNA, RNA, and protein sequences can all be represented and manipulated as simple text strings. The large Perl user base in the Bioinformatics community has led to the development of BioPerl ^{( [12] )}, an ever-expanding collection of Perl modules developed specifically for application in Molecular Biology.

The most obvious of the superficial differences when moving between Perl and Python is that in the latter whitespace is significant in the syntax, with common indentation used to delimit statement groups rather than the brace syntax used in C, Java, and Perl. However, due to the consistency of its implementation in Python, this quickly begins to feel quite natural corresponding closely to a normal indentation style and does not present a significant barrier to adoption of the language ^{(
[13] )}.

As Perl was originally developed for relatively small programming tasks, such as sorting lists and preparing reports (with one version of the acronym reading Practical Extraction and Reporting Language), its application to larger programs has required a number of features to be later patched on to address the increased complexity; most notably, the object-orientated features of Perl are supplied as an option, implemented through external modules although there is currently a move to reimplement the core language in an OOP framework ⁽¹¹⁾. This “bolted together” aspect of the language can make large quantities of Perl code unreasonably opaque, even for the original author. In contrast, and as described above, Python is an object-oriented language from inception.

As Python is a relatively new language, with development beginning in 1990, 3 years after the initial release of Perl, Python’s use in biology is much less pervasive than that of Perl. However, with the commencement in 1999 of the BioPython project to develop a toolkit for Computational Molecular Biology, it is rapidly gaining acceptance in the field.

Biopython

The Biopython Project, begun in 1999, provides a series of freely available Python tools to accomplish common Computational Molecular Biology tasks. Similar to BioPerl, Biopython provides a set of reusable libraries implemented as a number of modules, scripts, and packages specialised for life science research, focusing on file parsing, interaction with online resources, interfaces to commonly used bioinformatics programs, sequence analysis tools, and database access ^{(9
,10 )}. Although a relatively new project, Biopython has a great deal of functionality as outlined in Appendix A, with modules arranged in a hierarchical, OOP framework. The Biopython Project provides very good documentation, with a tutorial and code cookbook ^{(9 )}, and there are a number of very good third-party online tutorials ^{(10
)}.

Project Outline

The diversity of biological data base formats necessitates the extraction of desired fields from an input record and conversion to other formats for later analysis. The main aim of this project was to extend the ability of Biopython to manipulate biological data through development of a module for handling a file format that is currently unsupported. As outlined in Appendix A, the Biopython documentation lists a number of supported file formats, foremost of which is FASTA. However, at present there is no support for parsing EMBL flatfile database records. This report describes a module and an accompanying sample script developed to facilitate extraction of FASTA format data from an EMBL format file for use in other applications that require input in this format.

The project consisted of three programming tasks in Python, which will be discussed here in turn. First, a simple examination of Python’s usefulness for biological applications was performed, with the development of a module and accompanying script to allow common manipulations of DNA sequence data. Then, an attempt was made to build a class framework for contig assembly based on BioPerl’s Assembly::Contig module that could be similarly integrated into Biopython (see Results and Discussion). Finally, a module and accompanying script were developed to parse files in the embl format.

Examination of Python’s Usefulness in Biological Applications

DNA sequence data can be represented and manipulated computationally as text strings. Here, several common biologically relevant tasks were selected, and appropriate code was written in Python to accomplish them. The tasks were as follows: determine the length of the sequence, transcribe the DNA sequence into RNA, return the complement and reverse complement of the DNA sequence, determine the G+C content, identify the codons in the sequence, and translate the DNA sequence into the corresponding protein using Biopython’s Bio.Translate module.

EMBL to FASTA Parser

As mentioned above, it is often necessary to convert an input record into other formats for processing using other programs. As Biopython includes no tools for processing data in EMBL format, a module was written to parse EMBL database records and output into the well-supported FASTA format.

EMBL File Format

In contrast to the GenBank file format ^{( [14] )}, the EMBL format does not include a series of header lines ^{( [15]
, [16] )}. Instead, the first sequence entry of the file begins on the first line, and each line begins with a 2-character identifier tag as shown in Appendix B. Each file may contain one or multiple sequences.

The fields relevant to parsing into FASTA format for the purposes of this project are shown in Fig. 1.

Fig. 1 EMBL record fields relevant to parsing into FASTA format.

Line	Tag	Meaning	Column
First line:	ID	EMBL identifier	Col. 6-14
	AC	Accession number (contains db identifier)	Col. 6-11
	DE	Free form text definition	Col. 6-72
	KW	Keyword(s)	Col. 6-72
	OS	Organism Species	Col. 6-72
	SQ	Length of the sequence, with nucleotide composition	Beginning at or after Col. 13. Sequence length is followed by a blank space and ‘BP’ then nucleotide composition.
		The sequence entry itself.	Col. 6-70, followed by 10 columns with right-justified nucleotide position numbers, left-filled with spaces.
Last line:	//	Terminator line

The EMBL file format does not include a series of header lines, but the first sequence entry of the file begins on the first line. Each line has a 2-letter identifier tag as shown above.

Each line of the sequence begins with four blank spaces, followed by 66 columns holding the nucleotide sequence itself in six blocks of ten nucleotides each. Each of the six blocks begins with a blank space. The first nucleotide occurs in column 6, with the last in column 70.

FASTA File Format

The FASTA format ^{(
[17] )} is described in Fig. 2 and more fully in Appendix C. Briefly, sequences in FASTA format consist of a one-line description, which begins in column 1 with a “>” character followed by the GenInfo Identifier ^{(14 )}, the database from which the sequence was obtained (gb, GenBank; embl, EMBL Nucleotide Sequence Database; ddj, DNA Database of Japan), the species from which the sequence is derived, the gene/protein name, and a freeform description. This is then followed by one or more lines of sequence data of ≤80 characters in length.

Fig. 2 FASTA File Format

>gi|532319|pir|TVFV2E|TVFV2E envelope protein

ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT

QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC

HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYCK

MDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK

TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF

APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKNL

LAAVEAQQQMLKLTIWGVK

FASTA is one of the most common file formats used in biology. Fasta format is very simple, consisting of only two parts: a title line beginning with a “>” character, followed by one or more sequence lines. A more detailed description is given in Appendix C.

Methods

Examination of Python’s Usefulness in Biological Applications

To examine Python’s usefulness for biological applications, a module and accompanying script were developed to allow common manipulations of DNA sequence data. Seven common sequence manipulation tasks were defined in a single class in an importable Python module, named here Project2.py. In keeping with the architecture of Python, the module itself imports attributes from the Biopython Alphabet package, which defines standard character sets representing nucleotides (or amino acids) that may be included in the input data. The Project2.py module makes use of the Bio.Alphabet.IUPAC module, and for simplicity in this application defines the input as an IUPACUnambiguousDNA instance. In addition, the accompanying script dna_manip.py also imports, the standard Python sys module defining system-specific parameters and functions, allowing the input file to be passed into the script as a command line argument, as well as Biopython’s Bio.Translate to allow translation of the input DNA sequence into the corresponding amino acid sequence and Bio.SeqIO.FASTA.

The source code for Project2.py is presented in Appendix D, along with that for the accompanying script dna_manip.py in Appendix E. The Project2.py module can be imported from essentially any directory accessible to Python, e.g., from the current directory (from Project2 import *) or from the top-level Biopython directory (from Bio.Project2 import *). Scripts can make use of this module to produce a variety of output – length of the input sequence, codons, complement and reverse complement of the DNA sequence, G+C content, and translated amino acid sequence (via Bio.Translate) – that can then be passed into other applications for further processing.

The module contains a single class, Sequence, with the following methods as detailed in Fig. 3:

[‘__init__’, ‘__len__’, ‘codons’, ‘compString’, ‘gcCont’, ‘revComp’, ‘revSeq’, ‘transcribe’]

In addition, the class defines a Python dictionary specifying purine:pyrimidine key:value pairs, which is used to generate complementary sequences based on the input string.

dnaComp = {'A': 'T', 'C': 'G', 'T': 'A', 'G': 'C'}

Once the module has been imported, each of the methods can be called from a very simple Python script in the standard manner, as shown in Fig. 4. Appendix E presents the complete source code for the demonstration script dna_manip.py indicating use of all methods belonging to the Sequence class.

Fig. 3 Project2.Sequence Class Structure

Class: Sequence
	Method
	__init__	Constructor
		__init__(self, data, alphabet = Alphabet.generic_alphabet)
	__len__	Returns length of input string
		__len__(self)
	codons	Returns list of codons
		codons(self)
	compString	Returns complementary sequence
		CompString(self)
	gccont	Returns G+C content
		gcCont(self)
	revComp	Returns reverse complement
		revComp(self)
	revSeq	Returns reversed sequence
		RevSeq(self)
	transcribe	Transcribe to aminoacid sequence
		transcribe(self)

Fig. 4 Simple Outline of Project2 Module Usage

#!/usr/bin/env python

#import the module

from Project2 import *

#Define alphabet

alphabet = IUPAC.unambiguous_dna

#New Sequence class instance

new_seq = Sequence('ACGCTCCTCGCTCGCT', alphabet)

#View the input

print "The input sequence is 5'-%s-3' \n" %new_seq.data

#Process the input

new_var = new_seq.method_name()

#View the output

print "The output produced by method_name is %s\n" %new_var

A simplified outline demonstrating usage of the Project2 module, where method_name can be any of:

[‘__init__’, ‘__len__’, ‘codons’, ‘compString’, ‘gcCont’, ‘revComp’, ‘revSeq’, ‘transcribe’]

Method Details

__init__

def __init__(self, data, alphabet = Alphabet.generic_nucleotide):

self.data = data

self.data = data.rstrip().upper()

self.alphabet = alphabet

Constructor to create new DNA sequence with value from input data string. The alphabet is set to Bio.Alphabet.generic_nucleotide.

The string methods rstrip() and upper() are used to remove trailing whitespace characters in the input string and to convert to UPPERCASE, respectively.

__len__

def __len__(self):

return len(self.data)

Method applies len() function on input data and returns the length of the input sequence.

codons

def codons(self):

s = self.data

end = len(s) - (len(s) % 3) - 1

codons = [s[i:i+3] for i in range(0, end, 3)]

return codons

Method identifies the codons in the input DNA string by applying modulus 3 to len(string) to determine the end point for codons to return whole number of codons even if DNA string is not evenly divisible by 3. range()returns a list from beginning to end, incrementing by 3. String slicing is used to produce a list of 3-character codons as strings.

compString

def compString(self):

baseSeq = list(self.data)

try:

baseSeq = [self.dnaComp[base] for base in baseSeq]

return ''.join(baseSeq)

except:

sys.stderr.write ("Can't create complement from %s %s %s\n" %(filename, sys.exc_type, sys.exc_value))

sys.exit(40)

Method uses list comprehension (baseSeq = [self.dnaComp[base] for base in baseSeq]) to construct a new list with each base in the input list replaced with its complement defined in the dnaComp dictionary.

gcCont

def gcCont(self):

s = self.data

gcCont = s.count('G') + s.count('C')

return gcCont * 100.0 / len(s)

Method uses string count() to return total number of ‘G’ and ‘C’ nucleotides in the input string, and return as percentage of the total length given by len().

revComp

def revComp(self):

baseSeq = list(self.data)

baseSeq.reverse()

baseSeq = [self.dnaComp[base] for base in baseSeq]

return ''.join(baseSeq)

Method uses string.reverse() and list comprehension to return the reverse complement of the input DNA string.

revSeq

def revSeq(self):

baseSeq = list(self.data)

baseSeq.reverse()

return ''.join(baseSeq)

Method creates a list from a string using the list() function, and reverses the order of the list with reverse().

The list elements are then joined on an empty string with the string join() method and returned as a string.

transcribe

def transcribe(self):

return self.data.replace('T', 'U')

Method accepts a nucleotide sequence string and applies the string method replace() to alter ‘T’ to ‘U’, transcribing DNA to RNA.

EMBL to FASTA Parser

Using a simple object-oriented framework, a module (Embl2Fasta.py) and accompanying script (embl2fasta_script.py) were developed for parsing EMBL format data files into FASTA format. The module consists of a single class, Embl2Fasta, with a __call__ constructor instead of the more usual __init__ to allow the module to be called as a callable object. The Embl2Fasta class has the following five methods as detailed in Fig. 5:

[‘acfunc’, ‘defunc’, ‘idfunc’, ‘kwfunc’, ‘osfunc’]

Calls to these methods in the normal instance.method format are given as values in dictionary key:value pairs to process each line of input and extract relevant information to build the fasta description line.

The EMBL input file is read in via a for line in fileinput.input() loop and processed via a nested for <value> in <dictionary_name> loop as shown in Fig. 6 or in more depth in Appendix G, which presents the complete source code for the embl2fasta_script.py sample script. This script uses the new (as of Python 2.3) extended call syntax format: dictionary_name[index](*[line]). For older Python releases, the apply() built-in function can be used: apply(dictionary_name[index],[line]).

Fig. 5 Embl2Fasta.Embl2Fasta

Class: Embl2Fasta
	Method
	__init__	Constructor
		__init__(self, line)
	Acfunc	Processes AC line (accession number)
		acfunc(self, line)
	Defunc	Processes DE line (text definition)
		defunc(self, line)
	Idfunc	Processes ID line (EMBL identifier)
		idfunc(self, line)
	Kwfunc	Processes KW line (keywords)
		kwfunc(self, line)
	Osfunc	Processes OS line (organism species)
		osfunc(self, line)

Fig. 6 Simplified Example of Embl2Fasta Module Usage

#!/usr/bin/env python

#import : custom module

from Bio.Embl2Fasta import *

#new instance

newE2F = Embl2Fasta()

#Set variables to ''

sq = ''

fasta_output = ''

#Dictionary with method_calls as values in key:value pairs

dictionary = {'ID': newE2F.idfunc}

#Read in file

for line in fileinput.input():

header = line[0:2] #define search position

# if header == 'ID': do things to stuff

if header in dictionary:

dictionary[header](*[line])

#Match sequence lines - remove whitespace/digits using list defined in Embl2Fasta

elif re.search(newE2F.matchstr, line): sq = sq+line

#When reach the record end marker //: do things to stuff

elif header == "//":

#Write out ID to fasta header line

fasta_output = fasta_output+("\n>%s" %newE2F.id)

#Remove whitespace and position numbers from sequence

for s in newE2F.killwhitey:

sq = replace(sq, s, "").upper() #make sequence upper case

#Combine header & sequence, and assign to output variable

fasta_output = fasta_output+'\n'+sq[:-1].upper()

print fasta_output #print fasta output to stdout

sq = '' #Reset sequence to empty to flush for multisequence files

else:

pass #Ignore all other lines

Simplified script demonstrating usage of the Embl2Fasta module. This example parses the ID line from an EMBL input file using the Embl2Fasta.idfunc() method and prints the result to STDOUT. Other available methods are :

[‘acfunc’, ‘defunc’, ‘idfunc’, ‘kwfunc’, ‘osfunc’]

Method Details

__call__

def __call__(self, line):

self.line = line

Class Constructor. Builds a new instance of the class, which can be called like a callable object.

Makes the module callable.

acfunc

def acfunc(self, line):

self.ac = line

self.ac = line.split()[1].replace(';', '')

Accession number line (AC) is compulsory (See Appendix B).

Method takes line of input, splits on whitespace, retains only the second element (the first ‘word’ of the identifier), and removes the trailing ‘;’ character.

Method returns ‘ac’ to calling script.

defunc

def defunc(self, line):

self.de = line

self.de = line.split("DE ")[1][:-1].replace('.', '')

Freeform text description line (DE) is compulsory (See Appendix B).

Method takes line of input, splits after identifier (‘DE’+3 spaces), retains everything from the second to the last element minus the newline (a freeform text string), and removes embedded periods.

Method returns ‘de’ to calling script.

idfunc

def idfunc(self, line):

self.id = line

self.id = line.split()[1]

Identification line (ID) is compulsory with exactly 1 entry per file (See Appendix B).

Method takes line of input, splits on whitespace, and retains only the second element (the first ‘word’ of the identifier).

Method returns ‘id’ to calling script.

kwfunc

def kwfunc(self, line):

self.kw = line

self.kw = line.split("KW ")[1][:-1].replace('.', '')

Keyword line (KW) is compulsory (See Appendix B).

Method takes line of input, splits after identifier (‘KW’+3 spaces) , retains everything from the second to the last element minus the newline (a freeform text string), and removes embedded periods.

Method returns ‘kw’ to calling script.

osfunc

def osfunc(self, line):

self.os = line

self.os = line.split("OS ")[1][:-1].replace('.', '')

Organism Species line (OS) is compulsory (See Appendix B).

Method takes line of input, splits after identifier (‘OS’+3 spaces) , retains everything from the second to the last element minus the newline (a freeform text string), and removes embedded periods.

Method returns ‘os’ to calling script.

The class Embl2Fasta also provides a regular expression definition – matching multiple lines beginning with 5 spaces – to match the actual sequence data in the input Embl file. The regex is compiled once when the module is first imported and then run when called in the loop:

matchstr = re.compile("^\s{5}", re.MULTILINE)

Finally, the class definition includes a list definition to allow easy removal of embedded whitespace and digits from the sequence data, which are included in the EMBL specification but not in the FASTA format definition. The calling script compares each sequence element against this list and replaces whitespace characters, including newlines, and the digits embedded as sequence position markers:

Embl2Fasta.py

…

remove_whitespace = [' ', '0', '1', '2', '3','4', '5', '6', '7', '8', '9', '.', '\r','\n']

…

Calling script

…

for s in remove_whitespace: x = replace(x,s,'')

…

Results and Discussion

The development of a simple Python module containing objects to manipulate DNA sequence strings proved relatively straightforward. The language’s consistent syntax coupled with the Python interpreter’s interactive mode (Fig. 7) allowed for quick experimentation with code snippets, facilitating learning the language and yielding rapid results. Python is a well-designed language ^{(1
)}, and with only a little experience its use begins to feel quite natural.

Fig. 7 Python Interactive Mode

The Python interpreter provides an interactive mode, where commands can be input directly at the command line, with the result returned immediately. Interactive mode is useful for experimenting with code snippets to quickly determine their results. The primary interpreter prompt is three greater-than signs (">>> "), and the secondary prompt for continuation lines is three dots ("... ").

The module Project2.py implemented a number of key features of Python and was useful as a learning aid as well as proof-of-concept that it is possible to do useful “things” with biological “stuff” [ii] with relatively few lines of code and in a relatively short time.

The simple script written to accompany the module successfully:

Determined the length of the input sequence.

Transcribed the dna sequence into rna.

Returned the complement and reverse complement of the input sequence.

Determined the g+c content.

Identified the codons in the sequence.

Translated to the corresponding protein sequence (using Bio.Translate).

Of course, the code could be improved most notably by inclusion of more robust error checking, for example to raise an exception if the script is passed non-nucleotide characters or incorrectly formated data. However, as this was a preliminary exercise, this was beyond the scope of this part of the project.

As the various biological databases grew up independently, each developed its own unique and usually incompatible format(s) for storing sequence data. As the field has grown and it has become necessary to both share data between databases and to utilise data from a variety of sources, it has become necessary to develop computational tools for conversion between formats ^{( [18] , [19] )}.

Although Biopython includes tools for handling a range of important file types, the EMBL file format is currently unsupported. Therefore, in this project a module was developed to parse EMBL records and output in FASTA format, one of the most common file formats used in biology with support in a wide range of programs.

The Embl2Fasta.py module and accompanying embl2fasta_script.py script allowed extraction of FASTA format data from EMBL format files, including multi-record files. Using a single class with a simple set of five string manipulation methods, it was possible to loop through each line in the EMBL file and extract relevant data for inclusion in the FASTA title and sequence fields.

Further development of this module would include integration into the Biopython parser framework, making use of the standard event-driven system of Scanner and Consumer objects. More robust object-based error checking would also be required

An attempt was also made to build a Biopython class framework for contig assembly based on the Assembly::Contig module in BioPerl. Each class and method in the Contig.pm module was first translated directly into Python where comparable functions were available ^{(13 )}, with a view toward fully reworking the code into pure Python. However, this was not successful mainly due to the complexity and high degree of interrelatedness of the BioPerl heirarchy. While the OOP approach can make it relatively simple to implement a new package for an existing system, the multiple layers of inheritance that facilitate object-oriented coding within a system actually hinders development across platforms. The BioPerl Contig module inherits heavily from a number of superclasses that have no equivalent in BioPython. This degree of interdependence, coupled with the obfuscated nature of Perl code and often inaccessible or incomplete documentation, prevented reimplementation of BioPerl’s contig analysis framework for BioPython. Future attmepts in this area would likely be more successful starting with a functional specification and implementation in Python, making full use of the inheritance structure of BioPython iteself.

Conclusions

In conclusion, Python is a powerful development language for computational biology that is both relatively simple to learn and to uncomplicated to use. Although Perl has precedence in Bioinformatics and has both a large and prolific user base, Python is having significant impact in scientific computing. Python offers a robust and flexible architecture with code that is easy to read and understand, allows rapid development, has powerful standard libraries, and is highly scalable from very small to very large programs.

Appendix A: Biopython Functionality [iii]

The main Biopython releases have the following functionality:

Support for the following formats:

Blast output -- both from standalone and WWW Blast
ClustalW
FASTA
GenBank
PubMed and Medline
Expasy files, like Enzyme, Prodoc and Prosite
SCOP, including 'dom' and 'lin' files
Rebase
UniGene
SwissProt

Files in the supported formats can be iterated over record by record or indexed and accessed via a Dictionary interface.

Code to deal with popular on-line bioinformatics destinations such as:

NCBI -- Blast, Entrez and PubMed services
Expasy -- Prodoc and Prosite entries
Interfaces to common bioinformatics programs such as:
Standalone Blast from NCBI
ClustalW alignment program.

A standard sequence class that deals with sequences, ids on sequences, and sequence features.
Tools for performing common operations on sequences, such as translation, transcription and weight calculations.
Code to perform classification of data using k Nearest Neighbors, Naive Bayes or Support Vector Machines.
Code for dealing with alignments, including a standard way to create and deal with substitution matrices.
Code making it easy to split up parallelizable tasks into separate processes.
GUI-based programs to do basic sequence manipulations, translations, BLASTing, etc.
Extensive documentation and help with using the modules, including this file, on-line wiki documentation, the web site, and the mailing list.
Integration with other languages, including the BioPerl and BioJava projects, using the BioCorba interface standard (available with the biopython-corba module).

Appendix B: EMBL File Format Description [iv]

The EMBL file format does not include a series of header lines. Instead, the first sequence entry of the file begins on the first line. Each line has a 2-letter identifier tag as shown below. Multiple sequences may appear in each file.Each line begins with a two-character line code, which indicates the type of information contained in the line. The currently used line types, along with their respective line codes, are listed below:

ID	identification	begins each entry; 1 per entry
AC	accession number	≥1 per entry
SV	sequence version	1 per entry
DT	Date	2 per entry
DE	description	≥1 per entry
KW	keyword	≥1 per entry
OS	organism species	≥1 per entry
OC	organism classification	≥1 per entry
OG	organelle	0 or 1 per entry
RN	reference number	≥1 per entry
RC	reference comment	≥0 per entry
RP	reference positions	≥1 per entry
RX	reference cross-reference	≥0 per entry
RG	reference group	≥0 per entry
RA	reference author(s)	≥0 per entry
RT	reference title	≥1 per entry
RL	reference location	≥1 per entry
DR	database cross-reference	≥0 per entry
CC	comments or notes	≥0 per entry
AH	assembly header	0 or 1 per entry
AS	assembly information	0 or ≥1 per entry
FH	feature table header	2 per entry
FT	feature table data	≥2 per entry
XX	spacer line	many per entry
SQ	sequence header	1 per entry
CO	contig/construct line	0 or ≥1 per entry
	blanks sequence data	≥1 per entry
//	termination line	ends each entry; 1 per entry

Some entries will not contain all of the line types, and some line types occur many times in a single entry. As noted, each entry begins with an identification line (ID) and ends with a terminator line (//). The various line types appear in entries in the order in which they are listed above (except for XX lines which may appear anywhere between the ID and SQ lines).

Appendix C: FASTA Format Description [v]

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. An example sequence in FASTA format is:

>gi|44010|emb|X64011.1|LISOD L.ivanovii sod gene for superoxide dismutase

CGTTATTTAAGGTGTTACATAGTTCTATGGAAATAGGGTCTATACCTTTCGCCTTACAATGTAATTTCTT

TTCACATAAATAATAAACAATCCGAGGAGGAATTTTTAATGACTTACGAATTACCAAAATTACCTTATAC

TTATGATGCTTTGGAGCCGAATTTTGATAAAGAAACAATGGAAATTCACTATACAAAGCACCACAATATT

TATGTAACAAAACTAAATGAAGCAGTCTCAGGACACGCAGAACTTGCAAGTAAACCTGGGGAAGAATTAG

TTGCTAATCTAGATAGCGTTCCTGAAGAAATTCGTGGCGCAGTACGTAACCACGGTGGTGGACATGCTAA

CCATACTTTATTCTGGTCTAGTCTTAGCCCAAATGGTGGTGGTGCTCCAACTGGTAACTTAAAAGCAGCA

ATCGAAAGCGAATTCGGCACATTTGATGAATTCAAAGAAAAATTCAATGCGGCAGCTGCGGCTCGTTTTG

GTTCAGGATGGGCATGGCTAGTAGTGAACAATGGTAAACTAGAAATTGTTTCCACTGCTAACCAAGATTC

TCCACTTAGCGAAGGTAAAACTCCAGTTCTTGGCTTAGATGTTTGGGAACATGCTTATTATCTTAAATTC

CAAAACCGTCGTCCTGAATACATTGACACATTTTGGAATGTAATTAACTGGGATGAACGAAATAAACGCT

TTGACGCAGCAAAATAATTATCGAAAGGCTCACTTAGGTGGGTCTTTTTATTTCTA

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and nucleic acid codes, with these exceptions: lower-case letters are accepted and are mapped into upper-case; a single hyphen or dash can be used to represent a gap of indeterminate length; and in amino acid sequences, U and * are acceptable letters (see below). Before submitting a request, any numerical digits in the query sequence should either be removed or replaced by appropriate letter codes (e.g., N for unknown nucleic acid residue or X for unknown amino acid residue).
The nucleic acid codes supported are:

A	à	adenosine	M	à	A C (amino)
C	à	cytidine	S	à	G C (strong)
G	à	guanine	W	à	A T (weak)
T	à	thymidine	B	à	G T C
U	à	uridine	D	à	G A T
R	à	G A (purine)	H	à	A C T
Y	à	T C (pyrimidine)	V	à	G C A
K	à	G T (keto)	N	à	A G C T (any)
-	à	Gap of indeterminate length

Appendix D: Python Source Code for Project2.py

Appendix E: Python Source Code for dna_manip.py

Appendix F: Python Source Code for Embl2Fasta.py

Appendix G: Python Source Code for embl2fasta_script.py

References

[i] Python is copyrighted but placed under an open source license, meaning that it can be freely used and distributed, even for commercial purposes.

[ii] Maintaining the lack of formality of a programming language named after a comedy team, Learning Python ² characterises the language as a way to “do things with stuff”.

[iii] Adapted from Ref. 9 (Chang et al., 2003)

[iv] *Adapted from Ref. 16 (EMBL Nucleotide Sequence Database User Manual Release 78 March 2004: http://www.ebi.ac.uk/embl/Documentation/User_manual/usrman.html#FORMAT)

[v] Adapted from Ref. 17 (NCBI FASTA format description http://www.ncbi.nlm.nih.gov/BLAST/fasta.shtml)