DNA Sequence formats
[Plain] [EMBL] [FASTA]
[GCG] [GenBank] [IG]
[IUPAC]
A sequence in plain format may contain only
IUPAC characters
and spaces (no numbers!).
Note: A file in plain sequence format may only contain
one
sequence, while most other formats accept several sequences in one file.
An example sequence
in plain format is:
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC
A sequence file in EMBL format can contain several sequences.
One sequence entry starts with an identifier line ("ID "), followed by
further annotation lines. The start of the sequence is marked by a line
starting with "SQ" and the end of the sequence is marked by two slashes
("//").
An example sequence
in EMBL format is:
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc 60
tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg 120
ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc 180
tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc 237
//
A sequence file in FASTA format can contain several sequences.
One sequence in FASTA format begins with a single-line description, followed
by lines of sequence data. The description line must begin with a greater-than
(">") symbol in the first column.
An example sequence
in FASTA format is:
>U03518 Aspergillus awamori internal transcribed spacer 1 (ITS1)
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC
A sequence file in GCG format contains exactly one sequence, begins with
annotation lines and the start of the sequence is marked by a line ending
with two dot ("..") characters. This line also contains the sequence identifier,
the sequence length and a checksum. This format should only be used if
the file was created with the GCG package.
An example sequence
in GCG format is:
ID AA03518 standard; DNA; FUN; 237 BP.
XX
AC U03518;
XX
DE Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
DE rRNA and 5.8S rRNA genes, partial sequence.
XX
SQ Sequence 237 BP; 41 A; 77 C; 67 G; 52 T; 0 other;
AA03518 Length: 237 Check: 4514 ..
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
The new GCG-RSF can contain several sequences in one file. This format
should only be used if the file was created with the GCG package.
A sequence file in GenBank format can contain several sequences.
One sequence in GenBank format starts with a line containing the word
LOCUS and a number of annotation lines. The start of the sequence is marked
by a line containing "ORIGIN" and the end of the sequence is marked by
two slashes ("//").
An example sequence
in GenBank format is:
LOCUS AAU03518 237 bp DNA PLN 04-FEB-1995
DEFINITION Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
ACCESSION U03518
BASE COUNT 41 a 77 c 67 g 52 t
ORIGIN
1 aacctgcgga aggatcatta ccgagtgcgg gtcctttggg cccaacctcc catccgtgtc
61 tattgtaccc tgttgcttcg gcgggcccgc cgcttgtcgg ccgccggggg ggcgcctctg
121 ccccccgggc ccgtgcccgc cggagacccc aacacgaaca ctgtctgaaa gcgtgcagtc
181 tgagttgatt gaatgcaatc agttaaaact ttcaacaatg gatctcttgg ttccggc
//
A sequence file in IG format can contain several sequences, each consisting
of a number of comment lines that must begin with a semicolon (";"), a
line with the sequence name (it may not contain spaces!) and the sequence
itself terminated with the termination character '1' for linear or '2'
for circular sequences.
An example sequence
in IG format is:
; comment
; comment
U03518
AACCTGCGGAAGGATCATTACCGAGTGCGGGTCCTTTGGGCCCAACCTCCCATCCGTGTCTATTGTACCC
TGTTGCTTCGGCGGGCCCGCCGCTTGTCGGCCGCCGGGGGGGCGCCTCTGCCCCCCGGGCCCGTGCCCGC
CGGAGACCCCAACACGAACACTGTCTGAAAGCGTGCAGTCTGAGTTGATTGAATGCAATCAGTTAAAACT
TTCAACAATGGATCTCTTGGTTCCGGC1
To represent ambiguity in DNA sequences the following letters can be used (following the rules of the
International Union of Pure and Applied Chemistry (IUPAC)):
A = adenine
C = cytosine
G = guanine
T = thymine
U = uracil
R = G A (purine)
Y = T C (pyrimidine)
K = G T (keto)
M = A C (amino)
S = G C
W = A T
B = G T C
D = G A T
H = A C T
V = G C A
N = A G C T (any)
Copyright © Genomatix
Software GmbH 1998-2002 - All rights reserved