Industrial Agricultural Biotech – Robert D. Fleischmann, Mark D. Adams, Owen White, Hamilton O. Smith, J. Craig Venter, Human Genome Sciences Inc, Johns Hopkins University

Abstract for “Nucleotide sequence for the Haemophilus influenzae Rd gene, fragments thereof and uses thereof”

“The present invention allows for the complete sequencing of the genome of Haemophilus Influenzae Rd, SEQUID NO:1. The present invention also provides sequence information on computer-readable media and computer-based methods that facilitate its use. The present invention not only provides the complete genome sequence but also identifies more than 1700 fragments of Haemophilus protein coding genes and determines their relative positions to a single Not I restriction enzyme site.

Background for “Nucleotide sequence for the Haemophilus influenzae Rd gene, fragments thereof and uses thereof”

“The complete genome sequence of a living cell organism has not been determined. The sequence for the first mycobacterium should be complete by 1996. E. coli, S. cerevisae, and other cellular organisms are expected to be complete before 1998. These sequences are done using either directed or random sequencing of cosmid clones that overlap. Nobody has ever attempted to establish sequences that correspond to the megabase order or more using a random shotgun approach.

“Interest in H. influenzae biology’s medically important aspects has centered mainly on genes that determine the organism’s virulence characteristics. A number of genes that are responsible for the production of capsular polysaccharide (Kroll and al., Mol. Microbiol. 5(6):1549-1560 (1991)). Numerous genes encoding outer membrane proteins (OMPs) have been identified (Langford and colleagues, J. Gen. Microbiol. 138:155-159 (1992)). Weiser et.al., J. Bacteriol. In-depth research is being done on the lipoligosaccharide component of the outer membrane as well as the genes for its synthetic pathway. 172:3304-3309 (1990)). Although a vaccine is available since 1984, research on outer membrane components has been motivated in part by the need to develop better vaccines. Recently, the catalase gene has been characterized and sequenced to identify it as a potential virulence-related genetic (Bishni, et al. in press). Understanding the H. influenzae genome and the best ways to fight it will be possible with the clarification of its DNA.

“H. influenzae has a highly efficient natural DNA-transformation system that has been extensively studied in the non-encapsulated strain R (Kahn & Smith, J. Membrane Biology 81.1-103 (1984). At least 16 transformation-specific genes have been identified and sequenced. Four of these are regulatory genes (Redfield and J Bacteriol). 173:5612-55618 (1991), Chandler, Proc. Natl. Acad. Sci. USA 89.1626-1630 (1992), at least two of them are involved in recombination (Barouki, Smith, J. Bacteriol. 163(2), 629-634 (1985), and at most seven are targeted to membranes and the periplasmic spaces (Tomb et. al., Gene104:1-10 (1991) and Tomb. Natl. Acad. Sci. USA 89.10252-10256 (1992), where they function as structural components or in assembly of DNA transport machinery. H. influenzae Rd transform shows many interesting features, including sequence-specific DNA capture, rapid uptake by several double-stranded molecules per competent cell into the membrane compartment called The Transformasome, linear translocations of one strand of donor DNA into the Cytoplasm, and synapsis as well as recombination with the chromosome through a single-strand displacement mechanism. The H. influenzae Rd conversion system is one of the most studied gram-negative systems. It differs in many ways from the gram positive systems.

“The H. influenzae Rd genome size has been determined using pulsed-field electrophoresis with restriction digests to be approximately 1.95 Mb. This is approximately 40% of E.coli’s genome (Lee and Smith J. Bacteriol. 170:4402-4405 (1988)). The restriction map for H. influenzae is circular. (Lee and al., J. Bacteriol. 171:3016-3024 (1989) and Redfield and Lee?Haemophilus Influenzae Rd?, pages. 2110-2112, In O’Brien, S. J. (ed), Genetic Maps. Locus Maps for Complex Genomes. Cold Spring Harbor Press. N.Y. Different genes were mapped to restriction fragments using Southern hybridization probing restriction digest DNA bands. This map can be used to verify the assembly of a complete sequence of genomes from randomly sequenced fragments. GenBank currently has about 100 kb non-redundant H. flue DNA sequences. Half of the sequences are from Serotype B and half are Rd.

“The present invention was based on the sequence of the Haemophilus Influenzae Rd genome. SEQ ID NO.1 contains the primary nucleotide sequence that was generated.

“The present invention contains the generated nucleotide sequence for the Haemophilus influenzae Rd gene, or a fragment thereof, in a format that can be easily used, analyzed and interpreted by skilled artisans. The present invention can be described as a contiguous string containing primary sequence information that corresponds to the nucleotide sequence in SEQ ID No:1.

“The present invention also provides nucleotide sequencings that are at least 99.9% similar to the sequence of SEQ ID No:1.”

To facilitate its use, “The nucleotide sequencing of SEQID NO:1, a fragment thereof or a nucleotide which is at most 99.9% identical with the sequence of SEQID NO:1 may be provided on a variety of media. The sequences of the invention can be recorded on computer-readable media in one embodiment of this invention. These media include, but are not limited to, magnetic storage media like floppy disks, hard disk storage medium and magnetic tape; optical media such CD-ROM; electronic storage media such RAM and ROM; hybrids of these media such as magnetic/optical media.

“The invention also provides systems, especially computer-based systems that contain the sequence information described herein stored in a data storage device. These systems can be used to identify commercially relevant fragments of the Haemophilus Influenzae Rd genome.

“Another embodiment is directed at isolated fragments from the Haemophilus Influenzae Rd genome. The present invention includes fragments that encode peptides, hereinafter called open reading frames (ORFs), which modulate the expression an operably linked ORF (EMFs), which mediate the uptake a linked DNA fragment in a cell (hereafter UMFs), and fragments that can be used for diagnosing the presence of Haemophilus flue Rd in a sample (hereafter referred to as diagnostic fragments or DFs).

Each of the ORF fragments from the Haemophilus influenzae Rd gene, as well as the EMF found 5 to the ORF can be used in a variety of ways as polynucleotide-reagents. These sequences can be used to detect microbes in samples, or as diagnostic probes and amplification primers. They also have the potential to control gene expression by selectively controlling gene expression.

“The present invention also includes recombinant constructs that contain one or more fragments from the Haemophilus Influenzae Rd genome. The vectors that comprise the recombinant constructs described in the present invention include vectors such as a viral vector or plasmid into which a Haemophilus influenzaeRd fragment has been embedded.

The present invention also provides host cells that contain any fragment of the Haemophilus Influenzae Rd genome. You can choose to use higher eukaryotic hosts such as mammalian cells, lower eukaryotic cells such as yeast cells, or procaryotic cells such as bacteria.

“The present invention further relates to the isolation of proteins encoded using the ORFs. Any one of the invention’s proteins can be obtained using any of the many methods known to the art. The simplest form of the amino acid sequence is possible to synthesize using commercially available, peptide synthesizers. Another method is to purify the protein from bacteria cells that naturally produce it. The proteins of the invention can also be extracted from cells that have been modified to express the desired protein.

“The invention also provides methods for obtaining homologs to the fragments from the Haemophilus influenzaeRd genome and homologs to the proteins encoded using the ORFs. One skilled in the art can find homologs by using the nucleotide or amino acid sequences described herein as a probe, primers, and techniques like PCR cloning, colony/plaque hybridization and others.

“The invention also provides antibodies that selectively bind to one of the proteins described in the invention.” These antibodies can be monoclonal or polyclonal.

“The invention also provides hybridomas that produce the above-described antibody. Hybridomas are immortalized cell lines that can secrete a monoclonal antibody.

“The present invention also provides methods for identifying test samples that are derived from cells that express one or more of the ORFs of the invention, or a homolog thereof.” These methods include incubating a test specimen with one or more antibodies or one or several of the DFs according to conditions that allow a skilled artisan determine if the sample contains the ORF, product or combination thereof.

“In another embodiment, the present invention provides kits that contain the necessary reagents for performing the above-described assays.”

“Specifically, the invention provides a compartmentalized system to receive in close confinement one or more containers that include: (a) a container containing one of antibodies or one of DFs according to the present invention; (b) one or two other containers containing one or both of the following: wash reagents, reagents that detect presence of bound antibodies or hybridized DFs; or (c) wash reagents.

“Using the isolated proteins, the present invention also provides methods for obtaining and identifying agents that can bind to a protein encoded in one of the ORFs. These agents include antibodies, peptides and carbohydrates, as well as pharmaceutical agents. These methods include the following steps:

“(a) Contacting an agent with an isolated protein encoded by one the ORFs described in the present invention; and

“(b) To determine if the agent binds with said protein.”

The complete genome sequence of H. influenzae is a valuable resource for all laboratories that work with the organism, as well as commercially. Similarity searches against GenBank and protein databases will immediately identify many fragments of the Haemophilus Influenzae Rd genome. These will be valuable to Haemophilus researchers as well as for immediate commercial use for the production or control of gene expression. PHA synthase is a specific example. Polyhydroxybutyrate has been found in H. influenzae Rd membranes. This amount corresponds to the level of competence for the transformation. This polymer was synthesized by the PHA synthase in several bacteria. None of these bacteria are evolutionary related to H. influenzae. The hybridization probes and PCR techniques have not yet allowed for the isolation of this gene from H. influenzae. The present invention’s genomic sequence allows identification of the gene using the search methods described below.

“Developing the technology and methodology to elucidate the entire genome sequence of bacterial and small genomes has greatly enhanced our ability to understand and analyze chromosomal organization.” Sequenced genomes will be used to develop tools to analyze chromosome function and structure. This includes the ability to identify genes in large sections of genomic DNA, their structure, position and spacing, as well as the identification of genes that could have industrial applications.

“DESCRIPTION OFF THE FIGURES”

“FIG. “FIG.

“FIG. “FIG.

“FIG. 3?A comparison of experimental cover of approximately 4000 random sequence pieces assembled with AutoAssembler. This is compared to Lander Waterman prediction for a 2.5Mb genome (triangles), and a 1.5 Mb genome(circles), with a 460-bp average length sequence and a 25-bp overlap.

“FIG. 4. Data flow and computer programs used for managing, assembling, editing, and annotating the H. influenzae genome. The AB 373 sequence files can be handled by Unix and Macintosh platforms (Kerlavage et., Proceedings of the Twenty Sixth Annual Hawaii International Conference on System Sciences IEEE Computer Society Press Washington D.C. 585 (1993). Factura (AB), a Macintosh program, is designed to automatically remove and trim sequence files. The program esp runs on a Macintosh platform. It extracts feature data from sequence files and then converts it to the Unix-based H. influenzae relationshipal database. The assembly is done by retrieving a set of sequence files with their associated features using stp. This X-windows graphical user interface and control program can retrieve sequences from H. influenzae using standard or user-definable SQL queries. TIGR Assembler was used to create the sequence files. This assembly engine is designed by TIGR to quickly and accurately assemble thousands of fragments of sequence. The graphical interface TIGR Editor can display contig editing information and parse aligned sequence file output from TIGR. Genemark was used to identify putative coding areas (Borodovsky, McIninch Computers Chem. 17(2):123 (1993), a Markov- and Bayes-modeled program for predicting gene location, which was trained on a H. Influnzae sequence database set. Peptide searches were conducted against the three reading frames for each Genemark predicted coding area using blaze (Brutlag et.al., Computers Chem. 17:203 (1993), was run on a Maspar M-2 massively parallel computer with 4096 processors. By mblzt, the results from each frame were combined to create a single output file. The program praze, which allows for the extension of alignments across possible frameshifts, was used to obtain optimal protein alignments. The output was checked using the custom graphic viewing program, “gbyob”, which interacts directly with H. influenzae’s database. These alignments were used to detect potential frameshift errors, and were then targeted for further editing.

“FIG. “FIG. Outer perimeter: The unique NotI restriction site, designated as nucleotide 1, the RsrII and the SmaI locations. Outer concentric circles: This is the location of each identified code region for which a gene identification has been made. Second concentric circle, Regions with high G/C and high A/T. High G/C content areas are associated with the 6 operons of the ribosomal ribosome and the mu-like protophage. Third concentric circle: Coverage with lambda-clones. To confirm the genome’s overall structure and to identify the 6 ribosomal operaons, over 300 lambda-clones were sequenced at each end. Fourth concentric circle. The positions of the 6 ribosomal operaons, tRNAs, and the cryptic mulike prophage. Fiveth concentric circle: Simple tandem repetitions. These repeats are located at the CTGGCT and GTCT of ATT, AATGGC. TTGA. TTGG. TTTA. TTATC. TGAC. TCGTC. AACC. TTGC. CAAT. CCAA. Outward-pointing arrows that originate near base 603,000 illustrate the possible origin of replication. Near the circle’s opposite midpoint, two possible termination sequences can be seen.

“FIGS. “FIGS. Each strand shows the predicted coding regions. The rRNA and the tRNA genes are represented as triangles and lines, respectively. GeneID numbers are the same as those in Tables 1 (a), 1 (b) and 2. Three-letter designations can be provided where possible.

“FIG. 7?A comparison of H. influenzae’s chromosome region containing the 8 genes in the fimbrial cluster found in H. influenzae type B and the same area in H. influenzae Rd. Both the purE and pepN genes flank this region in both organisms. The 8 genes from the fimbrial cluster of gene have been deleted in the Rd strain. This region of the Rd strain contains a 172 bp gapr region and is still flanked by both the purE and pepN genes.

“FIG. 8?Hydrophobicity analysis for five predicted channel-proteins. Five predicted coding regions do not have homology (GenBank release: 87), but each sequence contains multiple hydrophobic domains, which are typical of channel-forming protein sequences. The Kyte-Doolittle algorithm was used to analyze the predicted coding regions sequences (Kyte & Doolittle, J. Mol. Biol. Biol.

“The present invention was based on the sequence of the Haemophilus Influenzae Rd genome. SEQ ID NO.1 contains the primary nucleotide sequence that was generated. The ‘primary sequence’ is the one used in this document. The IUPAC nomenclature system represents the nucleotide sequence.

“The sequence in SEQ ID No:1 is oriented relative a unique Not I restriction site found in the Haemophilus Influenzae Rd genome. An experienced artisan will quickly recognize that the start/stop point is merely for convenience and does NOT have any structural significance.

“The present invention provides the nucleotide sequencing of SEQ ID No:1 or a representative fragment thereof in a format that can be easily used, analyzed and interpreted by skilled artisans. The sequence can be provided as a contiguous string containing primary sequence information that corresponds to the nucleotide sequence in SEQ ID No:1.

“As used herein. A?representative fraction of the nucleotide sequence shown in SEQ ID No:1? Any portion of SEQ ID No:1 that is not currently available in a publicly accessible database. The preferred representative fragments of this invention include Haemophilus fluke open reading frames, expression modulating and uptake modulating bits, as well as fragments that can be used for diagnosing the presence of Haemophilus Influenzae Rd in sample. The Tables 1(a), and 2 provide a non-limiting identification for such preferred representative fragments.

“The information regarding the nucleotide sequence in SEQ ID No:1 was obtained from sequencing the Haemophilus Influenzae Rd genome using a megabase-shotgun sequencing method. The present inventors calculated that SEQ ID No:1’s sequence has a maximum accuracy 99.98% using three parameters discussed in the Examples. The nucleotide sequence in SEQ ID No:1 represents the Haemophilus Influenzae Rd genome’s nucleotide structure. However, it is not 100% accurate.

“As we will discuss in detail below, using information in SEQ ID No:1 and Tables 1(a), 2 together with routine sequencing and cloning methods, an ordinary skill in art will be capable of cloning all?representative pieces? Open reading frames (ORFs), which encode a wide range of Haemophilus Influenzae proteins, are also of interest. This may indicate a nucleotide sequence problem in the nucleotide sequence described in SEQ ID No:1. Once the present invention has been made public (i.e. once the information in SEQID NO:1 and Tables 1,(a), and 2) have been made accessible), it will be possible to resolve a rare sequence error in SEQID NO:1. Publicly available software for editing nucleotide sequences is Nucleotide Sequence Editing Software. Applied Biosystem’s AutoAssembler, for example, can be used to aid in visual inspection of nucleotide sequencings.

“Even if all the rare sequencing errors in SEQID NO:1 were fixed, the resulting nucleotide chain would still be at most 99.9% identical to that in SEQID NO:1.

“The nucleotide sequences for different strains of Haemophilus Influenzae genomes differ slightly. The nucleotide sequences of all Haemophilus Influenzae strains will be 99.9% or more identical to the sequence given in SEQ ID No:1.

“The present invention also provides nucleotide sequencing that is at least 99.9% identical with the nucleotide series of SEQ ID No:1 in a format that can be easily used, analyzed, and interpreted by the skilled craftsman. The skilled artisan has easy access to routine methods for determining if a nucleotide is at least 99.9% identical with the nucleotide of SEQ ID No:1. The well-known fasta algothrithm is an example (Pearson & Lipman, Proc. Natl. Acad. Sci. Sci.

“Computer Related Embodiments.”

“The nucleotide sequencing provided in SEQID NO:1, a representative of it, or a sequence at least 99.9% similar to SEQID NO:1, may be?provided? It can be stored in many media to make it easy to use. Provided refers to any manufacture other than an isolated nucleic acids molecule that contains a nucleotide sequencing of the present invention. This includes a representative fragment or a nucleotide series at least 99.9% identical with SEQ ID No:1. This manufacture contains the Haemophilus Influenzae Rd genome (or a subset thereof), in a form that allows skilled artisans to examine the manufacture with means not directly applicable for examining the Haemophilus Influenzae Rd gene or subset thereof, as it exists in nature.

“In one embodiment of this invention, a nucleotide sequence according to the present invention can also be recorded on computer-readable media. Computer readable media is used herein. Any medium that can be read or accessed by a computer directly. These media include magnetic storage media like floppy disks, hard disk storage medium and magnetic tape, optical storage media like CD-ROM, electrical storage media like RAM and ROM, hybrids of these media such a magnetic/optical media storage media. An skilled artisan will be able to see how any of the computer-readable media can be used to make a manufacture that has the nucleotide sequence recorded on it.

“Recorded” is the term used herein. “Recorded” refers to a method of storing information on a computer-readable medium. Any of the methods currently known for recording information on computer-readable media can be easily adopted by skilled artisans to create manufactures that contain the nucleotide sequence data of the invention.

“A skilled artisan can choose from a variety of data storage systems to create a computer-readable medium that has the nucleotide sequences of the invention. The access method to which the stored information can be accessed will determine the choice of data storage structure. The present invention can also be stored on computer-readable media using a number of data processor formats and programs. The sequence information can be represented in a word processing text file, formatted in commercially-available software such as WordPerfect and MicroSoft Word, or represented in the form of an ASCII file, stored in a database application, such as DB2, Sybase, Oracle, or the like. Any number of dataprocessor structuring formats can be easily adapted by a skilled artisan (e.g. text file or database to be able to create a computer-readable medium containing the nucleotide sequence information described in the present invention.

A skilled artisan can access sequence information in a variety of ways by providing the nucleotide sequencing of SEQID NO: 1, or a fragment of it, in computer-readable format. A computer program that can access sequence information in a computer-readable format is freely available to the public. These examples demonstrate how software implements the BLAST algorithm (Altschul et. al., J. Mol. Biol. 215:403-410 (1990)) and BLAZE (Brutlag et al., Comp. Chem. 17:203-207 (1993). Sybase search algorithms were used to identify open reading frame (ORFs), within the Haemophilus influenzaeRd genome, that contain homology to ORFs and proteins from other organisms. These ORFs are protein-encoding fragments in the Haemophilus Influenzae Rd genome. They are useful in producing important commercial proteins, such as enzymes for fermentation reactions or in the production and sale of useful metabolites.

“The invention also provides systems, especially computer-based systems that contain the sequence information described in this document.” These systems can be used to identify commercially valuable fragments of the Haemophilus Influenzae Rd genome.

“A computer-based system” is defined herein. The hardware, software, and data storage methods used to analyze the nucleotide sequence info of the present invention are all defined. The computer-based systems described in the present invention have a minimum of hardware. They include a central processing device (CPU), input and output means, as well as data storage. Any of the computer-based systems currently in use can be easily understood by a skilled artisan.

“As mentioned above, the computer-based system of the invention comprises a data storage device that stores a nucleotide sequence according to the invention as well as the hardware and software necessary for supporting and implementing a searching method. As used herein, ?data storage means? A memory that can store the nucleotide sequence data of the invention or a memory access mechanism which can access manufacturers having recorded the nucleotide sequencing information of this invention.

“As used herein, ?search means? A computer-based program that compares a target sequence or target structure motif with sequence information stored in the data storage means. Search methods are used to find fragments or regions within the Haemophilus Influenzae Rd genome that match a specific target sequence or target motif. There are many algorithms that have been publicly disclosed. Commercially available software can also be used to search for the desired sequence or motif. MacPattern (EMBL), BLASTN, and BLASTX(NCBIA) are just a few examples of such software. An expert artisan will recognize that any of the algorithms or implementing programs for homology searches can easily be modified for use in the current computer-based systems.

“A?target sequence” is defined herein. Any DNA or amino sequence that contains six or more nucleotides, or two or more amino acid residues. An experienced artisan will recognize that the longer the target sequence, the more likely it is to be found in the database as a random occurrence. A target sequence should be between 10 and 100 amino acids. It should also contain 30 to 300 nucleotide sequence residues. It is known that shorter sequences may be used to search for commercially valuable fragments of the Haemophilus Influenzae Rd genome. This includes fragments that are involved in gene expression or protein processing.

“As used herein, ?a target structural motif,? “A target structural motif” or “target motif”, as it is also known, refers to any sequence that has been chosen based on a specific three-dimensional structure. The term “target motif” refers to any sequence or combination thereof that has been rationally chosen. It is a sequence or group of sequences where the sequence(s), or combinations of sequences, are selected based on a three-dimensional structure formed by the folding of the target theme. There are many target motifs in the art. The art includes a variety of target motifs for protein, including signal sequences and enzymic active site. “Nucleic acid target motes include promoter sequences and hairpin structures, as well as inducible expression elements (protein-binding sequences).

The computer-based systems of this invention can use a variety of structural formats to input and output information. An output format that ranks fragments of the Haemophilus fluoride Rd genome with different degrees of homology to the target sequence/target motif is preferred. This presentation allows a skilled artisan to rank sequences that contain different amounts of the target motif or sequence and identify the level of homology in each identified fragment.

“A range of comparing methods can be used to compare target sequences or targetmotifs with the data storage means in order to identify fragments of the Haemophilus Influenzae Rd genome. Implementing software that implements the BLAST or BLAZE algorithms is used in the examples (Altschul et. al., J. Mol. Biol. Biol. Any of the homology search programs that are publicly available can be used to search for the computer-based systems described in the invention. A skilled artisan will recognize this.

FIG. 2. FIG. FIG. 2 shows a block diagram for a computer system (102), which can be used in the implementation of the present invention. The computer system 102 contains a processor 106 that is connected to a bus104. A main memory 108, which is preferably implemented as random-access memory, RAM, and a range of secondary storage devices 110 are also connected to bus 104. These include a hard drive 12, and a removable medium store device 114. A removable medium storage device (114) may be, for example, an floppy disk, a CDROM drive, or a magnetic tape drive. A removable storage media 116 may be a compact disk, magnetic tape, or floppy disk. A removable medium storage device (114) may contain control logic and/or data. Once inserted into the removable medium store device 114, the computer system 102 contains the appropriate software to read the control logic and/or data.

The present invention allows for the storage of a nucleotide sequence in well-known ways in the main memory, secondary storage devices 110 and/or removable storage medium 116. Software to access and process the genome sequence (such search tools, comparing instruments, etc.) During execution, they reside in main memory number 108.”

“Biochemical Embodiments”

“Another embodiment is directed at isolated fragments from the Haemophilus Influenzae Rd genome. The present invention includes fragments of the Haemophilus Influenzae Rd genome.

“As used herein, an “isolated nucleic acids molecule?” or an ‘isolated fragment from the Haemophilus Influenzae Rd genome. A nucleic acid molecule that has a specific sequence of nucleotides and has been subjected a purification process to reduce the amount of compounds normally associated with the composition. The present invention can be isolated using a variety of purification methods. They include methods that separate components of a solution according to size, solubility, charge or other factors.

“Haemophilus influenzaeRd DNA can be mechanically separated to produce fragments between 15-20 kb and one embodiment. These fragments can be used to create a Haemophilus influenzae library by inserting them in labda cells as shown in the Examples. Primers that flank an ORF, such as the one in Table 1(a), can be created using the nucleotide sequence information in SEQ ID No:1. The ORF can then be isolated from the lambda DNA collection by PCR cloning. The art of PCR cloning has been well established. Given the SEQ ID NO.1, Table 1(a), and Table 2, it is routine to isolate any ORF, or other nucleic acids fragment of the invention.

“The present invention includes, but is not limited to, single and double-stranded DNA and single-strandedRNA.

“An ‘open reading frame’ is a term used herein. ORF is an acronym for a sequence of triplets that code for amino acids and can be translatable into proteins. The ORFs found in the Haemophilus Influenzae Rd genome are listed in Tables 1a, 2 and 2. Particularly, Table 1a shows the location of ORFs in the Haemophilus Influenzae Rd genome that encode the recited proteins based on homology match with sequences of the organism appearing within parentheticals (see fourth column of Table 1(a )).”).

“The?GeneID is found in the first column of Table 1.a. an ORF. This information is valuable for two reasons. The first is the complete Haemophilus influenzae Rd genomic map provided in FIGS. 6(A)-6 (AN) refers the ORFs according their GeneID numbers. The GendID numbers are used in Table 1(b), which indicates which ORFs have been previously provided in a public databank.

“The second and third columns of Table 1(a), indicate the ORFs position within the nucleotide sequence given in SEQ ID No:1. Normal skill will be able to recognize that ORFs can be oriented in opposing directions in the Haemophilus Influenzae genome. Columns 2 and 3 reflect this.

“The fifth column in Table 1(a), indicates the percent identity between the protein encoded by an ORF and the corresponding protein found in parentheticals within the fourth column.”

“The sixth column in Table 1(a), indicates the percent similarity between the ORF-encoded protein and the ORF-encoded protein from the organism appearing as parentheticals at the fourth column. In the art, it is easy to understand the concepts of percent identity or percent similarity between two polypeptide sequences. Two polypeptides of 10 amino acids each that differ at three positions (e.g. at position 1, 3, and 5) would have a 70% percent identity. The same two polypeptides could be considered to have a percentage similarity of 80%, if the amino acid moieties at position 5 were, even though not identical,?similar? ”

“The seventh column of Table 1(a), indicates the length of the amino acid homology match.”

“Table 2 contains ORFs from the Haemophilus influenzae genome Rd that encode polypeptide sequences but did not produce a homology match. an existing protein sequence from another organism. The Examples below provide additional information about the criteria and algorithms used to perform homology searches.

“A skilled artisan is able to identify ORFs in Haemophilus influenzaeRd genome that are not listed in Tables 1, 1(b), and 2. This includes ORFs that overlap or are encoded by the opposite side of an ORF. These ORFs can also be identified using computer-based systems according to the invention.

“An expression modulating fragment is, as used herein,? EMF is a group of nucleotide molecules that modulate the expression of an operably-linked ORF or EMF.

“A sequence, as used herein is said to?modulate expression of an operably connected sequence? When the EMF alters the expression of the sequence. EMFs can include promoters and promoter modulating sequences. EMF fragments that induce the expression of an operably linked ORF or a regulatory factor in response to a physiological event or factor are one class. Tomb et. al. provide a review of all known EMFs in Haemophilus. Gene 104:1-10 (1991), Chandler, M. S., Proc. Natl. Acad. Sci. Sci.

The proximity of EMF sequences in the Haemophilus influenzae rd genome can help identify them. A fragment of the intergenic section, or an intergenic segment of about 10 to 200 nucleotides, can be taken 5 from any of the ORFs in Tables 1(a),1(b) and 2. This will modify the expression of an operably-linked 3 ORF in a manner similar to that of the naturally linked ORF sequence. An “intergenic segment” is defined herein. An?intergenic segment? refers to fragments of the flaemophilus gene that are between the two ORF(s) described. EMFs can also be identified by using known EMFs in target sequences or targetmotifs in computer-based systems according to the invention.

An EMF trap vector can confirm the presence and activity of EMFs. An EMF trap vector includes a cloning station 5 to a marker sequence. The marker sequence encodes a phenotype such as antibiotic resistance or complementing nutrition auotrophic factor. It can be identified and assayed by placing the EMF trap vector within an appropriate host in the appropriate conditions. EMFs can modulate expression of operably linked markers, as described above. Below is a more in-depth discussion on various marker sequences.

“A sequence that is suspected to be an EMF is cloned at all three reading frames at one or more restriction sites downstream from the marker sequence of the EMF trap vector. The vector is then transformed into a suitable host using the known methods. Under the appropriate conditions, the host’s phenotype can be examined. An EMF can modulate expression of an operably-linked marker sequence, as described above.

“An?uptake modulating piece, as used herein. UMF is a group of nucleotide molecules that mediate the incorporation of linked DNA fragments into cells. UMFs can easily be identified by using the computer-based systems described previously.

Attaching the UMF to a marker sequence can confirm that it is present and active. After the nucleic acid molecule has been attached to a marker sequence, it is incubated with the appropriate host and monitored for uptake. A UMF, as described above will increase the frequency at which a linked sequence is uptaken. Goodgall S. H. et al., Journal Bact, provide a review of DNA uptake by Haemophilus. 172:5924-5928 (1990).”

“A?diagnostic segment, as used herein. DF is a sequence of nucleotide molecules that selectively hybridize with Haemophilus influenzae sequences. Identifying unique sequences in the Haemophilus Influenzae Rd genome can help identify DFs. You can also generate and test probes or amplification primers that contain the DF sequence in a diagnostic format that determines amplification/hybridization selectivity.

“The scope of the invention does not include the sequences described herein, but includes allelic and species variations. It is possible to determine allelic and species variations by simply comparing the sequence in SEQID NO:1 or a representative fragment thereof with a sequence from another species isolate. To accommodate codon variability, this invention also includes nucleic acids molecules that code for the same amino-acid sequences as the ORFs described herein. This means that substitution of one codon by another, which encodes the same amino acids, is possible in the ORF’s coding region.

“Any sequence described herein can be easily screened for errors by resequencing one particular fragment, such an ORF in both directions (i.e. sequence both strands), Alternately, you can perform error screening by sequencing the corresponding polynucleotides from Haemophilus Influenzae origin by using part of or all of these fragments as a probe/primer.

Each of the ORFs in the Haemophilus influenzae Rd gene, as well as the EMF 5 to the ORF can be used in a variety of ways as polynucleotide-reagents. These sequences can be used to detect the presence in samples of specific microbes, such as Haemophilus Influenzae RD. This is particularly true for the fragments and ORFs from Table 2. These will be highly specific for Haemophilus influenzae.

“In addition, the fragments described in the present invention can be used to control gene transcription through triple helix formation, antisense DNA, or RNA. Both methods are based upon the binding of a sequence of polynucleotides to DNA orRNA. These polynucleotides are typically 20-40 bases long and can be used to complement a region of the gene involved with transcription (triple-helix?see Lee et al., Nucl). Acids Res. 6/773 (1979); Cooney and colleagues, Science 241 :456 (1988); Dervan et.al., Science 251 :1360 (1991); or to the mRNA (antisense?)Okano, J. Neurochem. 56:560 (1991); Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression, CRC Press, Boca Raton, Fla. (1988)).”

“Triple Helix?” formation results in an optimal shut-off of DNA-RNA transcription, while antisenseRNA hybridization prevents the translation of an mRNA molecular into a polypeptide. Both methods have been proven to work in model systems. The sequences of this invention provide information necessary to design an antisense or triple-helix oligonucleotide.

“The present invention also provides recombinant constructs that contain one or more fragments from the Haemophilus Influenzae Rd genome. Recombinant constructs according to the present invention include a vector (plasmid, viral vector) into which a Haemophilus Influenzae Rd fragment has been placed, in either a forward or reverse direction. A vector that contains one of the ORFs may also include regulatory sequences. This could include, for example, a promoter. The vectors that contain the EMFs or UMFs may also include a marker sequence, heterologous ORF, or other elements linked to the EMFs or UMFs. For the generation of the recombinant constructs described in the invention, there are many vectors and promoters that are suitable. These vectors are an example. Bacterial: pBs, phagescript, PsiX174, pBluescript SK, pBs KS, pNH8a, pNH16a, pNH18a, pNH46a (Stratagene); pTrc99A, pKK223-3, pKK233-3, pDR540, pRIT5 (Pharmacia). Eukaryotic: PWLneo. pSV2cat. pOG44. pSG. pKK233-3. pDR540. pSVL.

“Promoter areas can be selected from any gene using CAT vectors (chloramphenicol transase) or other vectors with selectable marker. pKK232-8, and pCM7 are two examples of suitable vectors. Some of the most prominent bacterial promoters are lacZ, lacZ and T3, T7. T7. gpt, lambda PP, and trc. CMV immediately early, HSV thymidine kinase and early and late SV40 are some examples of eukaryotic promoters. The art of selecting the appropriate vector and promoter for the right species is easy.

The present invention also provides host cells that contain any of the Haemophilus Influenzae Rd fragments. This fragment has been introduced to the host cell by known transformation methods. A host cell may be a higher-eukaryotic, mammalian, or a lower-eukaryotic, such a yeast, cell. Or, it can be a procaryotic, such a bacterial, cell. The recombinant construct can be introduced into the host cell by calcium phosphate transfer, DEAE, dextran mediated transcription, or electroporation. (Davis L. et., Basic Methods In Molecular Biology (1986 )).”).

“The present invention allows host cells to contain one of the fragments from the Haemophilus Influence Rd genome to be used in traditional ways to produce the gene product encoded (in the case an ORF), or to produce a heterologous proteins under the control EMF.”

“The present invention also provides isolated polypeptides encoded either by nucleic acids fragments of this invention or degenerate versions of nucleic acids fragments. What is meant by a?degenerate variation? Intended nucleotide fragments that differ from a nucleic acids fragment of the present invention (e.g. an ORF by nucleotide sequence) but encode an identical polypeptide sequence due to degeneracy in the Genetic Code. The ORFs in Table 1 (a), which encode proteins, are preferred nucleic acid pieces of the present invention.

There are many methods that can be used to isolate the polypeptides and proteins of the invention. The simplest amino acid sequence can be created using commercially available propeptide synthesizers. This is especially useful for producing small peptides or fragments of larger polypeptides. For example, fragments can be used to generate antibodies against native polypeptides. Another method is to extract the protein or polypeptide from bacteria cells that naturally produce it. A skilled person in the art will be able to follow the known methods of isolating proteins and polpeptides in order to isolate the polypeptides or protein of the invention. These include, but are not limited to, immunochromatography, HPLC, size-exclusion chromatography, ion-exchange chromatography, and immuno-affinity chromatography.”

The polypeptides or proteins of the invention can also be extracted from cells that have been modified to express the desired protein or polypeptide. A cell is considered to have been altered to express a desired protein or polypeptide by genetic manipulation. This means that the cell produces a different protein or polypeptide than it normally produces or at a lower level. A skilled person in the art will be able to adapt methods for inserting and expressing synthetic or recombinant sequences into prokaryotic or eukaryotic cells, in order to create a cell that produces the polypeptides or protein of the invention.

“Any host/vector can be used to express one of the ORFs described in the present invention. These include, among others, eukaryotic hosts like HeLa cells and Cv-1 cells, COS cells and Sf9 cell as well as prokaryotic hosts like E. coli or B. subtilis. Cells that do not express the desired protein or polypeptide are preferred.

“?Recombinant,? “?Recombinant” as it is used herein means that a protein or polypeptide is derived from recombinant expression systems (e.g. mammalian or microbial). ?Microbial? Recombinant polypeptides and proteins that are made using fungal or bacterial expression systems. Recombinant microbial is a product. A polypeptide or protein that is essentially free from native endogenous substances, and without any associated native glycosylation. Most polypeptides and proteins that are expressed in bacterial cultures (e.g. E.coli) will not have any glycosylation modifications. However, proteins or polypeptides expressed in yeast will exhibit a different glycosylation pattern to those found in mammalian cells.

“?Nucleotide sequence? refers to a heteropolymer of deoxyribonucleotides. The DNA segments that encode the polypeptides or proteins of this invention are generally made from fragments from the Haemophilus influenzae Rd gene and short oligonucleotide linksers.

“Recombinant expression vector or vehicle? A plasmid, phage, virus or vector that allows the expression of a polypeptide using a DNA sequence (RNA) is known as “?Recombinant expression vehicle or vector?” An expression vehicle may include a transcriptional unit that includes (1) a genetic element, or elements with a regulatory role, in gene expression. (2) a structural sequence or coding sequence, which is transcribed into the mRNA and translated to protein. (3) appropriate transcription initiation or termination sequences. A leader sequence is required for structural units that are intended to be used in yeast and eukaryotic expression system. This allows extracellular secretion by a host cell of translated protein. An N-terminal methionine residue may be included in a recombinant proteins that are not expressed with a leader sequence or transport sequence. This residue can be or not cleaved from the recombinant protein in order to produce a final product.

“?Recombinant expression system? Recombinant expression system refers to host cells that have successfully integrated a recombinant transcriptal unit into their chromosomal DNA. These cells can be either prokaryotic, eukaryotic, or both. After induction of regulatory elements that are linked to the DNA segment, or synthetic gene to express, recombinant expression systems will produce heterologous proteins or polypeptides.

“Mature proteins are able to be expressed in mammalian, yeast, and bacteria cells. These proteins can be produced using RNAs generated from the present invention. Cell-free translation systems are also possible. Sambrook et al. describe appropriate cloning methods and expression vectors that can be used with both prokaryotic and eukaryotic hosts in Molecular Cloning. A Laboratory Manual, Cold Spring Harbor (N.Y.), 1989. The disclosure is hereby incorporated into the reference.

“Recombinant expression vectors generally include origins for replication and selectable markers that permit transformation of the host cells, e.g. the ampicillin resistance gene in E. coli or the S. cerevisae TRP1 gene. A promoter derived form a high-expressed gene can direct transcription of a downstream structure sequence. These promoters can be derived, for example, from operons encoding glycolytic proteins such as acid phosphatase or 3-phosphoglycerate kinase. The heterologous structural sequence must be assembled in the appropriate phase with translation termination and translation initiation sequences. Preferably, it should contain a leader sequence that can direct secretion of translated proteins into extracellular medium or the periplasmic area. The heterologous sequence may encode a fusion protein that includes an N-terminal identification propeptide. This can be used to provide desired characteristics such as stabilization or simplified purification.

Summary for “Nucleotide sequence for the Haemophilus influenzae Rd gene, fragments thereof and uses thereof”

“The complete genome sequence of a living cell organism has not been determined. The sequence for the first mycobacterium should be complete by 1996. E. coli, S. cerevisae, and other cellular organisms are expected to be complete before 1998. These sequences are done using either directed or random sequencing of cosmid clones that overlap. Nobody has ever attempted to establish sequences that correspond to the megabase order or more using a random shotgun approach.

“Interest in H. influenzae biology’s medically important aspects has centered mainly on genes that determine the organism’s virulence characteristics. A number of genes that are responsible for the production of capsular polysaccharide (Kroll and al., Mol. Microbiol. 5(6):1549-1560 (1991)). Numerous genes encoding outer membrane proteins (OMPs) have been identified (Langford and colleagues, J. Gen. Microbiol. 138:155-159 (1992)). Weiser et.al., J. Bacteriol. In-depth research is being done on the lipoligosaccharide component of the outer membrane as well as the genes for its synthetic pathway. 172:3304-3309 (1990)). Although a vaccine is available since 1984, research on outer membrane components has been motivated in part by the need to develop better vaccines. Recently, the catalase gene has been characterized and sequenced to identify it as a potential virulence-related genetic (Bishni, et al. in press). Understanding the H. influenzae genome and the best ways to fight it will be possible with the clarification of its DNA.

“H. influenzae has a highly efficient natural DNA-transformation system that has been extensively studied in the non-encapsulated strain R (Kahn & Smith, J. Membrane Biology 81.1-103 (1984). At least 16 transformation-specific genes have been identified and sequenced. Four of these are regulatory genes (Redfield and J Bacteriol). 173:5612-55618 (1991), Chandler, Proc. Natl. Acad. Sci. USA 89.1626-1630 (1992), at least two of them are involved in recombination (Barouki, Smith, J. Bacteriol. 163(2), 629-634 (1985), and at most seven are targeted to membranes and the periplasmic spaces (Tomb et. al., Gene104:1-10 (1991) and Tomb. Natl. Acad. Sci. USA 89.10252-10256 (1992), where they function as structural components or in assembly of DNA transport machinery. H. influenzae Rd transform shows many interesting features, including sequence-specific DNA capture, rapid uptake by several double-stranded molecules per competent cell into the membrane compartment called The Transformasome, linear translocations of one strand of donor DNA into the Cytoplasm, and synapsis as well as recombination with the chromosome through a single-strand displacement mechanism. The H. influenzae Rd conversion system is one of the most studied gram-negative systems. It differs in many ways from the gram positive systems.

“The H. influenzae Rd genome size has been determined using pulsed-field electrophoresis with restriction digests to be approximately 1.95 Mb. This is approximately 40% of E.coli’s genome (Lee and Smith J. Bacteriol. 170:4402-4405 (1988)). The restriction map for H. influenzae is circular. (Lee and al., J. Bacteriol. 171:3016-3024 (1989) and Redfield and Lee?Haemophilus Influenzae Rd?, pages. 2110-2112, In O’Brien, S. J. (ed), Genetic Maps. Locus Maps for Complex Genomes. Cold Spring Harbor Press. N.Y. Different genes were mapped to restriction fragments using Southern hybridization probing restriction digest DNA bands. This map can be used to verify the assembly of a complete sequence of genomes from randomly sequenced fragments. GenBank currently has about 100 kb non-redundant H. flue DNA sequences. Half of the sequences are from Serotype B and half are Rd.

“The present invention was based on the sequence of the Haemophilus Influenzae Rd genome. SEQ ID NO.1 contains the primary nucleotide sequence that was generated.

“The present invention contains the generated nucleotide sequence for the Haemophilus influenzae Rd gene, or a fragment thereof, in a format that can be easily used, analyzed and interpreted by skilled artisans. The present invention can be described as a contiguous string containing primary sequence information that corresponds to the nucleotide sequence in SEQ ID No:1.

“The present invention also provides nucleotide sequencings that are at least 99.9% similar to the sequence of SEQ ID No:1.”

To facilitate its use, “The nucleotide sequencing of SEQID NO:1, a fragment thereof or a nucleotide which is at most 99.9% identical with the sequence of SEQID NO:1 may be provided on a variety of media. The sequences of the invention can be recorded on computer-readable media in one embodiment of this invention. These media include, but are not limited to, magnetic storage media like floppy disks, hard disk storage medium and magnetic tape; optical media such CD-ROM; electronic storage media such RAM and ROM; hybrids of these media such as magnetic/optical media.

“The invention also provides systems, especially computer-based systems that contain the sequence information described herein stored in a data storage device. These systems can be used to identify commercially relevant fragments of the Haemophilus Influenzae Rd genome.

“Another embodiment is directed at isolated fragments from the Haemophilus Influenzae Rd genome. The present invention includes fragments that encode peptides, hereinafter called open reading frames (ORFs), which modulate the expression an operably linked ORF (EMFs), which mediate the uptake a linked DNA fragment in a cell (hereafter UMFs), and fragments that can be used for diagnosing the presence of Haemophilus flue Rd in a sample (hereafter referred to as diagnostic fragments or DFs).

Each of the ORF fragments from the Haemophilus influenzae Rd gene, as well as the EMF found 5 to the ORF can be used in a variety of ways as polynucleotide-reagents. These sequences can be used to detect microbes in samples, or as diagnostic probes and amplification primers. They also have the potential to control gene expression by selectively controlling gene expression.

“The present invention also includes recombinant constructs that contain one or more fragments from the Haemophilus Influenzae Rd genome. The vectors that comprise the recombinant constructs described in the present invention include vectors such as a viral vector or plasmid into which a Haemophilus influenzaeRd fragment has been embedded.

The present invention also provides host cells that contain any fragment of the Haemophilus Influenzae Rd genome. You can choose to use higher eukaryotic hosts such as mammalian cells, lower eukaryotic cells such as yeast cells, or procaryotic cells such as bacteria.

“The present invention further relates to the isolation of proteins encoded using the ORFs. Any one of the invention’s proteins can be obtained using any of the many methods known to the art. The simplest form of the amino acid sequence is possible to synthesize using commercially available, peptide synthesizers. Another method is to purify the protein from bacteria cells that naturally produce it. The proteins of the invention can also be extracted from cells that have been modified to express the desired protein.

“The invention also provides methods for obtaining homologs to the fragments from the Haemophilus influenzaeRd genome and homologs to the proteins encoded using the ORFs. One skilled in the art can find homologs by using the nucleotide or amino acid sequences described herein as a probe, primers, and techniques like PCR cloning, colony/plaque hybridization and others.

“The invention also provides antibodies that selectively bind to one of the proteins described in the invention.” These antibodies can be monoclonal or polyclonal.

“The invention also provides hybridomas that produce the above-described antibody. Hybridomas are immortalized cell lines that can secrete a monoclonal antibody.

“The present invention also provides methods for identifying test samples that are derived from cells that express one or more of the ORFs of the invention, or a homolog thereof.” These methods include incubating a test specimen with one or more antibodies or one or several of the DFs according to conditions that allow a skilled artisan determine if the sample contains the ORF, product or combination thereof.

“In another embodiment, the present invention provides kits that contain the necessary reagents for performing the above-described assays.”

“Specifically, the invention provides a compartmentalized system to receive in close confinement one or more containers that include: (a) a container containing one of antibodies or one of DFs according to the present invention; (b) one or two other containers containing one or both of the following: wash reagents, reagents that detect presence of bound antibodies or hybridized DFs; or (c) wash reagents.

“Using the isolated proteins, the present invention also provides methods for obtaining and identifying agents that can bind to a protein encoded in one of the ORFs. These agents include antibodies, peptides and carbohydrates, as well as pharmaceutical agents. These methods include the following steps:

“(a) Contacting an agent with an isolated protein encoded by one the ORFs described in the present invention; and

“(b) To determine if the agent binds with said protein.”

The complete genome sequence of H. influenzae is a valuable resource for all laboratories that work with the organism, as well as commercially. Similarity searches against GenBank and protein databases will immediately identify many fragments of the Haemophilus Influenzae Rd genome. These will be valuable to Haemophilus researchers as well as for immediate commercial use for the production or control of gene expression. PHA synthase is a specific example. Polyhydroxybutyrate has been found in H. influenzae Rd membranes. This amount corresponds to the level of competence for the transformation. This polymer was synthesized by the PHA synthase in several bacteria. None of these bacteria are evolutionary related to H. influenzae. The hybridization probes and PCR techniques have not yet allowed for the isolation of this gene from H. influenzae. The present invention’s genomic sequence allows identification of the gene using the search methods described below.

“Developing the technology and methodology to elucidate the entire genome sequence of bacterial and small genomes has greatly enhanced our ability to understand and analyze chromosomal organization.” Sequenced genomes will be used to develop tools to analyze chromosome function and structure. This includes the ability to identify genes in large sections of genomic DNA, their structure, position and spacing, as well as the identification of genes that could have industrial applications.

“DESCRIPTION OFF THE FIGURES”

“FIG. “FIG.

“FIG. “FIG.

“FIG. 3?A comparison of experimental cover of approximately 4000 random sequence pieces assembled with AutoAssembler. This is compared to Lander Waterman prediction for a 2.5Mb genome (triangles), and a 1.5 Mb genome(circles), with a 460-bp average length sequence and a 25-bp overlap.

“FIG. 4. Data flow and computer programs used for managing, assembling, editing, and annotating the H. influenzae genome. The AB 373 sequence files can be handled by Unix and Macintosh platforms (Kerlavage et., Proceedings of the Twenty Sixth Annual Hawaii International Conference on System Sciences IEEE Computer Society Press Washington D.C. 585 (1993). Factura (AB), a Macintosh program, is designed to automatically remove and trim sequence files. The program esp runs on a Macintosh platform. It extracts feature data from sequence files and then converts it to the Unix-based H. influenzae relationshipal database. The assembly is done by retrieving a set of sequence files with their associated features using stp. This X-windows graphical user interface and control program can retrieve sequences from H. influenzae using standard or user-definable SQL queries. TIGR Assembler was used to create the sequence files. This assembly engine is designed by TIGR to quickly and accurately assemble thousands of fragments of sequence. The graphical interface TIGR Editor can display contig editing information and parse aligned sequence file output from TIGR. Genemark was used to identify putative coding areas (Borodovsky, McIninch Computers Chem. 17(2):123 (1993), a Markov- and Bayes-modeled program for predicting gene location, which was trained on a H. Influnzae sequence database set. Peptide searches were conducted against the three reading frames for each Genemark predicted coding area using blaze (Brutlag et.al., Computers Chem. 17:203 (1993), was run on a Maspar M-2 massively parallel computer with 4096 processors. By mblzt, the results from each frame were combined to create a single output file. The program praze, which allows for the extension of alignments across possible frameshifts, was used to obtain optimal protein alignments. The output was checked using the custom graphic viewing program, “gbyob”, which interacts directly with H. influenzae’s database. These alignments were used to detect potential frameshift errors, and were then targeted for further editing.

“FIG. “FIG. Outer perimeter: The unique NotI restriction site, designated as nucleotide 1, the RsrII and the SmaI locations. Outer concentric circles: This is the location of each identified code region for which a gene identification has been made. Second concentric circle, Regions with high G/C and high A/T. High G/C content areas are associated with the 6 operons of the ribosomal ribosome and the mu-like protophage. Third concentric circle: Coverage with lambda-clones. To confirm the genome’s overall structure and to identify the 6 ribosomal operaons, over 300 lambda-clones were sequenced at each end. Fourth concentric circle. The positions of the 6 ribosomal operaons, tRNAs, and the cryptic mulike prophage. Fiveth concentric circle: Simple tandem repetitions. These repeats are located at the CTGGCT and GTCT of ATT, AATGGC. TTGA. TTGG. TTTA. TTATC. TGAC. TCGTC. AACC. TTGC. CAAT. CCAA. Outward-pointing arrows that originate near base 603,000 illustrate the possible origin of replication. Near the circle’s opposite midpoint, two possible termination sequences can be seen.

“FIGS. “FIGS. Each strand shows the predicted coding regions. The rRNA and the tRNA genes are represented as triangles and lines, respectively. GeneID numbers are the same as those in Tables 1 (a), 1 (b) and 2. Three-letter designations can be provided where possible.

“FIG. 7?A comparison of H. influenzae’s chromosome region containing the 8 genes in the fimbrial cluster found in H. influenzae type B and the same area in H. influenzae Rd. Both the purE and pepN genes flank this region in both organisms. The 8 genes from the fimbrial cluster of gene have been deleted in the Rd strain. This region of the Rd strain contains a 172 bp gapr region and is still flanked by both the purE and pepN genes.

“FIG. 8?Hydrophobicity analysis for five predicted channel-proteins. Five predicted coding regions do not have homology (GenBank release: 87), but each sequence contains multiple hydrophobic domains, which are typical of channel-forming protein sequences. The Kyte-Doolittle algorithm was used to analyze the predicted coding regions sequences (Kyte & Doolittle, J. Mol. Biol. Biol.

“The present invention was based on the sequence of the Haemophilus Influenzae Rd genome. SEQ ID NO.1 contains the primary nucleotide sequence that was generated. The ‘primary sequence’ is the one used in this document. The IUPAC nomenclature system represents the nucleotide sequence.

“The sequence in SEQ ID No:1 is oriented relative a unique Not I restriction site found in the Haemophilus Influenzae Rd genome. An experienced artisan will quickly recognize that the start/stop point is merely for convenience and does NOT have any structural significance.

“The present invention provides the nucleotide sequencing of SEQ ID No:1 or a representative fragment thereof in a format that can be easily used, analyzed and interpreted by skilled artisans. The sequence can be provided as a contiguous string containing primary sequence information that corresponds to the nucleotide sequence in SEQ ID No:1.

“As used herein. A?representative fraction of the nucleotide sequence shown in SEQ ID No:1? Any portion of SEQ ID No:1 that is not currently available in a publicly accessible database. The preferred representative fragments of this invention include Haemophilus fluke open reading frames, expression modulating and uptake modulating bits, as well as fragments that can be used for diagnosing the presence of Haemophilus Influenzae Rd in sample. The Tables 1(a), and 2 provide a non-limiting identification for such preferred representative fragments.

“The information regarding the nucleotide sequence in SEQ ID No:1 was obtained from sequencing the Haemophilus Influenzae Rd genome using a megabase-shotgun sequencing method. The present inventors calculated that SEQ ID No:1’s sequence has a maximum accuracy 99.98% using three parameters discussed in the Examples. The nucleotide sequence in SEQ ID No:1 represents the Haemophilus Influenzae Rd genome’s nucleotide structure. However, it is not 100% accurate.

“As we will discuss in detail below, using information in SEQ ID No:1 and Tables 1(a), 2 together with routine sequencing and cloning methods, an ordinary skill in art will be capable of cloning all?representative pieces? Open reading frames (ORFs), which encode a wide range of Haemophilus Influenzae proteins, are also of interest. This may indicate a nucleotide sequence problem in the nucleotide sequence described in SEQ ID No:1. Once the present invention has been made public (i.e. once the information in SEQID NO:1 and Tables 1,(a), and 2) have been made accessible), it will be possible to resolve a rare sequence error in SEQID NO:1. Publicly available software for editing nucleotide sequences is Nucleotide Sequence Editing Software. Applied Biosystem’s AutoAssembler, for example, can be used to aid in visual inspection of nucleotide sequencings.

“Even if all the rare sequencing errors in SEQID NO:1 were fixed, the resulting nucleotide chain would still be at most 99.9% identical to that in SEQID NO:1.

“The nucleotide sequences for different strains of Haemophilus Influenzae genomes differ slightly. The nucleotide sequences of all Haemophilus Influenzae strains will be 99.9% or more identical to the sequence given in SEQ ID No:1.

“The present invention also provides nucleotide sequencing that is at least 99.9% identical with the nucleotide series of SEQ ID No:1 in a format that can be easily used, analyzed, and interpreted by the skilled craftsman. The skilled artisan has easy access to routine methods for determining if a nucleotide is at least 99.9% identical with the nucleotide of SEQ ID No:1. The well-known fasta algothrithm is an example (Pearson & Lipman, Proc. Natl. Acad. Sci. Sci.

“Computer Related Embodiments.”

“The nucleotide sequencing provided in SEQID NO:1, a representative of it, or a sequence at least 99.9% similar to SEQID NO:1, may be?provided? It can be stored in many media to make it easy to use. Provided refers to any manufacture other than an isolated nucleic acids molecule that contains a nucleotide sequencing of the present invention. This includes a representative fragment or a nucleotide series at least 99.9% identical with SEQ ID No:1. This manufacture contains the Haemophilus Influenzae Rd genome (or a subset thereof), in a form that allows skilled artisans to examine the manufacture with means not directly applicable for examining the Haemophilus Influenzae Rd gene or subset thereof, as it exists in nature.

“In one embodiment of this invention, a nucleotide sequence according to the present invention can also be recorded on computer-readable media. Computer readable media is used herein. Any medium that can be read or accessed by a computer directly. These media include magnetic storage media like floppy disks, hard disk storage medium and magnetic tape, optical storage media like CD-ROM, electrical storage media like RAM and ROM, hybrids of these media such a magnetic/optical media storage media. An skilled artisan will be able to see how any of the computer-readable media can be used to make a manufacture that has the nucleotide sequence recorded on it.

“Recorded” is the term used herein. “Recorded” refers to a method of storing information on a computer-readable medium. Any of the methods currently known for recording information on computer-readable media can be easily adopted by skilled artisans to create manufactures that contain the nucleotide sequence data of the invention.

“A skilled artisan can choose from a variety of data storage systems to create a computer-readable medium that has the nucleotide sequences of the invention. The access method to which the stored information can be accessed will determine the choice of data storage structure. The present invention can also be stored on computer-readable media using a number of data processor formats and programs. The sequence information can be represented in a word processing text file, formatted in commercially-available software such as WordPerfect and MicroSoft Word, or represented in the form of an ASCII file, stored in a database application, such as DB2, Sybase, Oracle, or the like. Any number of dataprocessor structuring formats can be easily adapted by a skilled artisan (e.g. text file or database to be able to create a computer-readable medium containing the nucleotide sequence information described in the present invention.

A skilled artisan can access sequence information in a variety of ways by providing the nucleotide sequencing of SEQID NO: 1, or a fragment of it, in computer-readable format. A computer program that can access sequence information in a computer-readable format is freely available to the public. These examples demonstrate how software implements the BLAST algorithm (Altschul et. al., J. Mol. Biol. 215:403-410 (1990)) and BLAZE (Brutlag et al., Comp. Chem. 17:203-207 (1993). Sybase search algorithms were used to identify open reading frame (ORFs), within the Haemophilus influenzaeRd genome, that contain homology to ORFs and proteins from other organisms. These ORFs are protein-encoding fragments in the Haemophilus Influenzae Rd genome. They are useful in producing important commercial proteins, such as enzymes for fermentation reactions or in the production and sale of useful metabolites.

“The invention also provides systems, especially computer-based systems that contain the sequence information described in this document.” These systems can be used to identify commercially valuable fragments of the Haemophilus Influenzae Rd genome.

“A computer-based system” is defined herein. The hardware, software, and data storage methods used to analyze the nucleotide sequence info of the present invention are all defined. The computer-based systems described in the present invention have a minimum of hardware. They include a central processing device (CPU), input and output means, as well as data storage. Any of the computer-based systems currently in use can be easily understood by a skilled artisan.

“As mentioned above, the computer-based system of the invention comprises a data storage device that stores a nucleotide sequence according to the invention as well as the hardware and software necessary for supporting and implementing a searching method. As used herein, ?data storage means? A memory that can store the nucleotide sequence data of the invention or a memory access mechanism which can access manufacturers having recorded the nucleotide sequencing information of this invention.

“As used herein, ?search means? A computer-based program that compares a target sequence or target structure motif with sequence information stored in the data storage means. Search methods are used to find fragments or regions within the Haemophilus Influenzae Rd genome that match a specific target sequence or target motif. There are many algorithms that have been publicly disclosed. Commercially available software can also be used to search for the desired sequence or motif. MacPattern (EMBL), BLASTN, and BLASTX(NCBIA) are just a few examples of such software. An expert artisan will recognize that any of the algorithms or implementing programs for homology searches can easily be modified for use in the current computer-based systems.

“A?target sequence” is defined herein. Any DNA or amino sequence that contains six or more nucleotides, or two or more amino acid residues. An experienced artisan will recognize that the longer the target sequence, the more likely it is to be found in the database as a random occurrence. A target sequence should be between 10 and 100 amino acids. It should also contain 30 to 300 nucleotide sequence residues. It is known that shorter sequences may be used to search for commercially valuable fragments of the Haemophilus Influenzae Rd genome. This includes fragments that are involved in gene expression or protein processing.

“As used herein, ?a target structural motif,? “A target structural motif” or “target motif”, as it is also known, refers to any sequence that has been chosen based on a specific three-dimensional structure. The term “target motif” refers to any sequence or combination thereof that has been rationally chosen. It is a sequence or group of sequences where the sequence(s), or combinations of sequences, are selected based on a three-dimensional structure formed by the folding of the target theme. There are many target motifs in the art. The art includes a variety of target motifs for protein, including signal sequences and enzymic active site. “Nucleic acid target motes include promoter sequences and hairpin structures, as well as inducible expression elements (protein-binding sequences).

The computer-based systems of this invention can use a variety of structural formats to input and output information. An output format that ranks fragments of the Haemophilus fluoride Rd genome with different degrees of homology to the target sequence/target motif is preferred. This presentation allows a skilled artisan to rank sequences that contain different amounts of the target motif or sequence and identify the level of homology in each identified fragment.

“A range of comparing methods can be used to compare target sequences or targetmotifs with the data storage means in order to identify fragments of the Haemophilus Influenzae Rd genome. Implementing software that implements the BLAST or BLAZE algorithms is used in the examples (Altschul et. al., J. Mol. Biol. Biol. Any of the homology search programs that are publicly available can be used to search for the computer-based systems described in the invention. A skilled artisan will recognize this.

FIG. 2. FIG. FIG. 2 shows a block diagram for a computer system (102), which can be used in the implementation of the present invention. The computer system 102 contains a processor 106 that is connected to a bus104. A main memory 108, which is preferably implemented as random-access memory, RAM, and a range of secondary storage devices 110 are also connected to bus 104. These include a hard drive 12, and a removable medium store device 114. A removable medium storage device (114) may be, for example, an floppy disk, a CDROM drive, or a magnetic tape drive. A removable storage media 116 may be a compact disk, magnetic tape, or floppy disk. A removable medium storage device (114) may contain control logic and/or data. Once inserted into the removable medium store device 114, the computer system 102 contains the appropriate software to read the control logic and/or data.

The present invention allows for the storage of a nucleotide sequence in well-known ways in the main memory, secondary storage devices 110 and/or removable storage medium 116. Software to access and process the genome sequence (such search tools, comparing instruments, etc.) During execution, they reside in main memory number 108.”

“Biochemical Embodiments”

“Another embodiment is directed at isolated fragments from the Haemophilus Influenzae Rd genome. The present invention includes fragments of the Haemophilus Influenzae Rd genome.

“As used herein, an “isolated nucleic acids molecule?” or an ‘isolated fragment from the Haemophilus Influenzae Rd genome. A nucleic acid molecule that has a specific sequence of nucleotides and has been subjected a purification process to reduce the amount of compounds normally associated with the composition. The present invention can be isolated using a variety of purification methods. They include methods that separate components of a solution according to size, solubility, charge or other factors.

“Haemophilus influenzaeRd DNA can be mechanically separated to produce fragments between 15-20 kb and one embodiment. These fragments can be used to create a Haemophilus influenzae library by inserting them in labda cells as shown in the Examples. Primers that flank an ORF, such as the one in Table 1(a), can be created using the nucleotide sequence information in SEQ ID No:1. The ORF can then be isolated from the lambda DNA collection by PCR cloning. The art of PCR cloning has been well established. Given the SEQ ID NO.1, Table 1(a), and Table 2, it is routine to isolate any ORF, or other nucleic acids fragment of the invention.

“The present invention includes, but is not limited to, single and double-stranded DNA and single-strandedRNA.

“An ‘open reading frame’ is a term used herein. ORF is an acronym for a sequence of triplets that code for amino acids and can be translatable into proteins. The ORFs found in the Haemophilus Influenzae Rd genome are listed in Tables 1a, 2 and 2. Particularly, Table 1a shows the location of ORFs in the Haemophilus Influenzae Rd genome that encode the recited proteins based on homology match with sequences of the organism appearing within parentheticals (see fourth column of Table 1(a )).”).

“The?GeneID is found in the first column of Table 1.a. an ORF. This information is valuable for two reasons. The first is the complete Haemophilus influenzae Rd genomic map provided in FIGS. 6(A)-6 (AN) refers the ORFs according their GeneID numbers. The GendID numbers are used in Table 1(b), which indicates which ORFs have been previously provided in a public databank.

“The second and third columns of Table 1(a), indicate the ORFs position within the nucleotide sequence given in SEQ ID No:1. Normal skill will be able to recognize that ORFs can be oriented in opposing directions in the Haemophilus Influenzae genome. Columns 2 and 3 reflect this.

“The fifth column in Table 1(a), indicates the percent identity between the protein encoded by an ORF and the corresponding protein found in parentheticals within the fourth column.”

“The sixth column in Table 1(a), indicates the percent similarity between the ORF-encoded protein and the ORF-encoded protein from the organism appearing as parentheticals at the fourth column. In the art, it is easy to understand the concepts of percent identity or percent similarity between two polypeptide sequences. Two polypeptides of 10 amino acids each that differ at three positions (e.g. at position 1, 3, and 5) would have a 70% percent identity. The same two polypeptides could be considered to have a percentage similarity of 80%, if the amino acid moieties at position 5 were, even though not identical,?similar? ”

“The seventh column of Table 1(a), indicates the length of the amino acid homology match.”

“Table 2 contains ORFs from the Haemophilus influenzae genome Rd that encode polypeptide sequences but did not produce a homology match. an existing protein sequence from another organism. The Examples below provide additional information about the criteria and algorithms used to perform homology searches.

“A skilled artisan is able to identify ORFs in Haemophilus influenzaeRd genome that are not listed in Tables 1, 1(b), and 2. This includes ORFs that overlap or are encoded by the opposite side of an ORF. These ORFs can also be identified using computer-based systems according to the invention.

“An expression modulating fragment is, as used herein,? EMF is a group of nucleotide molecules that modulate the expression of an operably-linked ORF or EMF.

“A sequence, as used herein is said to?modulate expression of an operably connected sequence? When the EMF alters the expression of the sequence. EMFs can include promoters and promoter modulating sequences. EMF fragments that induce the expression of an operably linked ORF or a regulatory factor in response to a physiological event or factor are one class. Tomb et. al. provide a review of all known EMFs in Haemophilus. Gene 104:1-10 (1991), Chandler, M. S., Proc. Natl. Acad. Sci. Sci.

The proximity of EMF sequences in the Haemophilus influenzae rd genome can help identify them. A fragment of the intergenic section, or an intergenic segment of about 10 to 200 nucleotides, can be taken 5 from any of the ORFs in Tables 1(a),1(b) and 2. This will modify the expression of an operably-linked 3 ORF in a manner similar to that of the naturally linked ORF sequence. An “intergenic segment” is defined herein. An?intergenic segment? refers to fragments of the flaemophilus gene that are between the two ORF(s) described. EMFs can also be identified by using known EMFs in target sequences or targetmotifs in computer-based systems according to the invention.

An EMF trap vector can confirm the presence and activity of EMFs. An EMF trap vector includes a cloning station 5 to a marker sequence. The marker sequence encodes a phenotype such as antibiotic resistance or complementing nutrition auotrophic factor. It can be identified and assayed by placing the EMF trap vector within an appropriate host in the appropriate conditions. EMFs can modulate expression of operably linked markers, as described above. Below is a more in-depth discussion on various marker sequences.

“A sequence that is suspected to be an EMF is cloned at all three reading frames at one or more restriction sites downstream from the marker sequence of the EMF trap vector. The vector is then transformed into a suitable host using the known methods. Under the appropriate conditions, the host’s phenotype can be examined. An EMF can modulate expression of an operably-linked marker sequence, as described above.

“An?uptake modulating piece, as used herein. UMF is a group of nucleotide molecules that mediate the incorporation of linked DNA fragments into cells. UMFs can easily be identified by using the computer-based systems described previously.

Attaching the UMF to a marker sequence can confirm that it is present and active. After the nucleic acid molecule has been attached to a marker sequence, it is incubated with the appropriate host and monitored for uptake. A UMF, as described above will increase the frequency at which a linked sequence is uptaken. Goodgall S. H. et al., Journal Bact, provide a review of DNA uptake by Haemophilus. 172:5924-5928 (1990).”

“A?diagnostic segment, as used herein. DF is a sequence of nucleotide molecules that selectively hybridize with Haemophilus influenzae sequences. Identifying unique sequences in the Haemophilus Influenzae Rd genome can help identify DFs. You can also generate and test probes or amplification primers that contain the DF sequence in a diagnostic format that determines amplification/hybridization selectivity.

“The scope of the invention does not include the sequences described herein, but includes allelic and species variations. It is possible to determine allelic and species variations by simply comparing the sequence in SEQID NO:1 or a representative fragment thereof with a sequence from another species isolate. To accommodate codon variability, this invention also includes nucleic acids molecules that code for the same amino-acid sequences as the ORFs described herein. This means that substitution of one codon by another, which encodes the same amino acids, is possible in the ORF’s coding region.

“Any sequence described herein can be easily screened for errors by resequencing one particular fragment, such an ORF in both directions (i.e. sequence both strands), Alternately, you can perform error screening by sequencing the corresponding polynucleotides from Haemophilus Influenzae origin by using part of or all of these fragments as a probe/primer.

Each of the ORFs in the Haemophilus influenzae Rd gene, as well as the EMF 5 to the ORF can be used in a variety of ways as polynucleotide-reagents. These sequences can be used to detect the presence in samples of specific microbes, such as Haemophilus Influenzae RD. This is particularly true for the fragments and ORFs from Table 2. These will be highly specific for Haemophilus influenzae.

“In addition, the fragments described in the present invention can be used to control gene transcription through triple helix formation, antisense DNA, or RNA. Both methods are based upon the binding of a sequence of polynucleotides to DNA orRNA. These polynucleotides are typically 20-40 bases long and can be used to complement a region of the gene involved with transcription (triple-helix?see Lee et al., Nucl). Acids Res. 6/773 (1979); Cooney and colleagues, Science 241 :456 (1988); Dervan et.al., Science 251 :1360 (1991); or to the mRNA (antisense?)Okano, J. Neurochem. 56:560 (1991); Oligodeoxynucleotides as Antisense Inhibitors of Gene Expression, CRC Press, Boca Raton, Fla. (1988)).”

“Triple Helix?” formation results in an optimal shut-off of DNA-RNA transcription, while antisenseRNA hybridization prevents the translation of an mRNA molecular into a polypeptide. Both methods have been proven to work in model systems. The sequences of this invention provide information necessary to design an antisense or triple-helix oligonucleotide.

“The present invention also provides recombinant constructs that contain one or more fragments from the Haemophilus Influenzae Rd genome. Recombinant constructs according to the present invention include a vector (plasmid, viral vector) into which a Haemophilus Influenzae Rd fragment has been placed, in either a forward or reverse direction. A vector that contains one of the ORFs may also include regulatory sequences. This could include, for example, a promoter. The vectors that contain the EMFs or UMFs may also include a marker sequence, heterologous ORF, or other elements linked to the EMFs or UMFs. For the generation of the recombinant constructs described in the invention, there are many vectors and promoters that are suitable. These vectors are an example. Bacterial: pBs, phagescript, PsiX174, pBluescript SK, pBs KS, pNH8a, pNH16a, pNH18a, pNH46a (Stratagene); pTrc99A, pKK223-3, pKK233-3, pDR540, pRIT5 (Pharmacia). Eukaryotic: PWLneo. pSV2cat. pOG44. pSG. pKK233-3. pDR540. pSVL.

“Promoter areas can be selected from any gene using CAT vectors (chloramphenicol transase) or other vectors with selectable marker. pKK232-8, and pCM7 are two examples of suitable vectors. Some of the most prominent bacterial promoters are lacZ, lacZ and T3, T7. T7. gpt, lambda PP, and trc. CMV immediately early, HSV thymidine kinase and early and late SV40 are some examples of eukaryotic promoters. The art of selecting the appropriate vector and promoter for the right species is easy.

The present invention also provides host cells that contain any of the Haemophilus Influenzae Rd fragments. This fragment has been introduced to the host cell by known transformation methods. A host cell may be a higher-eukaryotic, mammalian, or a lower-eukaryotic, such a yeast, cell. Or, it can be a procaryotic, such a bacterial, cell. The recombinant construct can be introduced into the host cell by calcium phosphate transfer, DEAE, dextran mediated transcription, or electroporation. (Davis L. et., Basic Methods In Molecular Biology (1986 )).”).

“The present invention allows host cells to contain one of the fragments from the Haemophilus Influence Rd genome to be used in traditional ways to produce the gene product encoded (in the case an ORF), or to produce a heterologous proteins under the control EMF.”

“The present invention also provides isolated polypeptides encoded either by nucleic acids fragments of this invention or degenerate versions of nucleic acids fragments. What is meant by a?degenerate variation? Intended nucleotide fragments that differ from a nucleic acids fragment of the present invention (e.g. an ORF by nucleotide sequence) but encode an identical polypeptide sequence due to degeneracy in the Genetic Code. The ORFs in Table 1 (a), which encode proteins, are preferred nucleic acid pieces of the present invention.

There are many methods that can be used to isolate the polypeptides and proteins of the invention. The simplest amino acid sequence can be created using commercially available propeptide synthesizers. This is especially useful for producing small peptides or fragments of larger polypeptides. For example, fragments can be used to generate antibodies against native polypeptides. Another method is to extract the protein or polypeptide from bacteria cells that naturally produce it. A skilled person in the art will be able to follow the known methods of isolating proteins and polpeptides in order to isolate the polypeptides or protein of the invention. These include, but are not limited to, immunochromatography, HPLC, size-exclusion chromatography, ion-exchange chromatography, and immuno-affinity chromatography.”

The polypeptides or proteins of the invention can also be extracted from cells that have been modified to express the desired protein or polypeptide. A cell is considered to have been altered to express a desired protein or polypeptide by genetic manipulation. This means that the cell produces a different protein or polypeptide than it normally produces or at a lower level. A skilled person in the art will be able to adapt methods for inserting and expressing synthetic or recombinant sequences into prokaryotic or eukaryotic cells, in order to create a cell that produces the polypeptides or protein of the invention.

“Any host/vector can be used to express one of the ORFs described in the present invention. These include, among others, eukaryotic hosts like HeLa cells and Cv-1 cells, COS cells and Sf9 cell as well as prokaryotic hosts like E. coli or B. subtilis. Cells that do not express the desired protein or polypeptide are preferred.

“?Recombinant,? “?Recombinant” as it is used herein means that a protein or polypeptide is derived from recombinant expression systems (e.g. mammalian or microbial). ?Microbial? Recombinant polypeptides and proteins that are made using fungal or bacterial expression systems. Recombinant microbial is a product. A polypeptide or protein that is essentially free from native endogenous substances, and without any associated native glycosylation. Most polypeptides and proteins that are expressed in bacterial cultures (e.g. E.coli) will not have any glycosylation modifications. However, proteins or polypeptides expressed in yeast will exhibit a different glycosylation pattern to those found in mammalian cells.

“?Nucleotide sequence? refers to a heteropolymer of deoxyribonucleotides. The DNA segments that encode the polypeptides or proteins of this invention are generally made from fragments from the Haemophilus influenzae Rd gene and short oligonucleotide linksers.

“Recombinant expression vector or vehicle? A plasmid, phage, virus or vector that allows the expression of a polypeptide using a DNA sequence (RNA) is known as “?Recombinant expression vehicle or vector?” An expression vehicle may include a transcriptional unit that includes (1) a genetic element, or elements with a regulatory role, in gene expression. (2) a structural sequence or coding sequence, which is transcribed into the mRNA and translated to protein. (3) appropriate transcription initiation or termination sequences. A leader sequence is required for structural units that are intended to be used in yeast and eukaryotic expression system. This allows extracellular secretion by a host cell of translated protein. An N-terminal methionine residue may be included in a recombinant proteins that are not expressed with a leader sequence or transport sequence. This residue can be or not cleaved from the recombinant protein in order to produce a final product.

“?Recombinant expression system? Recombinant expression system refers to host cells that have successfully integrated a recombinant transcriptal unit into their chromosomal DNA. These cells can be either prokaryotic, eukaryotic, or both. After induction of regulatory elements that are linked to the DNA segment, or synthetic gene to express, recombinant expression systems will produce heterologous proteins or polypeptides.

“Mature proteins are able to be expressed in mammalian, yeast, and bacteria cells. These proteins can be produced using RNAs generated from the present invention. Cell-free translation systems are also possible. Sambrook et al. describe appropriate cloning methods and expression vectors that can be used with both prokaryotic and eukaryotic hosts in Molecular Cloning. A Laboratory Manual, Cold Spring Harbor (N.Y.), 1989. The disclosure is hereby incorporated into the reference.

“Recombinant expression vectors generally include origins for replication and selectable markers that permit transformation of the host cells, e.g. the ampicillin resistance gene in E. coli or the S. cerevisae TRP1 gene. A promoter derived form a high-expressed gene can direct transcription of a downstream structure sequence. These promoters can be derived, for example, from operons encoding glycolytic proteins such as acid phosphatase or 3-phosphoglycerate kinase. The heterologous structural sequence must be assembled in the appropriate phase with translation termination and translation initiation sequences. Preferably, it should contain a leader sequence that can direct secretion of translated proteins into extracellular medium or the periplasmic area. The heterologous sequence may encode a fusion protein that includes an N-terminal identification propeptide. This can be used to provide desired characteristics such as stabilization or simplified purification.

Click here to view the patent on Google Patents.