1. Field of the Invention
The present invention relates to a system for describing glycan structures that can be easily stored and interpreted by computers.
2. Related Art
Glycans are complex chains of oligosaccharides that play critical roles in several structural and modulatory functions in cells. Although glycans are considered as one of the most important classes of molecules after DNA and proteins, the development of informatics methods to support and advance their research has lagged behind those available for other types of data. It is only in recent years that there has been an increase in the availability of informatics resources such as glycan databases and algorithms for analyzing glycan structures and their interactions (Pérez S, Mulloy B (2005) “Prospects for glycoinformatics.” Curr Opin Struct Biol 15:517-524 “(“Pérez et al.”). Such disparity is mainly attributable to the structural complexity of carbohydrates compared to the simpler linear structure of DNA and proteins. While nucleotide and amino acid residues can be represented by four and twenty letters respectively, glycan sequences are comprised of a larger number of base residues and contain additional information on linkages and branching (von der Lieth C W (2004) “An endorsement to create open databases for analytical data of complex carbohydrates.” J Carbohydr Chem 23:277-297 (“von der Lieth I”); Laine R A (1994) “A calculation of all possible oligosaccharide isomers both branched and linear yields 1.05×10(12) structures for a reducing hexasaccharide: the Isomer Barrier to development of single-method saccharide sequencing or synthesis systems.” Glycobiology 6:759-767). As a result, several research projects suffer from the lack of a suitable digital format that would render glycan data freely available to other researchers and interoperable in different applications (von der Lieth C W, Bohne-Lang A, Lohmann K K, Frank M (2004) “Bioinformatics for glycomics: status, methods, requirements and perspectives.” Brief Bioinform 5:164-178). Thus, it is necessary to develop a simple, flexible and versatile data format for the representation of glycan structures that is easily understood by scientists and also readable by computers (Brazma A, Krestyaninova M, Sarkans U (2006) “Standards for systems biology.” Nat Rev Genet 7:593-605.
Currently, there are a few nomenclatures available to describe glycan structures, some of which are illustrated in
Mammalian cell lines are ideal for producing recombinant proteins that require post-translational modifications such as glycosylation. Since glycosylation has an effect on various biological properties such as folding, stability and efficacy, the quality of secreted proteins is dependent on the consistency of attached glycan structures. Thus, studying the complex glycosylation reaction pathway in an effort to control the diversity of protein glycosylation is a very active area of research.
It is to the solution of these and other problems that the present invention is directed.
It is accordingly a primary object of the present invention to provide a compact notation for describing glycan structures that can be easily stored and interpreted by computers.
It is another object of the present invention to provide a simplified alpha-numeric representation of glycan structures that can facilitate the development of computer aided analysis tools to study these complex pathways.
It is still another object of the present invention to provide a simplified alpha-numeric representation of glycan structures that can replace text based representations.
It is still another object of the invention to provide a method for representing the structure of at least a portion of an oligosaccharide.
These and other objects of the present invention are achieved by an alpha-numeric code, hereinafter referred to as the “GlycoDigit code,” for the description of N-linked glycan structures that are commonly observed in secreted glycoproteins from engineered mammalian cell lines such as Chinese hamster ovary (CHO) cells.
In one aspect of the invention, a six character alpha-numeric code is used to describe glycan structures on the basis of the monosaccharide chains attached to the different branches of the core structure. In another aspect of the invention, structures in the GlycoDigit code are represented by seven digit-letter pairs for an overall fixed length of fourteen characters. The numeric component of the alpha-numeric code allows for the development of a difference operator and an algorithm to make convenient comparison of glycans based on the unique alpha-numeric code for each structure.
Other objects, features and advantages of the present invention will be apparent to those skilled in the art upon a reading of this specification including the accompanying drawings.
The invention is better understood by reading the following Detailed Description of the Preferred Embodiments with reference to the accompanying drawing figures, in which like reference numerals refer to like elements throughout, and in which:
a is a symbolic representation of N-linked glycan structures using symbols adopted from the nomenclature proposed by the Oxford Glycobiology Institute (UK) to represent a structure pictorially.
b is a full-word representation of the N-linked glycan structures of
c is a representation of the N-linked glycan structures of
d is a representation of the N-linked glycan structures of
a is a pictorial representation of a complex N-linked glycan and its corresponding representation using the first embodiment of the GlycoDigit code in accordance with the present invention.
b is a pictorial representation of a high-mannose N-linked glycan and its corresponding representation using the first embodiment of the GlycoDigit code in accordance with the present invention.
c is a pictorial representation of a hybrid N-linked glycan and its corresponding representation using the first embodiment of the GlycoDigit code in accordance with the present invention.
a is a pictorial representation of a complex N-linked glycan and its corresponding representation using a second embodiment of the GlycoDigit code in accordance with the present invention.
b is a pictorial representation of a high-mannose N-linked glycan and its corresponding representation using the second embodiment of the GlycoDigit code in accordance with the present invention.
c is a pictorial representation of a hybrid N-linked glycan and its corresponding representation using the second embodiment of the GlycoDigit code in accordance with the present invention.
a-6f illustrates a step-by-step representation of the corresponding GlycoDigit code for the complex type structure represented in
a is a visualization of a network of glycans and reaction links for a reduced data set of 64 two-branched glycans, arranged in a hierarchical way.
b is an enlargement of the area designated 11b in
a is a visualization of the entire glycosylation network for 1,024 complex type glycans commonly secreted in CHO cells, arranged in a hierarchical way.
b is an enlargement of the area designated 12b in
c is an enlargement of the area designated 12c in
In describing preferred embodiments of the present invention illustrated in the drawings, specific terminology is employed for the sake of clarity. However, the invention is not intended to be limited to the specific terminology so selected, and it is to be understood that each specific element includes all technical equivalents that operate in a similar manner to accomplish a similar purpose.
Methods
One aspect of the invention is a method for representing the structure of at least a portion of an oligosaccharide. Preferably, the representation will be one which is easily stored on and analyzed by a computer. The method of the invention as described below may be applied to produce the specific “GlycoDigit” code described herein, but it will be understood that it may also applied to generate different representations of the structure of an oligosaccharide.
The first part of the method of the invention involves the creation of the representational system, and comprises the following steps:
In step (a), a base oligosaccharide structure is selected. Preferably, this base structure will be one which is present in a great many of the oligosaccharide structures of interest. The “larger” the base structure (i.e. the greater the number of common structural features in the oligosaccharides of interest) the less complicated the representational system need be.
In step (b), each of the possible substitution points on the base structure are identified. Typically, each possible substitution point is assigned a number, from 1 to x, which will correspond to a position in the final structural representation. The larger the number of substitution points, the more complicated a structure the method can represent. In step (c), a two-character code is selected, where “character” means any unique identifier. Typically, one character will be a number and one will be a letter, but both could be numbers, or letters. Non-roman alphabets can also be used, e.g. Russian, Greek, Hebrew, etc.
In step (d), meanings for the characters selected in step (c) are assigned. An example of this is discussed in detail below with respect to the GlycoDigit code, but any system may be used. The combination of meanings for each two-character grouping is used to specifically define the residue present at each preselected substitution point. It is important to note that it is not necessary that the identifiers be able to identify every single possible residue at a particular substitution point, so long as all the ones of interest are covered. In step (e), step (d) is repeated for each of the substitution points identified in step (b).
The second part of the claimed method involves applying the system developed above to a particular oligosaccharide:
It will be apparent to those of skill in the art that the GlycoDigit codes described in detail hereinafter can be applied using this method.
N-Linked Glycan Structures
N-linked glycosylation occurs in all eukaryotic cells with N-linked glycans sharing a common pentasaccharide core structure depicted in
In a first embodiment of the invention, shown in
The first four branches are represented by odd numbers if the branch is a complex type while high-mannose branches are represented by letters. Complex branches terminating as a GlcNAc, galactose or neuraminic acid residue are represented by the number 3, 5 or 7 respectively. The mannose residues of hybrid and high-mannose N-linked glycans are represented by the letters A-F, with each letter designated as an even number, i.e., A=2, B=4, C=6 etc. For each branch, the letter value corresponds to double the number of mannose residues attached to that branch, i.e. A=2 implies that one mannose residue is attached, B=4 implies that two mannose residues are attached, etc. The fifth and sixth characters have a value of 3 if a bisecting GlcNAc and fucose residue are present respectively. If a branch is not present, its corresponding digit is 1. Further rules are defined that limit the number of mannose residues that can be attached to a structure and which combination of complex and high mannose branches are allowed. From these definitions, the GlycoDigit code can be used to describe the structures of 5100 glycans.
Glycosyltransferases are enzymes that sequentially add one monosaccharide at a time to glycan structures. Six GlcNAc transferases (GlcNAcT I-VI) can add GlcNAc to the three core mannose in different linkages. As shown in
Based on these seven possible linkage sites, in a second embodiment of the invention, shown in
Table 1 lists which linkage each digit-letter pair corresponds to in the second embodiment of the GlycoDigit code. High mannose and hybrid structures can be represented by using the first four digit-letter pairs to correspond to α1-2, α1-3 and α1-6 linked mannose chains attached to each of the two mannose residues in the core structure as shown in
aGlcNAc, mannose or fucose residues can attach to the core structure through these linkages
GlcNAc, Galactose and Polylactosamine Chains
After a GlcNAc residue is added to the core structure, several other monosaccharides can sequentially be attached to it. Galactose (Gal) residues are attached to GlcNAc through a β1-4 link and the branch is then represented as ‘2a’ as listed in Table 2. This Galβ1-4GlcNAc structure is called a lactosamine unit and additional lactosamine units can attach to the first structure through a β1-3 link to form poly-lactosamine chains. The second embodiment of the GlycoDigit code allows up to four lactosamine units to be present in a single branch. Although the first GlcNAc and galactose moieties can be added individually, further additions are restricted in that they must be added together as a single lactosamine unit. This fact is reflected in Table 2 where digit values for branches with only lactosamine units are assigned to even numbers. Thus, a branch with two lactosamine units is depicted by ‘4a’; three units by ‘6a’, etc. Galactose can also attach to GlcNAc through a β1-3 link to form a neo-lactosamine unit (Varki et al). The GlycoDigit code does not allow repeating neo-lactosamine units and the first unit would be represented by ‘2b’ as listed in Table 2. The outermost galactose can have a final monosaccharide such as fucose or a sialic acid attached to it.
Terminal Residues
The outermost galactose residue in a branch can be capped by several terminal monosaccharides. Since even numbers are used to imply the presence of a galactose unit, odd numbers (3, 5, 7 and 9) are used to represent a different terminal sugar in the second embodiment of the GlycoDigit code. Table 3 lists the monosaccharides that can be added to the outermost galactose in several different linkage positions.
Sialic acids are the most common type of glycans added to the outermost galactose and are often attached either in α2-3 or α2-6 linkage. Though the sialic acid family is very diverse, N-acetyl-neuraminic acid (NeuNAc) and N-glycolyl-neuraminic acid (NeuGc) are the most common sialic acids observed. Mice produce glycoproteins almost exclusively with NeuGc, while CHO cells are a mix of mostly NeuNAc and a small amount of NeuGc (Baker K N, Rendall M H, Hills A E, Hoare M, Freedman R B, James D C (2001) “Metabolic control of recombinant protein N-glycan processing in NS0 and CHO cells.” Biotechnol Bioeng 73:188-202). NeuGc is absent in humans and glycoproteins containing it are actually immunogenic to humans (Irie A, Koyama S, Kozutsumi Y, Kawasaki T, Suzuki A (1998) “The molecular basis for the absence of N-glycolylneuraminic acid in humans.” J Biol Chem 273:15866-15871). In Table 3 the letters ‘a’ to ‘f’ are assigned to represent NeuNAc and NeuGc in various linkages. α2-8 linked sialic acids, which attach to α2-3 sialic acids, are currently not represented in the second embodiment of the GlycoDigit code.
Other terminal residues that can attach to the outermost galactose are fucose (represented by the letter ‘g’) and an additional α1-3 linked galactose (represented by the letter ‘h’). Fucose units attached to terminal galactose in the α1-2 linkage are found in some blood group antigens such as the Lewis Y and Lewis B antigens (Varki et al). The α1-3 galactosyl-transferase enzyme in mouse cells attaches an additional terminal galactose residue to the β1-4 linked galactose (Butler M (2006) “Optimisation of the cellular metabolism of glycosylation for recombinant proteins produced by mammalian cell systems.” Cytotechnology 50:57-76). This Galα1-3Galβ1-4GlcNAc structure is highly immunogenic in humans (Jenkins N, Parekh R B, James D C (1996) “Getting the glycosylation right: implications for the biotechnology industry.” Nat Biotechnol 14:975-981).
Fucosylation
The final digit-letter pair in the second embodiment of the GlycoDigit code is used to represent fucosylation on the core GlcNAc and on the outermost GlcNAc residues in branches attached to the core structure. Fucose is attached to the core GlcNAc residue through an α1-6 link while the peripheral fucosylation can occur through the α1-3 or α1-4 linkage (Ma B, Simala-Grant J L, Taylor D E (2006) “Fucosylation in prokaryotes and eukaryotes.” Glycobiology 16:158R-184R). It is important to note that this digit-letter pair only counts fucose molecules attached to GlcNAc and does not include fucose attached to the outermost galactose which is covered in the cases for representing terminal residues. The digit portion of the last digit-letter pair counts the number of fucose molecules attached to GlcNAc in the structure, while the letter is used to represent which branches are fucosylated and through which linkage. In order to keep the code as concise as possible, not all combinations of possible fucosylation sites are represented in the second embodiment of the GlycoDigit code. Only the outermost GlcNAc residue in a branch is allowed to be fucosylated. Additionally, if more than one branch is fucosylated then all fucose residues must be attached through the same type of linkage. Thus it is possible to have a structure with two fucose residues attached on the outer branches through α1-3 linkages, but not possible to have one fucose attached through an α1-3 link and the other through an α1-4 link. Table 4 lists all the combinations of fucosylation that can be represented by the second embodiment of the GlycoDigit code.
aC implies that the fucose is attached to the core GlcNAc
bB indicates which branch's outermost GlcNAc is fucosylated
Results
Representing N-Linked Glycans with the GlycoDigit Code
The GlycoDigit code can be used to represent complex, high-mannose and hybrid type N-linked glycans.
a is a complex type N-linked glycan with the following digits for the code:
1st digit=7: The branch terminates in NeuNAc (N-acetylneuraminic acid)
2nd digit=3: The branch terminates in GlcNAc (N-acetylglucosamine)
3rd digit=5: The branch terminates in Galactose
4th digit=1: There is a non-existent branch
5th digit=1: No bisecting GlcNAc is attached in this branch
6th digit=3: Fucose is attached in this structure
Thus the final code for the structure in
The rules described herein are not intended to cover the N-linked glycan structures for all species. Some vertebrate structures have been observed to have five branches, a third branch attached to the upper core mannose (Varki et al.). In CHO cells, a similar branch has been observed to be present only as an intermediary step in the glycosylation pathway (Butler M. 2006. “Optimisation of the cellular metabolism of glycosylation for recombinant proteins produced by mammalian cell systems.” Cytotechnology, 50:57-76). In addition, several other variations on possible linkages have been observed in other species (Schachter H, Brockhausen I, Hull E. 1989. “High-performance liquid chromatography assays for N-acetylglucosaminyltransferases involved in N- and O-glycan synthesis.” Methods Enzymol., 179:351-397). Nevertheless, the GlycoDigit code is sufficiently applicable to most mammalian species that are commonly used in the production of recombinant proteins.
The first embodiment of the GlycoDigit code provides a simple means for generating all possible glycan structures. For branches 1 to 4 there are 10 possible alpha-numeric characters that can be used to describe the branch structure (1, 3, 5, 7, A, B, C, D, E and F), while there are two possible numbers for the 5th and 6th branch (1, 3). Thus, 10×10×10×10×2×2=40,000 different structures can be generated and represented in the six digit-letter pair embodiment of the GlycoDigit code. However, not all of these structures are valid. Invalid structures can be filtered out by the rules described hereinafter, thus resulting in 4860 N-linked glycan structures that can be considered as theoretically valid glycan structures in the six character alpha-numeric embodiment of the GlycoDigit code. Of course, it is possible to further refine the rules to give rise to the glycan population pertaining to the appropriate mammalian cell line.
Table 5 summarizes the definition for each digit in the first (six character alpha-numeric) embodiment of the GlycoDigit code, and also shows the full branch structure and the anomeric linkage information. Blank cells indicate that the value is not possible for that digit position.
Three additional rules are defined to describe the N-linked glycan structures of secreted proteins from CHO cells by the six character alpha-numeric embodiment of the GlycoDigit code.
Rule 1: For high-mannose and hybrid subtypes in secreted mammalian cells, the maximum possible number of mannose residues attached to the core structure is six, making the total number of mannose residues in a structure equal to nine (counting the three residues in the trimannosyl core) (Varki et al.).
Rule 2: The six character alpha-numeric embodiment of the GlycoDigit code only allows six mannose at most in a single branch.
Rule 3: For hybrid structures branches 1 and 2 and branches 3 and 4 must be of the same type respectively, i.e., either both mannose, or both complex type.
The complex type glycan structure in
a-6f illustrate a step-by-step representation of the corresponding GlycoDigit code (seven digit-letter embodiment) for the complex type structure presented in
Starting from the first digit-letter pair, in this case the corresponding branch is empty and so the representation is ‘0x’.
Looking at the second branch attached to the α1-3 core mannose, it has three residues and ends in a terminal fucose; its representation is ‘3g’ as listed in Table 3.
The branch in the third digit-letter position has one GlcNAc residue and is represented as ‘1a’.
The fourth branch has three residues ending in an α2-3 linked sialic acid. The code for this branch is ‘3a’.
The fifth and sixth branches are empty and thus both are represented by ‘0x’.
The value for the last digit-letter position is ‘2c’ since in addition to the core fucose, there is also a fucose residue attached to the GlcNAc in the second branch in an α1-3 linkage (see Table 4). The fucose attached to the galactose in that branch is represented in the code for the second branch and is not counted here.
Thus the code for the entire structure results in [0x 3g 1a 3a 0x 0x 2c].
It should be noted that the GlycoDigit code does not aim to provide comprehensive coverage of all possible glycan structures found in all species. Instead it focuses primarily on structures found in secreted glycoproteins in mammalian cell lines such as CHO cells, while still remaining extensible. For this reason the seven digit-letter pairs are chosen to represent the six linkage sites on the core structure for GlcNAc residues along with the ability to describe attached fucose molecules. Currently the GlycoDigit code can represent structures with mannose, GlcNAc, galactose, fucose and sialic acid residues present in them. It can distinguish between NeuNAc and NeuGc; and is capable of representing terminal galactose and fucose. Several structures that are not naturally expressed in CHO cells have been produced in engineered CHO cell lines. These include bisecting GlcNAc (Sburlati et al; Umana et al] repeating lactosamine chains (Sasaki H, Bothner B, Dell A, Fukuda M (1987) “Carbohydrate structure of erythropoietin expressed in Chinese hamster ovary cells by a human erythropoietin cDNA.” J Biol Chem 262:12059-12076) and Lewis blood group structures (Thomas L J, Panneerselvam K, Beattie D T, Picard M D, Xu B, Rittershaus C W, Marsh Jr H C, Hammond R A, Qian J, Stevenson T, Zopf D, Bayer R J (2004) “Production of a complement inhibitor possessing sialyl Lewis X moieties by in vitro glycosylation technology.” Glycobiology 14:883-893; Barrabés S, Pagès-Pons L, Radcliffe C M, Tabarès G, Fort E, Royle L, Harvey D J, Moenner M, Dwek R A, Rudd P M, De Llorens R, Peracaula R (2007) “Glycosylation of serum ribonuclease 1 indicates a major endothelial origin and reveals an increase in core fucosylation in pancreatic cancer.” Glycobiology 17:388-400).
With respect to the second embodiment, if additional branches are required to cover other cases, more digit-letter pairs can be added to the code to represent them. Further, the index-based letters for representing additional linkage information allow the easy addition of further linkage and residue type options. Conversely, the code can be simplified in cases where there are fewer than seven branches or if linkage information is not needed. The main emphasis in the GlycoDigit code is on the fact that the code keeps a numeric component, which can serve as the basis for several computational applications.
Applications of the GlycoDigit Code
Comparing Glycan Structures
The development of BLAST (Altschul S F, Gish W, Miller W, Myers E W, Lipman D J (1990) “Basic local alignment search tool.” J Mol Biol 215:403-410) (“Altschul et al”) provided a solution to a fundamental question that biologists had been asking, i.e., how to measure similarity between different sequences of nucleotides and proteins. However, such algorithms are not directly applicable to the comparison of glycans due to their tree-like structure. Recently a few techniques have been developed for comparing glycans (Aoki K F, Yamaguchi A, Ueda N, Akutsu T, Mamitsuka H, Goto S, Kanehisa M (2004) “KCaM (KEGG Carbohydrate Matcher): a software tool for analyzing the structures of carbohydrate sugar chains.” Nucleic Acids Res 32:W267-272 (“Aoki et al”); Aoki K F, Mamitsuka H, Akutsu T, Kanehisa M (2005) “A score matrix to reveal the hidden links in glycans.” Bioinformatics 21:1457-1463) but this research area is still in its infancy. In both the six- and seven digit-letter pair embodiments of the GlycoDigit code, we define a difference operator, which allows for easy comparison of different glycan structures.
A lookup table (Table 6) is defined to use the results from the difference operator to find the specific residue and linkage differences between structures. For each branch being compared, the larger digit from the two input structure is indexed against all possible resulting differences. Considering only complex type structures for example, a branch with the value 7 (NeuNAc) can only be compared against the values 7 (NeuNAc), 5 (Gal), 3 (GlcNAc), and 1, meaning that the resulting differences can only be 0, ±2, ±4, and ±6 (see Difference column in Table 6). The zero value indicates no change, and is not recorded in the lookup table. For each of these possible differences, the table lists the linkages that must be changed in order to get from the first to the second structure. For positive differences, linkages must be removed, while for negative values linkages are added. Table 6 is the lookup table for complex N-linked glycan comparisons between single branches. Using the result code obtained in
Lookup Table 6 also contains information on the number of reaction steps necessary for the difference between individual branches between the structures. The number of required reaction steps for each branch can be obtained by dividing the absolute value of the difference between two branches by 2. For the above example two reactions steps must take place to convert the first structure into the second one, i.e. the removal of the GlcNAc residue and the addition of fucose.
The full lookup table also contains information on the changes that occur when comparing branches where both inputs are of the high-mannose type. For example, in comparing the two branches of a high-mannose structure with digits B (value of 4) and D (value of 8) the difference would be 4 and can be described as adding two mannose residues to the first structure. The comparison between complex and high-mannose branches in hybrid glycan structures is more complicated. In order to convert a high-mannose structure to a complex one, all of the mannose residues must be removed before any other monosaccharides can be attached. Comparing branches represented by the digits C and 7 would imply that the three mannose residues have to be removed and that a GlcNAc, galactose and NeuNAc had to be added in a total of six reaction steps.
The result code from the difference operator can be used to calculate the number of reaction steps necessary to convert one structure to another for the seven digit-letter pair embodiment. Adding the absolute values of the digits in the difference code reveals the number of reactions needed to convert the first structure into the second. From the difference code, we can calculate the number of steps to be 7 (0+1+0+5+0+0+1). In the case of two complex branches being compared if the difference digit for that branch is positive then it implies that glycans must be added as part of the conversion, while a negative difference means glycans must be removed. The comparison between complex and high-mannose branches in hybrid glycan structures is more complicated. In order to convert a high-mannose branch to a complex one, all of the mannose residues must first be removed before any other monosaccharides can be attached. Comparing the fourth branch represented by the digits B and 3 in the two structures respectively would imply that the two mannose residues have to be removed and that a GlcNAc, galactose and NeuNAc have to be added for a total of five reaction steps. Tables 1 through 3 can be used to find out which monosaccharide is added for each digit and in which linkage. This information can be used in reverse to find out which linkages are removed when converting one structure to another.
A Distance Measurement Between Two N-Linked Glycan Structures
Equation (1) represents an algorithm for comparing two valid glycan structures in terms of reaction distance, for the six character alpha-numeric embodiment of the GlycoDigit code:
Using this algorithm, the nearness score between two structures can be simply calculated, allowing the determination of the number of reaction steps needed to convert one structure to another, as described hereinafter. It should be noted that the score is just a naïve approximation, and does not have any clear biological significance.
For the first four branches, the maximum number of reactions needed to convert a branch with six mannose residues into a branch with a terminal NeuNAc residue is nine reactions. Therefore, the maximum number of possible reactions would be (9×4) plus one reaction each for the bisecting GlcNAc at branch 5 and the fucose at branch 6 i.e: 38 possible reactions. The score can then be defined as
Using the first and last two structures in
Six reaction steps are needed to convert the first structure of
Constructing Glycosylation Networks
The glycosylation reaction network can be thought of as a graph with the nodes representing glycan structures and edges showing possible enzymatic reactions. A single glycan structure can act as a substrate to multiple reactions and also be the end product of several reactions, thus creating a highly branched network. Another characteristic feature of the glycan network is how any intermediary structure can be considered an end product and lead to the large variety of structures seen in natural systems. Visualizing such a network can improve our understanding of the glycosylation pathway and serve as a basis for in silico experiments.
To ease storage and processing, a symmetric adjacency matrix was created to store the reaction pairs. A 5100×5100 matrix was created with each (i, j) value recording whether glycan i reacts with glycan j. A zero value implies there is no reaction between these two glycans, while a value of 1 means that there is a reaction link. The difference operator as described above in connection with the first embodiment was used in creating a pair of functions which populate the adjacency matrix; these functions were implemented in MATLAB and their corresponding pseudocode versions are shown in
In order to visualize the glycosylation network, glycans were arranged from the basic core structure and sugar residues were added until the structure was fully sialylated. Glycans were classified into groups based on the number of reaction steps that separated each glycan from the core structure. For the case of complex type glycans, the core structure would be represented as 111111 in the first embodiment of the GlycoDigit code, while the end point would be a fully sialylated structure represented by the code 777733. The visualization algorithm draws the individual glycan structures in each group and then draws lines between those structures that have a reaction link.
Two data sets of glycan structures were created to test the visualization algorithm. The first set was the full 5100 theoretical glycans generated by GlycoDigit with 19372 reaction pairs. A much smaller data set comprising only 64 structures and 160 reactions was also created that only contained those complex type glycans with only two of the first four branches present. In both cases the resulting network showed a highly branched tree structure that diverged at first and then converged. At the start of the network there are many possible sites to attach sugars which leads to the divergent nature, but as these fill up the number of possible choices decreases and the network converges to the final few structures. The first network showed a tree structure with a depth of 15 levels, while the smaller set had a depth of 9. The number of glycans and reactions in each level for both cases are summarized in Table 7.
A list of enzymes involved in the addition and removal of monosaccharide units to the glycan structure were obtained from KEGG (Kanehisa M., Goto S., Hattori M., Aoki-Kinoshita K. F., Itoh M., Kawashima S., Katayama T., Araki M., and Hirakawa M. “From genomics to chemical genomics: new developments in KEGG.” Nucleic Acids Res., 34:D354-357, 2006). 5100 theoretical glycans of all three subtypes were obtained from the first embodiment of the GlycoDigit code, and 19372 reaction pairs were created for pairs of glycan structures that were linked together through an enzymatic reaction.
Using the numeric index of the second embodiment of the GlycoDigit code, an N-linked glycosylation network was constructed that can be represented as a graph with the nodes and edges corresponding to glycan structures and reaction steps, respectively, as shown in
Using the second embodiment of the GlycoDigit code, we enumerated all possible complex type glycan structures commonly secreted in CHO cells, starting from the core structure, which is represented as [0x 0x 0x 0x 0x 0x 0x]. This enumeration was simply carried out by incrementing each digit in the GlycoDigit code by 1, indicating that sugar residues such as GlcNAc, galactose, fucose and sialic acid are sequentially attached to the core structure through enzyme processing by relevant glycosyltransferases. This process continued until the glycan became a tetra-antennary fully sialylated structure with core fucosylation, represented by the code [3a 3a 3a 3a 0x 1a 1a], thus generating 1024 complex type glycans and 4096 reaction steps each linking two subsequent glycans.
In order to visualize the constructed network, the resulting graph was arranged in a hierarchical manner. First, all glycans were classified into different hierarchical layers based on the number of sugars attached. The core structure [0x 0x 0x 0x 0x 0x 0x] was initiated as the first layer, followed by the second layer composed of glycans that had added one sugar each to the core structure and so on until last layer containing a fully sialylated glycan structure [3a 3a 3a 3a 0x 1a 1a]. Once all glycans are placed in their corresponding layers, associated reaction edges linking glycan pairs are visualized within the network graph.
Most biological pathways are often complex and visualizing their structure is one of the most useful steps in studying them. The networks described herein can be used to identify possible pathways to link glycan structures, or find shorter paths than were previously known. In the current model there are often several possible pathways to get from one structure to another, but these paths might not always be biologically plausible. Depending on which species is being modeled, additional rules of which glycans can actually react to form others can be incorporated to make the network more realistic. The modular nature of the algorithms allows users to define their own model of reaction pairs and visualize them.
Metabolic flux analysis is one application that greatly benefits from the presence of a visual interface. Additional information can be added to the data model to allow in silico re-engineering of the pathway. The visualization system provides a good basis for building models for this kind of analysis. It can be implemented with an interactive user interface to incorporate experimental data and provide a web browser based service.
Discussion
Research in glycome informatics is slowly catching up with the progress that has been made in other ‘omics’ areas. As described herein, the GlycoDigit code in accordance with the present invention is based on a pre-defined branching structure of N-linked glycans that are commonly found in most mammalian cells. Compared to other standard text representations for glycans, the GlycoDigit code is much shorter and more intuitive as it focuses on branches instead of previous methods describing individual monosaccharide units. For example, the glycan structure illustrated in various formats in
Although the GlycoDigit code may be unable to provide comprehensive coverage of all possible glycan structures, it is adaptable and can be customized according to the user's requirements. For example, the number of branches allowed in a structure can be increased or decreased by adjusting the number of digit-letter pairs, while more choices can be added to the letter index to represent different linkage information. The GlycoDigit code is also interoperable, which allows it to be incorporated into a laboratory glyco-information management system in a retrievable format, thereby providing useful resources for biomedical and biotechnological applications (Hashimoto K, Goto S, Kawano S, Aoki-Kinoshita K F, Ueda N, Hamajima M, Kawasaki T, Kanehisa M (2006) “KEGG as a glycome informatics resource.” Glycobiology 16:63R-70R; Lutteke T, Bohne-Lang A, Loss A, Goetz T, Frank M, von der Lieth C W (2006) “GLYCOSCIENCES.de: an Internet portal to support glycomics and glycobiology research.” Glycobiology 16:71R-81R; Raman R, Venkataraman M, Ramakrishnan S, Lang W, Raguram S, Sasisekharan R (2006) “Advancing glycomics: implementation strategies at the consortium for functional glycomics.” Glycobiology 16:82R-90R). As such, relevant glycan structures can be easily stored, accessed, retrieved and rapidly converted into their pictorial formats.
Research on the glycosylation pathway to control the diversity of glycosylation is another area that can benefit from the GlycoDigit code. A simplified numeric representation instead of a text-based representation of glycan structures can further advance the development of computer aided analysis tools to study such a complex network (Hossler et al I). The format of the GlycoDigit code as described herein can be easily applied to constructing and visualizing networks of glycan interactions. This applicability may not be provided as easily by text-based representations. Moreover, describing differences between glycans in terms of reaction steps and having an exhaustive list of possible glycan structures as illustrated in
Further work is needed to define a biologically meaningful measure of similarity among glycan structures in the context of the GlycoDigit code. As was the case with protein structures, it is expected that a similarity of glycan structures will imply a similarity of function as well (Altschul et al; Aoki et al; Bertozzi C R, Kiessling L L (2001) “Carbohydrates and glycobiology review: chemical glycobiology.” Science 291:2357-2364). The GlycoDigit code in accordance with the present invention is also extendable to allow the representation of a more varied range of N-linked glycan structures.
Modifications and variations of the above-described embodiments of the present invention are possible, as appreciated by those skilled in the art in light of the above teachings. It is therefore to be understood that, within the scope of the appended claims and their equivalents, the invention may be practiced otherwise than as specifically described.
The present patent application is based on, and claims priority from, U.S. provisional Application No. 60/929,163, filed Jun. 15, 2007, which is incorporated herein by reference in its entirety.
Filing Document | Filing Date | Country | Kind | 371c Date |
---|---|---|---|---|
PCT/SG08/00212 | 6/13/2008 | WO | 00 | 12/15/2009 |
Number | Date | Country | |
---|---|---|---|
60929163 | Jun 2007 | US |