Proteins from anaerobic fungi and uses thereof

REFERENCE TO SEQUENCE LISTING, A TABLE, OR A COMPUTER PROGRAM LISTING COMPACT DISK APPENDIX

The instant application contains a Sequence Listing which has been filed electronically in ASCII format and is hereby incorporated by reference in its entirety. Said ASCII copy, created on Dec. 6, 2016, is named UCSB007PCT_SL.txt and is 21,694,502 bytes in size.

BACKGROUND OF THE INVENTION

Microbial communities have evolved immense enzymatic capabilities. In particular, anaerobic fungi perform metabolic feats which potentially could be adapted for great benefit. For example, the efficient conversion of biomass into fuels could provide humankind with an inexpensive, unlimited, and environmentally sustainable source of energy. However, current biomass conversion technologies are not economically scalable due to the recalcitrance of woody biomass. While humans have struggled to effectively capture energy from biomass, anaerobic fungi efficiently convert such material into many billions of joules of energy each day, in the digestive tracts of herbivores. These organisms have evolved efficient enzymatic machinery to break down cellulosic material in lignin rich plant material.

In addition to efficiently breaking down biomolecules, anaerobic fungi are able to synthesize complex natural products which are difficult or impossible to make using synthetic chemistry. Fungi have rich enzymatic abilities which create a diversity of biologically active molecules. Roughly 40% of drugs in use today were derived from fungi, for example, including antibiotics such as penicillin, chemotherapeutics such as vincristine or vinblastine, and cholesterol-lowering drugs such as statins. The prevalence of useful biomolecules produced by fungi is enabled by their unique enzymatic capabilities.

While the potential of fungi to improve bioproduction technologies is huge, large numbers of fungal species cannot contribute because they are not amenable to culture, isolation, and study. Anaerobic fungi in particular are very difficult to culture compared to model organisms such as aerobic bacteria or yeast. The anaerobic fungi have therefore been severely underrepresented in bioprospecting efforts due to the bottlenecks associated with their study.

Advantageously, the inventors of the present disclosure have developed methodologies for the culture of anaerobic fungi. This development has enable the isolation and characterization of organisms which were never previously studied. From this work, novel species of gut fungi have been isolated and their transcriptomes have been sequenced, revealing a multitude of new genes and proteins that can be used in energy production, in the synthesis of novel compounds, and in other applications.

SUMMARY OF THE INVENTION

The inventors of the present disclosure have identified four novel species of anaerobic fungi and have identified numerous useful protein domains and nucleic acid sequences coding therefor. These novel sequences provide the art with new enzymatic tools. In one aspect, the invention is directed to methods and compositions of matter utilized in the production of biofuels from lignocellulosic biomass utilizing the novel domains of the invention. In one aspect, the scope of the invention encompasses novel catalytic domains applied in the digestion of lignocellulosic biomass. In another aspect, the scope of the invention encompasses structural components which are incorporated into enzyme complexes, such as cellulosomes. Disclosed herein are novel engineered scaffoldins, glycoside hydrolase enzymes, dockerins, cohesins, and domains therefrom, as well as other catalytic proteins and protein domains involved in the breakdown of plant material. In another aspect, the scope of the invention encompasses methods of producing biofuels utilizing the novel organisms described herein in bioreactors or like processes.

In yet another aspect, the scope of the invention encompasses methods and compositions of matter which are utilized in the production of secondary compounds, such as polyketides. In one aspect, the compositions of the invention encompass engineered polyketide synthase complexes comprising one or more novel domains of the invention. In another aspect, the scope of the invention encompasses methods of using the domains of the invention in the production of secondary compounds.

BRIEF DESCRIPTION OF THE DRAWING

FIG. 1 is a conceptual diagram of a cellulosome. The cellulosome complex comprises a scaffoldin (104), comprising a plurality of cohesins (105). Enzymatic moieties comprising catalytic domains (106) are attached to the scaffold by the docking of dockerins (107) to complementary cohesion molecules in dockerin-cohesin complexes (111). The docked proteins further include carbohydrate binding entities (108). The cellulosome is anchored in the cell membrane (103) of a host cell (101) by an anchoring moiety such as a transmembrane helix (102). The cellulosome can digest a complex polymer (109) such as cellulose into monomers (110).

DETAILED DESCRIPTION OF THE INVENTION

Four novel anaerobic gut fungi were isolated and cultured. The organisms include Piromyces finnis isolated from horse feces, Neocallimastix californiae, isolated from goat feces, Anaeromyces robustus, isolated from sheep feces, and Neocallimastix sp S4, isolated from sheep feces.

Utilizing novel culture methods, the organisms were isolated and pure cultures were attained, enabling the performance of sequencing efforts. Next-generation sequencing techniques were then utilized identify sequences expressed by the fungal cells. DNA analysis tools were then used to identify domains present in the expressed proteins. By their homology to known sequences, numerous types of useful domains were identified, including catalytic domains and structural domains.

The several domains identified are provided in the sequence listing submitted herewith. Table 1 lists domain names and a description for each domain name which identifies a gene or gene family associated with the sequence, as assigned by bioinformatic tools.

Each domain is represented as a novel polypeptide sequence having a domain description based on its similarity to known proteins from other organisms. Each domain is also provided as a nucleic acid sequence coding for the disclosed polypeptides. The listed protein sequences are provided in standard one-letter amino acid code, as known in the art. The listed nucleic acid sequences comprise fungal cDNA sequences. In the nucleic acids sequence listings, A is adenine; C is cytosine; G is guanine; T is thymine; and N is any of the four bases. The codon preferences of the anaerobic fungi are generally in line with those of model organisms, although fungal sequences tend to have a higher A-T content.

It will be noted that in some cases, multiple variants of a domain are listed, having been derived from the same transcript sequence. This is due to the use of multiple genetic identification tools, which in some cases use diverging models to recognize, define, and annotate protein domains. These models recognize a number of unique features, such as the N- or C-termini of catalytic domains, key catalytic residues, etc, each with their own start and stop sites, resulting in overlapping domain annotations for some transcripts.

The present disclosure provides the art with a large number of novel protein domains and corresponding nucleic acid sequences that may be applied in various contexts. Domains which are applicable to the various compositions and methods described herein can be readily selected from the sequence listing submitted based on the domain labels and descriptions provided in Table 1.

TABLE 1

Domain labels, descriptions, and SEQ ID NO.'s.

Protein
Nucleic

Seq ID
Acid Seq.

Domain Label
Domain Description
No.'s
ID No.'s

(Trans)glycosidases
Glycosidase
1-155
13663-

13818

1-
1-Phosphatidylinositol
156-157
13819-

PHOSPHATIDYLINOSITOL
phosophodiesterase

13820

PHOSPHODIESTERASE-

RELATED PROTEIN

4-PPantetheinyl_Trfase_SF
4′-phosphopantetheinyl
158-165
13821-

transferase

13827

4'-phosphopantetheinyl
4′-phosphopantetheinyl
166-173
13828-

transferase
transferase

13835

6-blade_b-propeller_TolB-
six-bladed beta-propeller domain
174
13836

like
found in TolB protein

6hp_glycosidase
Glycosidase-six hairpin type
175-322
13837-

13984

9-O-ACETYL-N-
9-O-acetyl-N-acetylneuraminic
323-333
13885-

ACETYLNEURAMINIC
acid deacetylase

13995

ACID DEACETYLASE-

RELATED

AAC-RICH MRNA CLONE
AAC-RICH MRNA CLONE
334
13996

AAC4 PROTEIN-RELATED
AAC4 PROTEIN-RELATED

AB_hydrolase
Alpha-beta hydrolase fold
335-450
13997-

domain found in hydrolytic

14112

enzymes

Abhydrolase_5
Alpha-beta hydrolase fold
451-464
14113-

domain 5 found in hydrolytic

14126

enzymes

Ac_transferase_dom
Acyl Transferase Domain
465-578
14127-

14240

ACCESSORY GLAND
Accessory Gland Protein domain
579-583
14241-

PROTEIN ACP76A-
with similarity to that found in

14245

RELATED
flies

Acetyl-CoA synthetase-like
Acetyl CoA Synthetase
584-603
14246-

14265

ACID PHOSPHATASE
Acid phosphatase
604-623
14266-

RELATED

14285

ACP_DOMAIN
Acyl Carrier Protein
624-668
14286-

14330

ACP-like
Acyl Carrier Protein
669-714
14331-

14376

ACPS
Acyl Carrier Protein Synthase, a
715-722
14377-

phosphopantetheinyl transferase

14384

Acyl_carrier_prot-like
Acyl Carrier Protein
723-767
14385-

14429

Acyl_transf_1
Acyl Transferase
768-798
14430-

14460

ACYL-COA
Acyl CoA Thioesterase I
799
14461

THIOESTERASE I

ADH_N
Catalytic domain of alcohol
800-807
14462-

dehydrogenase

14469

adh_short
Domain from short chain
808-823
14470-

dehydrogenase family

14485

adh_short_C2
Domain from short chain
824-832
14486-

dehydrogenase C2 family

14494

ADH_ZINC
Domain from alcohol
833-835
14495-

dehydrogenase, zinc type

14497

ADH_zinc_N
Domain from alcohol
836-851
14498-

dehydrogenase, zinc type

14513

ALCOHOL
Domain from alcohol
852-862
14514-

DEHYDROGENASE
dehydrogenase

14524

RELATED

Aldolase_TIM
beta/alpha barrel domain found
863-870
14525-

in aldolases

14532

ALPHA-L-FUCOSIDASE 2
Alpha-L-fucosidase
871-874
14533-

14536

alpha/beta-Hydrolases
Alpha-beta hydrolase fold
875-972
14537-

domain found in hydrolytic

14634

enzymes

Aminotran_1_2
Class I/Class II
973-982
14635-

Aminotransferase

14644

AMP_BINDING
AMP binding domain
983-999
14645-

14661

AMP-binding_C
C terminal domain of AMP
1000-1001
14662-

binding enzyme

14663

Arabinanase/levansucrase/
Member of
1002-1058
14664-

invertase
Arabinanase/levansucrase/

14720

invertase superfamily

ARF/SAR SUPERFAMILY
Member of small GTPASE
1059-1060
14721-

PROTEIN-RELATED
superfamily

14722

AT18611P-RELATED
Carbohydrate binding domain
1061-1062
14723-

14724

B_KETOACYL_SYNTHASE
Beta-ketoacyl-ACP synthase
1063-1094
14725-

14756

Barwin-like endoglucanases
Endoglocanase
1095-1204
14757-

14866

Beta_cellobiohydrolase
1,4-beta cellobiohydrolase
1205-1286
14867-

14948

Beta-D-glucan exohydrolase,
Beta-D-glucan exohydrolase, C-
1287-1297
14949-

C-terminal domain
terminal domain

14959

BETA-GALACTOSIDASE
Glycoside hydrolase Beta-
1298-1303
14960-

Galasctosidase

14965

beta-
beta-
1304-1309
14966-

Galactosidase/glucuronidase
Galactosidase/glucuronidase

14971

domain
domain

BETA/GAMMA
Beta-Gamma Crystallin
1310-1316
14972-

CRYSTALLIN
Structural Protein

14978

Bgal_small_N
Beta-galactosidase small chain
1317-1322
14979-

14984

BNR
BNR repeat sequence
1323-1324
14985-

14986

Carb_bind
Carbohydrate Binding Domain
1325-1327
14987-

14989

Carbohydrate-binding domain
Carbohydrate Binding Domain
1328-1333
14990-

14995

CarboxyPept_regulatory_dom
Regulatory domain of
1334-1337
14996-

carboxypeptidase

14999

CBD_carb-bd_dom
Carbohydrate Binding Domain
1338-1344
15000-

15006

CBD_IV
Cellulose binding domain, Type
1345-1370
15007-

IV

15032

CBM_1
Fungal cellulose binding domain
1371-1419
15033-

15081

CBM_10
Dockerin and Carbohydrate
1420-3705
15082-

Binding Domain, Type 10

17367

CBM_2
Carbohydrate Binding Domain,
3706-3709
17368-

Type 2

17371

CBM_4_9
Carbohydrate Binding Domain
3710-3714
17372-

17376

CBM_6
Carbohydrate Binding Domain,
3715-3736
17377-

family 6

17398

CBM-like
Carbohydrate Binding Domain
3737-3740
17399-

17402

CBM1_1
Carbohydrate Binding Domain
3740-3758
17403-

17420

CBM1_2
Carbohydrate Binding Domain
3759-3841
17421-

17503

CBM6
Carbohydrate Binding Domain,
3842-3850
17504-

family 6

17512

Cellulase
Cellulase
3851-3962
17513-

17624

CELLULASE (GLYCOSYL
Cellulase-glycosyl hydrolase
3963-3972
17625-

HYDROLASE FAMILY 5)
family 5

17634

PROTEIN-RELATED

Cellulose docking domain,
Carbohydrate Binding Domain
3973-6217
17635-

dockering

19879

Cellulose-binding domain
Carbohydrate Binding Domain
6218-6254
19880-

19916

CHB_HEX_C
Chitinase-Chitobiase, C
6255-6262
19917-

terminal domain

19924

CHIT_BIND_I_1
Chitin Binding Site, Type 1, may
6263-6265
19923-

bind N-acetylglucosamine

19927

CHIT_BIND_I_2
Chitin Binding Site, Type 1, may
6266-6270
19928-

bind N-acetylglucosamine

19932

CHITIN DEACETYLASE 1-
allantoinase/chitin deacetylase 1
6271
19933

RELATED

Chitin_bind_1
Chitin Binding Site, Type 1, may
6272-6275
19934-

bind N-acetylglucosamine

19937

Chitin-bd_1
Chitin Binding Site, Type 1, may
6276-6282
19938-

bind N-acetylglucosamine

19944

CHITINASE
Chitinase
6283-6291
19945-

19953

Chitinase insertion domain
Chitinase insertion domain
6292-6300
19954-

19962

CHITINASE_18
Chitinase, family 18
6301-6303
19963-

19965

Chitinase_insertion
Chitinase insertion domain
6304-6312
19966-

19974

Chitobiase/Hex_dom_2-like
domain 2 of bacterial chitobiases
6313-6314
19975-

and beta-hexosaminidases

19976

ChtBD1
Chitin Binding Site, Type 1, may
6315-6318
19977-

bind N-acetylglucosamine

19980

CINNAMYL ALCOHOL
cinnamyl-alcohol dehydrogenase
6319-6322
19981-

DEHYDROGENASE 2-

19984

RELATED

ClpP/crotonase
Crotonase like domain
6323-6333
19985-

19995

ClpP/crotonase-like_dom
Crotonase like domain
6334-6344
19996-

20006

CoA-dependent
CoA-dependent acyltransferases
6345-6349
20007-

acyltransferases

20011

ConA-like_subgrp
Concanavalin A-like
6350-6375
20012-

lectins/glucanases

20037

Concanavalin A-like
Concanavalin A-like
6376-6433
20038-

lectins/glucanases
lectins/glucanases

20095

Condensation
Condensation domain
6434-6436
20096-

20098

CotH
spore coat protein, involved in
6437-6584
20099-

plant cell wall binding

20246

Cystine-knot_cytokine
Cystine-knot_cytokine
6585
20247

CYTH-like phosphatases
Phosphatase-acts on
6586-6587
20248-

triphosphorylated substrates

20249

CYTH-like_domain
Phosphatase-acts on
6588-6589
20250-

triphosphorylated substrates

20251

Dockerin_dom
Dockerin domain
6590-7679
20252-

21341

Dockerin_dom_fun
Dockerin domain
7680-8910
21342-

22572

DPBB_1
Lytic transglycolase
8911-8919
22573-

22581

DUF1729
Domain of unknown function-
8920-8930
22582-

Found in acyl transferase

22592

domains

DUF303
Domain of unknown function
8931-8946
22593-

DUF303, acetylesterase

22608

DUF4353
Domain of unknown function
8947-8949
22609-

22611

ECH
Enoyl-CoA hydratase
8950-8959
22612-

22621

EGGSHELL
eggshell
8960-8963
22622-

22625

Endo-1-4-beta-
Endo-1-4-beta-glucanase,
8964
22626

glucanase_dom2
domain 2

ENDO-1,4-BETA-
Endo-1-4-beta-glucanase
8965-8998
22627-

GLUCANASE

22660

ENDOGLUCANASE
Endoglucanse
8999-9022
22661-

22684

Endoglucanase_F_dom3
Endoglucanse F, domain 3
9023-9055
22685-

22717

ENTEROBACTIN
Entorbactin Synthase
9056
22718

SYNTHASE COMPONENT
Component F

F

Esterase
Esterase
9057-9080
22719-

22742

Expansin_CBD
C-terminal carbohydrate binding
9081-9106
22743-

domain of expansin

22768

EXPANSIN_EG45
N terminal domain of expansin
9107-9127
22769-

22789

EXTRACELLULAR
Extracellular matrix glycoprotein
9128-9129
22790-

MATRIX GLYCOPROTEIN
related domain

22791

RELATED

FabD/lysophospholipase-like
FabD/lysophospholipase-like
9130-9240
22792-

domain-found in hydrolases

22902

FAMILY NOT NAMED
Not associated with known
9240-9338
22903-

sequences

23600

FASYNTHASE
Fatty Acid Synthase
9339-9468
23601-

23130

FATTY ACID SYNTHASE
Fatty Acid Synthase-subunit
9469-9479
23131-

SUBUNIT BETA
beta

23141

fCBD
Cellulose binding domain
9480-9605
23142-

23267

fn3_3
domain II of
9606-9609
23268-

rhamnogalacturonan lyase

23271

Fn3_assoc
domain II of
9610
23272

rhamnogalacturonan lyase

Fn3-like
domain II of
9611-9619
23273-

rhamnogalacturonan lyase

23281

Galactose mutarotase-like
Galactose mutarotase-like
9620-9626
23282-

domain-binds carbohydrates

23288

Galactose-bd-like
Galactose binding domain-like
9627-9668
23289-

fold

23330

Galactose-binding domain-
Galactose binding domain-like
9669-9727
23331-

like
fold

23389

GDHRDH
short-chain
9728-9748
23390-

dehydrogenases/reductase family

23410

GH_fam_N_dom
domain is found towards the N
9749-9753
23411-

terminus of some glycosyl

23415

hydrolase family members,

including alpha-L-fucosidases

GH04125P-RELATED
Serine protease inhibitor related
9754
23416

GH97_C
Glycosyl-hydrolase 97, C-
9755
23417

terminal oligomerisation domain

GH97_N
Glycosyl-hydrolase 97, N-
9756
23418

terminal domain

GLHYDRLASE10
Glycoside hydrolase family 10
9757-9889
23419-

domain

23551

GLHYDRLASE11
Glycoside hydrolase family 11
9890-9998
23552-

23660

GLHYDRLASE16
Glycoside hydrolase family 16
9999-10013
23661-

23675

GLHYDRLASE2
Glycoside hydrolase family 2
10014-
23676-

10028
23690

GLHYDRLASE26
Glycoside hydrolase family 26
10029-
23691-

10054
23716

GLHYDRLASE3
Glycoside hydrolase, family 3,
10055-
23717-

N-terminal
10094
23756

GLHYDRLASE48
Glycoside hydrolase family 48
10095-
23757-

10308
23970

GLHYDRLASE6
Glycoside hydrolase family 6
10309-
23971-

10791
24453

GLHYDRLASE8
Glycoside hydrolase family 8
10792-
24454-

10818
24480

GLUCOSE-METHANOL-
Glucose-methanol-choline
10819-
24481-

CHOLINE (GMC)
oxidoreductase
10820
24482

OXIDOREDUCTASE

GLUCOSYLCERAMIDASE
Glucosylceramidase
10821-
24483-

10823
24485

Glyco_10
Glycoside hydrolase family 10
10824-
24486-

10858
24520

Glyco_18
Glycoside hydrolase family 18
10859-
24521-

10865
24527

Glyco_hyd_65N_2
N-terminus of the glycosyl
10866-
24528-

hydrolase 65 family catalytic
10870
24532

domain

Glyco_hydr_30_2
Glycoside hydrolase family 30
10871-
24533-

10873
24535

Glyco_hydro_10
Glycoside hydrolase family 10
10874-
24536-

10913
24575

Glyco_hydro_11
Glycoside hydrolase family 11
10914-
24576-

10949
24611

Glyco_hydro_11/12
Glycoside hydrolase family
10950-
24612-

11/12
10986
24684

Glyco_hydro_114
Glycosyl-hydrolase family,
10987-
24685-

number 114potential endo-alpha-
10989
24651

1,4-polygalactosaminidase

Glyco_hydro_13_b
Glycoside hydrolase family 13
10990-
24652-

10991
24653

Glyco_hydro_16
Glycoside hydrolase family 16
10992-
24654-

11001
24663

Glyco_hydro_18
Glycoside hydrolase family 18
11002-
24664-

11010
24672

Glyco_hydro_2
Glycoside hydrolase family 2
11011-
24673-

11013
24675

Glyco_hydro_2_C
Glycoside hydrolase family 2
11014-
24676-

11016
24678

Glyco_hydro_2_N
Glycoside hydrolase family 2-N
11017-
24679-

terminal domain
11019
24681

Glyco_hydro_2/20_Ig-like
Glycoside hydrolase, family
11020-
24682-

2/20, immunoglobulin-like beta-
11025
24687

sandwich domain

Glyco_hydro_26
Glycoside hydrolase family 26
11026-
24688-

11032
24694

Glyco_hydro_3
Glycoside hydrolase family 3
11033-
24695-

11041
24703

Glyco_hydro_3_C
Glycoside hydrolase family 3-C
11042-
24704-

terminal domain
11059
24721

Glyco_hydro_3_N
Glycoside hydrolase family 3-N
11060-
24722-

terminal domain
11068
24730

Glyco_hydro_39
Glycoside hydrolase family 39
11069-
24731-

11081
24743

Glyco_hydro_43
Glycoside hydrolase family 43
11082-
24744-

11129
24791

Glyco_hydro_45
Glycoside hydrolase family 45
11130-
24792-

11153
24815

Glyco_hydro_48
Glycoside hydrolase family 48
11154-
24816-

11186
24848

Glyco_hydro_53
Glycoside hydrolase family 53
11187-
24849-

11189
24851

Glyco_hydro_6
Glycoside hydrolase family 6
11190-
24852-

11270
24932

Glyco_hydro_8
Glycoside hydrolase family 8
11271-
24933-

11276
24938

Glyco_hydro_88
Glycoside hydrolase family 88
11277-
24939-

11278
24940

Glyco_hydro_9
Glycoside hydrolase family 9
11279-
24941-

11312
24974

Glyco_hydro_97
Glycoside hydrolase family 97
11313
24975

Glyco_hydro_beta-prop
five-bladed beta-propellor
11314-
24976-

domain found in some glycosyl
11361
25023

hydrolases

Glyco_hydro_catalytic_dom
catalytic TIM beta/alpha barrel
11362-
25024-

common to many different
11510
25172

families of glycosyl hydrolases

Glyco_hydro-type_carb-
Carbohydrate binding domain
11511-
25173-

bd_sub
from glycoside hydrolases
11517
25179

Glycoside
Glycoside hydrolase/deacetylase
11518-
25180-

hydrolase/deacetylase
family
11521
25183

GLYCOSYL HYDROLASE
Glycoside hydrolase family 43
11522-
25184-

43 FAMILY MEMBER

11560
25222

Glycosyl hydrolase domain
catalytic TIM beta/alpha barrel
11561-
25223-

common to many different
11562
25224

families of glycosyl hydrolases

GLYCOSYL HYDROLASE-
related to known glycosyl
11563-
25225-

RELATED
hydrolase domains
11564
25226

Glycosyl hydrolases family 6,
Glycosyl hydrolases family 6,
11565-
25227-

cellulases
cellulases
11648
25310

GLYCOSYL
Glycosyl transferase related
11649-
25311-

TRANSFERASE-RELATED
domain
11669
25331

GLYCOSYL_HYDROL_F10
Glycoside hydrolase family 10
11670-
25332-

11688
25350

GLYCOSYL_HYDROL_F11_
Glycoside hydrolase family 11
11689-
25351-

1

11719
25381

GLYCOSYL_HYDROL_F11_
Glycoside hydrolase family 11
11720-
25382-

2

11721
25383

GLYCOSYL_HYDROL_F3
Glycoside hydrolase family 3
11722
25384

GLYCOSYL_HYDROL_F45
Glycoside hydrolase family 45
11723-
25385-

11742
25404

GLYCOSYL_HYDROL_F5
Glycoside hydrolase family 5
11743-
25405-

11765
25427

GLYCOSYL_HYDROL_F6_
Glycoside hydrolase family 6
11766-
25428-

2

11816
25478

GLYCOSYL_HYDROL_F9_
Glycoside hydrolase family 9-
11817-
25479-

2
signature found in endglucanases
11844
25506

and other glycoside hydrolases

GroES-like
Similarity to GroES (chaperonin
11845-
25507-

10), an oligomeric molecular
11882
25544

chaperone

HMG_CoA_synt_C
Hydroxymethylglutaryl-
11883-
245545-

coenzyme A synthase C-terminal
11888
25550

domain

HMG_CoA_synt_N
Hydroxymethylglutaryl-
11889-
25551-

coenzyme A synthase N-
11891
25553

terminal domain

HotDog_dom
domain found in thioesterases
11892-
25554-

and thiol ester dehydratase-
11921
25583

isomerases

HxxPF_rpt
HxxPF-repeat domain.
11922
25584

This family is found in non-

ribosomal peptide synthetase

proteins.

ICP-like
ICP-like domain
11923
25585

Inhibitor_I42
Protease inhibitor
11924
25586

Inosine monophosphate
Inosine monophosphate
11925-
25587-

dehydrogenase (IMPDH)
dehydrogenase
11933
25595

Integrin alpha N-terminal
Integrin alpha N-terminal
11934-
25596-

domain
domain
11937
25599

KAZAL_1
serine proteinase inhibitor
11938
25600

ketoacyl-synt
Beta-ketoacyl synthase
11939-
25601-

11983
25645

Ketoacyl-synt_C
Beta-ketoacyl synthase, C-
11984-
25646-

terminal domain
12027
25689

KR
Ketoreductase
12028-
25690-

12043
25705

L domain-like
Leucine rich repeat domain
12044-
25706-

12047
25709

LamGL
LamG-like jellyroll fold
12048-
25710-

12050
25712

Laminin_G_3
This domain belongs to the
12051-
25713-

Concanavalin A-like
12053
25715

lectin/glucanases superfamily

LEUCINE-RICH REPEAT
Leucine-Rich Repeat Receptor-
12054-
25716-

RECEPTOR-LIKE
Like Kinase1
12057
25719

PROTEIN KINASE

Lipase_GDSL
Domain from GDSL esterases
12058-
25720-

and lipases
12067
25729

Lipase_GDSL_2
Domain from family of
12068-
25730-

presumed lipases and related
12070
25732

enzymes

LRR
Leucine rich repeat
12071-
25733-

12082
25744

LRR_1
Leucine rich repeat
12083-
25745-

12090
25752

LRR_4
Leucine rich repeat
12091-
25753-

12093
25755

LRR_6
Leucine rich repeat
12094
25756

LRR_8
Leucine rich repeat
12095-
25757-

12098
25760

LRR_SD22
Leucine rich repeat
12099-
25761-

12101
25763

LRR_TYP
Leucine rich repeat-typical
12102-
25764-

subtype
12116
25778

LYSOPHOSPHOLIPASE-
Lysophospholipase related
12117
25779

RELATED
domain

MALONYL COA-ACYL
Malonyl-CoA: acyl carrier
12118-
25780-

CARRIER PROTEIN
protein transacylase
12128
25790

TRANSACYLASE

MaoC_dehydrat_N
N-terminal domain of MaoC
12129-
25791-

dehydratase
12137
25799

MaoC_dehydratas
C-terminal doamin of MaoC
12138-
257800-

dehydratase
12152
25814

Metallo-dependent
Metallo-dependent phosphatases
12153-
25815-

phosphatases

12202
25864

Metallo-depent_PP-like
Metallo-dependent phosphatases
12203-
25865-

12245
25907

Metallophos
Metallo-dependent phosphatases
12246-
25908-

12281
25943

Mucin
Mucins, high molecular weight
12282-
25944-

glycoconjugates
12283
25945

NAD(P)-bd_dom
NADP binding domain
12284-
25946-

12334
25996

NAD(P)-binding Rossmann-
NAD(P)-binding Rossmann-fold
12335-
25997-

fold domains
domains
12384
26046

NODB
catalytic domain found in
12385
26047

members of carbohydrate

esterase family 4

Oligoxyloglucan reducing
Oligoxyloglucan reducing end-
12386-
26048-

end-specific
specific cellobiohydrolase
12397
26059

cellobiohydrolase

Pectin lyase-like
Pectin-lyase like domain
12398-
26060-

12408
26070

Pectin_lyas_fold
Pectin lyase fold domain
12409-
26071-

12417
26079

Peptidase_S8
Peptidase_S8
12418-
26080-

12421
26083

Peptidase_S8/S53_dom
domain found in serine
12422-
26084-

peptidases
12427
26089

PERIPLASMIC BETA-
Periplasmic Beta-glucosidase
12428-
26090-

GLUCOSIDASE-RELATED
related domain
12437
26099

PERIPLASMIC BROAD-
Periplasmic broad specificity
12438-
26100-

SPECIFICITY
esterase/lipase/protease
12441
26103

ESTERASE/LIPASE/

PROTEASE

PEROXISOMAL
Peroxisomal multifunctional
12442-
2614-

MULTIFUNCTIONAL
enzyme type 2
12445
26107

ENZYME TYPE 2

PHL pollen allergen
PHL pollen allergen
12446-
26108-

12462
26124

PHOSPHOPANTETHEINE
Prosthetic group of acyl carrier
12463-
26125-

protein
12473
26135

PI-PLC-X
PI-PLC X domain
12474
26136-

12475
26137

PIPLC_X_DOMAIN
PI-PLC X domain
12476-
26138-

12478
26140

PKS_AT
Acyl Transferase domain
12479-
26141-

12509
26171

PKS_ER
Enoyl Reductase
12510-
26172-

12525
26178

PKS_KR
Ketoreductase
12526-
26179-

12545
26198

PKS_KS
Ketosynthase
12546-
26199-

12579
26241

PKS_PP
phosphopantetheine-binding
12580-
26242-

domain
12607
26269

Plant lectins/antimicrobial
Plant lectins/antimicrobial
12608-
26270-

peptides
peptides
12612
26274

PLC-like phosphodiesterases
PLC-like phosphodiesterases
12613-
26275-

12618
26280

PLC-like_Pdiesterase_TIM-
domain consisting of a TIM
12619-
26281-

brl
beta/alpha-barrel, found in
12622
26284

several phospholipase C like

phosphodiesterases

PLCXc
Phosphatidylinositol-specific
12623-
26285-

phospholipase C, X domain
12624
26286

PLP-dependent transferases
PLP-dependent transferases
12625-
26287-

12634
26296

POLYKETIDE SYNTHASE-
Related to sequences found in
12635-
26297-

RELATED
PKS domains
12679
26314

Polysac_deacetylase
Polysaccharide deacetylase
12680-
26314-

12683
26345

Polysacc_deac_1
domain found in polysaccharide
12684-
26346-

deacetylase
12687
26349

PP-binding
phosphopantetheine-binding
12688-
26350-

domain
12727
26389

Probable ACP-binding
Probable ACP-binding domain
12728-
26390-

domain of malonyl-CoA ACP
of malonyl-CoA ACP
12740
26402

transacylase
transacylase

PROKAR_LIPOPROTEIN
Prokaryotic lipoprotein domain
12741-
26403-

12761
26423

PROPROTEIN
Proprotein convertase
12762-
26424-

CONVERTASE
subtilisin/kexin type 9
12764
26426

SUBTILISIN/KEXIN

PROSTAGLANDIN
Prostaglandin reductase 1
12765-
26427-

REDUCTASE 1
domain
12767
26429

PROTEIN C41A3.1
Protein C41A3.1
12768-
26430-

12770
26432

PS-DH
Polyketide synthase, dehydratase
12771-
26433-

domain
12785
26447

PT
Polyketide product template
12786-
26448-

domain
12787
26449

PURPLE ACID
purple acid phosphatase 23
12788-
26450-

PHOSPHATASE 23

12797
26459

Purple acid phosphatase, N-
Purple acid phosphatase, N-
12898-
26460-

terminal domain
terminal domain
12829
26491

Purple_acid_Pase_N
Purple acid phosphatase, N-
12830-
26492-

terminal domain
12861
26523

PyrdxlP-
Pyridoxal phosphate-dependent
12862-
26524-

dep_Trfase_major_sub 1
transferase, major region,
12871
26533

subdomain 1

PyrdxlP-
Pyridoxal phosphate-dependent
12872-
26534-

dep_Trfase_major_sub2
transferase, major region,
12879
26541

subdomain 2

Rhamnogal_lyase
Rhamnogalacturonate lyase
12880-
26542-

12883
26545

RhgB_N
Rhamnogalacturonase B, N-
12884-
26546-

terminal domain
12885
26547

RICIN
Ricin
12886-
26548-

12950
26612

Ricin B-like lectins
Ricin B-like lectins
12951-
26613-

13037
26699

RICIN_B_LECTIN
Ricin B-like lectins
13038-
26699-

13108
26770

RicinB_lectin_2
Ricin B-like lectins
13109-
26771-

13186
26848

SCP-like
Spore coat protein like domain
13187-
26849-

13189
26851

SCP2
Spore coat protein 2
13190-
26852-

13192
26854

SCP2_sterol-bd_dom
SCP2 sterol-binding domain
13193-
26855-

13195
26857

SDRFAMILY
Short-chain
13196-
26858-

dehydrogenase/reductase
13207
26869

SERINE PROTEASE
Serine protease inhibitor
13208-
26870-

INHIBITOR, SERPIN

13215
26877

SERINE/THREONINE
Serine/threonine protein kinase
13216
26878-

PROTEIN KINASE

26878

SERPIN
Serine protease inhibitor
13217-
26879-

13242
26904

Serpins
Serine protease inhibitor
13243-
26905-

13253
26915

SGNH hydrolase
SGNH hydrolase domain
13252-
26916-

13294
26956

SGNH_hydro-
SGNHhydro-type esterase
13295-
26957-

type_esterase_dom
domain
13307
26969

Six-hairpin glycosidases
six-hairpin glycoside domain
13308-
26970-

13389
27051

Starch-binding domain-like
Starch-binding domain-like
13392-
27052-

13393
27055

SUBTILASE_ASP
Serine proteases, subtilase
13394-
27056-

family, aspartic acid active site
13395
27057

SUBTILASE_HIS
Serine proteases, subtilase
13396-
27058-

family, histidine active site.
13397
27059

SUBTILASE_SER
Serine proteases, subtilase
13398-
27060-

family, serine active site.
13399
27061

SUBTILISIN
Protease domain
13400-
27062-

13407
27069

Subtilisin-like
Subtilisin-like protease domain
13408-
27070-

13416
27078

Thioesterase
Thioesterase
13417-
27079-

13419
27081

Thioesterase/thiol ester
Thioesterase/thiol ester
13420-
27082-

dehydrase-isomerase
dehydrase-isomerase
13449
27111

Thiolase-like
Thiolase-like domain
13450-
27112-

13544
27206

Thiolase-like_subgr
Thiolase-like_subgr
13545-
27207-

13637
27299

Thioredoxin-like
Thioredoxin-like
13638
27300

Thioredoxin-like_fold
Thioredoxin-like fold domain
13639
27301

TIGR00556
phosphopantethiene-protein
13640-
27302-

transferase domain
13647
27309

TIGR01733
amino acid adenylation domain
13648
27310

TIGR01833
hydroxymethylglutaryl-CoA
13649-
27311-

synthase
13650
27312

TRANS-2-ENOYL-COA
Enoyl-CoA reductase
13651-
27313-

REDUCTASE,

13654
27316

MITOCHONDRIAL

UNCHARACTERIZED
Uncharacterized
13655-
27317-

13656
27318

VCBS
VCBS repeat domain
13657
27319

ZINC FINGER FYVE
Zinc finger FYVE domain-
13658-
27320-

DOMAIN CONTAINING
containing protein
13659
27321

PROTEIN

ZINC FINGER-
Zinc finger containing protein
13660-
27322-

CONTAINING PROTEIN

13662
27324

ENGINEERED PROTEINS. The domains disclosed herein may be utilized in the creation of “engineered proteins.” As used herein, a “protein of the invention,” or an “engineered protein” will refer to a non-naturally occurring protein, wherein such protein comprises one or more domains selected from SEQ. ID NO: 1-13662. A non-naturally occurring protein means the protein is not found in any wild-type species, having been engineered by molecular biological techniques known in the art. For example, the engineered protein may comprise heterologous elements, i.e. elements from different species. Alternatively, the engineered protein may comprise an anaerobic fungal protein lacking heterologous elements, but wherein the elements of the protein have been modified in some way such that they differ from those of the native protein, for example by rearrangement, duplication, or deletion of elements.

The domains disclosed herein will impart various properties to the engineered proteins in which they are incorporated. In some cases, the domain will comprise a structural element and will impart a structural property to the engineered protein. In another embodiment, the domain will comprise a binding domain and will impart a binding affinity for specific binding partners. In other embodiments, the domain will comprise a catalytic domain and will impart an enzymatic activity to the engineered protein.

The various domains of SEQ ID NO: 1-13662 encompass a wide variety of domains having diverse properties. One of skill in the art may readily select a domain of the invention for incorporation into an engineered protein based on the putative functions assigned to the domain. The putative functions of various domains of SEQ ID NO: 1-13662 are listed as “domain descriptions” in Table 1. Methods of using the engineered proteins of the invention will be readily ascertained by the skilled practitioner based upon the properties of the one or more domains of SEQ ID NO: 1-13662 and any additional properties imparted by accessory elements in the engineered proteins.

The proteins of the invention may include chemically synthesized polypeptides and recombinantly produced polypeptides comprising the domain sequences disclosed herein. The scope of the invention will be understood to extend to derivatives of the disclosed domain sequences. The term “derivative,” as used herein with reference to the polypeptides of the invention refers to various modifications, analogs, and products based on the polypeptide sequences disclosed herein, as described below.

Protein derivatives of the invention include substantial equivalents of the disclosed amino acid sequences, for example polypeptides having at least 65%, at least 70%, at least 75%, at least 80%, at least 85%, at least 90%, at least 95%, or at least 99% amino acid sequence identity to a disclosed domain and/or which retain the biological activity of the unmodified sequences.

Protein derivatives of the invention further include polypeptides disclosed herein which have been modified by such techniques as ubiquitination, labeling (e.g., with radioactive or fluorescent moieties), covalent polymer attachment, etc. Derivative proteins of the invention include post-translational modifications of the polypeptide including, but not limited to, acetylation, carboxylation, glycosylation, phosphorylation, lipidation and acylation, etc.

Proteins derivatives of the invention further include polypeptides differing from the sequences disclosed herein by amino acid substitutions. For example, amino acid substitutions that largely preserve the secondary or tertiary structure of the original polypeptide may be selected on the basis of similarity in polarity, charge, solubility, hydrophobicity, hydrophilicity, and/or the amphipathic properties of specific residues. Determination of which amino acid substitutions may be made while maintaining enzymatic and other activities of interest is within ability of one of ordinary skill in the art of protein engineering. The invention also comprises substitutions with non-naturally occurring amino acids, amino acid analogs, etc.

Proteins derivatives of the invention further include mutations in the disclosed polynucleotide sequence intentionally introduced to enhance or modify characteristics of the polypeptide, such as to alter post-translational processing, binding affinities (e.g. introduction of specific epitopes for antibody binding), degradation/turnover rate, industrial processing compatibility (e.g. optimized expression, purification, etc.) or other properties.

The invention further comprises truncated versions of the protein domains disclosed herein, for example C-terminal, N-terminal, or internal deletions encompassing, for example, 1-20 amino acids. The invention further comprises isolated functional units from the disclosed domain sequences, for example isolated binding domains, catalytic domains, and other motifs having useful structures or functions which may be used in isolation from the remainder of the protein.

The invention further comprises any of the disclosed polypeptides sequences which have been augmented with additional amino acids. For example, the invention also includes fusion proteins and chimeric proteins in which a disclosed polypeptide sequences or sub-sequences thereof is combined with other peptides, proteins, or amino acid sequences. Exemplary fusion or chimeric proteins include the disclosed domain polypeptide sequences, or sub-sequences thereof, which have been combined with functional sequences from different proteins. Such proteins may further include secondary polypeptide sequences that impart desired properties such as enhanced secretion, or which enable purification (e.g. His-Tags), immobilization, and other desirable properties.

The invention further includes antibodies that specifically recognize one or more epitopes present on the disclosed polypeptides, as well as hybridomas producing such antibodies.

POLYNUCLEOTIDE CONSTRUCTS. The scope of the invention further encompasses any nucleic acid construct which codes for an engineered protein of the invention or codes for an engineered multiple enzyme complex of the invention. For example, the nucleic acid constructs of the invention may include any non-naturally occurring nucleic acid construct which comprises one or more nucleic sequences selected from SEQ ID NO: 13362-27324 or SEQ ID NO: 27328-27330, (corresponding to the proteins of SEQ ID NO: 1-13662 and SEQ ID NO: 27325-27327). However, it will be understood that, due to the redundancy of the genetic code and the diverging codon preferences in different species, that nucleic acid sequences coding for the proteins of SEQ ID NO: 1-13662 and SEQ ID NO: 27325-27327 are not limited to the fungal derived sequences disclosed in SEQ ID NO: 13663-27324 and SEQ ID NO: 27328-27330, and may comprise any nucleic acid construct comprising a sequence coding for the selected domain.

The polynucleotide sequences of the invention encompass DNA, RNA, DNA-RNA hybrids, peptide nucleic acid (PNA) or any other DNA-like or RNA-like material. For clarity, the polynucleotide sequences disclosed herein do not encompass genomic DNA sequences as present in their natural source (e.g. native organism). The polynucleotide sequences of the invention do not contain introns or untranslated 3-prime and 5-prime sequences. The polynucleotide sequences encompass translated sequences only.

The nucleic acid constructs of the invention encompass sequences which are the reverse or direct complement of any of the disclosed nucleic acid sequences (or their derivatives, as described below). Polynucleotide constructs of the invention may comprise single-stranded or double-stranded polynucleotides and may represent the sense or the antisense strand. The nucleic acid constructs of the present invention also include nucleic acid sequences that hybridize to the disclosed nucleotide sequences or their complements under stringent conditions. Polynucleotide constructs of the invention include sequences having high sequence similarity to the disclosed sequences (and their derivatives), for example, sequences having at least 80% homology, at least 85% homology, at least 90% homology, at least 95% homology, or at least 99% homology.

The polynucleotide constructs of the invention further encompass constructs comprising sequences which are derivatives of the disclosed domain polynucleotide sequences. As used herein, with reference to domain polynucleotide sequences, the term “derivative” refers to complementary sequences, degenerate sequences, truncated or augmented sequences, modified sequences, and other polynucleotides based upon the disclosed sequences. One form of polynucleotide derivative contemplated within the scope of the invention is a polynucleotide comprising nucleotide substitutions. For example, utilizing the redundancy in the genetic code, various substitutions may be made within a given polynucleotide sequence that result in a codon which codes for the identical amino acid as coded for in the original sequence, and which such change does not alter the composition of the polypeptide coded by a polynucleotide. Such “silent” substitutions may be selected by one of skill in the art. Likewise, nucleotide substitutions are contemplated which result in an amino acid substitution, wherein the amino acid is of similar polarity, charge, size, aromaticity, etc., such that the resulting polypeptide is of identical or substantially similar structure and function as a polypeptide resulting from an unmodified sequence. Further, the invention also comprises nucleotide substitutions which result in amino acid substitutions which create a polypeptide derivative, as described above.

It is also understood by one of skill in the art that various nucleotide analogs, modified nucleotides, and other compositions may be substituted for the nucleotides of the disclosed DNA sequences and their derivatives, for example modified or non-naturally occurring nucleotides such as 5-propynyl pyrimidines (i.e., 5-propynyl-dTTP and 5-propynyl-dTCP), 7-deaza purines (i.e., 7-deaza-dATP and 7-deaza-dGTP). Nucleotide analogs include base analogs and comprise modified forms of deoxyribonucleotides as well as ribonucleotides.

Additionally, substitutions in a disclosed polynucleotide sequence may be made which enable the translation of polypeptides from the polynucleotide sequence within a specific expression system. For example, as the polynucleotides of the invention are isolated from fungal species, it is contemplated that the disclosed sequences may be modified as necessary to enable or optimize expression of proteins in eukaryotic, yeast, insect, plant, mammalian, or in other expression systems such as cell-free and chemical systems. The selection of proper substitutions for proper expression within a given expression system is within the skill of one in the art of molecular biology.

Polynucleotide derivatives of the invention also comprise augmented or chimeric sequences, wherein a disclosed polynucleotide sequence has been modified to include additional nucleotides. For example, a disclosed polynucleotide sequence, or subsequences thereof, may be ligated with additional polypeptide sequences which enhance expression (for example, promoter sequences), or which alter the properties of the resulting polypeptide, such as sequences which enhance secretion, enable isolation (e.g. sequences which code for His-Tags or like moieties), enable immobilization, or other useful sequences as known in the art.

The scope of the invention additionally includes vectors, comprising the polynucleotide constructs of the invention integrated into vectors. Exemplary vectors include plasmids, phages, and viral constructs which promote efficient maintenance, amplification, and transcription of the polynucleotide sequences in an expression system. The nucleic acid constructs may comprise sequences integrated into the genome of an organism by transduction techniques known in the art.

ENGINEERED ORGANISMS. In one aspect, the scope of the invention encompasses organisms which have been genetically engineered to express one or more engineered protein of the invention, i.e. proteins comprising the protein domains selected from SEQ. ID NO: 1-13662. The engineered organisms of the invention further encompass organisms which express any of the ScaA full proteins of SEQ ID NO: 27325-27327, or portions thereof. Likewise, engineered organisms may comprise the nucleic acid constructs of the invention, for example, with the nucleic acid sequences being transiently expressed by the organism or being stably integrated into the genome of the organism. In one implementation of the invention, the engineered organism is an organism expressing one or more of nucleic acid sequences selected from SEQ ID NO: 13663-27324 or SEQ ID NO: 27328-27330.

Engineered organisms of the invention may comprise any species, for example, fungal species, yeast, bacteria, plants, and other organisms genetically modified to produce one or more engineered proteins of the invention. The engineered organisms of the invention may further comprise cell lines, such as insect cell cultures, CHO cells, and other cell culture systems used in the production of recombinant proteins.

Engineered Enzymes for Bioprocessing.

The various inventions described herein may be applied in numerous bioprocessing methods. The present description is largely directed to bioprocessing methods for the digestion of lignocellulosic biomass into fermentable monomers. However, it will be understood that the engineered proteins and organisms described herein may be applied in other bioprocessing methods, for example, for the synthesis of chemicals from feedstocks, including polymers, biofuels, and others.

In one aspect, the engineered proteins of the invention encompass proteins which participate in the breakdown of lignocellulosic biomass. In one embodiment, the engineered proteins of the invention comprise a glycoside hydrolase or other enzyme capable of digesting a component of lignocellulosic materials. For example, the engineered enzyme of the invention may comprise a cellulase, glycosidase, esterase, SGNH hydrolase, endoglucanase, cellobiohydrolase, Beta-D-glucan exohydrolase, beta-glucanase, phosphatidylinositol phosphodiesterase, pectin lyase, fucosidase, glycoside hydrolase, glycosyl hydrolase, hemicellulsase, xyanlase, galactosaminoglycan glycanohydrolase, amylase, chitinase, (3-glucuronyl hydrolase, trehalase, glucoamylase, β-glucuronyl hydrolase, or acid phosphatase. In one embodiment, the engineered protein of the invention is a glycoside hydrolase comprising one or more domains selected from the group consisting of SEQ ID NO: 1-155; SEQ ID NO: 1095-1309; and SEQ ID NO: 3851-3972; and SEQ ID NO: 9755-11844. In one embodiment, the invention comprises an organism comprising a polynucleotide sequence which codes for a domain selected from the sequences of SEQ ID NO: 1-155; SEQ ID NO: 1095-1309; and SEQ ID NO: 3851-3972; and SEQ ID NO: 9755-11844.

The scope of the invention encompasses methods of using engineered proteins comprising lignocellulose degrading enzymes to facilitate the breakdown of lignocellulosic biomass. In one such method, an engineered protein comprising a lignocellulose-degrading enzyme is produced in an engineered organism. Exemplary engineered organisms include Saccharomyces cerevisiae, Zymomonas mobilis, Escherichia coli, and Clostridium thermocellum. Systems which utilize such organisms in biofuel production are known in the art. For example, the successful heterologous expression of functional saccharization enzymes from a fungal organism in yeast has been previously demonstrated, as described in O'Malley et al., Evaluating expression and catalytic activity of anaerobic fungal fibrolytic enzymes native topiromyces sp E2 in Saccharomyces cerevisiae. Environmental Progress and Sustainable Energy 31:37-46, 2012.

In one such method, the engineered protein is produced by and is subsequently extracted from the organism. Purification or modification steps may be applied to the extracted enzyme. The enzyme may then be used in any applicable lignocellulosic bioprocessing system by contacting it with an appropriate substrate under suitable conditions for enzymatic action to occur. In one embodiment, the extracted enzyme is used as a component of an enzymatic cocktail, for example, an enzymatic cocktail used in the saccharification of cellulosic materials.

In an alternative implementation, an engineered protein comprising a lignocellulose degrading enzyme of the invention is expressed in an engineered organism, and the engineered organism is cultured with an appropriate lignocellulosic substrate to promote breakdown of the substrate.

Methods of using proteins comprising a lignocellulose degrading enzyme of the invention may be performed in any bioprocessing method, for example, in ethanol production from biomass.

MULTIPLE ENZYME CATABOLIC COMPLEXES. In another embodiment, the invention encompasses engineered enzymatic complexes. An engineered enzyme complex is a complex comprising multiple enzymes bound to a carrier or scaffold and further comprising one or more substrate binding moieties. Such multiple enzyme complexes may be used to process a substrate with high efficiency due to the presence of multiple complementary enzymatic moieties being held in proximity to the substrate by the substrate binding moieties.

The engineered enzyme complexes of the invention are based on the bacterial cellulosome. In anaerobic microorganisms, cellulolytic enzymes are not secreted freely into the extracellular medium, as is generally the case for aerobic microbes, but instead these enzymes assemble into large (MDa) multi-protein cellulolytic complexes called cellulosomes. Cellulosomes comprise various components. A first component is a non-catalytic protein that is anchored to the cell membrane of the host cell expressing the cellulosome, typically a scaffoldin or its equivalent. The scaffoldin comprises multiple domains called cohesins, which are sites to which functional moieties will attach. The functional moieties may comprise enzymes which comprise one or more dockerin domains. The dockerin domain will selectively bind complementary cohesion domains on the scaffoldin protein with high affinity. The celluolytic complex will typically further comprise one or more carbohydrate binding moieties which bind lignocellulosic substrate. This binding keeps the substrate in proximity to the catalytic enzymes present on the cellulosome, facilitating degradation of the substrate. A conceptual depiction of a cellulosome is depicted in FIG. 1.

In one aspect, the scope of the invention encompasses what will be referred to as an engineered enzyme complex. The engineered enzyme complex comprises: a scaffold protein; one or more catalytic proteins; and one or more substrate-binding proteins. In one embodiment, the one or more catalytic proteins and one or more substrate-binding proteins are bound to the scaffold protein by cohesion-dockerin interactions with complementary dockerin and cohesion elements being present on the scaffold and on the bound moieties. An engineered enzyme complex of the invention is any enzyme complex wherein one or more component is an engineered protein of the invention. Alternatively, the engineered enzyme complex of the invention is one comprising a scaffoldin protein selected from SEQ ID NO: 27325-27327 (for example, being coded for by nucleic acid sequences SEQ ID NO: 27328-27330).

Tools and methodologies for the creation of multiple enzyme complexes and organisms expressing them are known in the art. Cellulosomes and like enzyme complexes have been successfully produced wherein the type and precise placement of enzymes is possible, for example as described in Fujita et al., Synergistic saccharification, and direct fermentation to ethanol, of amorphous cellulose by use of an engineered yeast strain codisplaying three types of cellulolytic enzyme. Appl Environ Microbiol. 2004 February; 70(2):1207-12. Additional methods of producing engineered cellulosomes are described in United States Patent Application Publication Number 20150167030, entitled “Recombinant cellulosome complex and uses thereof,” by Mazolli; United States Patent Application Publication Number 20130189745, entitled “Artificial cellulosome and the use of the same for enzymatic breakdown of resilient substrates,” by Schwarz; and United States Patent Application Publication Number 9,315,833, entitled “Yeast cells expressing an exogenous cellulosome and methods of using the same,” by McBride.

In one implementation of the invention, the engineered multiple enzyme complex is an artificial cellulosome designed for the efficient digestion of lignocellulosic biomass, wherein the one or more catalytic proteins comprise a plurality of proteins which degrade lignocellulosic material, e.g. glycoside hydrolase proteins, and the one or more substrate-binding proteins comprise carbohydrate binding domains.

In one embodiment, the scaffold protein of the engineered enzyme complex is a scaffoldin protein comprising multiple cohesion domains. For example, the scaffoldin protein may comprise a scaffoldin protein selected from the group consisting of SEQ ID NO: 27325-27327. In another embodiment, the artificial cellulosome comprises a dockerin domain. In one embodiment, the dockerin domain comprises a dockerin domain selected from the group consisting of: SEQ ID NO: 1420-3705 and SEQ ID NO: 6590-8910. In another embodiment, the artificial cellulosome comprises one or more carbohydrate binding domains. For example, the carbohydrate binding domain may comprise a carbohydrate binding domain selected from the sequences of: SEQ ID NO: 1061-1062; SEQ ID NO: 1325-1333; SEQ ID NO: 1378-1419; SEQ ID NO: 3706-3850; SEQ ID NO: 3973-6254; and SEQ ID NO: 9480-9605. In one embodiment, the artificial cellulosome of the invention comprises one or more glycoside hydrolase proteins comprising one or more domains selected from the sequences of SEQ ID NO: 1-155; SEQ ID NO: 1095-1309; and SEQ ID NO: 3851-3972; and SEQ ID NO: 9755-11844. The scope of the invention further extends to nucleic acid sequences which code for the various elements of the artificial cellulosomes. The scope of the invention further encompasses engineered organisms which express the various elements of the artificial cellulosome. The scope of the invention further encompasses methods of using such engineered organisms in the digestion of lignocellulosic biomass. It will be understood that the artificial cellulosomes of the invention comprise or are expressed in combination with anchoring moieties, secretory signals and other elements required for the expression, secretion, and assembly of cellulosomes, as known in the art.

The artificial cellulosomes of the invention enable components from two or more species to be advantageously combined. Enzymes from non-fungal species can be utilized in anaerobic fungal cellulosomes, or enzymes from anaerobic fungi can be used in non-fungal cellulosomes. For example, in one implementation, the novel dockerin domains of the invention derived from anaerobic fungi may be fused with enzymes or carbohydrate-binding moieties from other species, such as from yeast or aerobic bacteria, and these combined elements can be bound to scaffoldins from anaerobic fungi. In another implementation, dockerins from other species could be fused to catalytic proteins or carbohydrate binding domains from anaerobic fungi, facilitating the inclusion of these anaerobic fungal proteins in synthetic cellulosomes of other species. This exchange of enzymatic elements from divergent species aids in the creation of novel artificial cellulosomes having extended enzymatic capabilities beyond those of wild type enzymatic complexes. Such hybrid systems can, with a single organism, recapitulate digestive processes in the complex environment of the rumen, where fungal, yeast, and bacterial strains work in concert to digest complex biomass.

It will be understood that the engineered enzyme complexes of the invention are not limited to multiple enzyme complexes which degrade lignocellulosic material, and may be designed for efficient enzymatic action of any kind on any substrate, as determined by the selection of suitable catalytic enzymes and substrate-binding moieties.

Polyketide Synthases

A very large number of important drugs and biologically active compounds are from the group called polyketides. Polyketides are structurally diverse compounds created by multi-domain enzymes or enzyme complexes called polyketide synthases (PKSs). PKSs proteins are composed of various peptide domains, each of which has a defined function. Various classes of PKSs are known, including Type I, Type II, and Type III PKSs. The Type I PKSs may be classified as either iterative or modular.

The iterative PKSs comprise a single module. The creation of a polyketide is initiated by binding a starting material to the acyl-transferase (AT) domain, the starting material typically being Acetyl-CoA or malonyl-CoA. The bound starting material is then shuttled to the KS domain by an acyl carrier protein (ACP). An extender material, typically malonyl-CoA is then loaded into the complex by the AT domain and is added to the starter material by a condensation reaction catalyzed by the ketosynthase (KS domain). Additional domains may introduce modifications to the bound chain by catalytic action. Additional extension reactions and modification reactions occur until the polyketide chain has reached its final length, which is specific for each type of iterative PKS. The mechanisms by which final length is controlled are not known. When the polyketide has reached its final length, a thioesterase (TE) domain releases the completed polyketide. Thus, such PKSs are called “iterative” because the final product polyketide is produced in an iterative fashion by the repeated action of the domains to lengthen and modify the growing polyketide chain. The various enzymatic domains of the iterative PKSs are not always used in each cycle, allowing for more variability in final product composition.

In contrast, modular PKSs have multiple repeating modules, arranged from the N-terminal end of the PKS towards the C-terminal end. In each module, the AT, ACP, and KS domains are repeated, and each module also contains its own combination of catalytic domains. Chain elongation is initiated at the N-terminal end in the first module, and the growing chain is passed from module to module towards the C-terminal end, undergoing a single elongation and one or more enzymatic modifications at each step. At the C-terminal module, a thioesterase (TE) domain releases the completed polyketide.

Just as PKS domains can interact with one another, PKSs can interact, or form hybrid complexes, with non-ribosomal peptide synthases to form active compounds (e.g. the anticancer compound epothilone).

Various classes of enzymatic PKS domain are known, including:

- keto reductase (KR) domains, which reduces ketone groups to hydroxyl groups;
- dehydratase domains (DH), which reduces hydroxyl groups to enoyl groups;
- enoyl reductase (ER) domains, which reduce enoyl groups to alkyl groups;
- methyltransferase (MT) domains, which transfer methyl groups to the growing polyketide;
- sulfohydrase domains (SH); and
- product template domains, which determine the folding pattern of the polyketide backbone.

Additional non-PKS catalytic domains that work in tandem with PKS domains include aminotransferases, pyridoxal-phosphate transferases and HMG-CoA synthases.

The specificity of substrates and products for the domains varies, as well as their order within PKSs. Accordingly, the different combinations the order of enzymatic domains within the PKS modules, and the different arrangements of modules within modular PKSs means that these enzymes can be configured to produce an immense range of final products.

The released polyketide may then be further modified by the action of additional enzymes, for example the addition of carbohydrate moieties or methyl groups. The further complexity of PKS systems enables even greater diversity of products, for example, two iterative PKSs can interact to form a common product (for example as in the synthesis of zearalenone). A PKS may also be fused with another enzyme to form a single enzyme (for example as known in the synthesis of fusarin C).

Accordingly, PKSs, due to their modular nature, including multiple domains arranged within a module, and multiple modules within an enzyme, present a potential platform for the synthesis of myriad biological products.

Engineered PKS's. Engineered PKS systems are known in the art and have been successfully utilized to create various novel end products, some of which have never been observed in nature. Various strategies exist for utilizing novel PKS enzymes, PKS modules, or PKS domains in the creation of diverse, potentially bioactive molecules. Exemplary PKS engineering techniques are described in U.S. Pat. No. 9,334,514, entitled “Hybrid polyketide synthases,” by Fortman et al.; U.S. Pat. No. 8,709,781, entitled “System and method for the heterologous expression of polyketide synthase gene clusters,” by Boddy et al.; and United States Patent Application Publication Number 20130067619, entitled “Genes and proteins for aromatic polyketide synthesis,” by Page and Gagne.

The current state of PKS engineering allows for the recombination and swapping of various PKS enzymes, modules, and domains, enabling novel means of synthesizing compounds using engineered enzyme systems. Accordingly, there is a need in the art for PKS enzymes, modules, and domains with novel functions, which such elements may be employed in engineered PKS systems. The novel PKS gene and protein sequences provide the art with novel tools for the creation of engineered PKS synthesis systems and enable the creation of novel compounds.

In one aspect, the scope of the invention encompasses engineered proteins comprising engineered PKS enzymes. The engineered PKS enzyme of the invention may comprise a modular PKS or an iterative PKS. In one embodiment, the engineered PKS enzyme of the invention comprises an acyl transerase domain. For example the engineered PKS may comprise an acyl transferase domain selected from the sequences of SEQ ID NO: 465-578; SEQ ID NO: 768-798; and SEQ ID NO: 12479-12509. In one embodiment, the engineered PKS comprises an acyl carrier protein domain. For example, the acyl carrier protein domain may comprise an acyl carrier domain selected from the sequences of SEQ ID NO: 604-767 and SEQ ID NO: 12463-12473. In one embodiment, the engineered PKS comprises a ketosynthase domain. For example, the ketosynthase domain may comprise a ketosynthase domain selected from the sequences of SEQ ID NO: 12546-12579. In one embodiment, the engineered PKS comprises a thioesterase domain. For example, the thioesterase domain may comprise a thioesterase domain selected from the sequences of SEQ ID NO: 13417-13449. In one embodiment, the engineered PKS comprises a ketoreductase domain. For example, the ketoredudctase domain may comprise a ketoreductase domain selected from the sequences of SEQ ID NO: 12028-12043 and SEQ ID NO: 12526-12545. In one embodiment, the engineered PKS comprises a dehydratase domain. For example, the dehydratase domain may comprise a dehydratase domain selected from the sequences of SEQ ID NO: 12129-12152 and SEQ ID NO: 12771-12785. In one embodiment, the engineered PKS comprises an enoyl reductase domain. For example, the enoly reductase domain may comprise an enoyle reductase domain selected from the sequences of SEQ ID NO: 12510-12525 and SEQ ID NO: 13651-13654. In one embodiment, the engineered PKS of the invention comprises a product template domain. For example, the product template domain may comprise a product template domain selected from the sequences of SEQ ID NO: 12786-12787. The scope of the invention further encompasses engineered proteins which are not PKS enzymes, but which contain any of the aforementioned PKS domains.

The scope of the invention further includes engineered accessory enzymes, which, as used herein, are engineered proteins with functions accessory to PKS enzymes. In one embodiment, the engineered protein comprises an aminotransferase domain selected from SEQ. ID NO: 973-982. In one embodiment, the engineered protein comprises a pyridoxal-phosphate transferase domain selected from SEQ. ID NO 12862-12879. In one embodiment, the engineered protein comprises a HMG-CoA synthase domain selected from SEQ. ID NO 11883-11891.

The scope of the invention further encompasses nucleic acid constructs which code for any of the aforementioned engineered PKS enzymes or engineered accessory enzymes. Furthermore, the scope of the invention encompasses engineered organisms which express any of the aforementioned engineered PKS enzymes or which comprise a nucleic acid construct coding therefor. Exemplary engineered PKS organisms include fungal species, bacterial species, yeast species, or plant species. The scope of the invention further encompasses methods of creating complex molecules, including polyketides, utilizing the engineered PKS enzymes and/or organism expressing such engineered PKS enzymes, wherein suitable substrates are exposed to such engineered PKS enzymes and/or organism expressing such engineered PKS enzymes under conditions which facilitate the synthesis of desired end-products.

Biofuel Production Using Novel Anaerobic Fungal Strains

Lignocellulosic material, or biomass, is a renewable and abundant material and represents a potential feedstock for energy and chemical production. However, the sugars contained in lignocellulosic materials are locked in a complex of lignin, hemicellulose and cellulose and other plant cell wall components. Currently, to extract fermentable sugars from these recalcitrant feedstocks, lignin and hemicellulose must be separated from the biomass prior to converting cellulose into monosaccharides. As a result, bioprocessing of crude biomass entails energy-intensive pretreatment steps, and the addition of an expensive and often inefficient cocktails of cellulolytic enzymes.

In contrast, anaerobic gut fungi that are resident in the gut of herbivores routinely and efficiently degrade cellulose in complex, lignin-rich biomass. This is achieved through both mechanical and enzymatic processes: colonizing fungi develop a highly branched rhizoidal network, or rhizomycelium, that penetrates and exposes the substrate to attack by secreted cellulases. Importantly, this unique invasive strategy for plant cell wall degradation enables gut fungi to colonize and decompose complex cellulosic feedstocks. Anaerobic gut fungi degrade plant particulates of dissimilar sizes at nearly the same rate, whereas the degradation rates of eubacterial populations steadily decrease with increasing particle size. Therefore, anaerobic gut fungi may serve as a means to degrade diverse biomass feedstocks to useful bioenergy compounds, without the need for expensive pretreatment, greatly reducing the cost and increasing the efficiency of biomass conversion to useful products.

Accordingly, there is a need in the art for novel organisms capable of efficient conversion of biomass to usable fuel materials, and for methods of culturing such organisms. The four previously undescribed species of anaerobic fungal gut organisms described herein fulfill this need in the art, being capable of breaking down plant material to produce ethanol, hydrogen, and other useful materials. Grown under anaerobic culture conditions, each of the four organisms is capable of degrading a wide range of lignocellulosic materials. For example, the organisms can metabolize reed canary grass, glucose, fructose, avicel, and filter paper, demonstrating an ability to break down a wide range of biomass materials.

In addition to cellulosomes, which convert plant material into fermentable sugars, anaerobic fungi possess hydrogenosomes that convert the released sugars to hydrogen gas following glycolysis. Hydrogenosomes are intracellular membrane-bound organelles that are analogous to the mitochondria of aerobic microbes. In general, they metabolize malate and pyruvate to H₂, CO₂, formate, and acetate, generating energy in the form of ATP. The four novel organisms described herein are each capable of hydrogen production from a range of feedstocks. Accordingly, in one aspect, the invention comprises the use of Piromyces finnis, Neocallimastix californiae, Anaeromyces robustus, and/or Neocallimastix sp S4 in the conversion of biomass feedstocks into ethanol, hydrogen, and other useful materials. The basic process of the invention comprises introducing biomass feedstocks into a bioreactor vessel wherein culture conditions amenable to organism growth and metabolism are maintained, allowing colonization and digestion of biomass by the organisms, and ongoing or subsequent harvesting of end-products.

Anaerobic bioreactors and fungal bioreactors are known in the art. For example, exemplary fungal and/or anaerobic bioreactors are described in: Moreira et al., Fungal Bioreactors: Applications to White-Rot Fungi, Reviews in Environmental Science and Biotechnology, 2003, Volume 2, Issue 2-4, pp 247-259; Martin, An Optimization Study of a Fungal Bioreactor System for the Treatment of Kraft Mill Effluents and Its Application for the Treatment of TNT-containing Wastewater, in Bioreactors, Auburn University Press, 2000; Palma et al., Use of a fungal bioreactor as a pretreatment or post-treatment step for continuous decolorisation of dyes, 1999, WATER SCIENCE AND TECHNOLOGY; 40, 8; 131-136; US Patent Publication Number US 20100159539 A1, Methods and systems for producing biofuels and bioenergy products from xenobiotic compounds, by Ascon; China Patent Publication 101374773, Method and bioreactor for producing synfuel from carbonaceous material, by Khor; and US Patent Publication Number 20100196994 A₁, Fungi cultivation on alcohol fermentation stillage for useful products and energy savings, by van Leeuwen. Bioreactor designs amenable to the growth of the gut fungi described herein may be readily developed utilizing knowledge of the growth conditions optimal for anaerobic gut fungi growth and activity. The invention encompasses the use of any type of bioreactor design, including batch reactors, flow-through reactors, and other bioreactor designs known in the art.

Anaerobic fungi are may be grown under substantially anaerobic conditions. Optimal temperatures for the growth and biomass digestive activity of such organisms is in the range of 25-40 C, preferably in the range of 30-40C. Cultures may be grown grown without agitation, on soluble or insoluble carbon sources, under a head space of 100% CO₂gas. Liquid culture medium is preferred for growth and maintenance of the anaerobic fungi.

The culture media used to grow anaerobic fungi may be any known in the art, for example formulations based on those used for the cultivation of rumen bacteria. For the most part, they are complex, non-defined media (pH 6.5-6.8) and contain up to 15% (v/v) clarified rumen fluid, but chemically defined media can be used as well, as described in Marvin-Sikkema, F. D., Lahpor, G. A., Kraak, M. N., Gottschal, J. C., Prins, R. A., Characterization of an anaerobic fungus from llama faeces. J. Gen. Microbiol. 1992, 138, 2235-2241. Although phosphate buffers may be used, a preferred buffer is bicarbonate with CO₂in the head space contributing to the buffering system. Chemical reducing agents (e.g., sodium sulfide and/or L-cysteine hydrochloride) are added to culture media pre- or post-autoclaving, after the majority of the 02 has been removed from culture solutions by boiling and gassing with CO₂. These procedures ensure that low oxygen levels of the culture medium are maintained such that anaerobic fungal growth can be supported.

The methods of the invention encompass various steps. In a first step, biomass is fed into the bioreactor. Any form of cellulosic or lignocellulosic material may be utilized in the bioreactors and methods of the invention. Biomass includes, but is not limited to, herbaceous material, agricultural residues, forestry residues, municipal solid wastes, waste paper, and pulp and paper mill residues. Exemplary feedstocks include corn stover, canary reed grass, swtichgrass, Miscanthus, hemp, poplar, willow, sorgum, sugarcane, bamboo, eucalyptus. Additional feedstocks include byproducts of industrial processes, such as pulping liquor (a byproduct of paper production).

Generally, it is preferred that the biomass feedstocks utilized in the processes of the invention be pre-processed to some degree prior to digestion by the fungal organisms. Preprocessing steps include grinding or other mechanical treatments which break the biomass into small particulates that may be more easily colonized and digested by the fungal organisms. Particulates in the range of 0.1 to 10 mm diameter, for example, may be used. The biomass material is then inoculated with one or more fungal strains selected from the group consisting of Piromyces finnis, Neocallimastix californiae, Anaeromyces robustus, and Neocallimastix sp S4. Exemplary inoculant material includes particulate material which has been colonized by the fungal organism(s). As opposed to free zoospores, using such material as the starting inoculum leads to more vigorous growth and a substantial reduction in culture lag. The inoculated biomass is then allowed time to digest. The precise digestion time will vary depending on (1) the composition and lability of the feedstock; (2) the particulate size of the feedstock; (3) the concentration of inoculant; and (4) the specific bioreactor design. End-products of the digestion may be removed from the bioreactor at set intervals, continuously, or at the end of the digestion process. Removal of ethanol may be accomplished using methods known in the art for separation of ethanol from fermentation broth. Likewise, evacuation of hydrogen gas produced by the digestion may be accomplished utilizing means known in the art.

Working cultures of anaerobic fungi may require frequent sub-culturing in order to retain their viability. Most batch cultures remain viable for 5 or 15 days in media containing soluble (glucose) or particulate (reed canary grass) substrates, respectively. Frequent sub-culturing intervals of 2-7 days with growth on particulate substrates are generally employed to ensure the continued production of viable cultures.

The processes of the invention may optionally further encompass the co-culture of the described anaerobic gut fungi with other organisms to promote optimal production of bioenergy materials. For example, co-culture of anaerobic gut fungi with highly effective anaerobic fermenting organism, yeast or bacterial strains can result in an optimized system with efficient saccharization and fermentation. Similarly, production of specific end-products can be enabled by co-inoculation with organisms that convert the products of fungal digestion to other materials. For example, production of hydrogen through fungal hydrogenosome activity allows other microbes to reduce H₂to the more energetically favorable methane gas. Co-culture with methane-producing organisms such as Archaea shifts end-product formation towards increased methane and acetate production, with a corresponding decrease in lactate, succinate, hydrogen and ethanol accumulation. In another embodiment, co-culture of anaerobic gut fungi with methanogens can be performed, which can significantly enhance the cellulose hydrolysis activity of the anaerobic fungi.

EXAMPLES
Example 1. Sequence Identification

Fresh fecal material was collected from farm animals. Specimens were isolated from 5×10-fold serial dilutions of fecal matter in anaerobic buffer medium. Each dilution was then supplemented with 30 μg/ml chloramphenicol and grown in anaerobic medium containing milled reed canary grass at 39° C. to enrich for gut fungi. Enrichment cultures that were positive for fungal, but not bacterial or protist, growth after 5-10 days as determined by the generation of fermentation gases without an increase in culture turbidity were further subcultured. To generate unique fungal isolates, actively growing enrichment cultures were diluted up to 50-fold in serial dilutions with each dilution being subcultured for ˜4 days. This isolation procedure was repeated five times until a uniform fungal morphology was observed from each specimen and a unique ITS sequence of the isolate was obtained. Subsequent phylogenetic analysis of this ITS sequence confirmed the presence of a single novel fungal isolate in each culture. The new species were named Neocallimastix californiae, isolated from goat feces, Anaeromyces robustus, isolated from sheep feces, and Neocallimastix sp S4, isolated from sheep feces, and Piromyces sp. finn, isolated from horse feces.

To identify novel sequences of interest, each strain was grown in anaerobic medium supplemented with either glucose or milled reed canary grass at 39° C. After 2 days, the biomass was harvested and the total RNA extracted using the Qiagen RNeasy kit. This RNA was then enriched for mRNA by selecting for polyadenylated RNA and made into a strand specific cDNA library (single stranded). This cDNA library was then sequenced using an Illumina HiSeq next generation sequencing platform, using a standard workflow, and the resulting data was assembled into a de novo transcriptome using the TRINITY bioinformatics platform. The assembled sequences were then annotated with the BLAST2GO package by BLAST sequence alignment against known protein sequences and protein domain hidden Markov model (HMM) scans on Interpro of all possible translations of each transcript. The results were analyzed for statistical significance and sequences of interest were noted.

Example 2. Identification and Characterization of Scaffoldin Proteins

Genomic analysis of 5 unique anaerobic fungi revealed the presence of 1600 total dockerin domain proteins (DDPs) across genera with diverse functionality, primarily related to plant carbohydrate binding and biomass degradation. These include 15 glycoside hydrolase (GH) families, 5 distinct carbohydrate-binding domains, and other functions implicated in plant cell wall modification and deconstruction including pectin modifying enzymes and expansins. 20.2% of DDPs belong to spore coat protein CotH, which are also present in bacterial cellulosomes and are speculated to also be involved in plant cell wall binding. Conversely, 12.6% represent additional GH activities that are not present in bacterial cellulosomes (GH3, GH6, and GH45). The additional β-glucosidase conferred by GH3 in particular enables fungal cellulosomes to convert cellulose directly to fermentable monosaccharides, whereas Clostridial cellulosomes produce low molecular weight oligosaccharides.

To find structural proteins that mediate assembly of DDPs, we isolated the supernatant and cellulosome fractions from three of these isolates growing on reed canary grass as a sole carbon substrate. Size-exclusion chromatography (SEC) of the cellulosome fraction showed complex formation well within the MDa range, and SDS-PAGE revealed the presence of many glycosylated proteins. Each fraction was subjected to tandem mass spectrometry and peptide sequences were mapped to their respective genomic and transcriptomic databases. Many of the proteins associated with these complexes were identified as GHs and other plant cell wall degrading enzymes. Proteins found in the cellulosome fraction were particularly enriched with NCDDs, indicating modular complex formation. Unexpectedly, all fractions also contained very large uncharacterized proteins (hereafter named ScaA) with molecular weights (MW) of approximately 700 kDa. These ScaA proteins share 32% sequence identity over at least 92% sequence length (E value=0.0) between fungal genera. ScaA orthologs were also detected in the only other sequenced gut fungal genomes, Piromyces sp. E2 and Orpinomyces sp. C1A, though the ortholog detected in O sp. C1A was incomplete likely due to fragmented genome assembly.

Sequence analysis of these proteins across all 5 sequenced genomes showed a predicted N-terminal signal sequence followed by a large extracellular repeat-rich domain, and ending with C-terminal membrane anchor. Some of these proteins also encode predicted choline binding repeats (CBRs), which are known to bind glucan in prokaryotic glucosyltransferases. Thus, one possibility is that CBRs help mediate fungal cellulosome assembly, as many cellulosome proteins are glycosylated. Closer examination of the sequences revealed the presence of a repeating amino acid sequence motif that is conserved among all of these homologues, and that occurs many times throughout these proteins. This motif is 20-30 amino acids long, typically includes with a Gly residue immediately followed by two large hydrophobic residues (most often Tyr residues) and two non-consecutive downstream Cys residues.

Because these proteins are highly represented in secretome and cellulosome fractions from these diverse species of gut fungi, we hypothesized that they share a common role in these systems, possibly in DDP assembly. We hypothesized that these proteins function as scaffolds whereby the repeating motifs act as dockerin-binding cohesins. To investigate this, we recombinantly expressed fragments of the ScaA homologues in Escherichia coli and performed enzyme linked immunosorbent assay (ELISA) using purified dockerin and anti-dockerin chemiluminescent secondary antibody. These results showed a strong dockerin binding signal in wells containing the scaffoldin fragment cells compared to those containing the empty vector control. As an additional control, a phenylalanine substitution to dockerin residue W28, previously identified to be critical for binding, showed significantly reduced binding activity. To determine the binding affinity of dockerin-ScaA interaction, we purified Piromyces ScaA fragments and performed equilibrium analysis by surface plasmon resonance (SPR) against purified fungal dockerin. This analysis revealed that a single dockerin domain interacts with the scaffoldin fragment with an approximate dissociation constant (K_d) of 0.7 μM and a maximum response (R_max) of 80 RU. Additionally, the W28F dockerin mutant showed significantly reduced binding affinity (K_d, =2.0 μM, R_max=40 RU). Taken together, these results suggest that fungal scaffoldin proteins likely mediate assembly of DDPs in fungal cellulosomes.

Though limited, previous studies have shown that fungal cellulosomes are quite divergent from their bacterial counterparts. For example, NCDDs occur as tandem repeats at the N- and/or C-terminus, with the most common form being a double tandem repeat (i.e. double dockerin) at the C-terminus. Though the functional role of this motif repetition is not known, it has been previously noted that double dockerins bind to native cellulosomes more efficiently than single domains⁵. Thus, we hypothesized that increasing the number of NCDD from one to two could enhance binding affinity to the scaffoldin fragment. By ELISA, we found that the P. finnis single dockerin domain had higher binding affinity than the double dockerin domain. However, by SPR, the double dockerin had a comparable K_d, but a higher R_max120 RU, suggesting that the double dockerin is indeed capable of binding to more sites on the ScaA fragment than the single dockerin, which suggests that site specificity may be more subtly encoded in the different dockerin domains and cohesin repeats. Though the minimum sequence that defines a single cohesin remains to be determined, it is clear from our study that fragments of the scaffoldin encoding as few as four repeats are sufficient for dockerin assembly. Additionally, we cannot rule out that additional binding factors (e.g. glycosylation) found in native cellulosomes likely further modulate the fungal dockerin-cohesin interaction, which are lacking in this recombinant system.

It has previously been reported that dockerins are capable of binding to cellulosome fractions from other species of gut fungi, which is a marked departure from bacterial cellulosomes. In agreement with this observation, a Piromyces dockerin is capable of binding to intact cellulosome fractions harvested from Anaeromyces and Neocallimastix species. Thus, we tested whether this cross-species binding activity is encoded specifically within ScaA homologues. To test this, we purified single dockerin domains from all three genera of gut fungi and tested their ability to bind to all combinations of ScaA fragments. Indeed, we observed binding for all combinations tested and the binding signal was within standard error for almost all cases. Taken together, these results demonstrate that the fungal scaffoldin system is broadly conserved across the anaerobic fungal phylum, allowing for high interspecies infidelity. Therefore, it is not unreasonable to speculate that in their native environments, for example the dense microbial community of the herbivore rumen, fungal cellulosomes are a composite of enzymes from several species of gut fungi. This is in stark contrast to bacterial cellulosomes, which have high species specificity. This promiscuity may confer a selective advantage of fungi over bacteria in these environments.

In addition to the ScaA orthologs, other scaffoldin-like proteins were also detected through proteomic analysis of cellulosome-associated proteins. We tested three of these scaffoldins for dockerin binding activity and each tested positive over an empty vector control, suggesting that multiple scaffoldins likely exist in fungal cellulosomes. To search for scaffoldin-like proteins more broadly within anaerobic fungi, we developed a Hidden Markov Model (HMM) based on the repeating motif from all 6 scaffoldins biochemically verified to interact with dockerins. We found 95 unique loci in the genomes of A. robustus, P. finnis, and N. californiae, that bear a signal peptide and at least 10 cohesin repeats. Fewer loci (14) were detected in P. sp. E2 and O. sp. C1A due to fragmented genome assemblies. Significantly, no loci were found in prokaryotes (˜2000 genomes) and only 1 or 2 weak hits in other fungi (˜400 genomes), demonstrating this HMM is highly specific to fungal scaffoldins. These results indicate that gut fungi likely produce multiple scaffoldins for cellulosome assembly, and these scaffoldins represent a new family of genes that is unique to the early branching anaerobic fungi.

While fungal scaffoldins and their NCDD ligands are specific to gut fungi, many plant biomass degrading enzymes that encode NCDDs are of bacterial origin, which has been noted previously for a limited subset of enzymes. Indeed, all five gut fungal genomes sequenced to date have large numbers of genes that are more similar to bacterial than to eukaryotic genes (9-13%). We aligned 1600 DDPs of the 5 anaerobic fungal genomes with the 394 fungi currently deposited in JGI MycoCosm (excluding Neocallimastigomycota) and the 1774 bacteria and archaea in JGI Integrated Microbial Genomes (IMG). Of these proteins, 611 aligned better with bacterial than fungal proteins and 158 aligned exclusively with bacteria. Conversely, only 38 aligned exclusively with fungi, and 372 aligned better to fungi than to bacteria. The remaining DDPs aligned equally with IMG and MycoCosm protein (3) or did not align with either (418). To determine whether this bacterial resemblance is the result of inter-kingdom horizontal gene transfer (HGT), we queried the domains that are fused to NCDDs to extract homologous sequences from the same bacterial and fungal genomes. When possible, we built phylogenetic trees of the domain sequences. Out of 35 non-dockerin domains analyzed, 10 (29%) passed our 2 criteria of 1) greater amino acid similarity to bacterial than to fungal sequences and 2) branching with bacterial rather than fungal sequences in its phylogenetic tree with >70% bootstrap support. The list of domains with an HGT signature includes 9 CAZyme domains as well as the spore coat domain. However, this analysis does not inform us as to the direction of any possible HGT events. Subjecting NCDDs to the same analysis showed that there are no similar sequences in IMG at all, suggesting that many DDPs may be fusions between native fungal and horizontally transferred bacterial components. Intriguingly, we found 12 fungal-bacterial homolog pairs where the bacterial protein is also a bacterial dockerin-domain protein. However, the sequence similarity between each pair of homologs encompasses only the catalytic domain and does not extend into the respective dockerin domains.

Over the past several decades, characterization of cellulosomes in fungi has been elusive, with multiple studies suggesting conflicting scaffolding schemes. Here, next-generation sequencing combined with functional proteomics uncovered a new family of genes that likely serve as scaffoldins in the cellulosomes of anaerobic fungi. The evidence for this is 3-fold: (1) scaffoldins appear among the most represented proteins in supernatant and cellulosome fractions in three diverse isolates of gut fungi, and their amino acids sequences encode hallmarks of a membrane-anchored scaffoldin molecule, including an N-terminal secretion motif, a C-terminal membrane anchor, and repeating amino acid motif in between. These scaffoldins are encoded in all sequenced Neocallimastigomycota (representing 4 of the 8 genera of gut fungi identified to date) and absent in other fungi. (2) Expression of repeat-containing scaffoldin fragments shows robust interaction with purified dockerins by ELISA. (3) This dockerin-scaffoldin interaction is biologically significant (K_d˜0.7 uM) as measured by SPR, whereas a mutated dockerin derivative significantly reduced binding activity. Taken together, the identification of a new dockerin-binding protein scaffold from fungi opens the way for exploitation of this modular interaction for synthetic biology and substrate channeling. Finally, the powerful degradation activity of gut fungi is provided by the diverse functionality of its constituents, with 50 unique protein families of bacterial and fungal origin, and the assembly of these constituents onto scaffoldin molecules into cellulosome-like complexes. Perhaps the most intriguing observation from this study is that fungal dockerins and their scaffoldin ligands have no sequence similarity to their bacterial counterparts. Thus, it is possible that the cellulosome-based strategy for plant cell wall degradation evolved in anaerobic gut fungi independently of bacteria. This suggests that co-localizing plant cell wall degrading enzymes is so effective that nature has evolved it on more than one occasion.

Example 3. Characterizing the Production of Secondary Metabolites in Anaerobic Fungi

In order to better understand the production of secondary metabolites by the anaerobic fungi of the invention, bioinformatics tools and analytical tools were employed. It was determined that numerous genes involved in the production of secondary metabolites were actively transcribed in Piromyces finnis, Anaeromyces robustus, and Neocallimastix californiae. Genes involved in secondary metabolite production included PKS genes as well as genes involved in the synthesis of nonribosomal peptides, terpenes, fatty acids, bacteriocins, and others.

Additionally, LC-MS/MS analysis of fungal products was performed, and a large number of peaks were observed, each corresponding to a composition having unique mass and charge. These results provide promising experimental evidence that secondary metabolites are produced in abundance by these anaerobic fungi.

All patents, patent applications, and publications cited in this specification are herein incorporated by reference to the same extent as if each independent patent application, or publication was specifically and individually indicated to be incorporated by reference. The disclosed embodiments are presented for purposes of illustration and not limitation. While the invention has been described with reference to the described embodiments thereof, it will be appreciated by those of skill in the art that modifications can be made to the structure and elements of the invention without departing from the spirit and scope of the invention as a whole.

Number	Name	Date	Kind
8361752	Kohda et al.	Jan 2013	B2
10717768	O'Malley et al.	Jul 2020	B2
11021524	O'Malley et al.	Jun 2021	B2
20110053195	Bauer	Mar 2011	A1
20110306105	Chen et al.	Dec 2011	A1
20120083012	Chang et al.	Apr 2012	A1
20180362597	O'Malley et al.	Dec 2018	A1
20200299338	O'Malley et al.	Sep 2020	A1

Number	Date	Country
2016366227	Jun 2018	AU
2016366227	Nov 2022	AU
2410061	Jan 2012	EP
3387121	Oct 2018	EP
2015-019346	Feb 2015	WO
2017100429	Jun 2017	WO

	Number	Date	Country
	62296064	Feb 2016	US
	62265397	Dec 2015	US

	Number	Date	Country
Parent	16898421	Jun 2020	US
Child	17321941		US
Parent	15782016		US
Child	16898421		US

Proteins from anaerobic fungi and uses thereof

Information

Patent Number

Date Filed

Date Issued

Inventors

Original Assignees

Examiners

Agents

CPC

Field of Search

CPC

International Classifications

Term Extension

Abstract

Description

Claims

CROSS-REFERENCE TO RELATED APPLICATIONS

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

US Referenced Citations (8)

Foreign Referenced Citations (6)

Non-Patent Literature Citations (11)

Related Publications (1)

Provisional Applications (2)

Continuations (2)

Entry
Rincon, Marco T. et al., “A Novel Cell Surface-Anchored 1-22,30,32-33 Cellulose-Binding Protein Encoded by the sea Gene Cluster of Ruminococcus flavefaciens”, Journal of Bacteriology, [E-pub] Apr. 27, 2007, vol. 189, No. 13, pp. 4774-4783. See abstract.
Database UniProt [Online], Sep. 16, 2015 (Sep. 16, 2015), “SubName: Ful1=Zinc finger, MIZ-type {ECO:00003131EMBL:CRL29765.1};”, XP002791431, retrieved from EBI accession No. Uniprot: A0A0G4PTB7 Database accession No. A0A0G4PTB7, 1 pg.
Database UniProt [Online], Apr. 18, 2006 (Apr. 18, 2006), “SubName: Full=Zinc finger lsdl subclass family protein {ECO:00003131EMBL:EAR95065.2};”, XP002791429, retrieved from EBI accession No. Uniprot:Q23F40 Database accession No. Q23F40, 2 pgs.
Database UniProt [Online], Oct. 29, 2014 (Oct. 29, 2014), “SubName: Full=Zinc finger protein, putative {ECO:00003131EMBL:CDS45340.1};”, XP002791430, retrieved from EBI accession No. Uniprot:A0A077X808 Database accession No. A0A077X808, 2 pgs.
Extended European Search Report for European Application No. 16873834.2, Search Completed May 17, 2019, dated May 28, 2019, 12 Pgs.
Freelove et al., “A Novel Carbohydrate-binding Protein Is a Component of the Plant Cell Wall-degrading Complex of Piromyces equi”, The Journal of Biological Chemistry, 2001, vol. 276, No. 46, Issue of Nov. 16, p. 43010-43017.
Nagy, “Characterization of a Double Dockerin from the Cellulosome of the Anaerobic Fungus Piromyces equi”, Journal of Molecular Biology, vol. 373, No. 3, Oct. 26, 2007, pp. 612-622.
Rachel et al., “Cohesin-Dockerin Microarray: Diverse Specificities Between Two Complementary Families of Interacting Protein Modules”, Proteomics, vol. 8, No. 5, Mar. 1, 2008, pp. 968-979.
Solomon et al., “Early-branching gut fungi possess a large, comprehensive array of biomass-degrading enzymes”, Science, [E-Pub] Feb. 18, 2016, vol. 351, No. 6278, 1192-195 pp. 1-9.
International Preliminary Report on Patentability for International Application PCT/US2016/065579, Report dated Jun. 12, 2018, dated Jun. 21, 2018, 14 Pgs.
International Search Report and Written Opinion for International Application PCT/US2016/065579, search completed Apr. 25, 2017, dated Apr. 26, 2017, 19 Pgs.