Assigning emm Types and Subtypes

New parameters for assigning types and subtypes to emm sequence data:

It has become increasingly obvious that the previous type definition for emm genes of S. pyogenes (and emm genes from S. dysgalactiae subsp. equisimilis ) based upon > 95% sequence identity over the first 160 bases of sequence obtained with primers 1 or emmseq 2, (allowing for one interruption of the reading frame of no more than 7 codons) is not optimal. Even though this basic formula usually works, the basic problem is that this definition relies upon an undefined starting point for all readable sequences. It is evident that a segment of DNA with defined boundaries must be used. Since the sequence encoding the N terminus is widely believed to determine the M serotype, it appears most logical to base the emm type upon a defined segment of this sequence. An average of about 72 bases (24 codons) encoding a portion of the relatively conserved signal peptide was used for the previous type definition. Thus, a type definition based upon the 90 bases encoding the N terminal 30 residues of the processed M protein would be predicted to be most consistent with the previous typing scheme. Although type definition will rely upon these 90 bases, subtypes will continue to be assigned according to exact 180 base sequences encoding the 10 residues at C terminal and 50 residues of the mature M protein.

New types will now be identified by the curator of this site (VSrinivasan@cdc.gov) on the basis of sharing less than 92% sequence identity over the first 90 bases encoding the deduced processed M protein of the emm type reference strain, using the SSEARCH program in the Wisconsin Package version 10.3 and bases 1-90 of emm reference strain sequences (identified as subtype 0, e.g. emm1.0 ) to compare to the full length 150 base subtype-determining region of the query sequence. As before, a single interruption of the reference sequence reading frame (through frame shift, in frame deletion or insertion) by no more than 7 codons is tolerated and not quantitated for mismatches. However, for each codon involved in such interruptions, a penalty of 0.5% is subtracted from the overall % identity score.

Instructions for assigning known types and subtypes:

A. Use at least the first 220 bases of sequence (edited for accuracy) obtained with primer 1 or emmseq2 to query the type-specific DNA sequence database. For the majority of queries, one will obtain an exact 180/180 match to a 180 base entry, which can be assigned to the sequence (e.g. emm4.4 is equivalent to type emm4, subtype emm4.4 ).
B. This database of trimmed 180 base entries corresponds to the first 50 residues of the mature M protein and the adjacent 10 C terminal residues of the signal sequence. If a perfect 180/180 match is obtained to an entry from the type-specific BLAST option, the subtype has been correctly identified with no additional steps required for correct subtype designation.
If you do not obtain a subtype assignment in steps A and B, please submit the sequence trace to the curator of the database (VSrinivasan@cdc.gov) and after verification it will be assigned as a new type and/or subtype and add it to the CDC emm sequence database. New types are assigned as described above through comparisons to the emm type reference strains. Within the downloadable sequence file I will add whatever strain, epidemiologic, and clinical info that you care to share, and acknowledge you and your institution for the contribution of the sequence and information.

Information including any of the following (but not limited to it) is also greatly appreciated if you care to share it:

Your name and institution.
Isolate designation.
Country where isolated.
Year isolated.
Group carbohydrate (A ,C,G, etc).
Specimen (skin lesion, blood, throat, etc.).
Clinical manifestation (if any).
Multilocus sequence type
sof positive or negative.
Opacity factor positive or negative.
Bacitracin sensitivity
T antigen type.
spe gene profile.
Other virulence determinants.
Antibiotic resistance phenotypes/genotypes
GenBank designation (if you have it)

Velusamy Srinivasan, Ph.D.
Streptococcus Laboratory
NCIRD/DBD/RDB
Centers for Disease Control and Prevention
1600 Clifton Rd., NE, MS-C02
Atlanta, GA 30333, USA
VSRINIVASAN@CDC.GOV

Top of Page

File Formats Help:

How do I view different file formats (PDF, DOC, PPT, MPEG) on this site?

Page last reviewed: December 19, 2007
Page last updated: December 19, 2007
Content source:
- National Center for Immunization and Respiratory Diseases