Fig. 1.
Identification and mapping of MMSETgene on 4p16.3. (A) The diagram, drawn to scale, represents the distal 200 kb of the 2-Mb cosmid contig spanning the Huntington’s disease region, with the telomeric side of the contig, containing theFGFR3 gene, to the left. Vertical lines represent the exons distribution of the MMSET gene within about 120 kb of the cosmid contig. The MMSET gene includes at least 24 exons that are transcribed from the telomeric to the centromeric end. The solid arrowhead indicates the 3′ end of an EST that is localized 530 bp centromeric to the end of the last MMSET exon, but is transcribed in the opposite orientation. The solid arrows indicate the position of the previously cloned translocation breakpoints for KMS11, UTMC2, H929, JIM3, OPM2 MM cell lines, and for the tumor sample PCL1; The LP-1 breakpoint has been mapped by sequence analysis of a hybrid transcript splicing to exon 4, and MM5.1 between exons 2b and 3. The KMS11 t(4;14) translocation breakpoint is localized at the 5′ end of MMSET exon 1a, about 15 kb from exon 1; the UTMC2 breakpoint is localized between exon 1a and 1, about 2.5 kb from exon 1; the PCL1 breakpoint is in the intron between exons 1 and 2a; the LP-1, H929, and the JIM3 breakpoints are between exons 3 and 4; and the OPM2 breakpoint between exons 4 and 5. (B) The MMSET transcription units. MMSET is expressed as transcripts that polyadenylate either in exon 11 (type I), or exon 24 (type II) as a result of alternative splicing occurring from exon 10 to 11 (top), or exon 10 to 12 (bottom). The position of the polyadenylation signals (PA) is indicated by black circles. Because of the heterogeneity of the 5′ untranslated region (5′UT), we chose to start the numbering of MMSET from the first nucleotide in exon 3. The type I transcript contains an ORF of 1911 bp, encoding a 647-aa protein with the first methionine at nt 30, in exon 3, and the stop codon at nt 1971 after the first 20 aa of exon 11. The type II transcript contains a longer ORF of 4094 bp, encoding a 1365-aa protein, that is identical to the shorter protein up to the splicing site in exon 10, as indicated by the light gray shading. The two proteins differ after exon 10, with the unique portion of the short protein shaded in black, and the portion unique to the long protein shaded in dark gray. In the middle panel are shown characteristic domains of MMSET. A putative nuclear localization signal (NLS, indicated by a thick vertical line) is common to the two proteins, as well as the HMG domain (dark gray rectangle), and the hathdomain (white rectangle). The long protein is characterized by four PHD fingers (black rectangles), a SET domain (light gray rectangle), an additional hath domain, and another putative NLS. The thick horizontal lines represent the PCR amplified probes used in the Northern blot assay, with the primers number indicated. Numbers within boxes indicate two of the bacteriophage clones isolated from a testis cDNA library. To obtain a complete sequence of the ORFs of MMSET, cDNA fragments have been amplified by PCR as indicated by the dashed horizontal lines. 1112 (gray rectangle) is a clone obtained by 3′ RACE and sequenced with SP6 and T7 primers. (C) Amino acid sequence of MMSET long and short proteins. The numbers above the aa indicate the first residue of the corresponding exon. The first methionine in exons 3, 4, and 6 are in bold. In a white box are shown the two hathdomains; two putative nuclear localization signals (NLS) are underlined; in a dark grey box is indicated the HMG domain; the four black boxes show the PHD fingers, with the consensus histidine residues in bold; in a dark grey box is shown the SET domain.