Upload
axel-drefahl
View
465
Download
1
Embed Size (px)
Citation preview
Application of CurlySMILES to the encoding of polymer systems
Presented on August 16, 2015, at the ACS 250th National Meeting in Boston
Division: Computers in Chemistry
Session: Accelerated Discovery of Chemical Compounds: Design New Polymers & Inorganic Materials from Integration of Polymer Science, Materials Science & Informatics
Axel Drefahl
Axeleratio, Reno, Nevada
www.axeleratio.com/axel/acs/boston2015/CurlySMILESpolymers.pdf
Copyright © 2015 Axel Drefahl
What is CurlySMILES?CurlySMILES is:
● a chemical language to capture, process and share, nanostructures, based on molecular constitution, connectivity and arrangement;
● a line notation system integrating SMILES with atom- and molecule-anchored annotations, inserted via curly braces: {…};
● custamizable by annotation: encoding of polymers, complexes, multi-phase systems, ...;
● available as a suite of Python 3 modules, including a notation parser and unique notation generator.
Copyright © 2015 Axel Drefahl
Overview● Current state of polymer informatics
● Brief introduction to CurlySMILES
● Encoding of Structural Repeat Units (SRUs)
● Encoding of single-strand polymers
● Encoding of multi-strand polymers
● Encoding of copolymers and miscellaneous polymer systems
● CurlySMILES software/task-specific integration
● Perspective: virtual polymer chemistry
Copyright © 2015 Axel Drefahl
Cheminformatics (sub)domainsEstablished informatics
● “Small molecules”● Crystalline solids
● Peptides, DNAs, ...
Capturing & processing
● Molecular graph
● Unit cell, space group
● Fragment sequence
Capturing & processing
● Struct. repeat unit (SRU)
● Nano-object (sphere,rod,...)
● Variable groups: R,X,Y,Z,...
● Metalevel components
Evolving informatics
● Polymer systems
● Nanomaterials
● Material classes
● Composites & design
Copyright © 2015 Axel Drefahl
Polymer informatics: approaches and tools
● IUPAC nomenclature & seniority rules (head-tail selection)
● S-group (superatom): SRU with crossing bonds and brackets (common representation, MDL, MarvinSketch),
● ThermoML with polymer block to specify compounds,
● Polymer Markup Language (PML),
● Polymer Informatics Knowledge System (PIKS) - PolyInfo database | “walled gardens” of polymer information,
● InChI polymer project (awaits implementation),
● CurlySMILES project, actively designs a human-machine interface for nanoarchitectures, including polymers, and develops open-source Python code.
Copyright © 2015 Axel Drefahl
From SMILES to CurlySMILES
SMILES: Simplified Molecular Input Line Entry System
Published by David Weininger in 1988
doi: 10.1021/ci00057a005
CurlySMILES: Curly-braces enhanced Smart Material Input Line Entry Specification
Published by Axel Drefahl in 2011
doi: 10.1186/1758-2946-3-1
Copyright © 2015 Axel Drefahl
CurlySMILES Motivation● Chemical nomenclature and encoding languages typically
employ idealized representations, while minor structural irregularities and impurities are ignored. CurlySMILES encoding enables their insertion via annotation, if desired.
● A molecular-graph-derived notation is often taken to represent molecule and substance interchangeably. CurlySMILES employs molecule multipliers and allows for phase distinction, for example, by using state and shape annotations such as lq, tf, am, cr, np, ...
● Variability of detail: stoichiometric formula notation (SFN)
● Encoding of molecular arrangements: hydrogen-bonded molecules, complexes, macromolecules and other nanoassemblies.
Copyright © 2015 Axel Drefahl
Format of curly-enclosed annotations in CurlySMILES
{AMk1=v1;...;kn=vn}
AM is a one-char or two-char annotation marker; a two-char AM may by followed by an annotation dictionary, a semicolon-separated list of key/value (ki/vi) pairs. Keys are predefined, but extensible by customization ($ prefix).
Example: n-octanethiol functionalized gold nanoparticle
dispersed in toluene (●_SCH2(CH
2)6CH
3 in toluene)
S{-|c=[Au]{np}}CCCCCCCC{dpc=Cc1ccccc1}
AMs: -| for surface-attached, dp for dispersed, np for nanoparticle.
Copyright © 2015 Axel Drefahl
Annotated Molecular Graph
Example: (Z)-but-1-ene-1,4-diyl substructure
CurlySMILES: C{-}=C{Z}CC{-}
Atom-anchored annotations:
Structural unit annotation (pendent single bond): {-}
Stereodescriptive annotation: {Z}
Copyright © 2015 Axel Drefahl
Poly[(Z)-but-1-ene-1,4-diyl]
CurlySMILES: C{-}=C{Z}CC{+n}
Atom-anchored annotations:
Structural unit annotation: {-} at head node
Stereodescriptive annotation: {Z}
Operational notation:{+n} at tail node
Copyright © 2015 Axel Drefahl
Does CurlySMILES encode macromolecules or polymers?
Answer: both (user choice). CurlySMILES comes with a rich annotation dictionary to encode chain length variation and phases.
A macromolecule is a single molecule. The term “polymer” can mean “macromolecule” or a “substance” composed thereof, typically with a “degree of polymerization” (DOP) range.
An oligomer or a macromolecule of a specific length is encoded based on the chain graph, i.e. the SRU graph, using annotation dictionary key n: {+nn=10} for ten-time-occurrence of SRU.
A polymer is encoded by leaving out n (generic polymer). The key dpr may specify a DOP range: {+ndpr=gt250}. AMs such as am (amorphous) or cr (crystalline) indicate a particular polymer phase.
A polymer system is encoded by additional annotations specifying, for example, impurities, additives and solvents.
Copyright © 2015 Axel Drefahl
Tail node annotations to formally construct polymers
{+n} anchored at tail node of divalent SRU to build single-strand polymer via head-tail single-bond connection.
{+r} anchored at tail node of SRU to build single-strand macrocycle (last tail node connects first head node).
{+m} anchored at tail node of non-single-bond or multivalent SRU to build multibond/multi-strand polymer via specified head-tail connection using key ich to provide index of corresponding head node.
{+s} anchored at tail node of the last (right-most) SRU in a copolymer sequence to provide copolymer details; for example, a copolymer qualifier via key cpq to specify an alternating, block or random sequence.
Copyright © 2015 Axel Drefahl
CurlySMILES notations of some common single-strand homopolymers
Structure-Based Name Structural Formula CurlySMILES Notation
Poly(oxymethylene) -[OCH2]-
n O{-}C{+n}
Poly(iminoethylene) -[NHCH2CH
2]-
n N{-}CC{+n}
Poly(1-hydroxyethylene) -[CH(OH)CH2]-
n OC{-}C{+n}
Poly(1-cyanoethylene) -[CH(CN)CH2]-
n N#CC{-}C{+n}
Poly(1,1-difluoroethylene) -[CF2CH
2]-
n FC{-}(F)C{+n}
Poly(1-phenylethylene) -[CH(Ph)CH2]-
n C{-}(c1ccccc1)C{+n}
Poly(oxy-1,4-phenylene) -[O-paraPh]-n O{-}c1ccc{+n}cc1
Poly(methylene) -[CH2]-
n C{-}{+n}
Poly(difluoromethylene) -[CF2]-
n FC{-}{+n}F
Copyright © 2015 Axel Drefahl
Polydispersity characterization
With the exception of the dimensionless pdi, units are kg/mol.
Example: {+npMn=89.2} to encode a single-strand polymer with a number-average molar mass of 89.2 kg/mol
Key Symbol Meaning ThermoML tag namepMn Mn Number-average molar mass nNumberAvgMolWt
pMm Mm Mass-average molar mass nWeightAvgMolWt
pMz Mz z-Average molar mass nZAvgMolWt
pMv Mv Viscosity-average molar mass nViscosityAvgMolWt
pMp Mp Peak molar mass nPeakAvgMolWt (?)
pdi Mm/Mn
Polydispersity index nPolydispersityIndex
Copyright © 2015 Axel Drefahl
Anionic homopolymer with monoatomic cations
Example: poly(sodium 1-carboxylatoethylene)
CurlySMILES: O=C([O-]{+Cc=[Na+]})C{-}C{+n}
The operational annotation marker +C is used to include [Na+] as counterion to [O-]. [Na+]is part of the repeat unit.
Copyright © 2015 Axel Drefahl
Homopolymer with terminating groups at head and tail
Example: poly(ethylene terephthalate) by esterification of terephthalic acid with ethylene glycol
[H]O{-}CCOC(=O)c1ccc(cc1)C{+ninc=2-15;ich=2}(=O)OCCO
Nodes 2 to 15 are parts of SRU. Node 1 makes the head terminus and nodes 16-20 belong to tail end group.
Copyright © 2015 Axel Drefahl
Cyclic polymers or oligomersExample: cyclic poly(silaether)
[Si]{-}(C)(C)[Si](C)(C)O{+rn=24}Shortcut for a long SMILES notation:
[Si]1(C)(C)[Si](C)(C)O...[Si](C)(C)[Si](C)(C)O1
Such cyclic poly(silaether) are obtained, for example, as by-products while making their linear homologs by ring-opening polymerization of octamethyl-1,4-dioxatetrasilacyclohexane [10.1021/ma00086a048].
Copyright © 2015 Axel Drefahl
Surface-grafted functional oligomerExample: polyacrylamid brush grown on silicon
N{-|c=[Si]}C(=O)c1ccc(cc1)CCC{-} \
C{+ninc=12-16;ich=12}C(=O)N
Group environment annotion -| for bond to substrate
Growth of such polyacrylamide brushes on a silicon wafer is studied to understand how to reduce or prevent microbial adhesion on surfaces by chemical surface modification [doi: 10.1021/la063531v].
Copyright © 2015 Axel Drefahl
Regular double-strand polymers:chain of formally fused cycloalkane ringsExample: poly(butane-1,4:3,2-tetrayl)
CurlySMILES notation:
C{-}C{+mich=1}C{+mich=4}C{-}
two head nodes: C{-}, two tail nodes C{+m}
For IUPAC nomenclature of this polymer see A Brief Guide to Polymer Nomenclature.
Copyright © 2015 Axel Drefahl
Regular double-strand polymers:chain of formally fused heterocycles
Example: poly(2,4-dimethyl-1,3,5-trioxa-2,4-disilapentane-1,5:4,2-tetrayl)
CurlySMILES notation:
O{-}[Si]{+mich=1}(C)O[Si]{+mich=7}(C)O{-}
two head nodes: O{-}, two tail nodes [Si]{+m}
For IUPAC nomenclature of this polymer see page 1573 in http://old.iupac.org/publications/pac/1993/pdf/6507x1561.pdf.
Copyright © 2015 Axel Drefahl
Double bond between head and tailExample: poly(piperidine-3,5-diylideneethanediylidene)
CurlySMILES notations:
A: C1{=}CNCC(C1)=CC{+mich=1;b==}
B: C{-}C1CNCC(C1)=C{+n}
Both notations encode correct atom connectivity. In the IUPAC-compliant notation A, key b specifies = as bond between tail and head. For IUPAC nomenclature of this polymer see page 1941 inNomenclature of Regular Single-Strand Organic Polymers.
Copyright © 2015 Axel Drefahl
Encoding with copolymer qualifiers
Copolymer Qualifiers Example:
poly(styrene-co-isoprene)
CurlySMILES notation of above example:
C{-}C{+ninc=1-8;ich=1}(c1ccccc1) \
C{-}C=C(C)C{+ninc=9-13;ich=9}{+scpq=c}
cpq Qualifier Meaning
a alt alternating
b block block
c co generic
g graft graft
p per periodic
r ran random
s stat statistical
Copyright © 2015 Axel Drefahl
Encoding of a terpolymerExample: poly[methyl-N-(3,4-dimethylphenyl)-N-(4-biphenyl)-N-(4-phenyloxy)siloxane-co-phenylmethylsiloxane-co-methylhydrosiloxane]
c1ccccc1[Si]{-}(C)O{+ninc=1-9;ich=7}[SiH]{-}(C)O{+ninc=10-12;ich=10}[Si]{-}(C)(Oc2ccc(cc2)N(c3cc(C)c(C)cc3)c4ccc(cc4)-c5ccccc5)O{+ninc=13-43;ich=13}{+scpq=c}For more about this terpolymer see 10.1021/ma202041u.
Copyright © 2015 Axel Drefahl
Nesting of SRUs
Example: unsaturated polyester with α,ω-alkanediyl bridges
CurlySMILES notation:
C{-}(=O)OC{-}{+ninc=4;ich=4;n=5-9} \
OC(=O)C(C)=CCC{+ninc=1-13;ich=1}C
Copyright © 2015 Axel Drefahl
Encoding of polymer blends
Example: polystyrene/poly(methyl methacrylate) blend
CurlySMILES notation:
C{-}C{+n}c1ccccc1.C{-}C{+n}(C)C(=O)OC{mx}
Annotation {mx} indicates a compatible or incompatible mixture.
CurlySMILES encoding as a two-phase system (composite):
{/C{-}C{+n}c1ccccc1/C{-}C{+n}(C)C(=O)OC}
Copyright © 2015 Axel Drefahl
Encoding of polymer solutionsExample: poly(1-cyanoethylene) dissolved in dihydrofuran-2(3H)-one (γ-butyrolactone)
CurlySMILES notation:
C{-}(C#N)C{+n}{dsc=O=C1OCCC1}
Annotation marker ds for dissolved
Key c for CurlySMILES notation with assigned value O=C1OCCC1
Copyright © 2015 Axel Drefahl
Encoding of doped polymersExample: poly(1,4-phenylene sulfide) doped with arsenic pentafluoride
CurlySMILES notation:
c1{-}ccc(cc1)S{+n}{IMc=F[As](F)(F)(F)F}
Annotation marker IM for impurity
Key c specifying dopant F[As](F)(F)(F)F
Copyright © 2015 Axel Drefahl
Encoding of polymer setsExample: poly[(alkylimino)methyleneimino-1,3-phenylene] with specified alkyl groups
CurlySMILES notation:
N{-}{+Rcc=C{-},CC{-},CCC{-},CC{-}C,CC{-}(C)C}CNc1cccc{+n}c1
Annotation marker +R for alkyl group insertion
Key cc for list of comma-separated CurlySMILES notations; here, encoding the specified alkyl groups methyl, ethyl, n-propyl, iso-propyl and tert-butyl
Copyright © 2015 Axel Drefahl
CurlySMILES in Python 3
Current iteratively tested implementations● Modules to parse and analyze molecular-graph-based notations
and their annotations
● CANGEN-based methods for input-to-unique conversion of notations (regular single-strands)
● Substructure and descriptor generation methods
● Programs to maintain and screen Axeleratio's in-house bibliography of CurlySMILES-tagged literature, including nano-device and polymer publications.
Copyright © 2015 Axel Drefahl
Transformation of a CurlySMILES notation based on node ranks
Example: poly[(2-propyl-1,3-dioxane-4,6-diyl)methylene]
Entered: C1{-}OC(CCC)OC(C1)C{+n}
Unique: C1{-}CC(OC(CCC)O1)C{+n}The CH2 ring node ranks lower than the left O node; the CH2 tail node
ranks higher than the right O node.
Copyright © 2015 Axel Drefahl
Uniqueness depending on selection of head/tail (H/T) pair
O{-}CC{+n} C{-}OC{+n} C{-}CO{+n}
poly(oxyethylene)
Nomenclature-conform selection of head and tail nodes is recommended in polymer encoding.
[see examples of unique notations for regular single-strand polymers]
Copyright © 2015 Axel Drefahl
Task-specific integration of CurlySMILES modules
● Interfacing polymer structure (input/output)Form-to-notation editors
Notation-to-sketch and notation-to-query software
● Pipelining polymer data (data administration)Automatic ranking and comparison of structure/data pairs
Screening of structured lists and repositories
● Generating virtual librariesAutomatically building lists of polymer notations for QSPR analysis and identification of optimal-design candidates
Copyright © 2015 Axel Drefahl
Application to polymer data mining: “nurturing the mine sites”
SRU-based CurlySMILES notations in unique form are identifiers of macromolecules and polymer systems that can be employed to
• function as search keys in database applications,
• tag factsheets, notes and bibliographic entries,
• populate spreadsheet cells and XML text nodes,
• index and abstract the polymer literature & patents,
• create ontologies that organize polymer information, ..which can be shared via Semantic Web technologies.
Copyright © 2015 Axel Drefahl
Application to polymer data mining: search and data extraction
The CurlySMILES language has a rich and extensible dictionary to encode polymers in diverse contexts and at various levels of detail.
Notations work both ways as precise data annotations and as query formulations for “needle-in-the-haystick” requests.
Today's polymer knowledge systems are not marked up by CurlySMILES. But client-server mediation can be achieved, behind-the-scenes, via CurlySMILES code to
• compact polymer input provided through entry forms,
• expand notations into query language formats.
Copyright © 2015 Axel Drefahl
Application to polymer modeling
CurlySMILES representations of polymer systems contain detailed structural information to derive macromolecular descriptors and substructures (groups) as entry points for property prediction and model development:
• Structure property relationships (QSPRs, GCMs)
• SRU similarity (kNN and pattern recognition methods)
• MC & MD simulations (flexibility, solution behavior)
• Backbone modeling (polymer stability & degradation)
• Kinetic & ab initio methods (controlled polymerization)
Copyright © 2015 Axel Drefahl
Application to polymer design
Specialty polymers must meet multifaceted requirements (multi-dimensional property windows).
The virtual design of polymers by permutationally building (co)polymers (or blends) based on systematically varied monomer structures often results into large libraries of structurally related polymers with predictable properties.
The automatic generation of the polymer structures of such libraries as compact CurlySMILES notation and the implementation of predictive methods for the desired properties will allow virtual high-throughput screening to initialize the synthesis of potential candidates.
Copyright © 2015 Axel Drefahl
Summary & Outlook
Done
● SRU annotations to encode polymers
● Polymer description grammar
● Python implementation
To Do
● Stereochemical descriptions
● Unique notations for nested polymers
● Conquering polymer space
Topics to be addressed for CurlySMILES applications
● Representation and iterative development of models for structure/property estimation
● Extension to advanced architectures: dendrimers, 3D polymers and nanostructure designs combining polymers with carbon nanotubes and fullerene-based bowls and cages