37
Application of CurlySMILES to the encoding of polymer systems Presented on August 16, 2015, at the ACS 250 th National Meeting in Boston Division: Computers in Chemistry Session: Accelerated Discovery of Chemical Compounds: Design New Polymers & Inorganic Materials from Integration of Polymer Science, Materials Science & Informatics Axel Drefahl [email protected] Axeleratio, Reno, Nevada www.axeleratio.com/axel/acs/boston2015/CurlySMILESpolymers.pdf

Application of CurlySMILES to the encoding of polymer systems

Embed Size (px)

Citation preview

Application of CurlySMILES to the encoding of polymer systems

Presented on August 16, 2015, at the ACS 250th National Meeting in Boston

Division: Computers in Chemistry

Session: Accelerated Discovery of Chemical Compounds: Design New Polymers & Inorganic Materials from Integration of Polymer Science, Materials Science & Informatics

Axel Drefahl

[email protected]

Axeleratio, Reno, Nevada

www.axeleratio.com/axel/acs/boston2015/CurlySMILESpolymers.pdf

Copyright © 2015 Axel Drefahl

What is CurlySMILES?CurlySMILES is:

● a chemical language to capture, process and share, nanostructures, based on molecular constitution, connectivity and arrangement;

● a line notation system integrating SMILES with atom- and molecule-anchored annotations, inserted via curly braces: {…};

● custamizable by annotation: encoding of polymers, complexes, multi-phase systems, ...;

● available as a suite of Python 3 modules, including a notation parser and unique notation generator.

Copyright © 2015 Axel Drefahl

Overview● Current state of polymer informatics

● Brief introduction to CurlySMILES

● Encoding of Structural Repeat Units (SRUs)

● Encoding of single-strand polymers

● Encoding of multi-strand polymers

● Encoding of copolymers and miscellaneous polymer systems

● CurlySMILES software/task-specific integration

● Perspective: virtual polymer chemistry

Copyright © 2015 Axel Drefahl

Cheminformatics (sub)domainsEstablished informatics

● “Small molecules”● Crystalline solids

● Peptides, DNAs, ...

Capturing & processing

● Molecular graph

● Unit cell, space group

● Fragment sequence

Capturing & processing

● Struct. repeat unit (SRU)

● Nano-object (sphere,rod,...)

● Variable groups: R,X,Y,Z,...

● Metalevel components

Evolving informatics

● Polymer systems

● Nanomaterials

● Material classes

● Composites & design

Copyright © 2015 Axel Drefahl

Polymer informatics: approaches and tools

● IUPAC nomenclature & seniority rules (head-tail selection)

● S-group (superatom): SRU with crossing bonds and brackets (common representation, MDL, MarvinSketch),

● ThermoML with polymer block to specify compounds,

● Polymer Markup Language (PML),

● Polymer Informatics Knowledge System (PIKS) - PolyInfo database | “walled gardens” of polymer information,

● InChI polymer project (awaits implementation),

● CurlySMILES project, actively designs a human-machine interface for nanoarchitectures, including polymers, and develops open-source Python code.

Copyright © 2015 Axel Drefahl

From SMILES to CurlySMILES

SMILES: Simplified Molecular Input Line Entry System

Published by David Weininger in 1988

doi: 10.1021/ci00057a005

CurlySMILES: Curly-braces enhanced Smart Material Input Line Entry Specification

Published by Axel Drefahl in 2011

doi: 10.1186/1758-2946-3-1

Copyright © 2015 Axel Drefahl

CurlySMILES Motivation● Chemical nomenclature and encoding languages typically

employ idealized representations, while minor structural irregularities and impurities are ignored. CurlySMILES encoding enables their insertion via annotation, if desired.

● A molecular-graph-derived notation is often taken to represent molecule and substance interchangeably. CurlySMILES employs molecule multipliers and allows for phase distinction, for example, by using state and shape annotations such as lq, tf, am, cr, np, ...

● Variability of detail: stoichiometric formula notation (SFN)

● Encoding of molecular arrangements: hydrogen-bonded molecules, complexes, macromolecules and other nanoassemblies.

Copyright © 2015 Axel Drefahl

Format of curly-enclosed annotations in CurlySMILES

{AMk1=v1;...;kn=vn}

AM is a one-char or two-char annotation marker; a two-char AM may by followed by an annotation dictionary, a semicolon-separated list of key/value (ki/vi) pairs. Keys are predefined, but extensible by customization ($ prefix).

Example: n-octanethiol functionalized gold nanoparticle

dispersed in toluene (●_SCH2(CH

2)6CH

3 in toluene)

S{-|c=[Au]{np}}CCCCCCCC{dpc=Cc1ccccc1}

AMs: -| for surface-attached, dp for dispersed, np for nanoparticle.

Copyright © 2015 Axel Drefahl

Annotated Molecular Graph

Example: (Z)-but-1-ene-1,4-diyl substructure

CurlySMILES: C{-}=C{Z}CC{-}

Atom-anchored annotations:

Structural unit annotation (pendent single bond): {-}

Stereodescriptive annotation: {Z}

Copyright © 2015 Axel Drefahl

Poly[(Z)-but-1-ene-1,4-diyl]

CurlySMILES: C{-}=C{Z}CC{+n}

Atom-anchored annotations:

Structural unit annotation: {-} at head node

Stereodescriptive annotation: {Z}

Operational notation:{+n} at tail node

Copyright © 2015 Axel Drefahl

Does CurlySMILES encode macromolecules or polymers?

Answer: both (user choice). CurlySMILES comes with a rich annotation dictionary to encode chain length variation and phases.

A macromolecule is a single molecule. The term “polymer” can mean “macromolecule” or a “substance” composed thereof, typically with a “degree of polymerization” (DOP) range.

An oligomer or a macromolecule of a specific length is encoded based on the chain graph, i.e. the SRU graph, using annotation dictionary key n: {+nn=10} for ten-time-occurrence of SRU.

A polymer is encoded by leaving out n (generic polymer). The key dpr may specify a DOP range: {+ndpr=gt250}. AMs such as am (amorphous) or cr (crystalline) indicate a particular polymer phase.

A polymer system is encoded by additional annotations specifying, for example, impurities, additives and solvents.

Copyright © 2015 Axel Drefahl

Tail node annotations to formally construct polymers

{+n} anchored at tail node of divalent SRU to build single-strand polymer via head-tail single-bond connection.

{+r} anchored at tail node of SRU to build single-strand macrocycle (last tail node connects first head node).

{+m} anchored at tail node of non-single-bond or multivalent SRU to build multibond/multi-strand polymer via specified head-tail connection using key ich to provide index of corresponding head node.

{+s} anchored at tail node of the last (right-most) SRU in a copolymer sequence to provide copolymer details; for example, a copolymer qualifier via key cpq to specify an alternating, block or random sequence.

Copyright © 2015 Axel Drefahl

CurlySMILES notations of some common single-strand homopolymers

Structure-Based Name Structural Formula CurlySMILES Notation

Poly(oxymethylene) -[OCH2]-

n O{-}C{+n}

Poly(iminoethylene) -[NHCH2CH

2]-

n N{-}CC{+n}

Poly(1-hydroxyethylene) -[CH(OH)CH2]-

n OC{-}C{+n}

Poly(1-cyanoethylene) -[CH(CN)CH2]-

n N#CC{-}C{+n}

Poly(1,1-difluoroethylene) -[CF2CH

2]-

n FC{-}(F)C{+n}

Poly(1-phenylethylene) -[CH(Ph)CH2]-

n C{-}(c1ccccc1)C{+n}

Poly(oxy-1,4-phenylene) -[O-paraPh]-n O{-}c1ccc{+n}cc1

Poly(methylene) -[CH2]-

n C{-}{+n}

Poly(difluoromethylene) -[CF2]-

n FC{-}{+n}F

Copyright © 2015 Axel Drefahl

Polydispersity characterization

With the exception of the dimensionless pdi, units are kg/mol.

Example: {+npMn=89.2} to encode a single-strand polymer with a number-average molar mass of 89.2 kg/mol

Key Symbol Meaning ThermoML tag namepMn Mn Number-average molar mass nNumberAvgMolWt

pMm Mm Mass-average molar mass nWeightAvgMolWt

pMz Mz z-Average molar mass nZAvgMolWt

pMv Mv Viscosity-average molar mass nViscosityAvgMolWt

pMp Mp Peak molar mass nPeakAvgMolWt (?)

pdi Mm/Mn

Polydispersity index nPolydispersityIndex

Copyright © 2015 Axel Drefahl

Anionic homopolymer with monoatomic cations

Example: poly(sodium 1-carboxylatoethylene)

CurlySMILES: O=C([O-]{+Cc=[Na+]})C{-}C{+n}

The operational annotation marker +C is used to include [Na+] as counterion to [O-]. [Na+]is part of the repeat unit.

Copyright © 2015 Axel Drefahl

Homopolymer with terminating groups at head and tail

Example: poly(ethylene terephthalate) by esterification of terephthalic acid with ethylene glycol

[H]O{-}CCOC(=O)c1ccc(cc1)C{+ninc=2-15;ich=2}(=O)OCCO

Nodes 2 to 15 are parts of SRU. Node 1 makes the head terminus and nodes 16-20 belong to tail end group.

Copyright © 2015 Axel Drefahl

Cyclic polymers or oligomersExample: cyclic poly(silaether)

[Si]{-}(C)(C)[Si](C)(C)O{+rn=24}Shortcut for a long SMILES notation:

[Si]1(C)(C)[Si](C)(C)O...[Si](C)(C)[Si](C)(C)O1

Such cyclic poly(silaether) are obtained, for example, as by-products while making their linear homologs by ring-opening polymerization of octamethyl-1,4-dioxatetrasilacyclohexane [10.1021/ma00086a048].

Copyright © 2015 Axel Drefahl

Surface-grafted functional oligomerExample: polyacrylamid brush grown on silicon

N{-|c=[Si]}C(=O)c1ccc(cc1)CCC{-} \

C{+ninc=12-16;ich=12}C(=O)N

Group environment annotion -| for bond to substrate

Growth of such polyacrylamide brushes on a silicon wafer is studied to understand how to reduce or prevent microbial adhesion on surfaces by chemical surface modification [doi: 10.1021/la063531v].

Copyright © 2015 Axel Drefahl

Regular double-strand polymers:chain of formally fused cycloalkane ringsExample: poly(butane-1,4:3,2-tetrayl)

CurlySMILES notation:

C{-}C{+mich=1}C{+mich=4}C{-}

two head nodes: C{-}, two tail nodes C{+m}

For IUPAC nomenclature of this polymer see A Brief Guide to Polymer Nomenclature.

Copyright © 2015 Axel Drefahl

Regular double-strand polymers:chain of formally fused heterocycles

Example: poly(2,4-dimethyl-1,3,5-trioxa-2,4-disilapentane-1,5:4,2-tetrayl)

CurlySMILES notation:

O{-}[Si]{+mich=1}(C)O[Si]{+mich=7}(C)O{-}

two head nodes: O{-}, two tail nodes [Si]{+m}

For IUPAC nomenclature of this polymer see page 1573 in http://old.iupac.org/publications/pac/1993/pdf/6507x1561.pdf.

Copyright © 2015 Axel Drefahl

Double bond between head and tailExample: poly(piperidine-3,5-diylideneethanediylidene)

CurlySMILES notations:

A: C1{=}CNCC(C1)=CC{+mich=1;b==}

B: C{-}C1CNCC(C1)=C{+n}

Both notations encode correct atom connectivity. In the IUPAC-compliant notation A, key b specifies = as bond between tail and head. For IUPAC nomenclature of this polymer see page 1941 inNomenclature of Regular Single-Strand Organic Polymers.

Copyright © 2015 Axel Drefahl

Encoding with copolymer qualifiers

Copolymer Qualifiers Example:

poly(styrene-co-isoprene)

CurlySMILES notation of above example:

C{-}C{+ninc=1-8;ich=1}(c1ccccc1) \

C{-}C=C(C)C{+ninc=9-13;ich=9}{+scpq=c}

cpq Qualifier Meaning

a alt alternating

b block block

c co generic

g graft graft

p per periodic

r ran random

s stat statistical

Copyright © 2015 Axel Drefahl

Encoding of a terpolymerExample: poly[methyl-N-(3,4-dimethylphenyl)-N-(4-biphenyl)-N-(4-phenyloxy)siloxane-co-phenylmethylsiloxane-co-methylhydrosiloxane]

c1ccccc1[Si]{-}(C)O{+ninc=1-9;ich=7}[SiH]{-}(C)O{+ninc=10-12;ich=10}[Si]{-}(C)(Oc2ccc(cc2)N(c3cc(C)c(C)cc3)c4ccc(cc4)-c5ccccc5)O{+ninc=13-43;ich=13}{+scpq=c}For more about this terpolymer see 10.1021/ma202041u.

Copyright © 2015 Axel Drefahl

Nesting of SRUs

Example: unsaturated polyester with α,ω-alkanediyl bridges

CurlySMILES notation:

C{-}(=O)OC{-}{+ninc=4;ich=4;n=5-9} \

OC(=O)C(C)=CCC{+ninc=1-13;ich=1}C

Copyright © 2015 Axel Drefahl

Encoding of polymer blends

Example: polystyrene/poly(methyl methacrylate) blend

CurlySMILES notation:

C{-}C{+n}c1ccccc1.C{-}C{+n}(C)C(=O)OC{mx}

Annotation {mx} indicates a compatible or incompatible mixture.

CurlySMILES encoding as a two-phase system (composite):

{/C{-}C{+n}c1ccccc1/C{-}C{+n}(C)C(=O)OC}

Copyright © 2015 Axel Drefahl

Encoding of polymer solutionsExample: poly(1-cyanoethylene) dissolved in dihydrofuran-2(3H)-one (γ-butyrolactone)

CurlySMILES notation:

C{-}(C#N)C{+n}{dsc=O=C1OCCC1}

Annotation marker ds for dissolved

Key c for CurlySMILES notation with assigned value O=C1OCCC1

Copyright © 2015 Axel Drefahl

Encoding of doped polymersExample: poly(1,4-phenylene sulfide) doped with arsenic pentafluoride

CurlySMILES notation:

c1{-}ccc(cc1)S{+n}{IMc=F[As](F)(F)(F)F}

Annotation marker IM for impurity

Key c specifying dopant F[As](F)(F)(F)F

Copyright © 2015 Axel Drefahl

Encoding of polymer setsExample: poly[(alkylimino)methyleneimino-1,3-phenylene] with specified alkyl groups

CurlySMILES notation:

N{-}{+Rcc=C{-},CC{-},CCC{-},CC{-}C,CC{-}(C)C}CNc1cccc{+n}c1

Annotation marker +R for alkyl group insertion

Key cc for list of comma-separated CurlySMILES notations; here, encoding the specified alkyl groups methyl, ethyl, n-propyl, iso-propyl and tert-butyl

Copyright © 2015 Axel Drefahl

CurlySMILES in Python 3

Current iteratively tested implementations● Modules to parse and analyze molecular-graph-based notations

and their annotations

● CANGEN-based methods for input-to-unique conversion of notations (regular single-strands)

● Substructure and descriptor generation methods

● Programs to maintain and screen Axeleratio's in-house bibliography of CurlySMILES-tagged literature, including nano-device and polymer publications.

Copyright © 2015 Axel Drefahl

Transformation of a CurlySMILES notation based on node ranks

Example: poly[(2-propyl-1,3-dioxane-4,6-diyl)methylene]

Entered: C1{-}OC(CCC)OC(C1)C{+n}

Unique: C1{-}CC(OC(CCC)O1)C{+n}The CH2 ring node ranks lower than the left O node; the CH2 tail node

ranks higher than the right O node.

Copyright © 2015 Axel Drefahl

Uniqueness depending on selection of head/tail (H/T) pair

O{-}CC{+n} C{-}OC{+n} C{-}CO{+n}

poly(oxyethylene)

Nomenclature-conform selection of head and tail nodes is recommended in polymer encoding.

[see examples of unique notations for regular single-strand polymers]

Copyright © 2015 Axel Drefahl

Task-specific integration of CurlySMILES modules

● Interfacing polymer structure (input/output)Form-to-notation editors

Notation-to-sketch and notation-to-query software

● Pipelining polymer data (data administration)Automatic ranking and comparison of structure/data pairs

Screening of structured lists and repositories

● Generating virtual librariesAutomatically building lists of polymer notations for QSPR analysis and identification of optimal-design candidates

Copyright © 2015 Axel Drefahl

Application to polymer data mining: “nurturing the mine sites”

SRU-based CurlySMILES notations in unique form are identifiers of macromolecules and polymer systems that can be employed to

• function as search keys in database applications,

• tag factsheets, notes and bibliographic entries,

• populate spreadsheet cells and XML text nodes,

• index and abstract the polymer literature & patents,

• create ontologies that organize polymer information, ..which can be shared via Semantic Web technologies.

Copyright © 2015 Axel Drefahl

Application to polymer data mining: search and data extraction

The CurlySMILES language has a rich and extensible dictionary to encode polymers in diverse contexts and at various levels of detail.

Notations work both ways as precise data annotations and as query formulations for “needle-in-the-haystick” requests.

Today's polymer knowledge systems are not marked up by CurlySMILES. But client-server mediation can be achieved, behind-the-scenes, via CurlySMILES code to

• compact polymer input provided through entry forms,

• expand notations into query language formats.

Copyright © 2015 Axel Drefahl

Application to polymer modeling

CurlySMILES representations of polymer systems contain detailed structural information to derive macromolecular descriptors and substructures (groups) as entry points for property prediction and model development:

• Structure property relationships (QSPRs, GCMs)

• SRU similarity (kNN and pattern recognition methods)

• MC & MD simulations (flexibility, solution behavior)

• Backbone modeling (polymer stability & degradation)

• Kinetic & ab initio methods (controlled polymerization)

Copyright © 2015 Axel Drefahl

Application to polymer design

Specialty polymers must meet multifaceted requirements (multi-dimensional property windows).

The virtual design of polymers by permutationally building (co)polymers (or blends) based on systematically varied monomer structures often results into large libraries of structurally related polymers with predictable properties.

The automatic generation of the polymer structures of such libraries as compact CurlySMILES notation and the implementation of predictive methods for the desired properties will allow virtual high-throughput screening to initialize the synthesis of potential candidates.

Copyright © 2015 Axel Drefahl

Summary & Outlook

Done

● SRU annotations to encode polymers

● Polymer description grammar

● Python implementation

To Do

● Stereochemical descriptions

● Unique notations for nested polymers

● Conquering polymer space

Topics to be addressed for CurlySMILES applications

● Representation and iterative development of models for structure/property estimation

● Extension to advanced architectures: dendrimers, 3D polymers and nanostructure designs combining polymers with carbon nanotubes and fullerene-based bowls and cages