Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
TheNOMAD(NovelMaterialsDiscovery)
Laboratory–aEuropeanCentreofExcellence
Deliverable6.2
FinalreportontheHPCplatform
DeliverableNo:6.2
ExpectedDeliveryDate:31/10/2018,M36ActualDeliveryDate:20/11/2018,M37
LeadBeneficiary:CSC-ITCentreforScienceLtd.
ContributingBeneficiaries:MaxPlanckSociety(MPG),BarcelonaSupercomputingCentre(BSC),Leibniz-Rechenzentrum(LRZ),Humboldt-UniversitätzuBerlin(HUB),AaltoUniveristy
(AALTO)
Authors:AtteSillanpää(CSC),T.Zastrow,H.Lederer(MPG-MPCDF),LauriHimanen(Aalto),RubenGarcía-Hernández(LRZ)
NOMAD–ProjectNo.676580
Copyright 2016/2017/2018 by theNOMADConsortium. The information in this document is proprietary totheNOMADConsortium.This document contains preliminary information and is not subject to any license agreement or any otheragreementwiththeNOMADConsortium.Thisdocumentcontainsonlyintendedstrategies,developments,andfunctionalitiesandisnotintendedtobebinding upon to any particular course of business, product strategy, and/or development oftheNOMADConsortium.TheNOMADConsortiumassume no responsibility for errors or omissions in this document. Furthermore,theNOMADConsortiumdoes not warrant the accuracy or completeness of the information, text, graphics,links,orotheritemscontainedwithinthismaterial.Thisdocumentisprovidedwithoutawarrantyofanykind,eitherexpressor implied, includingbutnot limitedtothe impliedwarrantiesofmerchantability, fitnessforaparticular purpose, or non-infringement. TheNOMADConsortiumshall have noliabilityfor damages of anykind includingwithout limitationdirect, special, indirect,orconsequentialdamages thatmayresult fromtheuse of these materials. This limitation shall not apply in cases of intent or gross negligence. Thestatutoryliabilityforpersonalinjuryanddefectiveproductsisnotaffected.Inaddition,thematerialspresentedandviewsexpressedherearetheresponsibilityoftheauthor(s)only.TheEUCommissiontakesnoresponsibilityforanyusemadeoftheinformationsetout.D6.2FinalreportontheHPCplatform 2
ExecutiveSummaryThroughworkpackage (WP)6, theNOMADtechnologyplatformhasbeendesigned, implemented,deployedandputintooperation.Afterdefininganddevelopingthenecessaryoveralldataworkflow,adevelopmentplatformwasfirstputinplaceforallpartnerstostartwiththeirspecificdevelopmentwork. After development of the first NOMAD services from otherWPs, a production platform forservingtheenduserwasinstalledandoperatedbyWP6.Thisproductionplatformhasbeenfurtheroptimizedinlinewithuserneeds,securityandredundancyrequirements.AnessentialfeatureoftheNOMAD technology platform is the centralized usermanagementwith single sign on capability. Amaster instance of the NOMAD technology platform is installed and operated at MPG-MPCDF inGermany,whileacloneofselectedpartsoftheNOMADtechnologyplatformisoperatedatCSC inFinland. The modular design of the NOMAD technology platform allows for easy installation andoperationofadditionalinstancesatfurtherlocations.
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 3
TABLEOFCONTENTS1 Introduction................................................................................................................4
2 TechnologyPlatformanditsEvolution........................................................................42.1 EvolutionofthePlatform.................................................................................................42.2 TheFinalTechnologyPlatform.........................................................................................4
2.2.1 VirtualMachinesandContainers....................................................................................62.2.2 Storage............................................................................................................................72.2.3 Authentication&authorization.......................................................................................82.2.4 Monitoring.....................................................................................................................102.2.5 DeploymentoftheNOMADproductionplatform.........................................................10
2.3 TechnologyPlatformoptimization..................................................................................102.3.1 LoweringtheBarrierforAccessingNOMADServices...................................................102.3.2 SecurityAspects............................................................................................................112.3.3 SoftwareDeploymentAutomation...............................................................................112.3.4 ExtendedComputeservices..........................................................................................122.3.5 ImprovingtheEncyclopediaIngestingProcess.............................................................12
3 LessonsLearnedandConclusion................................................................................13
RevisionHistory
Version1.0,submitted20/11/2018
Originalversion
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 4
1 IntroductionTheobjectivesofWP6wereto:
• define and develop the necessary overall data workflow for theNOMAD Laboratory CoE,withinputfromtherespectivescientificpartners,
• deployandoptimizeitsrespectiveapplicationswithafocusonHPCsystemsandmapthemtoanadequateinfrastructure,sothateffortandresourcesinWP6thussupportallscientificWPs,and
• operatetheunderlyinginfrastructurefortheMaterialsSciencecommunity.
Thisdeliverable,D6.2‘FinalreportonHPCplatform’,describesthetechnologyplatformdeliveredattheendoftheproject.Itexploreshowtheplatformchangedandevolvedduringtheprojecttomeetuserneeds,andwhatlessonscanbelearntforfutureprojectsinsimilardomains.
2 TechnologyPlatformanditsEvolution
2.1 EvolutionofthePlatform
Thedesignand implementationof thedevelopmentplatformwas initiatedat theprojectstartandcarried out as described in detail in D6.1 ‘HPC platform requirements and architecture report’.Essentialrequirementsandfeaturesweretosupportthe:
a) needsofWP1forthecreationofthenormalizeddatafromtheRepositorydata,b) implementationoftheNOMADEncyclopediabyWP2,c) integrationofvisualizationservicesbyWP3,andd) Big-Dataanalytics,includingmachine-learningapproaches,byWP4.
Accordingtothisdesign,themastersetupofthecompletedevelopmentplatformwasimplementedbyanddeployedatMPG-MPCDF.Thissetupishighlyflexibleandcanbeeasilyadaptedtoincreasingneedsboth for increased resourceandsoftware requirementsand -optional -extensions toothercomputingcenters.
2.2 TheFinalTechnologyPlatform
The final technology platform evolved as a production platform from the development platform,according to the requirements. The production platform was implemented in addition to thedevelopment platform according to Figure 1. The development platform has been maintainedalongsidetheproductionplatformforon-goingdevelopmentandtestingpurposes.
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 5
Figure1:Evolutionoftheproductionplatformfromthedevelopmentplatform.
NOMAD services are directly accessed with a web browser without the need for any local clientinstallations. Inaddition, it isalsopossibletoaccesse.g. theEncyclopediaAPIprogrammaticallyviatheRESTfulinterfaceandtheNOMADRepositoryandArchivecontentsdirectly.Someoftheservicesare accessible without authentication, while some require it. The common user management forusingNOMADservicesisdescribedinsection2.2.3.
Thus,fromauserperspective,mostoftheservicescanbeaccessedeitherfromthemainwebsiteorsites linkedto it,while thebackendallows forusingdistributedresources fordeployment.The fullset of services has been put into production at MPG-MPCDF where the largest part of NOMADhardware resourceshasbeenprovided. Through the flexibledesign, the system is ready for setupalsoatotherNOMADsides,asawholeoronlyinparts.Thisgivesadditionalflexibilitytosettingupthe services locally, reduces the capacity requirements, adds redundancyand resilienceand finallylowersthebarriertodeploytheservices.ThisalsomakesiteasierforscientistsoutsideofNOMADtotakesometoolsforlocaldevelopments.ThecurrentsituationofdeploymentofdifferentservicesisshowninFigure2.
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 6
Figure2:CurrentInfrastructuresetup.Themainwebpageisanentrypointtoallservices.TheEncyclopediaAPIbackend runsbothatMPG-MPCDFandatCSC. The loadbalancer redirectsqueries to them.Theblueservicesareonline,thegreyservicesareon"standby",i.e.thefunctionalityexists,butresourceshavenotbeenallocatedandgettingthemonlinerequiressomemanualwork.
Theproductionplatformrunsthefollowingservices:
• TheNOMADRepository -thedatabaseincludingtheuploadedmaterialssciencecalculationinputandoutputs,theGUIandAPItoqueryanduploadthedataandthelinksforwardtotheWP1workflow.
• The parsing infrastructure (WP1) - the scientific code-specific libraries that convert theuploadsintocode-independentdataintheNOMADArchive.
• TheArchive-databaseprovidingaccesstothecode-independentmaterialsdata.• TheEncyclopediaGUIandAPI(WP2),includingtheloadbalancer.• The Remote Visualization service (WP3) providing GPGPU powered visualization service of
local high volume data directly via a browser and functionality to convert data for VRviewing.
• TheAnalyticsToolkit(WP4)withBig-DataAnalyticsNotebooksforqueryingandanalyzingthematerialsdatafromtheArchive,and
• IdentityProvider-userregistrationandauthenticationservice.
Due to the number of services and distributed character of the NOMAD Laboratory CoEinfrastructure,a"SingleSignOn”(SSO)Systemforexternalusersbecamenecessary(see2.2.3).AnSSOhandlesuseraccountsinacentralservice("IdentityProvider"(IDP))andvalidatesuserloginsforanyNOMADservice("ServiceProvider"(SP)).
2.2.1 VirtualMachinesandContainers
Therequirementforeasyscalabilityhasbeenmetbyextensiveusageofvirtualmachines(VMs)andcontainersorchestratedwithKubernetes,OpenStack,Ansibleandotherscripts.Figure3presentsthedeploymentofdifferentNOMADservices.TheRepositoryrunsdirectlyonVMs,whereastheparsinginfrastructure (WP1), the remote visualization (WP3) and the Analytics Toolkit (WP4) all run onDockercontainers,managedbyasharedKubernetescluster.Wheneverauserlogsin,asetofnew
WP1Parsing
WP2Encyclopedia
WP3Remote
Visualization
WP4BigDataAnalysis
CentOS VMsMPG-MPCDF
vGPUs
User,browser
www.nomad-coe.eu
Repository
WP4BigDataAnalysis
WP2Encyclopedia
WP3Remote
Visualization
WP2Encyclopedia
WP3Remote
Visualization
WP4BigDataAnalysis
CentOS VMsCSC
vGPUs
Load balancer
IdPSSO
Inoperation
Stand by
Repository(Raw data)
Archive(Normalized)
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 7
containersaredeployedbyKubernetes.TheEncyclopediausesacombinationofbothtechnologies:itis runondedicatedVMs,where itsdifferentservicesareencapsulated inDockercontainers.Theseservices are the GUI-webserver, the API, and the database for each production system, which isextendedbythedataprocessingforthedevelopmentsystems.
Containers: The use of Kubernetes as an orchestration engine forDocker containers allows for aneasy future development and deployment path. It has become widely used and is likely to besupported in the future. During this reporting period also the Encyclopedia, VR-conversion scriptsand Remote Visualization Services have been migrated to run on Docker containers, furthersimplifying the Infrastructure. This flexibility allows for optimal use of existing heterogeneousresourcesatparticipatingHPCsiteswithouttheneedforallsitessetupan identical infrastructure.FromanHPC-center point of view, the cloud (VMs) and container solutions are also preferred, asthey enable dynamic scaling of the committed resources, thus minimizing idle time and makingefficientuseofresources.
VMs:ThefirstNOMADVMsrunningatMPG-MPCDFwerebasedonSLES.ThefirstservicesrunningatCSCcloudinfrastructurewerebuiltonCentOSVMs.Toharmonizethebuildandmanagementscripts,allVMsweremigrated to supportCentOS. It is awell-supported flavorwithout license constraintsandthereforecanbeusedatanysitewithoutadditionalcosts.
Figure3:NOMADLaboratoryCoEInfrastructurePlatform.TheplatformisbasedonVMs.TheRepositoryrunsdirectly on VMs, whereas the Encyclopedia services (WP2) run on containers on separate VMs, and theparsing infrastructure (WP1), the Remote Visualization (WP3) and theAnalytics Toolkit (WP4) all run onDockercontainers,managedbyasharedKubernetescluster.
2.2.2 Storage
Adifferentiationbetweentwokindsofstoragehasbeenmade:
• Storage for developers on the development platform: storage resources for storing andcreating the data on which theNOMAD Laboratory CoE is based on; this includes sharedstoragesystemsaswellaslocalstorages,mounteddirectlytoserversorVMs,and
• Storagefortheproductionplatform:datamadeavailablethroughtheproductionplatformisseparate from the storage of the development platform (MPG-MPCDF) or a separate filesystem(CSC)andhasread-onlyaccessviathepublicNOMADservices;throughtheNOMADtools (for analysis, visualizations etc.), the data produced by external users is available forfurtherNOMADservicesonaseparatefilesystem.
The200TiBGPFSstorageatMPG-MPCDFhasbeensuitableasthesharedstorageofthetechnologyplatform.Localstorageonthededicatedservershasbeenincreaseduponrequeststoaround10TBper server. Special storage solutions have also been used; for example, SSD disk has been
CentOS VMs
Kubernetes Cluster
vGPUs
WP1Container
WP4Container
WP4Container
WP2GUIContainer
RepositoryVM
WP3Container
SLESVMs CentOS VMs
WP1Container
WP4Container
WP4Container
…
…WP2APIContainer
CentOS VMs CentOS VMs
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 8
implementedfortheEncyclopediadatabaseontheproductionplatformtospeedupthequerytimes.Additional storage capacity was added to the platforms to address the significant growth of theNOMADRepositoryandtheotherdatabasescreatedfromtheRepository.
Forexample,thelargestdatacollectionistheRepository.Thenumberofcontributionsuploadedtoithas grown from 634,001 (01 Oct 2015) to 1,235,389 (30 Oct 2015 orM1) to 3,407,035 (M18) to6.240.929 (M36). The disk footprint of the uploaded data has grown from 0.8 TB to 9.9 TB,respectively.Asdescribed in theDataManagementPlan,WPs1 - 4use a slightly augmentedanddifferentlycompressedversionofthisdata.TheEncyclopediahasgrownaccordingly.
2.2.3 Authentication&authorization
Authentication for developers on the development platform has been handled by the alreadyexistinglocalusermanagementsystemsofthecomputingcenters.Theproductionplatformrequireda separate user management solution and several already existing Single Sign On (SSO) solutionswere considered and evaluated. The only solution that could cater to all developer needs andNOMADuserrequirementswasbasedonMPASSid1,whichinturnisbasedonShibbolethandSAML.TheMPASSidIdPproxywasusedtobuildtheNOMADusermanagementsystem.
Figure4:NOMADIdentityProviderlandingpageforlogginginormanagingtheaccount.
TheimplementedSSOsystemconsistsoftwoparts:(i)anIdentifyProvider(IdP),whichisresponsiblefor the whole user account management: registration and login procedure, and (ii) an arbitrarynumberofServiceProviders(SP).TheuserinterfaceisshowninFigure4.TheseSPsaretheNOMAD
1https://mpass.fi/english/
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 9
applications,whichusetheIdPforauthenticationandauthorizationoftheusersasshowninFigure5.
Figure5WorkflowinaSSOsystem
The NOMAD IdP was developed, operated and modified according to the needs of WPs 2 - 4,althoughtheintegrationoftheRepositoryusermanagementwiththeIdPisstillinprogress.TheIdPcomeswitha linked localLDAPforstoringtheuserdata.Aself-managedsystemgives flexibility tocreatenewuserattributesasneeded.AsMPASSidisbasedonShibbolethitiscompatiblewithmostacademicauthenticationfederations(e.g.HAKA2andSwitch3)andcanlaterbelinkedtoEuropeanornationalAAIinfrastructures.Italsoallowslinkingtosocialmediaaccountsmakingiteasierforuserstoauthenticate.
Uponrequest fromWP4developers, separatecontainerandVMdeploymentsof the IdPcompletewith its own LDAP to manage user data were created. The motivation was firstly to helpdevelopmentworkbyallowingaccess toanon-production IdPserver.Suchan IdPcouldbesetupinside the local development infrastructure, thereby removing the need to configure tunnelingaccess outside the development system firewalls. Another benefit was that this further facilitatessettinguptheNOMADsystematanewsite(e.g.inaprivatecompany)thatdoesnotwanttoaccessacommon IdP for intellectualproperty (IP)orsecurity reasons.The IdPcontainer ismanagedwithAnsiblescriptswhichhelpskeepingthemaintenanceandmanagementworklow.
Useraccountcreationandmanagementareself-serviceviaemailconfirmation,butattherequestofthe IndustryAdvisory Committee (IAC) and other industry colleagues, a guestmodehas also beenconfigured.Theguestmodeallowsaccesswithoutdisclosinganypersonalinformationtothesystem.Aftersuccessfulauthentication,theIdPwillasserttheuseridentitytotherequestingserviceusingaSAMLtoken.IndividualNOMADrequests,thatmostlytakeplaceviatheRESTAPI,includethetokenintheirmessages,therebyachievinguseridentificationandauthentication.AllNOMADservicescanutilize the same token, sonoadditional action from theuser isneeded toaccess the full rangeofNOMADservices.TheIdPrunsonavirtualmachineintheCSCcloudplatformandwhenneeded,aredundantIdPcaneasilybeaddedatanothersiteorserver.
2https://wiki.eduuni.fi/display/CSCHAKA/Federation3https://www.switch.ch/aai/about/federation/
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 10
2.2.4 Monitoring
The involved computing centers monitor the availability of the NOMAD infrastructure. In case offutureregulationfortheaccesstoNOMADresources,monitoringatservicelevelwillbeusedasthebasisforaccounting.
Regarding the actual services, themost important resource tomonitor has been the storage. ThecomputingcapacityonthedeployedserversatMPG-MPCDFhassofarbeenadequatefortheusage.
Accountingandlimitingofresourcesforexternalusagecanbedoneintwoways:
• usersaregrantedaccesstoaplatformwheretheNOMADservicesareinstalledandaregivenadedicatedamountofresources(CPUs,RAM,storage,etc),or
• the NOMAD applications themselves track resource use of the user and potentially limitfurtherusageaboveasetquota.
2.2.5 DeploymentoftheNOMADproductionplatform
TheNOMADRepositoryisthemainsourcefordataintheNOMADLaboratoryCoE.Currently,itistheonly placewhereusers canuploadmaterials sciencedata in theNOMADecosystem. TheNOMADRepository isoperatedatMPG-MPCDF.Thedataof theNOMADRepository isnormalizedandthenmadeavailablefortheNOMADCoEapplications–theNOMADLaboratory.ThemasterinstallationintheNOMADCoEwasrealizedatMPG-MPCDFwhileasecondinstancehasbeendeployedatCSC.
Both normalized data and metadata can be synchronized by the respective service operatorsbetweenthetwositesviaconventionalcommandlinetoolsasshowninFigure6.
Figure 6: Data flow between the Master site (MPG-MPCDF) and a remote site (CSC) in the distributedNOMADinfrastructure.
2.3 TechnologyPlatformoptimization
2.3.1 LoweringtheBarrierforAccessingNOMADServices
StepwiseaccesstoNOMADservicesforeaseofuse:
MPCDF,dedicated VMs
EncyclopediaDBMaster
Repository(raw data,ZIP
Archive)Master
CSC,OpenStack
EncyclopediaDBCopy
NOMADArchive(Normalized)
copy
NOMADArchive(normalized)
Master
rsync
Manual (batch)Transfer
WP1,parsing
WP2,manual
User data,local User data,local
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 11
1) Thefirststep,tryingoutthewebservice,canbedonewithananonymousaccount.Onlyawebbrowserisrequired.
2) Moreservicesandresourcescanbeaccessedandstoredeasilyandwithoutcostbycreatinga NOMAD user account. Only a working email address is required. Users can also accesstheseadditionalservicesthroughawebbrowseralone,asalltheprocessesrunonNOMADservers.
3) ThemodularandportabledesignofthetechnologyplatformenablesuserstosetupselectedNOMADservices locallyat theirownsite.This improves IP securityandenables theuseofexisting computational resources on site. However, few institutions would invest in theinstallationprocesswithout firstbeingconvincedthattheservice isuseful (i.e. through1-2above).
4) TheNOMADIdPallowscreatingadditionalattributes,e.g.forVIPstatusforresearcherswhoneed extensive computational resources, but do not want a local installation. This statuscouldallowadditionalresourcestobedeployed,perhapsforafee.
2.3.2 SecurityAspects
TheNOMADLaboratoryCoEservicesareweb-basedandaccessiblefromanywherebyanypotentialuser.Thissetuprequiresahighattentiononsecurity-relatedissues.
The computing centers host the NOMAD Laboratory CoE services. At the level of hardware andoperating systems, the computing centerswill take careofupcoming security issues. This includesregular checks for necessary updates of the operating system and the installed basic software.BackupcapabilitieshavebeenprovidedforallNOMADrelateddata,configurationsandapplicationcode. For example, MPG-MPCDF offers backup functionality on the GPFS cluster it operates forNOMADRepositoryandotherNOMADservices:a fileset ("/nomad/backup")which isautomaticallybackedupontapesviaTivoliStorageManager.
At the application level, the responsibility is on the developers of theNOMAD services. From theinfrastructurepointofview,usingcontinuousdeploymentandlargelyautomateddeploymentscriptsfacilitate frequent security updates and rapid fixing of reported vulnerabilities, aswell as periodiccleaninstallsfromthesource.
2.3.3 SoftwareDeploymentAutomation
The provisioning of VMs, configurations and application setups in the NOMAD InfrastructurePlatformhasbeenautomated,wherepossible,with toolssuchasAnsible,which isa free-softwareplatform for configuring and managing servers.4 Ansible scripts have been used for different usecases:
• Basic level (throughcomputingcenters): setupofoperating systemandbasic tools;willbeused to duplicate (virtual) machines across computing centers, set up and configureKubernetesclustersortheIdPfunctionality.
• Applicationlevel(throughNOMADservicedevelopers):installationandconfigurationofuserspaceapplicationslikeKubernetes,databases,applicationserversetc.
4http://docs.ansible.com/
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 12
During theproject,MPG-MPCDFhasoffered itsGitLabplatform forusebyallNOMADdevelopers.GitLab is in firstplacea versioning system fordistributedcodedevelopmenton thebasisofGit, aversioncontrolsystemthatcanbeusedforsoftwaredevelopmentandotherversioncontroltasks.Besidesthisbasicfunctionality,theGitLabsoftwareoffersindividualwikis,issuetrackersandabroadanddeep integrationwith furtherdevelopment tools (e.g. IntegratedDevelopmentEnvironments),whichwereallutilizedintheCoE.ContinuousintegrationwasalsoconfiguredtobepossibledirectlyfromwithinGitLab(seeFig.7).Somedevelopmentteamslinkedthisupwithfurthercommunicationtools. For example, a push to the master branch of the Encyclopedia GUI repository causes anautomatic deploy to the developmentwebserver, and at the same time a notification is sent to achannel inSlack,sothatotherdevelopersare immediately informedthat therehasbeenachangetheymightwanttocheckout.
Figure7:GitLabserviceatMPG-MPCDF
2.3.4 ExtendedComputeservices
In addition to the handling of computing needs through the NOMAD computing centers, HPCresourceswereofferedbyPRACEtobeabletofillingapsinthesimulationdataintheRepository,ifrequired. Inparticular, intheearlyphasesoftheparserdevelopment,therewasfrequentneedforheavy re-parsing of NOMAD data. MPG-MPCDF provided direct HPC access that could be usedinsteadofthesmallerscalededicatedresources,whicharemoresuitableforthereactiveparsingofnewlyuploadeddata into theRepository. Ifneeded in future, the resourcesatNOMADcomputingcenters can be made available for parsing the data in a geographically distributed way andsynchronizetheresultstotheNOMADArchive.
2.3.5 ImprovingtheEncyclopediaIngestingProcess
Inthefirsthalfoftheproject,theEncyclopediaingestionprocesswasidentifiedasabottleneck.Thefirst solution to improve the performance was to restructure the database schema to allow forparallelizing the process. This improved the performance, but with ever growing data volumes,eventuallyadifferentapproachwasneeded.
NOMAD–ProjectNo.676580
D6.2FinalreportontheHPCplatform 13
ThehighlyhierarchicalnatureoftheNOMADMetaInfo,frequentupdatesonthedatabaseschemaandhighinsertionvolumesmadetheoriginalrelationalPostgreSQL-basedsolutioninconvenientfortheEncyclopediadevelopmentteam.Toremedytheseproblems,WP6hasassistedtheEncyclopediateam in transitioning from an SQL-based database solution to a schemaless NoSQL environmentbasedonMongoDB.
Population of the Encyclopedia database is done by systematically crawling through the user-uploaded calculations stored in theNOMAD Archive. As described above, this taskwas efficientlyparallelized,butthefollowinginsertionstepspeedoftheSQLdatabasedidnotscalesufficientlyasmoreprocesseswereaddedtothistask.Thisbottleneckcouldbeavoidedbystoringtheinformationas single nested data entries in themore suitableNoSQL alternative.Nested JSON-documents arethusfastertostorebutalsotoserve.MongoDBalsosimplifiestheservingofEncyclopediacontentsas the documents are stored in the JSON format that can be directly served through a web-API.Additionally,theschemalessapproachprovidedbyMongoDBallowsmorerapidprototypingasthereisnoneedtoupdatethedatabaseschemaforeverychangeinthedatalayout.
ThemigrationtotheNoSQLenvironmenthasbeengradualbyfirstmakingaparallelimplementationthatworksalongsidetheolddatabase.Theproductionenvironment isscheduledtobeupdatedtousethenewdatabaseassoonasotherpartsoftheservice,includingtheAPI,havebeenupdatedtofullysupportthechanges.
3 LessonsLearnedandConclusionAt the start of the project, the first aimwas to try to design a full set ofweb-based services in adistributedtechnologyplatform.Overthecourseoftheimplementationofthedifferentpartsofthetechnology platform, it turned out that only parts of the NOMAD infrastructure needed to bedistributedoverNOMADcomputingcenters.AmasterinstanceoftheNOMADtechnologyplatformwas implementedascloseaspossibletothe largestdatabase,theNOMADRepository,withatightintegration of essential resources. Further instances of the NOMAD infrastructure can be easilyclonedtoothercomputingcentersforeaseofusefortheirlocalusers.OnecloneofessentialpartsoftheNOMADtechnologyplatformhasbeensuccessfullyinstalledatCSC.
WP6hasproventobeofessentialimportanceforthedesigning,implementationanddeploymentoftheNOMADtechnologyplatformthroughacollaborativeeffortofthetechnicalexpertsinvolved.