Introduction to Pythondsc.soic.indiana.edu/publications/CloudComputing-Python.pdfINTRODUCTION TO...

INTRODUCTIONTOPYTHON

GregorvonLaszewski

(c)GregorvonLaszewski,2018,2019

INTRODUCTIONTOPYTHON

1PREFACE1.1Disclaimer☁�1.1.1Acknowledgment1.1.2Extensions

2INTRODUCTION2.1IntroductiontoPython☁�2.1.1References

3INSTALATION3.1Python3.7.4Installation☁�3.1.1Hardware3.1.2PrerequisitsUbuntu19.043.1.3PrerequisitsmacOS3.1.3.1InstallationfromAppleAppStore3.1.3.2Installationfrompython.org3.1.3.3InstallationfromHoembrew

3.1.4PrerequisitsUbuntu18.043.1.5PrerequisiteWindows103.1.5.1LinuxSubsystemInstall

3.1.6Prerequisitvenv3.1.7InstallPython3.7viaAnaconda3.1.7.1Downloadcondainstaller3.1.7.2Installconda3.1.7.3InstallPython3.7.4viaconda

3.2Multi-VersionPythonInstallation☁�3.2.1Disablingwrongpythoninstalls3.2.2Managing2.7and3.7PythonVersionswithoutPyenv3.2.3ManagingMultiplePythonVersionswithPyenv3.2.3.1InstallationpyenvviaHomebrew3.2.3.2InstallpyenvonUbuntu18.043.2.3.3Usingpyenv3.2.3.3.1UsingpyenvtoInstallDifferentPythonVersions3.2.3.3.2SwitchingEnvironments

3.2.3.4UpdatingPythonVersionList3.2.3.4.1UpdatingtoanewversionofPythonwithpyenv

3.2.4AnacondaandMinicondaandConda3.2.4.1Miniconda3.2.4.2Anaconda

3.2.5Exercises4FIRSTSTEPS4.1InteractivePython☁�4.1.1REPL(ReadEvalPrintLoop)4.1.2Interpreter4.1.3Python3FeaturesinPython2

4.2Editors☁�4.2.1Pycharm4.2.2Pythonin45minutes

4.3GoogleColab☁�4.3.1IntroductiontoGoogleColab4.3.2ProgramminginGoogleColab4.3.3BenchamrkinginGoogleColabwithCloudmesh

5LANGUAGE5.1Language☁�5.1.1StatementsandStrings5.1.2Comments5.1.3Variables5.1.4DataTypes5.1.4.1Booleans5.1.4.2Numbers

5.1.5ModuleManagement5.1.5.1ImportStatement5.1.5.2Thefrom…importStatement

5.1.6DateTimeinPython5.1.7ControlStatements5.1.7.1Comparison5.1.7.2Iteration

5.1.8Datatypes5.1.8.1Lists5.1.8.2Sets5.1.8.3RemovalandTestingforMembershipinSets5.1.8.4Dictionaries5.1.8.5DictionaryKeysandValues

5.1.8.6CountingwithDictionaries5.1.9Functions5.1.10Classes5.1.11Modules5.1.12LambdaExpressions5.1.12.1map5.1.12.2dictionary

5.1.13Iterators5.1.14Generators5.1.14.1Generatorswithfunction5.1.14.2Generatorsusingforloop5.1.14.3GeneratorswithListComprehension5.1.14.4WhytouseGenerators?

6CLOUDMESH6.1Introduction☁�6.2Installation☁�6.2.1Prerequisite6.2.2BasicInstall

6.3Output☁�6.3.1Console6.3.2Banner6.3.3Heading6.3.4VERBOSE6.3.5Usingprintandpprint

6.4Dictionaries☁�6.4.1Dotdict6.4.2FlatDict6.4.3PrintingDicts

6.5Shell☁�6.6StopWatch☁�6.7CloudmeshCommandShell☁�6.7.1CMD56.7.1.1Resources6.7.1.2Installationfromsource6.7.1.3Execution6.7.1.4CreateyourownExtension6.7.1.5Bug:Quotes

6.8Exercises☁�6.8.1CloudmeshCommon6.8.2CloudmeshShell

7LIBRARIES7.1PythonModules☁�7.1.1UpdatingPip7.1.2UsingpiptoInstallPackages7.1.3GUI7.1.3.1GUIZero7.1.3.2Kivy

7.1.4FormattingandCheckingPythonCode7.1.5Usingautopep87.1.6WritingPython3CompatibleCode7.1.7UsingPythononFutureSystems7.1.8Ecosystem7.1.8.1pypi7.1.8.2AlternativeInstallations

7.1.9Resources7.1.9.1JupyterNotebookTutorials

7.1.10Exercises7.2DataManagement☁�7.2.1Formats7.2.1.1Pickle7.2.1.2TextFiles7.2.1.3CSVFiles7.2.1.4Excelspreadsheets7.2.1.5YAML7.2.1.6JSON7.2.1.7XML7.2.1.8RDF7.2.1.9PDF7.2.1.10HTML7.2.1.11ConfigParser7.2.1.12ConfigDict

7.2.2Encryption7.2.3DatabaseAccess7.2.4SQLite

7.2.4.1Exercises �7.3Plottingwithmatplotlib☁�7.4DocOpts☁�7.5OpenCV☁�7.5.1Overview7.5.2Installation7.5.3ASimpleExample7.5.3.1Loadinganimage7.5.3.2Displayingtheimage7.5.3.3ScalingandRotation7.5.3.4Gray-scaling7.5.3.5ImageThresholding7.5.3.6EdgeDetection

7.5.4AdditionalFeatures7.6SecchiDisk☁�7.6.1SetupforOSX7.6.2Step1:Recordthevideo7.6.3Step2:AnalysetheimagesfromtheVideo7.6.3.1ImageThresholding7.6.3.2EdgeDetection7.6.3.3Blackandwhite

8DATA8.1DataFormats☁�8.1.1YAML8.1.2JSON8.1.3XML

9MONGO9.1MongoDBinPython☁�9.1.1CloudmeshMongoDBUsageQuickstart9.1.2MongoDB9.1.2.1Installation9.1.2.1.1Installationprocedure

9.1.2.2CollectionsandDocuments9.1.2.2.1Collectionexample9.1.2.2.2Documentstructure9.1.2.2.3CollectionOperations

9.1.2.3MongoDBQuerying

9.1.2.3.1MongoQueriesexamples9.1.2.4MongoDBBasicFunctions9.1.2.4.1Import/Exportfunctionsexamples

9.1.2.5SecurityFeatures9.1.2.5.1Collectionbasedaccesscontrolexample

9.1.2.6MongoDBCloudService9.1.3PyMongo9.1.3.1Installation9.1.3.2Dependencies9.1.3.3RunningPyMongowithMongoDeamon9.1.3.4ConnectingtoadatabaseusingMongoClient9.1.3.5AccessingDatabases9.1.3.6CreatingaDatabase9.1.3.7InsertingandRetrievingDocuments(Querying)9.1.3.8LimitingResults9.1.3.9UpdatingCollection9.1.3.10CountingDocuments9.1.3.11Indexing9.1.3.12Sorting9.1.3.13Aggregation9.1.3.14DeletingDocumentsfromaCollection9.1.3.15CopyingaDatabase9.1.3.16PyMongoStrengths

9.1.4MongoEngine9.1.4.1Installation9.1.4.2ConnectingtoadatabaseusingMongoEngine9.1.4.3QueryingusingMongoEngine

9.1.5Flask-PyMongo9.1.5.1Installation9.1.5.2Configuration9.1.5.3Connectiontomultipledatabases/servers9.1.5.4Flask-PyMongoMethods9.1.5.5AdditionalLibraries9.1.5.6ClassesandWrappers

9.2Mongoengine☁�9.2.1Introduction9.2.2Installandconnect

9.2.3Basics10OTHER10.1WordCountwithParallelPython☁�10.1.1GeneratingaDocumentCollection10.1.2SerialImplementation10.1.3SerialImplementationUsingmapandreduce10.1.4ParallelImplementation10.1.5Benchmarking10.1.6Excersises10.1.7References

10.2NumPy☁�10.2.1InstallingNumPy10.2.2NumPyBasics10.2.3DataTypes:TheBasicBuildingBlocks10.2.4Arrays:StringingThingsTogether10.2.5Matrices:AnArrayofArrays10.2.6SlicingArraysandMatrices10.2.7UsefulFunctions10.2.8LinearAlgebra10.2.9NumPyResources

10.3Scipy☁�10.3.1Introduction10.3.2References

10.4Scikit-learn☁�10.4.1IntroductiontoScikit-learn10.4.2Installation10.4.3SupervisedLearning10.4.4UnsupervisedLearning10.4.5Building a end to end pipeline for Supervisedmachine learningusingScikit-learn10.4.6Stepsfordevelopingamachinelearningmodel10.4.7ExploratoryDataAnalysis10.4.7.1Barplot10.4.7.2Correlationbetweenattributes10.4.7.3HistogramAnalysisofdatasetattributes10.4.7.4BoxplotAnalysis10.4.7.5ScatterplotAnalysis

10.4.8DataCleansing-RemovingOutliers10.4.9PipelineCreation10.4.9.1 Defining DataFrameSelector to separate Numerical andCategoricalattributes10.4.9.2FeatureCreation/AdditionalFeatureEngineering

10.4.10CreatingTrainingandTestingdatasets10.4.11Creatingpipelinefornumericalandcategoricalattributes10.4.12Selectingthealgorithmtobeapplied10.4.12.1LinearRegression10.4.12.2LogisticRegression10.4.12.3Decisiontrees10.4.12.4KMeans10.4.12.5SupportVectorMachines10.4.12.6NaiveBayes10.4.12.7RandomForest10.4.12.8Neuralnetworks10.4.12.9DeepLearningusingKeras10.4.12.10XGBoost

10.4.13ScikitCheatSheet10.4.14ParameterOptimization10.4.14.1Hyperparameteroptimization/tuningalgorithms

10.4.15 Experiments with Keras (deep learning), XGBoost, and SVM(SVC)comparedtoLogisticRegression(Baseline)10.4.15.1Creatingaparametergrid10.4.15.2 Implementing Grid search with models and also creatingmetricsfromeachofthemodel.10.4.15.3ResultstablefromtheModelevaluationwithmetrics.10.4.15.4ROCAUCScore

10.4.16K-meansinscikitlearn.10.4.16.1Import

10.4.17K-meansAlgorithm10.4.17.1Import10.4.17.2Createsamples10.4.17.3Createsamples10.4.17.4Visualize10.4.17.5Visualize

10.5Dask-RandomForestFeatureDetection☁�

10.5.1Setup10.5.2Dataset10.5.3DetectingFeatures10.5.3.1DataPreparation

10.5.4RandomForest10.5.5Acknowledgement

10.6ParallelComputinginPython☁�10.6.1Multi-threadinginPython10.6.1.1ThreadvsThreading10.6.1.2Locks

10.6.2Multi-processinginPython10.6.2.1Process10.6.2.2Pool10.6.2.2.1SynchronousPool.map()10.6.2.2.2AsynchronousPool.map_async()

10.6.2.3Locks10.6.2.4ProcessCommunication10.6.2.4.1Value

10.7Dask☁�10.7.1HowDaskWorks10.7.2DaskBag10.7.3ConcurrencyFeatures10.7.4DaskArray10.7.5DaskDataFrame10.7.6DaskDataFrameStorage10.7.7Links

11APPLICATIONS11.1FingerprintMatching☁�11.1.1Overview11.1.2Objectives11.1.3Prerequisites11.1.4Implementation11.1.5Utilityfunctions11.1.6Dataset11.1.7DataModel11.1.7.1Utilities11.1.7.1.1Checksum

11.1.7.1.2Path11.1.7.1.3Image

11.1.7.2Mindtct11.1.7.3Bozorth311.1.7.3.1RunningBozorth311.1.7.3.1.1One-to-one11.1.7.3.1.2One-to-many

11.1.8Plotting11.1.9PuttingitallTogether

11.2NISTPedestrianandFaceDetection �☁�11.2.0.1Introduction11.2.0.1.1INRIAPersonDataset11.2.0.1.2HOGwithSVMmodel11.2.0.1.3AnsibleAutomationTool

11.2.0.2DeploymentbyAnsible11.2.0.3CloudmeshforProvisioning11.2.0.4RolesExplainedforInstallation11.2.0.4.1ServergroupsforMasters/SlavesbyAnsibleinventory

11.2.0.5InstructionsforDeployment11.2.0.5.1CloningPedestrianDetectionRepositoryfromGithub11.2.0.5.2AnsiblePlaybook

11.2.0.6OpenCVinPython11.2.0.6.1Importcv211.2.0.6.2ImageDetection

11.2.0.7HumanandFaceDetectioninOpenCV11.2.0.7.1INRIAPersonDataset11.2.0.7.2FaceDetectionusingHaarCascades11.2.0.7.3FaceDetectionPythonCodeSnippet

11.2.0.8PedestrianDetectionusingHOGDescriptor11.2.0.8.1PythonCodeSnippet

11.2.0.9ProcessingbyApacheSpark11.2.0.9.1ParallelizeinSparkContext11.2.0.9.2MapFunction(apply_batch)11.2.0.9.3CollectFunction

11.2.0.10Resultsfor100+imagesbySparkCluster12REFERENCES

1PREFACE

SatNov2305:25:16EST2019☁�

1.1DISCLAIMER☁�ThisbookhasbeengeneratedwithCyberaideBookmanager.

Bookmanagerisatooltocreateapublicationfromanumberofsourcesontheinternet. It is especially useful to create customized books, lecture notes, orhandouts. Content is best integrated inmarkdown format as it is very fast toproducetheoutput.

Bookmanagerhasbeendevelopedbasedonourexperienceoverthelast3yearswith amore sophisticated approach.Bookmanager takes the lessons from thisapproachanddistributesatoolthatcaneasilybeusedbyothers.

The followingshieldsprovide some informationabout it.Feel free toclickonthem.

pypipypi v0.2.28v0.2.28 LicenseLicense Apache2.0Apache2.0 pythonpython 3.73.7 formatformat wheelwheel statusstatus stablestable buildbuild unknownunknown

1.1.1Acknowledgment

Ifyouusebookmanagertoproduceadocumentyoumustincludethefollowingacknowledgement.

“This document was produced with Cyberaide Bookmanagerdeveloped by Gregor von Laszewski available athttps://pypi.python.org/pypi/cyberaide-bookmanager. It is in theresponsibility of the user tomake sure an author acknowledgementsection is included in your document. Copyright verification ofcontentincludedinabookisresponsibilityofthebookeditor.”

Thebibtexentryis@Misc{www-cyberaide-bookmanager,

author={GregorvonLaszewski},

1.1.2Extensions

We are happy to discuss with you bugs, issues and ideas for enhancements.Pleaseusetheconvenientgithubissuesat

https://github.com/cyberaide/bookmanager/issues

Pleasedonotfilewithusissuesthatrelatetoaneditorsbook.Theywillprovideyouwiththeirownmechanismonhowtocorrecttheircontent.

title={{CyberaideBookManager}},

howpublished={pypi},

month=apr,

year=2019,

url={https://pypi.org/project/cyberaide-bookmanager/}

2INTRODUCTION

2.1INTRODUCTIONTOPYTHON☁�

LearningObjectives

Learn quickly Python under the assumption you know a programminglanguageWorkwithmodulesUnderstanddocoptsandcmdContuctsomepythonexamplestorefreshyourpythonknpwledgeLearnaboutthemapfunctioninPythonLearnhowtostartsubprocessesandrederecttheiroutputLearnmoreadvancedconstructssuchasmultiprocessingandQueuesUnderstandwhywedonotuseanacondaGetfamiliarwithpyenv

Portions of this lesson have been adapted from the official Python TutorialcopyrightPythonSoftwareFoundation.

Pythonisaneasytolearnprogramminglanguage.Ithasefficienthigh-leveldatastructuresandasimplebuteffectiveapproachtoobject-orientedprogramming.Python’ssimplesyntaxanddynamictyping,togetherwithitsinterpretednature,make it an ideal language for scripting and rapid application development inmanyareasonmostplatforms.ThePythoninterpreterandtheextensivestandardlibraryarefreelyavailableinsourceorbinaryformforallmajorplatformsfromthe PythonWeb site, https://www.python.org/, and may be freely distributed.ThesamesitealsocontainsdistributionsofandpointerstomanyfreethirdpartyPythonmodules,programsandtools,andadditionaldocumentation.ThePythoninterpretercanbeextendedwithnewfunctionsanddatatypesimplementedinCor C++ (or other languages callable from C). Python is also suitable as anextensionlanguageforcustomizableapplications.

Pythonisaninterpreted,dynamic,high-levelprogramminglanguagesuitableforawiderangeofapplications.

ThephilosophyofpythonissummarizedinTheZenofPythonasfollows:

ExplicitisbetterthanimplicitSimpleisbetterthancomplexComplexisbetterthancomplicatedReadabilitycounts

ThemainfeaturesofPythonare:

UseofindentationwhitespacetoindicateblocksObjectorientparadigmDynamictypingInterpretedruntimeGarbagecollectedmemorymanagementalargestandardlibraryalargerepositoryofthird-partylibraries

Python is used by many companies and is applied for web development,scientific computing, embedded applications, artificial intelligence, softwaredevelopment,andinformationsecurity,tonameafew.

The material collected here introduces the reader to the basic concepts andfeaturesofthePythonlanguageandsystem.Afteryouhaveworkedthroughthematerialyouwillbeableto:

usePythonusetheinteractivePythoninterfaceunderstandthebasicsyntaxofPythonwriteandrunPythonprogramshaveanoverviewofthestandardlibraryinstall Python libraries using pyenv for multipython interpreterdevelopment.

Edoenotattempttobecomprehensiveandcovereverysinglefeature,orevenevery commonly used feature. Instead, it introduces many of Python’s most

noteworthyfeatures,andwillgiveyouagoodideaofthelanguage’sflavorandstyle.After reading it, youwillbeable to readandwritePythonmodulesandprograms,andyouwillbereadytolearnmoreaboutthevariousPythonlibrarymodules.

Inordertoconductthislessonyouneed

AcomputerwithPython2.7.16or3.7.4FamiliaritywithcommandlineusageA text editor such as PyCharm, emacs, vi or others.You should identitywhichworksbestforyouandsetitup.

2.1.1References

Some important additional information can be found on the following Webpages.

PythonPipVirtualenvNumPySciPyMatplotlibPandaspyenvPyCharm

Python module of the week is a Web site that provides a number of shortexamplesonhowtousesomeelementarypythonmodules.Notallmodulesareequallyusefulandyoushoulddecideiftherearebetteralternatives.Howeverforbeginnersthissiteprovidesanumberofgoodexamples

Python2:https://pymotw.com/2/Python3:https://pymotw.com/3/

3INSTALATION

3.1PYTHON3.7.4INSTALLATION☁�

LearningObjectives

Learnhowtoinstallpython.FindadditionalinformationaboutPython.MakesureyourComputersupportsPython.

Inthissetionweexplainhowtoinstallpython3.7.4onacomputer.Likelymuchofthecodewillworkwithearlierversions,butwedothedevelopmentinPythononthenewestversionofpythonavailableathttps://www.python.org/downloads.

3.1.1Hardware

Python does not require any special hardware.We have installed Python notonlyonPC’sandLaptops,butalsoonRaspberryPI’sandLegoMindstorms.

However,therearesomethingstoconsider.Ifyouusemanyprogramsonyourdesktop and run them all at the same time you will find that in up-to-dateoperating systems you will find your self quickly out of memmory. This isespeciallytrueifyouuseeditorssuchasPyCharmwhichwehighlyrecommend.Furthermore,asyoulikelyhavelotsofdiskaccess,makesuretouseafastHDDorbetteranSSD.

AtypicalmoderndeveloperPCorLaptophas16GBRAMandanSSD.Youcancertainlydopythonona$35RapbperryPI,butyouprobablywillnotbeabletorun PyCharm. There are many alternative editors with lessMemory footprintavialable.

3.1.2PrerequisitsUbuntu19.04

Python 3.7 is installed in ubuntu 19.04. Therefore, it already fulfills theprerequisits.Howeverwerecommendthatyouupdate to thenewestversionofpythonandpip.Howeverwerecommendthatyouupdatethethenewestversionofpython.Pleasevisit:https://www.python.org/downloads

3.1.3PrerequisitsmacOS

3.1.3.1InstallationfromAppleAppStore

Youwant a number of useful tool on yourmacOS. They are not installed bydefault,butareavailableviaXcode.Firstyouneedtoinstallxcodefrom

https://apps.apple.com/us/app/xcode/id497799835

NextyouneedtoinstallmacOSxcodecommandlinetools:

3.1.3.2Installationfrompython.org

The easiest instalation of Python for cloudmesh is to use the instaltion fromhttps://www.python.org/downloads. Please, visit the page and follow theinstructions.Afterthisinstallyouhavepython3avalablefromthecommandline

3.1.3.3InstallationfromHoembrew

An alternative instalation is provided from Homebrew. To use this installmethod,youneed to installHomebrewfirst.Start theprocessby installing thepython3usinghomebrew.Installhomebrewusingtheinstructionintheirwebpage:

ThenyoushouldbeabletoinstallPython3.7.4using:

3.1.4PrerequisitsUbuntu18.04

We recommend you update your ubuntu version to 19.04 and follow the

$xcode-select--install

$/usr/bin/ruby-e"$(curl-fsSLhttps://raw.githubusercontent.com/Homebrew/install/master/install)"

$brewinstallpython

instructionsforthatversioninstead,asitissignificantlyeasier.Ifyouhoweverarenotabletodoso,thefollowinginstructionsmaybehelpful.

Wefirstneed tomakesure that thecorrectversionof thePython3is installed.ThedefaultversionofPythononUbuntu18.04is3.6.Youcangettheversionwith:

Iftheversionisnot3.7.4ornewer,youcanupdateitasfollows:

Youcan thencheck the installedversionusing python3.7--version which should be3.7.4.

Nowwewillcreateanewvirtualenvironment:

Theeditthe~/.bashrcfileandaddthefollowinglineattheend:

nowactivatethevirtualenvironmentusing:

nowyoucaninstallthepipforthevirtualenvironmentwithoutconflictingwiththenativepip:

3.1.5PrerequisiteWindows10

Python 3.7 can be installed on Windows 10 using:https://www.python.org/downloads

For3.7.4cangoto thedownloadpageanddownloadoneof thedifferent filesforWindows.

$python3--version

$sudoapt-getupdate

$sudoaptinstallsoftware-properties-common

$sudoadd-apt-repositoryppa:deadsnakes/ppa

$sudoapt-getinstallpython3.7python3-devpython3.7-dev

$python3.7-mvenv--without-pip~/ENV3

aliasENV3="source~/ENV3/bin/activate"

$source~/.bashrc

$curl"https://bootstrap.pypa.io/get-pip.py"-o"get-pip.py"

$pythonget-pip.py

$rmget-pip.py

LetusassumeyouchoetheWebbasedinstaller,thanyouclickonthefileintheedge browser (make sure the account you use has administrative priviledges).Followtheinstructionsthattheinstallergives.Importantisthatyouselectatonepoint“[x]AddtoPath”.Therewillbeanemptycheckmarkaboutthisthatyouwillclickon.

Onceitisinstalled.choseaterminalandexecute

However, ifyouhave installedconda for somereasonyouneed to readuponhowtoinstall3.7.4pythonincondaoridentifyhowtoruncondaandpython.orgatthesametime.Weseeoftenothersgivingthewronginstallationinstructions.

Analternative is tousepythonfromwithin theLinuxSubsystem.But thathassomelimitationsandyouwillneedtoexplorehowtoexxessthefilesysteminthesubssytemtohaveasmoothintegrationbetweenyourWindowshostsoyoucanforexampleusePyCharm.

3.1.5.1LinuxSubsystemInstall

ToactivatetheLinuxSubsystem,pleasefollowtheinstructionsat

https://docs.microsoft.com/en-us/windows/wsl/install-win10

Asuitabledistributionwouldbe

https://www.microsoft.com/en-us/p/ubuntu-1804-lts/9n9tngvndl3q?activetab=pivot:overviewtab

Howeverasitusesanolderversionofpythonyouwillahvetoupdateit.

3.1.6Prerequisitvenv

This step is highly recommend if you have not yet already installed a venv forpythontomakesureyouarenotinterferingwithyoursystempython.NotusingavenvcouldhavecatastrophicconsequencesandadestructionofyouroperatingsystemtoolsiftheyrealyonPython.Theuseofvenvissimple.Forourpurposesweassumethatyouusethedirectory:

python--version

Followthesestepsfirst:

Firstcdtoyourhomedirectory.Thenexecute

Youcanaddat theendofyour .bashrc(ubuntu)or .bash_profile (macOS)filetheline

sotheenvironmentisalwaysloaded.Nowyouarereadytoinstallcloudmesh.

Checkifyouhavetherightversionofpythoninstalledwith

Tomakesureyouhaveanuptodateversionofpipissuethecommand

3.1.7InstallPython3.7viaAnaconda

3.1.7.1Downloadcondainstaller

Minicondaisrecommendedhere.Downloadan installerforWindows,macOS,andLinuxfromthispage:https://docs.conda.io/en/latest/miniconda.html

3.1.7.2Installconda

Followinstructionstoinstallcondaforyouroperatingsystems:

Windows. https://conda.io/projects/conda/en/latest/user-guide/install/windows.htmlmacOS. https://conda.io/projects/conda/en/latest/user-guide/install/macos.htmlLinux.https://conda.io/projects/conda/en/latest/user-guide/install/linux.html

~/ENV3

$python3-mvenv~/ENV3

$source~/ENV3/bin/activate

$python--version

$pipinstallpip-U

3.1.7.3InstallPython3.7.4viaconda

Itisveryimportanttomakesureyouhaveanewerversionofpipinstalled.Afteryou installed and created theENV3you need to activate it. This can be donewith

Ifyou like toactivate itwhenyoustartanewterminal,pleaseadd this line toyour.bashrcor.bash_profile

Ifyouusezshpleaseadditto.zprofileinstead.

3.2MULTI-VERSIONPYTHONINSTALLATION☁�

LearningObjectives

Understandwhyweneedtoworryaboutpython3.7and2.7UsepyenvtosupportbothversionsUnderstandthelimitationsofanaconda/condafordevelopers

WearelivinginaninterestingjunctionpointinthedevelopmentofPython.InJanuary 2019, it is encouraged that Python developers swoth from pythonversion2.7topythonversion3.7.

Howevertheremaybetherequirementwhenyoustillneedtodevelopcodenotonlyinpython3.7butalsoinpython2.7.Tofacilitatethismulti-pythonversiondevelopment,thebesttoolweknowaboutcapableofdoingsoispyenv.Wewillexplainyouinthissectionhowtoinstallbothversionswiththehelpofpyenv.

Python is easy to install andverygood instructions formostplatformscanbefoundonthepython.orgWebpage.Weseetwodifferentversions:

$condacreate-nENV3python=3.7.4

$condaactivateENV3

$condainstall-canacondapip

$condadeactivateENV3

$condaactivateENV3

Python2.7.16Python3.7.4

Tomanagepythonmodules,itisusefultohavepippackageinstallationtoolonyoursystem.

We assume that you have a computer with python installed. The version ofpythonhowevermaynotbethenewestversion.Pleasecheckwith

whichversionofpythonyourun.Ifitisnotthenewestversion,weusepyenvtoinstallanewerversionsoyoudonoteffect thedefaultversionofpythonfromyoursystem.

3.2.1Disablingwrongpythoninstalls

Whileworkingwithstudentswehaveseenattimesthattheytakeotherclasseseither at universities or online that teach them how to program in python.Unfortunately, they seem to often ignore to teach you how to properly installPython.Ijustrecentlyhadastudentsthathadinstalledpython7differenttimesonhismacOSmachine,whileanotherstudenthad3differentinstallations,allofwhich conflicted with each other as they were not set up properly and thestudents did not even realize that theywere using Python incorrectly on theircomputerduetosetupissuesandconflictinglibraries.

Werecommendthatyouinspectifyouhaveafilessuchas~/.bashrcor~/.bashrc_profileinyourhomedirectoryandidentifyifitactivatesvariousversionsofpythononyourcomputer.Ifsoyoucouldtrytodeactivatethemwhileout-commentingthevarious versionswith the # character at the beginning of the line, start a newterminal and see if the terminal shell still works. Than you can follow ourinstructionsherewhileusinganinstallonpyenv.

3.2.2Managing2.7and3.7PythonVersionswithoutPyenv

Ifyouneedtohavemorethanonepythonversioninstalledanddonotwantorcanusepyenv,werecommendyoudownloadandinstallpython2.7.16and3.7.4frompython.org(https://www.python.org/downloads/)

$python--version

YOucanthanuseeitherpython2orpython3toinvokethepythoninterpreter.

3.2.3ManagingMultiplePythonVersionswithPyenv

Python has several versions that are used by the community. This includesPython2andPython3,butalldifferentmanagementofthepythonlibraries.AseachOSmayhavetheirownversionofpythoninstalled.Itisrecommendedthatyounotmodifythatversion.Insteadyoumaywanttocreatealocalizedpythoninstallation that you as a user can modify. To do that we recommend pyenv.Pyenv allows users to switch between multiple versions of Python(https://github.com/yyuu/pyenv).Tosummarize:

userstochangetheglobalPythonversiononaper-userbasis;userstoenablesupportforper-projectPythonversions;easyversionchangeswithoutcomplexenvironmentvariablemanagement;tosearchinstalledcommandsacrossdifferentpythonversions;integratewithtox(https://tox.readthedocs.io/).

Toinstallpyenvonyoursystemyoucanusethecommand

Nowyoucaninstalldifferentpythonversionsonyoursystemsuchaspython2.7and3.7withafewcommands:

To automatically access them fromyour shellwe integrate them into bash byediting the bash configuration files.Make sure that on Linux you add to the~/.bashrcfileandonmacOStothefile~/.bash_profileor.zprofile.

$curlhttps://pyenv.run|bash

$pyenvinstall3.7.4

$pyenvinstall2.7.16

$pyenvvirtualenv3.7.4ENV3

exportPYENV_ROOT="$HOME/.pyenv"

exportPATH="$PYENV_ROOT/bin:$PATH"

exportPYENV_VIRTUALENV_DISABLE_PROMPT=1

eval"$(pyenvinit-)"

eval"$(pyenvvirtualenv-init-)"

__pyenv_version_ps1(){

localret=$?;

output=$(pyenvversion-name)

if[[!-z$output]];then

echo-n"($output)"

return$ret;

Werecommendthatyoudothistowardstheendofyourfile.ThanlookupourconveniencemethodstosetanALIASandinstallPython3.7.4viapyenv

Nextwerecommendtoupdatepip

3.2.3.1InstallationpyenvviaHomebrew

OnmacOSyoucaninstallpyenvalsoviaHomebrew.Beforeinstallinganythingon your computermake sure you have enough space.Use in the terminal thecommand:

whichgivesyour anoverviewofyour file system. Ifyoudonothaveenoughspace,pleasemakesureyoufreeupunusedfilesfromyourdrive.

In many occasions it is beneficial to use readline as it provides nice editingfeaturesfortheterminalandxzforcompletion.First,makesureyouhavexcodeinstalled:

OnMojaveyouwillgetanerrorthatzlibisnotinstalled.THisisduetothattheheaderfilesarenotproperlyinstalled.Todothisyoucansay

Next install homebrew, pyenv, pyenv-virtualenv and pyenv-virtualwrapper.Additionallyinstallreadlineandsomecompressiontools:

PS1="\$(__pyenv_version_ps1)${PS1}"

aliasENV2="pyenvactivateENV2"

$pipinstallpip-U

$xcode-select--install

$sudoinstaller-pkg/Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg-target/

$/usr/bin/ruby-e"$(curl-fsSLhttps://raw.githubusercontent.com/Homebrew/install/master/install)"

$brewupdate

$brewinstallreadlinexz

Toinstallpyenvwithhomebrewexecuteintheterminal:

3.2.3.2InstallpyenvonUbuntu18.04

Thefollowingstepswillinstallpyenvinanewubuntu18.04distribution.

Start up a terminal and execute in the terminal the following commands.Werecommend that you do it one command at a time so you can observe if thecommandsucceeds:

Youcanalsoinstallpyenvusingcurlcommandinfollowingway:

Theninstallitsdependencies:

Now that you have installed pyenv it is not yet activated in your currentterminal.Theeasiestthingtodoistostartanewterminalandtypin:

Ifyouseearesponsepyenvisinstalledandyoucanproceedwiththenextsteps.

Pleaserememberwheneveryoumodify.bashrcor.bash_profileor.zprofileyouneedtostartanewterminal.

3.2.3.3Usingpyenv

3.2.3.3.1UsingpyenvtoInstallDifferentPythonVersions

brewinstallpyenvpyenv-virtualenvpyenv-virtualenvwrapper

$sudoapt-getupdate

$sudoapt-getinstallgitpython-pipmakebuild-essentiallibssl-dev

$sudoapt-getinstallzlib1g-devlibbz2-devlibreadline-devlibsqlite3-dev

$sudopipinstallvirtualenvwrapper

$gitclonehttps://github.com/yyuu/pyenv.git~/.pyenv

$gitclonehttps://github.com/pyenv/pyenv-virtualenv.git~/.pyenv/plugins/pyenv-virtualenv

$gitclonehttps://github.com/yyuu/pyenv-virtualenvwrapper.git~/.pyenv/plugins/pyenv-virtualenvwrapper

$echo'exportPYENV_ROOT="$HOME/.pyenv"'>>~/.bashrc

$echo'exportPATH="$PYENV_ROOT/bin:$PATH"'>>~/.bashrc

$curl-Lhttps://raw.githubusercontent.com/yyuu/pyenv-installer/master/bin/pyenv-installer|bash

$sudoapt-getupdate&&sudoapt-getupgrade

$sudoapt-getinstall-ymakebuild-essentiallibssl-dev

$sudoapt-getinstall-yzlib1g-devlibbz2-devlibreadline-devlibsqlite3-dev

$sudoapt-getinstall-ywgetcurlllvmlibncurses5-devgit

$whichpyenv

Pyenv provides a large list of different python versions. To see the entire listpleaseusethecommand:

However, forusweonlyneed toworryaboutpython2.7.16andpython3.7.4.You can now install different versions of python into your local environmentwiththefollowingcommands:

Youcansettheglobalpythondefaultversionwith:

Typethefollowingtodeterminewhichversionyouactivated:

Typethefollowingtodeterminewhichversionsyouhaveavailable:

Associate a specific environment namewith a certain python version, use thefollowingcommands:

In the example, ENV2 would represent python 2.7.16 while ENV3 wouldrepresentpython3.7.4.Oftenitiseasiertotypethealiasratherthantheexplicitversion.

3.2.3.3.2SwitchingEnvironments

After setting up the different environments, switching between them is noweasy.Simplyusethefollowingcommands:

Tomakeiteveneasier,youcanaddthefollowinglinestoyour.bash_profileoror

$pyenvinstall-l

$pyenvupdate

$pyenvinstall2.7.16

$pyenvinstall3.7.4

$pyenvglobal3.7.4

$pyenvversion

$pyenvversions

(2.7.16)$pyenvactivateENV2

(ENV2)$pyenvactivateENV3

(ENV3)$pyenvactivateENV2

(ENV2)$pyenvdeactivateENV2

(2.7.16)$

.zprofilefile:

If you start a new terminal, you can switch between the different versions ofpythonsimplybytyping:

3.2.3.4UpdatingPythonVersionList

Pyenvmaintainslocallyalistofavailablepythonversions.Toseethelistusethecommand

Youwillseetheupdatedlist.

3.2.3.4.1UpdatingtoanewversionofPythonwithpyenv

Naturally python itself evolves and new versions will become available viapyenv.Tofacilitatesuchanewversionyouneedtofirstinstallitintopyenv.Letus assume you had an old version of python installed onto the ENV3environment.Thanyouneedtoexecutethefollowingsteps:

Withthepiinstallcommand,wemakesurewehavethenewestversionofpip.Incaseyougetanerror,youmayhavetoupdatexcodeasfollowsandtryagain:

AfteryouinstalledityoucanactivateitbytypingENV3.NaturallythisrequiresthatyouaddedittoyourbashenvironmentasdiscussedinSection1.1.1.8. �

3.2.4AnacondaandMinicondaandConda

$pyenvupdate

$pyenvinstall-l

$pyenvdeactivate

$pyenvuninstallENV3

$pyenvinstall3.7.4

$pipinstallpip-U

xcode-select--install

While inothers on the internet or inyour classesmayhave taught you touseanaconda,Wewillavoiditasithasseveraldisadvantagesforedevelopers.Thereasonforthisisthatitinstallsmanypackagesthatyouarelikelynottouse.InfactinstallinganacondaonyourVMwillwastespaceandtimeandyoushouldlookintootherinstalls.

Wedonotrecommendthatyouuseanacondaorminicondaasitmay

interferewithyourdefaultpythoninterpretersandsetup.

Pleasenotethatbeginnerstopythonshouldalwaysuseanacondaorminicondaonlyafter theyhave installedpyenvanduse it.For thisclassneitheranacondanorminicondaisrequired.Infactwedonotrecommendit.WekeepthissectionasweknowthatotherclassesatIUmayuseanaconda.Wearenotawareiftheseclassesteachyoutherightwaytoinstallit,withpyenv.

3.2.4.1Miniconda

This section about miniconda is experimental and has not beentested.Wearelookingforcontributorsthathelpcompletingit.Ifyouuseanacondaorminicondawerecommendtomanageitviapyenv.

Toinstallminicondayoucanusethefollowingcommands:

Toactivateuse:

Todeactivateuse:

3.2.4.2Anaconda

This section about anaconda is experimental and has not been

$mkdirana

$cdana

$pyenvinstallminiconda3-latest

$pyenvlocalminiconda3-latest

$pyenvactivateminiconda3-latest

$condacreate-nanaanaconda

$sourceactivateana

$sourcedeactivate

tested.Wearelookingforcontributorsthathelpcompletingit.

Youcanaddanacondatoyourpyenvwiththefollowingcommands:

To switch more easily we recommend that you use the following in your.bash_profileor.zprofilefile:

Onceyouhavedonethisyoucaneasilyswitchtoanacondawiththecommand:

Terminologyinanacondacouldleadtoconfusion.Thusweliketopointoutthattheversionnumberofanacondaisunrelatedtothepythonversion.Furthermore,anaconda uses the term root not for the root user, but for the originatingdirectoryinwhichtheanacondaprogramisinstalled.

Incaseyouliketobuildyourowncondapackagesatalatertimewerecommendthatyouinstalltheconda-buildpackage:

Whenexecuting:

youwillseeaftertheinstallcompletedtheanacondaversionsinstalled:

Letusnowcreatevirtualenvforanaconda:

Toactivateityoucannowuse:

pyenvinstallanaconda3-4.3.1

aliasANA="pyenvactivateanaconda3-4.3.1"

$condainstallconda-build

$pyenvversions

pyenvversions

system

2.7.16

2.7.16/envs/ENV2

3.7.4/envs/ENV3

*anaconda3-4.3.1(setbyPYENV_VERSIONenvironmentvariable)

$pyenvvirtualenvanaconda3-4.3.1ANA

$pyenvANA

However, anacondamaymodify your .bashrc or .bash_profile or or .zprofile files andmay result in incompatibilitieswith other python versions. For this reasonwerecommendnot touseit. Ifyoufindwaystoget it toworkreliablywithotherversions,pleaseletusknowandweupdatethistutorial.

3.2.5Exercises

E.Python.Install.1:

InstallPython3.7.4

E.Python.Install.1:

Writeinstallationinstructionsforanoperatingsystemofyourchoiceandaddtothisdocumentation.

E.Python.Install.2:

Replicate the steps to install pyenv, so you can type in ENV2 andENV3inyourterminalstoswitchbetweenpython2and3.

E.Python.Install.3:

Why do you not want to use generally anaconda for cloudcomputing?Whenisitoktouseanaconda?

4FIRSTSTEPS

4.1INTERACTIVEPYTHON☁�Pythoncanbeusedinteractively.Youcanentertheinteractivemodebyenteringtheinteractiveloopbyexecutingthecommand:

Youwillseesomethinglikethefollowing:

The >>> is the prompt used by the interpreter. This is similar to bash wherecommonly$isused.

Sometimes it is convenient to show the promptwhen illustrating an example.This is to provide some context for what we are doing. If you are followingalongyouwillnotneedtotypeintheprompt.

Thisinteractivepythonprocessdoesthefollowing:

readyourinputcommandsevaluateyourcommandprinttheresultofevaluationloopbacktothebeginning.

This is why you may see the interactive loop referred to as aREPL:Read-Evaluate-Print-Loop.

4.1.1REPL(ReadEvalPrintLoop)

There are many different types beyond what we have seen so far, such asdictionariess,lists,sets.Onehandywayofusingtheinteractivepythonistogetthetypeofavalueusingtype():

$python

Python3.7.1(default,Nov242018,14:27:15)

[Clang10.0.0(clang-1000.11.45.5)]ondarwin

Type"help","copyright","credits"or"license"formoreinformation.

Youcanalsoaskforhelpaboutsomethingusinghelp():

Usinghelp()opensupahelpmessagewithinapager.Tonavigateyoucanusethespacebartogodownapagewtogoupapage,thearrowkeystogoup/downline-by-line,orqtoexit.

4.1.2Interpreter

Althoughtheinteractivemodeprovidesaconvenienttooltotestthingsoutyouwillseequicklythatforourclasswewanttousethepythoninterpreterfromthecommandline.Letusassumetheprogramiscalledprg.py.Onceyouhavewrittenitinthatfileyousimplycancallitwith

Itisimportanttonametheprogramwithmeaningfulnames.

4.1.3Python3FeaturesinPython2

In this coursewewant to be able to seamlessly switch between python 2 andpython3.Thusitisconvenientfromthestarttousepython3syntaxwhenitissupportedalsoinpython2.Oneofthemostusedfunctionsistheprintstatementthathasinpython3parentheses.Toenableitinpython2youjustneedtoimportthisfunction:

Thefirstoftheseimportsallowsustousetheprintfunctiontooutputtexttothescreen, instead of the print statement, which Python 2 uses. This is simply adesigndecisionthatbetterreflectsPython’sunderlyingphilosophy.

Otherfunctionssuchasthedivisionalsobehavedifferently.Thusweuse

>>>type(42)

<type'int'>

>>>type('hello')

<type'str'>

>>>type(3.14)

<type'float'>

>>>help(int)

>>>help(list)

>>>help(str)

$pythonprg.py

from__future__importprint_function,division

from__future__importdivision

Thisimportmakessurethatthedivisionoperatorbehavesinawayanewcomertothelanguagemightfindmoreintuitive.InPython2,division/isfloordivisionwhentheargumentsareintegers,meaningthatthefollowing

InPython3,division/isafloatingpointdivision,thus

4.2EDITORS☁�Thissectionismeanttogiveanoverviewofthepythoneditingtoolsneededfordoing for this course. There are many other alternatives, however, we dorecommendtousePyCharm.

4.2.1Pycharm

PyCharm is an Integrated Development Environment (IDE) used forprogramming in Python. It provides code analysis, a graphical debugger, anintegratedunittester,integrationwithgit.

Python8:56Pycharm

4.2.2Pythonin45minutes

AnadditionalcommunityvideoaboutthePythonprogramminglanguagethatwefoundontheinternet.Naturallytherearemanyalternativestothisvideo,butthevideoisprobablyagoodstart.ItalsousesPyCharmwhichwerecommend.

Python43:16PyCharm

Howmuchyouwanttounderstandofpythonisactuallyabituptoyou.Whileitsgood toknowclassesand inheritance,youmaybeable for thisclass togetawaywithoutusingit.However,wedorecommendthatyoulearnit.

PyCharmInstallation:Method1:PyCharmInstallationonubuntuusingumake

(5/2==2)isTrue

(5/2==2.5)isTrue

Once umake command is run, use the next command to install Pycharmcommunityedition:

IfyouwanttoremovePyCharminstalledusingumakecommand,usethis:

Method2:PyCharminstallationonubuntuusingPPA

PyCharm also has a Professional (paid) version which can be installed usingfollowingcommand:

Onceinstalled,gotoyourVMdashboardandsearchforPyCharm.

4.3GOOGLECOLAB☁�In thissectionwearegoingto introduceyou,howtouseGoogleColabtorundeeplearningmodels.

4.3.1IntroductiontoGoogleColab

ThisvideocontainstheintroductiontoGoogleColab.InthissectionwewillbelearninghowtostartaGoogleColabproject.

4.3.2ProgramminginGoogleColab

Inthisvideowewilllearnhowtocreateasimple,ColabNotebook.

sudoadd-apt-repositoryppa:ubuntu-desktop/ubuntu-make

sudoapt-getupdate

sudoapt-getinstallubuntu-make

umakeidepycharm

umake-ridepycharm

sudoadd-apt-repositoryppa:mystic-mirage/pycharm

sudoapt-getupdate

sudoapt-getinstallpycharm-community

sudoapt-getinstallpycharm

RequiredInstallations

4.3.3BenchamrkinginGoogleColabwithCloudmesh

In this video we learn how to do a basic benchmark with Cloudmesh tools.CloudmeshStopWatchwillbeusedinthistutorial.

RequiredInstallations

pipinstallnumpy

pipinstallcloudmesh-installer

pipinstallcloudmesh-common

5LANGUAGE

5.1LANGUAGE☁�

5.1.1StatementsandStrings

LetusexplorethesyntaxofPythonwhilestartingwithaprintstatement

Thiswillprintontheterminal

The print function was given a string to process. A string is a sequence ofcharacters. A character can be a alphabetic (A through Z, lower and uppercase), numeric (any of the digits), white space (spaces, tabs, newlines, etc),syntacticdirectives(comma,colon,quotation,exclamation,etc),andsoforth.Astringisjustasequenceofthecharacterandtypicallyindicatedbysurroundingthecharactersindoublequotes.

StandardoutputisdiscussedintheSectionLinux.

So, what happened when you pressed Enter? The interactive Python programreadthelineprint("HelloworldfromPython!"),splititintotheprintstatementandthe"HelloworldfromPython!"string,andthenexecutedtheline,showingyoutheoutput.

5.1.2Comments

Commentsinpythonarefollowedbya#:

5.1.3Variables

Youcanstoredataintoavariabletoaccessitlater.Forinstance:

print("HelloworldfromPython!")

HelloworldfromPython!

#Thisisacomment

hello='HelloworldfromPython!'

print(hello)

Thiswillprintagain

5.1.4DataTypes

5.1.4.1Booleans

Aboolean is a value that can have the values True or False. You can combinebooleanswithbooleanoperatorssuchasandandor

5.1.4.2Numbers

Theinteractiveinterpretercanalsobeusedasacalculator.Forinstance,saywewantedtocomputeamultipleof21:

Wesawheretheprintstatementagain.Wepassedintheresultoftheoperation21 * 2.An integer (or int) in Python is a numeric valuewithout a fractionalcomponent(thosearecalledfloatingpointnumbers,orfloatforshort).

Themathematicaloperators compute the relatedmathematicaloperation to theprovidednumbers.Someoperatorsare:

Operator Function* multiplication/ division+ addition- subtraction** exponent

Exponentiationxyiswrittenasx**yisxtotheythpower.

HelloworldfromPython!

print(TrueandTrue)#True

print(TrueandFalse)#False

print(FalseandFalse)#False

print(TrueorTrue)#True

print(TrueorFalse)#True

print(FalseorFalse)#False

print(21*2)#42

Youcancombinefloatsandints:

Notethatoperatorprecedenceisimportant.Usingparenthesistoindicateaffecttheorderofoperationsgivesadifferenceresults,asexpected:

5.1.5ModuleManagement

AmoduleallowsyoutologicallyorganizeyourPythoncode.Groupingrelatedcodeintoamodulemakesthecodeeasiertounderstandanduse.AmoduleisaPythonobjectwitharbitrarilynamedattributesthatyoucanbindandreference.Amodule is a file consistingofPython code.Amodule candefine functions,classesandvariables.Amodulecanalsoincluderunnablecode.

5.1.5.1ImportStatement

Whentheinterpreterencountersanimportstatement,itimportsthemoduleifthemoduleispresentinthesearchpath.Asearchpathisalistofdirectoriesthattheinterpreter searches before importing a module. The from…import StatementPython’s fromstatement letsyou importspecificattributes fromamodule intothecurrentnamespace.Itispreferredtouseforeachimportitsownlinesuchas:

Whentheinterpreterencountersanimportstatement,itimportsthemoduleifthemoduleispresentinthesearchpath.Asearchpathisalistofdirectoriesthattheinterpretersearchesbeforeimportingamodule.

5.1.5.2Thefrom…importStatement

Python’s fromstatement letsyou importspecificattributes fromamodule intothecurrentnamespace.Thefrom…importhasthefollowingsyntax:

print(3.14*42/11+4-2)#13.9890909091

print(2**3)#8

print(3.14*(42/11)+4-2)#11.42

print(1+2*3-4/5.0)#6.2

print((1+2)*(3-4)/5.0)#-0.6

importnumpy

importmatplotlib

fromdatetimeimportdatetime

5.1.6DateTimeinPython

Thedatetimemodulesuppliesclassesformanipulatingdatesandtimesinbothsimpleandcomplexways.Whiledateandtimearithmeticissupported,thefocusof the implementation is on efficient attribute extraction for output formattingand manipulation. For related functionality, see also the time and calendarmodules.

The import Statement You can use any Python source file as a module byexecutinganimportstatementinsomeotherPythonsourcefile.

Thismoduleoffersagenericdate/timestringparserwhichisabletoparsemostknownformatstorepresentadateand/ortime.

pandas is an open source Python library for data analysis that needs to beimported.

Createastringvariablewiththeclassstarttime

Convertthestringtodatetimeformat

Creatingalistofstringsasdates

ConvertClass_datesstringsintodatetimeformatandsavethelistintovariablea

Useparse()toattempttoauto-convertcommonstringformats.Parsermustbea

fromdatetimeimportdatetime

fromdateutil.parserimportparse

importpandasaspd

fall_start='08-21-2018'

datetime.strptime(fall_start,'%m-%d-%Y')\#

datetime.datetime(2017,8,21,0,0)

class_dates=[

'8/25/2017',

'9/1/2017',

'9/8/2017',

'9/15/2017',

'9/22/2017',

'9/29/2017']

a=[datetime.strptime(x,'%m/%d/%Y')forxinclass_dates]

stringorcharacterstream,notlist.

Useparse()oneveryelementoftheClass_datesstring.

Useparse,butdesignatethatthedayisfirst.

Create adataframe.ADataFrame is a tabulardata structure comprisedof rowsand columns, akin to a spreadsheet, database table. DataFrame as a group ofSeriesobjectsthatshareanindex(thecolumnnames).

Convertdf[`date`]fromstringtodatetime

5.1.7ControlStatements

5.1.7.1Comparison

parse(fall_start)#datetime.datetime(2017,8,21,0,0)

[parse(x)forxinclass_dates]

#[datetime.datetime(2017,8,25,0,0),

#datetime.datetime(2017,9,1,0,0),

#datetime.datetime(2017,9,29,0,0)]

parse(fall_start,dayfirst=True)

#datetime.datetime(2017,8,21,0,0)

importpandasaspd

data={

'dates':[

'8/25/201718:47:05.069722',

'9/1/201718:47:05.119994',

'9/8/201718:47:05.178768',

'9/15/201718:47:05.230071',

'9/22/201718:47:05.230071',

'9/29/201718:47:05.280592'],

'complete':[1,0,1,1,0,1]}

df=pd.DataFrame(

columns=['dates','complete'])

print(df)

#datescomplete

#08/25/201718:47:05.0697221

#19/1/201718:47:05.1199940

#29/8/201718:47:05.1787681

#39/15/201718:47:05.2300711

#49/22/201718:47:05.2300710

#59/29/201718:47:05.2805921

importpandasaspd

pd.to_datetime(df['dates'])

#02017-08-2518:47:05.069722

#12017-09-0118:47:05.119994

#22017-09-0818:47:05.178768

#32017-09-1518:47:05.230071

#42017-09-2218:47:05.230071

#52017-09-2918:47:05.280592

#Name:dates,dtype:datetime64[ns]

Computer programs do not only execute instructions. Occasionally, a choiceneedstobemade.Suchasachoiceisbasedonacondition.Pythonhasseveralconditionaloperators:

Operator Function> greaterthan< smallerthan== equals!= isnot

Conditionsarealwayscombinedwithvariables.Aprogramcanmakeachoiceusingtheifkeyword.Forexample:

In this example,You guessed correctly! will only be printed if the variable xequals to four. Python can also executemultiple conditions using the elif andelsekeywords.

5.1.7.2Iteration

To repeat code, the for keyword can be used. For example, to display thenumbersfrom1to10,wecouldwritesomethinglikethis:

Thesecondargument to range,11, isnot inclusive,meaning that the loopwillonlygetto10beforeitfinishes.Pythonitselfstartscountingfrom0,sothiscodewillalsowork:

x=int(input("Guessx:"))

ifx==4:

print('Correct!')

x=int(input("Guessx:"))

ifx==4:

print('Correct!')

elifabs(4-x)==1:

print('Wrong,butclose!')

print('Wrong,wayoff!')

foriinrange(1,11):

print('Hello!')

foriinrange(0,10):

print(i+1)

Infact,therangefunctiondefaultstostartingvalueof0,soitisequivalentto:

Wecanalsonestloopsinsideeachother:

In this case we have two nested loops. The code will iterate over the entirecoordinaterange(0,0)to(9,9)

5.1.8Datatypes

5.1.8.1Lists

see:https://www.tutorialspoint.com/python/python_lists.htm

Lists inPythonareorderedsequencesofelements,whereeachelementcanbeaccessedusinga0-basedindex.

Todefinealist,yousimplylistitselementsbetweensquarebrackets‘[]’:

Youcanalsouseanegative index ifyouwant tostartcountingelementsfromthe endof the list.Thus, the last element has index -1, the second before lastelementhasindex-2andsoon:

Pythonalsoallowsyoutotakewholeslicesofthelistbyspecifyingabeginningandendofthesliceseparatedbyacolon

foriinrange(10):

print(i+1)

foriinrange(0,10):

forjinrange(0,10):

print(i,'',j)

names=[

'Albert',

'Jane',

'Liz',

'John',

'Abby']

#accessthefirstelementofthelist

names[0]

#'Albert'

#accessthethirdelementofthelist

names[2]

#'Liz'

#accessthelastelementofthelist

names[-1]

#'Abby'

#accessthesecondlastelementofthelist

names[-2]

#'John'

Asyoucanseefromtheexample,thestartingindexinthesliceisinclusiveandtheendingone,exclusive.

Pythonprovidesavarietyofmethodsformanipulatingthemembersofalist.

Youcanaddelementswithappend’:

Asyoucansee,theelementsinalistneednotbeunique.

Mergetwolistswith‘extend’:

Findtheindexofthefirstoccurrenceofanelementwith‘index’:

Removeelementsbyvaluewith‘remove’:

Removeelementsbyindexwith‘pop’:

Noticethatpopreturnstheelementbeingremoved,whileremovedoesnot.

Ifyouarefamiliarwithstacksfromotherprogramminglanguages,youcanuseinsertand‘pop’:

#themiddleelements,excludingfirstandlast

names[1:-1]

#['Jane','Liz','John']

names.append('Liz')

#['Albert','Jane','Liz',

#'John','Abby','Liz']

names.extend(['Lindsay','Connor'])

#['Albert','Jane','Liz','John',

#'Abby','Liz','Lindsay','Connor']

names.index('Liz')\#2

names.remove('Abby')

#['Albert','Jane','Liz','John',

#'Liz','Lindsay','Connor']

names.pop(1)

#'Jane'

#['Albert','Liz','John',

#'Liz','Lindsay','Connor']

names.insert(0,'Lincoln')

#['Lincoln','Albert','Liz',

#'John','Liz','Lindsay','Connor']

names.pop()

ThePythondocumentationcontainsafulllistoflistoperations.

To go back to the range function you used earlier, it simply creates a list ofnumbers:

5.1.8.2Sets

Pythonlistscancontainduplicatesasyousawpreviously:

Whenwedonotwantthistobethecase,wecanuseaset:

Keepinmindthatthesetisanunorderedcollectionofobjects,thuswecannotaccessthembyindex:

However,wecanconvertasettoalisteasily:

Noticethatinthiscase,theorderofelementsinthenewlistmatchestheorderinwhichtheelementsweredisplayedwhenwecreatetheset.Wehadset(['Lincoln','John','Albert','Liz','Lindsay'])

andnowwehave['Lincoln','John','Albert','Liz','Lindsay'])

#'Connor'

#['Lincoln','Albert','Liz',

#'John','Liz','Lindsay']

range(10)

#[0,1,2,3,4,5,6,7,8,9]

range(2,10,2)

#[2,4,6,8]

names=['Albert','Jane','Liz',

'John','Abby','Liz']

unique_names=set(names)

unique_names

#set(['Lincoln','John','Albert','Liz','Lindsay'])

unique_names[0]

#Traceback(mostrecentcalllast):

#File"<stdin>",line1,in<module>

#TypeError:'set'objectdoesnotsupportindexing

unique_names=list(unique_names)

unique_names[`Lincoln',`John',`Albert',`Liz',`Lindsay']

unique_names[0]

#`Lincoln'

You should not assume this is the case in general. That is, do not make anyassumptionsabouttheorderofelementsinasetwhenitisconvertedtoanytypeofsequentialdatastructure.

You can change a set’s contents using the add, remove and update methodswhich correspond to the append, remove and extend methods in a list. Inaddition to these, set objects support the operations youmay be familiarwithfrommathematicalsets:union,intersection,difference,aswellasoperations tocheck containment. You can read about this in the Python documentation forsets.

5.1.8.3RemovalandTestingforMembershipinSets

Oneimportantadvantageofasetoveralististhataccesstoelementsisfast. Ifyou are familiarwith different data structures fromaComputerScience class,thePython list is implementedby an array,while the set is implementedby ahashtable.

Wewilldemonstratethiswithanexample.Letussaywehavealistandasetofthesamenumberofelements(approximately100thousand):

Wewill use the timeit Pythonmodule to time 100 operations that test for theexistenceofamemberineitherthelistorset:

The exact duration of the operations on your systemwill be different, but thetake away will be the same: searching for an element in a set is orders ofmagnitudefasterthaninalist.Thisisimportanttokeepinmindwhenyouworkwithlargeamountsofdata.

5.1.8.4Dictionaries

importsys,random,timeit

nums_set=set([random.randint(0,sys.maxint)for_inrange(10**5)])

nums_list=list(nums_set)

len(nums_set)

#100000

timeit.timeit('random.randint(0,sys.maxint)innums',

setup='importrandom;nums=%s'%str(nums_set),number=100)

#0.0004038810729980469

timeit.timeit('random.randint(0,sys.maxint)innums',

setup='importrandom;nums=%s'%str(nums_list),number=100)

#0.398054122924804

Oneoftheveryimportantdatastructuresinpythonisadictionaryalsoreferredtoasdict.

Adictionaryrepresentsakeyvaluestore:

Aconvenientfortoprintbynamedattributesis

Thisformofprintingwiththeformatstatementandareferencetodataincreasesreadabilityoftheprintstatements.

Youcandeleteelementswiththefollowingcommands:

Youcaniterateoveradict:

5.1.8.5DictionaryKeysandValues

Youcanretrieveboth thekeysandvaluesofadictionaryusing thekeys()andvalues()methodsofthedictionary,respectively:

person={

'Name':'Albert',

'Age':100,

'Class':'Scientist'

print("person['Name']:",person['Name'])

#person['Name']:Albert

print("person['Age']:",person['Age'])

#person['Age']:100

print("{Name}{Age}'.format(**data))

delperson['Name']#removeentrywithkey'Name'

#person

#{'Age':100,'Class':'Scientist'}

person.clear()#removeallentriesindict

#person

delperson#deleteentiredictionary

#person

#Traceback(mostrecentcalllast):

#File"<stdin>",line1,in<module>

#NameError:name'person'isnotdefined

person={

'Name':'Albert',

'Age':100,

'Class':'Scientist'

foriteminperson:

print(item,person[item])

#Age100

#NameAlbert

#ClassScientist

person.keys()#['Age','Name','Class']

person.values()#[100,'Albert','Scientist']

Bothmethodsreturnlists.Notice,however,thattheorderinwhichtheelementsappear in the returned lists (Age, Name, Class) is different from the order inwhichwe listed theelementswhenwedeclared thedictionary initially (Name,Age,Class).Itisimportanttokeepthisinmind:

Youcannotmakeanyassumptionsabouttheorderinwhichtheelementsofadictionarywillbe returnedby thekeys()andvalues()methods.

However,youcanassumethatifyoucallkeys()andvalues()insequence,theorderof elements will at least correspond in both methods. In the example Agecorrespondsto100,Nameto Albert,andClasstoScientist,andyouwillobservethe same correspondence in general as long as keys() and values() are called onerightaftertheother.

5.1.8.6CountingwithDictionaries

Oneapplicationofdictionariesthatfrequentlycomesupiscountingtheelementsinasequence.Forexample,saywehaveasequenceofcoinflips:

Theactual listdie_rollswill likelybedifferentwhenyouexecute thisonyourcomputersincetheoutcomesofthedierollsarerandom.

Tocomputetheprobabilitiesofheadsandtails,wecouldcounthowmanyheadsandtailswehaveinthelist:

In addition to how we use the dictionary counts to count the elements of

importrandom

die_rolls=[

random.choice(['heads','tails'])for_inrange(10)

#die_rolls

#['heads','tails','heads',

#'tails','heads','heads',

'tails','heads','heads','heads']

counts={'heads':0,'tails':0}

foroutcomeincoin_flips:

assertoutcomeincounts

counts[outcome]+=1

print('Probabilityofheads:%.2f'%(counts['heads']/len(coin_flips)))

#Probabilityofheads:0.70

print('Probabilityoftails:%.2f'%(counts['tails']/sum(counts.values())))

#Probabilityoftails:0.30

coin_flips,noticeacouplethingsaboutthisexample:

1. We used the assert outcome in counts statement. The assert statement inPython allows you to easily insert debugging statements in your code tohelp you discover errors more quickly. assert statements are executedwhenevertheinternalPython__debug__variableissettoTrue,whichisalwaysthecaseunlessyoustartPythonwiththe-OoptionwhichallowsyoutorunoptimizedPython.

2. When we computed the probability of tails, we used the built-in sumfunction,whichallowedus toquickly find the totalnumberof coin flips.sumisoneofmanybuilt-infunctionyoucanreadabouthere.

5.1.9Functions

Youcanreusecodebyputtingitinsideafunctionthatyoucancallinotherpartsofyourprograms.Functionsarealsoagoodwayofgroupingcodethatlogicallybelongs together in one coherentwhole.A function has a unique name in theprogram.Onceyoucallafunction,itwillexecuteitsbodywhichconsistsofoneormorelinesofcode:

The def keyword tells Python we are defining a function. As part of thedefinition,wehavethefunctionname,check_triangle,andtheparametersofthefunction–variablesthatwillbepopulatedwhenthefunctioniscalled.

Wecallthefunctionwitharguments4,5and6,whicharepassedinorderintotheparametersa,bandc.Afunctioncanbecalledseveral timeswithvaryingparameters.Thereisnolimittothenumberoffunctioncalls.

It is also possible to store the output of a function in a variable, so it can bereused.

defcheck_triangle(a,b,c):

return\

a<b+canda>abs(b-c)and\

b<a+candb>abs(a-c)and\

c<a+bandc>abs(a-b)

print(check_triangle(4,5,6))

defcheck_triangle(a,b,c):

return\

c<a+bandc>abs(a-b)

5.1.10Classes

Aclass is an encapsulation of data and the processes thatwork on them.Thedata is represented inmember variables, and the processes are defined in themethodsoftheclass(methodsarefunctionsinsidetheclass).Forexample,let’sseehowtodefineaTriangleclass:

Python has full object-oriented programming (OOP) capabilities, however wecannotcoveralloftheminthissection,soifyouneedmoreinformationpleaserefertothePythondocsonclassesandOOP.

5.1.11Modules

Nowwritethissimpleprogramandsaveit:

Asacheck,makesurethefilecontainstheexpectedcontentsonthecommandline:

result=check_triangle(4,5,6)

print(result)

classTriangle(object):

def__init__(self,length,width,

height,angle1,angle2,angle3):

ifnotself._sides_ok(length,width,height):

print('Thesidesofthetriangleareinvalid.')

elifnotself._angles_ok(angle1,angle2,angle3):

print('Theanglesofthetriangleareinvalid.')

self._length=length

self._width=width

self._height=height

self._angle1=angle1

self._angle2=angle2

self._angle3=angle3

def_sides_ok(self,a,b,c):

return\

c<a+bandc>abs(a-b)

def_angles_ok(self,a,b,c):

returna+b+c==180

triangle=Triangle(4,5,6,35,65,80)

print("Helloworld!")

$cathello.py

print("Helloworld!")

Toexecuteyourprogrampassthefileasaparametertothepythoncommand:

Files in which Python code is stored are calledmodules. You can execute aPythonmoduleformthecommandlinelikeyoujustdid,oryoucanimportitinotherPythoncodeusingtheimportstatement.

Let us write a more involved Python program that will receive as input thelengths of the three sides of a triangle, andwill outputwhether they define avalidtriangle.Atriangleisvalidifthelengthofeachsideislessthanthesumofthelengthsoftheothertwosidesandgreaterthanthedifferenceofthelengthsoftheothertwosides.:

Assumingwesavetheprograminafilecalledcheck_triangle.py,wecanrunitlikeso:

Letusbreakthisdownabit.

1. Weare importing theprint_function anddivisionmodules frompython3likewedidearlierinthissection.It’sagoodideatoalwaysincludetheseinyourprograms.

2. We’vedefinedabooleanexpressionthattellsusifthesidesthatwereinputdefine a valid triangle. The result of the expression is stored in the

$pythonhello.py

Helloworld!

"""Usage:check_triangle.py[-h]LENGTHWIDTHHEIGHT

Checkifatriangleisvalid.

Arguments:

LENGTHThelengthofthetriangle.

WIDTHThewidthofthetraingle.

HEIGHTTheheightofthetriangle.

Options:

-h--help

fromdocoptimportdocopt

if__name__=='__main__':

arguments=docopt(__doc__)

a,b,c=int(arguments['LENGTH']),

int(arguments['WIDTH']),

int(arguments['HEIGHT'])

valid_triangle=\

c<a+bandc>abs(a-b)

print('Trianglewithsides%d,%dand%disvalid:%r'%(

a,b,c,valid_triangle

$pythoncheck_triangle.py456

Trianglewithsides4,5and6isvalid:True

valid_trianglevariable.insidearetrue,andFalseotherwise.3. We’ve used the backslash symbol \ to format are code nicely. The

backslash simply indicates that the current line is being continued on thenextline.

4. Whenweruntheprogram,wedothecheckif__name__=='__main__'. __name__ is aninternal Python variable that allows us to tell whether the current file isbeingrunfromthecommandline(value__name__),orisbeingimportedbyamodule (the value will be the name of the module). Thus, with thisstatementwe’rejustmakingsuretheprogramisbeingrunbythecommandline.

5. Weareusing thedocoptmodule tohandlecommand linearguments.Theadvantageofusing thismodule is that itgeneratesausagehelpstatementfor theprogramandenforces command line arguments automatically.Allofthisisdonebyparsingthedocstringatthetopofthefile.

6. Intheprintfunction,weareusingPython’sstringformattingcapabilitiestoinsertvaluesintothestringwearedisplaying.

5.1.12LambdaExpressions

As oppose to normal functions in Python which are defined using the def

keyword, lambda functions in Python are anonymous functions which do nothaveanameandaredefinedusing the lambda keyword.Thegeneric syntaxof alambda function is in form oflambdaarguments:expression, as shown in the followingexample:

Asyoucouldprobablyguess,theresultis:

Nowconsiderthefollowingexamples:

The power2 function defined in the expression, is equivalent to the followingdefinition:

greeter=lambdax:print('Hello%s!'%x)

print(greeter('Albert'))

HelloAlbert!

power2=lambdax:x**2

defpower2(x):

returnx**2

Lambdafunctionsareusefulforwhenyouneedafunctionforashortperiodoftime.Note that theycanalsobeveryusefulwhenpassedasanargumentwithotherbuilt-infunctionsthattakeafunctionasanargument,e.g.filter()andmap().Inthenextexampleweshowhowalambdafunctioncanbecombinedwiththefilerfunction. Consider the array all_names which contains five words that rhymetogether.Wewanttofilterthewordsthatcontainthewordname.Toachievethis,wepassthefunctionlambdax:'name'inxas thefirstargument.This lambdafunctionreturns True if the word name exists as a sub-string in the string x. The secondargumentoffilterfunctionisthearrayofnames,i.e.all_names.

Asyoucansee,thenamesaresuccessfullyfilteredasweexpected.

InPython3,filterfunctionreturnsafilterobjectortheiteratorwhichgetslazilyevaluatedwhichmeans neitherwe can access the elements of the filter objectwithindexnorwecanuselen()tofindthelengthofthefilterobject.

InPython,wecanhaveasmallusuallyasinglelineranonymousfunctioncalledLambda functionwhich canhave anynumberof arguments just like anormalfunctionbutwithonlyoneexpressionwithnoreturnstatement.Theresultofthisexpressioncanbeappliedtoavalue.

BasicSyntax:

Foranexample:afunctioninpython

SamefunctioncanwrittenasLambdafunction.Thisfunctionnamedasmultiplyishaving2argumentsandreturnstheirmultiplication.

all_names=['surname','rename','nickname','acclaims','defame']

filtered_names=list(filter(lambdax:'name'inx,all_names))

print(filtered_names)

#['surname','rename','nickname']

list_a=[1,2,3,4,5]

filter_obj=filter(lambdax:x%2==0,list_a)

#Convertthefilerobjtoalist

even_num=list(filter_obj)

print(even_num)

#Output:[2,4]

lambdaarguments:expression

defmultiply(a,b):

returna*b

#callthefunction

multiply(3*5)#outputs:15

Lambdaequivalentforthisfunctionwouldbe:

Here a and b are the 2 arguments and a*b is the expression whose value isreturnedasanoutput.

Alsowedon’tneedtoassignLambdafunctiontoavariable.

Lambdafunctionsaremostlypassedasparametertoafunctionwhichexpectsafunctionobjectslikeinmaporfilter.

5.1.12.1map

Thebasicsyntaxofthemapfunctionis

mapfunctionsexpectsafunctionobjectandanynumberofiterableslikelistordictionary.Itexecutesthefunction_objectforeachelementinthesequenceandreturnsalistoftheelementsmodifiedbythefunctionobject.

Example:

IfwewanttowritesamefunctionusingLambda

5.1.12.2dictionary

Now,letsseehowwecaninterateoveradictionaryusingmapandlambdaLetssaywehaveadictionaryobject

multiply=Lambdaa,b:a*b

print(multiply(3,5))

#outputs:15

(lambdaa,b:a*b)(3*5)

map(function_object,iterable1,iterable2,...)

defmultiply(x):

returnx*2

map(multiply2,[2,4,6,8])

#Output[4,8,12,16]

map(lambdax:x*2,[2,4,6,8])

#Output[4,8,12,16]

dict_movies=[

Wecan iterate over this dictionary and read the elements of it usingmap andlambdafunctionsinfollowingway:

In Python3, map function returns an iterator or map object which gets lazilyevaluatedwhichmeans neitherwe can access the elements of themap objectwith indexnorwe canuse len() to find the lengthof themapobject.We canforceconvertthemapoutputi.e.themapobjecttolistasshownnext:

5.1.13Iterators

InPython, an iteratorprotocol isdefinedusing twomethods: __iter()__ and next().The former returns the iterator object and latter returns the next element of asequence.Someadvantagesofiteratorsareasfollows:

ReadabilitySupportssequencesofinfinitelengthSavingresources

Thereareseveralbuilt-inobjects inPythonwhich implement iteratorprotocol,e.g.string,list,dictionary.Inthefollowingexample,wecreateanewclassthatfollowstheiteratorprotocol.Wethenusetheclasstogeneratelog2ofnumbers:

{'movie':'avengers','comic':'marvel'},

{'movie':'superman','comic':'dc'}]

map(lambdax:x['movie'],dict_movies)#Output:['avengers','superman']

map(lambdax:x['comic'],dict_movies)#Output:['marvel','dc']

map(lambdax:x['movie']=="avengers",dict_movies)

#Output:[True,False]

map_output=map(lambdax:x*2,[1,2,3,4])

print(map_output)

#Output:mapobject:<mapobjectat0x04D6BAB0>

list_map_output=list(map_output)

print(list_map_output)#Output:[2,4,6,8]

frommathimportlog2

classLogTwo:

"Implementsaniteratoroflogtwo"

def__init__(self,last=0):

self.last=last

def__iter__(self):

self.current_num=1

returnself

def__next__(self):

ifself.current_num<=self.last:

result=log2(self.current_num)

self.current_num+=1

returnresult

As you can see,we first create an instance of the class and assign its __iter()__functiontoavariablecalledi.Thenbycallingthenext()functionfourtimes,wegetthefollowingoutput:

Asyouprobablynoticed,thelinesarelog2()of1,2,3,4respectively.

5.1.14Generators

Before we go to Generators, please understand Iterators. Generators are alsoIteratorsbuttheycanonlybeinteratedoveronce.ThatsbecauseGeneratorsdonotstorethevaluesinmemoryinsteadtheygeneratethevaluesonthego.Ifwewanttoprintthosevaluesthenwecaneithersimplyiterateoverthemorusetheforloop.

5.1.14.1Generatorswithfunction

For example:we have a function named asmultiplyBy10which prints all theinputnumbersmultipliedby10.

Now,ifwewanttouseGeneratorsherethenwewillmakefollowingchanges.

raiseStopIteration

L=LogTwo(5)

i=iter(L)

print(next(i))

$pythoniterator.py

1.584962500721156

defmultiplyBy10(numbers):

result=[]

foriinnumbers:

result.append(i*10)

returnresult

new_numbers=multiplyBy10([1,2,3,4,5])

printnew_numbers#Output:[10,20,30,40,50]

foriinnumbers:

yield(i*10)

InGenerators,weuseyield() function inplaceof return().Sowhenwe try toprintnew_numberslistnow,itjustprintsGeneratorsobject.ThereasonforthisisbecauseGeneratorsdontholdanyvalue inmemory, ityieldsone resultatatime.Soessentiallyitisjustwaitingforustoaskforthenextresult.Toprintthenextresultwecanjustsayprintnext(new_numbers),sohowitisworkingisitsreadingthefirstvalueandsquaringitandyieldingoutvalue1.Alsointhiscasewecanjustprintnext(new_numbers)5timestoprintallnumbersandifwedoitfor6thtimethenwewillgetanerrorStopIterationwhichmeannsGeneratorshasexausteditslimitandithasno6thelementtoprint.

5.1.14.2Generatorsusingforloop

Ifwenowwanttoprintthecompletelistofsquaredvaluesthenwecanjustdo:

Theoutputwillbe:

5.1.14.3GeneratorswithListComprehension

Python has something called List Comprehension, ifwe use this thenwe canreplacethecompletefunctiondefwithjust:

Here the point to note is square brackets [] in line 1 is very important. Ifwechangeitto()thenagainwewillstartgettingGeneratorsobject.

printnew_numbers#Output:Generatorsobject

printnext(new_numbers)#Output:1

foriinnumbers:

yield(i*10)

fornuminnew_numbers:

printnum

new_numbers=[x*10forxin[1,2,3,4,5]]

printnew_numbers#Output:[10,20,30,40,50]

new_numbers=(x*10forxin[1,2,3,4,5])

printnew_numbers#Output:Generatorsobject

Wecanget the individualelementsagain fromGenerators ifwedoa for loopovernew_numberslikewedidpreviously.Alternatively,wecanconvertitintoalistandthenprintit.

Buthereifweconvertthisintoalistthenwelooseonperformance,whichwewilljustseenext.

5.1.14.4WhytouseGenerators?

Generators are betterwithPerformance because it does not hold the values inmemoryandherewiththesmallexamplesweprovideitsnotabigdealsinceweare dealing with small amount of data but just consider a scenario where therecords are in millions of data set. And if we try to convert millions of dataelements into a list then that will definitely make an impact on memory andperformancebecauseeverythingwillinmemory.

Lets see an example on how Generators help in Performance. First, withoutGenerators, normal function taking 1 million record and returns theresult[people]for1million.

new_numbers=(x*10forxin[1,2,3,4,5])

printlist(new_numbers)#Output:[10,20,30,40,50]

names=['John','Jack','Adam','Steve','Rick']

majors=['Math',

'CompScience',

'Arts',

'Business',

'Economics']

#printsthememorybeforewerunthefunction

memory=mem_profile.memory_usage_resource()

print(f'Memory(Before):{memory}Mb')

defpeople_list(people):

result=[]

foriinrange(people):

person={

'id':i,

'name':random.choice(names),

'major':randon.choice(majors)

result.append(person)

returnresult

t1=time.clock()

people=people_list(10000000)

t2=time.clock()

#printsthememoryafterwerunthefunction

print(f'Memory(After):{memory}Mb')

print('Took{time}seconds'.format(time=t2-t1))

#Output

Memory(Before):15Mb

I am justgivingapproximatevalues tocompare itwithnext executionbutwejusttrytorunitwewillseeaseriousconsumptionofmemorywithgoodamountoftimetaken.

Now after running the same code using Generators, wewill see a significantamount of performanceboostwith alomost 0Seconds.And the reasonbehindthisisthatincaseofGenerators,wedonotkeepanythinginmemorysosystemjustreads1atatimeandyieldsthat.

Memory(After):318Mb

Took1.2seconds

names=['John','Jack','Adam','Steve','Rick']

majors=['Math',

'CompScience',

'Arts',

'Business',

'Economics']

#printsthememorybeforewerunthefunction

print(f'Memory(Before):{memory}Mb')

defpeople_generator(people):

foriinxrange(people):

person={

'id':i,

'name':random.choice(names),

'major':randon.choice(majors)

yieldperson

t1=time.clock()

people=people_list(10000000)

t2=time.clock()

#printsthememoryafterwerunthefunction

print(f'Memory(After):{memory}Mb')

print('Took{time}seconds'.format(time=t2-t1))

#Output

Memory(Before):15Mb

Memory(After):15Mb

Took0.01seconds

6CLOUDMESH

6.1INTRODUCTION☁�

LearningObjectives

IntroductiontothecloudmeshAPIUsingcmd5viacmsIntroduction to cloudmesh convenience API for output, dotdict, shell,stopwatch,benchmarkmanagementCreatingyourowncmscommandsCloudmeshconfigurationfileCloudmeshinventory

InthisChapterweliketointroduceyoutocloudmeshwhichprovidesyouwithanumberofconvenientmethodstointerfacewiththelocalsystem,butalsowithcloud services.Wewill startwhile focussing on some simpleAPI’s and thangraduallyintroducethecloudmeshshellwhichnotonlyprovidesashell,butalsoacommandline interfacesoyoucanusecloudmeshfroma terminal.Thisdualabilityisquiteusefulaswecanwritecloudmeshscripts,butcanalsoinvokethefunctionality from the terminal. This is quite an important distinction towardsothertoolsthatonlyallowcommandlineinterfaces.

Moreoverwealsoshoyouthatitiseasytocreatenewcommandsandaddthemdynamicallytothecloudmeshshellviasimplepipinstalls.

Cloudmeshisanevolvingprojectandyouhavetheopportunitytoimproveitifyouseesomefeaturesmissing.

Themanualofcloudmeshcanbefoundat

https://cloudmesh.github.io/cloudmesh-manual

TheAPIdocumentationislocatedat

https://cloudmesh.github.io/cloudmesh-manual/api/index.html#cloudmesh-api

Wewillinitiallyfocusonasubsetofthisfunctionality.

6.2INSTALLATION☁�Theinstallationofcloudmeshissimpleandcantechnicallybedoneviapipbyauser.Howeveryouarenotauser,youareadeveloper.Cloudmeshisdistributedindifferenttopicalrepositoriesandinorderfordeveloperstoeasilyinteractwiththemwehavewrittenaconvenientcloudmesh-installerprogram.

As a developer you must also use a python virtual environment to avoidaffectingyoursystemwidepythoninstallation.ThiscanbeachievedwhileusingPython3 from python.org or via conda. We do recommend that you usepython.orgas this is thevanillapython thatmostdevelopers in theworlduse.Condaisoftenusedbyusersofpythoniftheynotneedtousebleeding-edgebutolderprepackagedpythontoolsandlibraries.

6.2.1Prerequisite

Werequireyoutocreateapythonvirtualenvironmentandactivateit.HowtodothiswasdiscussedinSection3.1.Pleasecreate theENV3environment.Pleaseactivateit.

6.2.2BasicInstall

Cloudmeshcaninstallfordevelopersanumberofbundles.Abundleisasetofgitrepositories that are needed for a particular install. For us, we are mostlyinterested in thebundles cms, cloud, storage.Wewill introduceyou tootherbundlesthroughoutthisdocumentation.

Ifyouliketofindoutmoreaboutthedetailsofthisyoucanlookatcloudmesh-installerwhichwillberegularlyupdated.

Tomakeuseofthebundleandtheeasyinstallationfordeveloperspleaseinstallthecloudmesh-installerviapip,butmakesureyoudothisinapythonvirtualenv

asdiscussedpreviously. Ifnotyoumay impactyoursystemnegatively.Pleasenote thatwe are not responsible for fixing your computer.Naturally, you canalso use a virtualmachine, if you prefer. It is also important thatwe create auniform development environment. In our case we create an empty directorycalledcminwhichweplacethebundle.

Toseethebundleyoucanuse

WewillstartwiththebasiccloudmeshfunctionalityatthistimeandonlyinstalltheshellandsomecommonAPI’s.

Thesecommandsdownloadandinstallcloudmeshshellintoyourenvironment.Itisimportantthatyouusethe-eflag

Toseeifitworksyoucanusethecommand

Youwillseeanoutput.Ifthisdoesnotworkforyou,andyoucannotfigureouttheissue,pleasecontactussowecanidentifywhatwentwrong.

Formoreinformation,pleasevisitourInstallationInstructionsforDevelopers

6.3OUTPUT☁�Cloudmesh provides a number of convenient API’s to make output easier ormorefancyful.

TheseAPI’sinclude

ConsoleBannerHeading

$mkdircm

$pipinstallcloudmesh-installer

$cloudmesh-installerbundles

$cloudmesh-installergitclonecms

$cloudmesh-installerinstallcms-e

$cmshelp

VERBOSE

6.3.1Console

Print is theusual function tooutput to the terminal.However,oftenwe like tohavecoloredoutputthathelpsusinthenotificationtotheuser.Forthisreasonwe have a simple Console class that has several built-in features. You can evenswitchanddefineyourowncolorschemes.

In case of the error message we also have convenient flags that allow us toincludethetracebackintheoutput.

Theprefixcanbeswitchedonandoffwith theprefix flag,while the traceflagswitchesonandofifthetraceshouldbeset.

The verbosity of the output is controlled via variables that are stored in the~/.cloudmeshdirectory.

Formorefeatures,seeAPI:Console

6.3.2Banner

Incaseyouneedabanneryoucandothiswith

Formorefeatures,seeAPI:Banner

fromcloudmesh.common.consoleimportConsole

msg="mymessage"

Console.ok(msg)#prinsagreenmessage

Console.error(msg)#prinsaredmessageproceededwithERROR

Console.msg(msg)#prinsaregularblackmessage

Console.error(msg,prefix=True,traceflag=True)

fromcloudmesh.common.variablesimportVariables

variables=Variables()

variables['debug']=True

variables['trace']=True

variables['verbose']=10

fromcloudmesh.common.utilimportbanner

banner("mytext")

6.3.3Heading

AparticularusefulfunctionisHEADING()whichprintsthemethodname.

The invocation of the HEADING() function doit prints a banner with the nameinformation.ThereasonwedidnotdoitasadecoratoristhatyoucanplacetheHEADING()functioninanarbitrarylocationofthemethodbody.

Formorefeatures,seeAPI:Heading

6.3.4VERBOSE

VERBOSEisaveryusefulmethodallowingyoutoprintadictionary.Notonlywillitprintthedict,butitwillalsoprovideyouwiththeinformationinwhichfileitisusedandwhichlinenumber.Itwillevenprintthenameofthedict thatyouuseinyourcode.

To use this youwill have to enable the debuggingmethods for cloudmesh asdiscusedinSection6.3.1

Formorefeatures,pleaseseeVERBOSE

6.3.5Usingprintandpprint

Inmanycasesitmaybesufficienttouseprintandpprintfordebugging.However,asthecodeisbigandyoumayforgetwhereyouplacedprintstatementsortheprintstatementsmayhavebeenaddedbyothers,werecommendthatyouusetheVERBOSE function. If you use print or pprint we recommend using a uniqueprefix,suchas:

fromcloudmesh.common.utilimportHEADING

classExample(object):

defdoit(self):

HEADING()

print("Hello")

fromcloudmesh.common.debugimportVERBOSE

m={"key":"value"}

VERBOSE(m)

frompprintimportpprint

6.4DICTIONARIES☁�6.4.1Dotdict

For simple dictionaries we sometimes like to simplify the notation with a .

insteadofusingthe[]:

Youcanachievethiswithdotdict

Nowyoucaneithercall

Thisisespaciallyusefulinifconditionsasitmaybeeasiertoreadandwrite

andisthesameas

Formorefeatures,seeAPI:dotdict

6.4.2FlatDict

d={"sample":"value"}

print("MYDEBUG:")

pprint(d)

#orwithprint

print("MYDEBUG:",d)

fromcloudmesh.common.dotdictimportdotdict

data={

"name":"Gregor"

data=dotdict(data)

data["name"]

data.name

ifdata.nameis"Gregor":

print("thisisquitereadable")

ifdata["name"]is"Gregor":

print("thisisquitereadable")

Insomecasesitisusefultobeabletoflattenoutdictionariesthatcontaindictswithindicts.ForthiswecanuseFlatDict.

Thiswillbeconvertedtoadictwiththefollowingstructure.

With sep you can change the sepaerator between the nested dict attributes. Formorefeatures,seeAPI:dotdict

6.4.3PrintingDicts

In case we want to print dicts and lists of dicts in various formats, we haveincludedasimplePrinterthatcanprintadictinyaml,json,table,andcsvformat.

Thefunctioncanevenguessfromthepassedparameterswhattheinputformatisandusestheappropriateinternalfunction.

Acommonexampleis

fromcloudmesh.common.FlatdictimportFlatDict

data={

"name":"Gregor"

"address":{

"city":"Bloomington",

"state":"IN"

flat=FlatDict(data,sep=".")

flat={

"name":"Gregor"

"address.city":"Bloomington",

"address.state":"IN"

fromcloudmesh.common.PrinterimportPrinter

data=[

"name":"Gregor",

"address":{

"street":"FunnyLane11",

"city":"Cloudville"

"name":"Albert",

"address":{

"street":"MemoryLane1901",

"city":"Cloudnine"

Formorefeatures,seeAPI:Printer

Moreexamplesareavailableinthesourcecodeastests

6.5SHELL☁�Python provides a sophisticated method for starting background processes.However inmanycases it isquitecomplex to interactwith it. Italsodoesnotprovideconvenientwrappersthatwecanusetostarttheminapythonicfashion.Forthisreasonwehavewrittenaprimitive Shellclass thatprovidesjustenoughfunctionalitytobeusefulinmanycases.

Let us review some exampleswhere result is set to theoutput of the commandbeingexecuted.

Formanycommoncommands,weprovidebuilt-infunctions.Forexample:

Thelistincludes(naturallythecommandsmustbeavailableonyourOS.IftheshellcommandisnotavailableonyourOS,pleasehelpusimprovingthecodetoeither provide functions that work on your OS or develop with us platformindependentfunctionalityofasubsetofthefunctionalityfortheshellcommand

pprint(data)

table=Printer.flatwrite(data,

sort_keys=["name"],

order=["name","address.street","address.city"],

header=["Name","Street","City"],

output='table')

print(table)

fromcloudmesh.common.ShellimportShell

result=Shell.execute('pwd')

print(result)

result=Shell.execute('ls',["-l","-a"])

print(result)

result=Shell.execute('ls',"-l-a")

print(result)

result=Shell.ls("-aux")

print(result)

result=Shell.ls("-a","-u","-x")

print(result)

result=Shell.pwd()

print(result)

thatwemaybenefitfrom.

VBoxManage(cls,*args)

bash(cls,*args)

blockdiag(cls,*args)

brew(cls,*args)

cat(cls,*args)

check_output(cls,*args,**kwargs)

check_python(cls)

cm(cls,*args)

cms(cls,*args)

command_exists(cls,name)

dialog(cls,*args)

edit(filename)

execute(cls,*args)

fgrep(cls,*args)

find_cygwin_executables(cls)

find_lines_with(cls,lines,what)

get_python(cls)

git(cls,*args)

grep(cls,*args)

head(cls,*args)

install(cls,name)

keystone(cls,*args)

kill(cls,*args)

live(cls,command,cwd=None)

ls(cls,*args)

mkdir(cls,directory)

mongod(cls,*args)

nosetests(cls,*args)

nova(cls,*args)

operating_system(cls)

pandoc(cls,*args)

ping(cls,host=None,count=1)

pip(cls,*args)

ps(cls,*args)

pwd(cls,*args)

rackdiag(cls,*args)

remove_line_with(cls,lines,what)

rm(cls,*args)

rsync(cls,*args)

scp(cls,*args)

sh(cls,*args)

sort(cls,*args)

ssh(cls,*args)

sudo(cls,*args)

tail(cls,*args)

terminal(cls,command='pwd')

terminal_type(cls)

unzip(cls,source_filename,dest_dir)

vagrant(cls,*args)

version(cls,name)

which(cls,command)

Formorefeatures,pleaseseeShell

6.6STOPWATCH☁�Often you find yourself in a situation where you like to measure the timebetween two events.We provide a simple StopWatch that allows you not only tomeasureanumberoftimes,butalsotoprintthemoutinaconvenientformat.

Toprintthem,youcanalsouse:

Formorefeatures,pleaseseeeStopWatch

fromcloudmesh.common.StopWatchimportStopWatch

fromtimeimportsleep

StopWatch.start("test")

sleep(1)

StopWatch.stop("test")

print(StopWatch.get("test"))

StopWatch.benchmark()

6.7CLOUDMESHCOMMANDSHELL☁�

6.7.1CMD5

Python’s CMD (https://docs.python.org/2/library/cmd.html) is a very usefulpackagetocreatecommandlineshells.Howeveritdoesnotallowthedynamicintegrationofnewlydefinedcommands.Furthermore,additionstoCMDneedtobe donewithin the same source tree. To simplify developing commands by anumber of people and to have a dynamic plugin mechanism, we developedcmd5.Itisarewriteonourearliereffortsincloudmeshclientandcmd3.

6.7.1.1Resources

Thesourcecodeforcmd5islocatedingithub:

https://github.com/cloudmesh/cmd5

We have discussed in Section 6.2 how to install cloudmesh as developer andhaveaccesstothesourcecodeinadirectorycalledcm.Asyoureadthisdocumentweassumeyouareadeveloperandcanskipthenextsection.

6.7.1.2Installationfromsource

WARNING:DONOT EXECUTE THIS IFYOUAREADEVELOPERORYOURENVIRONMENTWILLNOTPROPERLYWORK.

However,ifyouareauserofcloudmeshyoucaninstallitwith

6.7.1.3Execution

To run the shell you can activate it with the cms command. cms stands forcloudmeshshell:

Itwillprintthebannerandentertheshell:

$pipinstallcloudmesh-cmd5

(ENV2)$cms

Toseethelistofcommandsyoucansay:

Toseethemanualpageforaspecificcommand,pleaseuse:

6.7.1.4CreateyourownExtension

OneofthemostimportantfeaturesofCMD5isitsabilitytoextenditwithnewcommands.Thisisdoneviapackagednamespaces.Werecommendyounameiscloudmesh-mycommand,wheremycommand is thenameof thecommand thatyou like to create. This can easily be done while using the sys* cloudmeshcommand (we suggest you use a different name than gregor maybe yourfirstname):

Itwilldownloadatemplatefromcloudmeshcalledcloudmesh-barandgenerateanewdirectorycloudmesh-gregorwithalltheneededfilestocreateyourowncommandandregister it dynamically with cloudmesh. All you have to do is to cd into thedirectoryandinstallthecode:

Addingyourowncommandiseasy.Itisimportantthatallobjectsaredefinedinthe command itself and that noglobal variables beuse in order to alloweachshell command to stand alone. Naturally you should develop API librariesoutside of the cloudmesh shell command and reuse them in order to keep thecommandcodeassmallaspossible.Weplacethecommandin:

+-------------------------------------------------------+

|_______|

|/___||_______||____________||__|

|||||/_\||||/_`|'_`_\/_\/__|'_\|

|||___||(_)||_||(_|||||||__/\__\||||

|\____|_|\___/\__,_|\__,_|_||_||_|\___||___/_||_||

+-------------------------------------------------------+

|CloudmeshCMD5Shell|

+-------------------------------------------------------+

cms>help

helpCOMMANDNAME

$cmssyscommandgenerategregor

$cdcloudmesh-gregor

$pythonsetup.pyinstall

#pipinstall.

cloudmsesh/mycommand/command/gregor.py

Nowyoucangoaheadandmodifyyourcommandinthatdirectory.Itwilllooksimilarto(ifyouusedthecommandnamegregor):

An important difference to other CMD solutions is that our commands canleverage (besides the standarddefinition), docopts as away todefine themanualpage. This allows us to use arguments as dict and use simple if conditions tointerpret the command. Using docopts has the advantage that contributors areforcedtothinkaboutthecommandanditsoptionsanddocumentthemfromthestart.Previouslywedidnotusebutargparseandclick.Howeverwenoticedthatforourcontributorsbothsystemsleadtocommandsthatwereeithernotproperlydocumentedor thedevelopersdelivered ambiguous commands that resulted inconfusionandwrongusagebysubsequentusers.Hence,wedorecommendthatyou use docopts for documenting cmd5 commands. The transformation isenabledbythe@commanddecoratorthatgeneratesamanualpageandcreatesaproper help message for the shell automatically. Thus there is no need tointroduceaseparatehelpmethodaswouldnormallybeneeded inCMDwhilereducingtheeffortittakestocontributenewcommandsinadynamicfashion.

6.7.1.5Bug:Quotes

Wehaveonebugincmd5thatrelatestotheuseofquotesonthecommandline

Forexampleyouneedtosay

from__future__importprint_function

fromcloudmesh.shell.commandimportcommand

fromcloudmesh.shell.commandimportPluginCommand

classGregorCommand(PluginCommand):

@command

defdo_gregor(self,args,arguments):

Usage:

gregor-fFILE

gregorlist

Thiscommanddoessomeusefulthings.

Arguments:

FILEafilename

Options:

-fspecifythefile

print(arguments)

ifarguments.FILE:

print("Youhaveusedfile:",arguments.FILE)

return""

$cmsgregor-f\"filenamewithspaces\"

Ifyou like tohelpus fix this thatwouldbegreat. it requires theuseof shlex.Unfortuantlywedidnotyettimetofixthis“feature”.

6.8EXERCISES☁�Whendoingyourassignment,make sureyou label theprogramsappropriatelywith comments that clearly identify the assignment.Place all assignments in afolderongithubnamed“cloudmesh-exercises”

ForexamplenametheprogramsolvingE.Cloudmesh.Common.1e-cloudmesh-1.pyandsoon.Formorecomplexassignmentsyoucannamethemasyoulike,aslongasinthefileyouhaveacommentsuchas#fa19-516-000E.Cloudmesh.Common.1

at the beginning of the file. Please do not store any screenshots in your gitrepositoryofyourworkingprogram.

6.8.1CloudmeshCommon

E.Cloudmesh.Common.1

Developaprogramthatdemonstratestheuseofbanner,HEADING,andVERBOSE.

Developaprogramthatdemonstratestheuseofdotdict.

DevelopaprogramthatdemonstratestheuseofFlatDict.

Developaprogramthatdemonstratestheuseofcloudmesh.common.Shell.

Developaprogramthatdemonstratestheuseofcloudmesh.common.StopWatch.

6.8.2CloudmeshShell

E.Cloudmesh.Shell.1

Installcmd5andthecommandcmsonyourcomputer.

E.Cloudmesh.Shell.2

Writeanewcommandwithyourfirstnameasthecommandname.

E.Cloudmesh.Shell.3

Write a new command and experiment with docopt syntax andargumentinterpretationofthedictwithifconditions.

E.Cloudmesh.Shell.4

Ifyouhaveusefulextensionsthatyoulikeustoaddbydefault,pleaseworkwithus.

E.Cloudmesh.Shell.5

Atthis timeoneneedstoquoteinsomecommandsthe " intheshellcommandline.Developandtestcodethatfixesthis.

7LIBRARIES

7.1PYTHONMODULES☁�OftenyoumayneedfunctionalitythatisnotpresentinPython’sstandardlibrary.Inthiscaseyouhavetwooption:

implementthefeaturesyourselfuseathird-partylibrarythathasthedesiredfeatures.

Oftenyoucanfindapreviousimplementationofwhatyouneed.Sincethisisacommonsituation,thereisaservicesupportingit:thePythonPackageIndex(orPyPiforshort).

Our task here is to install the autopep8 tool from PyPi. Thiswill allow us toillustratetheuseifvirtualenvironmentsusingthepyenvorvirtualenvcommand,andinstallinganduninstallingPyPipackagesusingpip.

7.1.1UpdatingPip

Itisimportantthatyouhavethenewestversionofpipinstalledforyourversionof python. Let us assume your python is registered with python and you usepyenv,thanyoucanupdatepipwith

without interferingwith a potential systemwide installed version of p ip thatmaybeneededby the systemdefaultversionofpython.See the sectionaboutpyenvformoredetails

7.1.2UsingpiptoInstallPackages

Letusnowlookatanother important toolforPythondevelopment: thePythonPackageIndex,orPyPIforshort.PyPIprovidesalargesetofthird-partypythonpackages. If youwant todo something inpython, first checkpypi, asodd aresomeonealreadyranintotheproblemandcreatedapackagesolvingit.

pipinstall-Upip

InordertoinstallpackagefromPyPI,usethepipcommand.WecansearchforPyPIforpackages:

Itappearsthatthetoptworesultsarewhatwewantsoinstallthem:

Thiswill cause pip to download the packages fromPyPI, extract them, checktheir dependencies and install those as needed, then install the requestedpackages.

Youcanskip‘–trusted-hostpypi.python.org’optionifyouhave

patchedurllib3onPython2.7.9.

7.1.3GUI

7.1.3.1GUIZero

Installguizerowiththefollowingcommand:

Foracomprehensivetutorialonguizero,clickhere.

7.1.3.2Kivy

YoucaninstallKivyonmacOSasfollows:

Ahelloworldprogramforkivy is included in thecloudmesh.robot repository.Whichyoucanfinehere

https://github.com/cloudmesh/cloudmesh.robot/tree/master/projects/kivy

To run the program, please download it or execute it in cloudmesh.robot as

$pipsearch--trusted-hostpypi.python.orgautopep8pylint

$pipinstall--trusted-hostpypi.python.orgautopep8pylint

sudopipinstallguizero

brewinstallpkg-configsdl2sdl2_imagesdl2_ttfsdl2_mixergstreamer

pipinstall-UCython

pipinstallkivy

pipinstallpygame

follows:

Tocreatestandalonepackageswithkivy,pleasesee:

7.1.4FormattingandCheckingPythonCode

First,getthebadcode:

Examinethecode:

As you can see, this is very dense and hard to read. Cleaning it up by handwouldbeatime-consuminganderror-proneprocess.Luckily,thisisacommonproblemsothereexistacouplepackagestohelpinthissituation.

7.1.5Usingautopep8

Wecannowrunthebadcodethroughautopep8tofixformattingproblems:

Letuslookattheresult.Thisisconsiderablybetterthanbefore.Itiseasytotellwhattheexample1andexample2functionsaredoing.

It is a good idea to develop a habit of using autopep8 in your python-development workflow. For instance: use autopep8 to check a file, and if itpasses,makeanychangesinplaceusingthe-iflag:

IfyouusepyCharmyouhavetheabilitytouseasimilarfunctionwhilepressingonInspectCode.

7.1.6WritingPython3CompatibleCode

cdcloudmesh.robot/projects/kivy

pythonswim.py

-https://kivy.org/docs/guide/packaging-osx.html

$wget--no-check-certificatehttp://git.io/pXqb-Obad_code_example.py

$emacsbad_code_example.py

$autopep8bad_code_example.py>code_example_autopep8.py

$autopep8file.py#checkoutputtoseeofpasses

$autopep8-ifile.py#updateinplace

Towritepython2and3compatiblecodewerecommendthatyoutakealookat:http://python-future.org/compatible_idioms.html

7.1.7UsingPythononFutureSystems

ThisisonlyimportantifyouuseFuturesystemsresources.

InordertousePythonyoumust logintoyourFutureSystemsaccount.Thenattheshellpromptexecutethefollowingcommand:

Thiswillmakethepythonandvirtualenvcommandsavailabletoyou.

Thedetailsofwhatthemoduleloadcommanddoesaredescribedinthefuturelessonmodules.

7.1.8Ecosystem

7.1.8.1pypi

The Python Package Index is a large repository of software for the Pythonprogramminglanguagecontaininga largenumberofpackages,manyofwhichcanbefoundonpypi.Thenice thingaboutpypi is thatmanypackagescanbeinstalledwiththeprogram‘pip’.

Todosoyouhave to locate the<package_name>forexamplewith thesearchfunctioninpypiandsayonthecommandline:

where package_name is the string name of the package. an example would be thepackagecalledcloudmesh_clientwhichyoucaninstallwith:

Ifallgoeswellthepackagewillbeinstalled.

7.1.8.2AlternativeInstallations

$moduleloadpython

$pipinstall<package_name>

$pipinstallcloudmesh_client

The basic installation of python is provided by python.org. However othersclaim to have alternative environments that allow you to install python. Thisincludes

CanopyAnacondaIronPython

Typically they include not only the python compiler but also several usefulpackages.Itisfinetousesuchenvironmentsfortheclass,butitshouldbenotedthat in both cases not every python librarymay be available for install in thegivenenvironment.Forexampleifyouneedtousecloudmeshclient,itmaynotbeavailableascondaorCanopypackage.This isalso thecaseformanyothercloudrelatedandusefulpythonlibraries.Hence,wedorecommendthat ifyouare new to python to use the distribution form python.org, and use pip andvirtualenv.

Additionally some python version have platform specific libraries ordependencies.Forexamplecocalibraries,.NETorotherframeworksareexamples.Fortheassignmentsandtheprojectssuchplatformdependentlibrariesarenottobeused.

If however you can write a platform independent code that works on Linux,macOSandWindowswhileusingthepython.orgversionbutdevelopitwithanyoftheothertoolsthatisjustfine.Howeveritisuptoyoutoguaranteethatthisindependence is maintained and implemented. You do have to writerequirements.txtfilesthatwillinstallthenecessarypythonlibrariesinaplatformindependent fashion.ThehomeworkassignmentPRG1hasevena requirementtodoso.

Inordertoprovideplatformindependencewehavegivenintheclassaminimalpythonversionthatwehavetestedwithhundredsofstudents:python.org.Ifyouuseanyotherversion,thatisyourdecision.Additionallysomestudentsnotonlyusepython.orgbuthaveusediPythonwhichisfinetoo.Howeverthisclassisnotonlyaboutpython,butalsoabouthowtohaveyourcoderunonanyplatform.Thehomeworkisdesignedsothatyoucanidentifyasetupthatworksforyou.

Howeverwehaveconcernsifyouforexamplewantedtousechameleoncloud

whichwerequireyoutoaccesswithcloudmesh.cloudmeshisnotavailableasconda,canopy,orotherframeworkpackage.Cloudmeshclientisavailableformpypiwhichisstandardandshouldbesupportedbytheframeworks.Wehavenottestedcloudmeshonanyotherpythonversionthenpython.orgwhichistheopensourcecommunitystandard.Noneoftheotherversionsarestandard.

In factwe had students over the summer using canopyon theirmachines andtheygotconfusedas theynowhadmultiplepythonversionsanddidnotknowhow to switchbetween themandactivate the correct version.Certainly if youknowhowtodothat,thanfeelfreetousecanopy,andifyouwanttousecanopyall this isuptoyou.However thehomeworkandprojectrequiresyoutomakeyourprogramportabletopython.org.Ifyouknowhowtodothatevenifyouusecanopy,anaconda,oranyotherpythonversionthatisfine.Graderswilltestyourprograms on a python.org installation and not canopy, anaconda, ironpythonwhileusingvirtualenv. It isobviouswhy. Ifyoudonotknowthatansweryoumaywant to thinkabout thatevery timetheytestaprogramtheyneedtodoanewvirtualenvandrunvanillapythoninit.Ifweweretoruntwoinstallsinthesamesystem,thiswillnotworkaswedonotknowifonestudentwillcauseasideeffect foranother.Thusweas instructorsdonot justhave to lookatyourcode but code of hundreds of students with different setups. This is a nonscalablesolutionaseverytimewetestoutcodefromastudentwewouldhavetowipeout theOS, install itnew, installannewversionofwhateverpythonyouhave elected, become familiarwith that version and so on and on.This is thereason why the open source community is using python.org.We follow bestpractices.Usingotherversionsisnotacommunitybestpractice,butmayworkforanindividual.

We have however in regards to using other python version additional bonusprojectssuchas

deployrunanddocumentcloudmeshonironpythondeploy run and document cloudmesh on anaconda, develop script togenerateacondapackageformgithubdeployrunanddocumentcloudmeshoncanopy,developscripttogenerateacondapackageformgithubdeployrunanddocumentcloudmeshonironpythonotherdocumentationthatwouldbeuseful

7.1.9Resources

IfyouareunfamiliarwithprogramminginPython,wealsoreferyoutosomeofthenumerousonlineresources.YoumaywishtostartwithLearnPythonorthebookLearnPythontheHardWay.OtheroptionsincludeTutorialsPointorCodeAcademy, and the Python wiki page contains a long list of references forlearningaswell.Additionalresourcesinclude:

https://virtualenvwrapper.readthedocs.iohttps://github.com/yyuu/pyenvhttps://amaral.northwestern.edu/resources/guides/pyenv-tutorialhttps://godjango.com/96-django-and-python-3-how-to-setup-pyenv-for-multiple-pythons/https://www.accelebrate.com/blog/the-many-faces-of-python-and-how-to-manage-them/http://ivory.idyll.org/articles/advanced-swc/http://python.net/~goodger/projects/pycon/2007/idiomatic/handout.htmlhttp://www.youtube.com/watch?v=0vJJlVBVTFghttp://www.korokithakis.net/tutorials/python/http://www.afterhoursprogramming.com/tutorial/Python/Introduction/http://www.greenteapress.com/thinkpython/thinkCSpy.pdfhttps://docs.python.org/3.3/tutorial/modules.htmlhttps://www.learnpython.org/en/Modules/_and/_Packageshttps://docs.python.org/2/library/datetime.htmlhttps://chrisalbon.com/python/strings/_to/_datetime.html

Averylonglistofusefulinformationarealsoavailablefrom

https://github.com/vinta/awesome-pythonhttps://github.com/rasbt/python_reference

This list may be useful as it also contains links to data visualization andmanipulationlibraries,andAItoolsandlibraries.Pleasenotethatforthisclassyoucanreusesuchlibrariesifnototherwisestated.

7.1.9.1JupyterNotebookTutorials

AShort Introduction toJupyterNotebooksandNumPyToviewthenotebook,open this link in a background tab https://nbviewer.jupyter.org/ and copy andpaste the following link in the URL input areahttps://cloudmesh.github.io/classes/lesson/prg/Jupyter-NumPy-tutorial-I523-F2017.ipynbThenhitGo.

7.1.10Exercises

E.Python.Lib.1:

Write a python program called iterate.py that accepts an integer nfromthecommandline.Passthisintegertoafunctioncallediterate.

Theiteratefunctionshouldtheniteratefrom1ton.Ifthei-thnumberis a multiple of three, print multiple of 3, if a multiple of 5 printmultipleof5,ifamultipleofbothprintmultipleof3and5,elseprintthevalue.

E:Python.Lib.2:

1. Createapyenvorvirtualenv~/ENV

2. Modify your ~/.bashrc shell file to activate your environmentuponlogin.

3. Installthedocoptpythonpackageusingpip

4. Write a program that uses docopt to define a commandlineprogram.Hint:modifytheiterateprogram.

5. Demonstratetheprogramworks.

7.2DATAMANAGEMENT☁�Obviouslywhendealingwithbigdatawemaynotonlybedealingwithdatainoneformatbutinmanydifferentformats.Itisimportantthatyouwillbeabletomastersuchformatsandseamlesslyintegrateinyouranalysis.Thusweprovidesome simple examples on which different data formats exist and how to use

7.2.1Formats

7.2.1.1Pickle

Pythonpickleallowsyoutosavedatainapythonnativeformatintoafilethatcan later be read in by other programs.However, the data formatmay not beportableamongdifferentpythonversionsthustheformatisoftennotsuitabletostoreinformation.Insteadwerecommendforstandarddatatouseeitherjsonoryaml.

Toreaditbackinuse

7.2.1.2TextFiles

Toreadtextfilesintoavariablecalledcontentyoucanuse

Youcanalsousethefollowingcodewhileusingtheconvenientwithstatement

Tosplitupthelinesofthefileintoanarrayyoucando

Thiscamalsobedonewiththebuildinreadlinesfunction

Incasethefileistoobigyouwillwanttoreadthefilelinebyline:

importpickle

flavor={

"small":100,

"medium":1000,

"large":10000

pickle.dump(flavor,open("data.p","wb"))

flavor=pickle.load(open("data.p","rb"))

content=open('filename.txt','r').read()

withopen('filename.txt','r')asfile:

content=file.read()

lines=file.read().splitlines()

lines=open('filename.txt','r').readlines()

7.2.1.3CSVFiles

Oftendataiscontainedincommaseparatedvalues(CSV)withinafile.Toreadsuchfilesyoucanusethecsvpackage.

Usingpandasyoucanreadthemasfollows.

TherearemanyothermodulesandlibrariesthatincludeCSVreadfunctions.Incaseyouneedtosplitasinglelinebycomma,youmayalsousethesplitfunction.However,rememberitswillsplitateverycomma,includingthosecontainedinquotes.Sothismethodalthoughlookingoriginallyconvenienthaslimitations.

7.2.1.4Excelspreadsheets

PandascontainsamethodtoreadExcelfiles

7.2.1.5YAML

YAML is a very important format as it allows you easily to structure data inhierarchicalfieldsItisfrequentlyusedtocoordinateprogramswhileusingyamlasthespecificationforconfigurationfiles,butalsodatafiles.Toreadinayamlfilethefollowingcodecanbeused

Thenicepartisthatthiscodecanalsobeusedtoverifyifafileisvalidyaml.Towritedataoutwecanuse

line=file.readline()

print(line)

importcsv

withopen('data.csv','rb')asf:

contents=csv.reader(f)

forrowincontent:

printrow

importpandasaspd

df=pd.read_csv("example.csv")

importpandasaspd

filename='data.xlsx'

data=pd.ExcelFile(file)

df=data.parse('Sheet1')

importyaml

withopen('data.yaml','r')asf:

content=yaml.load(f)

The flow style set to false formats the data in a nice readable fashion withindentations.

7.2.1.6JSON

7.2.1.7XML

XML format is extensively used to transport data across the web. It has ahierarchicaldataformat,andcanberepresentedintheformofatree.

ASampleXMLdatalookslike:

PythonprovidestheElementTreeXMLAPItoparseandcreateXMLdata.

ImportingXMLdatafromafile:

ReadingXMLdatafromastringdirectly:

Iteratingoverchildnodesinaroot:

ModifyingXMLdatausingElementTree:

Modifyingtextwithinatagofanelementusing.textmethod:

withopen('data.yml','w')asf:

yaml.dump(data,f,default_flow_style=False)

importjson

withopen('strings.json')asf:

content=json.load(f)

<data>

<items>

<itemname="item-1"></item>

</items>

</data>

importxml.etree.ElementTreeasET

tree=ET.parse('data.xml')

root=tree.getroot()

root=ET.fromstring(data_as_string)

forchildinroot:

print(child.tag,child.attrib)

tag.text=new_data

tree.write('output.xml')

Adding/modifyinganattributeusing.set()method:

OtherPythonmodulesusedforparsingXMLdatainclude

minidom:https://docs.python.org/3/library/xml.dom.minidom.htmlBeautifulSoup:https://www.crummy.com/software/BeautifulSoup/

7.2.1.8RDF

ToreadRDFfilesyouwillneedtoinstallRDFlibwith

ThiswillthanallowyoutoreadRDFfiles

Good examples on using RDF are provided on the RDFlib Web page athttps://github.com/RDFLib/rdflib

FromtheWebpageweshowcasealsohowtodirectlyprocessRDFdata fromtheWeb

7.2.1.9PDF

The Portable Document Format (PDF) has been made available by AdobeInc.royaltyfree.ThishasenabledPDFtobecomeaworldwideadoptedformatthat also has been standardized in 2008 (ISO/IEC 32000-1:2008,https://www.iso.org/standard/51502.html). A lot of research is published inpapersmakingPDFoneofthede-factostandardsforpublishing.However,PDFis difficult to parse and is focused on high quality output instead of datarepresentation.Nevertheless,toolstomanipulatePDFexist:

tag.set('key','value')

tree.write('output.xml')

$pipinstallrdflib

fromrdflib.graphimportGraph

g=Graph()

g.parse("filename.rdf",format="format")

forentrying:

print(entry)

importrdflib

g=rdflib.Graph()

g.load('http://dbpedia.org/resource/Semantic_Web')

fors,p,oing:

prints,p,o

PDFMiner

https://pypi.python.org/pypi/pdfminer/allowsthesimpletranslationofPDFinto text that than can be further mined. The manual page helps todemonstratesomeexampleshttp://euske.github.io/pdfminer/index.html.

pdf-parser.py

https://blog.didierstevens.com/programs/pdf-tools/ parses pdf documentsandidentifiessomestructuralelementsthatcanthanbefurtherprocessed.

Ifyouknowaboutothertools,letusknow.

7.2.1.10HTML

A very powerful library to parse HTML Web pages is provided withhttps://www.crummy.com/software/BeautifulSoup/

More details about it are provided in the documentation pagehttps://www.crummy.com/software/BeautifulSoup/bs4/doc/

�TODO:Studentscancontributeasection

BeautifulSoupisapythonlibrarytoparse,processandeditHTMLdocuments.

ToinstallBeautifulSoup,usepipcommandasfollows:

In order to process HTML documents, a parser is required. Beautiful Soupsupports the HTML parser included in Python’s standard library, but it alsosupports a number of third-party Python parsers like the lxml parser which iscommonlyused[1].

Followingcommandcanbeusedtoinstalllxmlparser

Tobeginwith,weimportthepackageandinstantiateanobjectasfollowsforahtmldocumenthtml_handle:

$pipinstallbeautifulsoup4

$pipinstalllxml

Now,wewilldiscussafewfunctions,attributesandmethodsofBeautifulSoup.

prettifyfunction

prettify() method will turn a Beautiful Soup parse tree into a nicely formattedUnicode string,witha separate line for eachHTML/XML tagand string. It isanalgoustopprint()function.Theobjectcreatedabovecanbeviewedbyprintingtheprettfiedversionofthedocumentasfollows:

tagObject

AtagobjectreferstotagsintheHTMLdocument.ItispossibletogodowntotheinnerlevelsoftheDOMtree.Toaccessatagdivunderthetagbody,itcanbedoneasfollows:

TheattrsattributeofthetagobjectreturnsadictionaryofallthedefinedattributesoftheHTMLtagaskeys.

has_attr()method

Tocheckifatagobjecthasaspecificattribute,has_attr()methodcanbeused.

tagobjectattributes

name-Thisattributereturnsthenameofthetagselected.attrs -Thisattribute returnsadictionaryofall thedefinedattributesof theHTMLtagaskeys.contents -Thisattributereturnsa listofcontentsenclosedwithin theHTMLtagstring-ThisattributewhichreturnsthetextenclosedwithintheHTMLtag.ThisreturnsNoneiftherearemultiplechildrenstrings-Thisovercomesthelimitationofstringandreturnsageneratorofall

frombs4importBeautifulSoup

soup=BeautifulSoup(html_handle,`lxml`)

print(soup.prettify())

body_div=soup.body.div

print(body_div.prettify())

ifbody_div.has_attr('p'):

print('Thevalueof\'p\'attributeis:',body_div['p'])

stringsenclosedwithinthegiventag

Followingcodeshowcasesusageoftheabovediscussedattributes:

SearchingtheTree

find() function takes a filter expression as argument and returns the firstmatchfoundfindall()functionreturnsalistofallthematchingelements

select()functioncanbeusedtosearchthetreeusingCSSselectors

7.2.1.11ConfigParser

�TODO:Studentscancontributeasection

https://pymotw.com/2/ConfigParser/

7.2.1.12ConfigDict

https://github.com/cloudmesh/cloudmesh-common/blob/master/cloudmesh/common/ConfigDict.py

7.2.2Encryption

body_tag=soup.body

print("Nameofthetag:',body_tag.name)

attrs=body_tag.attrs

print('Theattributesdefinedforbodytagare:',attrs)

print('Thecontentsof\'body\'tagare:\n',body_tag.contents)

print('Thestringvalueenclosedin\'body\'tagis:',body_tag.string)

forsinbody_tag.strings:

print(repr(s))

search_elem=soup.find('a')

print(search_elem.prettify())

search_elems=soup.find_all("a",class_="sample")

pprint(search_elems)

#Select`a`tagwithclass`sample`

a_tag_elems=soup.select('a.sample')

print(a_tag_elems)

Oftenweneedtoprotecttheinformationstoredinafile.Thisisachievedwithencryption.Therearemanymethodsofsupportingencryptionandevenifafileisencrypteditmaybetargettoattacks.Thusitisnotonlyimportanttoencryptdatathatyoudonotwantotherstosebutalsotomakesurethatthesystemonwhichthedataishostedissecure.Thisisespeciallyimportantifwetalkaboutbigdatahavingapotentiallargeeffectifitgetsintothewronghands.

To illustrate one type of encryption that is non trivial we have chosen todemonstrate how to encrypt a file with an ssh key. In case you have opensslinstalledonyoursystem,thiscanbeachievedasfollows.

MostimportanthereareStep4thatencryptsthefileandStep5thatdecryptsthefile. Using the Python os module it is straight forward to implement this.However,weareprovidingincloudmeshaconvenientclassthatmakestheuseinpythonverysimple.

Inourclassweinitializeitwiththelocationsofthefilethatistobeencryptedanddecrypted.Toinitiatethatactionjustcallthemethodsencryptanddecrypt.

7.2.3DatabaseAccess

�TODO:Students:defineconventionaldatabaseaccesssection

see:https://www.tutorialspoint.com/python/python_database_access.htm

7.2.4SQLite

#!/bin/sh

#Step1.Creatingafilewithdata

echo"BigDataisthefuture.">file.txt

#Step2.Createthepem

opensslrsa-in~/.ssh/id_rsa-pubout>~/.ssh/id_rsa.pub.pem

#Step3.lookatthepemfiletoillustratehowitlookslike(optional)

cat~/.ssh/id_rsa.pub.pem

#Step4.encryptthefileintosecret.txt

opensslrsautl-encrypt-pubin-inkey~/.ssh/id_rsa.pub.pem-infile.txt-outsecret.txt

#Step5.decryptthefileandprintthecontentstostdout

opensslrsautl-decrypt-inkey~/.ssh/id_rsa-insecret.txt

fromcloudmesh.common.ssh.encryptimportEncryptFile

e=EncryptFile('file.txt','secret.txt')

e.encrypt()

e.decrypt()

�TODO:Studentscancontributetothissection

https://www.sqlite.org/index.html

https://docs.python.org/3/library/sqlite3.html

7.2.4.1Exercises �

E:Encryption.1:

Testtheshellscripttoreplicatehowthisexampleworks

E:Encryption.2:

Testthecloudmeshencryptionclass

E:Encryption.3:

What other encryptionmethods exist. Can you provide an exampleandcontributetothesection?

E:Encryption.4:

WhatistheissueofencryptionthatmakeitchallengingforBigData

E:Encryption.5:

Givenatestdatasetwithmanyfilestextfiles,howlongwillittaketoencrypt anddecrypt themon variousmachines.Write a benchmarkthatyoutest.Developthisbenchmarkasagroup,testoutthetimeittakestoexecuteitonavarietyofplatforms.

7.3PLOTTINGWITHMATPLOTLIB☁�Abrief overviewofplottingwithmatplotlib alongwith examples is provided.Firstmatplotlibmustbeinstalled,whichcanbeaccomplishedwithpipinstallasfollows:$pipinstallmatplotlib

Wewillstartbyplottingasimplelinegraphusingbuiltinnumpyfunctionsforsineandcosine.Thisfirststepistoimporttheproperlibrariesshownnext.

Nextwewilldefinethevaluesforthexaxis,wedothiswiththelinspaceoptioninnumpy.Thefirsttwoparametersarethestartingandendingpoints,thesemustbescalars.Thethirdparameterisoptionalanddefinesthenumberofsamplestobe generated between the starting and ending points, this value must be aninteger.Additionalparametersforthelinspaceutilitycanbefoundhere:

Nowwewillusethesineandcosinefunctionsinordertogenerateyvalues,forthiswewill use the values of x for the argument of both our sine and cosinefunctionsi.e.cos(x).

Youcandisplay thevaluesof the threeparameterswehavedefinedby typingtheminapythonshell.

Havingdefinedxandyvalueswecangeneratealineplotandsinceweimportedmatplotlib.pyplotaspltwesimplyuseplt.plot.

Wecandisplaytheplotusingplt.show()whichwillpopupafiguredisplayingtheplotdefined.

Additionallywecanaddthesinelinetooutlinegraphbyenteringthefollowing.

Invoking plt.show() now will show a figure with both sine and cosine linesdisplayed.Nowthatwehaveafiguregenerateditwouldbeusefultolabelthex

importnumpyasnp

importmatplotlib.pyplotasplt

x=np.linspace(-np.pi,np.pi,16)

cos=np.cos(x)

sin=np.sin(x)

array([-3.14159265,-2.72271363,-2.30383461,-1.88495559,-1.46607657,

-1.04719755,-0.62831853,-0.20943951,0.20943951,0.62831853,

1.04719755,1.46607657,1.88495559,2.30383461,2.72271363,

3.14159265])

plt.plot(x,cos)

plt.show()

plt.plot(x,sin)

andyaxisandprovideatitle.Thisisdonebythefollowingthreecommands:

Alongwithaxislabelsandatitleanotherusefulfigurefeaturemaybealegend.Inordertocreatealegendyoumustfirstdesignatealabelfortheline,thislabelwill be what shows up in the legend. The label is defined in the initialplt.plot(x,y)instance,nextisanexample.

Theninordertodisplaythelegendthefollowingcommandisissued:

Thelocationisspecifiedbyusingupperorlowerandleftorright.Naturallyallthesecommandscanbecombinedandput ina filewith the .pyextensionandrunfromthecommandline.

�linkerror

Anexampleofabarchartisprecedednextusingdatafrom[T:fast-cars].

plt.xlabel("X-label(units)")

plt.ylabel("Y-label(units)")

plt.title("AcleverTitleforyourFigure")

plt.plot(x,cos,label="cosine")

plt.legend(loc='upperright')

importnumpyasnp

x=np.linspace(-np.pi,np.pi,16)

cos=np.cos(x)

sin=np.sin(x)

plt.plot(x,cos,label="cosine")

plt.plot(x,sin,label="sine")

plt.xlabel("X-label(units)")

plt.ylabel("Y-label(units)")

plt.title("AcleverTitleforyourFigure")

plt.legend(loc='upperright')

plt.show()

x=['ToyotaPrius',

'TeslaRoadster',

'BugattiVeyron',

'HondaCivic',

'LamborghiniAventador']

horse_power=[120,288,1200,158,695]

x_pos=[ifori,_inenumerate(x)]

plt.bar(x_pos,horse_power,color='green')

plt.xlabel("CarModel")

plt.ylabel("HorsePower(Hp)")

plt.title("HorsePowerforSelectedCars")

You can customize plots further by using plt.style.use(), in python 3. If youprovide thefollowingcommandinsideapythoncommandshellyouwillseealistofavailablestyles.

Anexampleofusingapredefinedstyleisshownnext.

Uptothispointwehaveonlyshowcasedhowtodisplayfiguresthroughpythonoutput, however web browsers are a popular way to display figures. OneexampleisBokeh, thefollowinglinescanbeenteredinapythonshellandthefigureisoutputtedtoabrowser.

7.4DOCOPTS☁�Whenwewanttodesigncommandlineargumentsforpythonprogramswehavemany options. However, as our approach is to create documentation first,docoptsprovidesalsoagoodapprachforPython.Thecodeforitislocatedat

https://github.com/docopt/docopt

Itcanbeinstalledwith

Asampleprogramsarelocatedat

https://github.com/docopt/docopt/blob/master/examples/options_example.py

Asampleprogramofusingdocoptsforourpurposesloksasfollows

plt.xticks(x_pos,x)

plt.show()

print(plt.style.available)

plt.style.use('seaborn')

frombokeh.ioimportshow

frombokeh.plottingimportfigure

x_values=[1,2,3,4,5]

y_values=[6,7,2,3,6]

p=figure()

p.circle(x=x_values,y=y_values)

show(p)

$pipinstalldocopt

Another good feature of using docopts is that we can use the same verbaldescriptioninotherprogramminglanguagesasshowcasedinthisbook.

7.5OPENCV☁�

LearningObjectives

Providesomesimplecalculationssowecantestcloudservices.ShowcasesomeelementaryOpenCVfunctionsShowanenvironmentalimageanalysisapplicationusingSecchidisks

OpenCV (OpenSourceComputerVisionLibrary) is a library of thousands ofalgorithmsforvariousapplicationsincomputervisionandmachinelearning.Ithas C++, C, Python, Java and MATLAB interfaces and supports Windows,Linux,AndroidandMacOS. In this section,wewill explainbasic featuresofthislibrary,includingtheimplementationofasimpleexample.

7.5.1Overview

OpenCVhascountlessfunctionsforimageandvideosprocessing.Thepipelinestarts with reading the images, low-level operations on pixel values,preprocessinge.g.denoising,andthenmultiplestepsofhigher-leveloperations

"""CloudmeshVMmanagement

Usage:

cm-govmstartNAME[--cloud=CLOUD]

cm-govmstopNAME[--cloud=CLOUD]

cm-goset--cloud=CLOUD

cm-go-h|--help

cm-go--version

Options:

-h--helpShowthisscreen.

--versionShowversion.

--cloud=CLOUDThenameofthecloud.

--mooredMoored(anchored)mine.

--driftingDriftingmine.

ARGUMENTS:

NAMEThenameoftheVM`

arguments=docopt(__doc__,version='1.0.0rc2')

print(arguments)

which vary depending on the application.OpenCV covers thewhole pipeline,especiallyprovidingalargesetoflibraryfunctionsforhigh-leveloperations.Asimpler library for image processing in Python is Scipy’s multi-dimensionalimageprocessingpackage(scipy.ndimage).

7.5.2Installation

OpenCV for Python can be installed on Linux in multiple ways, namelyPyPI(Python Package Index), Linux package manager (apt-get for Ubuntu),Condapackagemanager,andalsobuildingfromsource.YouarerecommendedtousePyPI.Here’sthecommandthatyouneedtorun:

ThiswastestedonUbuntu16.04withafreshPython3.6virtualenvironment.Inordertotest,importthemoduleinPythoncommandline:

If itdoesnotraiseanerror, it is installedcorrectly.Otherwise, try tosolvetheerror.

ForinstallationonWindows,see:

https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_setup/py_setup_in_windows/py_setup_in_windows.html#install-opencv-python-in-windows

NotethatbuildingfromsourcecantakealongtimeandmaynotbefeasiblefordeployingtolimitedplatformssuchasRaspberryPi.

7.5.3ASimpleExample

Inthisexample,animageisloaded.Asimpleprocessingisperformed,andtheresultiswrittentoanewimage.

7.5.3.1Loadinganimage

$pipinstallopencv-python

importcv2

%matplotlibinline

importcv2

TheimagewasdownloadedfromUSCstandarddatabase:

http://sipi.usc.edu/database/database.php?volume=misc&image=9

7.5.3.2Displayingtheimage

The image is saved in anumpyarray.Eachpixel is representedwith3values(R,G,B).Thisprovidesyouwithaccesstomanipulatetheimageatthelevelofsingle pixels. You can display the image using imshow function as well asMatplotlib’simshowfunction.

Youcandisplaytheimageusingimshowfunction:

oryoucanuseMatplotlib.IfyouhavenotinstalledMatplotlibbefore,installitusing:

Nowyoucanuse:

whichresultsinFigure1

Figure1:Imagedisplay

img=cv2.imread('images/opencv/4.2.01.tiff')

cv2.imshow('Original',img)

cv2.waitKey(0)

cv2.destroyAllWindows()

$pipinstallmatplotlib

plt.imshow(img)

7.5.3.3ScalingandRotation

Scaling(resizing)theimagerelativetodifferentaxis

Figure2:Scalingandrotation

Rotationoftheimageforanangleoft

res=cv2.resize(img,

fx=1.2,

fy=0.7,

interpolation=cv2.INTER_CUBIC)

plt.imshow(res)

rows,cols,_=img.shape

M=cv2.getRotationMatrix2D((cols/2,rows/2),t,1)

dst=cv2.warpAffine(img,M,(cols,rows))

plt.imshow(dst)

Figure3:image

7.5.3.4Gray-scaling

whichresultsin+Figure4

Figure4:Graysacling

7.5.3.5ImageThresholding

img2=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

plt.imshow(img2,cmap='gray')

ret,thresh=cv2.threshold(img2,127,255,cv2.THRESH_BINARY)

plt.subplot(1,2,1),plt.imshow(img2,cmap='gray')

plt.subplot(1,2,2),plt.imshow(thresh,cmap='gray')

Figure5:ImageThresholding

7.5.3.6EdgeDetection

EdgedetectionusingCannyedgedetectionalgorithm

Figure6:Edgedetection

7.5.4AdditionalFeatures

OpenCV has implementations of many machine learning techniques such asKMeansandSupportVectorMachines,thatcanbeputintousewithonlyafewlines of code. It also has functions especially for video analysis, featuredetection,objectrecognitionandmanymore.Youcanfindoutmoreaboutthemintheirwebsite

[OpenCV](https://docs.opencv.org/3.0-beta/index.html was initially developed

edges=cv2.Canny(img2,100,200)

plt.subplot(121),plt.imshow(img2,cmap='gray')

plt.subplot(122),plt.imshow(edges,cmap='gray')

for C++ and still has a focus on that language, but it is still one of themostvaluableimageprocessinglibrariesinPython.

7.6SECCHIDISK☁�Wearedevelopinganautonomousrobotboatthatyoucanbepartofdevelopingwithinthisclass.Therobotbotisactuallymeasuringturbidityorwaterclarity.TraditionallythishasbeendonewithaSecchidisk.TheuseoftheSecchidiskisasfollows:

1. LowertheSecchidiskintothewater.2. Measurethepointwhenyoucannolongerseeit3. Recordthedepthatvariouslevelsandplotinageographical3Dmap

One of the thingswe can do is take a video of themeasurement instead of ahumanrecordingthem.Thanwecananalysethevideoautomaticallytoseehowdeep a diskwas lowered.This is a classical image analysis program.You areencouragedtoidentifyalgorithmsthatcanidentifythedepth.Themostsimplestseemstobetodoahistogramatavarietyofdepthsteps,andmeasurewhenthehistogramno longerchangessignificantly.Thedepthat that imagewillbe themeasurementwelookfor.

Thus ifwe analyse the imageswe need to look at the image and identify thenumbersonthemeasuringtape,aswellasthevisibilityofthedisk.

To show case how such a disk looks like we refer to the image showcasingdifferent Secchi disks. For our purpose the black-white contrast Secchi diskworkswell.SeeFigure7

Figure 7: Secchi disk types. A marine style on the left and thefreshwaterversionontherightwikipedia.

MoreinformationaboutSecchiDiskcanbefoundat:

https://en.wikipedia.org/wiki/Secchi/_disk

WehaveincludednextacoupleofexampleswhileusingsomeobviouslyusefulOpenCVmethods.Surprisingly,theuseoftheedgedetectionthatcomesinmindfirst to identify if we still can see the disk, seems to complicated to use foranalysis.Weatthistimebelievethehistogramwillbesufficient.

Pleaseinspectourexamples.

7.6.1SetupforOSX

First lest setup theOpenCVenvironment forOSX.Naturallyyouwillhave toupdatetheversionsbasedonyourversionsofpython.Whenwetriedtheinstallof OpenCV on MacOS, the setup was slightly more complex than otherpackages. This may have changed by now and if you have improvedinstructions, pleas elt us know. However we do not want to install it viaAnacondaoutoftheobviousreasonthatanacondainstallstomanyotherthings.importos,sys

fromos.pathimportexpanduser

7.6.2Step1:Recordthevideo

Recordthevideoontherobot

Wehaveactuallydonethisforyouandwillprovideyouwithimagesandvideosifyouareinterestedinanalyzingthem.SeeFigure8

7.6.3Step2:AnalysetheimagesfromtheVideo

Fornowwejustselected4imagesfromthevideo

os.path

home=expanduser("~")

sys.path.append('/usr/local/Cellar/opencv/3.3.1_1/lib/python3.6/site-packages/')

sys.path.append(home+'/.pyenv/versions/OPENCV/lib/python3.6/site-packages/')

importcv2

cv2.__version__

!pipinstallnumpy>tmp.log

!pipinstallmatplotlib>>tmp.log

%matplotlibinline

importcv2

img1=cv2.imread('secchi/secchi1.png')

figures=[]

fig=plt.figure(figsize=(18,16))

foriinrange(1,13):

figures.append(fig.add_subplot(4,3,i))

count=0

forimgin[img1,img2,img3,img4]:

figures[count].imshow(img)

color=('b','g','r')

fori,colinenumerate(color):

histr=cv2.calcHist([img],[i],None,[256],[0,256])

figures[count+1].plot(histr,color=col)

figures[count+2].hist(img.ravel(),256,[0,256])

count+=3

print("Legend")

print("Firstcolumn=imageofSecchidisk")

print("Secondcolumn=histogramofcolorsinimage")

print("Thirdcolumn=histogramofallvalues")

plt.show()

Figure8:Histogram

7.6.3.1ImageThresholding

SeeFigure9,Figure10,Figure11,Figure12defthreshold(img):

ret,thresh=cv2.threshold(img,150,255,cv2.THRESH_BINARY)

plt.subplot(1,2,1),plt.imshow(img,cmap='gray')

plt.subplot(1,2,2),plt.imshow(thresh,cmap='gray')

threshold(img1)

Figure9:Threshold1

Figure10:Threshold2

Figure11:Threshold3

Figure12:Threshold4

7.6.3.2EdgeDetection

threshold(img2)

threshold(img3)

threshold(img4)

SeeFigure13,Figure14,Figure15,Figure16,Figure17.EdgedetectionusingCannyedgedetectionalgorithm

Figure13:EdgeDetection1

deffind_edge(img):

edges=cv2.Canny(img,50,200)

plt.subplot(121),plt.imshow(img,cmap='gray')

plt.subplot(122),plt.imshow(edges,cmap='gray')

find_edge(img1)

find_edge(img2)

find_edge(img3)

find_edge(img4)

7.6.3.3Blackandwhite

Figure17:BackWhiteconversion

bw1=cv2.cvtColor(img1,cv2.COLOR_BGR2GRAY)

plt.imshow(bw1,cmap='gray')

8.1DATAFORMATS☁�

8.1.1YAML

ThetermYAMLstandfor“YAMLAinotMarkupLanguage”.AccordingtotheWebPageat

http://yaml.org/

“YAML is a human friendly data serialization standard for all programminglanguages.”TherearemultipleversionsofYAMLexistingandoneneedstotakecare of that your software supports the right version. The current version isYAML1.2.

YAML is oftenused for configuration and inmany cases can alsobeused asXMLreplacement.ImportantistatYAMincontrasttoXMLremovesthetagswhilereplacingthemwithindentation.Thishasnaturallytheadvantagethatitismoreasily to read,however, the format is strictandneeds toadhere toproperindentation. Thus it is important that you check your YAML files forcorrectness,eitherbywritingforexampleapythonprogramthatreadyouryamlfile,oranonlineYAMLcheckersuchasprovidedat

http://www.yamllint.com/

An example on how to use yaml in python is provided in our next example.PleasenotethatYAMLisasupersetofJSON.OriginallyYAMLwasdesignedasamarkuplanguage.Howeverasitisnotdocumentorientedbutdataorientedithasbeenrecastanditdoesnolongerclassifyitselfasmarkuplanguage.importos

importsys

importyaml

yamlFilename=os.sys.argv[1]

yamlFile=open(yamlFilename,"r")

except:

print("filenamedoesnotexist")

sys.exit()

Resources:

http://yaml.org/https://en.wikipedia.org/wiki/YAMLhttp://www.yamllint.com/

8.1.2JSON

ThetermJSONstandforJavaScriptObjectNotation.Itistargetedasanopen-standard file format that emphasizes on integration of human-readable text totransmitdataobjects.Thedataobjectscontainattributevaluepairs.Althoughitoriginates from JavaScript, the format itself is language independent. It usesbracketstoalloworganizationofthedata.PLeasenotethatYAMLisasupersetofJSONandnotallYAMLdocumentscanbeconvertedtoJSON.FurthermoreJSONdoesnotsupportcomments.ForthesereasonsweoftenprefertousYAMlinsteadofJSON.HoweverJSONdatacaneasilybetranslatedtoYAMLaswellasXML.

Resources:

https://en.wikipedia.org/wiki/JSONhttps://www.json.org/

8.1.3XML

XMLstandsforExtensibleMarkupLanguage.XMLallowstodefinedocumentswith the help of a set of rules in order to make it machine readable. Theemphasize here is on machine readable as document in XML can becomequickly complex and difficult to understand for humans. XML is used fordocumentsaswellasdatastructures.

AtutorialaboutXMLisavailableat

https://www.w3schools.com/xml/default.asp

Resources:

yaml.load(yamlFile.read())

except:

print("YAMLfileisnotvalid.")

https://en.wikipedia.org/wiki/XML

9MONGO

9.1MONGODBINPYTHON☁�

LearningObjectives

IntroductiontobasicMongoDBknowledgeUseofMongoDBviaPyMongoUseofMongoEngineMongoEngineandObject-Documentmapper,UseofFlask-Mongo

In today’s era, NoSQL databases have developed an enormous potential toprocess the unstructured data efficiently. Modern information is complex,extensive, andmaynothavepre-existing relationships.With the adventof theadvanced search engines, machine learning, and Artificial Intelligence,technology expectations to process, store, and analyze such data have growntremendously[2].TheNoSQLdatabaseenginessuchasMongoDB,Redis,andCassandra have successfully overcome the traditional relational databasechallenges such as scalability, performance, unstructured data growth, agilesprint cycles, andgrowingneeds of processingdata in real-timewithminimalhardwareprocessingpower[3].TheNoSQLdatabasesareanewgenerationofengines that do not necessarily require SQL language and are sometimes alsocalledNotOnlySQL databases.However,mostof themsupportvarious third-partyopenconnectivitydriversthatcanmapNoSQLqueriestoSQL’s.Itwouldbe safe to say that althoughNoSQL databases are still far from replacing therelationaldatabases,theyareaddinganimmensevaluewhenusedinhybridITenvironmentsinconjunctionwithrelationaldatabases,basedontheapplicationspecific needs [3].We will be covering theMongoDB technology, its driverPyMongo, itsobject-documentmapperMongoEngine, and theFlask-PyMongomicro-webframeworkthatmakeMongoDBmoreattractiveanduser-friendly.

9.1.1CloudmeshMongoDBUsageQuickstart

Beforeyoureadonwelikeyoutoreadthisquickstart.Theeasiestwayformanyof the activities we do to interact with MongoDB is to use our cloudmeshfunctionality.Thispreludesectionisnotintendedtodescribeallthedetails,butgetyoustartedquicklywhileleveragingcloudmesh

Thisisdoneviathecloudmeshcmd5andthecloudmesh_community/cmcode:

https://cloudmesh-community.github.io/cm/

ToinstallmongoonforexamplemacOSyoucanuse

Tostart,stopandseethestatusofmongoyoucanuse

To add anobject toMongo, you simplyhave to define a dictwithpredefinedvaluesforkindandcloud.InfuturesuchattributescanbepassedtothefunctiontodeterminetheMongoDBcollection.

When you invoke the function itwill automatically store the information intoMongoDB. Naturally this requires that the ~/.cloudmesh/cloudmesh.yaml file is properlyconfigured.

9.1.2MongoDB

TodayMongoDB isoneof leadingNoSQLdatabasewhich is fullycapableofhandling dynamic changes, processing large volumes of complex andunstructureddata,easilyusingobject-orientedprogrammingfeatures;aswellasdistributed system challenges [4]. At its core, MongoDB is an open source,cross-platform,documentdatabasemainlywritteninC++language.

$cmsadminmongoinstall

$cmsadminmongostart

$cmsadminmongostop

$cmsadminmongostatus

fromcloudmesh.mongo.DataBaseDecoratorimportDatabaseUpdate

@DatabaseUpdate

deftest():

data={

"kind":"test",

"cloud":"testcloud",

"value":"hello"

returndata

result=test()

9.1.2.1Installation

MongoDBcanbeinstalledonvariousUnixPlatforms,includingLinux,Ubuntu,AmazonLinux,etc[5].ThissectionfocusesoninstallingMongoDBonUbuntu18.04BionicBeaverusedasastandardOSforavirtualmachineusedasapartofBigDataApplicationClassduringthe2018Fallsemester.

9.1.2.1.1Installationprocedure

Beforeinstalling,itisrecommendedtoconfigurethenon-rootuserandprovidethe administrative privileges to it, in order to be able to perform generalMongoDBadmin tasks.Thiscanbeaccomplishedby loginas the rootuser inthefollowingmanner[6].

When logged in as a regular user, one can perform actions with superuserprivilegesbytypingsudobeforeeachcommand[6].

Oncetheusersetupiscompleted,onecanloginasaregularuser(mongoadmin)andusethefollowinginstructionstoinstallMongoDB.

To update the Ubuntu packages to the most recent versions, use the nextcommand:

ToinstalltheMongoDBpackage:

Tochecktheserviceanddatabasestatus:

VerifyingthestatusofasuccessfulMongoDBinstallationcanbeconfirmedwithanoutputsimilartothis:

$addusermongoadmin

$usermod-aGsudosammy

$sudoaptupdate

$sudoaptinstall-ymongodb

$sudosystemctlstatusmongodb

$mongodb.service-Anobject/document-orienteddatabase

Loaded:loaded(/lib/systemd/system/mongodb.service;enabled;vendorpreset:enabled)

Active:**active**(running)sinceSat2018-11-1507:48:04UTC;2min17sago

Docs:man:mongod(1)

MainPID:2312(mongod)

Tasks:23(limit:1153)

Toverify theconfiguration,morespecifically the installedversion, server,andport,usethefollowingcommand:

Similarly,torestartMongoDB,usethefollowing:

To allow access toMongoDB from an outside hosted server one can use thefollowingcommandwhichopensthefire-wallconnections[5].

Statuscanbeverifiedbyusing:

Other MongoDB configurations can be edited through the /etc/mongodb.conffilessuchasportandhostnames,filepaths.

Also, to complete this step, a server’s IPaddressmustbe added to thebindIPvalue[5].

MongoDB is now listening for a remote connection that can be accessed byanyonewithappropriatecredentials[5].

9.1.2.2CollectionsandDocuments

Each database within Mongo environment contains collections which in turncontaindocuments.Collectionsanddocumentsareanalogoustotablesandrowsrespectivelytotherelationaldatabases.Thedocumentstructureisinakey-valueform which allows storing of complex data types composed out of field andvalue pairs. Documents are objects which correspond to native data types inmanyprogramming languages, hence awell defined, embeddeddocument can

CGroup:/system.slice/mongodb.service

└─2312/usr/bin/mongod--unixSocketPrefix=/run/mongodb--config/etc/mongodb.conf

$mongo--eval'db.runCommand({connectionStatus:1})'

$sudosystemctlrestartmongodb

$sudoufwallowfromyour_other_server_ip/32toanyport27017

$sudoufwstatus

$sudonano/etc/mongodb.conf

$logappend=true

bind_ip=127.0.0.1,your_server_ip

*port=27017*

helpreduceexpensivejoinsandimprovequeryperformance.The_idfieldhelpstoidentifyeachdocumentuniquely[3].

MongoDB offers flexibility towrite records that are not restricted by columntypes.Thedata storageapproach is flexibleas it allowsone to storedataas itgrowsandtofulfillvaryingneedsofapplicationsand/orusers.ItsupportsJSONlikebinarypointsknownasBSONwheredatacanbestoredwithoutspecifyingthe type of data.Moreover, it can be distributed tomultiplemachines at highspeed. It includes a sharding feature that partitions and spreads the data outacrossvariousservers.ThismakesMongoDBanexcellentchoiceforclouddataprocessing. Its utilities can load high volumes of data at high speed whichultimately provides greater flexibility and availability in a cloud-basedenvironment[2].

ThedynamicschemastructurewithinMongoDBallowseasytestingofthesmallsprints in theAgile projectmanagement life cycles and research projects thatrequirefrequentchangestothedatastructurewithminimaldowntime.Contrarytothisflexibleprocess,modifyingthedatastructureofrelationaldatabasescanbeaverytediousprocess[2].

9.1.2.2.1Collectionexample

ThefollowingcollectionexampleforapersonnamedAlbertincludesadditionalinformationsuchasage,status,andgroup[7].

9.1.2.2.2Documentstructure

9.1.2.2.3CollectionOperations

If collection does not exists, MongoDB database will create a collection by

name:"Albert"

age:"21"

status:"Open"

group:["AI","MachineLearning"]

field1:value1,

field2:value2,

field3:value3,

fieldN:valueN

default.

9.1.2.3MongoDBQuerying

Thedataretrievalpatterns, thefrequencyofdatamanipulationstatementssuchas insert, updates, and deletes may demand for the use of indexes orincorporatingtheshardingfeaturetoimprovequeryperformanceandefficiencyof MongoDB environment [3]. One of the significant difference betweenrelationaldatabasesandNoSQLdatabasesare joins. In the relationaldatabase,onecancombineresultsfromtwoormoretablesusingacommoncolumn,oftencalled as key. The native table contains the primary key column while thereferenced table contains a foreign key. This mechanism allows one to makechangesinasinglerowinsteadofchangingallrowsinthereferencedtable.Thisaction is referred to as normalization.MongoDB is a document database andmainlycontainsdenormalizeddatawhichmeansthedataisrepeatedinsteadofindexedoveraspecifickey.Ifthesamedataisrequiredinmorethanonetable,itneedstoberepeated.ThisconstrainthasbeeneliminatedinMongoDB’snewversion 3.2. The new release introduced a $lookup feature whichmore likelyworksasaleft-outer-join.Lookupsarerestrictedtoaggregatedfunctionswhichmeansthatdatausuallyneedsometypeoffilteringandgroupingoperations tobe conducted beforehand. For this reason, joins in MongoDB require morecomplicated querying compared to the traditional relational database joins.Although at this time, lookups are still very far from replacing joins, this is aprominent feature that can resolve some of the relational data challenges forMongoDB[8].MongoDBqueriessupport regularexpressionsaswellas rangeasksforspecificfieldsthateliminatetheneedofreturningentiredocuments[3].MongoDB collections do not enforce document structure like SQL databaseswhichisacompellingfeature.However,itisessentialtokeepinmindtheneedsoftheapplications[2].

9.1.2.3.1MongoQueriesexamples

ThequeriescanbeexecutedfromMongoshellaswellasthroughscripts.

To query the data from a MongoDB collection, one would use MongoDB’s

>db.myNewCollection1.insertOne({x:1})

>db.myNewCollection2.createIndex({y:1})

find()method.

Theoutputcanbeformattedbyusingthepretty()command.

TheMongoDBinsertstatementscanbeperformedinthefollowingmanner:

“The$lookup command performs a left-outer-join to an unshardedcollectioninthesamedatabasetofilterindocumentsfromthejoinedcollectionforprocessing”[9].

ThisoperationisequivalenttothefollowingSQLoperation:

ToperformaLikeMatch(Regex),onewouldusethefollowingcommand:

9.1.2.4MongoDBBasicFunctions

WhenitcomestothetechnicalelementsofMongoDB,itpossesarichinterfacefor importing and storage of external data in various formats. By using theMongoImport/Exporttool,onecaneasilytransfercontentsfromJSON,CSV,orTSV files into a database. MongoDB supports CRUD (create, read, update,delete) operations efficiently and has detailed documentation available on theproductwebsite.Itcanalsoquerythegeospatialdata,anditiscapableofstoringgeospatial data in GeoJSON objects. The aggregation operation of theMongoDB process data records and returns computed results. MongoDB

>db.COLLECTION_NAME.find()

>db.mycol.find().pretty()

>db.COLLECTION_NAME.insert(document)

$lookup:

from:<collectiontojoin>,

localField:<fieldfromtheinputdocuments>,

foreignField:<fieldfromthedocumentsofthe"from"collection>,

as:<outputarrayfield>

$SELECT*,<outputarrayfield>

FROMcollection

WHERE<outputarrayfield>IN(SELECT*

FROM<collectiontojoin>

WHERE<foreignField>=<collection.localField>);`

>db.products.find({sku:{$regex:/789$/}})

aggregationframeworkismodeledontheconceptofdatapipelines[10].

9.1.2.4.1Import/Exportfunctionsexamples

ToimportJSONdocuments,onewouldusethefollowingcommand:

The CSV import uses the input file name to import a collection, hence, thecollectionnameisoptional[10].

“Mongoexport is a utility that produces a JSON or CSV export ofdatastoredinaMongoDBinstance”[10].

9.1.2.5SecurityFeatures

DatasecurityisacrucialaspectoftheenterpriseinfrastructuremanagementandisthereasonwhyMongoDBprovidesvarioussecurityfeaturessuchasolebasedaccess control, numerous authentication options, and encryption. It supportsmechanisms such as SCRAM, LDAP, and Kerberos authentication. Theadministrator can create role/collection-based access control; also roles can bepredefined or custom. MongoDB can audit activities such as DDL, CRUDstatements,authenticationandauthorizationoperations[11].

9.1.2.5.1Collectionbasedaccesscontrolexample

Auserdefinedrolecancontainthefollowingprivileges[11].

9.1.2.6MongoDBCloudService

In regards to the cloud technologies, MongoDB also offers fully automatedcloudservicecalledAtlaswithcompetitivepricingoptions.MongoAtlasCloudinterface offers interactive GUI for managing cloud resources and deploying

$mongoimport--dbusers--collectioncontacts--filecontacts.json

$mongoimport--dbusers--typecsv--headerline--file/opt/backups/contacts.csv

$mongoexport--dbtest--collectiontraffic--outtraffic.json

$privileges:[

{resource:{db:"products",collection:"inventory"},actions:["find","update"]},

{resource:{db:"products",collection:"orders"},actions:["find"]}

applications quickly. The service is equipped with geographically distributedinstances to ensure no single point failure. Also, a well-rounded performancemonitoring interface allows users to promptly detect anomalies and generateindex suggestions to optimize the performance and reliability of the database.Global technology leaders such as Google, Facebook, eBay, and Nokia areleveragingMongoDB andAtlas cloud services makingMongoDB one of themostpopularchoicesamongtheNoSQLdatabases[12].

9.1.3PyMongo

PyMongo is the official Python driver or distribution that allowsworkwith aNoSQLtypedatabasecalledMongoDB[13].Thefirstversionofthedriverwasdevelopedin2009[14],onlytwoyearsafterthedevelopmentofMongoDBwasstarted.Thisdriverallowsdevelopers tocombinebothPython’sversatilityandMongoDB’sflexibleschemanatureintosuccessfulapplications.Currently,thisdriver supports MongoDB versions 2.6, 3.0, 3.2, 3.4, 3.6, and 4.0 [15].MongoDBandPythonrepresentacompatiblefitconsideringthatBSON(binaryJSON) used in this NoSQL database is very similar to Python dictionaries,whichmakes thecollaborationbetweenthe twoevenmoreappealing[16].Forthisreason,dictionariesaretherecommendedtoolstobeusedinPyMongowhenrepresentingdocuments[17].

9.1.3.1Installation

Prior to being able to exploit the benefits of Python and MongoDBsimultaneously,thePyMongodistributionmustbeinstalledusingpip.Toinstallitonallplatforms,thefollowingcommandshouldbeused[18]:$python-mpipinstallpymongo

SpecificversionsofPyMongocanbe installedwithcommand lines suchas inourexamplewherethe3.5.1versionisinstalled[18].

Asinglelineofcodecanbeusedtoupgradethedriveraswell[18].

$python-mpipinstallpymongo==3.5.1

$python-mpipinstall--upgradepymongo

Furthermore, the installation process can be completed with the help of theeasy_installtool,whichrequiresuserstousethefollowingcommand[18].

To do an upgrade of the driver using this tool, the following command isrecommended[18]:

There are many other ways of installing PyMongo directly from the source,however, theyrequireforCextensiondependencies tobe installedprior to thedriver installation step, as they are the ones that skim through the sources onGitHubandusethemostup-to-datelinkstoinstallthedriver[18].

Tocheckiftheinstallationwascompletedaccurately,thefollowingcommandisusedinthePythonconsole[19].

If the command returns zero exceptions within the Python shell, one canconsiderforthePyMongoinstallationtohavebeencompletedsuccessfully.

9.1.3.2Dependencies

The PyMongo driver has a few dependencies that should be taken intoconsiderationpriortoitsusage.Currently,itsupportsCPython2.7,3.4+,PyPy,and PyPy 3.5+ interpreters [15]. An optional dependency that requires someadditionalcomponentstobeinstalledistheGSSAPIauthentication[15].FortheUnixbasedmachines, it requirespykerberos,while for theWindowsmachinesWinKerberos is needed to fullfill this requirement [15]. The automaticinstallation of this dependency can be done simultaneously with the driverinstallation,inthefollowingmanner:

Other third-party dependencies such as ipaddress, certifi, or wincerstore arenecessaryforconnectionswithhelpofTLS/SSLandcanalsobesimultaneouslyinstalledalongwiththedriverinstallation[15].

$python-measy_installpymongo

$python-measy_install-Upymongo

importpymongo

$python-mpipinstallpymongo[gssapi]

9.1.3.3RunningPyMongowithMongoDeamon

OncePyMongois installed, theMongodeamoncanberunwithaverysimplecommandinanewterminalwindow[19].

9.1.3.4ConnectingtoadatabaseusingMongoClient

In order to be able to establish a connectionwith a database, aMongoClientclass needs to be imported, which sub-sequentially allows the MongoClientobjecttocommunicatewiththedatabase[19].

Thiscommandallowsaconnectionwithadefault,localhostthroughport27017,however, depending on the programming requirements, one can also specifythosebylistingthemintheclient instanceorusethesameinformationviatheMongoURIformat[19].

9.1.3.5AccessingDatabases

Since MongoClient plays a server role, it can be used to access any desireddatabasesinaneasyway.Todothat,onecanusetwodifferentapproaches.Thefirstapproachwouldbedoingthisvia theattributemethodwhere thenameofthe desired database is listed as an attribute, and the second approach, whichwouldincludeadictionary-styleaccess[19].Forexample,toaccessadatabasecalled cloudmesh_community, onewould use the following commands for theattributeandforthedictionarymethod,respectively.

9.1.3.6CreatingaDatabase

Creating a database is a straight forward process. First, one must create aMongoClientobjectandspecifytheconnection(IPaddress)aswellasthenameofthedatabasetheyaretryingtocreate[20].Theexampleof thiscommandispresentedinthefollowngsection:

$mongod

frompymongoimportMongoClient

client=MongoClient()

db=client.cloudmesh_community

db=client['cloudmesh_community']

9.1.3.7InsertingandRetrievingDocuments(Querying)

Creating documents and storing data using PyMongo is equally easy asaccessingandcreatingdatabases.Inordertoaddnewdata,acollectionmustbespecifiedfirst.Inthisexample,adecisionismadetousethecloudmeshgroupofdocuments.

Oncethisstepiscompleted,datamaybeinsertedusingtheinsert_one()method,whichmeans that only one document is being created.Of course, insertion ofmultiple documents at the same time is possible as well with use of theinsert_many()method[19].Anexampleofthismethodisasfollows:

Anotherexampleofthismethodwouldbetocreateacollection.Ifwewantedtocreateacollectionofstudents in thecloudmesh_community,wewoulddo it inthefollowingmanner:

Retrievingdocumentsisequallysimpleascreatingthem.Thefind_one()methodcanbeusedtoretrieveonedocument[19].Animplementationofthismethodisgiveninthefollowingexample.

Similarly, to retieve multiple documents, one would use the find() method

importpymongo

client=pymongo.MongoClient('mongodb://localhost:27017/')

db=client['cloudmesh']

cloudmesh=db.cloudmesh

course_info={

'course':'BigDataApplicationsandAnalytics',

'instructor':'GregorvonLaszewski',

'chapter':'technologies'

result=cloudmesh.insert_one(course_info)`

student=[{'name':'John','st_id':52642},

{'name':'Mercedes','st_id':5717},

{'name':'Anna','st_id':5654},

{'name':'Greg','st_id':5423},

{'name':'Amaya','st_id':3540},

{'name':'Cameron','st_id':2343},

{'name':'Bozer','st_id':4143},

{'name':'Cody','price':2165}]

client=MongoClient('mongodb://localhost:27017/')

withclient:

db=client.cloudmesh

db.students.insert_many(student)

gregors_course=cloudmesh.find_one({'instructor':'GregorvonLaszewski'})

insteadofthe find_one().Forexample,tofindallcoursesthoughtbyprofessorvonLaszewski,onewouldusethefollowingcommand:

Onethingthatusersshouldbecognizantofwhenusingthefind()methodisthatit doesnot return results in an array formatbut as acursor object,which is acombinationofmethods thatwork together tohelpwithdataquerying[19]. Inordertoreturnindividualdocuments,iterationovertheresultmustbecompleted[19].

9.1.3.8LimitingResults

When itcomes toworkingwith largedatabases it isalwaysuseful to limit thenumberofquery results.PyMongosupports thisoptionwith its limit()method[20]. This method takes in one parameter which specifies the number ofdocumentstobereturned[20].Forexample,ifwehadacollectionwithalargenumber of cloud technologies as individual documents, one couldmodify thequery results to return only the top 10 technologies.To do this, the followingexamplecouldbeutilized:

9.1.3.9UpdatingCollection

Updating documents is very similar to inserting and retrieving the same.Depending on the number of documents to be updated, one would use theupdate_one()orupdate_many()method[20].Twoparametersneedtobepassedintheupdate_one()methodforittosuccessfullyexecute.Thefirstargumentisthe query object that specifies the document to be changed, and the secondargumentistheobjectthatspecifiesthenewvalueinthedocument.Anexampleoftheupdate_one()methodinactionisthefollowing:

Updating all documents that fall under the same criteria can be donewith theupdate_many method [20]. For example, to update all documents in which

gregors_course=cloudmesh.find({'instructor':'GregorvonLaszewski'})

col=db['technologies']

topten=col.find().limit(10)

myquery={'course':'BigDataApplicationsandAnalytics'}

newvalues={'$set':{'course':'CloudComputing'}}

coursetitlestartswithletterBwithadifferentinstructorinformation,wewoulddothefollowing:

9.1.3.10CountingDocuments

Counting documents can be done with one simple operation calledcount_documents()insteadofusingafullquery[21].Forexample,wecancountthedocumentsinthecloudmesh_commpunitybyusingthefollowingcommand:

Tocreateamorespecificcount,onewoulduseacommandsimilartothis:

Thistechnologysupportssomemoreadvancedqueryingoptionsaswell.Thoseadvanced queries allow one to add certain contraints and narrow down theresults even more. For example, to get the courses thought by professor vonLaszewskiafteracertaindate,onewouldusethefollowingcommand:

9.1.3.11Indexing

Indexing is a very important part of querying. It can greately improve queryperformancebutalsoaddfunctionalityandaideinstoringdocuments[21].

“To create a unique index on a key that rejects documents whosevalueforthatkeyalreadyexistsintheindex”[21].

Weneedtofirstlycreatetheindexinthefollowingmanner:

col=db['courses']

query={'course':{'$regex':'^B'}}

newvalues={'$set':{'instructor':'GregorvonLaszewski'}}

edited=col.update_many(query,newvalues)

cloudmesh=count_documents({})

cloudmesh=count_documents({'author':'vonLaszewski'})

d=datetime.datetime(2017,11,12,12)

forcourseincloudmesh.find({'date':{'$lt':d}}).sort('author'):

pprint.pprint(course)

result=db.profiles.create_index([('user_id',pymongo.ASCENDING)],

unique=True)

sorted(list(db.profiles.index_information()))

Thiscommandacutallycreatestwodifferentindexes.Thefirstoneisthe*_id*,createdbyMongoDBautomatically,andthesecondoneis theuser_id,createdbytheuser.

Thepurposeof those indexes is to cleverlyprevent future additionsof invaliduser_idsintoacollection.

9.1.3.12Sorting

Sortingontheserver-sideisalsoavaialableviaMongoDB.ThePyMongosort()methodisequivalenttotheSQLorderbystatementanditcanbeperformedaspymongo.ascending andpymongo.descending [22].Thismethod ismuchmoreefficient as it is being completed on the server-side, compared to the sortingcompleted on the client side. For example, to return all userswith first nameGregorsortedindescendingorderbybirthdatewewoulduseacommandsuchasthis:

9.1.3.13Aggregation

Aggregationoperationsareusedtoprocessgivendataandproducesummarizedresults. Aggregation operations collect data from a number of documents andprovide collective results by grouping data. PyMongo in its documentationoffers a separate framework that supports data aggregation. This aggregationframeworkcanbeusedto

“provideprojectioncapabilitiestoreshapethereturneddata”[23].

In the aggregation pipeline, documents pass through multiple pipeline stageswhich convert documents into result data. The basic pipeline stages includefilters. Those filters act like document transformation by helping change thedocument output form. Other pipelines help group or sort documents withspecific fields. By using native operations from MongoDB, the pipelineoperatorsareefficientinaggregatingresults.

TheaddFieldsstageisusedtoaddnewfieldsintodocuments.Itreshapeseach

users=cloudmesh.users.find({'firstname':'Gregor'}).sort(('dateofbirth',pymongo.DESCENDING))

foruserinusers:

printuser.get('email')

document in stream, similarly to the project stage. The output document willcontain existing fields from input documents and the newly added fields 24].Thefollowingexampleshowshowtoaddstudentdetailsintoadocument.

Thebucketstageisusedtocategorizeincomingdocumentsintogroupsbasedonspecified expressions. Those groups are called buckets [24]. The followingexampleshowsthebucketstageinaction.

In the bucketAuto stage, the boundaries are automatically determined in anattempttoevenlydistributedocumentsintoaspecifiednumberofbuckets.Inthefollowingoperation,inputdocumentsaregroupedintofourbucketsaccordingtothevaluesinthepricefield[24].

ThecollStatsstagereturnsstatisticsregardingacollectionorview[24].

Thecount stagepasses a document to thenext stage that contains thenumberdocumentsthatwereinputtothestage[24].

db.cloudmesh_community.aggregate([

$addFields:{

"document.StudentDetails":{

$concat:['$document.student.FirstName','$document.student.LastName']

db.user.aggregate([

{"$group":{

"_id":{

"city":"$city",

"age":{

"$let":{

"vars":{

"age":{"$subtract":[{"$year":newDate()},{"$year":"$birthDay"}]}},

"in":{

"$switch":{

"branches":[

{"case":{"$lt":["$$age",20]},"then":0},

{"case":{"$lt":["$$age",30]},"then":20},

{"case":{"$lt":["$$age",40]},"then":30},

{"case":{"$lt":["$$age",50]},"then":40},

{"case":{"$lt":["$$age",200]},"then":50}

]}}}}},

"count":{"$sum":1}}})

db.artwork.aggregate([

$bucketAuto:{

groupBy:"$price",

buckets:4

db.matrices.aggregate([{$collStats:{latencyStats:{histograms:true}}

db.scores.aggregate([{

The facet stage helps processmultiple aggregation pipelines in a single stage[24].

The geoNear stage returns an ordered stream of documents based on theproximity to a geospatial point. The output documents include an additionaldistancefieldandcanincludealocationidentifierfield[24].

The graphLookup stage performs a recursive search on a collection. To eachoutputdocument, itaddsanewarrayfieldthatcontainsthetraversalresultsoftherecursivesearchforthatdocument[24].

Thegroup stageconsumes thedocumentdatapereachdistinctgroup. IthasaRAM limit of 100MB. If the stage exceeds this limit, thegroup produces anerror[24].

$match:{score:{$gt:80}}},

{$count:"passing_scores"}])

db.artwork.aggregate([{

$facet:{"categorizedByTags":[{$unwind:"$tags"},

{$sortByCount:"$tags"}],"categorizedByPrice":[

//Filteroutdocumentswithoutapricee.g.,_id:7

{$match:{price:{$exists:1}}},

{$bucket:{groupBy:"$price",

boundaries:[0,150,200,300,400],

default:"Other",

output:{"count":{$sum:1},

"titles":{$push:"$title"}

}}}],"categorizedByYears(Auto)":[

{$bucketAuto:{groupBy:"$year",buckets:4}

}]}}])

db.places.aggregate([

{$geoNear:{

near:{type:"Point",coordinates:[-73.99279,40.719296]},

distanceField:"dist.calculated",

maxDistance:2,

query:{type:"public"},

includeLocs:"dist.location",

num:5,

spherical:true

db.travelers.aggregate([

$graphLookup:{

from:"airports",

startWith:"$nearestAirport",

connectFromField:"connects",

connectToField:"airport",

maxDepth:2,

depthField:"numConnections",

as:"destinations"

db.sales.aggregate(

$group:{

_id:{month:{$month:"$date"},day:{$dayOfMonth:"$date"},

The indexStats stage returns statistics regarding the use of each index for acollection[24].

The limit stage is used for controlling the number of documents passed to thenextstageinthepipeline[24].

ThelistLocalSessionsstagegivesthesessioninformationcurrentlyconnectedtomongosormongodinstance[24].

ThelistSessionsstagelistsoutallsessionthathavebeenactivelongenoughtopropagatetothesystem.sessionscollection[24].

Thelookupstageisusefulforperformingouterjoinstoothercollectionsinthesamedatabase[24].

Thematchstageisusedtofilterthedocumentstream.Onlymatchingdocumentspasstonextstage[24].

Theproject stage is used to reshape the documents by adding or deleting thefields.

year:{$year:"$date"}},

totalPrice:{$sum:{$multiply:["$price","$quantity"]}},

averageQuantity:{$avg:"$quantity"},

count:{$sum:1}

db.orders.aggregate([{$indexStats:{}}])

db.article.aggregate(

{$limit:5}

db.aggregate([{$listLocalSessions:{allUsers:true}}])

useconfig

db.system.sessions.aggregate([{$listSessions:{allUsers:true}}])

$lookup:

from:<collectiontojoin>,

localField:<fieldfromtheinputdocuments>,

foreignField:<fieldfromthedocumentsofthe"from"collection>,

as:<outputarrayfield>

db.articles.aggregate(

[{$match:{author:"dave"}}]

The redact stage reshapes stream documents by restricting information usinginformationstoredindocumentsthemselves[24].

ThereplaceRootstageisusedtoreplaceadocumentwithaspecifiedembeddeddocument[24].

Thesample stage isused to sampleoutdataby randomlyselectingnumberofdocumentsforminput[24].

Theskipstageskipsspecifiedinitialnumberofdocumentsandpassesremainingdocumentstothepipeline[24].

Thesort stage is usefulwhile reordering document stream by a specified sortkey[24].

The sortByCounts stage groups the incoming documents based on a specifiedexpressionvalueandcountsdocumentsineachdistinctgroup[24].

Theunwindstagedeconstructsanarrayfieldfromtheinputdocumentstooutput

db.books.aggregate([{$project:{title:1,author:1}}])

db.accounts.aggregate(

{$match:{status:"A"}},

$redact:{

$cond:{

if:{$eq:["$level",5]},

then:"$$PRUNE",

else:"$$DESCEND"

}}}]);

db.produce.aggregate([

$replaceRoot:{newRoot:"$in_stock"}

db.users.aggregate(

[{$sample:{size:3}}]

db.article.aggregate(

{$skip:5}

db.users.aggregate(

{$sort:{age:-1,posts:1}}

db.exhibits.aggregate(

[{$unwind:"$tags"},{$sortByCount:"$tags"}])

adocumentforeachelement[24].

Theoutstageisusedtowriteaggregationpipelineresultsintoacollection.Thisstageshouldbethelaststageofapipeline[24].

AnotheroptionfromtheaggregationoperationsistheMap/Reduceframework,whichessentiallyincludestwodifferentfunctions,mapandreduce.Thefirstoneprovidesthekeyvaluepairforeachtaginthearray,whilethelatterone

“sumsoveralloftheemittedvaluesforagivenkey”[23].

ThelaststepintheMap/Reduceprocessittocallthemap_reduce()functionanditerateovertheresults[23].TheMap/Reduceoperationprovidesresultdatainacollectionorreturnsresultsin-line.Onecanperformsubsequentoperationswiththesameinputcollectioniftheoutputofthesameiswrittentoacollection[25].Anoperationthatproducesresultsinain-lineformmustprovideresultswithintheBSONdocument size limit.Thecurrent limit for aBSONdocument is16MB.Thesetypesofoperationsarenotsupportedbyviews[25].ThePyMongo’sAPI supports all features of the MongoDB’s Map/Reduce engine [26].Moreover,Map/Reduce has the ability to getmore detailed results by passingfull_response=Trueargumenttothemap_reduce()function[26].

9.1.3.14DeletingDocumentsfromaCollection

ThedeletionofdocumentswithPyMongo is fairly straight forward.Todo so,one would use the remove() method of the PyMongo Collection object [22].Similarlytothereadsandupdates,specificationofdocumentstoberemovedisamust.Forexample,removaloftheentiredocumentcollectionwithascoreof1,wouldrequiredonetousethefollowingcommand:

ThesafeparametersettoTrueensurestheoperationwascompleted[22].

db.inventory.aggregate([{$unwind:"$sizes"}])

db.inventory.aggregate([{$unwind:{path:"$sizes"}}])

db.books.aggregate([

{$group:{_id:"$author",books:{$push:"$title"}}},

{$out:"authors"}

cloudmesh.users.remove({"score":1,safe=True})

9.1.3.15CopyingaDatabase

Copying databases within the same mongod instance or between differentmongodservers ismadepossiblewith thecommand()methodafterconnectingto the desired mongod instance [27]. For example, to copy the cloudmeshdatabase and name the new database cloudmesh_copy, one would use thecommand()methodinthefollowingmanner:

There are two ways to copy a database between servers. If a server is notpassword-prodected, one would not need to pass in the credentials nor toauthenticate to the admin database [27]. In that case, to copy a database onewouldusethefollowingcommand:

On the other hand, if the server where we are copying the database to isprotected,onewouldusethiscommandinstead:

9.1.3.16PyMongoStrengths

One of PyMongo strengths is that allows document creation and queryingnatively

“through the use of existing language features such as nesteddictionariesandlists”[22].

Formoderately experienced Python developers, it is very easy to learn it andquicklyfeelcomfortablewithit.

“For these reasons, MongoDB and Python make a powerfulcombinationforrapid,iterativedevelopmentofhorizontallyscalable

client.admin.command('copydb',

fromdb='cloudmesh',

todb='cloudmesh_copy')

fromdb='cloudmesh',

todb='cloudmesh_copy',

fromhost='source.example.com')

client=MongoClient('target.example.com',

username='administrator',

password='pwd')

fromdb='cloudmesh',

todb='cloudmesh_copy',

fromhost='source.example.com')

backendapplications”[22].

Accordingto[22],MongoDBisveryapplicable tomodernapplications,whichmakesPyMongoequallyvaluable[22].

9.1.4MongoEngine

“MongoEngineisanObject-DocumentMapper,writteninPythonforworkingwithMongoDB”[28].

It is actually a library that allows a more advanced communication withMongoDBcomparedtoPyMongo.AsMongoEngineistechnicallyconsideredtobeanobject-documentmapper(ODM),itcanalsobeconsideredtobe

“equivalenttoaSQL-basedobjectrelationalmapper(ORM)”[19].

Theprimary techniquewhyonewoulduse anODM includesdataconversionbetweencomputersystemsthatarenotcompatiblewitheachother[29].Forthepurpose of converting data to the appropriate form, a virtual object databasemustbecreatedwithin theutilizedprogramming language[29].This library isalsousedtodefineschematafordocumentswithinMongoDB,whichultimatelyhelpswithminimizingcodingerrorsaswelldefiningmethodsonexistingfields[30].Itisalsoverybeneficialtotheoverallworkflowasittrackschangesmadetothedocumentsandaidsinthedocumentsavingprocess[31].

9.1.4.1Installation

Theinstallationprocessforthistechnologyisfairlysimpleasitisconsideredtobealibrary.Toinstallit,onewouldusethefollowingcommand[32]:

Ableeding-edgeversionofMongoEnginecanbeinstalleddirectlyfromGitHubbyfirstcloningtherepositoryonthelocalmachine,virtualmachine,orcloud.

9.1.4.2ConnectingtoadatabaseusingMongoEngine

Once installed, MongoEngine needs to be connected to an instance of the

$pipinstallmongoengine

mongod, similarly to PyMongo [33]. The connect() function must be used tosuccessfully complete this step and the argument that must be used in thisfunctionisthenameofthedesireddatabase[33].Priortousingthisfunction,thefunctionnameneedstobeimportedfromtheMongoEnginelibrary.

SimilarlytotheMongoClient,MongoEngineusesthelocalhostandport27017by default, however, the connect() function also allows specifying other hostsandportargumentsaswell[33].

Other types of connections are also supported (i.e. URI) and they can becompletedbyprovidingtheURIintheconnect()function[33].

9.1.4.3QueryingusingMongoEngine

ToqueryMongoDBusingMongoEngineanobjectsattribute isused,whichis,technically, a part of the document class [34]. This attribute is called theQuerySetManagerwhichinreturn

“createsanewQuerySetobjectonaccess”[34].

Tobeabletoaccessindividualdocumentsfromadatabase,thisobjectneedstobe iterated over. For example, to return/print all students in thecloudmesh_community object (database), the following command would beused.

MongoEngine also has a capability of query filtering which means that akeyword can be used within the called QuerySet object to retrieve specificinformation [34]. Let us say one would like to iterate overcloudmesh_communitystudentsthatarenativesofIndiana.Toachievethis,onewouldusethefollowingcommand:

Thislibraryalsoallowstheuseofalloperatorsexceptfortheequalityoperator

frommongoengineimportconnect

connect('cloudmesh_community')

connect('cloudmesh_community',host='196.185.1.62',port=16758)

foruserincloudmesh_community.objects:

printcloudmesh_community.student

indy_students=cloudmesh_community.objects(state='IN')

in itsqueries, andmoreover,has thecapabilityofhandlingstringqueries,geoqueries,listquerying,andqueryingoftherawPyMongoqueries[34].

The string queries are useful in performing text operations in the conditionalqueries.Aquery to find a document exactlymatching andwith stateACTIVEcanbeperformedinthefollowingmanner:

ThequerytoretrievedocumentdatafornamesthatstartwithacasesensitiveALcanbewrittenas:

Toperformanexactsamequeryforthenon-key-sensitiveALonewouldusethefollowingcommand:

TheMongoEngineallowsdataextractionofgeographicallocationsbyusingGeoqueries.Thegeo_withinoperatorchecksifageometryiswithinapolygon.

Thelistquerylooksupthedocumentswherethespecifiedfieldsmatchesexactlytothegivenvalue.Tomatchallpagesthathavethewordcodingasaniteminthetagslistonewouldusethefollowingquery:

Overall,itwouldbesafetosaythatMongoEnginehasgoodcompatibilitywithPython. It provides different functions to utilize Python easily withMongoDBand which makes this pair even more attractive to applicationdevelopers.

9.1.5Flask-PyMongo

“Flaskisamicro-webframeworkwritteninPython”[35].

db.cloudmesh_community.find(State.exact("ACTIVE"))

db.cloudmesh_community.find(Name.startswith("AL"))

db.cloudmesh_community.find(Name.istartswith("AL"))

cloudmesh_community.objects(

point__geo_within=[[[40,5],[40,6],[41,6],[40,5]]])

cloudmesh_community.objects(

point__geo_within={"type":"Polygon",

"coordinates":[[[40,5],[40,6],[41,6],[40,5]]]})

classPage(Document):

tags=ListField(StringField())

Page.objects(tags='coding')

ItwasdevelopedafterDjango,and it isverypythonic innaturewhich impliesthatitisexplicitlythetargetingthePythonusercommunity.ItislightweightasitdoesnotrequireadditionaltoolsorlibrariesandhenceisclassifiedasaMicro-Webframework.ItisoftenusedwithMongoDBusingPyMongoconnector,andit treats data within MongoDB as searchable Python dictionaries. TheapplicationssuchasPinterest,LinkedIn,andthecommunitywebpageforFlaskare using theFlask framework.Moreover, it supports various features such asthe RESTful request dispatching, secure cookies, Google app enginecompatibility,andintegratedsupportforunittesting,etc[35].Whenitcomestoconnectingtoadatabase,theconnectiondetailsforMongoDBcanbepassedasavariableorconfiguredinPyMongoconstructorwithadditionalargumentssuchasusernameandpassword,ifrequired.ItisimportantthatversionsofbothFlaskandMongoDBarecompatiblewitheachothertoavoidfunctionalitybreaks[36].

9.1.5.1Installation

Flask-PyMongocanbeinstalledwithaneasycommandsuchasthis:

PyMongocanbeaddedinthefollowingmanner:

9.1.5.2Configuration

TherearetwowaystoconfigureFlask-PyMongo.ThefirstwaywouldbetopassaMongoDBURItothePyMongoconstructor,whilethesecondwaywouldbeto

“assignittotheMONGO_URIFlaskconfiurationvariable”[36].

9.1.5.3Connectiontomultipledatabases/servers

Multiple PyMongo instances can be used to connect to multiple databases ordatabase servers. To achieve this, once would use a command similar to thefollowing:

$pipinstallFlask-PyMongo

fromflaskimportFlask

fromflask_pymongoimportPyMongo

app=Flask(__name__)

app.config["MONGO_URI"]="mongodb://localhost:27017/cloudmesh_community"

mongo=PyMongo(app)

9.1.5.4Flask-PyMongoMethods

Flask-PyMongo provides helpers for some common tasks.One of them is theCollection.find_one_or_404methodshowninthefollowingexample:

This method is very similar to the MongoDB’s find_one() method, however,insteadofreturningNoneitcausesa404NotFoundHTTPstatus[36].

Similarly,thePyMongo.send_fileandPyMongo.save_filemethodsworkon thefile-likeobjectsandsavethemtoGridFSusingthegivenfilename[36].

9.1.5.5AdditionalLibraries

Flask-MongoAlchemyandFlask-MongoEngineare theadditional libraries thatcanbeused toconnect toaMongoDBdatabasewhileusingenhancedfeatureswith the Flask app. The Flask-MongoAlchemy is used as a proxy betweenPython and MongoDB to connect. It provides an option such as server ordatabasebasedauthenticationtoconnecttoMongoDB.Whilethedefault issetserver based, to use a database-based authentication, the config valueMONGOALCHEMY_SERVER_AUTHparametermustbesettoFalse[37].

Flask-MongoEngine is the Flask extension that provides integration with theMongoEngine. It handles connection management for the apps. It can beinstalledthroughpipandsetupveryeasilyaswell.Thedefaultconfigurationisset to the local host and port 27017. For the custom port and in caseswhereMongoDB is running on another server, the host and port must be explicitlyspecifiedinconnectstringswithintheMONGODB_SETTINGSdictionarywithapp.config, alongwith the database username and password, in caseswhere adatabaseauthenticationisenabled.TheURIstyleconnectionsarealsosupportedandsupply theURIas thehost in theMONGODB_SETTINGS dictionarywithapp.config.TherearevariouscustomquerysetsthatareavailablewithinFlask-

app=Flask(__name__)

mongo1=PyMongo(app,uri="mongodb://localhost:27017/cloudmesh_community_one")

mongo2=PyMongo(app,uri="mongodb://localhost:27017/cloudmesh_community_two")

mongo3=PyMongo(app,uri=

"mongodb://another.host:27017/cloudmesh_community_Three")

@app.route("/user/<username>")

defuser_profile(username):

user=mongo.db.cloudmesh_community.find_one_or_404({"_id":username})

returnrender_template("user.html",user=user)

MongoenginethatareattachedtoMongoengine’sdefaultqueryset[38].

9.1.5.6ClassesandWrappers

Attributes such as cx and db in the PyMongo objects are the ones that helpprovideaccesstotheMongoDBserver[36].Toachievethis,onemustpasstheFlaskapptotheconstructororcallinit_app()[36].

“Flask-PyMongo wraps PyMongo’s MongoClient, Database, andCollectionclasses,andoverridestheirattributeanditemaccessors”[36].

This type of wrapping allows Flask-PyMongo to add methods to CollectionwhileatthesametimeallowingaMongoDB-styledottedexpressionsinthecode[36].

Flask-PyMongo creates connectivity between Python and Flask using aMongoDBdatabaseandsupports

“extensions that can add application features as if they wereimplementedinFlaskitself”[39],

hence, it canbeused as an additionalFlask functionality inPython code.Theextensions are there for the purpose of supporting form validations,authentication technologies, object-relational mappers and framework relatedtoolswhichultimatelyaddsalotofstrengthtothismicro-webframework[39].OneofthemainreasonsandbenefitswhyitisfrequentlyusedwithMongoDBisitscapabilityofaddingmorecontroloverdatabasesandhistory[39].

9.2MONGOENGINE☁�9.2.1Introduction

MongoEngine isadocumentmapper forworkingwithmongoldbwithpython.To be able to use mongo engine MongodD should be already installed and

type(mongo.cx)

type(mongo.db)

type(mongo.db.cloudmesh_community)

running.

9.2.2Installandconnect

Mongoenginecanbeinstalledbyrunning:

Thiswillinstallsix,pymongoandmongoengine.

To connect to mongoldb use connect () function by specifying mongoldbinstancename.Youdon’tneedtogotomongoshellbutthiscanbedonefromunix shell or cmd line. In this case we are connecting to a database namedstudent_db.

Ifmongodb is runningonaportdifferent fromdefaultport ,portnumberandhost need to be specified. If mongoldb needs authentication username andpasswordneedtobespecified.

9.2.3Basics

Mongodbdoesnotenforceschemas.ComparingtoRDBMS,Rowinmongoldbis called a “document” and table can be compared toCollection. Defining aschemaishelpfulasitminimizescodingerror’s.Todefineaschemawecreateaclassthatinheritsfromdocument.

�TODO:Canyoufixthecodesectionsandlookattheexamplesweprovided.

Fields are notmandatory but if needed, set the required keyword argument toTrue. There are multiple values available for field types. Each field can becustomizedbybykeywordargument.IfeachstudentissendingtextmessagestoUniversitiescentraldatabase,thesecanbestoredusingMongodb.Eachtextcanhavedifferentdatatypes,somemighthaveimagesorsomemighthaveurl’s.SowecancreateaclasstextandlinkittostudentbyusingReferencefield(similar

$pipinstallmongoengine

frommongoengineimport*connect(‘student_db’)

frommongoengineimport*

classStudent(Document):

first_name=StringField(max_length=50)

last_name=StringField(max_length=50)

toforeignkeyinRDBMS).

MongoDb supports adding tags to individual texts rather then storing themseparately and then having them referenced.Similarly Comments can also bestoreddirectlyinaText.

Foraccessingdata:ifweneedtogettitles.

Searchingtextswithtags.

classText(Document):

title=StringField(max_length=120,required=True)

author=ReferenceField(Student)

meta={'allow_inheritance':True}

classOnlyText(Text):

content=StringField()

classImagePost(Text):

image_path=StringField()

classLinkPost(Text):

link_url=StringField()

classText(Document):

title=StringField(max_length=120,required=True)

author=ReferenceField(User)

tags=ListField(StringField(max_length=30))

comments=ListField(EmbeddedDocumentField(Comment))

fortextinOnlyText.objects:

print(text.title)

fortextinText.objects(tags='mongodb'):

print(text.title)

10OTHER

10.1WORDCOUNTWITHPARALLELPYTHON☁�We will demonstrate Python’s multiprocessing API for parallel computation bywriting a program that counts how many times each word in a collection ofdocumentsappear.

10.1.1GeneratingaDocumentCollection

Beforewebegin,letuswriteascriptthatwillgeneratedocumentcollectionsbyspecifying the number of documents and the number ofwords per document.Thiswillmakebenchmarkingstraightforward.

To keep it simple, the vocabulary of the document collection will consist ofrandomnumbersratherthanthewordsofanactuallanguage:'''Usage:generate_nums.py[-h]NUM_LISTSINTS_PER_LISTMIN_INTMAX_INTDEST_DIR

Generaterandomlistsofintegersandsavethem

as1.txt,2.txt,etc.

Arguments:

NUM_LISTSThenumberofliststocreate.

INTS_PER_LISTThenumberofintegersineachlist.

MIN_NUMEachgeneratedintegerwillbe>=MIN_NUM.

MAX_NUMEachgeneratedintegerwillbe<=MAX_NUM.

DEST_DIRAdirectorywherethegeneratednumberswillbestored.

Options:

-h--help

importos,random,logging

defgenerate_random_lists(num_lists,

ints_per_list,min_int,max_int):

return[[random.randint(min_int,max_int)\

foriinrange(ints_per_list)]foriinrange(num_lists)]

args=docopt(__doc__)

num_lists,ints_per_list,min_int,max_int,dest_dir=[

int(args['NUM_LISTS']),

int(args['INTS_PER_LIST']),

int(args['MIN_INT']),

int(args['MAX_INT']),

args['DEST_DIR']

ifnotos.path.exists(dest_dir):

Notice thatwe are using the docoptmodule that you should be familiar withfromtheSection[PythonDocOpts](#s-python-docopts}tomakethescripteasytorunfromthecommandline.

Youcangenerateadocumentcollectionwiththisscriptasfollows:

10.1.2SerialImplementation

Afirstserialimplementationofwordcountisstraightforward:

os.makedirs(dest_dir)

lists=generate_random_lists(num_lists,

ints_per_list,

min_int,

max_int)

curr_list=1

forlstinlists:

withopen(os.path.join(dest_dir,'%d.txt'%curr_list),'w')asf:

f.write(os.linesep.join(map(str,lst)))

curr_list+=1

logging.debug('Numberswritten.')

pythongenerate_nums.py1000100000100docs-1000-10000

'''Usage:wordcount.py[-h]DATA_DIR

Readacollectionof.txtdocumentsandcounthowmanytimeseachword

appearsinthecollection.

Arguments:

DATA_DIRAdirectorywithdocuments(.txtfiles).

Options:

-h--help

from__future__importdivision,print_function

importos,glob,logging

logging.basicConfig(level=logging.DEBUG)

defwordcount(files):

counts={}

forfilepathinfiles:

withopen(filepath,'r')asf:

words=[word.strip()forwordinf.read().split()]

forwordinwords:

ifwordnotincounts:

counts[word]=0

counts[word]+=1

returncounts

ifnotos.path.exists(args['DATA_DIR']):

raiseValueError('Invaliddatadirectory:%s'%args['DATA_DIR'])

counts=wordcount(glob.glob(os.path.join(args['DATA_DIR'],'*.txt')))

logging.debug(counts)

10.1.3SerialImplementationUsingmapandreduce

We can improve the serial implementation in anticipation of parallelizing theprogrambymakinguseofPython’smapandreducefunctions.

In short, you can use map to apply the same function to the members of acollection.Forexample,toconvertalistofnumberstostrings,youcoulddo:

We can use reduce to apply the same function cumulatively to the items of asequence.Forexample,tofindthetotalofthenumbersinourlist,wecouldusereduceasfollows:

Wecansimplifythisevenmorebyusingalambdafunction:

YoucanreadmoreaboutPython’slambdafunctioninthedocs.

Withthisinmind,wecanreimplementthewordcountexampleasfollows:

importrandom

nums=[random.randint(1,2)for_inrange(10)]

print(nums)

[2,1,1,1,2,2,2,2,2,2]

print(map(str,nums))

['2','1','1','1','2','2','2','2','2','2']

defadd(x,y):

returnx+y

print(reduce(add,nums))

print(reduce(lambdax,y:x+y,nums))

'''Usage:wordcount_mapreduce.py[-h]DATA_DIR

Readacollectionof.txtdocumentsandcounthow

manytimeseachword

appearsinthecollection.

Arguments:

Options:

-h--help

defcount_words(filepath):

counts={}

withopen(filepath,'r')asf:

words=[word.strip()forwordinf.read().split()]

10.1.4ParallelImplementation

Drawingon theprevious implementationusing mapand reduce,we can parallelizetheimplementationusingPython’smultiprocessingAPI:

10.1.5Benchmarking

forwordinwords:

ifwordnotincounts:

counts[word]=0

counts[word]+=1

returncounts

defmerge_counts(counts1,counts2):

forword,countincounts2.items():

ifwordnotincounts1:

counts1[word]=0

counts1[word]+=counts2[word]

returncounts1

per_doc_counts=map(count_words,

glob.glob(os.path.join(args['DATA_DIR'],

'*.txt')))

counts=reduce(merge_counts,[{}]+per_doc_counts)

'''Usage:wordcount_mapreduce_parallel.py[-h]DATA_DIRNUM_PROCESSES

Readacollectionof.txtdocumentsandcount,inparallel,howmany

timeseachwordappearsinthecollection.

Arguments:

NUM_PROCESSESThenumberofparallelprocessestouse.

Options:

-h--help

fromwordcount_mapreduceimportcount_words,merge_counts

frommultiprocessingimportPool

num_processes=int(args['NUM_PROCESSES'])

pool=Pool(processes=num_processes)

per_doc_counts=pool.map(count_words,

glob.glob(os.path.join(args['DATA_DIR'],

'*.txt')))

counts=reduce(merge_counts,[{}]+per_doc_counts)

Totimeeachoftheexamples,enteritintoitsownPythonfileanduseLinux’stimecommand:

Theoutputcontains the real run timeand theuser run time. real iswall clocktime-timefromstarttofinishofthecall.useristheamountofCPUtimespentin user-mode code (outside the kernel)within the process, that is, only actualCPUtimeusedinexecutingtheprocess.

10.1.6Excersises

E.python.wordcount.1:

Run the threedifferentprograms (serial, serialw/mapandreduce,parallel)andanswerthefollowingquestions:

1. Is there any performance difference between the differentversionsoftheprogram?

2. Doesusertimesignificantlydifferfromrealtimeforanyoftheversionsoftheprogram?

3. Experimentwithdifferentnumbersofprocessesfortheparallelexample, starting with 1. What is the performance gain whenyougoalfrom1to2processes?From2to3?Whendoyoustopseeing improvement? (this will depend on your machinearchitecture)

10.1.7References

Map,FilterandReducemultiprocessingAPI

10.2NUMPY☁�NumPyisapopularlibrarythatisusedbymanyotherPythonpackagessuchasPandas, SciPy, and scikit-learn. It provides a fast, simple-to-use way ofinteracting with numerical data organized in vectors and matrices. In thissection,wewillprovideashortintroductiontoNumPy.

$timepythonwordcount.pydocs-1000-10000

10.2.1InstallingNumPy

The most common way of installing NumPy, if it wasn’t included with yourPythoninstallation,istoinstallitviapip:

IfNumPyhasalreadybeeninstalled,youcanupdatetothemostrecentversionusing:

YoucanverifythatNumPyisinstalledbytryingtouseitinaPythonprogram:

Notethat,byconvention,weimportNumPyusingthealias‘np’-wheneveryousee ‘np’ sprinkled in example Python code, it’s a good bet that it is usingNumPy.

10.2.2NumPyBasics

At its core, NumPy is a container for n-dimensional data. Typically, 1-dimensional data is called an array and 2-dimensional data is called amatrix.Beyond2-dimenionswouldbeconsideredamultidimensionalarray.Exampleswhereyou’llencounterthesedimenionsmayinclude:

1 Dimensional: time series data such as audio, stock prices, or a singleobservationinadataset.2 Dimensional: connectivity data between network nodes, user-productrecommendations,anddatabasetables.3+ Dimensional: network latency between nodes over time, video(RGB+time),andversioncontrolleddatasets.

All of these data can be placed into NumPy’s array object, just with varyingdimensions.

10.2.3DataTypes:TheBasicBuildingBlocks

Beforewedelveintoarraysandmatrices,wewillstartoffwiththemostbasic

$pipinstallnumpy

$pipinstall-Unumpy

importnumpyasnp

element of those: a single value. NumPy can represent data utilizing manydifferentstandarddatatypessuchasuint8(an8-bitusigned integer), float64(a64-bitfloat),orstr(astring).Anexhaustivelistingcanbefoundat:

https://docs.scipy.org/doc/numpy-1.15.0/user/basics.types.html

Beforemovingon,itisimportanttoknowaboutthetradeoffmadewhenusingdifferentdatatypes.Forexample,auint8canonlycontainvaluesbetween0and255.This,however,contrastswithfloat64whichcanexpressanyvaluefrom+/-1.80e+308.Sowhywouldn’twejustalwaysusefloat64s?Thoughtheyallowustobemoreexpressiveintermsofnumbers,theyalsoconsumemorememory.Ifwewereworkingwith a12megapixel image, for example, storing that imageusinguint8valueswouldrequire3000*4000*8=96millionbits,or91.55MBof memory. If we were to store the same image utilizing float64, our imagewouldconsume8 timesasmuchmemory:768millionbitsor732.42MB.It’simportant use the right datatype for the job to avoid consuming unneccessaryresourcesorslowingdownprocessing.

Finally,whileNumPywillconvenientlyconvertbetweendatatypes,onemustbeawareofoverflowswhenusingsmallerdatatypes.Forexample:

In this example, it makes sense that 6+7=13. But how does 13+245=2? Putsimply, the object type (uint8) simply ran out of space to store the value andwrapped back around to the beginning. An 8-bit number is only capable ofstoring2^8,or256,uniquevalues.Anoperationthatresultsinavalueabovethatrangewill‘overflow’andcausethevaluetowrapbackaroundtozero.Likewise,anythingbelowthatrangewill‘underflow’andwrapbackaroundtotheend.Inour example, 13+245 became 258,whichwas too large to store in 8 bits andwrappedbackaroundto0andendedupat2.

NumPywill, generally, try to avoid this situation by dynamically retyping towhateverdatatypewillsupporttheresult:

a=np.array([6],dtype=np.uint8)

print(a)

>>>[6]

a=a+np.array([7],dtype=np.uint8)

print(a)

>>>[13]

a=a+np.array([245],dtype=np.uint8)

print(a)

>>>[2]

a=a+260

Here,ouradditioncausedourarray,‘a’,tobeupscaledtouseuint16insteadofuint8. Finally, NumPy offers convenience functions akin to Python’s range()functiontocreatearraysofsequentialnumbers:

Wecanusethisfunctiontoalsogenerateparametersspacesthatcanbeiteratedon:

10.2.4Arrays:StringingThingsTogether

With our knowledge of datatypes in hand, we can begin to explore arrays.Simply put, arrays can be thought of as a sequence of values (not neccesarilynumbers).Arraysare1dimensionalandcanbecreatedandaccessedsimply:

Arrays (and, later,matrices)arezero-indexed.Thismakes it convenientwhen,forexample,usingPython’srange()functiontoiteratethroughanarray:

Arraysare,also,mutableandcanbechangedeasily:

NumPy also includes incredibly powerful broadcasting features.Thismakes itvery simple to perform mathematical operations on arrays that also makes

print(test)

>>>[262]

X=np.arange(0.2,1,.1)

print(X)

>>>array([0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],dtype=float32)

P=10.0**np.arange(-7,1,1)

print(P)

forx,pinzip(X,P):

print('%f,%f'%(x,p))

a=np.array([1,2,3])

print(type(a))

>>><class'numpy.ndarray'>

print(a)

>>>[123]

print(a.shape)

>>>(3,)

foriinrange(3):

print(a[i])

a[0]=42

print(a)

>>>array([42,2,3])

intuitivesense:

Arrayscanalsointeractwithotherarrays:

In this example, the result of multiplying together two arrays is to take theelement-wiseproductwhilemultiplyingbyaconstantwillmultiplyeachelementin the array by that constant. NumPy supports all of the basic mathematicaloperations: addition, subtraction, multiplication, division, and powers. It alsoincludesanextensivesuiteofmathematicalfunctions,suchaslog()andmax(),whicharecoveredlater.

10.2.5Matrices:AnArrayofArrays

Matrices can be thought of as an extension of arrays - rather than having onedimension,matriceshave2(ormore).Muchlikearrays,matricescanbecreatedeasilywithinNumPy:

Accessingindividualelementsissimilartohowwediditforarrays.Wesimplyneedtopassinanumberofargumentsequaltothenumberofdimensions:

In this example, our first index selected the row and the second selected thecolumn-givingusourresultof3.Matricescanbeextendingouttoanynumberofdimensionsbysimplyusingmoreindicestoaccessspecificelements(thoughuse-casesbeyond4maybesomewhatrare).

Matricessupportallofthenormalmathematialfunctionssuchas+,-,*,and/.Aspecialnote:the*operatorwillresultinanelement-wisemultiplication.Using@ornp.matmul()formatrixmultiplication:

>>>array([3,6,9])

>>>array([1,4,9],dtype=int32)

b=np.array([2,3,4])

print(a*b)

>>>array([2,6,12])

m=np.array([[1,2],[3,4]])

print(m)

>>>[[12]

>>>[34]]

m[1][0]

MorecomplexmathematicalfunctionscantypicallybefoundwithintheNumPylibraryitself:

A full listing can be found at:https://docs.scipy.org/doc/numpy/reference/routines.math.html

10.2.6SlicingArraysandMatrices

As one can imagine, accessing elements one-at-a-time is both slow and canpotentially require many lines of code to iterate over every dimension in thematrix. Thankfully, NumPy incorporate a very powerful slicing engine thatallowsustoaccessrangesofelementseasily:

The ‘:’value tellsNumPy toselectallelements in thegivendimension.Here,we’ve requested all elements in the first row. We can also use indexing torequestelementswithinagivenrange:

Here,weaskedNumPy togiveuselements4 through7 (ranges inPythonareinclusiveatthestartandnon-inclusiveattheend).Wecanevengobackwards:

Inthepreviousexample,thenegativevalueisaskingNumPytoreturnthelast5elementsof thearray.Had theargumentbeen‘:-5’,NumPywould’vereturnedeverythingBUTthelastfiveelements:

Becoming more familiar with NumPy’s accessor conventions will allow youwritemoreefficient,clearercodeasitiseasiertoreadasimpleone-lineaccessor

print(m-m)

print(m*m)

print(m/m)

print(np.sin(x))

print(np.sum(x))

m[1,:]

>>>array([3,4])

a=np.arange(0,10,1)

print(a)

>>>[0123456789]

a[4:8]

>>>array([4,5,6,7])

a[-5:]

>>>array([5,6,7,8,9])

a[:-5]

>>>array([0,1,2,3,4])

than it is a multi-line, nested loop when extracting values from an array ormatrix.

10.2.7UsefulFunctions

The NumPy library provides several convenient mathematical functions thatusers can use. These functions provide several advantages to codewritten byusers:

They are open source typically have multiple contributors checking forerrors.Many of them utilize a C interface andwill runmuch faster than nativePythoncode.They’rewrittentoveryflexible.

NumPyarraysandmatricescontainmanyusefulaggregatingfunctionssuchasmax(),min(),mean(), etc These functions are usually able to run an order ofmagnitudefasterthanloopingthroughtheobject,soit’simportanttounderstandwhatfunctionsareavailabletoavoid‘reinventingthewheel.’Inaddition,manyof the functions are able to sum or average across axes, which make themextremely useful if your data has inherent grouping. To return to a previousexample:

In thisexample,wecreateda2x2matrixcontaining thenumbers1 through4.Thesumof thematrix returned theelement-wiseadditionof theentirematrix.Summing across axis 0 (rows) returned a new array with the element-wiseadditionacrosseachrow.Likewise,summingacrossaxis1(columns)returnedthecolumnarsummation.

10.2.8LinearAlgebra

Perhaps one of the most important uses for NumPy is its robust support for

m=np.array([[1,2],[3,4]])

print(m)

>>>[[12]

>>>[34]]

m.sum()

m.sum(axis=1)

>>>[3,7]

m.sum(axis=0)

>>>[4,6]

Linear Algebra functions. Like the aggregation functions described in theprevious section, these functions are optimized to be much faster than userimplementationsandcanutilizeprocessesorlevelfeaturestoprovideveryquickcomputations. These functions can be accessed very easily from the NumPypackage:

Included in within np.linalg are functions for calculating theEigendecompositionofsquarematricesandsymmetricmatrices.Finally,togiveaquickexampleofhoweasy it is to implementalgorithms inNumPy,wecaneasilyuseittocalculatethecostandgradientwhenusingsimpleMean-Squared-Error(MSE):

Finally, more advanced functions are easily available to users via the linalglibraryofNumPyas:

10.2.9NumPyResources

https://docs.scipy.org/doc/numpyhttp://cs231n.github.io/python-numpy-tutorial/#numpyhttps://docs.scipy.org/doc/numpy-1.15.1/reference/routines.linalg.htmlhttps://en.wikipedia.org/wiki/Mean_squared_error

10.3SCIPY☁�SciPy is a library built around numpy and has a number of off-the-shelfalgorithmsandoperationsimplemented.Theseincludealgorithmsfromcalculus(such as integration), statistics, linear algebra, image-processing, signal

a=np.array([[1,2],[3,4]])

b=np.array([[5,6],[7,8]])

print(np.matmul(a,b))

>>>[[1922]

[4350]]

cost=np.power(Y-np.matmul(X,weights)),2).mean(axis=1)

gradient=np.matmul(X.T,np.matmul(X,weights)-y)

fromnumpyimportlinalg

A=np.diag((1,2,3))

w,v=linalg.eig(A)

print('w=',w)

print('v=',v)

processing,machinelearning.

To achieve this, SciPy bundels a number of useful open-source software formathematics,science,andengineering.Itincludesthefollowingpackages:

NumPy,

formanagingN-dimensionalarrays

SciPylibrary,

toaccessfundamentalscientificcomputingcapabilities

Matplotlib,

toconduct2Dplotting

IPython,

foranInteractiveconsole(seejupyter)

Sympy,

forsymbolicmathematics

pandas,

forprovidingdatastructuresandanalysis

10.3.1Introduction

First we add the usual scientific computing modules with the typicalabbreviations, including sp for scipy. We could invoke scipy’s statisticalpackageassp.stats,butforthesakeoflazinessweabbreviatethattoo.

Nowwecreatesomerandomdatatoplaywith.Wegenerate100samplesfroma

importnumpyasnp#importnumpy

importscipyassp#importscipy

fromscipyimportstats#referdirectlytostatsratherthansp.stats

importmatplotlibasmpl#forvisualization

frommatplotlibimportpyplotasplt#referdirectlytopyplot

#ratherthanmpl.pyplot

Gaussiandistributioncenteredatzero.

Howmanyelementsareintheset?

Whatisthemean(average)oftheset?

Whatistheminimumoftheset?

Whatisthemaximumoftheset?

Wecanusethescipyfunctionstoo.What’sthemedian?

Whataboutthestandarddeviationandvariance?

Isn’tthevariancethesquareofthestandarddeviation?

How close are the measures? The differences are close as the followingcalculationshows

Howdoesthislookasahistogram?SeeFigure18,Figure19,Figure20

s=sp.randn(100)

print('Thereare',len(s),'elementsintheset')

print('Themeanofthesetis',s.mean())

print('Theminimumofthesetis',s.min())

print('Themaximumofthesetis',s.max())

print('Themedianofthesetis',sp.median(s))

print('Thestandarddeviationis',sp.std(s),

'andthevarianceis',sp.var(s))

print('Thesquareofthestandarddeviationis',sp.std(s)**2)

print('Thedifferenceis',abs(sp.std(s)**2-sp.var(s)))

print('Andindecimalform,thedifferenceis%0.16f'%

(abs(sp.std(s)**2-sp.var(s))))

plt.hist(s)#yes,onelineofcodeforahistogram

plt.show()

Figure18:Histogram1

Letusaddsometitles.

Figure19:Histogram2

Typically we do not include titles when we prepare images for inclusion inLaTeX.Thereweusethecaptiontodescribewhatthefigureisabout.

plt.clf()#clearoutthepreviousplot

plt.hist(s)

plt.title("HistogramExample")

plt.xlabel("Value")

plt.ylabel("Frequency")

plt.show()

plt.clf()#clearoutthepreviousplot

plt.hist(s)

plt.xlabel("Value")

Figure20:Histogram3

Let us try out some linear regression, or curve fitting. See @#fig:scipy-output_30_0

plt.ylabel("Frequency")

plt.show()

importrandom

defF(x):

return2*x-2

defadd_noise(x):

returnx+random.uniform(-1,1)

X=range(0,10,1)

foriinrange(len(X)):

Y.append(add_noise(X[i]))

plt.clf()#clearouttheoldfigure

plt.plot(X,Y,'.')

plt.show()

Figure21:Result1

Nowlet’strylinearregressiontofitthecurve.

Whatistheslopeandy-interceptofthefittedcurve?

Nowlet’sseehowwellthecurvefitsthedata.We’llcallthefittedcurveF’.

To save images into a PDF file for inclusion into LaTeX documents you cansavetheimagesasfollows.Otherformatssuchaspngarealsopossible,butthequalityisnaturallynotsufficientforinclusioninpapersanddocuments.Forthat

m,b,r,p,est_std_err=stats.linregress(X,Y)

print('Theslopeis',m,'andthey-interceptis',b)

defFprime(x):#thefittedcurve

returnm*x+b

X=range(0,10,1)

Yprime=[]

foriinrange(len(X)):

Yprime.append(Fprime(X[i]))

plt.clf()#clearouttheoldfigure

#theobservedpoints,bluedots

plt.plot(X,Y,'.',label='observedpoints')

#theinterpolatedcurve,connectedredline

plt.plot(X,Yprime,'r-',label='estimatedpoints')

plt.title("LinearRegressionExample")#title

plt.xlabel("x")#horizontalaxistitle

plt.ylabel("y")#verticalaxistitle

#legendlabelstoplot

plt.legend(['obseredpoints','estimatedpoints'])

#commentoutsothatyoucansavethefigure

#plt.show()

youcertainlywant tousePDF.Thesaveof thefigurehas tooccurbeforeyouusetheshow()command.SeeFigure22

Figure22:Result2

10.3.2References

Formore informationaboutSciPywe recommend thatyouvisit the followinglink

https://www.scipy.org/getting-started.html#learning-to-work-with-scipy

Additionalmaterialandinspirationforthissectionarefrom

[ ]“GettingStartedguide”https://www.scipy.org/getting-started.html

[ ]Prasanth.“SimplestatisticswithSciPy.”Comfortat1AU.February28, 2011. https://oneau.wordpress.com/2011/02/28/simple-statistics-with-scipy/.

[ ] SciPy Cookbook. Lasted updated: 2015. http://scipy-cookbook.readthedocs.io/.

plt.savefig("regression.pdf",bbox_inches='tight')

plt.savefig('regression.png')

plt.show()

createbibtexentries

10.4SCIKIT-LEARN☁�

LearningObjectives

ExploratorydataanalysisPipelinetopreparedataFulllearningpipelineFinetunethemodelSignificancetests

10.4.1IntroductiontoScikit-learn

Scikit learnisaMachineLearningspecific libraryusedinPython.Librarycanbeusedfordataminingandanalysis.ItisbuiltontopofNumPy,matplotlibandSciPy.ScikitLearnfeaturesDimensionalityreduction,clustering,regressionandclassificationalgorithms.Italsofeaturesmodelselectionusinggridsearch,crossvalidationandmetrics.

Scikitlearnalsoenablesuserstopreprocessthedatawhichcanthenbeusedformachinelearningusingmoduleslikepreprocessingandfeatureextraction.

Inthissectionwedemonstratehowsimpleitistousek-meansinscikitlearn.

10.4.2Installation

Ifyoualreadyhaveaworkinginstallationofnumpyandscipy,theeasiestwaytoinstallscikit-learnisusingpip

10.4.3SupervisedLearning

$pipinstallnumpy

$pipinstallscipy-U

$pipinstall-Uscikit-learn

SupervisedLearningisusedinmachinelearningwhenwealreadyknowasetofoutputpredictionsbasedon input characteristics andbasedon thatweneed topredictthetargetforanewinput.Trainingdataisusedtotrainthemodelwhichthencanbeusedtopredicttheoutputfromaboundedset.

Problemscanbeoftwotypes

1. Classification:Trainingdatabelongstothreeorfourclasses/categoriesandbasedon the labelwewant topredict theclass/categoryfor theunlabeleddata.

2. Regression : Training data consists of vectorswithout any correspondingtargetvalues.Clusteringcanbeusedforthesetypeofdatasetstodeterminediscover groups of similar examples. Another way is density estimationwhichdeterminethedistributionofdatawithintheinputspace.Histogramisthemostbasicform.

10.4.4UnsupervisedLearning

UnsupervisedLearning isused inmachine learningwhenwehave the trainingsetavailablebutwithoutanycorrespondingtarget.Theoutcomeoftheproblemistodiscovergroupswithintheprovidedinput.Itcanbedoneinmanyways.

Fewofthemarelistedhere

1. Clustering:Discovergroupsofsimilarcharacteristics.2. Density Estimation : Finding the distribution of datawithin the provided

inputorchanging thedata fromahighdimensional space to twoor threedimension.

10.4.5BuildingaendtoendpipelineforSupervisedmachinelearningusingScikit-learn

Adatapipelineisasetofprocessingcomponentsthataresequencedtoproducemeaningfuldata.PipelinesarecommonlyusedinMachinelearning,sincethereislotofdatatransformationandmanipulationthatneedstobeappliedtomakedatauseful formachine learning.All components are sequenced in away thatthe output of one component becomes input for the next and each of thecomponentisselfcontained.Componentsinteractwitheachotherusingdata.

Evenifacomponentbreaks,thedownstreamcomponentcanrunnormallyusingthe last output. Sklearn provide the ability to build pipelines that can betransformedandmodeledformachinelearning.

10.4.6Stepsfordevelopingamachinelearningmodel

1. Explorethedomainspace2. Extracttheproblemdefinition3. Getthedatathatcanbeusedtomakethesystemlearntosolvetheproblem

definition.4. DiscoverandVisualizethedatatogaininsights5. Featureengineeringandpreparethedata6. Finetuneyourmodel7. Evaluateyoursolutionusingmetrics8. Onceprovenlaunchandmaintainthemodel.

10.4.7ExploratoryDataAnalysis

Exampleproject=Frauddetectionsystem

Firststepistoloadthedataintoadataframeinorderforaproperanalysistobedoneontheattributes.

Performthebasicanalysisonthedatashapeandnullvalueinformation.

Hereistheexampleoffewofthevisualdataanalysismethods.

10.4.7.1Barplot

Abarchartorgraphisagraphwithrectangularbarsorbinsthatareusedtoplotcategoricalvalues.Eachbarinthegraphrepresentsacategoricalvariableandtheheightofthebarisproportionaltothevaluerepresentedbyit.

data=pd.read_csv('dataset/data_file.csv')

data.head()

print(data.shape)

print(data.info())

data.isnull().values.any()

Bargraphsareused:

TomakecomparisonsbetweenvariablesTovisualizeanytrendinthedata,i.e.,they show the dependence of one variable on another Estimate values of avariable

Figure23:Exampleofscikit-learnbarplots

10.4.7.2Correlationbetweenattributes

Attributesinadatasetcanberelatedbasedondifferntaspects.

Examplesincludeattributesdependentonanotherorcouldbelooselyortightlycoupled.Alsoexampleincludestwovariablescanbeassociatedwithathirdone.

Inordertounderstandtherelationshipbetweenattributes,correlationrepresentsthebestvisualwaytogetaninsight.Positivecorrelationmeaningbothattributesmovingintothesamedirection.Negativecorrelationreferstooppostedirections.One attributes values increase results in value decrease for other. Zero

plt.ylabel('Transactions')

plt.xlabel('Type')

data.type.value_counts().plot.bar()

correlationiswhentheattributesareunrelated.

Figure24:scikit-learncorrelationarray

10.4.7.3HistogramAnalysisofdatasetattributes

Ahistogramconsistsofasetofcountsthatrepresentthenumberoftimessome

#computethecorrelationmatrix

corr=data.corr()

#generateamaskforthelowertriangle

mask=np.zeros_like(corr,dtype=np.bool)

mask[np.triu_indices_from(mask)]=True

#setupthematplotlibfigure

f,ax=plt.subplots(figsize=(18,18))

#generateacustomdivergingcolormap

cmap=sns.diverging_palette(220,10,as_cmap=True)

#drawtheheatmapwiththemaskandcorrectaspectratio

sns.heatmap(corr,mask=mask,cmap=cmap,vmax=.3,

square=True,

linewidths=.5,cbar_kws={"shrink":.5},ax=ax);

eventoccurred.

Figure25:scikit-learn

10.4.7.4BoxplotAnalysis

Box plot analysis is useful in detecting whether a distribution is skewed anddetectoutliersinthedata.

%matplotlibinline

data.hist(bins=30,figsize=(20,15))

plt.show()

fig,axs=plt.subplots(2,2,figsize=(10,10))

tmp=data.loc[(data.type=='TRANSFER'),:]

a=sns.boxplot(x='isFlaggedFraud',y='amount',data=tmp,ax=axs[0][0])

axs[0][0].set_yscale('log')

b=sns.boxplot(x='isFlaggedFraud',y='oldbalanceDest',data=tmp,ax=axs[0][1])

axs[0][1].set(ylim=(0,0.5e8))

c=sns.boxplot(x='isFlaggedFraud',y='oldbalanceOrg',data=tmp,ax=axs[1][0])

axs[1][0].set(ylim=(0,3e7))

d=sns.regplot(x='oldbalanceOrg',y='amount',data=tmp.loc[(tmp.isFlaggedFraud==1),:],ax=axs[1][1])

plt.show()

10.4.7.5ScatterplotAnalysis

The scatter plot displays values of two numerical variables as Cartesiancoordinates.plt.figure(figsize=(12,8))

sns.pairplot(data[['amount','oldbalanceOrg','oldbalanceDest','isFraud']],hue='isFraud')

Figure27:scikit-learnscatterplots

10.4.8DataCleansing-RemovingOutliers

Ifthetransactionamountislowerthan5percentoftheallthetransactionsANDdoesnotexceedUSD3000,wewillexcludeitfromouranalysistoreduceType1costsIfthetransactionamountishigherthan95percentofallthetransactionsAND exceeds USD 500000, we will exclude it from our analysis, and use ablanketreviewprocessforsuchtransactions(similartoisFlaggedFraudcolumninoriginaldataset)toreduceType2costslow_exclude=np.round(np.minimum(fin_samp_data.amount.quantile(0.05),3000),2)

high_exclude=np.round(np.maximum(fin_samp_data.amount.quantile(0.95),500000),2)

###UpdatingDatatoexcluderecordspronetoType1andType2costs

low_data=fin_samp_data[fin_samp_data.amount>low_exclude]

data=low_data[low_data.amount<high_exclude]

10.4.9PipelineCreation

Machinelearningpipelineisusedtohelpautomatemachinelearningworkflows.Theyoperateby enabling a sequenceofdata tobe transformedandcorrelatedtogether in a model that can be tested and evaluated to achieve an outcome,whetherpositiveornegative.

10.4.9.1DefiningDataFrameSelectortoseparateNumericalandCategoricalattributes

SamplefunctiontoseperateoutNumericalandcategoricalattributes.

10.4.9.2FeatureCreation/AdditionalFeatureEngineering

DuringEDAweidentifiedthattherearetransactionswherethebalancesdonottallyafterthetransactioniscompleted.Webelievethiscouldpotentiallybecaseswherefraudisoccurring.Toaccountforthiserrorinthetransactions,wedefinetwo new features“errorBalanceOrig” and “errorBalanceDest”, calculated byadjusting theamountwith thebeforeandafterbalances for theOriginatorandDestinationaccounts.

Below,wecreateafunctionthatallowsustocreatethesefeaturesinapipeline.

fromsklearn.baseimportBaseEstimator,TransformerMixin

#Createaclasstoselectnumericalorcategoricalcolumns

#sinceScikit-Learndoesn'thandleDataFramesyet

classDataFrameSelector(BaseEstimator,TransformerMixin):

def__init__(self,attribute_names):

self.attribute_names=attribute_names

deffit(self,X,y=None):

returnself

deftransform(self,X):

returnX[self.attribute_names].values

fromsklearn.baseimportBaseEstimator,TransformerMixin

#columnindex

amount_ix,oldbalanceOrg_ix,newbalanceOrig_ix,oldbalanceDest_ix,newbalanceDest_ix=0,1,2,3,4

classCombinedAttributesAdder(BaseEstimator,TransformerMixin):

def__init__(self):#no*argsor**kargs

deffit(self,X,y=None):

returnself#nothingelsetodo

deftransform(self,X,y=None):

errorBalanceOrig=X[:,newbalanceOrig_ix]+X[:,amount_ix]-X[:,oldbalanceOrg_ix]

errorBalanceDest=X[:,oldbalanceDest_ix]+X[:,amount_ix]-X[:,newbalanceDest_ix]

returnnp.c_[X,errorBalanceOrig,errorBalanceDest]

10.4.10CreatingTrainingandTestingdatasets

Trainingsetincludesthesetofinputexamplesthatthemodelwillbefitinto ortrained on by adjusting the parameters. Testing dataset is critical to test thegeneralizability of the model . By using this set, we can get the workingaccuracyofourmodel.

Testingsetshouldnotbeexposedtomodelunlessmodeltraininghasnotbeencompleted.Thiswaytheresultsfromtestingwillbemorereliable.

10.4.11Creatingpipelinefornumericalandcategoricalattributes

IdentifyingcolumnswithNumericalandCategoricalcharacteristics.

10.4.12Selectingthealgorithmtobeapplied

Algorithimselectionprimarilydependsontheobjectiveyouaretryingtosolveandwhatkindofdatasetisavailable.Therearediffernttypeofalgorithmswhichcanbeappliedandwewilllookintofewofthemhere.

10.4.12.1LinearRegression

This algorithm can be applied when you want to compute some continuousvalue.Topredictsomefuturevalueofaprocesswhichiscurrentlyrunning,you

fromsklearn.model_selectionimporttrain_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42,stratify=y)

X_train_num=X_train[["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest"]]

X_train_cat=X_train[["type"]]

X_model_col=["amount","oldbalanceOrg","newbalanceOrig","oldbalanceDest","newbalanceDest","type"]

fromsklearn.pipelineimportPipeline

fromsklearn.preprocessingimportStandardScaler

fromsklearn.preprocessingimportImputer

num_attribs=list(X_train_num)

cat_attribs=list(X_train_cat)

num_pipeline=Pipeline([

('selector',DataFrameSelector(num_attribs)),

('attribs_adder',CombinedAttributesAdder()),

('std_scaler',StandardScaler())

cat_pipeline=Pipeline([

('selector',DataFrameSelector(cat_attribs)),

('cat_encoder',CategoricalEncoder(encoding="onehot-dense"))

cangowithregressionalgorithm.

Exampleswherelinearregressioncanusedare:

1. Predictthetimetakentogofromoneplacetoanother2. Predictthesalesforafuturemonth3. Predictsalesdataandimproveyearlyprojections.

10.4.12.2LogisticRegression

Thisalgorithmcanbeusedtoperformbinaryclassification.Itcanbeusedifyouwantaprobabilisticframework.Alsoincaseyouexpecttoreceivemoretrainingdata in the future that you want to be able to quickly incorporate into yourmodel.

1. Customerchurnprediction.2. CreditScoring&FraudDetectionwhichisourexampleproblemwhichwe

aretryingtosolveinthischapter.3. Calculatingtheeffectivenessofmarketingcampaigns.

10.4.12.3Decisiontrees

Decision trees handle feature interactions and they’re non-parametric. Doesnt

fromsklearn.linear_modelimportLinearRegression

importtime

scl=StandardScaler()

X_train_std=scl.fit_transform(X_train)

X_test_std=scl.transform(X_test)

start=time.time()

lin_reg=LinearRegression()

lin_reg.fit(X_train_std,y_train)#SKLearn'slinearregression

y_train_pred=lin_reg.predict(X_train_std)

train_time=time.time()-start

fromsklearn.linear_modelimportLogisticRegression

X_train,_,y_train,_=train_test_split(X_train,y_train,stratify=y_train,train_size=subsample_rate,random_state=42)

X_test,_,y_test,_=train_test_split(X_test,y_test,stratify=y_test,train_size=subsample_rate,random_state=42)

model_lr_sklearn=LogisticRegression(multi_class="multinomial",C=1e6,solver="sag",max_iter=15)

model_lr_sklearn.fit(X_train,y_train)

y_pred_test=model_lr_sklearn.predict(X_test)

acc=accuracy_score(y_test,y_pred_test)

results.loc[len(results)]=["LRSklearn",np.round(acc,3)]

results

supportonlinelearningandtheentiretreeneedstoberebuildwhennewtraningdatasetcomesin.Memoryconsumptionisveryhigh.

Canbeusedforthefollowingcases

1. Investmentdecisions2. Customerchurn3. Banksloandefaulters4. BuildvsBuydecisions5. Salesleadqualifications

10.4.12.4KMeans

Thisalgorithmisusedwhenwearenotawareofthelabelsandoneneedstobecreatedbasedon the featuresofobjects.Examplewillbe todivideagroupofpeopleintodifferntsubgroupsbasedoncommonthemeorattribute.

ThemaindisadvantageofK-meanisthatyouneedtoknowexactlythenumberofclustersorgroupswhichisrequired.IttakesalotofiterationtocomeupwiththebestK.

10.4.12.5SupportVectorMachines

fromsklearn.treeimportDecisionTreeRegressor

dt=DecisionTreeRegressor()

start=time.time()

dt.fit(X_train_std,y_train)

y_train_pred=dt.predict(X_train_std)

start=time.time()

y_test_pred=dt.predict(X_test_std)

test_time=time.time()-start

fromsklearn.neighborsimportKNeighborsClassifier

fromsklearn.model_selectionimporttrain_test_split,GridSearchCV,PredefinedSplit

fromsklearn.metricsimportaccuracy_score

X_train,_,y_train,_=train_test_split(X_train,y_train,stratify=y_train,train_size=subsample_rate,random_state=42)

X_test,_,y_test,_=train_test_split(X_test,y_test,stratify=y_test,train_size=subsample_rate,random_state=42)

model_knn_sklearn=KNeighborsClassifier(n_jobs=-1)

model_knn_sklearn.fit(X_train,y_train)

y_pred_test=model_knn_sklearn.predict(X_test)

acc=accuracy_score(y_test,y_pred_test)

results.loc[len(results)]=["KNNArbitarySklearn",np.round(acc,3)]

results

SVM is a supervised ML technique and used for pattern recognition andclassification problems when your data has exactly two classes. Its popular intextclassificationproblems.

FewcaseswhereSVMcanbeusedis

1. Detectingpersonswithcommondiseases.2. Hand-writtencharacterrecognition3. Textcategorization4. Stockmarketpriceprediction

10.4.12.6NaiveBayes

NaiveBayesisusedforlargedatasets.ThisalgoritmworkswellevenwhenwehavealimitedCPUandmemoryavailable.Thisworksbycalculatingbunchofcounts.Itrequireslesstrainingdata.Thealgorthimcantlearninterationbetweenfeatures.

NaiveBayescanbeusedinreal-worldapplicationssuchas:

1. Sentimentanalysisandtextclassification2. RecommendationsystemslikeNetflix,Amazon3. Tomarkanemailasspamornotspam4. Facerecognition

10.4.12.7RandomForest

RanmdonforestissimilartoDecisiontree.Canbeusedforbothregressionandclassificationproblemswithlargedatasets.

Fewcasewhereitcanbeapplied.

1. Predictpatientsforhighrisks.2. Predictpartsfailuresinmanufacturing.3. Predictloandefaulters.

fromsklearn.ensembleimportRandomForestRegressor

forest=RandomForestRegressor(n_estimators=400,criterion='mse',random_state=1,n_jobs=-1)

start=time.time()

forest.fit(X_train_std,y_train)

10.4.12.8Neuralnetworks

Neural network works based on weights of connections between neurons.Weights are trained and based on that the neural network can be utilized topredicttheclassoraquantity.Theyareresourceandmemoryintensive.

Fewcaseswhereitcanbeapplied.

1. Appliedtounsupervisedlearningtasks,suchasfeatureextraction.2. Extracts features from raw images or speech with much less human

intervention

10.4.12.9DeepLearningusingKeras

Keras is most powerful and easy-to-use Python libraries for developing andevaluating deep learning models. It has the efficient numerical computationlibrariesTheanoandTensorFlow.

10.4.12.10XGBoost

XGBooststandsforeXtremeGradientBoosting.XGBoostisanimplementationof gradient boosted decision trees designed for speed and performance. It isengineeredforefficiencyofcomputetimeandmemoryresources.

10.4.13ScikitCheatSheet

ScikitlearninghasputaveryindepthandwellexplainedflowcharttohelpyouchoosetherightalgorithmthatIfindveryhandy.

y_train_pred=forest.predict(X_train_std)

start=time.time()

y_test_pred=forest.predict(X_test_std)

test_time=time.time()-start

10.4.14ParameterOptimization

Machinelearningmodelsareparameterizedsothattheirbehaviorcanbetunedforagivenproblem.Thesemodelscanhavemanyparametersand finding thebestcombinationofparameterscanbetreatedasasearchproblem.

Aparameterisaconfigurationthatispartofthemodelandvaluescanbederivedfromthegivendata.

1. Requiredbythemodelwhenmakingpredictions.2. Valuesdefinetheskillofthemodelonyourproblem.3. Estimatedorlearnedfromdata.4. Oftennotsetmanuallybythepractitioner.5. Oftensavedaspartofthelearnedmodel.

10.4.14.1Hyperparameteroptimization/tuningalgorithms

Gridsearchisanapproachtohyperparametertuningthatwillmethodicallybuildandevaluateamodelforeachcombinationofalgorithmparametersspecifiedin

agrid.

Random search provide a statistical distribution for each hyperparameter fromwhichvaluesmayberandomlysampled.

10.4.15ExperimentswithKeras(deeplearning),XGBoost,andSVM(SVC)comparedtoLogisticRegression(Baseline)

10.4.15.1Creatingaparametergrid

10.4.15.2ImplementingGridsearchwithmodelsandalsocreatingmetricsfromeachofthemodel.

grid_param=[

[{#LogisticRegression

'model__penalty':['l1','l2'],

'model__C':[0.01,1.0,100]

[{#keras

'model__optimizer':optimizer,

'model__loss':loss

[{#SVM

'model__C':[0.01,1.0,100],

'model__gamma':[0.5,1],

'model__max_iter':[-1]

[{#XGBClassifier

'model__min_child_weight':[1,3,5],

'model__gamma':[0.5],

'model__subsample':[0.6,0.8],

'model__colsample_bytree':[0.6],

'model__max_depth':[3]

Pipeline(memory=None,

steps=[('preparation',FeatureUnion(n_jobs=None,

transformer_list=[('num_pipeline',Pipeline(memory=None,

steps=[('selector',DataFrameSelector(attribute_names=['amount','oldbalanceOrg','newbalanceOrig','oldbalanceDest'

tol=0.0001,verbose=0,warm_start=False))])

fromsklearn.metricsimportmean_squared_error

fromsklearn.metricsimportclassification_report

fromsklearn.metricsimportf1_score

fromxgboost.sklearnimportXGBClassifier

fromsklearn.svmimportSVC

test_scores=[]

#MachineLearningAlgorithm(MLA)SelectionandInitialization

linear_model.LogisticRegression(),

keras_model,

SVC(),

XGBClassifier()

10.4.15.3ResultstablefromtheModelevaluationwithmetrics.

#createtabletocompareMLAmetrics

MLA_columns=['Name','Score','Accuracy_Score','ROC_AUC_score','final_rmse','Classification_error','Recall_Score','Precision_Score'

MLA_compare=pd.DataFrame(columns=MLA_columns)

Model_Scores=pd.DataFrame(columns=['Name','Score'])

row_index=0

foralginMLA:

#setnameandparameters

MLA_name=alg.__class__.__name__

MLA_compare.loc[row_index,'Name']=MLA_name

#MLA_compare.loc[row_index,'Parameters']=str(alg.get_params())

full_pipeline_with_predictor=Pipeline([

("preparation",full_pipeline),#combinationofnumericalandcategoricalpipelines

("model",alg)

grid_search=GridSearchCV(full_pipeline_with_predictor,grid_param[row_index],cv=4,verbose=2,scoring='f1',return_train_score

grid_search.fit(X_train[X_model_col],y_train)

y_pred=grid_search.predict(X_test)

MLA_compare.loc[row_index,'Accuracy_Score']=np.round(accuracy_score(y_pred,y_test),3)

MLA_compare.loc[row_index,'ROC_AUC_score']=np.round(metrics.roc_auc_score(y_test,y_pred),3)

MLA_compare.loc[row_index,'Score']=np.round(grid_search.score(X_test,y_test),3)

negative_mse=grid_search.best_score_

scores=np.sqrt(-negative_mse)

final_mse=mean_squared_error(y_test,y_pred)

final_rmse=np.sqrt(final_mse)

MLA_compare.loc[row_index,'final_rmse']=final_rmse

confusion_matrix_var=confusion_matrix(y_test,y_pred)

TP=confusion_matrix_var[1,1]

TN=confusion_matrix_var[0,0]

FP=confusion_matrix_var[0,1]

FN=confusion_matrix_var[1,0]

MLA_compare.loc[row_index,'Classification_error']=np.round(((FP+FN)/float(TP+TN+FP+FN)),5)

MLA_compare.loc[row_index,'Recall_Score']=np.round(metrics.recall_score(y_test,y_pred),5)

MLA_compare.loc[row_index,'Precision_Score']=np.round(metrics.precision_score(y_test,y_pred),5)

MLA_compare.loc[row_index,'F1_Score']=np.round(f1_score(y_test,y_pred),5)

MLA_compare.loc[row_index,'mean_test_score']=grid_search.cv_results_['mean_test_score'].mean()

MLA_compare.loc[row_index,'mean_fit_time']=grid_search.cv_results_['mean_fit_time'].mean()

Model_Scores.loc[row_index,'MLAName']=MLA_name

Model_Scores.loc[row_index,'MLScore']=np.round(metrics.roc_auc_score(y_test,y_pred),3)

#CollectMeanTestscoresforstatisticalsignificancetest

test_scores.append(grid_search.cv_results_['mean_test_score'])

row_index+=1

10.4.15.4ROCAUCScore

AUC-ROCcurve isaperformancemeasurementforclassificationproblematvarious thresholds settings. ROC is a probability curve and AUC representsdegree or measure of separability. It tells how much model is capable ofdistinguishing between classes. Higher the AUC, better the model is atpredicting0sas0sand1sas1s.

10.4.16K-meansinscikitlearn.

10.4.16.1Import

10.4.17K-meansAlgorithm

Inthissectionwedemonstratehowsimpleitistousek-meansinscikitlearn.

10.4.17.1Import

10.4.17.2Createsamples

fromtimeimporttime

importnumpyasnp

fromsklearnimportmetrics

fromsklearn.clusterimportKMeans

fromsklearn.datasetsimportload_digits

fromsklearn.decompositionimportPCA

fromsklearn.preprocessingimportscale

np.random.seed(42)

digits=load_digits()

data=scale(digits.data)

10.4.17.3Createsamples

10.4.17.4Visualize

SeeFigure32

np.random.seed(42)

digits=load_digits()

data=scale(digits.data)

n_samples,n_features=data.shape

n_digits=len(np.unique(digits.target))

labels=digits.target

sample_size=300

print("n_digits:%d,\tn_samples%d,\tn_features%d"%(n_digits,n_samples,n_features))

print(79*'_')

print('%9s'%'init''timeinertiahomocomplv-measARIAMIsilhouette')

print("n_digits:%d,\tn_samples%d,\tn_features%d"

%(n_digits,n_samples,n_features))

print(79*'_')

print('%9s'%'init'

'timeinertiahomocomplv-measARIAMIsilhouette')

defbench_k_means(estimator,name,data):

t0=time()

estimator.fit(data)

print('%9s%.2fs%i%.3f%.3f%.3f%.3f%.3f%.3f'

%(name,(time()-t0),estimator.inertia_,

metrics.homogeneity_score(labels,estimator.labels_),

metrics.completeness_score(labels,estimator.labels_),

metrics.v_measure_score(labels,estimator.labels_),

metrics.adjusted_rand_score(labels,estimator.labels_),

metrics.adjusted_mutual_info_score(labels,estimator.labels_),

metrics.silhouette_score(data,estimator.labels_,metric='euclidean',sample_size=sample_size)))

bench_k_means(KMeans(init='k-means++',n_clusters=n_digits,n_init=10),name="k-means++",data=data)

bench_k_means(KMeans(init='random',n_clusters=n_digits,n_init=10),name="random",data=data)

metrics.silhouette_score(data,estimator.labels_,

metric='euclidean',

sample_size=sample_size)))

bench_k_means(KMeans(init='k-means++',n_clusters=n_digits,n_init=10),

name="k-means++",data=data)

bench_k_means(KMeans(init='random',n_clusters=n_digits,n_init=10),

name="random",data=data)

#inthiscasetheseedingofthecentersisdeterministic,hencewerunthe

#kmeansalgorithmonlyoncewithn_init=1

pca=PCA(n_components=n_digits).fit(data)

bench_k_means(KMeans(init=pca.components_,n_clusters=n_digits,n_init=1),name="PCA-based",data=data)

print(79*'_')

bench_k_means(KMeans(init=pca.components_,

n_clusters=n_digits,n_init=1),

name="PCA-based",

data=data)

10.4.17.5Visualize

SeeFigure32

Figure32:Result

print(79*'_')

reduced_data=PCA(n_components=2).fit_transform(data)

kmeans=KMeans(init='k-means++',n_clusters=n_digits,n_init=10)

kmeans.fit(reduced_data)

#Stepsizeofthemesh.DecreasetoincreasethequalityoftheVQ.

h=.02#pointinthemesh[x_min,x_max]x[y_min,y_max].

#Plotthedecisionboundary.Forthat,wewillassignacolortoeach

x_min,x_max=reduced_data[:,0].min()-1,reduced_data[:,0].max()+1

y_min,y_max=reduced_data[:,1].min()-1,reduced_data[:,1].max()+1

xx,yy=np.meshgrid(np.arange(x_min,x_max,h),np.arange(y_min,y_max,h))

#Obtainlabelsforeachpointinmesh.Uselasttrainedmodel.

Z=kmeans.predict(np.c_[xx.ravel(),yy.ravel()])

#Puttheresultintoacolorplot

Z=Z.reshape(xx.shape)

plt.figure(1)

plt.clf()

plt.imshow(Z,interpolation='nearest',

extent=(xx.min(),xx.max(),yy.min(),yy.max()),

cmap=plt.cm.Paired,

aspect='auto',origin='lower')

plt.plot(reduced_data[:,0],reduced_data[:,1],'k.',markersize=2)

#PlotthecentroidsasawhiteX

centroids=kmeans.cluster_centers_

plt.scatter(centroids[:,0],centroids[:,1],

marker='x',s=169,linewidths=3,

color='w',zorder=10)

plt.title('K-meansclusteringonthedigitsdataset(PCA-reduceddata)\n'

'Centroidsaremarkedwithwhitecross')

plt.xlim(x_min,x_max)

plt.ylim(y_min,y_max)

plt.xticks(())

plt.yticks(())

plt.show()

10.5DASK-RANDOMFORESTFEATUREDETECTION☁�

10.5.1Setup

First we need our tools. pandas gives us the DataFrame, very similar to R’sDataFrames.TheDataFrameisastructurethatallowsustoworkwithourdatamoreeasily.Ithasnicefeaturesforslicingandtransformationofdata,andeasywaystodobasicstatistics.

numpyhassomeveryhandyfunctionsthatworkonDataFrames.

10.5.2Dataset

We are using a dataset about the wine quality dataset, archived at UCI’sMachineLearningRepository(http://archive.ics.uci.edu/ml/index.php).

Nowwewillloadourdata.pandasmakesiteasy!

Like in R, there is a .describe() method that gives basic statistics for everycolumninthedataset.

fixedacidity volatileacidity citricacid residual

sugar chlorides

count 1599.000000 1599.000000 1599.000000 1599.000000 1599.000000mean 8.319637 0.527821 0.270976 2.538806 0.087467std 1.741096 0.179060 0.194801 1.409928 0.047065min 4.600000 0.120000 0.000000 0.900000 0.01200025% 7.100000 0.390000 0.090000 1.900000 0.070000

importpandasaspd

importnumpyasnp

#redwinequalitydata,packedinaDataFrame

red_df=pd.read_csv('winequality-red.csv',sep=';',header=0,index_col=False)

#whitewinequalitydata,packedinaDataFrame

white_df=pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)

#rose?otherfruitwines?plumwine?:(

#forredwines

red_df.describe()

50% 7.900000 0.520000 0.260000 2.200000 0.079000

75% 9.200000 0.640000 0.420000 2.600000 0.090000max 15.900000 1.580000 1.000000 15.500000 0.611000

fixedacidity volatileacidity citricacid residual

sugar chlorides

count 4898.000000 4898.000000 4898.000000 4898.000000 4898.000000mean 6.854788 0.278241 0.334192 6.391415 0.045772std 0.843868 0.100795 0.121020 5.072058 0.021848min 3.800000 0.080000 0.000000 0.600000 0.00900025% 6.300000 0.210000 0.270000 1.700000 0.03600050% 6.800000 0.260000 0.320000 5.200000 0.04300075% 7.300000 0.320000 0.390000 9.900000 0.050000max 14.200000 1.100000 1.660000 65.800000 0.346000

Sometimesitiseasiertounderstandthedatavisually.Ahistogramofthewhitewinequalitydatacitricacidsamplesisshownnext.Youcanofcoursevisualizeother columns’ data or other datasets. Just replace theDataFrame and columnname(seeFigure33).

#forwhitewines

white_df.describe()

defextract_col(df,col_name):

returnlist(df[col_name])

col=extract_col(white_df,'citricacid')#canreplacewithanotherdataframeorcolumn

plt.hist(col)

#TODO:addaxesandsuchtosetagoodexample

plt.show()

Figure33:Histogram

10.5.3DetectingFeatures

Letustryoutasomeelementarymachinelearningmodels.Thesemodelsarenotalways for prediction. They are also useful to find what features are mostpredictiveofavariableofinterest.Dependingontheclassifieryouuse,youmayneedtotransformthedatapertainingtothatvariable.

10.5.3.1DataPreparation

LetusassumewewanttostudywhatfeaturesaremostcorrelatedwithpH.pHofcourseisreal-valued,andcontinuous.Theclassifierswewanttouseusuallyneed labeled or integer data.Hence,wewill transform the pH data, assigningwineswithpHhigherthanaverageashi(morebasicoralkaline)andwineswithpHlowerthanaverageaslo(moreacidic).#refreshtomakeJupyterhappy

red_df=pd.read_csv('winequality-red.csv',sep=';',header=0,index_col=False)

white_df=pd.read_csv('winequality-white.csv',sep=';',header=0,index_col=False)

#TODO:datacleansingfunctionshere,e.g.replacementofNaN

#ifthevariableyouwanttopredictiscontinuous,youcanmaprangesofvalues

#tointeger/binary/stringlabels

#forexample,mapthepHdatato'hi'and'lo'ifapHvalueismorethanor

#lessthanthemeanpH,respectively

M=np.mean(list(red_df['pH']))#expectinelegantcodeinthesemappings

Lf=lambdap:int(p<M)*'lo'+int(p>=M)*'hi'#someC-stylehackery

#createthenewclassifiablevariable

red_df['pH-hi-lo']=map(Lf,list(red_df['pH']))

#andremovethepredecessor

delred_df['pH']

Nowwe specifywhich dataset and variable youwant to predict by assigningvluestoSELECTED_DFandTARGET_VAR,respectively.

Weliketokeepaparameterfilewherewespecifydatasourcesandsuch.Thisletsmecreategenericanalyticscodethatiseasytoreuse.

Afterwehavespecifiedwhatdatasetwewanttostudy,wesplitthetrainingandtestdatasets.We thenscale (normalize) thedata,whichmakesmostclassifiersrunbetter.

Nowwepickaclassifier.Asyoucansee, therearemany to tryout,andevenmoreinscikit-learn’sdocumentationandmanyexamplesandtutorials.RandomForests are data scienceworkhorses.They are thego-tomethod formost datascientists. Be careful relying on them though–they tend to overfit.We try toavoidoverfittingbyseparatingthetrainingandtestdatasets.

10.5.4RandomForest

Nowwewilltestitoutwiththedefaultparameters.

Notethatthiscodeisboilerplate.Youcanuseitinterchangeablyformostscikit-

fromsklearnimportmetrics

#makeselectionsherewithoutdiggingincode

SELECTED_DF=red_df#selecteddataset

TARGET_VAR='pH-hi-lo'#thepredictedvariable

#generatenamelessdatastructures

df=SELECTED_DF

target=np.array(df[TARGET_VAR]).ravel()

deldf[TARGET_VAR]#nocheating

#TODO:datacleansingfunctioncallshere

#splitdatasetsfortrainingandtesting

X_train,X_test,y_train,y_test=train_test_split(df,target,test_size=0.2)

#setupthescaler

scaler=StandardScaler()

scaler.fit(X_train)

#applythescaler

X_train=scaler.transform(X_train)

X_test=scaler.transform(X_test)

#pickaclassifier

fromsklearn.treeimportDecisionTreeClassifier,DecisionTreeRegressor,ExtraTreeClassifier,ExtraTreeRegressor

fromsklearn.ensembleimportRandomForestClassifier,ExtraTreesClassifier

clf=RandomForestClassifier()

learnmodels.

Nowoutputtheresults.ForRandomForests,wegetafeatureranking.Relativeimportances usually exponentially decay. The first few highly-ranked featuresareusuallythemostimportant.

Featureranking:

fixed acidity 0.269778 citric acid 0.171337 density 0.089660 volatile acidity0.088965 chlorides 0.082945 alcohol 0.080437 total sulfur dioxide 0.067832sulphates0.047786freesulfurdioxide0.042727residualsugar0.037459quality0.021075

Sometimesit’seasiertovisualize.We’lluseabarchart.SeeFigure34

#testitout

model=clf.fit(X_train,y_train)

pred=clf.predict(X_test)

conf_matrix=metrics.confusion_matrix(y_test,pred)

var_score=clf.score(X_test,y_test)

#theresults

importances=clf.feature_importances_

indices=np.argsort(importances)[::-1]

#forthesakeofclarity

num_features=X_train.shape[1]

features=map(lambdax:df.columns[x],indices)

feature_importances=map(lambdax:importances[x],indices)

print'Featureranking:\n'

foriinrange(num_features):

feature_name=features[i]

feature_importance=feature_importances[i]

print'%s%f'%(feature_name.ljust(30),feature_importance)

plt.clf()

plt.bar(range(num_features),feature_importances)

plt.xticks(range(num_features),features,rotation=90)

plt.ylabel('relativeimportance(a.u.)')

plt.title('Relativeimportancesofmostpredictivefeatures')

plt.show()

Figure34:Result

10.5.5Acknowledgement

ThisnotebookwasdevelopedbyJulietteZerickandGregorvonLaszewski

10.6PARALLELCOMPUTINGINPYTHON☁�InthismodulewewillreviewtheavailablePythonmodulesthatcanbeusedforparallel computing. The parallel computing can be in form of either multi-threadingormulti-processing.Inmulti-threadingapproach,thethreadsruninthesame shared memory heap whereas in case of multi-processing, the memoryheaps of processes are separate and independent, therefore the communicationbetweentheprocessesarealittlebitmorecomplex.

10.6.1Multi-threadinginPython

ThreadinginPythonisperfectforI/Ooperationswheretheprocessisexpectedto be idle regularly, e.g. web scraping. This is a very useful feature because

importdask.dataframeasdd

red_df=dd.read_csv('winequality-red.csv',sep=';',header=0)

white_df=dd.read_csv('winequality-white.csv',sep=';',header=0)

several applications and script might spend the majority of their runtime onwaiting for network or data I/O. In several cases, e.g. web scraping, theresources, i.e. downloading from different websites, are most of the timeindependent.Thereforetheprocessorcandownloadinparallelandjointheresultattheend.

10.6.1.1ThreadvsThreading

Therearetwobuilt-inmodulesinPythonthatarerelatedtothreading,namelythreadandthreading.TheformermoduleisdeprecatedforsometimeinPython2,andinPython3itisrenamedto_threadforthesakeofbackwardsincompatibilities.The_threadmoduleprovideslow-levelthreadingAPIformulti-threadinginPython,whereasthemodulethreadingbuildsahigh-levelthreadinginterfaceontopofit.

TheThread()isthemainmethodofthethreadingmodule,thetwoimportantargumentsof which are target, for specifying the callable object, and args to pass theargumentsforthetargetcallable.Weillustratetheseinthefollowingexample:

Thisistheoutputofthepreviousexample:

Incaseyouarenotfamiliarwiththeif__name__=='__main__:'statement,whatitdoesisbasicallymakingsurethatthecodenestedunderthisconditionwillberunonlyifyourunyourmoduleasaprogramanditwillnotrunincaseyourmoduleisimportedinanotherfile.

10.6.1.2Locks

Asmentionedprior,thememoryspaceissharedbetweenthethreads.Thisisat

importthreading

defhello_thread(thread_num):

print("HellofromThread",thread_num)

forthread_numinrange(5):

t=threading.Thread(target=hello_thread,arg=(thread_num,))

t.start()

In[1]:%runthreading.py

HellofromThread0

HellofromThread1

HellofromThread2

HellofromThread3

HellofromThread4

the same time beneficial and problematic: it is beneficial in a sense that thecommunication between the threads becomes easy, however, you mightexperience strange outcome if you let several threads change same variablewithoutcaution,e.g.thread2changesvariablexwhilethread1isworkingwithit.Thisiswhenlockcomesintoplay.Usinglock,youcanallowonlyonethreadtoworkwithavariable.Inotherwords,onlyasinglethreadcanholdthelock.Iftheother threadsneed toworkwith thatvariable, theyhave towaituntil theotherthreadisdoneandthevariableis“unlocked”.

Weillustratethiswithasimpleexample:

Supposewewant toprintmultiplesof3between1and12, i.e.3,6,9and12.Forthesakeofargument,wetrytodothisusing2threadsandanestedforloop.Thenwecreateaglobalvariablecalledcounterandweinitializeitwith0.Thenwhenever each of the incrementer1 or incrementer2 functions are called, the counter isincrementedby3 twice (counter is incrementedby6 in each functioncall). Ifyourunthepreviouscode,youshouldbereallyluckyifyougetthefollowingaspartofyouroutput:

Thereasonistheconflictthathappensbetweenthreadswhileincrementingthe

importthreading

globalcounter

counter=0

defincrementer1():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1

print("Greeter1incrementedthecounterby1")

print("Counteris%d"%counter)

defincrementer2():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1

print("Counterisnow%d"%counter)

t1=threading.Thread(target=incrementer1)

t1.start()

t2.start()

Counterisnow3

Counterisnow6

Counterisnow9

Counterisnow12

counter in thenestedfor loop.Asyouprobablynoticed, thefirst levelfor loopisequivalentofadding3 to thecounterand theconflict thatmighthappen isnoteffective on that level but the nested for loop.Accordingly, the output of thepreviouscodeisdifferentineveryrun.Thisisanexampleoutput:

We can fix this issue using a lock: whenever one of the function is going toincrementthevalueby3,itwillacquire()thelockandwhenitisdonethefunctionwillrelease()thelock.Thismechanismisillustratedinthefollowingcode:

Nomatterhowmanytimesyourunthiscode,theoutputwouldalwaysbeinthecorrectorder:

$python3lock_example.py

Greeter1incrementedthecounterby1

Counteris4

Counteris8

Counteris10

Counteris12

importthreading

increment_by_3_lock=threading.Lock()

globalcounter

counter=0

defincrementer1():

globalcounter

forjinrange(2):

increment_by_3_lock.acquire(True)

foriinrange(3):

counter+=1

increment_by_3_lock.release()

defincrementer2():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1

t1.start()

t2.start()

$python3lock_example.py

Using the Threading module increases both the overhead associated with threadmanagementaswellasthecomplexityoftheprogramandthatiswhyinmanysituations,employingmultiprocessingmodulemightbeabetterapproach.

10.6.2Multi-processinginPython

We already mentioned that multi-threading might not be sufficient in manyapplicationsandwemightneedtousemultiprocessingsometime,orbettertosaymostof the times. That is why we are dedicating this subsection to this particularmodule.ThismoduleprovidesyouwithanAPIforspawningprocessesthewayyou spawn threads using threading module. Moreover, there are somefunctionalities that are not even available in threading module, e.g. the Pool classwhichallowsyoutorunabatchofjobsusingapoolofworkerprocesses.

10.6.2.1Process

Similartothreadingmodulewhichwasemployingthread(aka_thread)underthehood,multiprocessingemploystheProcessclass.Considerthefollowingexample:

In this example, after importing the Processmodulewecreated a greeter() function

Counteris3

Counteris6

Counteris9

Counteris12

frommultiprocessingimportProcess

importos

defgreeter(name):

proc_idx=os.getpid()

print("Process{0}:Hello{1}!".format(proc_idx,name))

name_list=['Harry','George','Dirk','David']

process_list=[]

forname_idx,nameinenumerate(name_list):

current_process=Process(target=greeter,args=(name,))

process_list.append(current_process)

current_process.start()

forprocessinprocess_list:

process.join()

thattakesanameandgreetsthatperson.Italsoprintsthepid(processidentifier)oftheprocessthatisrunningit.Notethatweusedtheosmoduletogetthepid.Inthebottomofthecodeaftercheckingthe__name__='__main__'condition,wecreateaseriesofProcessesandstartthem.Finallyinthelastforloopandusingthejoinmethod,wetell Python towait for the processes to terminate. This is one of the possibleoutputsofthecode:

10.6.2.2Pool

ConsiderthePoolclassasapoolofworkerprocesses.ThereareseveralwaysforassigningjobstothePoolclassandwewillintroducethemostimportantonesinthis section. These methods are categorized as blocking or non-blocking. The formermeans that after calling the API, it blocks the thread/process until it has theresultoranswerreadyandthecontrolreturnsonlywhenthecallcompletes.Inthenon-blockinontheotherhand,thecontrolreturnsimmediately.

10.6.2.2.1SynchronousPool.map()

WeillustratethePool.mapmethodbyre-implementingourpreviousgreeterexampleusingPool.map:

Asyoucansee,wehavesevennamesherebutwedonotwanttodedicateeachgreeting toa separateprocess. Insteadwedo thewhole jobof“greetingsevenpeople”using“twoprocesses”.Wecreateapoolof3processeswithPool(processes=3)syntax and then we map an iterable called names to the greeter function usingpool.map(greeter,names).Asweexpected,thegreetingsintheoutputwillbeprintedfromthreedifferentprocesses:

$python3process_example.py

Process23451:HelloHarry!

Process23452:HelloGeorge!

Process23453:HelloDirk!

Process23454:HelloDavid!

importos

defgreeter(name):

pid=os.getpid()

print("Process{0}:Hello{1}!".format(pid,name))

names=['Jenna','David','Marry','Ted','Jerry','Tom','Justin']

pool=Pool(processes=3)

sync_map=pool.map(greeter,names)

print("Done!")

Note that Pool.map() is in blocking category and does not return the control to yourscriptuntilitisdonecalculatingtheresults.ThatiswhyDone!isprintedafterallofthegreetingsareover.

10.6.2.2.2AsynchronousPool.map_async()

As the name implies, you can use the map_async method, when you want assignmany function calls to a pool of worker processes asynchronously. Note thatunlike map, the order of the results is not guaranteed (as oppose to map) and thecontrolisreturnedimmediately.Wenowimplementthepreviousexampleusingmap_async:

As you probably noticed, the only difference (clearly apart from the map_async

methodname)iscallingthe wait()methodin the last line.The wait()method tellsyourscripttowaitfortheresultofmap_asyncbeforeterminating:

Note that the order of the results are not preserved.Moreover, Done! is printerbefore anyof the results,meaning that ifwedonot use the wait()method, youprobablywillnotseetheresultatall.

$pythonpoolmap_example.py

Process30585:HelloJenna!

Process30587:HelloMarry!

Process30585:HelloTed!

Process30585:HelloJerry!

Process30587:HelloTom!

Process30585:HelloJustin!

importos

defgreeter(name):

pid=os.getpid()

print("Process{0}:Hello{1}!".format(pid,name))

names=['Jenna','David','Marry','Ted','Jerry','Tom','Justin']

pool=Pool(processes=3)

async_map=pool.map_async(greeter,names)

print("Done!")

async_map.wait()

$pythonpoolmap_example.py

Process30740:HelloJenna!

Process30740:HelloTed!

Process30742:HelloMarry!

Process30740:HelloJerry!

Process30741:HelloTom!

Process30742:HelloJustin!

10.6.2.3Locks

Thewaymultiprocessingmoduleimplementslocksisalmostidenticaltothewaythethreadingmoduledoes.AfterimportingLockfrommultiprocessingallyouneedtodoistoacquireit,dosomecomputationandthenreleasethelock.WewillclarifytheuseofLockbyprovidinganexampleinnextsectionaboutprocesscommunication.

10.6.2.4ProcessCommunication

Process communication in multiprocessing is one of the most important, yetcomplicated,featuresforbetteruseofthismodule.Asopposetothreading,theProcessobjects will not have access to any shared variable by default, i.e. no sharedmemoryspacebetweentheprocessesbydefault.Thiseffectisillustratedinthefollowingexample:

Probably you already noticed that this is almost identical to our example inthreadingsection.Now,takealookatthestrangeoutput:

Asyoucansee,itisasiftheprocessesdoesnotseeeachother.Insteadofhavingtwoprocessesonecounting to6and theothercountingfrom6 to12,wehave

frommultiprocessingimportProcess,Lock,Value

importtime

globalcounter

counter=0

defincrementer1():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1

print("Greeter1:Counteris%d"%counter)

defincrementer2():

globalcounter

forjinrange(2):

foriinrange(3):

counter+=1

print("Greeter2:Counteris%d"%counter)

t1=Process(target=incrementer1)

t2=Process(target=incrementer2)

t1.start()

t2.start()

$pythoncommunication_example.py

Greeter1:Counteris3

Greeter1:Counteris6

Greeter2:Counteris3

Greeter2:Counteris6

twoprocessescountingto6.

Nevertheless, there are several ways that Processes from multiprocessing cancommunicatewitheachother,includingPipe,Queue,Value,ArrayandManager.Pipeand Queueare appropriate for inter-processmessage passing. To bemore specific, Pipe isuseful for process-to-process scenarios while Queue is more appropriate forprocesses-toprocessesones.ValueandArrayarebothusedtoprovideasynchronizedaccesstoashareddata(verymuchlikesharedmemory)andManagerscanbeusedondifferentdatatypes.Inthefollowingsub-sections,wecoverbothValueandArraysincetheyarebothlightweight,yetuseful,approaches.

10.6.2.4.1Value

The following example re-implements the broken example in the previoussection.Wefixthestrangeoutput,byusingbothLockandValue:

The usage of Lock object in this example is identical to the example in threadingsection.Theusageof counter ison theotherhand thenovelpart.First,note thatcounterisnotaglobalvariableanymoreandinsteaditisaValuewhichreturnsactypes object allocated from a shared memory between the processes. The firstargument 'i' indicates a signed integer, and the second argument defines the

frommultiprocessingimportProcess,Lock,Value

importtime

increment_by_3_lock=Lock()

defincrementer1(counter):

forjinrange(3):

foriinrange(3):

counter.value+=1

time.sleep(0.1)

print("Greeter1:Counteris%d"%counter.value)

defincrementer2(counter):

forjinrange(3):

foriinrange(3):

counter.value+=1

time.sleep(0.05)

print("Greeter2:Counteris%d"%counter.value)

counter=Value('i',0)

t1=Process(target=incrementer1,args=(counter,))

t2=Process(target=incrementer2,args=(counter,))

t2.start()

t1.start()

initializationvalue.Inthiscaseweareassigningasignedintegerinthesharedmemory initialized to size 0 to the counter variable.We thenmodified our twofunctionsandpassthissharedvariableasanargument.Finally,wechangethewayweincrementthecountersincecounterisnotanPythonintegeranymorebutactypes signed integerwherewe can access its value using the value attribute.Theoutputofthecodeisnowasweexpected:

Thelastexamplerelatedtoparallelprocessing,illustratestheuseofbothValueandArray,aswellasa technique topassmultiplearguments toa function.Note thattheProcessobjectdoesnotacceptmultipleargumentsforafunctionandthereforewe need this or similar techniques for passingmultiple arguments. Also, thistechniquecanalsobeusedwhenyouwanttopassmultipleargumentstomapormap_async:

$pythonmp_lock_example.py

Greeter2:Counteris3

Greeter2:Counteris6

Greeter1:Counteris9

Greeter1:Counteris12

frommultiprocessingimportProcess,Lock,Value,Array

importtime

fromctypesimportc_char_p

increment_by_3_lock=Lock()

defincrementer1(counter_and_names):

counter=counter_and_names[0]

names=counter_and_names[1]

forjinrange(2):

foriinrange(3):

counter.value+=1

time.sleep(0.1)

name_idx=counter.value//3-1

print("Greeter1:Greeting{0}!Counteris{1}".format(names.value[name_idx],counter.value))

defincrementer2(counter_and_names):

counter=counter_and_names[0]

names=counter_and_names[1]

forjinrange(2):

foriinrange(3):

counter.value+=1

time.sleep(0.05)

name_idx=counter.value//3-1

print("Greeter2:Greeting{0}!Counteris{1}".format(names.value[name_idx],counter.value))

counter=Value('i',0)

names=Array(c_char_p,4)

names.value=['James','Tom','Sam','Larry']

t1=Process(target=incrementer1,args=((counter,names),))

t2=Process(target=incrementer2,args=((counter,names),))

t2.start()

t1.start()

Inthisexamplewecreatedamultiprocessing.Array()objectandassignedittoavariablecallednames.Aswementionedbefore,thefirstargumentisthectypedatatypeandsincewewanttocreateanarrayofstringswithlengthof4(secondargument),weimportedthec_char_pandpasseditasthefirstargument.

Instead of passing the arguments separately,wemerged both the Value and Arrayobjects in a tuple andpassed the tuple to the functions.We thenmodified thefunctions to unpack the objects in the first two lines in the both functions.Finally we changed the print statement in a way that each process greets aparticularname.Theoutputoftheexampleis:

10.7DASK☁�Dask is a python-based parallel computing library for analytics. Parallelcomputingisatypeofcomputationinwhichmanycalculationsortheexecutionofprocessesarecarriedoutsimultaneously.Largeproblemscanoftenbedividedintosmallerones,whichcanthenbesolvedconcurrently.

Daskiscomposedoftwocomponents:

1. Dynamic task scheduling optimized for computation. This is similar toAirflow, Luigi, Celery, or Make, but optimized for interactivecomputationalworkloads.

2. BigDatacollections like parallel arrays, dataframes, and lists that extendcommoninterfaceslikeNumPy,Pandas,orPythoniteratorstolarger-than-memoryordistributedenvironments.Theseparallelcollectionsrunontopofthedynamictaskschedulers.

Daskemphasizesthefollowingvirtues:

Familiar: Provides parallelized NumPy array and Pandas DataFrameobjects.Flexible:Providesa task scheduling interface formorecustomworkloadsandintegrationwithotherprojects.

$python3mp_lock_example.py

Greeter2:GreetingJames!Counteris3

Greeter2:GreetingTom!Counteris6

Greeter1:GreetingSam!Counteris9

Greeter1:GreetingLarry!Counteris12

Native: Enables distributed computing in Pure Pythonwith access to thePyDatastack.Fast:Operateswith lowoverhead, low latency, andminimal serializationnecessaryforfastnumericalalgorithmsScalesup:Runsresilientlyonclusterswith1000sofcoresScalesdown:TrivialtosetupandrunonalaptopinasingleprocessResponsive:Designedwithinteractivecomputinginminditprovidesrapidfeedbackanddiagnosticstoaidhumans

The section is structured in a number of subsections addressing the followingtopics:

Foundations:

anexplanationofwhatDaskis,howitworks,andhowtouselowerlevelprimitives to set up computations. Casual users may wish to skip thissection,althoughweconsideritusefulknowledgeforallusers.

DistributedFeatures:

information on runningDask on the distributed scheduler,which enablesscale-uptodistributedsettingsandenhancedmonitoringoftaskoperations.The distributed scheduler is now generally the recommended engine forexecutingtaskwork,evenonsingleworkstationsorlaptops.

Collections:

convenientabstractionsgivingafamiliarfeeltobigdata.

Pythoniteratorswithafunctionalparadigm,suchasfoundinfunc/iter-toolsand toolz - generalize lists/generators to big data; this will seem veryfamiliartousersofPySpark’sRDD

Array:

massivemulti-dimensionalnumericaldata,withNumpyfunctionality

Dataframe:

massivetabulardata,withPandasfunctionality

10.7.1HowDaskWorks

Daskiscomputationtoolforlarger-than-memorydatasets,parallelexecutionordelayed/backgroundexecution.

WecansummarizethebasicsofDaskasfollows:

processdata thatdoesnot fit intomemorybybreaking it intoblocksandspecifyingtaskchainsparallelizeexecutionoftasksacrosscoresandevennodesofaclustermovecomputationtothedataratherthantheotherwayaround,tominimizecommunicationoverheads

Weusefor-loops tobuildbasic tasks,Python iterators,and theNumpy(array)and Pandas (dataframe) functions for multi-dimensional or tabular data,respectively.

Daskallowsus toconstructaprescriptionfor thecalculationwewant tocarryout.AmodulenamedDask.delayedletsusparallelizecustomcode.Itisusefulwhenever our problem doesn’t quite fit a high-level parallel object likedask.array or dask.dataframe but could still benefit from parallelism.Dask.delayedworksbydelayingourfunctionevaluationsandputtingthemintoadaskgraph.Hereisasmallexample:

Herewehaveusedthedelayedannotationtoshowthatwewantthesefunctionstooperatelazily-tosavethesetofinputsandexecuteonlyondemand.

10.7.2DaskBag

fromdaskimportdelayed

@delayed

definc(x):

returnx+1

@delayed

defadd(x,y):

returnx+y

Dask-bag excels in processing data that can be represented as a sequence ofarbitrary inputs. We’ll refer to this as “messy” data, because it can containcomplex nested structures, missing fields, mixtures of data types, etc. Thefunctional programming style fits very nicely with standard Python iteration,suchascanbefoundintheitertoolsmodule.

Messy data is often encountered at the beginning of data processing pipelineswhenlargevolumesofrawdataarefirstconsumed.TheinitialsetofdatamightbeJSON,CSV,XML,oranyotherformatthatdoesnotenforcestrictstructureanddatatypes.Forthisreason,theinitialdatamassagingandprocessingisoftendonewithPythonlists,dicts,andsets.

These core data structures are optimized for general-purpose storage andprocessing.Adding streaming computationwith iterators/generator expressionsorlibrarieslikeitertoolsortoolzletusprocesslargevolumesinasmallspace.Ifwe combine this with parallel processing then we can churn through a fairamountofdata.

Dask.bagisahighlevelDaskcollectiontoautomatecommonworkloadsofthisform.Inanutshell

YoucancreateaBagfromaPythonsequence,fromfiles,fromdataonS3,etc..

Bagobjectshold thestandardfunctionalAPIfoundinprojects like thePythonstandardlibrary,toolz,orpyspark,includingmap,filter,groupby,etc..

As with Array and DataFrame objects, operations on Bag objects create newbags.Callthe.compute()methodtotriggerexecution.

dask.bag=map,filter,toolz+parallelexecution

#eachelementisaninteger

importdask.bagasdb

b=db.from_sequence([1,2,3,4,5,6,7,8,9,10])

#eachelementisatextfileofJSONlines

importos

b=db.read_text(os.path.join('data','accounts.*.json.gz'))

#Requires`s3fs`library

#eachelementisaremoteCSVtextfile

b=db.read_text('s3://dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv')

defis_even(n):

returnn%2==0

b=db.from_sequence([1,2,3,4,5,6,7,8,9,10])

FormoredetailsonDaskBagcheckhttps://dask.pydata.org/en/latest/bag.html

10.7.3ConcurrencyFeatures

Dask supports a real-time task framework that extends Python’sconcurrent.futuresinterface.Thisinterfaceisgoodforarbitrarytaskscheduling,likedask.delayed,butisimmediateratherthanlazy,whichprovidessomemoreflexibility in situations where the computations may evolve over time. Thesefeatures depend on the second generation task scheduler found indask.distributed(which,despiteitsname,runsverywellonasinglemachine).

Daskallowsus tosimplyconstructgraphsof taskswithdependencies.Wecanfind that graphs can also be created automatically for us using functional,NumpyorPandassyntaxondatacollections.Noneofthiswouldbeveryuseful,if thereweren’talsoaway toexecute thesegraphs, inaparallelandmemory-awareway.Daskcomeswithfouravailableschedulers:

dask.threaded.get:aschedulerbackedbyathreadpooldask.multiprocessing.get:aschedulerbackedbyaprocesspooldask.async.get_sync:asynchronousscheduler,goodfordebuggingdistributed.Client.get: a distributed scheduler for executing graphs on multiplemachines.

Hereisasimpleprogramfordask.distributedlibrary:

For more details on Concurrent Features by Dask checkhttps://dask.pydata.org/en/latest/futures.html

10.7.4DaskArray

c=b.filter(is_even).map(lambdax:x**2)

#blockingform:waitforcompletion(whichisveryfastinthiscase)

c.compute()

fromdask.distributedimportClient

client=Client('scheduler:port')

futures=[]

forfninfilenames:

future=client.submit(load,fn)

futures.append(future)

summary=client.submit(summarize,futures)

summary.result()

Dask arrays implement a subset of theNumPy interface on large arrays usingblocked algorithms and task scheduling. These behave like numpy arrays, butbreakamassivejobintotasksthatarethenexecutedbyascheduler.Thedefaultschedulerusesthreadingbutyoucanalsousemultiprocessingordistributedorevenserialprocessing(mainlyfordebugging).Youcantellthedaskarrayhowtobreakthedataintochunksforprocessing.

For more details on Dask Array checkhttps://dask.pydata.org/en/latest/array.html

10.7.5DaskDataFrame

A Dask DataFrame is a large parallel dataframe composed of many smallerPandasdataframes,splitalongtheindex.Thesepandasdataframesmayliveondisk for larger-than-memory computing on a single machine, or on manydifferentmachines in a cluster.Dask.dataframe implements a commonly usedsubset of the Pandas interface including elementwise operations, reductions,groupingoperations,joins,timeseriesalgorithms,andmore.ItcopiesthePandasinterfacefor theseoperationsexactlyandsoshouldbeveryfamiliar toPandasusers.BecauseDask.dataframeoperationsmerelycoordinatePandasoperationstheyusuallyexhibitsimilarperformancecharacteristicsasarefoundinPandas.Torunthefollowingcode,save‘student.csv’fileinyourmachine.

importdask.arrayasda

f=h5py.File('myfile.hdf5')

x=da.from_array(f['/big-data'],chunks=(1000,1000))

x-x.mean(axis=1).compute()

importpandasaspd

df=pd.read_csv('student.csv')

d=df.groupby(df.HID).Serial_No.mean()

print(d)

Name:Serial_No,dtype:int64

df=dd.read_csv('student.csv')

dt=df.groupby(df.HID).Serial_No.mean().compute()

print(dt)

1011.0

For more details on Dask DataFrame checkhttps://dask.pydata.org/en/latest/dataframe.html

10.7.6DaskDataFrameStorage

Efficient storage can dramatically improve performance, particularly whenoperatingrepeatedlyfromdisk.

Decompressing text and parsing CSV files is expensive. One of the mosteffective strategies with medium data is to use a binary storage format likeHDF5.

Createdataifwedon’thaveany

Firstwereadourcsvdataasbefore.

CSV and other text-based file formats are themost common storage for datafrommanysources,becausetheyrequireminimalpre-processing,canbewrittenline-by-lineandarehuman-readable.SincePandas’read_csviswell-optimized,CSVs are a reasonable input, but far from optimized, since reading requiredextensivetextparsing.

HDF5 and netCDF are binary array formats very commonly used in thescientificrealm.

1022.0

1043.0

1054.0

1065.0

1076.0

1097.0

1118.0

2019.0

20210.0

Name:Serial_No,dtype:float64

#besuretoshutdownotherkernelsrunningdistributedclients

fromdask.distributedimportClient

client=Client()

fromprepimportaccounts_csvs

accounts_csvs(3,1000000,500)

importos

filename=os.path.join('data','accounts.*.csv')

filename

df_csv=dd.read_csv(filename)

df_csv.head()

Pandas contains a specialized HDF5 format, HDFStore. Thedd.DataFrame.to_hdf method works exactly like the pd.DataFrame.to_hdfmethod.

For more information of Dask DataFrame Storage, clickhttp://dask.pydata.org/en/latest/dataframe-create.html

10.7.7Links

https://dask.pydata.org/en/latest/http://matthewrocklin.com/blog/work/2017/10/16/streaming-dataframes-1http://people.duke.edu/~ccc14/sta-663-2017/18A_Dask.htmlhttps://www.kdnuggets.com/2016/09/introducing-dask-parallel-programming.htmlhttps://pypi.python.org/pypi/dask/https://www.hdfgroup.org/2015/03/hdf5-as-a-zero-configuration-ad-hoc-scientific-database-for-python/https://github.com/dask/dask-tutorial

target=os.path.join('data','accounts.h5')

target

%timedf_csv.to_hdf(target,'/data')

df_hdf=dd.read_hdf(target,'/data')

df_hdf.head()

11APPLICATIONS

11.1FINGERPRINTMATCHING☁�

PleasenotethatNISThastemporarilyremovedtheFingerprintdataset.Weunfortunatelydonothaveacopyof thedataste. Ifyouhaveone,pleasenotifyus—

Pythonisaflexibleandpopularlanguageforrunningdataanalysispipelines.Inthissectionwewillimplementasolutionforafingerprintmatching.

11.1.1Overview

Fingerprint recognition refers to the automated method for verifying a matchbetweentwofingerprintsandthatisusedtoidentifyindividualsandverifytheiridentity. Fingerprints (Figure 35) are themost widely used form of biometricusedtoidentifyindividuals.

Figure35:Fingerprints

Theautomatedfingerprintmatchinggenerallyrequiredthedetectionofdifferentfingerprintfeatures(aggregatecharacteristicsofridges,andminutiapoints)andthen theuseof fingerprintmatchingalgorithm,whichcandobothone-to-oneand one-to- many matching operations. Based on the number of matches a

proximityscore(distanceorsimilarity)canbecalculated.

WeusethefollowingNISTdatasetforthestudy:

Special Database 14 - NIST Mated Fingerprint Card Pairs 2.(http://www.nist.gov/itl/iad/ig/special\_dbases.cfm)

11.1.2Objectives

Match the fingerprint images from a probe set to a gallery set and report thematchscores.

11.1.3Prerequisites

Forthisworkwewillusethefollowingalgorithms:

MINDTCT:TheNISTminutiaedetector,whichautomatically locatesandrecords ridge ending and bifurcations in a fingerprint image.(http://www.nist.gov/itl/iad/ig/nbis.cfm)BOZORTH3:ANISTfingerprintmatchingalgorithm,whichisaminutiaebasedfingerprint-matchingalgorithm.Itcandobothone-to-oneandone-to-manymatchingoperations.(http://www.nist.gov/itl/iad/ig/nbis.cfm)

Inordertofollowalong,youmusthavetheNBIStoolswhichprovidemindtctandbozorth3 installed. If you are on Ubuntu 16.04 Xenial, the following steps willaccomplishthis:

11.1.4Implementation

1. Fetchthefingerprintimagesfromtheweb2. Callouttoexternalprogramstoprepareandcomputethematchscoreds3. Storetheresultsinadatabase4. Generateaplottoidentifylikelymatches.

$sudoapt-getupdate-qq

$sudoapt-getinstall-ybuild-essentialcmakeunzip

$wget"http://nigos.nist.gov:8080/nist/nbis/nbis_v5_0_0.zip"

$unzip-dnbisnbis_v5_0_0.zip

$cdnbis/Rel_5.0.0

$./setup.sh/usr/local--without-X11

$sudomake

wewillbeinteractingwiththeoperatingsystemandmanipulatingfilesandtheirpathnames.

Somegeneralusefullutilities

Usingtheattrslibraryprovidessomeniceshortcutstodefiningobjects

wewill be randomly dividing the entire dataset, based on user input, into theprobeandgallerystets

wewill need to call out to theNBIS software.wewill also be usingmultipleprocessestotakeadvantageofallthecoresonourmachine

Asforplotting,wewillusematplotlib,thoughtherearemanyalternatives.

Finally,wewillwritetheresultstoadatabase.

11.1.5Utilityfunctions

Next,wewilldefinesomeutilityfunctions:

importurllib

importzipfile

importhashlib

importos.path

importos

importsys

importshutil

importtempfile

importitertools

importfunctools

importtypes

importattr

importsys

importrandom

importsubprocess

importmultiprocessing

importpandasaspd

importnumpyasnp

importsqlite3

11.1.6Dataset

wewillnowdefinesomeglobalparameters

First,thefingerprintdataset

deftake(n,iterable):

"Returnsageneratorofthefirst**n**elementsofaniterable"

returnitertools.islice(iterable,n)

defzipWith(function,*iterables):

"Zipasetof**iterables**togetherandapply**function**toeachtuple"

forgroupinitertools.izip(*iterables):

yieldfunction(*group)

defuncurry(function):

"TransformsanN-arry**function**sothatitacceptsasingleparameterofanN-tuple"

@functools.wraps(function)

defwrapper(args):

returnfunction(*args)

returnwrapper

deffetch_url(url,sha256,prefix='.',checksum_blocksize=2**20,dryRun=False):

"""Downloadaurl.

:paramurl:theurltothefileontheweb

:paramsha256:theSHA-256checksum.Usedtodetermineifthefilewaspreviouslydownloaded.

:paramprefix:directorytosavethefile

:paramchecksum_blocksize:blocksizetousedwhencomputingthechecksum

:paramdryRun:booleanindicatingthatcallingthisfunctionshoulddonothing

:returns:thelocalpathtothedownloadedfile

:rtype:

ifnotos.path.exists(prefix):

os.makedirs(prefix)

local=os.path.join(prefix,os.path.basename(url))

ifdryRun:returnlocal

ifos.path.exists(local):

print('Verifyingchecksum')

chk=hashlib.sha256()

withopen(local,'rb')asfd:

whileTrue:

bits=fd.read(checksum_blocksize)

ifnotbits:break

chk.update(bits)

ifsha256==chk.hexdigest():

returnlocal

print('Downloading',url)

defreport(sofar,blocksize,totalsize):

msg='{}%\r'.format(100*sofar*blocksize/totalsize,100)

sys.stderr.write(msg)

urllib.urlretrieve(url,local,report)

returnlocal

DATASET_URL='https://s3.amazonaws.com/nist-srd/SD4/NISTSpecialDatabase4GrayScaleImagesofFIGS.zip'

DATASET_SHA256='4db6a8f3f9dc14c504180cbf67cdf35167a109280f121c901be37a80ac13c449'

We’lldefinehowtodownloadthedataset.Thisfunctionisgeneralenoughthatitcouldbeusedtoretrievemostfiles,butwewilldefaultittousethevaluesfromprevious.

11.1.7DataModel

we will define some classes so we have a nice API for working with thedataflow. We set slots=True so that the resulting objects will be more space-efficient.

11.1.7.1Utilities

11.1.7.1.1Checksum

The checksum consists of the actual hash value (value) as well as a stringrepresentingthehashingalgorithm.Thevalidatorenforcesthatthealgorithcanonlybeoneofthelistedacceptablemethods

defprepare_dataset(url=None,sha256=None,prefix='.',skip=False):

url=urlorDATASET_URL

sha256=sha256orDATASET_SHA256

local=fetch_url(url,sha256=sha256,prefix=prefix,dryRun=skip)

ifnotskip:

print('Extracting',local,'to',prefix)

withzipfile.ZipFile(local,'r')aszip:

zip.extractall(prefix)

name,_=os.path.splitext(local)

returnname

deflocate_paths(path_md5list,prefix):

withopen(path_md5list)asfd:

forlineinitertools.imap(str.strip,fd):

parts=line.split()

ifnotlen(parts)==2:continue

md5sum,path=parts

chksum=Checksum(value=md5sum,kind='md5')

filepath=os.path.join(prefix,path)

yieldPath(checksum=chksum,filepath=filepath)

deflocate_images(paths):

defpredicate(path):

_,ext=os.path.splitext(path.filepath)

returnextin['.png']

forpathinitertools.ifilter(predicate,paths):

yieldimage(id=path.checksum.value,path=path)

@attr.s(slots=True)

classChecksum(object):

value=attr.ib()

kind=attr.ib(validator=lambdao,a,v:vin'md5sha1sha224sha256sha384sha512'.split())

11.1.7.1.2Path

Pathsrefertoanimage'sfilepathandassociatedChecksum.Wegetthechecksum"for"free"sincetheMD5hashisprovidedforeachimageinthedataset.

11.1.7.1.3Image

Thestartofthedatapipelineistheimage.Animagehasanid(themd5hash)andthepathtotheimage.

11.1.7.2Mindtct

Thenextstepinthepipelineis toapplythe mindtctprogramfromNBIS.A mindtctobjectthereforerepresentstheresultsofapplyingmindtctonanimage.Thexytoutputisneededforthenextstep,andtheimageattributerepresentstheimageid.

Weneedawaytoconstructamindtctobjectfromanimageobject.Astraightforwardway of doing this would be to have a from_image @staticmethod or @classmethod, but thatdoesn'tworkwellwithmultiprocessingastop-levelfunctionsworkbestastheyneedtobeserialized.

@attr.s(slots=True)

classPath(object):

checksum=attr.ib()

filepath=attr.ib()

@attr.s(slots=True)

classimage(object):

id=attr.ib()

path=attr.ib()

@attr.s(slots=True)

classmindtct(object):

image=attr.ib()

xyt=attr.ib()

defpretty(self):

d=dict(id=self.image.id,path=self.image.path)

returnpprint(d)

defmindtct_from_image(image):

imgpath=os.path.abspath(image.path.filepath)

tempdir=tempfile.mkdtemp()

oroot=os.path.join(tempdir,'result')

cmd=['mindtct',imgpath,oroot]

subprocess.check_call(cmd)

withopen(oroot+'.xyt')asfd:

xyt=fd.read()

result=mindtct(image=image.id,xyt=xyt)

11.1.7.3Bozorth3

Thefinal step in thepipeline is running the bozorth3 fromNBIS.The bozorth3 classrepresentsthematchbeingdone:trackingtheidsoftheprobeandgalleryimagesaswellasthematchscore.

Sincewewillbewritingtheseinstanceouttoadatabase,weprovidesomestaticmethods for SQL statements. While there are many Object-Relational-Model(ORM) libraries available for Python, this approach keeps the currentimplementationsimple.

Inordertoworkwellwithmultiprocessing,wedefineaclassrepresentuingtheinputparamaters to bozorth3 and a helper function to run bozorth3. Thisway the pipelinedefinitioncanbekeptsimpletoamaptocreatetheinputandthenamaptoruntheprogram.

AsNBIS bozorth3 can be called to compare one-to-one or one-to-many,wewillalsodynamicallychoosebetween theseapproachesdependingon if thegalleryattributeisalistorasingleobject.

returnresult

finally:

shutil.rmtree(tempdir)

@attr.s(slots=True)

classbozorth3(object):

probe=attr.ib()

gallery=attr.ib()

score=attr.ib()

@staticmethod

defsql_stmt_create_table():

return'CREATETABLEIFNOTEXISTSbozorth3'\

+'(probeTEXT,galleryTEXT,scoreNUMERIC)'

@staticmethod

defsql_prepared_stmt_insert():

return'INSERTINTObozorth3VALUES(?,?,?)'

defsql_prepared_stmt_insert_values(self):

returnself.probe,self.gallery,self.score

@attr.s(slots=True)

classbozorth3_input(object):

probe=attr.ib()

gallery=attr.ib()

defrun(self):

ifisinstance(self.gallery,mindtct):

returnbozorth3_from_one_to_one(self.probe,self.gallery)

elifisinstance(self.gallery,types.ListType):

returnbozorth3_from_one_to_many(self.probe,self.gallery)

raiseValueError('Unhandledtypeforgallery:{}'.format(type(gallery)))

The next is the top-level function to running bozorth3. It accepts an instance ofbozorth3_input. The is implemented as a simple top-levelwrapper so that it can beeasilypassedtothemultiprocessinglibrary.

11.1.7.3.1RunningBozorth3

There are two cases to handle: 1.One-to-one probe to gallery sets 1.One-to-manyprobetogallerysets

Both approaches are implemented next. The implementations follow the samepattern:1.Createatemporarydirectorywithinwithtowork1.Writetheprobeand gallery images to files in the temporary directory 1. Call the bozorth3

executable 1. The match score is written to stdout which is captured and thenparsed.1.Returnabozorth3 instanceforeachmatch1.Makesuretocleanupthetemporarydirectory

11.1.7.3.1.1One-to-one

11.1.7.3.1.2One-to-many

defrun_bozorth3(input):

returninput.run()

defbozorth3_from_one_to_one(probe,gallery):

probeFile=os.path.join(tempdir,'probe.xyt')

galleryFile=os.path.join(tempdir,'gallery.xyt')

withopen(probeFile,'wb')asfd:fd.write(probe.xyt)

withopen(galleryFile,'wb')asfd:fd.write(gallery.xyt)

cmd=['bozorth3',probeFile,galleryFile]

result=subprocess.check_output(cmd)

score=int(result.strip())

returnbozorth3(probe=probe.image,gallery=gallery.image,score=score)

finally:

defbozorth3_from_one_to_many(probe,galleryset):

probeFile=os.path.join(tempdir,'probe.xyt')

galleryFiles=[os.path.join(tempdir,'gallery%d.xyt'%i)

fori,_inenumerate(galleryset)]

withopen(probeFile,'wb')asfd:fd.write(probe.xyt)

forgalleryFile,galleryinitertools.izip(galleryFiles,galleryset):

withopen(galleryFile,'wb')asfd:fd.write(gallery.xyt)

cmd=['bozorth3','-p',probeFile]+galleryFiles

result=subprocess.check_output(cmd).strip()

scores=map(int,result.split('\n'))

11.1.8Plotting

Forplottingwewilloperateonlyonthedatabase.wewillselectasmallnumberof probe images and plot the score between them and the rest of the galleryimages.

Themk_short_labelshelperfunctionwillbedefinednext.

Theimageidsarelonghashstrings.Inorderetominimizetheamountofspaceon the figure the labelsoccupy,weprovideahelper function tocreatea shortlabelthatstilluniquelyidentifieseachprobeimageintheselectedsample

11.1.9PuttingitallTogether

First,setupatemporarydirectoryinwhichtowork:

NextwedownloadandextractthefingerprintimagesfromNIST:

return[bozorth3(probe=probe.image,gallery=gallery.image,score=score)

forscore,galleryinzip(scores,galleryset)]

finally:

defplot(dbfile,nprobes=10):

conn=sqlite3.connect(dbfile)

results=pd.read_sql(

"SELECTDISTINCTprobeFROMbozorth3ORDERBYscoreLIMIT'%s'"%nprobes,

con=conn

shortlabels=mk_short_labels(results.probe)

plt.figure()

fori,probeinresults.probe.iteritems():

stmt='SELECTgallery,scoreFROMbozorth3WHEREprobe=?ORDERBYgalleryDESC'

matches=pd.read_sql(stmt,params=(probe,),con=conn)

xs=np.arange(len(matches),dtype=np.int)

plt.plot(xs,matches.score,label='probe%s'%shortlabels[i])

plt.ylabel('Score')

plt.xlabel('Gallery')

plt.legend(bbox_to_anchor=(0,0,1,-0.2))

plt.show()

defmk_short_labels(series,start=7):

forsizeinxrange(start,len(series[0])):

iflen(series)==len(set(map(lambdas:s[:size],series))):

returnmap(lambdas:s[:size],series)

pool=multiprocessing.Pool()

prefix='/tmp/fingerprint_example/'

ifnotos.path.exists(prefix):

os.makedirs(prefix)

%%time

dataprefix=prepare_dataset(prefix=prefix)

Nextwewill configure the location of of theMD5 checksum file that comeswiththedownload

Loadtheimagesfromthedownloadedfilestostarttheanalysispipeline

Wecan examineoneof the loaded image.Note that image is refers to theMD5checksumthatcamewiththeimageandthexytattributerepresentstherawimagedata.

For example purposeswewill only a use a small percentage of the database,randomlyselected,forpurprobeandgallerydatasets.

Wecannowcompute thematchingscoresbetween theprobeandgallerysets.Thiswilluseallcoresavailableonthisworkstation.

VerifyingchecksumExtracting

/tmp/fingerprint_example/NISTSpecialDatabase4GrayScaleImagesofFIGS.zip

to/tmp/fingerprint_example/CPUtimes:user3.34s,sys:645ms,

total:3.99sWalltime:4.01s

md5listpath=os.path.join(prefix,'NISTSpecialDatabase4GrayScaleImagesofFIGS/sd04/sd04_md5.lst')

%%time

print('Loadingimages')

paths=locate_paths(md5listpath,dataprefix)

images=locate_images(paths)

mindtcts=pool.map(mindtct_from_image,images)

print('Done')

LoadingimagesDoneCPUtimes:user187ms,sys:17ms,total:204ms

Walltime:1min21s

print(mindtcts[0].image)

print(mindtcts[0].xyt[:50])

98b15d56330cb17f1982ae79348f711d141462146252382237255118020

30332214

perc_probe=0.001

perc_gallery=0.1

%%time

print('Generatingsamples')

probes=random.sample(mindtcts,int(perc_probe*len(mindtcts)))

gallery=random.sample(mindtcts,int(perc_gallery*len(mindtcts)))

print('|Probes|=',len(probes))

print('|Gallery|=',len(gallery))

Generatingsamples=4=400CPUtimes:user2ms,sys:0ns,total:2

msWalltime:993µs

%%time

print('Matching')

input=[bozorth3_input(probe=probe,gallery=gallery)

forprobeinprobes]

bozorth3s=pool.map(run_bozorth3,input)

bozorth3sisnowalistoflistsofbozorth3instances.

Nowaddtheresultstothedatabase

Wenowplottheresults.Figure36

MatchingCPUtimes:user19ms,sys:1ms,total:20msWalltime:1.07

print('|Probes|=',len(bozorth3s))

print('|Gallery|=',len(bozorth3s[0]))

print('Result:',bozorth3s[0][0])

=4=400Result:bozorth3(probe='caf9143b268701416fbed6a9eb2eb4cf',

gallery='22fa0f24998eaea39dea152e4a73f267',score=4)

dbfile=os.path.join(prefix,'scores.db')

conn=sqlite3.connect(dbfile)

cursor=conn.cursor()

cursor.execute(bozorth3.sql_stmt_create_table())

<sqlite3.Cursorat0x7f8a2f677490>

%%time

forgroupinbozorth3s:

vals=map(bozorth3.sql_prepared_stmt_insert_values,group)

cursor.executemany(bozorth3.sql_prepared_stmt_insert(),vals)

conn.commit()

print('Insertedresultsforprobe',group[0].probe)

Insertedresultsforprobecaf9143b268701416fbed6a9eb2eb4cfInserted

resultsforprobe55ac57f711eba081b9302eab74dea88eInsertedresultsfor

probe4ed2d53db3b5ab7d6b216ea0314beb4fInsertedresultsforprobe

20f68849ee2dad02b8fb33ecd3ece507CPUtimes:user2ms,sys:3ms,total:

5msWalltime:3.57ms

plot(dbfile,nprobes=len(probes))

Figure36:Result

11.2NISTPEDESTRIANANDFACEDETECTION �☁�

Pedestrian and Face Detection uses OpenCV to identify people standing in apicture or a video and NIST use case in this document is built with ApacheSparkandMesosclustersonmultiplecomputenodes.

The example in this tutorial deploys software packages on OpenStack usingAnsiblewithitsroles.SeeFigure37,Figure38,Figure39,Figure40

cursor.close()

Figure37:Original

Figure38:PedestrianDetected

Figure39:Original

Figure40:PedestrianandFace/eyesDetected

11.2.0.1Introduction

Human (pedestrian) detection and face detection have been studied during thelastseveralyearsandmodelsforthemhaveimprovedalongwithHistogramsofOriented Gradients (HOG) for Human Detection [1]. OpenCV is a ComputerVision library including the SVM classifier and the HOG object detector forpedestriandetectionandINRIAPersonDataset[2]isoneofpopularsamplesforboth trainingand testingpurposes. In thisdocument,wedeployApacheSparkon Mesos clusters to train and apply detection models from OpenCV usingPythonAPI.

11.2.0.1.1INRIAPersonDataset

Thisdatasetcontainspositiveandnegativeimagesfortrainingandtestpurposeswithannotationfilesforuprightpersonsineachimage.288positivetestimages,453 negative test images, 614 positive training images and 1218 negativetraining images are included along with normalized 64x128 pixel formats.970MBdatasetisavailabletodownload[3].

11.2.0.1.2HOGwithSVMmodel

HistogramofOrientedGradient(HOG)andSupportVectorMachine(SVM)areused as object detectors and classifiers and built-in python libraries fromOpenCVprovidethesemodelsforhumandetection.

11.2.0.1.3AnsibleAutomationTool

Ansible is a python tool to install/configure/manage software on multiplemachines with JSON files where system descriptions are defined. There arereasonswhyweuseAnsible:

Expandable:LeveragesPython(default)butmodulescanbewritteninanylanguage

Agentless:nosetuprequiredonmanagednode

Security:Allowsdeploymentfromuserspace;usessshforauthentication

Flexibility:onlyrequiressshaccesstoprivilegeduser

Transparency:YAMLBasedscriptfilesexpressthestepsofinstallingandconfiguringsoftware

Modularity: SingleAnsible Role (should) contain all required commandsandvariablestodeploysoftwarepackageindependently

Sharingandportability:rolesareavailablefromsource(github,bitbucket,gitlab,etc)ortheAnsibleGalaxyportal

We use Ansible roles to install software packages for Humand and FaceDetection which requires to run OpenCV Python libraries on Apache Mesoswithaclusterconfiguration.Datasetisalsodownloadedfromthewebusinganansiblerole.

11.2.0.2DeploymentbyAnsible

Ansible is to deploy applications and build clusters for batch-processing largedatasets towards targetmachines e.g.VM instancesonOpenStackandweuseansibleroleswithincludedirectivetoorganizelayersofbigdatasoftwarestacks(BDSS). Ansible provides abstractions by Playbook Roles and reusability byInclude statements.We defineX application inXAnsible Role, for example,anduseincludestatementstocombinewithotherapplicationse.g.YorZ.Thelayers exist in sub directories (see next) to add modularity to your Ansible

deployment. For example, there are five roles used in this example that areApache Mesos in a scheduler layer, Apache Spark in a processing layer, aOpenCVlibraryinanapplicationlayer,INRIAPersonDatasetinadatasetlayerand a python script for human and facedetection in an analytics layer. If youhaveanadditionalsoftwarepackagetoadd,youcansimplyaddanewroleinamainansibleplaybookwithincludedirective.Withthis,yourAnsibleplaybookmaintainssimplebutflexibletoaddmoreroleswithouthavingalargesinglefilewhichisgettingdifficulttoreadwhenitdeploysmoreapplicationsonmultiplelayers.ThemainAnsibleplaybookrunsAnsiblerolesinorderwhichlooklike:

Directory names e.g. sched, proc, data, or anlys indicate BDSS layers like: -sched: scheduler layer - proc: data processing layer - apps: application layer -data:datasetlayer-anlys:analyticslayerandtwodigitsinthefilenameindicateanorderofrolestoberun.

11.2.0.3CloudmeshforProvisioning

It is assumed that virtual machines are created by cloudmesh, the cloudmanagementsoftware.ForexampleonOpenStack,cmclustercreate-N=6

commandstartsasetofvirtualmachineinstances.Thenumberofmachinesandgroups for clusters e.g. namenodes and datanodes are defined in the Ansibleinventory file, a list of target machines with groups, which will be generatedoncemachinesarereadytousebycloudmesh.Ansiblerolesinstallsoftwareanddatasetonvirtualclustersafterthatstage.

11.2.0.4RolesExplainedforInstallation

Mesos role is installed first as a scheduler layer formasters and slaveswheremesos-master runs on the masters group and mesos-slave runs on the slavesgroup.ApacheZookeeper is included in themesosrole thereforemesosslavesfind an electedmesos leader for the coordination. Spark, as a data processing

include:sched/00-mesos.yml

include:proc/01-spark.yml

include:apps/02-opencv.yml

include:data/03-inria-dataset.yml

Include:anlys/04-human-face-detection.yml

layer,providestwooptionsfordistributedjobprocessing,batchjobprocessingvia a cluster mode and real-time processing via a client mode. The MesosdispatcherrunsonamastersgrouptoacceptabatchjobsubmissionandSparkinteractiveshell,whichistheclientmode,providesreal-timeprocessingonanynodeinthecluster.Eitherway,Sparkisinstalledlatertodetectamaster(leader)hostforajobsubmission.OtherrolesforOpenCV,INRIAPersonDatasetandHumanandFaceDetectionPythonapplicationsarefollowedby.

Thefollowingsoftwareareexpectedinthestacksaccordingtothegithub:

mesoscluster(master,worker)

spark(withdispatcherformesosclustermode)

openCV

zookeeper

INRIAPersonDataset

DetectionAnalyticsinPython

[1]Dalal,Navneet,andBillTriggs.“Histogramsoforientedgradients forhumandetection.”2005IEEEComputerSocietyConferenceonComputerVisionandPatternRecognition(CVPR’05).Vol.1.IEEE,

2005.[pdf]

[2]http://pascal.inrialpes.fr/data/human/

[3]ftp://ftp.inrialpes.fr/pub/lear/douze/data/INRIAPerson.tar

[4]https://docs.python.org/2/library/configparser.html

11.2.0.4.1ServergroupsforMasters/SlavesbyAnsibleinventory

Wemayseparatecomputenodesintwogroups:mastersandworkersthereforeMesos masters and zookeeper quorums manage job requests and leaders andworkers run actual tasks. Ansible needs group definitions in their inventory

thereforesoftwareinstallationassociatedwithaproperpartcanbecompleted.

ExampleofAnsibleInventoryfile(inventory.txt)

11.2.0.5InstructionsforDeployment

The following commands complete NIST Pedestrian and Face DetectiondeploymentonOpenStack.

11.2.0.5.1CloningPedestrianDetectionRepositoryfromGithub

Rolesareincludedassubmoduleswhichrequire--recursiveoptiontocheckoutthemall.

Changethefollowingvariablewithactualipaddresses:

Createainventory.txtfilewiththevariableinyourlocaldirectory.

Addansible.cfgfilewithoptionsforsshhostkeycheckingandloginname.

Checkaccessibilitybyansiblepinglike:

[masters]

10.0.5.67

10.0.5.68

10.0.5.69

[slaves]

10.0.5.70

10.0.5.71

10.0.5.72

$gitclone--recursivehttps://github.com/futuresystems/pedestrian-and-face-detection.git

sample_inventory="""[masters]

10.0.5.67

10.0.5.68

10.0.5.69

[slaves]

10.0.5.70

10.0.5.71

10.0.5.72"""

!printf"$sample_inventory">inventory.txt

!catinventory.txt

ansible_config="""[defaults]

host_key_checking=false

remote_user=ubuntu"""

!printf"$ansible_config">ansible.cfg

!catansible.cfg

!ansible-mping-iinventory.txtall

Makesure thatyouhaveacorrectsshkey inyouraccountotherwiseyoumayencounter‘FAILURE’inthepreviouspingtest.

11.2.0.5.2AnsiblePlaybook

We use a main ansible playbook to deploy software packages for NISTPedestrian and Face detection which includes: - mesos - spark -zookeeper -opencv-INRIAPersondataset-Pythonscriptforthedetection

Theinstallationmaytake30minutesoranhourtocomplete.

11.2.0.6OpenCVinPython

Beforewe run our code for this project, let’s tryOpenCV first to see how itworks.

11.2.0.6.1Importcv2

Let us import opencv pythonmodule andwewill use images from the onlinedatabase image-net.org to test OpenCV image recognition. See Figure 41,Figure42

Letusdownloadamailboximagewitharedcolortoseeifopencvidentifiestheshapewithacolor.Theexamplefileinthistutorialis:

100167k100167k00686k0–:–:––:–:––:–:–684k

!cdpedestrian-and-face-detection/&&ansible-playbook-i../inventory.txtsite.yml

importcv2

$curlhttp://farm4.static.flickr.com/3061/2739199963_ee78af76ef.jpg>mailbox.jpg

%matplotlibinline

fromIPython.displayimportImage

mailbox_image="mailbox.jpg"

Image(filename=mailbox_image)

Figure41:Mailboximage

You can try other images. Check out the image-net.org for mailbox images:http://image-net.org/synset?wnid=n03710193

11.2.0.6.2ImageDetection

Justforatest,let’strytodetectaredcolorshapedmailboxusingopencvpythonfunctions.

Therearekeyfunctionsthatweuse:*cvtColor:toconvertacolorspaceofanimage * inRange: to detect a mailbox based on the range of red color pixelvalues * np.array: to define the range of red color using aNumpy library forbettercalculation*findContours:tofindaoutlineoftheobject*bitwise_and:toblack-outtheareaofcontoursfoundimportnumpyasnp

#imreadforloadinganimage

img=cv2.imread(mailbox_image)

#cvtColorforcolorconversion

hsv=cv2.cvtColor(img,cv2.COLOR_BGR2HSV)

#definerangeofredcolorinhsv

lower_red1=np.array([0,50,50])

upper_red1=np.array([10,255,255])

lower_red2=np.array([170,50,50])

upper_red2=np.array([180,255,255])

#thresholdthehsvimagetogetonlyredcolors

mask1=cv2.inRange(hsv,lower_red1,upper_red1)

mask2=cv2.inRange(hsv,lower_red2,upper_red2)

Figure42:Maskedimage

Theredcolormailboxisleftaloneintheimagewhichwewantedtofindinthisexamplebyopencvfunctions.Youcantryotherimageswithdifferentcolorstodetect the different shape of objects using findContours and inRange fromopencv.

Formoreinformation,seethenextusefullinks.

contours features:http://docs.opencv.org/3.1.0/dd/d49/tutorial/_py/_contour/_features.html

contours:http://docs.opencv.org/3.1.0/d4/d73/tutorial/_py/_contours/_begin.html

mask=mask1+mask2

#findaredcolormailboxfromtheimage

im2,contours,hierarchy=cv2.findContours(mask,cv2.RETR_TREE,cv2.CHAIN_APPROX_SIMPLE)

#bitwise_andtoremoveotherareasintheimageexceptthedetectedobject

res=cv2.bitwise_and(img,img,mask=mask)

#turnoff-x,yaxisbar

plt.axis("off")

#textforthemaskedimage

cv2.putText(res,"maskedimage",(20,300),cv2.FONT_HERSHEY_SIMPLEX,2,(255,255,255))

#display

plt.imshow(cv2.cvtColor(res,cv2.COLOR_BGR2RGB))

plt.show()

redcolorinhsv:http://stackoverflow.com/questions/30331944/finding-red-color-using-python-opencv

inrange:http://docs.opencv.org/master/da/d97/tutorial/_threshold/_inRange.html

inrange: http://docs.opencv.org/3.0-beta/doc/py/_tutorials/py/_imgproc/py/_colorspaces/py/_colorspaces.html

numpy: http://docs.opencv.org/3.0-beta/doc/py/_tutorials/py/_core/py/_basic/_ops/py/_basic/_ops.html

11.2.0.7HumanandFaceDetectioninOpenCV

11.2.0.7.1INRIAPersonDataset

Weuse INRIAPersondataset to detect upright people and faces in images inthisexample.Letusdownloaditfirst.

100969M100969M008480k00:01:570:01:57–:–:–12.4M

11.2.0.7.2FaceDetectionusingHaarCascades

This section is prepared based on the opencv-python tutorial:http://docs.opencv.org/3.1.0/d7/d8b/tutorial/_py/_face/_detection.html#gsc.tab=0

Thereisapre-trainedclassifierforfacedetection,downloaditfromhere:

100908k100908k002225k0–:–:––:–:––:–:–2259k

This classifierXML filewill be used to detect faces in images. If you like tocreate a new classifier, find out more information about training from here:http://docs.opencv.org/3.1.0/dc/d88/tutorial/_traincascade.html

$curlftp://ftp.inrialpes.fr/pub/lear/douze/data/INRIAPerson.tar>INRIAPerson.tar

$tarxvfINRIAPerson.tar>logfile&&taillogfile

$curlhttps://raw.githubusercontent.com/opencv/opencv/master/data/haarcascades/haarcascade_frontalface_default.xml>haarcascade_frontalface_default.xml

11.2.0.7.3FaceDetectionPythonCodeSnippet

Now, we detect faces from the first five images using the classifier. SeeFigure 43, Figure 44, Figure 45, Figure 46, Figure 47, Figure 48, Figure 49,Figure50,Figure51,Figure52,Figure53#importthenecessarypackages

importnumpyasnp

importcv2

fromosimportlistdir

fromos.pathimportisfile,join

mypath="INRIAPerson/Test/pos/"

face_cascade=cv2.CascadeClassifier('haarcascade_frontalface_default.xml')

onlyfiles=[join(mypath,f)forfinlistdir(mypath)ifisfile(join(mypath,f))]

forfilenameinonlyfiles:

image=cv2.imread(filename)

image_grayscale=cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)

faces=face_cascade.detectMultiScale(image_grayscale,1.3,5)

iflen(faces)==0:

continue

cnt_faces=1

for(x,y,w,h)infaces:

cv2.rectangle(image,(x,y),(x+w,y+h),(255,0,0),2)

cv2.putText(image,"face"+str(cnt_faces),(x,y-10),cv2.FONT_HERSHEY_SIMPLEX,1,(0,0,0),2)

plt.figure()

plt.axis("off")

plt.imshow(cv2.cvtColor(image[y:y+h,x:x+w],cv2.COLOR_BGR2RGB))

cnt_faces+=1

plt.figure()

plt.axis("off")

plt.imshow(cv2.cvtColor(image,cv2.COLOR_BGR2RGB))

cnt=cnt+1

ifcnt==5:

Figure43:Example

Figure44:Example

Figure45:Example

Figure46:Example

Figure47:Example

Figure48:Example

Figure49:Example

Figure50:Example

Figure51:Example

Figure52:Example

Figure53:Example

11.2.0.8PedestrianDetectionusingHOGDescriptor

WewilluseHistogramofOrientedGradients(HOG)todetectauprightpersonfrom images. See Figure 54, Figure 55, Figure 56, Figure 57, Figure 58,Figure59,Figure60,Figure61,Figure62,Figure63

11.2.0.8.1PythonCodeSnippet

#initializetheHOGdescriptor/persondetector

hog=cv2.HOGDescriptor()

hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

forfilenameinonlyfiles:

img=cv2.imread(filename)

orig=img.copy()

gray=cv2.cvtColor(img,cv2.COLOR_BGR2GRAY)

#detectpeopleintheimage

(rects,weights)=hog.detectMultiScale(img,winStride=(8,8),

padding=(16,16),scale=1.05)

#drawthefinalboundingboxes

for(x,y,w,h)inrects:

cv2.rectangle(img,(x,y),(x+w,y+h),(0,255,0),2)

plt.figure()

plt.axis("off")

plt.imshow(cv2.cvtColor(orig,cv2.COLOR_BGR2RGB))

plt.figure()

plt.axis("off")

plt.imshow(cv2.cvtColor(img,cv2.COLOR_BGR2RGB))

cnt=cnt+1

ifcnt==5:

Figure54:Example

Figure55:Example

Figure56:Example

Figure57:Example

Figure58:Example

Figure59:Example

Figure60:Example

Figure61:Example

Figure62:Example

Figure63:Example

11.2.0.9ProcessingbyApacheSpark

INRIAPersondatasetprovides100+ imagesandSparkcanbeused for image

processinginparallel.Weload288imagesfrom“Test/pos”directory.

Spark provides a special object ‘sc’ to connect between a spark cluster andfunctionsinpythoncode.Therefore,wecanrunpythonfunctionsinparalleltodetetobjectsinthisexample.

map function is used to process pedestrian and face detection per imagefromtheparallelize()functionof‘sc’sparkcontext.

collectfonctionmergesresultsinanarray.

defapply_batch(imagePath):importcv2importnumpyasnp#initializetheHOG descriptor/person detector hog = cv2.HOGDescriptor()hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())image = cv2.imread(imagePath) # detect people in the image (rects,weights) = hog.detectMultiScale(image, winStride=(8, 8), padding=(16,16),scale=1.05)#drawthefinalboundingboxesfor(x,y,w,h) inrects:cv2.rectangle(image,(x,y),(x+w,y+h),(0,255,0),2)returnimage

11.2.0.9.1ParallelizeinSparkContext

Thelistofimagefilesisgiventoparallelize.

11.2.0.9.2MapFunction(apply_batch)

The‘apply_batch’functionthatwecreatedpreviouslyisgiventomapfunctiontoprocessinasparkcluster.

11.2.0.9.3CollectFunction

Theresultofeachmapprocessismergedtoanarray.

11.2.0.10Resultsfor100+imagesbySparkCluster

pd=sc.parallelize(onlyfiles)

pdc=pd.map(apply_batch)

result=pdc.collect()

forimageinresult:

plt.figure()

plt.axis("off")

plt.imshow(cv2.cvtColor(image,cv2.COLOR_BGR2RGB))

12REFERENCES

☁�

[1]L.Richardson,“Beautifulsouppythonpackageoverview.”WebPage,2019[Online].Available:https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[2]C.WODEHOUSE,“Shouldyouusemongodb?Alookattheleadingnosqldatabase.” Web Page, 2018 [Online]. Available:https://www.upwork.com/hiring/data/should-you-use-mongodb-a-look-at-the-leading-nosql-database/

[3]Guru99, “Introduction tomongodb.”WebPage, 2018 [Online].Available:https://www.guru99.com/mongodb-tutorials.html#1

[4] MongoDB, “Https://www.mongodb.com/.” Web Page, 2018 [Online].Available:https://docs.mongodb.com/manual/introduction/

[5]M.Papiernik,“Howto installmongodbonubuntu18.04.”WebPage,Jun-2018 [Online]. Available:https://www.digitalocean.com/community/tutorials/how-to-install-mongodb-on-ubuntu-18-04

[6]J.Ellingwood,“Initialserversetupwithubuntu18.04.”WebPage,Apr-2018[Online]. Available: https://www.digitalocean.com/community/tutorials/initial-server-setup-with-ubuntu-18-04

[7]MongoDB,Databasesandcollections,4.0ed.NewYork,NewYork,USA:MongoDB Inc, 2008 [Online]. Available:https://docs.mongodb.com/manual/core/databases-and-collections/

[8]J.M.CraigBuckler,“Usingjoinsinmongodbnosqldatabases.”WebPage,Sep-2016 [Online]. Available: https://www.sitepoint.com/using-joins-in-mongodb-nosql-databases/

[9] MongoDB, Lookup (aggregation), 3.2 ed. New York City, New York,United States: MongoDB Inc, 2008 [Online]. Available:

https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/

[10]MongoDB,MongoDB package components - mongoexport, 4.0 ed. NewYorkCity,NewYork,UnitedStates:MongoDBInc,2008[Online].Available:https://docs.mongodb.com/manual/reference/program/mongoexport/

[11]MongoDB, Security, 4.0 ed. New York City, New York, United States:MongoDB Inc, 2008 [Online]. Available:https://docs.mongodb.com/manual/security/

[12] MongoDB, “MongoDB atlas.” Web Page, 2018 [Online]. Available:https://www.mongodb.com/cloud/atlas

[13]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/api

[14]A. J. J. Davis, “Announcing pymongo3.”Web Page,Apr-2015 [Online].Available:https://emptysqua.re/blog/announcing-pymongo-3/

[15] M. Dirolf, “PyMongo.” Web Page, Jul-2018 [Online]. Available:https://github.com/mongodb/mongo-python-driver

[16] N. Leite, “MongoDB and python.” Web Page, Mar-2015 [Online].Available:https://www.slideshare.net/NorbertoLeite/mongodb-and-python

[17]V.Oleynik, “How do you usemongodbwith python?”Web Page,Mar-2017 [Online]. Available: https://gearheart.io/blog/how-do-you-use-mongodb-with-python/

[18] I. MongoDB, “Installing / upgrading.” Web pages, 2008 [Online].Available:http://api.mongodb.com/python/current/installation.html

[19] R. Python, “Introduction to mongodb and python.” Web Page, 2016[Online]. Available: https://realpython.com/introduction-to-mongodb-and-python/

[20]W3Schools,“Pythonmongodbcreatedatabase.”WebPage,1999[Online].Available:https://www.w3schools.com/python/python_mongodb_create_db.asp

[21]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/tutorial.html

[22] N. O’Higgins, PyMongo & python. O’Reilly, 2011 [Online]. Available:http://img105.job1001.com/upload/adminnew/2015-04-07/1428393873-MHKX3LN.pdf

[23]I.MongoDB,“PyMongo3.7.1documentation.”WebPage,2008[Online].Available:https://api.mongodb.com/python/current/examples/aggregation.html

[24] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available: https://docs.mongodb.com/manual/reference/operator/aggregation-pipeline/

[25] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available:https://docs.mongodb.com/manual/core/map-reduce/

[26] MongoDB, “PyMongo v2.0 documentation.” Web Page, 2008 [Online].Available:https://api.mongodb.com/python/2.0/examples/map_reduce.html

[27] MongoDB, “PyMongo 3.7.2 documenation.” Web Page, 2008 [Online].Available:https://api.mongodb.com/python/current/examples/copydb.html

[28] MongoEngine, “MongoEngine user documentation.” Web Page, 2009[Online].Available:http://docs.mongoengine.org/

[29] Wikipedia, “Object-relational mapping.” Web Page, May-2009 [Online].Available:https://en.wikipedia.org/wiki/Object-relational_mapping

[30] MongoDB, “Flask-mongoengine.” Web Page, 2008 [Online]. Available:http://docs.mongoengine.org/guide/defining-documents.html

[31] MongoEngine, “User guide: Document instances.” Web Page, 2009[Online]. Available: http://docs.mongoengine.org/guide/document-instances.html

[32] MongoEngine, “2.1 installing mongoengine.” Web Page, 2009 [Online].Available:http://docs.mongoengine.org/guide/installing.html

[33]MongoEngine, “2.2 connection to mongodb.”Web Page, 2009 [Online].Available:http://docs.mongoengine.org/guide/connecting.html

[34]MongoEngine,“Userguide2.5.Querying thedatabase.”WebPage,2009[Online].Available:http://docs.mongoengine.org/guide/querying.html

[35]Wikipedia,“Flask(webframework).”WebPage,2010[Online].Available:https://en.wikipedia.org/wiki/Flask_(web_framework)

[36] MongoDB, “Flask-pymongo.” Web Page, 2008 [Online]. Available:https://flask-pymongo.readthedocs.io/en/latest/

[37]MongoDB,“Flaskmongoalchemy.”WebPage,2008 [Online].Available:https://pythonhosted.org/Flask-MongoAlchemy/

[38] MongoDB, “Flask-mongoengine.” Web Page, 2008 [Online]. Available:http://docs.mongoengine.org/projects/flask-mongoengine/en/latest/

[39] Wikipedia, “Flask (web framework).” Web Page, Oct-2018 [Online].Available:https://en.wikipedia.org/wiki/Flask_(web_framework)

Introduction to Pythondsc.soic.indiana.edu/publications/CloudComputing-Python.pdfINTRODUCTION TO...

Documents

Problem Definition - sbmu.ac.irtreatment.sbmu.ac.ir/uploads/0041-electorolit.pdf · ŸŸ: - Brunner, L.S., Suddarth,D.S. (2008). Text book of Medical Surgical Nursing. William and

· PDF file,3l 9lvwl $uwlfrol5hfhqwl ¹²Î¬Î ééÎpÎéÃ ÎpÀ ÿÿ é¹àÃÎàç²Îq'²é¹'iòàà ¬Ýl ¹ ÎéÎÎé¬ò¬Î Î mu é

cvexemple.com · Translate this pageW%Þjóbb×¿ ÿÿ PK !¬º ÈS S word/media/image3.png‰PNG IHDR-- zÃ"> sRGB®Î é gAMA± üa ]PLTEUUUñññ ýýýþþþÿÿÿåååsss êêêëëëÿÖE1

Content Types].xml ¢ ( ´TKKÃ@ ¾ ÒGj...„øÆÃÛŸÝœˆ' qñí²_ ÿÿ PK ! œ®t~Œ o' word/document.xml´Xárâ6 þß™¾ƒ†ß lcp ‡o †¹i/× I@Ø"¸‘- ,Ãñ¯ïÐ7ì“t%ËœEˆBÒô—

THE BUSINESS PLAN ROAD MAP TO SUCCESS 9/97 A Tutorial …€¦ · ÿÿ Copy of resumes of al l principals ... Tricks and Traps, a condensed guide on ho w to market you r product or

PRAXISpH - Archives of IT › wp-content › uploads › 2019 › ... · es 7 f AYear of Projects AVeav of ELLA Financial Repan n l 5 i i i. TheyarrweziarBar/7 eslr.se. ☁ » I,r,sy

· Translate this pageM%à–¨–¨ word/media/image13.jpegÿØÿà JFIF ``ÿá ‹ExifII* b j ( 1 r2 Ži‡ ¤ø ' 'Adobe Photoshop CS2 Windows2011:10:19 11:09:07 †’ Ú ÿÿ

Free amatuer videosTranslate this page... uš¢•öNÐ£ßòü ÿÿ PK !ðxCah€ ‚ word/document.xmlì} ... (®A;b¨Ó½¦NÂ’3Æ§MM

spb.allplans.ru · +5,000 +3,500 +0,200 0,000 +4,456 í ð û õ ù ö ù ø ø ÿÿ. r. 1.1 a a D. a a a D. Þ a a 1.1 1.-.1 1.-.1 1.1

Introduction to Python and programmingcourses.cs.washington.edu/.../lectures/01-intro-python.pdfIntroduction to Python and programming Ruth Anderson UW CSE 140 Winter 2014 1 1. Python

^RÂØýýdr»hf´C óªô _ ´m.=ŽÎ ´ WÓ>w ÕI à Tmã¾ +±KŒN’ÏÙ áNÕÍ¾ ÿÿ PK !?¤n; word/footer4.xmlìXKoÛ8 ¾/ ÿAÐq DRb ‰Q»ð#N 4»Fân±À^hš²ÙH$AÒV³¿~g(JŠc·pÓ

dsc.soic.indiana.edudsc.soic.indiana.edu/publications/CloudComputingTopics.pdf · CLOUD COMPUTING 1 PREFACE 1.1 Learning Objectives ☁ÿÿ 1.2 ePub Readers ☁ÿÿ 1.3 Corrections

TEAK_Student_Teacher_Training_Workshop.docxTranslate this page Parent: /edge ... vovàÃµè6âØ CÌ§D´o¢ ´õïôâäÕ¯¿ ÿÿ PK !üfê ó@ word/document.xmlì\Ërã ... ,bTRC

P age 1 Pitesti G oblen-set 561 (2) - SAPO Blogs · Page 3 Pitesti Goblen-set 561 (4) (8) (9) Ÿ`` ŸŸ` ŸŸL ŸLL sL` sLL sLL LsL ÄsŸ ÄsŸ Äs` ss` s`s s`s `LL `Ls ÄLÄ ÄsÄ

KGCOE-Research | BethDeBartolo - …Translate this page Parent: /edge/BethDeBartolo/public/TEAK/website Project: ... ö ÿÿ PK !Ïi•ˆT +" word/document.xmlÔZÛn ¹ ¾/Ðw tQì

Hamadan 0 0.25 miles #– A D B C M ainbus Ter ml (70 ); D ...media.lonelyplanet.com/ebookmaps/Iran/hamadan.pdfÿ ÿ ÿ ÿÿ # # # ò ò ä # # # æ æ æ # # # # ß ß # # # # # #

@simonvc · 2016-11-23 · • 2015: ☁ • 2016: ☁ ... We selected Amazon WebServices as our provider, after discussing with the regulators. & & & ... We chose to build a golang

fortress.wa.govTranslate this page · ... ª ò ñ/é=·ÔÑÀßTpêÏøk ÇÃÚ ×ìF›™3¦h M,ŒâêëÂ¤\çÇ|•Mõ ùè ÿÿ PK !øÁÄ¦Ê ¤ word/footnotes.xml¤”Ûnã

qt b ¦ wO M - 文部科学省ホームページ...2011/01/14 · Ot® qTT ¯UxYX oMXwT | Z wî «Ä « è`oU Â`|fw RL stm Mo Ót C`oMX{ ¬w A ]x|þûùw . U ^ oS | þùó÷ÿÿ

Introduction to Data Sciencegiri/teach/5768/F19/lecs/Unit2-Python.pdfIntroduction to Data Science GIRI NARASIMHAN, SCIS, FIU. CAP 5510 / CGS 5166 F19 Momentos! Slides and Audio online