Upload
ruby-davis
View
221
Download
4
Embed Size (px)
Citation preview
Nuons && Threads -> Suggestions 1
Nuons && Threads
SuggestionsSFT meeting
December 15 2014René Brun
15/12/2014
Nuons && Threads -> Suggestions 2
• 1973: Thesis in Nuclear Physics (SC33/CERN, Diogene/Saturne/Saclay)• 1973-1975: ISR/R232, p-p elastic scattering with C.Rubbia (Reconstruction)• 1975-1980:SPS/NA4, deep inelastic muon scattering with C.Rubbia (Simul + Recons)• 1978-1979 : simulation of UA1 with C.Rubbia• 1980-1989: simulation of OPAL with R.Heuer• 1988-1993:simulation of GEM & SDC for the defunct SSC• 1991-1994: simulation of ATLAS and CMS (letters of Intent) F.Gianotti, D.Froidevaux, V.Karimaki• 1995-2010: busy with ROOT• 2009-2010: interested by theoretical predictions for TOTEM (p-p elastic) and results• 2009-2011: foundations for the Nuons model• 2011……..: computing particles masses better than 1/1000• 2012……. Testing p-p elastic with TOTEM/UA4/D0/ISR• 2012… Testing p-p interactions at the LHC (900 GeV, 2.76 TeV, 7 TeV)• 2013… Testing nuons model with Jets at the LHC• 2014… Predictions for 13 TeV + paper draft
• From Algol to Nuons
15/12/2014
Nuons && Threads -> Suggestions 3
Nuons
15/12/2014
proton
neutron
Nuons && Threads -> Suggestions 4
• I am implementing my « physics model » to:– Model elementary particles using « nuons »– Compute particle masses with high accuracy– Test the model at many energies for p-p elastic scattering– Test the model at LHC energies: particles production and Jets
findall.C totem.C collide.C
Nuons and C++
15/12/2014
Nuons && Threads -> Suggestions 5
Example of event motivating my project
15/12/2014
Standard proton model
Predicted cross section wrong by more than 1000 for t > 2 GeV^2
Nuons && Threads -> Suggestions 6
collisions
15/12/2014
PP elastic
PP inelastic
Nuons && Threads -> Suggestions 7
Some programming details• The 3 C++ programs findall, totem and collide (about 12000 LOC in
total) are all running in batch and multi-threaded mode on several OpenLab machines with 2x6 cores Westmere or 2x12 cores Ivy Bridge or 2x14 cores E5-2697v3 now upgraded to 2x18 cores. My programs run from a few minutes to one day.– nohup root.exe –b –q « collide.C+(7000) » >x1.log&– eg processid 12756
• While the program(s) are running, I can inspect the results (histograms or/and Trees), (say once per minute) from my laptop, stop and lauch again with a new set of parameters.– root > .x colshow.C(-12756)– This CINT script takes the file collide_12756.root from OpenLab/AFS and
stores it on my laptop where histograms are visualized.
15/12/2014
Nuons && Threads -> Suggestions 8
More programming details• Findall is a bit « lattice QCD like ». 99.99% of
the time is spent in TMinuit to compute the stable positions of a set of N nuons generated at random in a cube of size 1 fermi.
• Totem and Collide are quite similar to Pythia or Herwig. They simulate proton-proton collisions generating output particles and Jets.
• The scripts run on my laptop and show plenty of graphs comparing with the LHC experiments results.
BatchOn OpenLab
machines
Interactive script
On my laptop
Histograms, Treeafs
scp
15/12/2014
Nuons && Threads -> Suggestions 9
More programming details(2)
• Findall saves results in a Tree (one particle per entry). It takes about 0.1s to compute a pion, 10 minutes for a proton and 20 minutes for a Omega.
• Totem generates histograms only (about 20 1&2D) • collide generates about 100 histograms (1 & 2D) and a Tree
with a size ranging from a few Mbytes/minute to several Gbytes/minute depending on the desired granularity of the collision information. About one billion collisions are generated in one day.
• Most histograms are filled millions of times per second.
15/12/2014
Nuons && Threads -> Suggestions 10
Experience Suggestions
• All these applications are multi-threaded, a HUGE gain in REAL time for what I am doing.
• There are many many applications in HEP that look very similar :– All detector simulations– All event generators– Most physics analyses
• To make the most efficient use of the hardware, I had to make simple changes in ROOT or implement solutions that should be implemented in a more general way in ROOT.
15/12/2014
Nuons && Threads -> Suggestions 11
Main Topics
• Random numbers and distributions : trivial• Histograms• Trees• I/O in general• Thread scalability considerations
Current ROOT is a blockerfor performant multi-threaded applications
15/12/2014
Nuons && Threads -> Suggestions 12
Random Numbers• No changes required in the TRandomXX classes. I am using only the
nice and efficient TRandom3 (Mersenne Twister). • I create a TRandom3 object per thread initialized with :
TRandom3(pid + 1000*thnumb).• I had to modify or circumvent all places referencing gRandom in full
backward compatibility and in totally trivial ways:– TF1::GetRandom() -> TF1::GetRandom(double r=-1)– Similar changes should be applied to TH1::GetRandom and FillRandom– TGenPhaseSpace: add SetRandom function and member fRandom– Similar changes should also be applied to:
• TF2,TF3::GetRandom, Tunuran, TKDTree• TMVA: Dataset, RuleEnsemble• TGeoBBox, TGeoCompositeShape, TGeoChecker• TRobustEstimator, TAttParticle, TVirtualMC, RooStudyPackage• TApplicationRemote, TProof
15/12/2014
Nuons && Threads -> Suggestions 13
Histograms & Threads• Currently one has to set TH1::AddDirectory(0) to bypass gDirectory.• However, this forces the user to do the histogram book-keeping himself. This makes the
histogram merging phase a bit complex (see next slides with a solution).• Histograms may be created in the main thread and filled (with thread-locking) at each fill.
This is fine if the number of fills is negligible.• The only realistic solution is to make a copy of all histograms per thread.• However, in several applications, this can represent a substantial increase in memory.
– In my case, I have at most 100 histograms (total 400 Kbytes per thread)– Alice monitoring has 14000 histograms, total size 1.5 Gbytes in memory!– Most analysis applications have a few hundred , up to a few thousand histograms
• Some tiny work is required to take advantage of the architecture already in place to:– Do lazy instantiations of the bins structures– Exploit better the TH1::SetBuffer mechanism, in particular in TH1::Merge and make vectorization possible.
• I could not survive without my I/O check-pointing (around one per minute) for histograms and Trees. This allows me to inspect at any time the current status of my jobs and interrupt them and change my parameters when I see that the results are not the ones expected. It also makes the running of multi-threading applications much safer.
15/12/2014
Nuons && Threads -> Suggestions 14
Histograms : poor man
Main ThreadTH1 *hrun, *hwatch
Thread 1
Create 97 histograms
Loop on events
Every N events, save thread histograms
to file
Thread 6
Create 97 histograms
Loop on events
Every N events, save thread histograms
to file
Thread 12
Create 97 histograms
Loop on events
Every N events, save thread histograms
to file
……. …….
Then Merge all thread files every NN events or at end of job
What I have been doing for a long time and efficiency < 8/12
15/12/2014
Nuons && Threads -> Suggestions 15
Histograms (2) much better
Main ThreadTH1 *hrun, *hwatch
Thread 1
Create 97 histograms
Loop on events
Every N events, merge
histograms from all threadsand save to file
Thread 6
Create 97 histograms
Loop on events
Every N events, merge
histograms from all threadsand save to file
Thread 12
Create 97 histograms
Loop on events
Every N events, merge
histograms from all threadsand save to file
……. …….
My current version
15/12/2014
16
Histograms Management (1)(my current solution)
Nuons && Threads -> Suggestions
TH1::AddDirectory(0);TList htr[nthreads];TH1D *hrun = new TH1D(…);
TThread::Lock();TList &hlist = htr[thnumb]; TH1D *hncol = new TH1D("hncol","number of collisions",66,0,66); hlist.Add(hncol); TH1D *hpoiss = new TH1D("hpoiss","Jets particle multiplicity",50,0,50); hlist.Add(hpoiss);……hncol->Fill(…);…
TFile *fhist = TFile::Open(TString::Format("collide_%d.root",processID),"recreate");hrun->SetBinContent(26,mainwatch->GetRealTime());hrun->Write(); TList hlistall; int nh = htr[0].GetSize();for (int ih=0;ih<nh;ih++) { TH1 *hcur = (TH1*)htr[0].At(ih)->Clone(); hlistall.Clear(); for (int t=1;t<ncpus;t++) { hlistall.Add(htr[t].At(ih)); } hcur->Merge(&hlistall); hcur->Write(); delete hcur;}fhist->SaveSelf(); delete fhist;
Main thread
in thread thnumb
In any thread or end of main thread
15/12/2014
17
Histograms Management (2)(what I would like to see in ROOT)
Nuons && Threads -> Suggestions
TH1::InitializeThreads(nthreads);TH1D *hrun = new TH1D(…);
TH1::SetThreadDirectory(thnumb]; TH1D *hncol = new TH1D("hncol","number of collisions",66,0,66); TH1D *hpoiss = new TH1D("hpoiss","Jets particle multiplicity",50,0,50); ……hncol->Fill(…);…
TFile *fhist = TFile::Open(TString::Format("collide_%d.root",processID),"recreate");hrun->SetBinContent(26,mainwatch->GetRealTime());hrun->Write(); TH1::MergeThreads()->Write();fhist->SaveSelf(); delete fhist;
Main thread
in thread thnumb
In any thread or end of main thread
15/12/2014
Nuons && Threads -> Suggestions 18
Histograms (3) muuuch better
Main ThreadTH1 *hrun, *hwatch
Thread 1
Create 97 histograms
Loop on events
Every N events, merge
histograms from all threadsand save to file
Thread 6
Create 97 histograms
Loop on events
Every N events, merge
histograms from all threadsand save to file
Thread 12
Create 97 histograms
Loop on events
Every N events, merge
histograms from all threadsand save to file
……. …….
What I would like to see
Non blocking asynchronous I/O thread15/12/2014
Nuons && Threads -> Suggestions 19
Trees & Threads• Solution1 : one TTree per thread one file per thread, then possibly merge files at end of
job.– Currently this requires locking or/and fixing the non-thread-safe parts of TTree I/O – Not very user friendly as it requires more book-keeping
• Solution2: Use the TTree Buffer merge facility– This is much more efficient, but requires more memory– This solution is not yet fully operational for threads
• Solution 3: Create only one TTree in main thread (or any thread)– For each fill: Lock, Swap branch addresses, Fill, UnLock– This solution is nice for memory, but adds more sequentiality– This is my current solution, waiting for a better solution, eg Solution4
• Solution4: same as Solution3, but with– An optimized branch addresses booking and swapping– Delegation of the pure I/O part to a separate asynchronous thread doing the zipping and disk writes.
• Solution 5: same as Solution 4, with in addition– Possibility to call branch::Fill per thread (This will be essential for GeantV)
15/12/2014
Nuons && Threads -> Suggestions 20
Trees & Threads(my current solution)
TTree *T = 0;
if (!T && fillTree) { TFile::Open(TString::Format("/data/brun/collide_%d_events.root",processID),"recreate"); T = new TTree("T","selected collide events"); T->Branch("i1",&i1,"i1/I"); T->Branch("i2",&i2,"i2/I"); T->Branch("nch",&nch,"nch/I"); T->Branch("nchCMS",&nchCMS,"nchCMS/I"); T->Branch("njets",&njets,"njets/I"); T->Branch("njetsCMS",&njetsCMS,"njetsCMS/I"); T->Branch("phi1",&phi1,"phi1/D"); ……. T->Branch("ptype",ptype,"ptype[nchCMS]/I"); T->Branch("pjet",pjet,"pjet[nchCMS]/I"); T->Branch("ppx",ppx,"ppx[nchCMS]/D"); T->Branch("ppy",ppy,"ppy[nchCMS]/D"); T->Branch("ppz",ppz,"ppz[nchCMS]/D"); T->Branch("ppt",ppt,"ppt[nchCMS]/D"); T->Branch("peta",peta,"peta[nchCMS]/D"); T->AutoSave("SaveSelf"); }
if (fillTree && bigjet) { TThread::Lock(); T->SetBranchAddress("i1",&i1); T->SetBranchAddress("i2",&i2); T->SetBranchAddress("nch",&nch); T->SetBranchAddress("nchCMS",&nchCMS); T->SetBranchAddress("njets",&njets); T->SetBranchAddress("njetsCMS",&njetsCMS); T->SetBranchAddress("phi1",&phi1); ……. T->SetBranchAddress("ptype",ptype); T->SetBranchAddress("pjet",pjet); T->SetBranchAddress("ppx",ppx); T->SetBranchAddress("ppy",ppy); T->SetBranchAddress("ppz",ppz); T->SetBranchAddress("ppt",ppt); T->SetBranchAddress("peta",peta); T->Fill(); //every N events autosave if (event%1000==0) T->AutoSave(“SaveSelf”); TThread::UnLock(); }
Main thread
in initialisation thread thnumb
Filling Tree in thread thnumb
15/12/2014
Nuons && Threads -> Suggestions 21
Trees & Threads(what would be faster and simpler)
TTree *T = 0;
if (!T && fillTree) { TFile::Open(TString::Format("/data/brun/collide_%d_events.root",processID),"recreate"); T = new TTree("T","selected collide events"); T->Branch("i1",&i1,"i1/I"); T->Branch("i2",&i2,"i2/I"); T->Branch("nch",&nch,"nch/I"); T->Branch("nchCMS",&nchCMS,"nchCMS/I"); T->Branch("njets",&njets,"njets/I"); T->Branch("njetsCMS",&njetsCMS,"njetsCMS/I"); T->Branch("phi1",&phi1,"phi1/D"); ……. T->Branch("ptype",ptype,"ptype[nchCMS]/I"); T->Branch("pjet",pjet,"pjet[nchCMS]/I"); T->Branch("ppx",ppx,"ppx[nchCMS]/D"); T->Branch("ppy",ppy,"ppy[nchCMS]/D"); T->Branch("ppz",ppz,"ppz[nchCMS]/D"); T->Branch("ppt",ppt,"ppt[nchCMS]/D"); T->Branch("peta",peta,"peta[nchCMS]/D"); T->AutoSave("SaveSelf"); T->SaveThreadBranches(thnumb); }
if (fillTree && bigjet) { TThread::Lock(); T->SetThreadBranches(thnumb); T->Fill(); //every N events autosave if (event%1000==0) T->AutoSave(“SaveSelf”); TThread::UnLock(); }
Main thread
in initialisation thread thnumb
Filling Tree in thread thnumb
15/12/2014
Nuons && Threads -> Suggestions 22
Trees & Threads (3)(what would be much faster and even simpler)
TTree *T = 0;
if (!T && fillTree) { TFile::Open(TString::Format("/data/brun/collide_%d_events.root",processID),"recreate"); T = new TTree("T","selected collide events"); T->Branch("i1",&i1,"i1/I"); T->Branch("i2",&i2,"i2/I"); T->Branch("nch",&nch,"nch/I"); T->Branch("nchCMS",&nchCMS,"nchCMS/I"); T->Branch("njets",&njets,"njets/I"); T->Branch("njetsCMS",&njetsCMS,"njetsCMS/I"); T->Branch("phi1",&phi1,"phi1/D"); ……. T->Branch("ptype",ptype,"ptype[nchCMS]/I"); T->Branch("pjet",pjet,"pjet[nchCMS]/I"); T->Branch("ppx",ppx,"ppx[nchCMS]/D"); T->Branch("ppy",ppy,"ppy[nchCMS]/D"); T->Branch("ppz",ppz,"ppz[nchCMS]/D"); T->Branch("ppt",ppt,"ppt[nchCMS]/D"); T->Branch("peta",peta,"peta[nchCMS]/D"); T->AutoSave("SaveSelf"); T->SaveThreadBranches(thnumb); }
if (fillTree && bigjet) { TThread::Lock(); T->SetThreadBranchesFill(thnumb, kAutoSave %( n%1000==0)); TThread::UnLock(); }
Main thread
in initialisation thread thnumb
Filling Tree in thread thnumb
Where SetThreadBranchesFill quickly copy the branch data to a
circular buffer,return immediately the control to
the calling threadand pass the data to another
thread asynchronously to fill the TreeCache and disk I-O
15/12/2014