In high energy physics, what we typically analysis is a list of events, extract
what we deem useful from each event to compute physical variables like coupling
strengths, production cross-sections and such. But what variables in an event
are deemed useful varies greatly from analysis to analysis. The Event Data
Model(EDM) format is a core backbone to the CMS software suite, giving great
flexibility to the user to define an event by the variables they are interested
in. From the official “tiers” that include raw electronics data collected at
the CMS, general usage physical objects like electrons and jets made from the
official reconstruction algorithms, to user-made EDM files where the property
of the event is condensed down to a single double
variable of interested to an
analysis. Getting familiar with how to operator on EDM
files is a first step
on getting started with analysis at CMS.
EDM
file
Reading the contents of a We are going to provide a dummy EDM
file
here to use as
an example, and give instruction to find official EDM
files that you should
be able to get is you have access to a CMSSW environment.
As always, move to a CMSSW/src
directory and load the environment by
cmsenv
. To look what is variables are available in each event, run the
command
edmDumpEventContent yourfile.root
You should get an output like
Type Module Label Process
-------------------------------------------
double "myprod" "Gauss" "Dummy"
int "myprod" "Poisson" "Dummy"
The first column is the C++ type needed to read the variable. The last three
strings are related with how this variable was produced. You can think of it as
three magic strings that allows you to access the variable. This dummy file
isn’t very interesting. Each event contains a double
variable that is
generated according to a Gaussian distribution and a int
generated according
to a Poisson distribution. The outputs of a standard EDM
files is much more
interesting:
Type Module Label Process
----------------------------------------------------------------------------------------------
...
double "fixedGridRhoAll" "" "RECO"
double "fixedGridRhoFastjetAll" "" "RECO"
...
vector<pat::Electron> "slimmedElectrons" "" "PAT"
...
vector<pat::Jet> "slimmedJets" "" "PAT"
...
vector<pat::MET> "slimmedMETs" "" "PAT"
...
vector<pat::Muon> "slimmedMuons" "" "PAT"
...
unsigned int "bunchSpacingProducer" "" "PAT"
where you could see what look to be like physical objects.
To read this file with a main function, there are two key classes you will need
to use. the
fwlite::Event
class and the
fwlite::Handle
class. The example version of how to use the classes could be given as:
#include "DataFormats/FWLite/interface/Handle.h"
#include "DataFormats/FWLite/interface/Event.h"
#include <iostream>
#include "TFile.h"
int main()
{
// For single files
fwlite::event ev( TFile::Open("myedmfile.root") );
fwlite::Handle<double> mydouble;
for( ev.toBegin() ; !ev.atEnd() ; ++ev ){
mydouble.getByLabel( ev , "myprod", "Gaussian", "Dummy" );
std::cout << *mydouble << std::endl;
}
}
Make sure you add the required entries for the BuildFile.xml
accordingly.
The class fwlite::Event
the class that helps with iterating through the
file(s), and fwlite::Handle
is the class that helps exposes the data in the
corrected datatype. To tell the Handle
class which variable to extract, the
getByLabel()
member function must be called, with the latter three strings
corresponding to the strings that are generated by the edmDumpEventContent
commend. After calling the getByLabel()
function, one could treat the
instance of Handle<T>
as a pointer to a const T
instance.
After this, the analysis is a case of understanding what variables/classes are
available in the EDM
file you are working with, and how to extract the
information of getting the information you are interested in. For the official
EDM
files, recipes for operating the data formats might be found on official
documentation pages, such is the case for
MINIAOD.
Or you can look in the documentation of individual classes to see what
variables are available on the CMSSW doxygen
page.
Comparison with ROOT TTree flavoured files
If you are familiar with ROOT
TTree
class, you might
think this look familiar. A simple root flavored class could be read as
something like:
TFile *myfile = new TFile("myfile.root");
TTree *mytree = (TTree*) myfile->GetObject("mytree");
double mydouble;
mytree->SetBranchAddress("myvar",&myvar);
for( int i = 0 ; i < i < mytree->GetEntries() ; ++i ){
mytree->GetEntry(i);
cout << mydouble << endl;
}
and you would be correct. The EDM file format is indeed powered by the TTree
technology, but there are several advantages that could be achieved by using
this wrapper system. You can always make the argument that if this is possible
in ROOT, why not just use ROOT, and the answer will always be yes, but how much
time are you willing to spending writing you own code for reading files when
you could be writing code for actual analysis purposes.
Intuitive Modification
As we will see in later tutorials, the EDM file is intuitive to manipulate. You could:
- remove entries
- add variable
- remove variables
- remove parts of arrays
that was stored in a large file to produce a new, slimmed and much smaller size file with very intuitive code writing.
In high every physics, unselected data typically contain up to billions of events, most of which are not particularly interesting to the analysis, and even the events that are considered interesting might contain particles that are uninteresting. The ability to strip down these files to a manageable size is a key step in making analysis work of any kind possible at all.
Ease of Encapsulation
As you can see when dumping the contents of an EDM
file, you could see that
entire classes and array of classes could actually be stored inside the file.
The ROOT TTree
s could save classes as well. But to do this is rather finicky.
Encapsulation has the advantage of making you code far more readable and thus
much more manageable? If you have used the a TTree
consisting only of C++
concrete types before, you might be familiar with this kind of code for getting
the invariant mass of each jet pair in an event:
int size_of_jetlist;
double jetpt[SIZE];
double jeteata[SIZE];
...
tree->SetBranchAddress( "sizeofjet" , *size_of_jetlist );
tree->SetBranchAddress( "jetpt" , jetpt );
...
for( i = 0 ; i < tree->GetEntries() ; ++i ){
tree->GetEnty(i);
for( j = 0 ; j < size_of_jetlist ; ++j ){
for( k = 0 ; k < size_of_jetlist ; ++k ){
TLorentzVector jet1 ; jet1.SetPtEtaPhiE( jetpt[j], jeteta[j] .... );
TLorentzVector jet2 ; jet2.SetPtEtaPhiE( jetpt[k], jeteta[k] .... );
TLorentzVector sum = jet1 + jet2;
cout << sum.M() << endl;
}
}
}
There are several problems with writing code like this:
- Managing variables becomes repeated chore that has to be rewritten for every analysis.
- Counter-intuitive index handling: why is the
jetpt
variable capable of running a separate index with thejeteta
variable? - Excessive amounts of variables that are only used once.
All in all, it makes the code needlessly long, difficult to read and difficult
to maintain and alter. With the power of encapsulation classes already existing
the CMSSW
, the power of the EDM
file format and the C++11
standard, once
could write:
fwlite::Event ev("myfile.root");
fwlite::Handle<std::vector<pat::Jet>> jethandle;
for( ev.toBegin() ; !ev.atEnd() ; ++ev() ){
jethandle.getByLabel( ev, "MyLabel" );
for( const auto& jet1 : *jethandle ){
for( const auto& jet2 : *jethandle ){
cout << (jet1.p4() + jet2.p4()).M() << endl;
}
}
}
Which is a lot easier to read and understand than the long-winded code of a
flat TTree
format.
EDM
file format
Arguments against the In all analysis I have come across, the EDM
file format is typically
disfavored over the plain TTree
file (commonly referred to as Ntuples). The
reasons I will list here, and also include the reason why these arguments are
becoming increasingly invalid.
Forced ties to
CMSSW
environment: This might be a problem when internet connections are still not as reliable as they are now. But now, logging onto a machine with the correct environments should not be a problem most of the time, and a uniform environment makes it easier to debug in most of the cases.The
EDM
file format is large: It is true that theEDM
file contains a much larger header space than a normalTTree
file. But the main bulk of theEDM
file actually comes in the physical object classes storing a lot of variables. Flattening these objects to arrays with variables dropped out could be used to reduce file size, but comes at the risk that wanted variable might be thrown out and there is an extra level of maintenance that needs to be worked on in addition to the analysis (n-tuplizer codes are notoriously hard to maintain). I believe that one should focus on how to remove events and objects that are not useful to the analysis in question is a much more effective may of reduce storage space required for an analysis.The
EDM
file format is difficult to learn This is perhaps the only valid argument in my opinion. Learning how to read and modifyEDM
files of your own is not easy. Learning ROOT is already difficult enough same reason of spare documentation (and extremely counter C++ intuition designs), learningCMSSW
andEDM
file manipulation is even harder. But all in all, I think it would be worth the time.
Closing words
I have said that manipulating the contents of an EDM
file is intuitive. This
is where the brilliance of CMSSW
really shines, but is not easy to learn, as
it:
- Requires extensive knowledge of C++ syntax: (templates, virtual functions)
- Lacks a direct file for a main function, and you could only interact with a main function through a python file.
In a lot of the cases, I could not tell you how the CMSSW
framework is doing
what it is doing. But hopefully I could give you enough information to get
started and get familiar with this strange and powerful framework. In the next
part, I will start getting you familiar with looping thought EDM
files in
this CMSSW
framework, knowing how to read to file, and next tell you how to
modify its contents.