In high energy physics, what we typically analysis is a list of events, extract
what we deem useful from each event to compute physical variables like coupling
strengths, production cross-sections and such. But what variables in an event
are deemed useful varies greatly from analysis to analysis. The Event Data
Model(EDM) format is a core backbone to the CMS software suite, giving great
flexibility to the user to define an event by the variables they are interested
in. From the official “tiers” that include raw electronics data collected at
the CMS, general usage physical objects like electrons and jets made from the
official reconstruction algorithms, to user-made EDM files where the property
of the event is condensed down to a single double variable of interested to an
analysis. Getting familiar with how to operator on EDM files is a first step
on getting started with analysis at CMS.
Reading the contents of a EDM file
 We are going to provide a dummy EDM file
here to use as
an example, and give instruction to find official EDM files that you should
be able to get is you have access to a CMSSW environment.
As always, move to a CMSSW/src directory and load the environment by
cmsenv. To look what is variables are available in each event, run the
command
edmDumpEventContent  yourfile.rootYou should get an output like
Type       Module     Label      Process
-------------------------------------------
double     "myprod"   "Gauss"    "Dummy"
int        "myprod"   "Poisson"  "Dummy"The first column is the C++ type needed to read the variable. The last three
strings are related with how this variable was produced. You can think of it as
three magic strings that allows you to access the variable. This dummy file
isn’t very interesting. Each event contains a double variable that is
generated according to a Gaussian distribution and a int generated according
to a Poisson distribution. The outputs of a standard EDM files is much more
interesting:
Type                                  Module                      Label             Process
----------------------------------------------------------------------------------------------
...
double                                "fixedGridRhoAll"           ""                "RECO"
double                                "fixedGridRhoFastjetAll"    ""                "RECO"
...
vector<pat::Electron>                 "slimmedElectrons"          ""                "PAT"
...
vector<pat::Jet>                      "slimmedJets"               ""                "PAT"
...
vector<pat::MET>                      "slimmedMETs"               ""                "PAT"
...
vector<pat::Muon>                     "slimmedMuons"              ""                "PAT"
...
unsigned int                          "bunchSpacingProducer"      ""                "PAT"where you could see what look to be like physical objects.
To read this file with a main function, there are two key classes you will need
to use. the
fwlite::Event
class and the
fwlite::Handle
class. The example version of how to use the classes could be given as:
#include "DataFormats/FWLite/interface/Handle.h"
#include "DataFormats/FWLite/interface/Event.h"
#include <iostream>
#include "TFile.h"
int main()
{
  // For single files
  fwlite::event ev( TFile::Open("myedmfile.root") );
  fwlite::Handle<double> mydouble;
  for( ev.toBegin() ; !ev.atEnd() ; ++ev ){
    mydouble.getByLabel( ev , "myprod",   "Gaussian",   "Dummy"   );
    std::cout << *mydouble << std::endl;
  }
}Make sure you add the required entries for the BuildFile.xml accordingly.
The class fwlite::Event the class that helps with iterating through the
file(s), and fwlite::Handle is the class that helps exposes the data in the
corrected datatype. To tell the Handle class which variable to extract, the
getByLabel() member function must be called, with the latter three strings
corresponding to the strings that are generated by the edmDumpEventContent
commend. After calling the getByLabel() function, one could treat the
instance of Handle<T> as a pointer to a const T instance.
After this, the analysis is a case of understanding what variables/classes are
available in the EDM file you are working with, and how to extract the
information of getting the information you are interested in. For the official
EDM files, recipes for operating the data formats might be found on official
documentation pages, such is the case for
MINIAOD.
Or you can look in the documentation of individual classes to see what
variables are available on the CMSSW doxygen
page.
Comparison with ROOT TTree flavoured files
If you are familiar with ROOT TTree class, you might
think this look familiar. A simple root flavored class could be read as
something like:
TFile *myfile =  new TFile("myfile.root");
TTree *mytree = (TTree*) myfile->GetObject("mytree");
double mydouble;
mytree->SetBranchAddress("myvar",&myvar);
for( int i = 0 ; i < i < mytree->GetEntries() ; ++i ){
   mytree->GetEntry(i);
   cout << mydouble << endl;
}and you would be correct. The EDM file format is indeed powered by the TTree
technology, but there are several advantages that could be achieved by using
this wrapper system. You can always make the argument that if this is possible
in ROOT, why not just use ROOT, and the answer will always be yes, but how much
time are you willing to spending writing you own code for reading files when
you could be writing code for actual analysis purposes.
Intuitive Modification
As we will see in later tutorials, the EDM file is intuitive to manipulate. You could:
- remove entries
- add variable
- remove variables
- remove parts of arrays
that was stored in a large file to produce a new, slimmed and much smaller size file with very intuitive code writing.
In high every physics, unselected data typically contain up to billions of events, most of which are not particularly interesting to the analysis, and even the events that are considered interesting might contain particles that are uninteresting. The ability to strip down these files to a manageable size is a key step in making analysis work of any kind possible at all.
Ease of Encapsulation
As you can see when dumping the contents of an EDM file, you could see that
entire classes and array of classes could actually be stored inside the file.
The ROOT TTrees could save classes as well. But to do this is rather finicky.
Encapsulation has the advantage of making you code far more readable and thus
much more manageable? If you have used the a TTree consisting only of C++
concrete types before, you might be familiar with this kind of code for getting
the invariant mass of each jet pair in an event:
int size_of_jetlist;
double jetpt[SIZE];
double jeteata[SIZE];
...
tree->SetBranchAddress( "sizeofjet" , *size_of_jetlist );
tree->SetBranchAddress( "jetpt"     , jetpt );
...
for( i = 0 ; i < tree->GetEntries() ; ++i ){
   tree->GetEnty(i);
   for( j = 0 ; j < size_of_jetlist ; ++j ){
      for( k = 0 ; k < size_of_jetlist ; ++k ){
         TLorentzVector jet1 ; jet1.SetPtEtaPhiE( jetpt[j], jeteta[j] .... );
         TLorentzVector jet2 ; jet2.SetPtEtaPhiE( jetpt[k], jeteta[k] .... );
         TLorentzVector sum = jet1 + jet2;
         cout << sum.M() << endl;
      }
   }
}There are several problems with writing code like this:
- Managing variables becomes repeated chore that has to be rewritten for every analysis.
- Counter-intuitive index handling: why is the jetptvariable capable of running a separate index with thejetetavariable?
- Excessive amounts of variables that are only used once.
All in all, it makes the code needlessly long, difficult to read and difficult
to maintain and alter. With the power of encapsulation classes already existing
the CMSSW, the power of the EDM file format and the C++11 standard, once
could write:
fwlite::Event ev("myfile.root");
fwlite::Handle<std::vector<pat::Jet>> jethandle;
for( ev.toBegin() ; !ev.atEnd() ; ++ev() ){
   jethandle.getByLabel( ev, "MyLabel" );
   for( const auto& jet1 : *jethandle ){
      for( const auto& jet2 : *jethandle ){
         cout << (jet1.p4() + jet2.p4()).M() << endl;
      }
   }
}Which is a lot easier to read and understand than the long-winded code of a
flat TTree format.
Arguments against the EDM file format
 In all analysis I have come across, the EDM file format is typically
disfavored over the plain TTree file (commonly referred to as Ntuples). The
reasons I will list here, and also include the reason why these arguments are
becoming increasingly invalid.
- Forced ties to - CMSSWenvironment: This might be a problem when internet connections are still not as reliable as they are now. But now, logging onto a machine with the correct environments should not be a problem most of the time, and a uniform environment makes it easier to debug in most of the cases.
- The - EDMfile format is large: It is true that the- EDMfile contains a much larger header space than a normal- TTreefile. But the main bulk of the- EDMfile actually comes in the physical object classes storing a lot of variables. Flattening these objects to arrays with variables dropped out could be used to reduce file size, but comes at the risk that wanted variable might be thrown out and there is an extra level of maintenance that needs to be worked on in addition to the analysis (n-tuplizer codes are notoriously hard to maintain). I believe that one should focus on how to remove events and objects that are not useful to the analysis in question is a much more effective may of reduce storage space required for an analysis.
- The - EDMfile format is difficult to learn This is perhaps the only valid argument in my opinion. Learning how to read and modify- EDMfiles of your own is not easy. Learning ROOT is already difficult enough same reason of spare documentation (and extremely counter C++ intuition designs), learning- CMSSWand- EDMfile manipulation is even harder. But all in all, I think it would be worth the time.
Closing words
I have said that manipulating the contents of an EDM file is intuitive. This
is where the brilliance of CMSSW really shines, but is not easy to learn, as
it:
- Requires extensive knowledge of C++ syntax: (templates, virtual functions)
- Lacks a direct file for a main function, and you could only interact with a main function through a python file.
In a lot of the cases, I could not tell you how the CMSSW framework is doing
what it is doing. But hopefully I could give you enough information to get
started and get familiar with this strange and powerful framework. In the next
part, I will start getting you familiar with looping thought EDM files in
this CMSSW framework, knowing how to read to file, and next tell you how to
modify its contents.