Introduction to CMSSW part II - First Look at the EDM file format

2016.Aug.15

In high energy physics, what we typically analysis is a list of events, extract what we deem useful from each event to compute physical variables like coupling strengths, production cross-sections and such. But what variables in an event are deemed useful varies greatly from analysis to analysis. The Event Data Model(EDM) format is a core backbone to the CMS software suite, giving great flexibility to the user to define an event by the variables they are interested in. From the official “tiers” that include raw electronics data collected at the CMS, general usage physical objects like electrons and jets made from the official reconstruction algorithms, to user-made EDM files where the property of the event is condensed down to a single double variable of interested to an analysis. Getting familiar with how to operator on EDM files is a first step on getting started with analysis at CMS.

Reading the contents of a EDM file

We are going to provide a dummy EDM file here to use as an example, and give instruction to find official EDM files that you should be able to get is you have access to a CMSSW environment.

As always, move to a CMSSW/src directory and load the environment by cmsenv. To look what is variables are available in each event, run the command

[bash]
edmDumpEventContent  yourfile.root

You should get an output like

[plaintext]
Type       Module     Label      Process
-------------------------------------------
double     "myprod"   "Gauss"    "Dummy"
int        "myprod"   "Poisson"  "Dummy"

The first column is the C++ type needed to read the variable. The last three strings are related with how this variable was produced. You can think of it as three magic strings that allows you to access the variable. This dummy file isn’t very interesting. Each event contains a double variable that is generated according to a Gaussian distribution and a int generated according to a Poisson distribution. The outputs of a standard EDM files is much more interesting:

Example output of a real EDM file used by CMS [plaintext]
Type                                  Module                      Label             Process
----------------------------------------------------------------------------------------------
...
double                                "fixedGridRhoAll"           ""                "RECO"
double                                "fixedGridRhoFastjetAll"    ""                "RECO"
...
vector<pat::Electron>                 "slimmedElectrons"          ""                "PAT"
...
vector<pat::Jet>                      "slimmedJets"               ""                "PAT"
...
vector<pat::MET>                      "slimmedMETs"               ""                "PAT"
...
vector<pat::Muon>                     "slimmedMuons"              ""                "PAT"
...
unsigned int                          "bunchSpacingProducer"      ""                "PAT"

where you could see what look to be like physical objects.

To read this file with a main function, there are two key classes you will need to use. the fwlite::Event class and the fwlite::Handle class. The example version of how to use the classes could be given as:

Simple example of looping over file with fwlite [cpp]
#include "DataFormats/FWLite/interface/Handle.h"
#include "DataFormats/FWLite/interface/Event.h"
#include <iostream>
#include "TFile.h"
int main()
{
  // For single files
  fwlite::event ev( TFile::Open("myedmfile.root") );
  fwlite::Handle<double> mydouble;
  for( ev.toBegin() ; !ev.atEnd() ; ++ev ){
    mydouble.getByLabel( ev , "myprod",   "Gaussian",   "Dummy"   );
    std::cout << *mydouble << std::endl;
  }
}

Make sure you add the required entries for the BuildFile.xml accordingly.

The class fwlite::Event the class that helps with iterating through the file(s), and fwlite::Handle is the class that helps exposes the data in the corrected datatype. To tell the Handle class which variable to extract, the getByLabel() member function must be called, with the latter three strings corresponding to the strings that are generated by the edmDumpEventContent commend. After calling the getByLabel() function, one could treat the instance of Handle<T> as a pointer to a const T instance.

After this, the analysis is a case of understanding what variables/classes are available in the EDM file you are working with, and how to extract the information of getting the information you are interested in. For the official EDM files, recipes for operating the data formats might be found on official documentation pages, such is the case for MINIAOD. Or you can look in the documentation of individual classes to see what variables are available on the CMSSW doxygen page.

Comparison with ROOT TTree flavoured files

If you are familiar with ROOT TTree class, you might think this look familiar. A simple root flavored class could be read as something like:

[cpp]
TFile *myfile =  new TFile("myfile.root");
TTree *mytree = (TTree*) myfile->GetObject("mytree");
double mydouble;

mytree->SetBranchAddress("myvar",&myvar);
for( int i = 0 ; i < i < mytree->GetEntries() ; ++i ){
   mytree->GetEntry(i);
   cout << mydouble << endl;
}

and you would be correct. The EDM file format is indeed powered by the TTree technology, but there are several advantages that could be achieved by using this wrapper system. You can always make the argument that if this is possible in ROOT, why not just use ROOT, and the answer will always be yes, but how much time are you willing to spending writing you own code for reading files when you could be writing code for actual analysis purposes.

Intuitive Modification

As we will see in later tutorials, the EDM file is intuitive to manipulate. You could:

  • remove entries
  • add variable
  • remove variables
  • remove parts of arrays

that was stored in a large file to produce a new, slimmed and much smaller size file with very intuitive code writing.

In high every physics, unselected data typically contain up to billions of events, most of which are not particularly interesting to the analysis, and even the events that are considered interesting might contain particles that are uninteresting. The ability to strip down these files to a manageable size is a key step in making analysis work of any kind possible at all.

Ease of Encapsulation

As you can see when dumping the contents of an EDM file, you could see that entire classes and array of classes could actually be stored inside the file. The ROOT TTrees could save classes as well. But to do this is rather finicky.

Encapsulation has the advantage of making you code far more readable and thus much more manageable? If you have used the a TTree consisting only of C++ concrete types before, you might be familiar with this kind of code for getting the invariant mass of each jet pair in an event:

Example of a n-tuple like look [cpp]
int size_of_jetlist;
double jetpt[SIZE];
double jeteata[SIZE];
...

tree->SetBranchAddress( "sizeofjet" , *size_of_jetlist );
tree->SetBranchAddress( "jetpt"     , jetpt );
...

for( i = 0 ; i < tree->GetEntries() ; ++i ){
   tree->GetEnty(i);
   for( j = 0 ; j < size_of_jetlist ; ++j ){
      for( k = 0 ; k < size_of_jetlist ; ++k ){
         TLorentzVector jet1 ; jet1.SetPtEtaPhiE( jetpt[j], jeteta[j] .... );
         TLorentzVector jet2 ; jet2.SetPtEtaPhiE( jetpt[k], jeteta[k] .... );
         TLorentzVector sum = jet1 + jet2;
         cout << sum.M() << endl;
      }
   }
}

There are several problems with writing code like this:

  • Managing variables becomes repeated chore that has to be rewritten for every analysis.
  • Counter-intuitive index handling: why is the jetpt variable capable of running a separate index with the jeteta variable?
  • Excessive amounts of variables that are only used once.

All in all, it makes the code needlessly long, difficult to read and difficult to maintain and alter. With the power of encapsulation classes already existing the CMSSW, the power of the EDM file format and the C++11 standard, once could write:

Example of a EDM loop [cpp]
fwlite::Event ev("myfile.root");
fwlite::Handle<std::vector<pat::Jet>> jethandle;

for( ev.toBegin() ; !ev.atEnd() ; ++ev() ){
   jethandle.getByLabel( ev, "MyLabel" );
   for( const auto& jet1 : *jethandle ){
      for( const auto& jet2 : *jethandle ){
         cout << (jet1.p4() + jet2.p4()).M() << endl;
      }
   }
}

Which is a lot easier to read and understand than the long-winded code of a flat TTree format.

Arguments against the EDM file format

In all analysis I have come across, the EDM file format is typically disfavored over the plain TTree file (commonly referred to as Ntuples). The reasons I will list here, and also include the reason why these arguments are becoming increasingly invalid.

  • Forced ties to CMSSW environment: This might be a problem when internet connections are still not as reliable as they are now. But now, logging onto a machine with the correct environments should not be a problem most of the time, and a uniform environment makes it easier to debug in most of the cases.

  • The EDM file format is large: It is true that the EDM file contains a much larger header space than a normal TTree file. But the main bulk of the EDM file actually comes in the physical object classes storing a lot of variables. Flattening these objects to arrays with variables dropped out could be used to reduce file size, but comes at the risk that wanted variable might be thrown out and there is an extra level of maintenance that needs to be worked on in addition to the analysis (n-tuplizer codes are notoriously hard to maintain). I believe that one should focus on how to remove events and objects that are not useful to the analysis in question is a much more effective may of reduce storage space required for an analysis.

  • The EDM file format is difficult to learn This is perhaps the only valid argument in my opinion. Learning how to read and modify EDM files of your own is not easy. Learning ROOT is already difficult enough same reason of spare documentation (and extremely counter C++ intuition designs), learning CMSSW and EDM file manipulation is even harder. But all in all, I think it would be worth the time.

Closing words

I have said that manipulating the contents of an EDM file is intuitive. This is where the brilliance of CMSSW really shines, but is not easy to learn, as it:

  • Requires extensive knowledge of C++ syntax: (templates, virtual functions)
  • Lacks a direct file for a main function, and you could only interact with a main function through a python file.

In a lot of the cases, I could not tell you how the CMSSW framework is doing what it is doing. But hopefully I could give you enough information to get started and get familiar with this strange and powerful framework. In the next part, I will start getting you familiar with looping thought EDM files in this CMSSW framework, knowing how to read to file, and next tell you how to modify its contents.