In this article we are going to explore the (supposed) rationale behind why the CMSSW framework is designed as it is, as well as getting a first look at how one could easily alter the contents of an EDM file using this framework. As before all, the concrete code used in this article could be found here
Hiding the Loop
By now, you might have known that the core component of the main function for data processing would center around the event-by-event for loop, a skeleton code for the main function might look something like:
// code for opening file
for( run.toBegin() ; !run.atEnd() ; run++ ){
// come per run calculations...
for( lumi.toBegin() ; !lumi.atEnd() ; ++lumi ){
// some per lumi calculations
for( event.toBegin() ; ! event.atEnd() ; ++event() ){
// some per-event calculations...
// some other per-event calculations...
//...
}
}
}
// Save calculations of files
There are a few rationals why the framework might want to hide the main loop from the user:
- In mass data processing, the framework might not want to give the user freedom to manipulate the ordering of the for loop. For example there might be specialization for multi-threading.
- The framework might what to optimization global object handling, such as the the pointer to handle the event contents.
When designing the framework, one will still need to give the user an interface to defined calculation flows. One way of doing so would be using the concept of polymorphism, which is implemented in C++ as the virtual function syntax:
class BaseCalculator {
virtual void perrun_calc( run& ) {} ; // do nothing is not redefined by derived class
virtual void perlumi_calc( lumi& ) {} ;
virtual void perevent_calc( event& ){} ;
}
class MyCalc1 : public BaseCalculator {}; // implement one kind of calculation
class MyCalc2 : public BaseCalculator {}; // implement another kind of calculation
And the definition of the main functions would turn into something like this:
// open file
vector<BaseCalculator*> processList ;
// Load the user defined calculation
processList.push_back( new MyCalc1 );
processList.push_back( new MyCalc2 );
for( run.toBegin() ; !run.atEnd() ; run++ ){
for( auto& calc : processList ){
calc->perrun_calc( run );
}
for( lumi.toBegin() ; !lumi.atEnd() ; ++lumi ){
for( auto& calc : processList ){
calc->perlumi_calc( lumi );
}
for( event.toBegin() ; ! event.atEnd() ; ++event() ){
for( auto& calc : processList ){
calc->perevent_calc( event );
}
}
}
}
// Save calculations of files
The merit of using this structure for the end user is that one could separate
independent calculations from each other, so if we are only interested in
altering one specific calculation, you could do so without having to comb
through the entire high level control flow. Of course, there is the problem
that calculations might depend on the results of other user calculations. So
how to define the control flow couldn’t be as simple as above with a simple
list of pointers. Luckily for use, the CMSSW
framework as a set of very handy
tools that could help us with the calculations. As will be demonstrated later.
After understanding the rationale behind the “why” when interacting with the CMSSW framework, you won’t be handling the high level control flows directly, we begin to get on the actual code for defining user calculations.
EDProducer
A Look at one calculator class: Structure overview
Let us look at the structure required for using a
EDProducer
derived class to define calculation. It follows the standard
Package/Subpackage
structure required by CMSSW
, but this time the C++ code
is written in the plugins
directory, and also an extra python
directory.
Package
└── MyReader
├── plugins
│ ├── BuildFile.xml
│ └── MyReader.cc
└── python
└── ConfFile_cfg.py
Let us look at the MyReader.cc
file first, which should give you a rough idea
of all the functions that are available in the calculator class. It does the
same as the FWLite
example given in the previous article, dumping the
contents of the “Gaussian” variable stored in an EDM file, except using the C++
technology described above:
#include <memory>
#include "FWCore/Framework/interface/Frameworkfwd.h"
#include "FWCore/Framework/interface/stream/EDProducer.h"
#include "FWCore/Framework/interface/Event.h"
#include "FWCore/Framework/interface/MakerMacros.h"
#include "FWCore/ParameterSet/interface/ParameterSet.h"
#include "FWCore/Utilities/interface/StreamID.h"
class MyReader : public edm::stream::EDProducer<>
{
public:
explicit MyReader(const edm::ParameterSet&);
~MyReader();
private:
virtual void beginStream(edm::StreamID) override;
virtual void endStream() override;
virtual void beginRun(edm::Run const&, edm::EventSetup const&) override;
virtual void endRun(edm::Run const&, edm::EventSetup const&) override;
virtual void beginLuminosityBlock(edm::LuminosityBlock const&, edm::EventSetup const&) override;
virtual void endLuminosityBlock(edm::LuminosityBlock const&, edm::EventSetup const&) override;
virtual void produce(edm::Event&, const edm::EventSetup&) override;
};
you could think of these private member functions corresponding to these execution points in the imaginary loop code:
// <--- beginStream() called here
for( run.toBegin() ; !run.atEnd() ; run++ ){
// <-- beginRun() called here
for( lumi.toBegin() ; !lumi.atEnd() ; ++lumi ){
// <--- beginLuminosityBlock() called here
for( event.toBegin() ; ! event.atEnd() ; ++event() ){
// <--- produce() called here
}
// <--- endLuminosityBlock() called here
}
// <--- endRun() called here
}
// <--- endStream() called here
But note that the file opening and saving process are not handled by the
beginStream
/endStream
functions. We will get to that in a bit. Now for the
part of actually writing the “calculation” part of the code in the the
produce
member function.
Getting the Handle class
Since the CMSSW
main framework is designed with fast computing in hand, so
the Handle
class while similar in usage to that we have seen previously,
requires slightly different initialization.
The edm::EDGetToken
class
is mandatory to be initialized during the constructor via a consumes()
call,
and this is used to get the Handle
class we will be interested in.
class MyReader : public edm::stream::EDProducer<>
{...
private:...
const edm::EDGetToken _mytoken;
};
// Getting the token via the consumes call.
// The Input Tag format is identical to that of the fwlite.
MyReader::MyReader(const edm::ParameterSet& iConfig):
_mytoken( consumes<double>( edm::InputTag("myprod","Gauss","Dummy")) )
{} // No other initialization for now
// Defining the calculation to be called per event
void MyReader::produce(edm::Event& iEvent, const edm::EventSetup&)
{
edm::Handle<double> myHandle;
// Note the big difference in the getByToken() call compared to fwlite!
iEvent.getByToken( _mytoken, myHandle );
// After a sucessful getByToken call
// myHandle is essentially a double pointer
std::cout << *myHandle << std::endl;
}
Take special note of the much more complex way of inserting the three magic
strings to get the Handle object. Also take note that the Handle
and Event
class are no longer using the one in the fwlite
namespace, but the edm
work
space.
Exposing the class as a plugin.
The control flow part should be rather straight forwards, but there are still a
few magic lines of code required to let the CMSSW
framework understand this
class as a calculation class. At the very end of the file, we need to call the
MACROS function:
DEFINE_FWK_MODULE(MyReader);
Be sure that the argument in the MACROS functions is the same as the name of
your class. In the case that you have decided to split you calculator class
across multiple .cc
files, make sure that this MACROS function is called
exactly once. Otherwise, you will get repeated definition error during the
compilation.
Now we look at the required <use>
tags needed to add to the BuildFile.xml
file in the plugins
directory:
<use name="FWCore/Framework" />
<use name="FWCore/PluginManager" />
<use name="FWCore/ParameterSet" />
<flags EDM_PLUGIN="1" />
Again, the use
tag is used to list external libraries as dependencies. The
minimal requirements for the “calculator” class are the three listed above. The
last line is new! This time, instead of saying we are compiling to an external
library, we are making a “plugin” for manipulating EDM
files (again with the
name left to be default by the ="1"
flag). Now we are ready to compile!
scram b
A python configuration file for the main function
Next we would want to know how to run the main function, and how to alter its behavior, such as:
- What input file to use.
- What output file (if any) to use
- What calculator classes to load
The main function containing the event loops is invoked via the cmsRun
command provided by the CMSSW
environment, but to alter the behavior of this
main function, you need to provide it with a configuration file read, which is
a python file that should be intuitive to read, and later down the line, allows
the configuration itself to be programmable! But first, lets look at a minimal
example the file in directory MyReader/python/ConfFile_cfg.py
, where the file
is basically a bunch of declaration for what the main function should use.
# Getting the required python modules
import FWCore.ParameterSet.Config as cms
# Defining the process
# You can say that process will be the object what is converted to
# what the main function should do.
process = cms.Process("MyProcess")
# How many events before early exit (-1 for effectively every event)
process.maxEvents = cms.untracked.PSet( input = cms.untracked.int32(-1) )
# Define input, note that for local files, the prefix 'file:' is required
# so the below is referring to the local file ./edm_example.root
process.source = cms.Source(
"PoolSource",
fileNames = cms.untracked.vstring(
'file:edm_example.root'
)
)
# Loading the producer, notice the string must the same as
# the argument you placed in the C++ MACROS function!
# but the name of the member in the process instance can be anything you want.
process.myProducer = cms.EDProducer(
'MyReader'
)
# Defining sequence of producer instances you want to run.
# Again, the name of the member in the process instance could be anything you want.
process.mypath = cms.Path(process.myProducer)
A note to the folks that are not familiar with python syntax. The data members in a python class are completely dynamic, meaning you can add data members at will. The python memory manager will also keep track of what type each variable is declared by. So what the main file will be translated to will be something like this:
process = ReadFromConfig( "ConfFile_cfg.py" )
# Opening the file
open( process.source.fileName )
# Defining the calculation classes to load via cms.Path variables
calclist = []
for member in get_all_member( process ) :
if get_type( member ) == get_type( cms.Path ):
calclist.extend( member.content )
## main for loop
for evt in opened_file :
for calc in calclist :
calc.produce( evt )
It must be stressed that this is a very rough “translation”, and is definitely
not what is actually implemented in the CMSSW
framework. The core concept of
what to take away from this translation is that ultimately what matters in the
configuration file would be the final data members in the cms.Process
instance.
Having written you configuration file, you could then run the following command:
cmsRun ConfFile_cfg.py
And you should get the output that looks something like:
29-Aug-2016 20:53:16 CST Initiating request to open file file://edm_example.root
29-Aug-2016 20:53:17 CST Successfully opened file file://edm_example.root
Begin processing the 1st record. Run 1, Event 383601, LumiSection 1919 at 29-Aug-2016 20:53:18.320 CST
2.9968
Begin processing the 2nd record. Run 1, Event 383602, LumiSection 1919 at 29-Aug-2016 20:53:18.321 CST
2.34539
...
Begin processing the 10th record. Run 1, Event 383610, LumiSection 1919 at 29-Aug-2016 20:53:18.322 CST
6.78134
29-Aug-2016 20:53:18 CST Closed file file://edm_example.root
=============================================
MessageLogger Summary
type category sev module subroutine count total
---- -------------------- -- ---------------- ---------------- ----- -----
1 fileAction -s file_close 1 1
2 fileAction -s file_open 2 2
type category Examples: run/evt run/evt run/evt
---- -------------------- ---------------- ---------------- ----------------
1 fileAction PostEndRun
2 fileAction pre-events pre-events
Severity # Occurrences Total Occurrences
-------- ------------- -----------------
System 3 3
Congratulations! You have completed an event loop using the full CMSSW framework! Notice that what you actually calculate and defined to be outputted is in between the messages pre-defined by the CMSSW framework for loop.
Making the producer produce something
As suggested by the name, the EDProducer
this isn’t just a calculator class.
The EDProducer
class has functionalities that actually produces something
that could be placed into an output EDM
. Suppose I want to compute the square
of all the Gauss
variables and store them in the event. We would need to
alter the constructor:
MyProducer::MyProducer(const edm::ParameterSet& iConfig):
_mytoken( consumes<double>(edm::InputTag("myprod","Gauss","Dummy")))
{
produces<double>("GaussSquare");
}
which tell the main function that we want to produce a double
collection,
called GaussSquare
. Next we need to define how the GaussSquare
variable
should be calculated.
void MyProducer::produce(edm::Event& iEvent, const edm::EventSetup& iSetup)
{
edm::Handle<double> myHandle;
iEvent.getByToken( _mytoken, myHandle );
std::auto_ptr<double> myptr( new double );
*myptr = (*myHandle) * (*myHandle);
iEvent.put( myptr, "GaussSquare" );
}
The three new lines creates a new pointer to a double, store the results that
we want, and associate the double with the GaussSquare
name. If you are
wondering what an
auto_ptr
is, it is
essentially the same as the pointer, except it automatically deletes at the end
of the scope it is declared in. The use of this STL class, or its successor
class std::unique_ptr
to call the put
member function is mandatory.
Next we need to add the following lines to the python configuration file:
# Same as before, but with the MyProducer instance instead
process.myProducerInstance = cms.EDProducer(
'MyProducer'
)
process.p = cms.Path(process.myProducerInstance)
# We wish to output a EDM files containing all the stuff produced in this process
process.out = cms.OutputModule("PoolOutputModule",
fileName = cms.untracked.string('myoutput.root'),
)
# Marking the process as a EndPath (to be ran after everything else)
process.e = cms.EndPath( process.out )
Next you could run the command
cmsRun python/Conf1.py
You should see a new file called myoutput.root
, and if you dump the contents
of this file you should see:
Type Module Label Process
---------------------------------------------------------------------------------
double "myprod" "Gauss" "Dummy"
int "myprod" "Poisson" "Dummy"
double "myProducerInstance" "GaussSquare" "ProduceProcess"
You now have a new collection! You could also being to see where the three
magic strings used to get objects in the EDM
files are coming from:
- The module name is the name you defined for the
cms.EDProducer
instance in the python file - the process name is the name you gave the
cms.Process
instance in the python file. - The Label name is defined in the
C++
file on theproduces
call in the producer constructor. A common practice in the standard CMS data files is for each producer to produce only one variable for one type. In which case you can leave the entry in theproduces
call empty.
It is worth noting that most of the standard “analysis recipes” in CMS revolves
around using EDProducers
to add more objects into an EDM file during runtime.
For instance, we make a joint configuration to first product the “GaussSquare”
variable before passing the calculation results to our MyReader
class. In the
next section, we will be introducing tool for better high level control flow in
the python file, as well as a better look at the interface between the python
and c++ interface.
FWLite
and the full framework
Comparison between As you have probably seen, a CMSSW framework is aimed for event looping and
event looping only. If you want to combine various EDM
files with different
weight associated with each file, it is do-able, but very hard. The CMSSW
full framework is meant to dealing with bulk calculations across many files.
The reason for using plugins and such C++ technologies is used so that the full
framework could also multithreaded with minimal amount of additional user
coding. The full framework should be used to reduce the bulky, general purpose
files into a handful of variable/objects that are interesting to an analysis,
and then a more flexible fwlite
framework could be used to read the output
and produce the final calculation.