Scope
After +20 years of computer business we still lack of consistent performance monitoring between different operating systems, each system deploying its own type of monitoring and data collection. UNIX systems try to stay a bit close with each other since all are POSIX systems and follow similar industry standards, like The Open Group.
The main idea behind SDR is to deploy a number of light recorders which collect and store various metrics from operating system: cpu, memory, disk or network usage over a long period of time without disrupting the production business. The recording process is done very simple, sampling over a period of time, storing several performance metrics as a time series. The recorded data is stored as plain ASCII data, easy to be consolidated and accessed by any 3rd party systems.
Another idea is to make SDR available across many operating systems, having the data collection process standard between operating systems: (Open)Solaris, Linux, BSD. In this sense SDR looks like a blackbox where a vast number of applications and operating system metrics are stored and analyzed.
SDR can help in cases where the budget is limited and the time to deploy the solution is an important factor for your site. You don't want to spend a lot of time setting up an expensive monitoring system based on a relational database system and a complicated reporting module which takes time to setup and learn.
Raw data
To keep things simple SDR is making available all collected metrics as variable measured sequentially in time, called time series. All these observations collected over fixed sampling intervals create a historical time series. To provide access to anyone or to any application to this volume of data the history time series are stored on commodity disk drives, compressed but in text format.
Time series let us understand what has happened in past and look in the future, using various statistical models. In addition , having access to these historical time series will help us to build a simple capacity planning model for our application or site.
Agentless or agent based data collection ?
Certain monitoring systems use the concept of agentless recording, a system which runs on a centralized machine and executes via SSH or RSH operating system commands or custom probes. Example here: HP SiteScope.
In contrast with such systems, SDR tries to stay simple and clear: list and define all recorders used to collect the performance data and store somewhere this set of data, which anyone can easily access it. Each recorder requires to be installed on every system we plan to collect data from.
Main points about current Recording Operation Mode:
simplicity and easy control over what is collected: the raw data
time series: keep a history of OS metrics, look trends and seasonal variations
range of recorders based on their activity: cpu, mem, net
modular approach: enhance some recorders without disrupting other consumers, modules
Probe Definition
A recorder, is defined as a light probe developed in KSH, Perl or even C langauge, which can directly talk and extract via Kernel Statistics interface, if available, operating system metrics. As well the probe can interact directly with a userland process and obtain the required metrics. This should happen without creating additional load or impact anyhow the execution of the measured production environment.
Each recorder should be capable of accessing operating system interfaces without calling additional utilities and display its data in the following manner:
where timestamp should be defined as Unix time or POSIX time and metricN are the values collected from OS or application.
Data analysis
The next step, after data collection, is to gather and store all server's data under a centralized place from where we could start analysing and estimating the capacity in use. To analyse and digest all these collected data SDR offers a way to gather the raw data from each server and keep it safe over a long period of time. Developed and built on top of the RRD, the high performance data logging and graphing system , SDR in matter of minutes can generate reports for all kind of time-series data collected by the recording module. In addition, coupled with PDQ analytic solver , SDR can be used to modelate a certain workload and predict future growth.
Why System Data Recorder?
Recording
License free
Very simple
Control over raw data in matter of minutes
Raw data: time series, easy statistical modeling: trends, seasonal variations
Very simple to change, add, remove whatever you need
Easy to work with other modules, example PDQ
Simple to educate your IT staff
Reporting
License free
Very simple
Not an analytics package
Statistical models: R and the analytical solver: PDQ will help to look over your data and predict the future
You don't have to click, dozens of options, links to get what you need
Should save your time in front of computer being simple and to the point
Simple to educate your IT staff
Design
The System Data Recorder is simple organized as two main things: the collection part, or the part which handles recording the data from each system and a reporting side where we permanently store and generate simple reports and graphs and perform the analysis. For some configurations we can use only the recording part without the reporting side at all.
Data recorder module consists of many simple utilities developed in Korn shell , Perl or C language which extract different telemetry from the operating system. As well some recorders gather their data from various processes, directly using OS or third parties utilities. We try to stay low and keep to minimum the number of dependencies for our main recorders.
There are 5 recorders, which should be installed and deployed on any system and optional recorders only required for certain cases: JVM monitoring or dedicated hardware platforms like: Niagara power based servers, CMT. SDR was mainly developed around (Open)Solaris operating environment, because of its powerful observability capabilities and robust features. However currently SDR is being ported to FreeBSD and RedHat operating systems.
If your system deploys some sort of virtualization then the recorders will operate from the global level. If the virtualization type includes domains or Xen technology then the recorders are deployed in all these systems.
Recorded data:
System CPU, Mem, Disk and Network utilisation, Queuing statistics
Zone utilisation
CPU statistics: cross calls, system and user time
CMT core utilisation: T1,T2
NET UDP, IP, TCP/IP statistics of each zone
NIC statistics
Java Virtual Machine: Garbage Collector statistics
There are five main recorders: sysrec, netrec, nicrec , zonerec and cpurec. Each of these recorders are simple Perl or Ksh utilities running as separate processes, being light and designed to dont be considered a hog for your system when the system is under high utilisation. Some others are developed in C language to reduce as much as possible the the footprint. Additional we have corerec and jvmrec, recorders which should be deployed on systems which are based on CMT architecture or run Java Virtual Machine.
Each recorder is operated by the SMF, the (Open)Solaris service management facility in order to ensure their activity, restarting them automatically in case one fails or exists unexpected. For other operating systems, where we dont have a similar framework we used the standard way of starting and monitoring daemon processes using a watchdog. SMF has a number of advantages like dependency checking which is easily implemented with SMF: the recorders should not start if the local filesystem is not mounted or the network interfaces are not present.
Each recorder outputs its data to a file called the raw output file. Every night we rotate this file using logadm utility and we compress it. This way we make sure the stored data is small and easy to be transported to our reporting system. The stored data is small and compact in size, majority of collectors record directly raw data in RRD format, easy to be imported into Round-Robin Database system, the final place where the data will be stored. The default time for storing data is 1 year but this can easily be changed.
Last updated: 2010-05-17
Back to main homepage