Why having a consistent recording ?

After +25 years of computer business we still lack of consistent performance monitoring between different operating systems, each system deploying its own type of monitoring and data collection. UNIX systems try to stay a bit close with each other since all are POSIX systems and follow similar industry standards, like The Open Group.

If we step back and we look other industries, how are they doing it, we see a completely different picture.

  • Aerospace industry: FDR. Airplanes for examples use some sort of recorders, usually found as a device called flight data recorder FDR, used to store aircraft data parameters. Such unit is found by default on many airplanes nowadays and its usage is regulated by governments and federal administrations, example FAA in United States. This device sometimes is referred as the black box.

  • Shipbuilding industry: VDR. Ships, boats or other type of vessels use some sort of recorder, called voyager data recorder VDR, used to store vessel data parameters. Similar to aerospace industry such devices are required when a certain vessel must comply with international standards, example International Convention for the Safety of Life at Sea, SOLAS. Used mainly for accident investigation the VDR can serve as preventive maintenance, performance efficiency monitoring, heavy weather damage analysis, accident avoidance and training purposes to improve safety and reduce running costs. This device sometimes is referred as the black box

  • Auto industry: EDR. Automobiles use some sort of device used to store vehicle parameters, called event data recorder EDR. Again such devices can serve as the main source for accident investigations. EDRs are not enforced by any standard organizations and are not really required by law so their usage varies from vendor to vendor. National Highway Traffic Safety Administration NHTSA proposed a series of changes to standardize and enforce mandatory EDR installation and usage by vendors. Around 2010 over 85% of all vehicles in US would already have some sort of EDR installed.

  • Computer industry: None. Computers, mainframes, servers or workstations have no such recording devices installed. Manufacturers are not interested in standardizing this effort since they prefer selling additional software packages which can perform such recording features for an extra cost. The lack of standardization and agreements between vendors resulted in a complete different picture than other industries. Currently, there are houndreads of performance monitoring solutions for computer systems.

What if we try to adopt what other industries are using and define a number of standard recorders, found on each computer system, no matter if that is a database or application server. And what if we use same way no matter what the operating system really is, wouldn't this be great ?


SDR

System Data Recorder. Four main recorders responsible to store system: cpu, memory, disk and network overall performance metrics over long periods of time on commodity storage, disks or SSDs. Additional specialized recorders, like Java Virtual Machine, process or network protocol statistics would make SDR a complete recording system suitable for many types of computer systems.


Data Recording

Time series

To keep things simple SDR is making available all collected metrics as variable measured sequentially in time, called time series. All these observations collected over fixed sampling intervals create a historical time series. To easy the access to all this set of data SDR simple records and stores the observations on commodity disk drives, compressed, in text format.

Time series let us understand what has happened in past and look in the future, using various statistical models. In addition , having access to these historical time series will help us to build a simple capacity planning model for our application or site.

Raw Data

All recorded observations we call them raw data. This set of data is not modified, altered or changed in any way and it is entirely the way we collected from the computer system. Its format is simple, as already mentioned, having its parameters collected : separated. Each recorder will write and store all collected parameters under such raw data file for the entire duration of its execution. By default, the SDR raw data extension is called, sdrd, system data recorder datafile.

Recorders

The recording process consists of a number of running recorders, light probes developed in a dynamic language like Perl5 or Java which can directly talk and extract from operating system interfaces, the parameters we are interested in. For example on Linux based systems we directly extract various metrics from /proc interface. On Solaris systems we interact with KSTAT interface to record all needed parameters.

Each recorder should be capable of accessing operating system interfaces without calling additional utilities and display its data in the following manner:

timestamp : metric1 : metric2: ... : metricN

, where timestamp should be defined as Unix time or POSIX time and metricN are the values collected from OS or application.

There are four main recorders: sysrec, cpurec, nicrec, diskrec. Each recorder runs as a separate Perl5 process without any relation to the others. This makes very flexible to operation mode of all recorders, since they are autonomous. Additional there are different other recorders which can collect other type of system or application data: netrec, jvmrec, hdwrec, webrec. See below for a complete list of all available recorders or check our documentation .

Recorders:

  • sysrec: overall system cpu, mem, disk and net utilization
  • cpurec: per-cpu statistics
  • nicrec: per-NIC statistics
  • diskrec: per-disk drive statistics
  • dbrec: Oracle, MySQL, PostgreSQL database recorder
  • corerec: SPARC CMT T1, T2 processor statistics
  • netrec: UDP, IP, TCP/IP statistics
  • hdwrec: hardware, software inventory
  • jvmrec: Java Virtual Machine Garbage Collector statistics
  • procrec: per-process statistics
  • webrec: HTTP response time statistics
  • zonerec: Solaris zone statistics

Certain monitoring systems use the concept of agentless recording, a system which runs on a centralized machine and executes via SSH or RSH operating system commands or custom probes on a number of hosts. Example here: HP SiteScope.

From start we want to store performance metrics from one or many computer systems no matter under what conditions these systems are: heavily used, idling , or severe utilized. To avoid starting a new process on each system everytime we want to read the system metrics we selected to deploy a minimal number of recorders on each system, we want to monitor.


Transport

All observations are recorded for a number of days on each computer system. However we would like to send this data to a reporting backend where we could do some analysis and see it visual. There are currently two ways to transport sdrd raw data for analysis: instant and batch modes.

Instant Mode

First mode of transporting the raw data to a reporting backend system is the instant mode. On this mode, the output of each data recorder will be scanned by a special utility, sender responsible to detect each changes and send over a SSH2 channel this data for analysis. Sender will scan periodically all sdrd raw data files, configured under a XML type of configuration and it will send these changes, secure to the reporting backend. This way we ensure each recorded data will arrive for analysis as soon as it has happened.

Batch Mode

The other mode, where we would like to see changes less often, like every 24 hrs, would mean we will transport each sdrd raw data every 24 hrs to a reporting backend system for analysis using raw2day utility. This utility simple transports all recorded sdrd data for 1 day using SSH2 or FTP.


Analysis

At last comes into play how would you see, analyze all recorded data. The reporting module is responsible for this part, but we will shortly describe how the recording part packages all collected data and prepares for the reporting module.

Data Visualization Engines

Having access to all recorded observations, the raw data, we can basically use any type of analysis software we would want: R, RRDtool, cpuplayer. We are not restricted to any vendor relational database management system, like Oracle or IBM nor use any proprietary type of database for storing all data.

Currently we use RRDtool, the high performance data logging and graphing system as our performance database. We generate all reports, plots based on RRDtool mainly. However we are not restricted to RRDtool since we do have all raw data collected. We are using as well R statistical computing and graphics language for analysing all recorded data using heatmaps or treemaps representation. At last, we like to bring closely geometry and performance analysis: cuplayer, a visual player of multiprocessor data, cpurec, using Barycentric coordinates, which displays the CPU transition states from IDLE to USER or SYSTEM time.