Tuesday, August 6, 2013

Analysing NetApp sysstat: The CP columns

The CP type is displayed in the CP/ty column in the output of sysstat. The CP type column contains two pieces of data. The cause of the CP (the CP type) – the first character, and the ‘phase’. The second character. In the output below, the first row shows a CP type “T” and phase of “f”. The second row shows a cp type of “:” which just means that the same CP was still ongoing when sysstat sampled the internal CP counters the next time. We get a bit more insight into this process in the CP/time column which just represents the amount of the sample time which was spent in the CP. So, in the case where we are sampling every second – the entire CP took about 1/2 a second in the first sample, and then continued over into the second sample for 17% of one second. So, we can assume that the CP started in the second half of the first sample period, and continued a little into the next sample.
filer*> sysstat -x 1
    CP    CP Disk    FCP iSCSI   FCP  kB/s iSCSI  kB/s
    time  ty util                 in   out    in   out

    54%   Tf  92%    183     0     4 145228     0     0
    17%   :   96%    201     0     5 149234     0     0
The table below describes the CP types as per the sysstat manpage. In a system with a small amount of incoming data (or none at all) there will be an artificially generated CP caused by a timer, which fires once every 10 seconds, this is the CP type “T”.
When there is a high incoming data rate, the filer tries to free up resources before they become exhausted, which would mean that the user would see high write latencies. When everything is working well, as is normally the case – the incoming write latency is the time it takes to write the users data into NVRAM (plus the network round trip time). This is how the filer is able to achieve extremely high write rates, even for a random IO pattern. If the filer is not able to keep up with the incoming workload, it will sometimes show CP type “B” which can mean that the filer is continually in a state of CP, and the user workload can see higher latencies as a result.
The log full CP literally means that the NVRAM log is 50% full, and so the filer must write this dirty data out to disk. Whilst the CP is happening, the other 50% of the NVRAM is used to accept more incoming data, and so the clients should see the same low latency as when there is no CP ongoing.
A CP type of “H” indicates that the filer has a large number of dirty buffers in the system, and even though the NRVRAM is not yet 50% full, the filer issues a CP in order to free up RAM. This CP type is sometimes seen on filers with a small amount of RAM, and a large incoming write rate with a small IO size to random offsets.
CP type “Z” is often caused by snapshot creation or deletion. Remember that snapshot deletion can be triggered by several conditions besides a user typing “snap delete” at the console. Examples of ‘automatic’ deletion are.
  • Snapshot deletion to recover space on the aggregate
  • Snapshot deletion, to maintain a specific snap schedule
CP types and phases.
CP TypesCP Phases
B – Back to back CPs (CP generated CP)0 – Initializing
b – Deferred back to back CPs (CP generated CP)n – Processing normal files
F – CP caused by full NVLogs – Processing special files
H – CP caused by high water markf – Flushing modified data to disk
L – CP caused by low water markv – Flushing modified superblock to disk
S – CP caused by snapshot operation
T – CP caused by timer
U – CP caused by flush
Z – CP caused by internal sync
: continuation of CP from previous interval
# continuation of CP from previous interval, and the NVLog for the next CP is now full, so that the next CP will be of type B.

Here are some explanations on the columns of netapp sysstat command.
Cache age : The age in minutes of the oldest read-only blocks in the buffer cache. Data in this column indicates how fast read operations are cycling through system memory; when the filer is reading very large files, buffer cache age will be very low. Also if reads are random, the cache age will be low. If you have a performance problem, where the read performance is poor, this number may indicate you need a larger memory system or  analyze the application to reduce the randomness of the workload.
Cache hit : This is the WAFL cache hit rate percentage. This is the percentage of times where WAFL tried to read a data block from disk that and the data was found already cached in memory. A dash in this column indicates that WAFL did not attempt to load any blocks during the measurement interval.
CP Ty : Consistency Point (CP) type is the reason that a CP started in that interval. The CP types are as follows:
  • - No CP started during sampling interval (no writes happened to disk at this point of time)
  • number Number of CPs started during sampling interval
  • B Back to back CPs (CP generated CP) (The filer is having a tough time keeping up with writes)
  • b Deferred back to back CPs (CP generated CP) (the back to back condition is getting worse)
  • F CP caused by full NVLog (one half of the nvram log was full, and so was flushed)
  • H CP caused by high water mark (rare to see this. The filer was at half way full on one side of the nvram logs, so decides to write on disk).
  • L CP caused by low water mark
  • S CP caused by snapshot operation
  • T CP caused by timer (every 10 seconds filer data is flushed to disk)
  • U CP caused by flush
  • : continuation of CP from previous interval (means, A cp is still going on, during 1 second intervals)
The type character is followed by a second character which indicates the phase of the CP at the end of the sampling interval. If the CP completed during the sampling interval, this second character will be blank. The phases are as follows:
  • 0 Initializing
  • n Processing normal files
  • s Processing special files
  • f Flushing modified data to disk
  • v Flushing modified superblock to disk
CP util : The Consistency Point (CP) utilization, the % of time spent in a CP.  100% time in CP is a good thing. It means, the amount of time, used out of the cpu, that was dedicated to writing data, 100% of it was used. 75% means, that only 75% of the time allocated to writing data was utilized, which means we wasted 25% of that time. A good CP percentage has to be at or near 100%.
You can use Netapp SIO tool to benchmark netapp systems. SIO is a client-side workload generator that works with any target. It generates I/O load and does basic statistics to see how any type of storage performs under certain conditions.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. A very clear explanation of this useful utility and feature of the filer.

    ReplyDelete