{"id":215,"date":"2017-09-02T10:00:47","date_gmt":"2017-09-02T00:00:47","guid":{"rendered":"https:\/\/www.itfault.com.au\/main\/?p=215"},"modified":"2017-09-21T21:02:12","modified_gmt":"2017-09-21T11:02:12","slug":"case-cpus-becoming-100-busy","status":"publish","type":"post","link":"https:\/\/www.itfault.com.au\/main\/2017\/09\/02\/case-cpus-becoming-100-busy\/","title":{"rendered":"The case of CPUs becoming 100% busy"},"content":{"rendered":"<p>It was a curious thing. A website had been deployed to powerful production servers. These servers had many CPU cores.<\/p>\n<p><figure id=\"attachment_216\" aria-describedby=\"caption-attachment-216\" style=\"width: 249px\" class=\"wp-caption alignleft\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.itfault.com.au\/main\/wp-content\/uploads\/2017\/08\/2017-08-31-htop-3-cpu-cores-busy.png\" alt=\"Three CPU cores busy\" width=\"249\" height=\"58\" class=\"size-full wp-image-216\" \/><figcaption id=\"caption-attachment-216\" class=\"wp-caption-text\">Three CPU cores busy<\/figcaption><\/figure> Yet every morning the system administrators would notice several processes using 100% cpu utilisation. At first it was just one or two rogue web application processes which could be easily killed off. But as the days went by more and more processes would be spinning at 100% CPU utilisation each morning.<\/p>\n<p>Immediately I became suspicious of the application code; and I requested to see any and all code with <a href=\"https:\/\/en.wikipedia.org\/wiki\/While_loop\"><code>while<\/code> loops<\/a> in them.<\/p>\n<h3>What is risky about while loops?<\/h3>\n<p>Let me digress for a moment. Why would I be suspicious of <code>while<\/code> loops?<\/p>\n<p>The issue is that, unlike other kinds of loops, such as the <a href=\"https:\/\/en.wikipedia.org\/wiki\/For_loop\"><code>for<\/code> loop<\/a>, a <code>while<\/code> loop relies on the programmer to update the <i>condition variable<\/i> inside the loop code.<\/p>\n<p>A simple <code>while<\/code> loop is difficult to get wrong. For example:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n\/* iterate through users *\/\r\ncounter = 0;\r\nwhile ( counter &lt; numusers ) {\r\n  printf( &quot;User %s\\n&quot;, user&#x5B;counter] );\r\n  counter++;\r\n}\r\n<\/pre>\n<p>This code will always complete because <code>counter<\/code> will always reach <code>numusers<\/code> eventually and cause the loop to end.<\/p>\n<p>The <code>counter<\/code> variable is <i>the<\/i> &#8220;condition variable&#8221; for this loop because the <code>while<\/code> loop has a condition dependent on the <code>counter<\/code> variable. And this loop does, indeed, update the condition variable every iteration (by the <code>counter++;<\/code> statement).<\/p>\n<p>The problem lies in larger organisations when multiple people are making changes to source code &#8211; perhaps when making a fix in response to a bug report. They may add some code like this:<\/p>\n<pre class=\"brush: cpp; highlight: [4,5,6,7,8]; title: ; notranslate\" title=\"\">\r\n\/* iterate through users *\/\r\ncounter = 0;\r\nwhile ( counter &lt; numusers ) {\r\n  \/* bug report 14132 - ignore user &quot;secretuser&quot; *\/\r\n  if ( strcmp( user&#x5B;counter], &quot;secretuser&quot; ) == 0 ) {\r\n    \/* match! loop again to avoid printing secretuser *\/\r\n    continue;\r\n  }\r\n\r\n  printf( &quot;User %s\\n&quot;, user&#x5B;counter] );\r\n  counter++;\r\n}\r\n<\/pre>\n<p>Now we have a problem. If <code>\"secretuser\"<\/code> appears in the <code>user[]<\/code> array then <code>counter<\/code> will no longer be incremented and the while loop will continue forever.<\/p>\n<p>A <code>for<\/code> wouldn&#8217;t have this issue. Imagine if the loop had started out as:<\/p>\n<pre class=\"brush: cpp; title: ; notranslate\" title=\"\">\r\n\/* iterate through users *\/\r\nfor ( counter = 0; counter &lt; numusers; counter++ ) {\r\n  printf( &quot;User %s\\n&quot;, user&#x5B;counter] );\r\n}\r\n<\/pre>\n<p>and then the maintainer added the bug fix:<\/p>\n<pre class=\"brush: cpp; highlight: [3,4,5,6,7]; title: ; notranslate\" title=\"\">\r\n\/* iterate through users *\/\r\nfor ( counter = 0; counter &lt; numusers; counter++ ) {\r\n  \/* bug report 14132 - ignore user &quot;secretuser&quot; *\/\r\n  if ( strcmp( user&#x5B;counter], &quot;secretuser&quot; ) == 0 ) {\r\n    \/* match! loop again to avoid printing secretuser *\/\r\n    continue;\r\n  }\r\n\r\n  printf( &quot;User %s\\n&quot;, user&#x5B;counter] );\r\n}\r\n<\/pre>\n<p>The fact that <code>counter<\/code> was incremented in the <code>for<\/code> loop declaration ensures this loop will complete even though the maintainer added a <code>continue<\/code> statement inside the loop.<\/p>\n<p>Thus <code>for<\/code> loops offer greater protection against accidental infinite looping.<\/p>\n<h3>A tight while loop is bad<\/h3>\n<p>If a CPU finds itself in the situation where it is in a tight while loop things can be worse than one might ordinarily expect. That&#8217;s because it might never find itself calling an operating system function &#8211; which would give the operating system a chance to consider whether anything else needs to be done (outside the application) &#8211; before returning control to the program.<\/p>\n<p>Instead the CPU is executing instructions furiously over and over again &#8211; until a scheduled timer interrupt forces the CPU to visit the operating system and something else is given a chance to run.<\/p>\n<p>So things are not utterly dire &#8211; other code does get a chance to run on the processor. But it&#8217;s not well behaved and polite like most programs are. Still, this kind of impolite hogging of the CPU does have consequences and is worth being aware of.<\/p>\n<h3>Back to the story<\/h3>\n<p>My suspicion was confirmed whereupon I found a <code>while<\/code> loop that had a rogue <code>continue<\/code> statement without any accompanying code to update the condition variable.<\/p>\n<p>Quite simply I started grepping for the string <code>continue<\/code> and closely reviewed any <code>while<\/code> loop that contained the keyword. And that is how I found the fault.<\/p>\n<p>Indeed it had been a bug fix. And was only triggered occasionally because it added an exception to the normal operation. In testing this exception had not been encountered.<\/p>\n<p>But in production the exception was being triggered on occasion by real customers &#8211; and when it did their web session froze (never responded) and the process locked up at 100% CPU utilisation each time.<\/p>\n<p>It was fortunate this issue was identified, and fixed, before whole servers became overwhelmed with CPU load.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>It was a curious thing. A website had been deployed to powerful production servers. These servers had many CPU cores. Yet every morning the system administrators would notice several processes using 100% cpu utilisation. At first it was just one or two rogue web application processes which could be easily killed off. But as the &hellip; <\/p>\n<p class=\"link-more\"><a href=\"https:\/\/www.itfault.com.au\/main\/2017\/09\/02\/case-cpus-becoming-100-busy\/\" class=\"more-link\">Continue reading<span class=\"screen-reader-text\"> &#8220;The case of CPUs becoming 100% busy&#8221;<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[22],"tags":[],"class_list":["post-215","post","type-post","status-publish","format-standard","hentry","category-technical"],"_links":{"self":[{"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/posts\/215","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/comments?post=215"}],"version-history":[{"count":15,"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/posts\/215\/revisions"}],"predecessor-version":[{"id":334,"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/posts\/215\/revisions\/334"}],"wp:attachment":[{"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/media?parent=215"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/categories?post=215"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.itfault.com.au\/main\/wp-json\/wp\/v2\/tags?post=215"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}