In my last blog entry I talked about the X-Link Enterprise Edition Beta Results and how it improved message delivery times by reduction of 77% across the board. We looked at how and why the results were obtained and came to the conclusion that X-Link Enterprise Edition does perform high volume message delivery much better than the application and/or service editions.
But we left one item open: The final graph was showing message delivery times were degrading. The degradation is only affecting a very small portion of the messages, but none the less, is clearly visibly in the graphs.
What did we find?
We used the same X-Link message statistics database as was used in part 1 of this blog entry. We then began the process of more thoroughly understanding what this data means. Our initial review of the logs for messages with long delivery times didn't show any specific delay issues, but occasionally (during a handful of messages) we found X-Link was simply not processing the message as fast as expected - there was an unexplained delay in processing that wasn't obvious what was causing it, but it was responsible for up to about 45 seconds of delay on these messages. Over 99% of the messages were delivered within 12 seconds. In other words, this degradation is not significant - the client will only know of it via this blog. However, I was concerned that under larger loads, the degradation may become significant. EG, it was important to understand it fully to know how it may affect users in the future, including this client.
The graph below is a zoom of the final graph in part 1 showing just November 17th-24th.
What's interesting to note is that the delays appear to occur each day just after midnight and linger for as long as a couple hours, except on Sunday morning. The tech whose been working closely with this client, after seeing this graph, said he knows what is up. At midnight each night, X-Link stops all processing and performs a clean up function. This function removes old temporary files, etc. All TCPip connections are also stopped. Apparently the EHR this client uses does not recover from a lost of connection gracefully. There is some sort of fix in place that restarts the EHR connections, but there is some sort of delay. Once we knew of this issue, it was easy to document it in the logs.
We found 99% of the time when X-Link sent a message to the EHR, the EHR acknowledged the message within 0.15 seconds. However, we did find a couple that took 8 seconds for the EHR to respond. Then using searches, we found many more delayed responses and several times when the EHR host never responded for 20 seconds.
This delay is the major component of the delays we are seeing in the above graph. Basically there is nothing we can do about the EHR's host being slow to respond. And X-Link will eventually deliver the message when the host is ready, so this isn't necessarily an issue.
In addition to the above delays, I noticed a log that had messages stating patients were locked. But the messages were delivered in approximately 2 minutes, not the 5 minutes expected delay when updating a patient whose record is locked. Turns out, this locked record was showing up when reading patients that are being changed in the PM by a user. X-Link when reading waits 20 seconds to allow time for the patient to unlock, then tries again on the next cycle to read the patient again.
These two issues appeared to account for almost all of the delays other than an occasional up to 8 second delay. This delay appeared between X-Link tasks, so it looked like it was not X-Link, but something was blocking X-Link from processing a few messages as quickly as the rest. We had already ruled out the database access times. We didn't have any historical data on network, disk or memory usage. But we did have a ton of data (4.5 million data points) collected by X-Link on CPU usage. We decided to use the data we had and see if CPU utilization was a contributing factor or not.
The graph below shows the internal processing times for X-Link at the customer site. It represents the sum of all Engines operating.
Immediately we noticed this graph corresponded directly with the deliver times graph. And the disturbing item to note was the green layer - X-Link Scheduling and other non-data transfer activities.
What did we learn?
We found that the current degradation of the delivery times is not due to X-Link, but mainly to the slow responses from the EHR and to a lesser extent, the locking of patients when reading patients from the PM.
We previously learned that posting changes to locked patients in the PM caused 5 minute delays in message delivery.
What we don't know yet is what the growing green area is caused by in the graph above. Does it contribute to the message delivery delays? And can it be reduced or eliminated? That is the topic of Part 3 of this blog entry.