I had users report problems this morning where incoming calls weren't ringing into their KIRK phones, but were ringing other phones. Looking in the KWS300 logs, I'm seeing lines like:
notice SIP INVITE request from:20 to:28 Transaction timeout
(this was for an internal call from ext 20 to 28)
I checked & found the KWS300 PCS08 Firmware Release Notes which indicates the that Transaction Timeout is actually referring to SIP "controls time B and F as specified in RFC3261". So my interpretation is basically that the KWS300 received the call request from the handset, and tried to initiate the call (via SIP INVITE), but never received a response back over the network from the SIP server (asterisk in our case). Does this seem right to others?
However, during the same problematic time period, non-KIRK phones were ringing just fine, and incoming calls from the SIP trunk provider were still coming in. So obviously asterisk was online, and connected to the network, which is how it was able to ring our other phones.
My logical approach is to check whether there were network issues that affected only the switch(es) that our KWS300 goes through that would have prevented it from talking to the asterisk box. Does this seem correct to others?
Any other insight or suggestions here? Anywhere else I could/should be looking?
Your diagnosis is perfect in my view
On the KWS you can do a packet capture - it has a limited buffer but a great tool for this type of issue.
DO this when you have this issue.
STATUS - PACKET CAPTURE menu in the GUI
Just enable capture SIP packets click <START> then you will see in wireshark what is happening
I'd also trace on the call handler (sip server) at the same time to see if it gets the packets then you will know for sure whether it is a network, KWS or SIP server issue.
OK. I'll give it a try. It's always tricky to try to catch these things "in the moment". Seems like usually I'm away & unavailable, just until the problem has stopped.... :-)
I did setup on the asterisk box, a cron job running every minute, which pings both KWS300 units, as well as an ATA on the network & the default gateway. It is also pulling some of the stats showing current SIP peers (registered extensions), etc. I have it running every minute & dumping its output to a log file. That should help determine whether this issue exists under the application layer in the OSI model (or TCP/IP mode). It will be my eyes in the event that I can't get to the issue immediatley next time.
I did see the packet capture option, but was hesitant to enable it without knowing what type of load it would add to the KWS300. Are you saying that it is safe for me to enable the capturing of SIP packets now & leave it enabled, even for days or weeks until the next time the problem occurs? I just don't want to run the thing out of memory & kill call quality, etc.
Thanks for your help!
Let me preface this by saying that I still haven't moved & properly wall-mounted the 2 KWS300 units, because I am needing a hole to get drilled through a wall prior to me being able to run the cables to the proper location.
With that said:
We've been having reports of calls sometimes not ringing some or all of the KIRK phones. We have 4 KIRK 4040 phones on 1 KWS300 and 5 KIRK 4040 phones on another KWS300.
I've been chasing the issue around for a while. Recently I found that I our call processor (asterisk) is logging every time that these phones become unreachable, and then again when they become rechable. I didn't know that they were not reachable in the first place. But after analyzing logs, it does appear that there is an absolute correlation between the times when the phones aren't ringing AND when asterisk shows the handsets as unrecheable. Additionally the KWS300 systems will report "transaction timeout" for any attempted outbound calls during this period.
Interestingly enough, the data that I observed from the logs, showed the phones becoming "reachable" again to asterisk pretty shortly after the attempted outbound call. This causes me to suspect whether the KWS is intelligent enough to realize that if an outbound call failed, then maybe it should try to reregister its SIP extensions with the SIP server. I'm assuming here that reregistering with the SIP server is what causes asterisk to show the phones as "reachable again". I may be wrong though, as I'm not an expert on SIP. Just learning more & more as I go along.
One important note though is that when the extensions become unreachable, it happens for ALL extensions on that KWS300. As indicated above, this means that it happens for either 5 or 6 extensions, depending on which KWS300 it is. Today it has already happened at least 11 times, and it has been the same KWS300 every time. But I'm pretty sure that I actually saw it happening with both when I checked into this originally.
The tricky part
So at this point I think that if I had more experience & expertise with SIP in the first place, I'd already have figured out the solution.
I am assuming that when my SIP server tells me that all of the extensions on 1 KWS300 are unreachable (which happens to all of them at the exact same instant), this is the equivelant of them becoming "unregistered". And then I suspect that when the SIP server shows them as reachable again (which also happens to all of its extensions in an instant), this is the equivelant of them becoming "registered".
Assuming I'm right here, then I have to ask the question: What causes extensions to become unregistered? In my case, it's happening to all extensions on a single KWS300 at once.
Is there a "time out" that is occuring? Could asterisk be "unregistering" these extensions because they are not checking in (reregistering) often enough? Do SIP extensions even have a reregister/check-in functionality? If not, then how is the SIP server (asterisk) supposed to be aware of what extensions are registered/online, and which are not? Usually with computer networking applications this type of functionality is accomplished through some kind of "Keep alives". I presume that there is some kind of equivelant with SIP (whether by reregistering, or a more traditional keepalive, or something else....)?
Anyway, I'm not sure whether this is just a matter of me needing to adjust some settings on asterisk or the KWS300 units that has to do with the something along the lines of a keepalive, or whether I'm having an actual network connectivity issue.
I did actually setup a cron-job on the asterisk server that runs every minute & pings both KWS300 systems w/ 10 packets. I have its output logging to a special logfile that I can review and correlate with the other logs. There have been 4 times today where 1-3 packets out of 10 were lost. However, considering that the cron job is running every minute, this may be a small amount of loss. I don't think that the loss should exist at all. However, it is normal to see some loss every now & then waiting for an ARP reply. I guess I could modify my script to output the arp table before it tries to ping, just to confirm whether this is indeed an issue. However, generally speaking the loss is concerning, even though it may be less than 1% packet loss.
So my question really comes back to "How do SIP extensions keep registered with their SIP server"?
Anyway, I've asked a few different questions in here.
If anyone has insight into the situation, I would appreciate it.
I am definitely wondering whether there is any possibility that the KWS300 units could be exhibiting some bug...... Hopefully/probably not...?
welcome to the Polycom Community.
The SIP Registration is handeled by the KWS itself and not by the individual extensions so therefore if the registration is lost on all this explains hopefully why.
The Registration expire Parameter under Configuration => SIP specifies the duration a SIP Extensions.
The reason why the Phones/KWS/Asterisk looses the registration should be within the logs and it may be outside the scope of the community to further investigate into this.
The best way forward would be to liaise with your Polycom Reseller and/or Polycom Support to proceed.
The latest Software for the KWS300 is PCS10 so you may want to upgrade this first before proceeding.
The way the Unit is currently installed may cause issues with the DECT reception but should not cause the Registration/Subscription issues reported by you.
Polycom Global Services
OK. I have contacted our vendor's support team. Hopefully they will help me get a resolution.
FYI, today this incident has happened 16 time so far with the same KWS300. I checked the SIP Re-register setting on the KWS300 and it is currently 3600 (1 hour). It states that the SIP server (asterisk) can override this and require it to be more frequent, and I believe the implication is that the KWS300 will reregister at whatever interval asterisk tells it to. I do not know whether asterisk is telling the KWS300 to register at a different interval. I'd have to research how to determine this.
Of the 16 times it's happened today, the actual times have been scatterred. Below are lines from my asterisk logs, filtering by just one of the extensions on the KWS300. After each log line, I've manually typed the number of minutes that passed prior to the next incident.
[Nov 14 07:02:12] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 9
18 minutes later
[Nov 14 07:20:46] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 9
37 minutes later
[Nov 14 07:57:02] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
15 minutes later
[Nov 14 08:12:26] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
21 minutes later
[Nov 14 08:33:02] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
19 minutes later
[Nov 14 08:52:52] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
37 minutes later
[Nov 14 09:29:53] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 9
35 minutes later
[Nov 14 10:04:42] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 9
58 minutes later
[Nov 14 11:02:32] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 9
32 minutes later
[Nov 14 11:34:02] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
14 minutes later
[Nov 14 11:48:22] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
33 minutes later
[Nov 14 12:21:54] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
15 minutes later
[Nov 14 12:36:43] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
18 minutes later
[Nov 14 12:54:12] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 10
43 minutes later
[Nov 14 13:37:25] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 9
16 minutes later
[Nov 14 13:53:02] NOTICE chan_sip.c: Peer '20' is now UNREACHABLE! Last qualify: 9
So I can't see a pattern in regards to some type of consistent timeout. Perhaps the next step is to figure out whether asterisk has a shorter timeout setting than 1 hour, which could explain why this is sometimes happening nearly 3 times inside of 1 hour (and otherwise sometimes only 15 minutes apart).
If asterisk is NOT requiring a shorter reregistration expiry, then it would seem to me that the KWS300 is actually sending a SIP message to unregister the extensions. But I have no idea why that would be happening.
It will be interesting to get to the bottom of this one.
It's definitely worth noting that the above log lines & number of minutes between are only giving the times & intervals between each incident where the extensions went offline. This does not take into consideration that between each incident the extensions did come back online again. I haven't looked at those numbers, but I'm not too worried about them. If only we can kepe the phones from going offline, then we should be fine.
I think this case is best handled though direct contact with with your Polycom Reseller and/or Polycom Support.
This will mean that full sets of logs and traces can be looked at both at the SIP level and the call handler level. It is going to be a simple thing but just needs that direct contact to solve this issue.
I'm happy to take that approach & it's already in motion. Do you think there is a problem if I post the solution back here? I think that it could help someone in the future who may encounter the same (or similar) issue.
By the way, I did check & with asterisk, UNREACHABLE does indeed mean unregistered, and Reachable does mean registered. So I think I am on the right track.