Plantronics + Polycom. Now together as Poly Logo

SoundPoint IP 650 lockups

SOLVED
Valued Contributor

SoundPoint IP 650 lockups

I plan on enabling logging to a syslog server and seeing if I can manage to capture anything, but I thought I'd also check here to see if any other SPIP650 users have been experiencing problems with their phones behaving erratically.

 

We have 2 650s in our office, and the one that the main receiptionist has is equipped with a sidecar.  I've been getting complaints ever since this phone was installed about a month ago that on an almost-daily basis somebody needs to reboot it.  It will get into a state where calls to the phone do not cause it to ring (either audibly or visually), and it sometimes even becomes nonresponsive to button pushes.

 

I have not been able to catch it in the act and nobody seems to know how to reproduce it or what the exact circumstances are that surround the problem.  I did once observe it in a bizarre state where one of the line lamps (the first) was continuously blinking green, as if it was receiving a call, but this was despite the fact that it wasn't getting a call.  The phone was nonresponsive to me pressing the line key (I couldn't answer the "phantom" call and it wouldn't give me a dialtone).  It also acted as if it were on an active call -- it was showing "End Call" on one of the soft buttons -- but there was no call counter on the display and pressing End Call did not result in any action or response from the phone, either.  Shortly after I got there and observed all of this, the phone suddenly came out of its coma.

 

This is the only phone in the entire office that does this.  We have roughly 20-some-odd 335s, four 450s, and two 650s.  I have tried swapping the troublesome 650 at the receptionist's desk with the other 650 (for the backup receptionist), and the problem did NOT follow the phone hardware: the main receptionist continued to have these issues, and the backup receptionist did not.  The only difference between the two phones is that the phone that is having trouble has a sidecar attached, and the other one does not.  All 6 "line" buttons on the main phone are programmed as line keys and the sidecar buttons are all statically programmed as BLFs for other extensions throughout the office using the "attendant.resourceList.#" configuration parameters.  When I swapped phones, I also moved the sidecar to the other phone.  So far, the problem has always stayed with whatever phone happens to have the sidecar attached to it and the associated "attendant.resourceList" programming, so I'm wondering if there is a UC software bug related to one of those two things (presence of sidecar or use of "attendant.resourceList", or perhaps both together).

 

All phones used to run 4.0.2B.  I pushed out 4.0.5 to the entire office, but the same problems continue to haunt that phone.

 

Any hints, clues, or reaffirmations of my sanity would be appreciated.

 

-- Nathan

Message 1 of 45
1 ACCEPTED SOLUTION

Accepted Solutions
Highlighted
Valued Contributor

Re: SoundPoint IP 650 lockups

I am pleased to announce that, by my testing, this issue has been completely resolved in the UC 4.0.7 software released today (issues VOIP-93121 and VOIP-93082 may have both been involved, but it is unclear to me from the descriptions given to them in the release notes which one was ultimately the fix for the issue I reported).

 

I re-ran my stress-test scripts against a 650 with 1 sidecar attached and 19 BLFs configured on it running 4.0.7.2514, and it didn't break a sweat.  The memory usage did not climb and climb and climb as it would before.  With 4.0.6, this same script would crash a phone in a couple of minutes.  I ran the script for over 2 hours and memory utilization didn't budge.  Even more importantly, the phone continued to function perfectly fine after I terminated running the script.  The UC software finally appears to be able to gracefully take a beating insofar as the number of incoming SIP messages per second goes.

 

My thanks to Polycom for delivering a fix.

 

-- Nathan

View solution in original post

Message 32 of 45
44 REPLIES 44
Highlighted
Polycom Employee & Community Manager

Re: SoundPoint IP 650 lockups

Hello  Nathan,

welcome back to the Polycom Community.

The information about the BLF's may be the root cause of your issue.

 

Do you have a large group that has multiple phones in them ringing at the same time?

 

If yes then this is causing the phone to be overloaded with all the SIP NOTIFY messages.

 

In order to host such a scenario you would need to configure the lines as shared lines so the receptionist can answer the line and then in addition use additional speed dials to dial the individual as the phone is simply unable to handle the traffic in such scenarios.

 

Above is all based on assumption and only a proper wireshark trace and matching logs via your Polycom reseller would enable our support team to confirm this.


Please ensure to provide some feedback if this reply has helped you so other users can profit from your experience.

Best Regards

Steffen Baier

Polycom Global Services

----------------
The title Polycom Employee & Community Manager is a community setting and does not reflect my role. I am just a simple volunteer in the community like everybody else. My official "day" Job is 3rd Level support at Poly but I am unable to provide official support via the community.

----------------

Notice: This community forum is not an official Poly support resource, thus responses from Poly employees, partners, and customers alike are best-effort in attempts to share learned knowledge. If you need immediate and/or official assistance please open a service ticket through your proper support channels.
Please also ensure you always check the VoIP , Video Endpoint , Skype for Business , PSTN or RPM FAQ's
Message 2 of 45
Highlighted
Valued Contributor

Re: SoundPoint IP 650 lockups

Interesting.  Is this a known issue, and are there any plans to improve the phone's handling of NOTIFYs in future firmware updates?

 

What do you think is a reasonable number of simultaneous NOTIFYs for the phone to handle?  Also, do you expect that if they did not come in simultaneously but in rapid succession (within a second or two of each other), that this might also cause phone lockups?

 

I would be surprised if the system is generating enough NOTIFYs to cause problems.  We don't have anybody who is in a ring group (or similar construct) assigned to any of the buttons on the 650 or the sidecar.  I would be shocked if there were more than 4-5 phones with BLF assignments on the sidecar having their states change at roughly the same time, at the very most.

 

It sure doesn't seem like having 4-5 state changes should cause these kinds of issues.  If the phone is capable of housing multiple 14-button sidecars, having it struggle with parsing an arguably average (or lower) number of state changs seems crazy.  Our decrepit Nortel system had a sidecar with 48 buttons on it, and it handled the "load" (well, the KSU did, but whatever...) without breaking a sweat.

 

-- Nathan

 

 

Message 3 of 45
Highlighted
Polycom Employee & Community Manager

Re: SoundPoint IP 650 lockups

Hello Nathan,

 

my reply was based on an assumption and any detailed troubleshooting is outside the scope of the community.

 

It was just an example and meant to help you to troubleshoot the issue with an example on how to replicate.

 

Best regards

 

Steffen Baier

----------------
The title Polycom Employee & Community Manager is a community setting and does not reflect my role. I am just a simple volunteer in the community like everybody else. My official "day" Job is 3rd Level support at Poly but I am unable to provide official support via the community.

----------------

Notice: This community forum is not an official Poly support resource, thus responses from Poly employees, partners, and customers alike are best-effort in attempts to share learned knowledge. If you need immediate and/or official assistance please open a service ticket through your proper support channels.
Please also ensure you always check the VoIP , Video Endpoint , Skype for Business , PSTN or RPM FAQ's
Message 4 of 45
Highlighted
Valued Contributor

Re: SoundPoint IP 650 lockups

Steffen,

 

Thanks for the clarification.  FYI, I have managed to capture logs from before the crash, and there is definitely a difference in the logs between when the config includes the attendant.resourceList.* parameters and when it does not.  At this point, based on what I'm seeing, I do not believe that the issue is with the rate of NOTIFY packets that are being received by the phone.  It actually looks more like a software bug.

 

Without attendant.resourceList.* programming, the syslog output is almost dead-silent after bootup.  With the attendant.resourceList.* programming, I see messages like this extremely frequently...several times a minute, in fact:

 

[SoMediaSess:PpsCallStateIdle] Dialog session is not found

 

After a few hours of this, the phone will start logging errors that almost seem to indicate it is having trouble reading to and writing from the internal flash NAND: [EDIT: a Google search for the phrase "memPartAlloc: block to big" actually reveals that this is a vxWorks message that seems to be related to a failure to allocate memory, not read to or write from flash ROM...a "partition" is not referring to a logical storage/block device in this case but rather a running task or process.  This I think would reinforce the memory leak theory presented at the end of the post.]

 

utilBufferCompress: Couldn't start compression, error -4
utilCryptoLightBufferEncrypt: Could not compress buffer size 1147.
cfgWriteAnyConfigToFlash: Could not encrypt flash data length 1147
0x95777ad0 (utilGenTask): memPartAlloc: block too big - 65536 in partition 0x94e49b0c.

 

At first it just references one "block too big" error whenever it logs this kind of thing, but that number starts to grow over the next   10-20 minutes, and it is logging blocks of errors like this every minute or two:

 

utilBufferDecompress: Error -4 in inflate
utilCryptoLightBufferDecrypt: Could not decompress buffer size 193.
cfgReadAnyConfigFromFlash: Could not decrypt flash data length 229
cfgReadAnyConfigFromFlash: error reading config, using defaults
utilBufferCompress: Couldn't start compression, error -4
utilCryptoLightBufferEncrypt: Could not compress buffer size 1147.
cfgWriteAnyConfigToFlash: Could not encrypt flash data length 1147
SoNcasC::saveMwiInfoRequest - Failed writing flash parameters
0x95777ad0 (utilGenTask): memPartAlloc: block too big - 32768 in partition 0x94e49b0c.
0x95777ad0 (utilGenTask): memPartAlloc: block too big - 65536 in partition 0x94e49b0c.
0x95777ad0 (utilGenTask): memPartAlloc: block too big - 65536 in partition 0x94e49b0c.
0x95777ad0 (utilGenTask): memPartAlloc: block too big - 65536 in partition 0x94e49b0c.
0x95777ad0 (utilGenTask): memPartAlloc: block too big - 65536 in partition 0x94e49b0c.
0x95a1a1d0 (tUtilLog): memPartAlloc: block too big - 34264 in partition 0x94e49b0c.
Couldn't initialize CURL object.
failed to set curl OVERRIDE_DNS option
failed to set curl CONNECTTIMEOUT option
UtilLogC::uploadFifoLog: upload error. protocol 0 result = -1

 

This probably goes on for about 30 minutes total from start to finish, and eventually there is what looks like a catastrophic failure that occurs with one of the phone's internal processes/threads just before it hangs and reboots:

 

0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
Could not allocate space for LLDP packet.
0x9557a250 (tLldpTx): memPartAlloc: block too big - 1500 in partition 0x94e49b0c.
pps2MediaSessDialogUpdate: memory allocation failed for msg.extraP not sending message
0x95962b80 (tPktProSys): memPartAlloc: block too big - 932 in partition 0x94e49b0c.
pps2MediaSessDialogUpdate: memory allocation failed for msg.extraP not sending message
0x95962b80 (tPktProSys): memPartAlloc: block too big - 932 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
pps2MediaSessDialogUpdate: memory allocation failed for msg.extraP not sending message
0x95962b80 (tPktProSys): memPartAlloc: block too big - 932 in partition 0x94e49b0c.
pps2MediaSessDialogUpdate: memory allocation failed for msg.extraP not sending message
0x95962b80 (tPktProSys): memPartAlloc: block too big - 932 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
rtosNetwork: rtosNetwConnectOpen() - Failed to spawn task, CCB# = 5, index = 4. 
rtosNetwork: rtosNetwConnectOpen() - Failed to spawn task, CCB# = 4, index = 3. 
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 10512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 10512 in partition 0x94e49b0c.
rtosNetwork: rtosNetwConnectClose() - task -1 is not valid.
rtosNetwork: rtosNetwConnectClose() - task -1 is not valid.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.
0x95c9d520 (tSupObjs): memPartAlloc: block too big - 2512 in partition 0x94e49b0c.

 

I of course don't know anything about the architecture of the phone firmware, but if I had to hypothesize from what I'm seeing, it almost looks like a memory leak of some kind that eventually leads to an out-of-memory condition.  At this point, the phone becomes comatose (unresponsive to button pushes, SIP INVITE messages, etc.) and does not log anything further.  If you wait long enough (15-20 minutes), the phone will eventually reboot itself.

 

I will be crossing my fingers and opening up a trouble ticket with my supplier, although I won't be holding my breath.

 

Thanks for your input,

 

-- Nathan

Message 5 of 45
Highlighted
Polycom Employee & Community Manager

Re: SoundPoint IP 650 lockups

Hello Nathan,

thanks for the detailed troubleshooting and as already mentioned by yourself the next step is a service ticket via your Polycom reseller.


Best Regards

Steffen Baier

Polycom Global Services

----------------
The title Polycom Employee & Community Manager is a community setting and does not reflect my role. I am just a simple volunteer in the community like everybody else. My official "day" Job is 3rd Level support at Poly but I am unable to provide official support via the community.

----------------

Notice: This community forum is not an official Poly support resource, thus responses from Poly employees, partners, and customers alike are best-effort in attempts to share learned knowledge. If you need immediate and/or official assistance please open a service ticket through your proper support channels.
Please also ensure you always check the VoIP , Video Endpoint , Skype for Business , PSTN or RPM FAQ's
Message 6 of 45
Highlighted
Valued Contributor

Re: SoundPoint IP 650 lockups

I am being told by our reseller that a case has been opened with Polycom.  Our case # is 1-494027911.

 

-- Nathan

Message 7 of 45
Highlighted
Valued Contributor

Re: SoundPoint IP 650 lockups

Other people seem to be experiencing the same problem that I am, and I have confirmation from them that they, too, started having the problem only after adding BLF to their phone configuration.  Thead is here:

 

http://community.polycom.com/t5/VoIP/SoundPoint-IP-450-v4-0-5-Hard-Locking/td-p/49221

 

I have pointed my reseller to the posts on this forum and asked that they pass this along to their Polycom support contacts who are working my ticket.

 

In the meantime, through experimentation, I have learned more about this bug, and have managed to set up a test/lab environment where I can reproduce the crash at will within 5 minutes.  I am working on figuring out how to bundle this test environment up and have it sent to Polycom so that they can reproduce the bug themselves on their end.

 

To avoid this bug for now, you should remove any BLFs from affected phones.  If disabling BLFs is not an option for you, I believe I may have also stumbled upon a way to prolong the phone's uptime before a crash occurs: if you have a short SIP re-registration period set on your PBX (a few minutes), try to increase it.  We had ours set to 2 minutes, and it seems that may have been exacerbating the situation.  This is only a hypothesis at this point, though, based on what I know about the bug so far.  Note that even if you dramatically increase the SIP re-registration period, this will only extend how long the phone will remain functional: even with this workaround, it will still eventually crash.  (Polycom SoundPoint IP phones seem to default to a 1 hour SIP registration expiry -- which means that they would try to re-register every 30 minutes, at the halfway point -- but if you set the minimum registration expiry time on your SIP UAS/PBX even higher, the Polycom should honor that.)

 

More soon.

 

-- Nathan

Message 8 of 45
Highlighted
Valued Contributor

Re: SoundPoint IP 650 lockups

I just sent this detailed description of the problem to Polycom through our reseller:

 

The issue we are seeing is almost definitely a memory leak, and although more BLFs will cause a crash to occur more quickly and frequently, as long as a single BLF is defined on a phone, a crash will eventually occur: it is not a matter of "if" but "when".  Our phone with 14 BLFs crashes once or more daily, but other Polycom SoundPoint IP users I have been communicating with have phones with only a single BLF defined, and they have crashes at least once weekly on those phones.

I have run some more tests, and here is what I have discovered:

I will be referring to our SoundPoint IP 650 phone with the sidecar module attached to it as the "attendant phone".  As I have described before, we have 14 BLFs configured on this phone.  I previously mentioned that in the phone logs, when we have BLFs configured, we see hundreds of lines that look like this:

[SoMediaSess:PpsCallStateIdle] Dialog session is not found


It turns out that the attendant phone logs this whenever it receives a certain BLF state update from the SIP proxy that it is registered to and subscribed to for BLF updates.  Specifically, this "dialog session is not found" error occurs whenever the phone receives a BLF update with a state of "terminated" (not busy) for an extension that the phone already thinks is in the "terminated" state.  If the attendant phone first receives an "early" (ringing) or "confirmed" (busy) state update, and then later receives a "terminated" update for that same extension, that error will not occur and will not be logged, but if the phone receives "terminated" for an extension that is not already shown to be ringing on busy, then that error is logged.

We have determined that the problem with BLFs does not have anything to do with the frequency of SIP NOTIFYs that the phone receives, but only has to do with the total number of NOTIFYs it gets that match the circumstances described above.  This is why reducing the number of BLFs configured on the phone will not prevent a crash, but only delay the inevitable: even if you have only 1 BLF defined, it will take longer for the phone to be sent whatever total number of NOTIFYs are required to cause a crash, but it will still eventually reach that number.

It turns out that the reason that we are seeing these errors with such frequency from our attendant phone is because our PBX is running Asterisk, and Asterisk will send an initial, gratuitous SIP NOTIFY for every SIP SUBSCRIBE request it gets.  If a given extension that the attendant phone is SUBSCRIBEing to is in a non-busy state at the moment the attendant phone sends the SUBSCRIBE request (which is the case the majority of the time), then Asterisk will send a NOTIFY to the attendant phone with a state of "terminated", at which point the phone will generate the "dialog session is not found" error.  Furthermore, SIP SUBSCRIBE session expiration timers match SIP REGISTER expiration timers, so whenever the phone needs to renew its registration, it also re-SUBSCRIBEs to the same BLF extensions as before, which means that the shorter you have your SIP registration expiration timers set to, the more frequently this occurs.

The core issue seems to be that whenever a Polycom SoundPoint phone with any BLF lamps configured on it receives a SIP NOTIFY "terminated" state update for any BLF that does not already show activity (ringing or busy), the SoundPoint IP firmware will log "dialog session is not found", AND also leak some memory at the same time.  I would guess that the firmware is trying to match up the "terminated" state update it got with a BLF that is currently set to show some activity, and if it can't find a match, that error is generated.  Presumably there is some memory allocated by the code path that is followed when it processes SIP NOTIFY messages it receives, and the part of the routine that cleans up and frees that memory at the end is mistakenly skipped if the "dialog session is not found" -- it prints that error, the subroutine returns abruptly, and the end of the subroutine that normally would have freed the allocated memory is jumped over.  After this happens a certain number of times (several hundred at least, by the look of it), all available memory is eventually consumed, and the phone crashes.

This is a bad, bad bug.  As long as it exists, BLFs should not be used in any quantity, even just one.  With this bug, it is possible to execute a denial-of-service (DoS) attack on any Polycom SoundPoint IP phone that has as few as 1 BLFs configured on it, simply by rapidly sending it a series of NOTIFYs for the monitored extension(s) that are all state "terminated".  It is exactly this scenario (rapid-fire NOTIFYs for monitored extensions all containing state "terminated") that I have reproduced in a lab.  I can get a phone running 4.0.5 to crash and reboot in 2-3 minutes.

I had previously claimed that this bug seems to exist throughout all of UC 4.0.x, and I based this off of the fact that we had been running 4.0.2B and seeing crashes, and this was what prompted us to first try upgrading to 4.0.5.  It turns out that 4.0.2B does NOT suffer from this bug, but from a different issue, which is, I believe, related to the rate of NOTIFYs that the phone is receiving.  I have since re-tested with 4.0.2B, and a phone running 4.0.2 with BLFs configured does NOT log "dialog session is not found" when it receives gratuitous "terminated" NOTIFYs for non-busy BLFs, nor does it leak memory.  So this memory leak bug is a relatively recent regression.  What it will do on 4.0.2, seemingly at random, is have certain BLFs and non-BLF line keys become "stuck" in a certain state and unresponsive to key presses.  Often, if you wait long enough (a few minutes), that condition will clear by itself without the need for a reboot.  It is definitely a different problem/bug, and I cannot reproduce this on 4.0.5.  This gives me hope that once this memory leak bug is plugged, issues with BLF support in UC software will be greatly reduced in number.

 

-- Nathan

Message 9 of 45
Highlighted
Polycom Employee & Community Manager

Re: SoundPoint IP 650 lockups

Hello Nathan,

what is the ticket number so I can check what team has this and if it is in EMEA work on this with engineering.

 

The Fact that this is Asterisk specific may explain why we have not seen this so frequent as most of our customers would use another SIP platform utilizing other methods.

 

Great work btw. and much appreciated.


Please ensure to provide some feedback if this reply has helped you so other users can profit from your experience.

Best Regards

Steffen Baier

Polycom Global Services

----------------
The title Polycom Employee & Community Manager is a community setting and does not reflect my role. I am just a simple volunteer in the community like everybody else. My official "day" Job is 3rd Level support at Poly but I am unable to provide official support via the community.

----------------

Notice: This community forum is not an official Poly support resource, thus responses from Poly employees, partners, and customers alike are best-effort in attempts to share learned knowledge. If you need immediate and/or official assistance please open a service ticket through your proper support channels.
Please also ensure you always check the VoIP , Video Endpoint , Skype for Business , PSTN or RPM FAQ's
Message 10 of 45