Troubleshooting IIS ARR Bad Gateway Timeout Issues

I recently encountered an issue whereby Lync 2013 mobile clients would consistently disconnect when utilising IIS ARR 3.0 on Microsoft Windows Server 2012 R2. I have deployed ARR for Lync Server 2013 on several ocassasions and have always followed guidelines around timeout values for the webservices public facing URL specifically, as this is known to cause issues if the timeout value is below 200 seconds. Intitially in this particular customers case, I was sure the problem was not ARR related as I had performed this process several times previously, however after testing with a public IP address directly on the ARR server in order to rule out TCP timeout issues on the customers firewall appliance, the problem still occurred and I needed a way to look further into what IIS was reporting. The following details the process I went through to identify the issue and gain a resolution.

1. Firstly, we need to take a look at the log file that the Lync Mobile client produces when the issue occurs. Ensure logging is enabled under the clients options and then reproduce the issue, following this if you enter the applications options for a second time there is an additional highlighted option to send the log files to an e-mail address. Once this is performed, open the log file on a workstation and perform a find operation for the word “gateway”, if you receive a match similar to “E_BadGateway (E2-3-35)” you have a timeout issue. In fact, any bad gateway error reporting in the Lync Mobile client logs is timeout related, finding out where the timeout has occurred is the key factor to determine. In my case, I was seeing the following error in my client side log which was consistent on Windows, Android and iOS based devices.

Bad Gateway

 2. After ruling out the customers firewall appliance by placing a public IP address on the ARR server itself, I then knew the issue had to either be the Lync Server Front End, ARR Server or the customers router, however the latter was the most unlikely. Following this I decided to take a look at the IIS log files, the first issue I stumbled across here is that I had not installed the IIS logging role features and as result no log files had been generated. For reference, to obtain the correct log files in order to assist in diagnostic the problem you will need to add the “HTTP Logging” and “Tracing” Web Server role services from the Windows Server 2012 R2 Server manager, below is a screenshot of the options that are required.

Role Services

Once these role services are installed, you will then need to reproduce the issue on a Lync Mobile client in order for a log to be generated, once performed the log file will be viewable under the following file location on the ARR Server, “C:\inetpub\logs\LogFiles\W3SVC1\”. In this folder you will see a log file or files have been generated, by opening the log file you will be presented with a time and date stamped entry of the processes that have occurred on the server. By performing a find on this file and specifying “502”, this will locate your timeout event, in my case I was experiencing a 502.3 error as detailed below.

2014-12-30 11:50:02 GET /ucwa/v1/applications/21223095915/events ack=3&low=5&medium=5&timeout=180&priority=141994047&X-ARR-CACHE-HIT=0&X-ARR-LOG-ID=42ec39b2-cb68-4ad3-8ab8-8ba781a7bcba 443 – 151.228.9.186 Mozilla/5.0+(Windows+Phone+8.1;+ARM;+Trident/7.0;+Touch;+rv:11.0;+IEMobile/11.0;+NOKIA;+Lumia+920)+like+Gecko – 502 3 12002 33124

The end of the error is the part that is most important at this stage, we see “502 3” which means we received a 502.3 error and interestingly we also see that 33.124 seconds elapsed without a response, which caused the timeout to occur, this is shown by the very last line in the above output, the 33124 is represented in milliseconds which converted in seconds is 33.124. This confused me somewhat initially as I had already configured my webservices proxy timeout to be 960 seconds, which should have avoided this timeout issue.

3. To further investigate the issue, we can additionally configure an IIS logging utility entitled Failed Request Tracing Rules, this feature allows us to trigger a report and accompanying log file if a particular exception is matched, in my case I wanted to trigger an exception when a 502.x error was produced. This can be configured be opening the IIS Management Console and selecting the “Server Name” node to display the IIS Home options.

Failed Request Tracing Rules

On double clicking the Failed Request Tracing Rules icon, click Add in the top right hand corner of the screen and a new dialog box will be presented. In this box, click Next on the first presented screen, in the next window enter 502 in the “Status Codes” dialog box area and click Next. In the “Select Trace Providers” window, ensure only “WWW Server” is selected and click Finish.

 4. Once the tracing rule has been created, again reproduce the timout issue on a Lync Mobile client and when complete a new IIS log file will be available in the following location “C:\inetpub\logs\FailedReqLogFiles\W3SVC1” and an accompanying report will have been produced in an XML format, typically entitled “fr000001”. Proceed and open the XML file in Internet Explorer and when reviewing the content we will be able to identify which application routing request triggered the timeout, as detailed below.

Failed Request Tracing Rules Report

As we can see under the URL_Changed value, the timeout was actually being encounterd when the https://meet.domain.com URL was being queried, this was confirmed by matching the GET command that was detailed in the original IIS log file and then matching this to the URL detailed in the Failed Request Tracing Report. On increasing the proxy timeout value for the meet.domain.com server farm in IIS ARR to 960 seconds, the issue was resolved. As to why this problem occurred specifically within this environment and as to why the reverse proxy was seeing a URL for https://meet.domain.com/ucwa/v1/applications, which is a web services directory thus meaning the queried URL is actually invalid, compared to others that I have deployed, I cannot currently say for sure however I even went as far as installing ARR 2.5 on Windows Server 2012 and experienced an identical issue. Hopefully this will assist someone else from a troubleshooting perspective at least.

AudioCodes E-SBC – Removing ;ext= From An INVITE Header

Recently I was performing an integration between Microsoft Lync Server 2013 and an Alcatel OmniPCX phone system at a customer site. The existing PBX already had an element of VoIP via a separate platform called OpenTouch which allowed users to have multiple devices associated with a single extension. For an unknown reason when calling OpenTouch users via Lync Server 2013, which caused the call to be sent to an AudioCodes Virtual E-SBC and then onto the Alcatel PBX via a direct SIP trunk, the users handset would not ring. After working with the customers Alcatel vendor, it transpired the OpenTouch platform did not like INVITE’s that were sent with a from destination of +441234567890;ext=7890 for example, and the ;ext= element was causing the issue. In order to remove this aspect from an INVITE the ;ext= element needed to be stripped from the INVITE header using a Message Manipulation on the AudioCodes E-SBC, as with Sonus devices a regular expression is required in order to remove any parts of an INVITE we do not need. On an AudioCodes devices running version 6.8 of the firmware, the following was performed.

1. When logged into the device and expand VoIP -> SIP Definitions -> Msg Policy & Manipulations and then select Message Manipulations.

2. In the Message Manipulations table, click add and provide an Index number, such as 1 if this is your first manipulation and then proceed to configure the manipulation as per the provided screenshot below.

Message Manipulation

3. To understand how Message Manipulation works, in the condition section this where we look for specific information within the INVITE. In this case we are using a regular expression to place each part of the INVITE header into a variable. For example the (.*) part of the condition places the value into variable $1, this would be the +441234567890 part of the number only, the next part of the regular expression (;ext=) would be placed in variable $2 which would only be the (;ext=) part of the header and so on and so forth. By placing each part of the header into variable we can then control the manipulation by specifying an Action Value, this Action Value constructs the header using only specific parts that we wish to use. For example, to gain a result that removes the ;ext= element, we would use $1 (+441234567890) and then $4 (@) and then $5 (domain.com) to construct a header that displays [email protected].

4. Once this is complete, expand VoIP -> VoIP Network -> IP Group and then edit the IP Group you wish to apply the message manipulation to. When the properties of the manipulation opens, enter the Manipulation Set ID that you entered in step 2 into either the inbound or outbound manipulation set ID dialog boxes depending on the direction in which you need to apply the manipulation, as illustrated below.

SBC Manipulation Set

That’s it, when performing a debug trace on the gateway you should now see the ;ext= element has now been removed from the INVITE header.

Thanks to Siplifi (https://www.siplifi.com) for the assistance.

Microsoft Lync Server 2013 – Consolidated Edge Audio Issue

I was assisting a customer with the renewal of a Consolidated Edge external SSL certificate recently and upon applying the newly issued certificate to the server, audio calls with externally situated users or federated partners via the Consolidated Edge failed. Upon answering an audio call via the Lync Client, it immediately reported “Called Ended” and third party IP handsets would become stuck in a “Connecting” state once the call was answered by the remote party. While the latter error would have suggested perhaps an audio/video port issue at a firewall level, I was on the phone to the customer via a federated call when the certificate was assigned to the Consolidated Edge services, which in turn immediately caused the call to drop and as such I knew a firewall issue was not the root cause of this problem. On inspection of the Consolidated Edge servers event log, everything appeared to fine and all Consolidated Edge specific services were started, additionally all other functionality such as remote and federated instant messages were working as expected. However, upon a restart of the Consolidated Edge services the following event was logged:

The Access Edge Server failed to import a shared session key due to invalid signature.

In the past 1 minutes, the server rejected 1 shared session keys presented from the network due to an invalid signature. This suggests an incorrect certificate configuration. A large number of failures could indicate spoofed session key data sent by an attacker.

Cause: This is most likely to be a configuration problem in a server array.

Resolution: Ensure that all servers in an Access Edge Server Array have the same certificate configured for the external IP address on the Edge Interfaces tab. Inspect the serial number of each certificate. If the problem persists, use the Administrator Log to help identify the source of these messages.

Upon reading the error this immediately prompted me to inspect the certificate via the certificates MMC snap-in and upon reviewing the new SSL certificate everything appeared to be correct, the certificate chain was reporting as valid as was the private key. In this customers case a GoDaddy SSL certificate had been procured and while the certificate chain stated the implemented certificate was trusted, it transpired an intermediate GoDaddy certificate was missing. On adding the missing intermediate certificate into the Local Computer certificate store, which was also included in the downloaded certificates .zip file, the audio calling issue was resolved. In summary, always check intermediate SSL certificates for your chosen provider, specifically if you are not utilising one of the bigger SSL vendors such as VeriSign or Digicert. Hope this helps!

Lync Server 2013 – Cannot Find Any Suitable Disks For Database Files

I was adding a new front end pool to a customers Microsoft Lync Server 2013 deployment recently, when I came across an issue when attempting to implement pool pairing between two Standard Edition servers. On completion of the backup service MSI being installed, I received the error “Command execution failed: Cannot find any suitable disks for database files. You must manually specify database paths.”, as illustrated below.

Database Error

It became apparent quite quickly this issue was related to the available storage on the virtual machines C:\ drive, while there was sufficient space to perform the front end installation the addition of an extra component was a step too far in terms of required storage. In this case, the customer had provisioned a 50 GB local hard disk drive, which is under the recommended 72 GB required for the deployment. On expanding the local hard disk drive to 80 GB and running the Lync Server Deployment Wizard an additional time, the error was no longer apparent. This was a good example of ensuring any virtual or physical machines utilised for Microsoft Lync Server 2013 roles need to conform to the minimum hardware requirements in order to avoid potentially time consuming issues. That’s it!

Lync Server 2010 – Limited Functionality Is Available Due To An Outage

I had a customer log a support case detailing an issue whereby all Lync 2010 clients were displaying a red banner stating “Limited functionality is available due to an outage”, however this customer had only a single front end server. This message is typically displayed when users registered against and SBA experience a WAN outage in their location which results in them being unable to contact the associated front end server. The message shown in the client is illustrated below.

Limited Functionality

This issue affected all users registered again the front end server and upon viewing the application, system and Lync Server event logs a picture of how the issue was produced could be gained. Firstly in the front end server Lync Server event log the following error had been produced at the time the customer reported the issue:

Pas with FQDN: lync.domain.co.uk has been detected to be down.

PAS stands for Presence Agent Server and is the component of the Lync Server handling presence logic and traffic, turning to the system event logs on the front end detailed why the problem had occurred and why contacts could not be seen within the Lync client. The first error displayed was the following:

Reset to device, \Device\RaidPort0, was issued.

This detailed that access to the virtual machines storage subsystem had been interrupted for a short period of time and as such the following event was then also displayed:

SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [c:\csdata\backendstore\rtc\dynlogpath\rtcdyn.ldf] in database [rtcdyn] (7).  The OS file handle is 0x0000000000000740.  The offset of the latest long I/O is: 0x000000060aca00

The above error essentially caused the underlying message that was displayed in the Lync client and in order resolve the problem the virtual machine was moved to another virtualisation host which immediately caused the error in the client to disappear as reliable access to the storage volume was now possible.

Active Directory – Telephone Number PowerShell Script

I was recently working on a Lync Server 2013 deployment for a customer whereby they neededed to add E.164 formatted numbers to the Telephone Number field of each Enterprise Voice users Active Directory account. This is a standard process as a part of the installation, however in this customers case no telephone numbers had been populated and manually adding these numbers for over 400 users would be a time consuming task. Using their existing telephone extension list and a CSVDE export of SamAccount names from Active Directory, I created a short PowerShell script to automate the process. The below script needs to be saved as a .PS1 file and needs to be executed from a domain controller or a workstations with the ADDS tools installed. To do this, perform the following steps:

1. Copy the below script into notepad and save the file as script.ps1 for example.

Import-Module ActiveDirectory
$inputFile = Import-CSV c:\test.csv
$log = “.\Error-Log.txt”
$date = Get-Date
Write-Host “Active Directory – Telephone Number Change Script”
Start-Sleep 2
Write-Host “The change process is now commencing, please wait…”
Start-Sleep 2
Function ChangeTelephoneNumber
{
“Change Process Started On: ” + $date + “) :” | Out-File $log -Append
“————————————————-” | Out-File $log -Append
foreach($line in $inputFile)
{
$sam = $line.SamAccountName
$officephone = $line.OfficePhone
Set-ADUser -Identity $sam -OfficePhone $officephone
Write-Host “Completed Telephone Number Change For: $sam”
“Completed Telephone Number Change For: $sam” | Out-File $log -Append
}
Start-Sleep 2
Write-Host “The process is now complete, please review the log file for any errors.”
}
ChangeTelephoneNumber

2. In Microsoft Excel, for example, in a new sheet use two columns and in the first cell of each column enter SamAccountName and OfficePhone. These are the value names the script looks for when it attempts to update a users telephone number. Your spread sheet should look similar to the following:

3. Once you have completed the spread sheet save it as a CSV file and copy it to the C:\ drive of the server or workstation that you are executing the script front. If the C:\ drive if not suitable, you can change the location the scripts searches for the CSV by editing the “$inputFile = Import-CSV c:\test.csv” field.

4. That’s it, you can now execute the script from a domain controller and the telephone number should update, I would recommend running this against a single user initially in order to test the functionality.

Lync Server 2013 – Cannot Setup Mirroring Database

I recently implemented a Lync Server 2013 infrastructure with a mirrored Microsoft SQL backend. During the configuration of the mirror, a network interruption occurred which resulted in the creation of the mirror database failing. Once the network issue had been resolved, I attempted to recreate the mirror via the “Install Databases” option in the topology builder to find the process failed with the following error:

“Cannot setup mirroring because there is an error when validating the current database states on primary and mirror instances for the database “rtcxds”. Exception: System.InvalidOperationException: Cannot setup mirror database “rtcxds” because it already exists.

Despite manually attempting to create the mirror via the Lync Server Management Shell, the error persisted. In order resolve the problem, the following was performed.

1. Open the SQL Management Studio on the primary database server and locate the affected database, in my case it was “rtcxds”.

2. Right click the database and click “Detach” and then click OK when the following window is displayed.

3. Open the SQL Management Studio on the mirror database server and locate the affected database.

4. Right click the database and click “Delete”, you won’t be able to detach this database as it is not the primary.

5. Remove the created database files on each SQL server and then open the Lync Server topology builder.

6. In the topology builder right click “Lync Server 2013” and then click “Install Databases”.

7. Follow the install databases wizard and the mirror should now deploy correctly.

Exchange 2007 Unified Messaging & Lync Server 2013 – Key Mapping Issue

I recently deployed a Microsoft Lync Server 2013 infrastructure for a customer running Microsoft Exchange Server 2007 SP3, and while this version of Exchange Server is supported it should be noted there are a few additional amendments that need to made in order to make UM Auto Attendant key mappings to Lync extensions work correctly. Following the typical UM integration through the use of OCSUMUtil.exe and ExchUMUtil.ps1, all Unified Messaging functionality seemed to be working correctly, including dial by extension. It wasn’t until a key mapping was added to an Auto Attendant to transfer a call to a specific Lync extension, did I see an issue. When calling the attendant and pressing one for example, which was directed to extension 319, the Unified Messaging service would produce the following error and the attendant would tell the caller “The call could not be transferred”.

As you can see, there is not a specific amount of detail to go on and researching this particular Event ID suggested this was an error that could pertain to a number of issues. On researching further however, I noted the following from the TechNet article on integrating Lync Server 2013 with Exchange Unified Messaging:

If you are using a version of Exchange that is earlier than Microsoft Exchange Server 2010 SP1, you must enter the fully qualified domain name (FQDN) of the corresponding Exchange Unified Messaging (UM) SIP dial plan in the Lync Server 2013 dial plan Simple name field. If you are using Microsoft Exchange Server 2010 SP1 or latest service pack, this dial plan name matching is not necessary.

In order to resolve the key mapping issue the following was performed.

1. Connect to the Lync Server 2013 control panel and click Voice Routing and then select the Dial Plans tab.

2. Double click the “Global” dial plan to edit it and in the Simple Name dialog box, remove the word Global and replace it with the name of your Exchange Unified Messaging dial plan followed by your internal Active Directory domain name. For example, if my UM Dial Plan name was “DefaultUM” and my internal domain was “company.local”, I would enter DefaultUM.company.local into the Simple Name field.

3. Click OK and then commit the change, you will then need to wait a few moments for the change to take affect before trying the key mapping again. It should also be noted that in the Global dial plan you will need sufficient normalisation rules for the key mapping to work when transferring to an extension. In my case the dial plan now looked like the following:

That’s it, hopefully your Auto Attendant key mapping issues to Lync extensions will now be resolved.