Troubleshooting IIS ARR Bad Gateway Timeout Issues

I recently encountered an issue whereby Lync 2013 mobile clients would consistently disconnect when utilising IIS ARR 3.0 on Microsoft Windows Server 2012 R2. I have deployed ARR for Lync Server 2013 on several ocassasions and have always followed guidelines around timeout values for the webservices public facing URL specifically, as this is known to cause issues if the timeout value is below 200 seconds. Intitially in this particular customers case, I was sure the problem was not ARR related as I had performed this process several times previously, however after testing with a public IP address directly on the ARR server in order to rule out TCP timeout issues on the customers firewall appliance, the problem still occurred and I needed a way to look further into what IIS was reporting. The following details the process I went through to identify the issue and gain a resolution.

1. Firstly, we need to take a look at the log file that the Lync Mobile client produces when the issue occurs. Ensure logging is enabled under the clients options and then reproduce the issue, following this if you enter the applications options for a second time there is an additional highlighted option to send the log files to an e-mail address. Once this is performed, open the log file on a workstation and perform a find operation for the word “gateway”, if you receive a match similar to “E_BadGateway (E2-3-35)” you have a timeout issue. In fact, any bad gateway error reporting in the Lync Mobile client logs is timeout related, finding out where the timeout has occurred is the key factor to determine. In my case, I was seeing the following error in my client side log which was consistent on Windows, Android and iOS based devices.

Bad Gateway

 2. After ruling out the customers firewall appliance by placing a public IP address on the ARR server itself, I then knew the issue had to either be the Lync Server Front End, ARR Server or the customers router, however the latter was the most unlikely. Following this I decided to take a look at the IIS log files, the first issue I stumbled across here is that I had not installed the IIS logging role features and as result no log files had been generated. For reference, to obtain the correct log files in order to assist in diagnostic the problem you will need to add the “HTTP Logging” and “Tracing” Web Server role services from the Windows Server 2012 R2 Server manager, below is a screenshot of the options that are required.

Role Services

Once these role services are installed, you will then need to reproduce the issue on a Lync Mobile client in order for a log to be generated, once performed the log file will be viewable under the following file location on the ARR Server, “C:\inetpub\logs\LogFiles\W3SVC1\”. In this folder you will see a log file or files have been generated, by opening the log file you will be presented with a time and date stamped entry of the processes that have occurred on the server. By performing a find on this file and specifying “502”, this will locate your timeout event, in my case I was experiencing a 502.3 error as detailed below.

2014-12-30 11:50:02 GET /ucwa/v1/applications/21223095915/events ack=3&low=5&medium=5&timeout=180&priority=141994047&X-ARR-CACHE-HIT=0&X-ARR-LOG-ID=42ec39b2-cb68-4ad3-8ab8-8ba781a7bcba 443 – 151.228.9.186 Mozilla/5.0+(Windows+Phone+8.1;+ARM;+Trident/7.0;+Touch;+rv:11.0;+IEMobile/11.0;+NOKIA;+Lumia+920)+like+Gecko – 502 3 12002 33124

The end of the error is the part that is most important at this stage, we see “502 3” which means we received a 502.3 error and interestingly we also see that 33.124 seconds elapsed without a response, which caused the timeout to occur, this is shown by the very last line in the above output, the 33124 is represented in milliseconds which converted in seconds is 33.124. This confused me somewhat initially as I had already configured my webservices proxy timeout to be 960 seconds, which should have avoided this timeout issue.

3. To further investigate the issue, we can additionally configure an IIS logging utility entitled Failed Request Tracing Rules, this feature allows us to trigger a report and accompanying log file if a particular exception is matched, in my case I wanted to trigger an exception when a 502.x error was produced. This can be configured be opening the IIS Management Console and selecting the “Server Name” node to display the IIS Home options.

Failed Request Tracing Rules

On double clicking the Failed Request Tracing Rules icon, click Add in the top right hand corner of the screen and a new dialog box will be presented. In this box, click Next on the first presented screen, in the next window enter 502 in the “Status Codes” dialog box area and click Next. In the “Select Trace Providers” window, ensure only “WWW Server” is selected and click Finish.

 4. Once the tracing rule has been created, again reproduce the timout issue on a Lync Mobile client and when complete a new IIS log file will be available in the following location “C:\inetpub\logs\FailedReqLogFiles\W3SVC1” and an accompanying report will have been produced in an XML format, typically entitled “fr000001”. Proceed and open the XML file in Internet Explorer and when reviewing the content we will be able to identify which application routing request triggered the timeout, as detailed below.

Failed Request Tracing Rules Report

As we can see under the URL_Changed value, the timeout was actually being encounterd when the https://meet.domain.com URL was being queried, this was confirmed by matching the GET command that was detailed in the original IIS log file and then matching this to the URL detailed in the Failed Request Tracing Report. On increasing the proxy timeout value for the meet.domain.com server farm in IIS ARR to 960 seconds, the issue was resolved. As to why this problem occurred specifically within this environment and as to why the reverse proxy was seeing a URL for https://meet.domain.com/ucwa/v1/applications, which is a web services directory thus meaning the queried URL is actually invalid, compared to others that I have deployed, I cannot currently say for sure however I even went as far as installing ARR 2.5 on Windows Server 2012 and experienced an identical issue. Hopefully this will assist someone else from a troubleshooting perspective at least.