Today is a happy day, since we managed to figure it out. Sorry @Duat that I uncheck the “answer” box, but it turned out that the problem is very different from what you (and anyone else) predicted.
In my last update, I was about to redirect this question to MS when we realized that our firewall was mistaken for name resolution. Therefore, we assumed that it was a criminal and waited for this to be resolved. After it was resolved, we ALL had the same problems, and we again reviewed the situation.
We isolated the problem as part of our build process, more specifically with custom code activity included in our build solution.
I implemented a work with code that will work at the last stages of each assembly. This work consisted of collecting BuildDetails
about the current build and adding them to a new line in "BuildLog.xls".
The implementation is implemented using Microsoft.Office.Interop.Excel
.
This excel sheet is on another server (NOT on the servers where the controller / agents are located).
During the development of this activity, I encountered such problems as, but after I finished, there were no EXCEL examples. So I thought it was done and it was decided.
With an attempt and a mistake, we noticed that when this activity will not work, there will be no problems. With this start-up, the first assembly after the reset controller assembly was successful, any next assembly would have a definite chance to fail. As soon as any assembly fails, no one else will succeed until the other assembly controller is reset.
I have only a general idea of what the problem is (the Excel DCOM call, the TFS services are WCF: how would they intervene ?! Why does this sometimes succeed and sometimes fail?). The provided diagnostics also did not help, in fact they mislead us, which lasted for several months.
If I ever find the time, I would like to cleanly reproduce the error and make from it the question of server failure ...
After removing this activity it works! Now I searched in SO and found this one , where J. Saunders comments: "In general, you should never use Office Interop from a server environment."
It is ironic that as soon as you get to the bottom of any complex problem, the whole universe seems to know about it except you ...