No, this is not a respectable Memorial Day story. This is about System Center Configuration Manager and a site upgrade that almost drove me to drink. More, I mean. So, it’s not really a grueling story of courage and sacrifice, unless you count suffering through the traumatic effects of stale coffee and lack of sleep. (I didn’t think so)
Names of customers, engineers and other victims are withheld to protect them from the mind-numbing stupidity of my crappy jokes. Let’s continue…
(note: I always choose my images carefully with regards to the content of my posts. The above is intended for me to maintain a rational perspective of how serious (or not) my problems really are)
The project was about upgrading a System Center 2012 R2 Configuration Manager site hierarchy to “current branch” (i.e. to 1511 and on to 1602). The hierarchy consisted of the following:
- All sites running
- SCCM 2012 R2 SP1 with CU1
- SQL Server 2008 R2 Standard, 64-bit with SP1
- Windows Server 2008 R2 Enterprise, 64-bit, Service Pack 1
- A CAS
- Three (3) primary sites
- One in the same AD forest/domain as the CAS
- The other two (2) in separate AD forests (non-contiguous namespaces)
- Forest trusts in place
I mirrored this configuration in my home lab on Hyper-V using cutting edge techniques like cold coffee, loud music and multiple trips to the fridge and microwave at random times while normal people sleep.
After getting the baseline set, I created snapshots (checkpoints) of all VM’s.
There was one catch, and that seemed to open a wormhole into some really long discussions on Skype between us, and several rounds of Microsoft support engineers. I have to say, it was a learning experience.
Rather than repeat every nuance of every step we gradually chopped our way through, I’ll summarize them.
- Some of the site servers were using different user accounts for different site functions. No biggie, we set them all to use the same.
- The site servers were installed on SQL named instances, rather than the default instances, even though no other services or processes were using the default instance.
- The named instances were running on non-standard ports for both engine and broker service processes.
- CAS-Primary replication seemed to be working, but later on, a deeper inspection would reveal that some components were not replicating as consistently as they should.
With all indicators on the surface appearing fine, we began our upgrade process with the CAS. Installing SP3 for SQL 2008 R2, and Windows patches, in order to bring the site up to an “upgrade-ready” state.
Then we upgraded the SQL Server instance from 2008 R2 to 2012 R2 SP3. That seemed to go well, with no errors or warnings, and we performed a restart on the server when finished.
Then replication started breaking. This is when it got interesting. The errors were all about ‘NT Authority\SYSTEM’ not having permissions to connect to the database. But every thing we checked seemed correct. The SYSTEM account had full administrator rights to Windows; it was a sysadmins role member, and all security and role settings were standard. Still, the error was saying SCCM could not connect to the “specified database ‘cm_cas’…”
After several hours of comparing every setting with our test labs, and everything matching up, we decided to call Microsoft for some help. The first Microsoft support engineer was an SCCM PFE. After he poked around and found the scenario more complicated than expected, he called a SQL engineer. Thus began the journey into hours and hours of profile traces, examining log files, permissions, running queries and stored procedures, more logs, more coffee, more engineers and on and on.
Their first goal was to determine whether it was SCCM or SQL. At some point one of us looked at the error again and had an epiphany: It was SCCM trying to connect to the wrong database. As to how or why, we still don’t know for sure. But here’s one possible aspect that might be part of it:
SCCM stores the SQL database configuration under the registry key “HKEY_LOCAL_MACHINE\Software\Microsoft\SMS\SQL Server” in three value assignments:
- Database Name
- Service Name
Then under the next sub-level key “Site System SQL Account” it stores the SQL Server service port (typically 1433) and the SQL Broker service port number (typically 4022).
All of the site servers were configured for port 4124 for the SQL service, and 4125 for the Broker service. And since they were all installed to named instances, the registry key values did not match the actual configuration. This part seems to match our test lab environments, so it’s not unusual, it seems, but the ports and named configuration might have been part of this. Adding aliases to the ports and instance configurations did *not* fix the problem either, so something was odd.
With the named instance “CMCAS” and database named “CM_CAS”, you might expect the registry to show this…
- Database Name = “CM_CAS”
- Server = “CAS”
- Service Name = “CMCAS”
Yes, it is confusing to look at, but that’s how it would be laid out. However, SCCM stores it like this…
- Database Name = “CMCAS\CM_CAS”
- Server = “CAS”
- Service Name = “MSSQLServer”
If you’re at all familiar with how a typical SQL provider connection is constructed, you’d expect the “server” to include the instance name, such as “CAS\CMCAS” and the database to be just the database name. But fetching the above and running them using that approach breaks everything, and produces the same error (e.g. “NT Authority\SYSTEM” can’t connect to the specified database).
Changing the registry values was ill-advised by the support engineers, so we decided to do a triple-bypass, stat!
The first attempt involved running ALTER DATABASE to remap the database files to the default instance, leaving the files in-place. But, while this appeared to retain the security accounts from the same “system” and model database configurations, it seemed to wreak havoc with other aspects of certificates and SCCM processes. Site replication was really a mess and attempts to repair things didn’t seem to work. Then it was on to a quadruple bypass, stat!
- Stop the SCCM and SQL services
- Detach the databases
- Change the SQL service ports to standard numbers (1433, and 4022)
- Attach to the default instance and restart
- Run a SCCM site reset
At this point, SCCM was able to connect to the database. However, with the changes, it broke the certificate and site replication as well. So we had to run through this process, particularly, issue 2 and 3, (for us, anyway), and then wait it out as replication gradually got back on its feet.
Umair – Thank you!
In the end, we spent three (3) days on this, with 2 of them joined by Microsoft. It was a cool situation where the combination of everyone, our engineers, the customer and their engineers, all provided valuable pieces to make it work. I don’t think any one part by itself would’ve worked it out nearly as quick, but that’s just my humble opinion.
What Worked for Us
- Make sure you have full SCCM site and database backups on each site server
- Stop and detach the database files
- Attach them to the default instance
- Configure the service ports as needed
- Upgrade SQL Server *after* the database changes
- Run a SCCM site reset
- Stare at logs in cmtrace, until your eyeballs have them firmly imprinted: rcmctrl.log, sender.log, smsexec.log, despool.log, etc. etc.
- Re-run spdiagdrs a few million times
- Re-run “select * from vlog where….” a few million times
- Re-run “select * from RCM_DrsIntializationTracking where InitializationStatus not in (6,7)” a few hundred times
- Run to restroom, fridge, Taco Bell and Starbucks, walk dog, pet cat, talk to yourself in the bathroom.
- Drop a “replication configuration.pub” or “configuration data.pub” file into rcm.box folder and watch for it to vanish before your eyes (a good thing!)
- Smack the keyboard to Refresh like a kid with Tourette’s and too much Red Bull, to watch the replication status go from “failed” to “degraded” to “pending” to “active”, and site status go from “requested” to “pending creation” to “package created” to “active”
- Wait for replication to show “ACTIVE”
- Jump up and scream “hell yeah!!!”
- Sit calmly back down while your dog looks scared to death.
- Lay down and stare at ceiling for a few minutes as dog licks your face
As of now, all of the sites are configured the same with regards to SQL and SCCM 1511, with 1602 in the pipeline. Replication is cranking along. And things are working well.
The Next Part
Now I’m working on the OS upgrade from Windows Server 2008 R2 to Windows Server 2012 R2. So far, in my lab, it’s going well like this…
- Remove the SUP role (if necessary) from SCCM console
- Remove the WSUS role from Windows Server
- Run setup.exe for Windows Server 2012 R2
- When prompted to restart *before* the upgrade, due to patches, do the restart
- Log back on, run setup.exe again and continue through upgrade and restart again
- Log back on, add WSUS role
- Open SCCM, if issues look rough (components not starting, etc.) check IIS and SQL logs, and if required: run a SCCM site reset. Restart again
- Open SCCM and check status summaries and review logs
- Add the SUP role (if necessary)
- Monitor SCCM and SQL logs
I only had to do a site reset on my CAS. The primary had no issues after the OS upgrade, so I can’t say if either situation (CAS vs. Primary) was typical or not.
My brain is still recovering from a week of traumatic sleep deprivation, bad food, cold coffee, and a 12-hour “nap” I finally had on Sunday, so I may have mixed some things around. If you have a question or concern about the particulars, let me know? I may have to re-read my own notes to make sure I transcribed this correctly.
- https://www.youtube.com/watch?v=zGgFEHsY5FQ 🙂
Thoughts? Comments? Coupons for free food or drinks? Winning lottery numbers? Post a comment below – thank you!