Tuesday, December 22, 2015

Moving from one Domain to another for Windows 2003 Cluster - Lessons Learned

Good day All,

Welcome back!!!

Today i will go over challenges we had moving a 2003 Cluster from one domain to another.I know I know you guys would be wondering you still have Unsupported OS, well i guess we do and i may not be the only one in the world i guess :)
We are making progress, right now these Servers was in some other domain and we started doing consolidation and going to Windows 2012 AD, and eventually we will be moving to 2012 or may be 2016 Cluster who knows... but for now will share the lessons learned during the domain change.

Microsoft has a pretty good article our there and here is the Link.So please make to go over and do all the Policy changes believe me i tested my self in lab they Policy changes play a very key role.

Well reading the article you must be thing, hmm this is simple and pretty straight forward why do need another article, well remember Production Servers always never do smooth ride and we had our challenges which i wanted to share to you all

So let me list out..

1. Un-joining from old domain and rejoin to new Domin as per article shutting on passive nodes down and worked on first Active Node. Well unjoin went well, we updated the DNS for new domain in TCP/IP Properties and joined it, guess what we started to get this below error

The computer failed to join the domain. Please contact your domain
administrator and indicate that the computer failed to update the
dnshostname and/or servicePrincipalName (SPN) attritbute in its Active
directory computer account. Once the problem is resolved, you may join the
computer to the domain.


Not sure how how many of you know but when you try to join a domain a log file is setup in the Server c:\windows\Debug\ntsetup.log.
After review we found that Server was trying to join in some other Server in different location in domain but not to the nearest domain controller as specified in DNS1, DNS2 in TCP/IP Properties.
More strange was Computer object was getting created and just disappearing...
It was time to Pull in AD team and after checking couple of things, it was identified that when Unjoining the Server from domain, even though Server is getting unjoined witthout any errors but the Computer object in the trusted OLD domain was not getting deleted and that caused all the issue.
If anyone about to shout saying ID doesn't have domain Admin rights, we do :) but some thing to be investigated by AD team.

So the solution was pretty simple, unjoin from domain, delete the computer object from old domain wait for like 5-10 mints to replicated and then tried it and this time it was all ok

2. Moving on to 2nd issue,...  As i said above we are working on the Primary Active Node as per the Microsoft article..
After adding to new domain, we made all the Local Policy changes as per the Link above and changed the Cluster Service domain name and password and we started the cluster Service..
Well guess what cluster Service just timed out with System event log error 7031.

Now we got struck and started to reapply the Local Security Policy just in-case if anything got missed, nope that didn't help.
So i started to review the cluster logs i was seeing that when we start the cluster Service, Q(Quorum) drive was trying to come online and then going offline and shutting down all the cluster group resources and terminating the Cluster Service.

So first clue we identified was disk/LUN Issue. So to double check i ran the command

net start clussvc /fixquorum

guess what Cluster Service came online and when we checked Cluster Admin MMC, Cluster Name and Cluster IP was online and Q drive was in failed state.. and when check for other drives too those was failed... that was not good.

So i went to disk management and when observed i found that there was active LUN's with drive letters assigned and also same duplicate  LUN's was showing online as below... hmm then i thought that Multi-path issue and disk are showing twice in disk management.


So we reached out to Storage team and they told that for Multipath issue there is updated drivers and after applying and rebooting presto!!! Cluster Service came online with no issues.


3. So 3rd lesson we learned.. I don't think i mentioned before but its our windows 2003 SQL Active/Passive cluster and SQL was brought down and Services of SQL was put to manual too before starting this activity.So we asked the SQL team to change the SQL Service account domain name,account  and password and then started to bring the Service online, all SQL Service came online except the SQL Service account and it failed with this error

SQLServerAgent could not be started (reason: SQLServerAgent must be able to connect to SQLServer as SysAdmin, but '(Unknown)' is not a member of the SysAdmin role).

SQL Service in cluster started so it cant be permission issue so doing some search in one of blog they suggested to make sure the SQL Service account part of " Lock in memory" in local Security policy and reboot the Server.
Well after doing that SQL Service account came back online too with no issues.

Sorry my laptop crashed and couldn't go back to the same blog where it was suggested and thank them but i would like to thank them for the blog and also people like them is helping the community to fix issues.

After all ok on the first Node tested , we moved to 2nd Node and replicated every thing and all went fine with no issues.
Hopefully my lessons learned will help someone too!!!!

Special thanks to my buddy Prasanth who sticked around the issues along with me.

Till next one all have a good day!!!!!!!!!!!!!!!