I have been thinking of writing procedures of Server Monitoring and Maintenance. I decided to put together some write-up for this. It is not an exhaustive document but it will help if you are fairly new to SharePoint Administration.
Server Monitoring and Maintenance Procedures
This document explains daily procedures that should be performed in order to monitor and manage the SharePoint 2007 Server. Moreover, daily and monthly procedures are outlined to help you maintain the SharePoint Installation.
Daily Monitoring Procedures
The following system monitoring procedures should be performed daily.
- Check that all relevant services are operational on all the servers in the farm using the Services Snap.
- Check that ample space is left in the C: Drive in order to free-up disk space for applications to run properly.
- Verify that the previous night's backup has run.
- Search for unusual entries in the Event log Snap, this task should be done every 6 hours.
- Check for all Timer Jobs definitions and resolve all jobs that failed or did not complete.
In the Event of an Emergency Downtime,
The following has to be restored back as quickly as possible;
- SQL Database
- Front End Server
- Search Server
- All Solutions/modifications applied
The functionality of the SharePoint different functions should be checked after every downtime.
In order to seamlessly administer and manage the SharePoint 2007. Some tools are needed as follows;
- Microsoft Office SharePoint Server 2007 MP for MOM 2005
- SharePoint Administration Toolkit
These tools would help to detect any errors and problems that may arise which might not be easily detected.
Event Log Resolution Procedure
Once an error log is detected in the Event log, this should activate the quick resolution process be aware that a resolution might take the whole day and sometimes it might be a quick fix or an after close of work fix.
- Log this Error/Fault in the Daily Resolution KB on the SharePoint Administration Site.
- Diagnose and detect the problem, propose a solution to this error/fault.
- Perform a google search for similar problems, document the results, make sure the results obtained from the google search reflect the same problems occurring on the server.
- Discuss with Teksys and if necessary raise a call ticket
- Copy the URL from the search and add it to your entry on the Daily Faults Resolution KB.
- Test resolution on the test environment, then staging environment.
- Discuss possibility of a change with Project Manager and/or Infrastructure Team leader in order to resolve this fault.
- Determine length of time for resolution to take place, time it would start and any extent of downtime/effect on the entire farm.
- Have a Rollback plan
- Raise an RFC for the change.
- Inform the CAB about the change and time.
- Effect the change, perform testing and confirm the functionality of the entire farm.
- Update the Issues Log.