Previously in this series, we have looked at filtering and monitoring requirements along with the thorny topic of the cloud. In the final part (for now) we will take a tour through servers and storage for those who want to retain their own hardware.
One interesting omission from the standard (in my opinion at least) is that it does not discuss the use of virtual vs physical server infrastructure. I would imagine that most larger-scale server operations these days would be heavily virtualised but this is generally a good option for most use cases and makes some of the conversations around resilient design considerably less burdensome. It can also be extremely cost-effective when compared to physical server roll-outs and of course, makes any migration to cloud or hybrid solutions easier to implement.
All servers and related storage platforms should continue to work if any single component or service fails
The purpose of this element is to ensure that any onsite technology is as reliable as possible and that the risks of downtime are mitigated and minimised as far as possible. Building truly resilient infrastructure is hard. Like, really hard. I will one day write an article to demonstrate how hard it is to get it right. That said, there are some simple items listed within this standard that will at least get things going in the right direction:
- ensure there are multiple power supplies for servers that have an automatic cutover system in place
- ensure that all key elements are covered by a UPS with at least 30 minutes run-time (of which more below)
- ensure that any disc systems utilise some form of RAID or similar technology such that the loss of a drive can be accommodated with no loss of service
- ensure there are regular backups of the systems and data
- if possible have backup servers in the event of a failure
- ensure that valid manufacturer warranties are in place with service levels that meet business requirements
It goes on to expand on the warranty line with a note that anything not covered by manufacturer warranty should have spares such as fans, power supplies etc. I’d probably go a step further and say that would be a good idea anyway, there is little point waiting for someone to come to site to replace a component you could easily have in stock at low cost. Most manufacturers have schemes to allow staff to be certified to undertake some level of basic component swap as part of the warranty scheme. Also worth noting that having equipment not covered by manufacturer warranty will count against you for Cyber Essentials accreditation purposes.
There are a couple of other items to note.
The UPS point is fairly standard and oft-repeated. However, what I’ve seen missed many times is to ensure that all components in the chain are covered by UPS so that the servers don’t fail due to loss of storage or networking or some other critical element that isn’t part of the UPS load. Once all those elements are included it may require a more substantial UPS to give that 30 minute window.
Two further UPS-related points to note. Firstly, the 30-minute figure only provides cover for minor blips or a controlled shutdown. It is not designed to provide continuity of service for 30 minutes and depending on the complexity of the environment it may take the full period to shut down all services. If you need more resilience or the ability to continue to operate for longer, more heavy-duty options such as generators may be needed (file with; resilience is hard, cross-referenced to schools and colleges aren’t well placed to run commercial-grade data centres.).
The second UPS point is that 30 minutes out the box may well degrade to 20 minutes after a couple of years depending on the batteries used. Like all batteries, their effectiveness wanes over time and must be monitored regularly to check what they can actually do vs what they could do when new. Replacement of UPS batteries should also be part of the annual budget cycle process as the last thing you need is for things to flip to UPS only for it to shut down in seconds.
At the top of this part of the standard, there is of course a reference to reducing the use of local servers with cloud options. Not to rehash part 2 of the series but when considering cloud options it is worth noting that designing for resilience may well still be needed particularly if we are looking at IaaS deployments. More than one large web-based service has gone dark because they have all their infrastructure eggs in a single location of their chosen cloud provider. Usually, the blame gets placed on the cloud provider but I disagree. The likes of AWS, Microsoft and Google have multiple regions for a reason, if you choose not to utilise that it’s on you. Then again I may have mentioned resilience is hard.
Servers and related storage platforms must be secure and follow data protection legislation
This is a bit of a repeat section that has cropped up in more than one element of the standards and should for the most part already be part of the data protection toolkit used by organisations. If you already comply with data protection regulations, have a robust backup process and have read and understood the cyber security standards for schools and colleges then you will already mostly comply with this standard. If you don’t, get help quickly!
There is one throw-away line here though that may benefit from a little further discussion. One of the risks identified to damage or data loss is via human error through poor management.
One personal pet peeve of mine having worked in a number of educational environments to help their IT teams is how often there is a total lack of change management in place to protect the infrastructure from mistakes.
There is perhaps a misunderstanding that change management=bureaucratic and time-sapping but this need not be the case. It is relatively simple to create a process that manages change, considers impacts and back-out options and communicates, documents and authorises changes without adding a huge amount of time or resource overhead. Generating change templates also allows a greater degree of repeatability to regular change activities.
All servers and related storage platforms should be energy-efficient and set up to reduce power consumption, while still meeting user needs
Modern server and storage hardware has made great strides in energy efficiency and when looking for replacement hardware it is always worth undertaking a detailed sizing exercise to understand whether you can achieve the performance required with less physical hardware; however efficient a server is, not having it and powering it is more efficient so fewer servers are preferable providing performance requirements can be met.
Although we are starting to strain Moore’s Law, if you look to replace your server and storage hardware between 5 and 10 yearly, the increase in performance will almost certainly allow you to do more with less. In the most recent virtual server host replacements I’ve worked on the reduction was significant, almost halving the number of physical servers while generating a moderately increased overall CPU performance.
However, the key element of the headline for this element is “while still meeting user needs”. Increasingly demands for access to services go beyond the school or college day and are not even restricted to term time. Reducing power to servers or storage will generally result in reduced performance and so care has to be taken not to make assumptions about when services are being used.
Of course, if using virtual servers using something like VMWare give you access to highly optimised power management features that will adaptively scale power usage depending on load which makes life easier in this regard. If using current versions of the software on modern hardware, most energy efficiency features will be turned on by default.
In terms of specific advice the standard suggests:
- specify servers that are designed to. be energy efficient, with the ENERGY STAR label or equivalent
- include a requirement that all server and related storage platforms are set up to reduce energy consumption
- are meeting your immediate needs and plans for growth, but do not go beyond that
- a solution is easy to maintain and repairable
Put simply, buy modern hardware from well-respected vendors, make sure to turn on any power optimisation features (that aren’t on by default), don’t over-egg the pudding and make sure you can fix it if it goes wrong.
The final part of that is more overall sustainability than pure power but in truth, if you are also complying with the requirement to have maintenance agreements in place then chances are you’ll also tick that box.
I noted earlier the desire to undertake a sizing exercise when considering new hardware. This also helps with the over-egging the pudding issue as there can be a tendency, no offence to my IT colleagues, to want the best and shiniest technology rather than what is required.
Historically, given the often patchy approach to replacement within educational institutions, the aim was to buy the most performance available within the budget to allow for growth over what could be 5, 10 or even more years until new investment was available. This was understandable but not an ideal position.
In the current technology landscape, the discussion should be about what performance your solution is currently giving you, where any deficiencies are (often storage performance rather than processing power) and what services you might want to move to a cloud-based solution over the life of the new hardware. In mature infrastructures, I’d be surprised if that review didn’t point to less hardware than people think.
All server and related storage platforms should be kept and used in an appropriate physical environment
Now, anyone who has spent time in schools or colleges should probably look away now because unless they have had the fortune of a new build in recent years, and a new build which met the actual standards and didn’t have space “value engineered” out, what I’m about to describe will be a country mile away from reality.
The first section covers the actual HSE and British Standards for the size of a server room. These are also built into the generic design brief for any DfE build project. Simply, the absolute minimum size of a server room with a single cabinet would be 3.4m x 2.2m. Additional cabinets bring with them incremental size increases. Again to point out, these aren’t nice to haves, these are minimum standards.
It goes on to state that the space should also:
- have servers mounted or stored in cabinets
- be free from flammable items such as paper, clothing, solvents and chemicals
- have a dedicated power supply
- have sufficient cooling or mechanically assisted ventilation to keep equipment within manufacturers recommended temperature guidelines
The room can not:
- contain battery-powered end user devices such as laptops
- have any windows or be accessible from a classroom
- store any liquids
It also notes to check for potential threats such as water sources above, below or adjacent that could leak into the space.
In reality, I have seen server rooms in schools and colleges that:
- are inside of cupboards, including one inside a cleaners store-room with water, chemicals and paper towels
- double up as the main store for incoming IT hardware with piles of laptops, phones and other equipment
- used as a store for boxes (cardboard) full of cables and other components
- was used to also store the water for the water dispensing units
- had office grade air-con that barely kept the temperature down to the maximum operating temperature
- fed power from a standard 13-amp socket
To make matters worse, an element only covered in passing in the standard, namely the need to ensure that the space is secure, was not even remotely true, access was available to any member of staff in some cases.
Just to restate, the above recommendations are the minimum requirements for a server room location. That so many schools and colleges don’t even get close is an indicator that this is not an issue that is widely understood. It is also another of the reasons I don’t believe that schools and colleges are well placed to host professional-level data centres in most cases.
The standard doesn’t even touch on many items that would be base level for data centre implementations such as raised floors, redundant and resilient power (including backup power sources), resilient cooling, fire suppression, environmental monitoring such as temperature and humidity sensors, access control and CCTV to log anyone entering the space.
If you have arrived at this article by way of part 2 on cloud you will know that I believe there are pros and cons for that approach. However, the fact that so many schools and colleges can’t get the space to house the servers right should be a sign that there are no easy answers.
That brings to a close this second batch of articles on the DfE standards. If anything within these pieces or the standards themselves sparks a desire to know more or to explore the issues and how they effect your organisation, please get in touch and I’ll try and help.