Building a Network Operations Center

The Case

The network operations center (NOC) of our customer had a US-based operation, yet covering 24 hours of the day, 7 days per week. This required to keep a solid staff schedule of people in one location and one-time zone. Having such a setup meant additional cost on one hand and on the other having specialists, who are not usually used to a different schedule than the standard 09:00 to 18:00. Employee job satisfaction was not at the greatest level, due to night shifts.

Our customer was looking to diversify the spread of the time zone coverage, ideally having the setup at a 12-hour time zone difference with the US-based team. In addition, the NOC team had a long pipeline of automation, which they could not get to work due to lack of resources and in some cases due to lack of knowledge. Our customer was looking for a mirrored team with highly-profiled consultants in it, which would help them take the NOC’s operations to the next level.

 

What is a Network Operations Center (NOC)?

The overall function of the NOC is to maintain optimal network operations across a variety of platforms, mediums and communications channels. They monitor power failures, communication line alarms and performance issues that may affect networks. NOCs are capable of analyzing problems, performing troubleshooting, communicating with site technicians and tracking problems until they are resolved. Network operation centers serve as the main focal point for software troubleshooting, software distribution, and updating router and domain name management in coordination with affiliated networks and performance monitoring.

Building the Network Operations Center

After an official GO of the engagement, it took ScaleFocus less than one month to pick-up the initial leadership of the team and two more months to have the fully operational NOC setup. In the first 4 months from the GO, the team was already integrated with the US NOC and took-over the main responsibility. By the sixth month, the Bulgarian-based NOC was improving and automating various processes, monitoring solutions and increasing the overall productivity of the entire NOC operations.

As a part of the solution the NOC team in Bulgaria started to focus on the following tasks:

  • Execute routine day-to-day operations as part of the SLA monitoring of business services.
  • Helped System Engineering team during regular patching of servers.
  • Improved monitoring value for the NOC by integrating all of the key source monitoring solutions.
  • Implement monitoring checks for newly released services which ensured faster time-to-market and better visibility of critical business services.
  • Execute brown out sessions as part of peak loads preparation for the entire engineering teams.
  • Performed testing of automated monitoring and notification services for effective incident lifecycle.
  • Our team was an integral part of the Customer’s regular release processes.
  • Helped and assisted System Engineering and Network Engineering teams with Top of the Rack migrations.
  • Helped Information Security team with 24/7 monitoring of security events.
  • Helped Software Engineers in Bulgaria with investigation and resolution of issues with continuous integration and continuous delivery tools.
  • Helped System Engineering team to maintain the currently available configuration management tools and to resolve issues which were affecting hosts.
  • NOC Team tuned and updated the current alerting mechanism for the time series metric cluster to withstand the increased amount of checks during the peak loads.
  • Our team performed fine-tuning of the existing monitoring solution which collects, stores, aggregates and display time-series data in real time to optimize the cluster performance due to the increased load from all microservices.
  • Monitoring and incident management.
  • Handling service requests and server provisioning.
  • NOC Team monitors the performance and capacity of the whole Infrastructure.
  • NOC Engineers are responsible for triaging, tracking and troubleshooting of ongoing incidents based on well-defined operational procedures and runbooks.

Technologies Used

 

Jenkins, Grafana/Graphite, Maven, Artifactory, Jenkins, Grafana/Graphite, Maven, Artifactory, Nagios, BigPanda, Solarwinds, Neustar, AppDynamics, Git, SVN, SignalFX, Cloudwatch, Thousand Eyes, Seyren, SmokePing, Splunk, Pingdom

The Achievements

  • Optimized the 24-hour work day and created new shifts within several months.
  • Automate lots repetitive and time-consuming monitoring and support processes.
  • Provide constant monitoring of incidents.
  • Responsible for more than 1000 on-premise servers, more than 1500 virtual machines and the services running on them.
  • Cleaned up and fixed more than 600 broken checks.
  • Implemented short circuit escalation process part of four seasons initiative (integration between existing source monitoring solutions).
  • Reduced costs at US location due to no having night shifts.
  • Allowed US team to focus on more productive and high-added value work as the Bulgarian-based team was automatic the low-level work.

The Customer

The Client is a consumer-heavy Silicon Valley-based digital e-commerce business that specializes in photo-based personalized products and mass-customized print-on-demand.  The company grew extensively organically and through multiple acquisitions of new brands.

The customer operates 3 state-of-the-art manufacturing facilities.  In 2017, it served 10 million unique customers and processed more than 27 million annual orders and hosted 30 billion photos.  Its annual 2017 revenue was close to 2 billion.