Network Status Page and Metrics Aggregator

Team

Geometry Labs

Category

Infrastructure

Description

Status page for the network that aggregates network health into a single dashboard for public use

Dates

April 1, 2020 ~ Dec. 31, 2020

Progress

50%

Status

In Progress

  • Details

    Link to original project description

    tldr;

    • Status pages are used to give the public a high level overview of the network's services
    • We are building a status page page based on a request from the foundation
    • The status page will give an overview of the overall health of the network displaying incidents and planned updates
      • Demo can be found at icon.status-page.net
      • It is deployed with code and ready to be put in production
    • Many features are planned for the tool to interface with a broader monitoring and alerting platform

    Overview


    When running large sets of services to the public, software companies will generally run status page to give a summary of the health across their network. For instance when GitHub experiences an outage, you can see it at githubstatus.com with a description of the interruption. Status pages are generally very high level and primarily used as a means to communicate service outages and planned system maintenance events to users. They can be also configured to notify users of various events by email and other means.

    The ICON foundation requested Insight to build a status page for a high level overview of their services. In response, Insight put together a number of demos of some open source offerings and settled on the well supported open source status page tool called Cachet. This tool serves as a central hub where telegram bots, alarms, notifications, and to a small degree metrics can be fed into and displayed in an easy to view and manage format. Cachet, while having many easy to manage features, doesn't take the place of a full monitoring solution like Prometheus and Grafana that are more fit for node operators and is part of the ICON Network Monitoring project. It is instead supposed to stay very high level and a way to communicate events to end users who, in ICON's case, tend to be the application developers and users.

    Currently the status page can be found at icon.status-page.net with a development environment at cachet.blockstatus.net. The code can be found at our github but it can easily be migrated to any domain as the whole deployment is done with Terraform and Ansible. We will have a development status page that we welcome contributions to for anyone who wants to customize the appearance or graphics. Cachet is written in PHP and has an active community.

    Process diagram - Visualization of the components in this project.

    We have built a telegram bot to update incidents and are working with the foundation to customize their action response plan. We also built a metrics collection agent that measure the latency in responses to the main 4 networks supported by ICON, main net and three test nets. From process diagram above, basically everything not directly connected to Prometheus is in late stages of development as of 4/12/2020. The main work going forward is in connecting the metrics collection system to feed high level indicators into the status page to display network health stats in both a current indicator and time series. 

    Development Roadmap


    Cachet, while interfacing with a number of different systems with customized tools, doesn't do a good job of collecting various metrics from the nodes themselves and aggregate them in ways to produce useful alarms. For that, the best in class tools will be used from the prometheus ecosystem. Development will hence shift focus to that project which will then be integrated into the main status page. The thought is that we are using the full functionality of Cachet but don't want to do too much customization on the data aggregation level and instead do the brunt of that with professional tools. These two projects will merge at some point where the incidents and reports can be shown from a variety of sources with customized thresholds. This will allow us to tune alarm parameters to give users the right amount of detail and not raise alarm from minor events.

    Links


    Tools Research - Collection of different tools that were evaluated prior to choosing Cachet as the final platform 

    Repos Master DB - Repos associated with this project 

     

  • Updates

    [6/17/2020]

    - With block production metrics now being collected across the network in prometheus, we are going to start working on the status page again after putting the project down for a couple months.  

    - We have functionality to send alerts via email with Cachet but are opting to build a custom telegram bot to build alerts with Alertmanager.  This is the professional way to go about this.