Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

An incident management process is a set of procedures and actions taken to respond to and resolve critical issues. Only such issues that satisfy all 3 conditions below are qualified as Incidents:

  • impact the normal operation of the platformcritical service

  • noticeable by external or internal users of Knoema

  • significantly negatively impact the user experience

...

5 Steps of Incident Management

Roles

Incident Manager

...

Konstantin Trukhin (Unlicensed)

Responsibilities of Incident Manager: Collect information about open incidents and communicate the status to CSM team in #platform-support Slack channel.

Incident Team

...

The incident team is a team of persons from the engineering side who are helping the incident manager with the investigation from the technical side. The members of the incident team will be changed from sprint to sprint.

Incident SLAs

Severity

General description of severity

Status update frequency

1 Whole Platform - Critical

  • Critical client service or its critical client components are not responding

and users Platform
  • Users are cannot complete routine tasks.

  • Restarting the services or servers do not help

Every 30 min

2

  • Client data lost

  • Confidential or Sensitive data leak

  • Workaround is not available

Every hour

2 - High

  • Critical client service or its critical client components are not accessible for some of the clients

  • Critical client service response is heavily delayed (load time increased 200%)

and users
  • Users cannot normally complete most typical tasks

  • Restarting the services or servers helps only for short time

  • Workaround is available

Every hour2 hours

3 - Medium

  • Either a high value customer reported

the issue as critical or the
  • an issue

  • The issue impacts some very common capability of the

platform
  • critical client service and is noticeable by many customers

  • Critical client service response is delayed but it is not prevent clients of using the service

  • The issue appears on regular basis

  • Restarting the services or servers solve the issue

  • Workaround is available

Twice a day

4 - Low

  • Certain non-critical components of the

platform 5
  • service are not functioning normally but the issue was reported by some customers already

Daily

  • Issues appears sporadically and and do not influence many clients

  • The issue has not been reported by users and doesn’t impact any important components or performance of the

platform
  • critical service

Daily

NOTE: Status update frequency is for business days and working time

Severity Matrix explained

...

Incident fixation

Each incident should be added to the Jira project: https://knoema.jira.com/jira/software/c/projects/IN/boards/57

Incident priority should be set up based on the severity of the incident described above.

Postmortems should be written for incidents with Severity 1 and 2.

...

Severity Matrix explained

...