Incident Management

An incident management process is a set of procedures and actions taken to respond to and resolve critical issues. Only such issues that satisfy all 3 conditions below are qualified as Incidents:

  • impact the normal operation of the critical service

  • noticeable by external or internal users of Knoema

  • significantly negatively impact the user experience

Documents how to work with incidents:

https://www.atlassian.com/ru/incident-management/handbook/incident-response#assess

5 Steps of Incident Management

Roles

Incident Manager

@Konstantin Trukhin (Unlicensed)

Responsibilities of Incident Manager: Collect information about open incidents and communicate the status to CSM team in #platform-support Slack channel.

Incident Team

The incident team is a team of persons from the engineering side who are helping the incident manager with the investigation from the technical side. The members of the incident team will be changed from sprint to sprint.

  • Data OASIS - @Niyaz Batyrov (Unlicensed)

  • Enterprise Core and Data Hub - @Alexey Matyukhin (Unlicensed)

  • Expert Tools - @Vitaly Popov (Unlicensed)

 

Incident SLAs

Severity

General description of severity

Status update frequency

Severity

General description of severity

Status update frequency

1 - Critical

  • Critical client service or its critical client components are not responding

  • Users are cannot complete routine tasks.

  • Restarting the services or servers do not help

  • Client data lost

  • Confidential or Sensitive data leak

  • Workaround is not available

Every hour

2 - High

  • Critical client service or its critical client components are not accessible for some of the clients

  • Critical client service response is heavily delayed (load time increased 200%)

  • Users cannot normally complete most typical tasks

  • Restarting the services or servers helps only for short time

  • Workaround is available

Every 2 hours

3 - Medium

  • Either a high value customer reported an issue

  • The issue impacts some very common capability of the critical client service and is noticeable by many customers

  • Critical client service response is delayed but it is not prevent clients of using the service

  • The issue appears on regular basis

  • Restarting the services or servers solve the issue

  • Workaround is available

Twice a day

4 - Low

  • Certain non-critical components of the service are not functioning normally but the issue was reported by some customers already

  • Issues appears sporadically and and do not influence many clients

  • The issue has not been reported by users and doesn’t impact any important components or performance of the critical service

Daily

NOTE: Status update frequency is for business days and working time

 

Severity Matrix explained

Incident fixation

Each incident should be added to the Jira project: https://knoema.jira.com/jira/software/c/projects/IN/boards/57

Incident priority should be set up based on the severity of the incident described above.

Postmortems should be written for incidents with Severity 1 and 2.