An incident management process is a set of procedures and actions taken to respond to and resolve critical issues. Only such issues that satisfy all 3 conditions below are qualified as Incidents:
impact normal operation of the platform
noticeable by external or internal users of Knoema
significantly negatively impact the user experience
Documents how to work with incidents:
https://www.atlassian.com/ru/incident-management/handbook/incident-response#assess
5 Steps of Incident Management
Roles
Incident Manager
Konstantin Trukhin (Unlicensed)
Responsibilities of Incident Manager: Collect information about open incidents and communicate the status to CSM team in #platform-support Slack channel.
Incident Team
The incident team is a team of persons from the engineering side who are helping the incident manager with the investigation from the technical side. The members of the incident team will be changed from sprint to sprint.
Data OASIS - Niyaz Batyrov (Unlicensed)
Enterprise Core and Data Hub - Alexey Matyukhin (Unlicensed)
Expert Tools - Alexey Filippov (Unlicensed)
Incident SLAs
Severity | General description of severity | Status update frequency |
---|---|---|
1 | Whole Platform or its critical components are not responding and users are cannot complete routine tasks | Every 30 min |
2 | Platform response is heavily delayed (load time increased 200%) and users cannot normally complete most typical tasks | Every hour |
3 | Either a high value customer reported the issue as critical or the issue impacts some very common capability of the platform and is noticeable by many customers | Twice a day |
4 | Certain non-critical components of the platform are not functioning normally but the issue was reported by some customers already | Daily |
5 | The issue has not been reported by users and doesn’t impact any important components or performance of the platform | Daily |
Postmortems should be written for incidents with Severity 1 and 2.
Add Comment