Data Infrastructure

Work-in-Progress Diagram

Purpose

This page is an ongoing attempt to document the existing Knoema data infrastructure. The diagram currently represents what @Ian Cook (Unlicensed) has noted down during his initial conversations with people in Russia and India.

The diagram above and statements below are clearly incomplete and likely incorrect.

We will use this page to work on improving and correcting them as a team.

Scope

This page focuses on the technical infrastructure used to move data from a source of any kind to its eventual presentation to a client, whether through Knoema’s primary site, client portals, or APIs and tools. The content of this page should be restricted to the technical details, and is not intended to address how Knoema discusses access with clients, names products, or groups items under any marketing terminology.

Collaborating

Any person on the Data Team that has relevant insights is encouraged to contribute. The diagram is generated via Python code in this repo. Anyone who wants to update the code directly is welcome, though it is not required, as Ian will be updating it as well.

Request:

Please use the comments section on this page to indicate:

  1. Any necessary changes to specific items

  2. Any tools or processes that are used but are not present in the diagram

  3. Existing documentation or diagrams (please include links)

  4. People who may have knowledge of the systems

Current Understanding

The flow of this documentation starts with SOURCES and moves step-by-step until final KNOEMA USER ACCESS.

Sources

Knoema collects data from at least four types of sources:

  1. Specific files made available through publicly-accessed websites

    1. Example:

  2. S/FTP and similar protected file access tools, both public and as allowed by data providers

    1. Example:

  3. Pulling data from publicly available API endpoints

    1. Example

  4. Data providers delivering files/DB extracts/other binary data delivery

    1. Example

Data Lake Storage

All data brought in to Knoema is initially stored in Amazon S3 buckets, referred to as Knoema’s “Data Lake”. All data is initially stored as raw, with no transformations or operations on the data prior to initial insertion into S3.

 

Orchestration

Knoema uses NiFi to run some ingestion processes and deposit raw data into S3 buckets. Other processes are being run through the custom Oasis tool (as a holdover from Adaptive Management).

Transformations and Preparation for Database

Data engineers perform some standard transformations to data. These operations are done on data that sits in S3, and are performed using the Data Management Tool (DMT).

  • Need DMT documentation

    • what operations does it permit?

    • what is the output file format, where are they stored, are they versioned, etc?

Examples of transformations include:

  • simple format corrections (strdate, int-> bigint)

  • aggregations

  • delimiter identification

  • light data modeling/schema specification

  • other…

The output of the Data Management Tool is a file that specifies how to handle the file when it is picked up and processed.

 

Data Operations

The Data Operations tool (DataOps) tracks the pipelines specified by the directions output by the Data Management Tool. In addition, DataOps tracks and reports on the metadata associated with each data asset.

  • Is NiFi doing coordination here?

  • Get link to DataOps documentation and code

  • Is DataOps getting any information back from knoema.com or portals?

  • Who enters the metadata associated with the data assets?

The output of DataOps are transformed data assets that are delivered to Knoema’s database of record (MSSQL).

Knoema User Access