Data Infrastructure
Work-in-Progress Diagram
Purpose
This page is an ongoing attempt to document the existing Knoema data infrastructure. The diagram currently represents what @Ian Cook (Unlicensed) has noted down during his initial conversations with people in Russia and India.
The diagram above and statements below are clearly incomplete and likely incorrect.
We will use this page to work on improving and correcting them as a team.
Scope
This page focuses on the technical infrastructure used to move data from a source of any kind to its eventual presentation to a client, whether through Knoema’s primary site, client portals, or APIs and tools. The content of this page should be restricted to the technical details, and is not intended to address how Knoema discusses access with clients, names products, or groups items under any marketing terminology.
Collaborating
Any person on the Data Team that has relevant insights is encouraged to contribute. The diagram is generated via Python code in this repo. Anyone who wants to update the code directly is welcome, though it is not required, as Ian will be updating it as well.
Request:
Please use the comments section on this page to indicate:
Any necessary changes to specific items
Any tools or processes that are used but are not present in the diagram
Existing documentation or diagrams (please include links)
People who may have knowledge of the systems
Current Understanding
The flow of this documentation starts with SOURCES and moves step-by-step until final KNOEMA USER ACCESS.
Sources
Knoema collects data from at least four types of sources:
Specific files made available through publicly-accessed websites
Example:
S/FTP and similar protected file access tools, both public and as allowed by data providers
Example:
Pulling data from publicly available API endpoints
Example
Data providers delivering files/DB extracts/other binary data delivery
Example
Data Lake Storage
All data brought in to Knoema is initially stored in Amazon S3 buckets, referred to as Knoema’s “Data Lake”. All data is initially stored as raw, with no transformations or operations on the data prior to initial insertion into S3.
Orchestration
Knoema uses NiFi to run some ingestion processes and deposit raw data into S3 buckets. Other processes are being run through the custom Oasis tool (as a holdover from Adaptive Management).
Transformations and Preparation for Database
Data engineers perform some standard transformations to data. These operations are done on data that sits in S3, and are performed using the Data Management Tool (DMT).
Need DMT documentation
what operations does it permit?
what is the output file format, where are they stored, are they versioned, etc?
Examples of transformations include:
simple format corrections (
str
→date
,int
->bigint
)aggregations
delimiter identification
light data modeling/schema specification
other…
The output of the Data Management Tool is a file that specifies how to handle the file when it is picked up and processed.
Data Operations
The Data Operations tool (DataOps) tracks the pipelines specified by the directions output by the Data Management Tool. In addition, DataOps tracks and reports on the metadata associated with each data asset.
Is NiFi doing coordination here?
Get link to DataOps documentation and code
Is DataOps getting any information back from knoema.com or portals?
Who enters the metadata associated with the data assets?
The output of DataOps are transformed data assets that are delivered to Knoema’s database of record (MSSQL).