Revelio Labs - Workforce Composition and Trends Data

Please find attached reference documentation (Data Dictionary, FAQ, companies, JobCluster, 1 pager)

 

Character encoding and character escaping logic:

  1. Files are in UTF-8 encoded. Please make sure you read files with this encoding.

  2. I did not notice a need to use character escaping logic in the sample files. However, I am keeping it here in case it is needed for full files. To escape special characters:

    1. Records in data files may contain both commas and double quotes, and the way the vendor chose to generate a robust CSV is to escape special characters (double quotes and commas) and add double quotes around each field on top of it.

    2. For CHAR and VARCHAR columns in the data files, an escape character (\) is placed before every occurrence of the following characters:

      1. Linefeed: \n

      2. Carriage return: \r

      3. The delimiter character specified for the unloaded data (in this case, it is “,”)

      4. The escape character: \

      5. A quotation mark character: " or ' (if both ESCAPE and ADDQUOTES are specified in the UNLOAD command). See https://docs.aws.amazon.com/redshift/latest/dg/r_UNLOAD.html for more information.

    3. For example,

      1. the company name "Bioblioteca \"P. Florentino Idoate\, S.J.\"" should be read as Biblioteca "P. Florentino Idoate, S.J."

      2. the company name "TBWA\\Raad" should be read as TBWA\Raad

 

Data assets:

Data assets are shared from vendor S3 to Knoema S3. There is 1 mapping file + 6 data files:

  • company_mapping

  • gender

  • inflow

  • long

  • outflow

  • posting

  • skill

Schema and data types:

General notes on data ingestion:

  • Data files contain historical data.

  • All files should be ingested.

  • The work_stream_name (for source record ID) should be “revelio_altdata”

  • Snowflake table names are ok – however, there is a double underscore in one or more of the Snowflake table names, please fix.

  • Update type: Full History Updates

  • Update Frequency: Monthly

  • Some tables contain records with “empty” (string) as values. Please change these to null.

 

Match Logic:

To match the data,

  • The company_mapping file contains all the input fields you need: “company”, “ISIN” and “URL”.

    • In the future, the vendor may also add a field called “country” into the company_mapping file, which may contain country name or country code. For full transparency, I will skip this logic for now; when the company_mapping file changes, I will review it and update this confluence page.

  • To match the data,

    • add a new field called ISIN_country and populate this field:

      • If ISIN = null, then ISIN_country = null

      • Else ISIN_Country = first two chars of primary_ISIN

    • add a new field called countryISOAlpha2Code

      • countryISOAlpha2Code = ISO code substitution on top of ISIN_country field (e.g. if ISIN_Country = ‘PR’, then countryISOAlpha2Code = ‘US’)

      • If you can’t find a substitution, then countryISOAlpha2Code = null.

  • We need to do two matches on top of the company_mapping file.

    • Match 1

      • source attribute → D+ IDR Mapping:

        • company → name

        • countryISOAlpha2Code → countryISOAlpha2Code

      • The DUNS field will be called COMPANY_DUNS. Please use the “COMPANY_” prefix for match fields.

      • Acceptance rules

        • Accept all confidence code >=7

        • If ConfidenceCode <7, accept matches where matchGrade has the following pattern A* AND Name Score >=95%

        • Apply '1' AcceptanceIndicator to all above conditions

        • If company name is null or countryISOAlpha2Code is null, we should skip the row from being matched.

          • Confidence code = 0

          • Acceptance Indicator = 0

          • other match fields will be NULL

    • Match 2

      • source attribute → D+ IDR Mapping:

        • url → url

      • The DUNS field will be called URL_DUNS. Please use the “URL_” prefix for match fields.

      • Acceptance rules

        • Accept all confidence code >=7

        • Apply '1' AcceptanceIndicator to all above conditions

        • If url is null, we should skip the row from being matched.

          • Confidence code = 0

          • Acceptance Indicator = 0

          • other match fields will be NULL

    • Add a set of fields to determine which set of DUNS/match fields (we can just simply call this DUNS) we will used for mapping. There is no prefix needed for this final set of fields:

      • DUNS = URL_DUNS where available; other match fields = the URL_DUNS match fields.

      • If URL_DUNS is not available, DUNS = COMPANY_DUNS; other match fields = the COMPANY_DUNS match fields.

 

The Inflow and Outflow data files contain 2 columns that have company names. All other data files (i.e. Gender, Posting, Skill, Long) contain only 1 column that has the company name.

After matching is done, we will need to map/join the match information back to the data tables by company names. To do this, we use the “company” field in the company_mapping file to map information to:

  • The “company” field in all data files except for Inflow and Outflow (i.e. Gender, Posting, Skill, Long).

    • The DUNS columns would be called “DUNS”

  • The “prev_company” and “new_company” fields in the Inflow or Outflow data files:

    • The DUNS columns would be called “PREV_COMPANY_DUNS” and “NEW_COMPANY_DUNS”

    • The match fields would follow the same naming conventions--basically you just add the prefix “PREV_COMPANY_” or “NEW_COMPANY_” based on the column you’re mapping.

  • For companies that don’t exist in the company_mapping file, please treat them as no match.

    • Confidence code = 0

    • Acceptance Indicator = 0

    • other match fields will be NULL

  • What to expect in terms of companies that exist in the mapping file but are missing in the data files:

    • For “prev_company” in inflow AND “new_company” in outflow, most companies don’t exist in the company_mapping file. (In the sample test, ~98% of the companies are not there).

      • The data vendor will add more companies to the mapping file in the future, so match rates will gradually improve.

    • Other company name columns should not have this problem. Only a very small percentage of the companies do not exist in the company_mapping file (in the sample test, <2% of the companies cannot be found in the company_mapping file).

    • The company_mapping file is really small, so we should be able to re-dunsify data if needed.

  • Just for emphasis, please note that we are mapping/joining the match information only, and not all the fields from the mapping table.

Snowflake Tables to Share:

  • Prior to full data ingestion and dunsification:

    • 7 raw tables for D&B to review data table

  • After full data ingestion:

    • 1 dunsified company_mapping tables with all columns (including raw, audit, and all duns/match columns).

    • 6 Data tables with all columns (including raw, audit, and duns/match columns that were mapped over).