Instructions for Including Matching of a Dataset or Table in Workflow

Overview

This explains how to leverage our high-performance, parallel-processing cloud architecture to run and/or automate the running of matching jobs for datasets, files, or database tables, all with a single API call. Not only can you identify and matching inconsistent and duplicate data quickly and easily, you can also schedule processing, add to business processes, match & merge multiple datasets, and integrate into ETL/ELT processes.

Automation

Schedule and integrate matching jobs via API into your ETL/ELT processes, workflows, devops, or other operations, constantly gauging levels of data quality across all of your data assets.

Multiple Data Sources

Support for various dataset formats, including local files, cloud files, the most popular database platforms, and other data sources.

Single Command

Execute or schedule powerful, highly-performant, API-driven matching capabilities with a single HTTP API request.

How It Works

The matching process is initiated via API using an HTTP request "query string", which can be embedded into any process, batch file, scheduler, or scripted series of commands, including from a browser address bar, a command line, using cURL, and any other method that enables an API call.

CSV Data Source Example: Matching Organization Names (Match Report)

Description:

Generates a match report of matching/inconsistent organization names from the first column in a CSV file. The similar organization names are clustered together in groups in the report. The source file is a URL as the sample file is stored on AWS S3. Since it is a recognized sample file, no API key is necessary. It can be run as-is by cutting and pasting into a URL address bar in your browser. This same call can be used with your own data in CSV file format.

Example API Call:

Try it out by pasting into the URL address bar of your browser, no API key needed.

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/companies.csv&table=CSV&column=1&process=matchreport&category=company&html=true

API Parameters

Use these parameters in your HTTP query string/API call.

Parameter	Description	Required
`function`	Use 'match' for performing matching capabilities.	Required
`process`	Processing type for match processing. Options: `matchreport` (default): Generate a Match Report `keysonly`: Generate a similarity key for every value in the dataset `gensql`: Generate SQL Insert statements for inserting similarity keys into a database `createtable`: Create a new table in an existing database containing similarity keys	Required
`category`	Indicates which set of AI-driven matching algorithms to use for the target data to be analyzed. Options: `company`: For matching company names `individual`: For matching individual names `address`: For matching addresses	Required
`apikey`	Your API Key. Login to www.interzoid.com to obtain one. New users can register at www.interzoid.com/register-api-account	Required
`source`	Source of data, such as 'CSV', 'TSV', Snowflake', 'Postgres', 'json', etc. See the Supported Data Sources section for the full list.	Required
`connection`	Connection string to access database, or for CSV, TSV, or other text files, the full URL of the file location.	Required
`table`	Table name to access the source data. Use "CSV", "TSV", etc. for delimited text files.	Required
`column`	Column name within the table to access the source data. For CSV or TSV files, use a number starting from 1 (leftmost column).	Required
`reference`	An additional column from the source table to display in the output results, such as a primary key. For text files, this is also a number.	Optional
`target`	The target text file delimited format for output, such as "CSV" (commas), "TSV" (tabs), "PSV" (pipes), "SSV" (semicolons), or "json" (structured output). Default is CSV.	Optional
`keysoutputall`	Set to true (&keysoutputall=true) to output all source columns when generating similarity keys where "&process=keysonly", with the similarity key appended as the last column.	Optional
`newtable`	The name of the new table if the output results are written to a new SQL table.	Optional
`html`	Set to true (`&html=true`) to pad line breaks into the output results for better readability in a browser when run from the address bar.	Optional

Supported Data Sources: Connection Strings

Values to use for the API source and connection parameters

Source Value	Description	Connection String Value Example
`snowflake`	Account/Warehouse connection	`user:password@zwa55555/database/schema`
`postgres`	PostgreSQL, AWS/RDS/Aurora Postgres, Google Cloud SQL, ElephantSQL, CockroachDB, etc.	`postgres://user:password@domain/database?sslmode=disable`
`mysql`	MySQL, MariaDB, SkySQL, AWS/RDS/Aurora MySQL, Google Cloud SQL, etc.	`root:password@tcp(domain)/database`
`databricks`	SQL Warehouse example	`token:dapi1ab2c34defabc567890123d4efa56789@dbc-a1b2345c-d6e7.cloud.databricks.com:443/sql/1.0/endpoints/a1b234c5678901d2`
`sqlserver`	Microsoft, Azure, and AWS SQL Server	`server={servername}.database.windows.net;user id={youruserid};password={yourpassword};port=1433;database=mysample;`
`csv`	URL path of CSV file	`https://www.mywebaddress.com/files/myfile.csv`
`tsv`	URL path of TSV file	`https://www.mywebaddress.com/files/myfile.tsv`
`excel`	URL path of Excel file	`https://www.mywebaddress.com/files/myfile.xlsx`
`parquet`	URL path of Parquet file	`https://www.mywebaddress.com/files/myfile.parquet`
`json`	URL path of JSON file (one data value)	`https://www.mywebaddress.com/files/myfile.json`
`jsonref`	URL path of JSON file (one data value + one reference value)	`https://www.mywebaddress.com/files/myfile.json`
`jsonbatch`	Encoded JSON as parameter of API call	`&data=[encoded JSON]`
`jsonrefbatch`	Encoded JSON as parameter of API call	`&data=[encoded JSON]`

Note: to process and analyze local files, use our browser-based wizard.

Running with cURL Example

You can run the command from a Linux, Windows, or macOS command line using cURL:

Linux & Mac

$ curl 'https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/companies.csv&table=CSV&column=1&process=matchreport&category=company'

Windows

curl "https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/companies.csv&table=CSV&column=1&process=matchreport&category=company"

Redirecting Output

Output from these curl commands can be redirected to output files for further processing using the greater-than symbol in both Linux & Windows.

Linux & Mac

$ curl '[HTTP query string]' > output.csv

Windows

curl "[HTTP query string]" > output.csv

Other File Source Examples

Additional examples where a file is the data source that will be used for data matching.

CSV Data Source Example: Matching Individual Names (Match Report)

Description:

Generates a match report of individual names from the first column in a CSV file. The similar individual names are clustered together in groups in the report. The source file is a URL as the sample file is stored on AWS S3. Since it is a recognized sample file, no API key is necessary. It can be run as-is by cutting and pasting into a URL address bar in your browser. This same call can be used with your own data in CSV file format.

API Call:

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/peoplenames.csv&table=CSV&column=1&category=individual&process=matchreport&html=true

CSV Data Source Example: Matching Street Addresses (Match Report)

Description:

Generates a match report of street addresses from the first column in a CSV file. The similar street addresses are clustered together in groups in the report. The source file is a URL as the sample file is stored on AWS S3. Since it is a recognized sample file, no API key is necessary. It can be run as-is by cutting and pasting into a URL address bar in your browser. This same call can be used with your own data in CSV file format.

API Call:

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/addresses.csv&table=CSV&column=1&category=address&process=matchreport&html=true

CSV Data Source Example: Company/Organization Name Similarity Keys (Append)

Description:

Generates a match report of matching/inconsistent organization names from the first column in a CSV file. The similar organization names are clustered together in groups in the report. The source file is a URL as the sample file is stored on AWS S3. Since it is a recognized sample file, no API key is necessary. It can be run as-is by cutting and pasting into a URL address bar in your browser. This same call can be used with your own data in CSV file format.

API Call:

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/companies.csv&table=CSV&column=1&process=keysonly &category=company&html=true

Connecting to Cloud SQL Data Tables

Snowflake Data Source Example: Matching Company/Organization Names (Match Report)

Description:

Generates a match report of matching/inconsistent organization names from the specified column in a Snowflake database table. The similar organization names are clustered together in groups in the report. Using a native Snowflake driver, a connection is achieved with the provided Snowflake connection string. Upon connecting, the specified Snowflake table will be traversed, generating similarity keys on the fly. After scanning the entire table (or view), the match report will be generated. This is a cURL example using sample Snowflake values.

API Call:

curl "https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=snowflake&connection=username:password@account/database/schema&table=companies&column=company&category=company&process=matchreport"

Snowflake Data Source Example: Company/Organization Name Similarity Keys (Generate SQL)

Description:

Generates a similarity key and Insert SQL command for every record in a specified column in a Snowflake database table. Using a native Snowflake driver, a connection is achieved with the provided Snowflake connection string. Upon connecting, the specified Snowflake table will be traversed, generating similarity keys on the fly. For each of these, an Insert SQL statement is generated. After scanning the entire table (or view), the entire output of Insert SQL statements is available for review, and then can be executed if desired within Snowflake. This is a cURL example using sample Snowflake values.

API Call:

curl "https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=snowflake&connection=username:password@account/database/schema&table=companies&column=company&category=company&process=gensql"

Snowflake Data Source Example: Company/Organization Name Similarity Keys (Create Table & Insert SQL)

Description:

A new Snowflake table is created using the name in the 'newtable' parameter. A similarity key is generated and an Insert SQL command issued for every record in a specified column in a Snowflake database table. Using a native Snowflake driver, a connection is achieved with the provided Snowflake connection string. Upon connecting, the specified Snowflake table will be traversed, generating and inserting similarity keys on the fly within the newly created Snowflake table for use as desired within Snowflake. This is a cURL example using sample Snowflake values.

API Call:

curl "https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=snowflake&connection=username:password@account/database/schema&table=companies&column=company&category=company&process=createtable&newtable=mytablename"

Azure SQL Data Source Example: Matching Company/Organization Names (Match Report)

Description:

Generates a match report of matching/inconsistent organization names from the specified column in an Azure SQL/SQL Server database table. The similar organization names are clustered together in groups in the report. Using a native SQL Server driver, a connection is achieved with the provided SQL Server connection string. Upon connecting, the specified Azure SQL table will be traversed, generating similarity keys on the fly. After scanning the entire table (or view), the match report will be generated.

API Call:

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=sqlserver&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company

AWS RDS Data Source Example: Matching Company/Organization Names (Match Report)

Description:

Generates a match report of matching/inconsistent organization names from the specified column in an AWS RDS/Aurora database table using Postgres. The similar organization names are clustered together in groups in the report. Using a native Postgres driver, a connection is achieved with the provided SQL Server connection string. Upon connecting, the specified AWS Postgres table will be traversed, generating similarity keys on the fly. After scanning the entire table (or view), the match report will be generated.

API Call:

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=postgres&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company

Google Cloud SQL Data Source Example: Matching Company/Organization Names (Match Report)

Description:

Generates a match report of matching/inconsistent organization names from the specified column in a Google Cloud SQL database table using Postgres (you can also use MySQL). The similar organization names are clustered together in groups in the report. Using a native Postgres driver, a connection is achieved with the provided Google Cloud SQL Postgres connection string. Upon connecting, the specified Google Cloud SQL table will be traversed, generating similarity keys on the fly. After scanning the entire table (or view), the match report will be generated.

API Call:

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=postgres&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company

PostgreSQL Data Source Example: Matching Company/Organization Names (Match Report)

Description:

Generates a match report of matching/inconsistent organization names from the specified column in a PostgreSQL database table. The similar organization names are clustered together in groups in the report. Using a native Postgres driver, a connection is achieved with the provided PostgreSQL connection string. Upon connecting, the specified PostgreSQL table will be traversed, generating similarity keys on the fly. After scanning the entire table (or view), the match report will be generated.

API Call:

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=postgres&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company

MySQL Data Source Example: Matching Company/Organization Names (Match Report)

Description:

Generates a match report of matching/inconsistent organization names from the specified column in a MySQL database table. The similar organization names are clustered together in groups in the report. Using a native MySQL driver, a connection is achieved with the provided MySQL connection string. Upon connecting, the specified MySQL table will be traversed, generating similarity keys on the fly. After scanning the entire table (or view), the match report will be generated.

API Call:

https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=mysql&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company

JSON:

There are two primary ways to provide JSON as input to the Full Dataset Matching API. These are a source input file, or alternatively a multi-value supporting batch parameter. in the API call (up to 100 values at a time, no file necessary).

Example JSON input file data for matching.

This example illustrates how input data might look when performing similarity matching using the API with source=json parameter, using JSON structured values from an input file. Use one data column for the data to be matched from and from which similarity keys will be generated. Using the target parameter, output can still be JSON, CSV, TSV, etc.

JSON input source file example

[
    {
        "Data": "IBM"
    },
    {
        "Data": "International Business Machines"
    },
    {
        "Data": "ibm corp"
    }
]

Example JSON input data for matching with reference data, such as a primary key, that will also appear in output for reference purposes only and not used as part of the matching.

This example illustrates how input data along with a reference parameter might look when performing similarity matching using the API with source=jsonref parameter. Use one data column for the data to be matched from and from which similarity keys will be generated. The reference value will also appear in the output. Using the target parameter, output can still be JSON, CSV, TSV, etc.

JSON input source file example with reference value

[
    {
        "Data": "IBM",
        "Reference": "376152"
    },
    {
        "Data": "International Business Machines",
        "Reference": "419044"
    },
    {
        "Data": "ibm corp",
        "Reference": "277383"
    }
]

Batch: Encoded JSON as input parameter - no source file necessary

This example illustrates how input data can look when performing similarity matching using the API with the source=jsonbatch parameter, using encoded JSON via the data parameter as input.

JSON Example

[
    {
        "Data": "IBM"
    },
    {
        "Data": "International Business Machines"
    },
    {
        "Data": "ibm corp"
    }
]

Here is what the encoded JSON from above would like for use within the data parameter:

%5B%7B%22Data%22%3A%22IBM%22%7D%2C%7B%22Data%22%3A%22International%20Business%20Machines%22%7D%2C%7B%22Data%22%3A%22ibm%20corp%22%7D%5D

The URL-encoded JSON must then be provided with the "data" parameter. You will not use a "connection" parameter sine there is not a source file.
The source and table parameters must be "jsonbatch". You can provide up to 100 values per call. Using the target parameter, output can still be JSON, CSV, TSV, etc. Here is an example using cURL:

curl "https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=jsonbatch&data=%5B%7B%22Data%22%3A%22IBM%22%7D%2C%7B%22Data%22%3A%22International%20Business%20Machines%22%7D%2C%7B%22Data%22%3A%22ibm%20corp%22%7D%5D&category=company&process=keysonly&table=jsonbatch&keysoutputall=true&target=json"

Batch: Encoded JSON as input parameter with a reference value - no source file necessary

This example illustrates how input data can look when performing similarity matching using the API with source=jsonrefbatch parameter, using encoded JSON via the data parameter as input. The reference parameter is simply a value, such as a primary key, you want to appear along with the generated similarity key. Using the target parameter, output can still be JSON, CSV, TSV, etc.

JSON Example

[
    {
        "Data": "IBM",
        "Reference": "376152"
    },
    {
        "Data": "International Business Machines",
        "Reference": "419044"
    },
    {
        "Data": "ibm corp",
        "Reference": "277383"
    }
]

Here is what the encoded JSON from above would like for use within the data parameter:

%5B%7B%22Data%22%3A%22IBM%22%2C%22Reference%22%3A%22376152%22%7D%2C%7B%22Data%22%3A%22International%20Business%20Machines%22%2C%22Reference%22%3A%22419044%22%7D%2C%7B%22Data%22%3A%22%22%2C%22Reference%22%3A%22277383%22%7D%5D

The URL-encoded JSON must then be provided with the "data" parameter. You will not use a "connection" parameter since there is not a source file.
The source and table parameters must be "jsonrefbatch". You can provide up to 100 values per call. Using the target parameter, output can still be JSON, CSV, TSV, etc. Here is an example using cURL:

curl "https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=jsonrefbatch&data=%5B%7B%22Data%22%3A%22IBM%22%2C%22Reference%22%3A%22376152%22%7D%2C%7B%22Data%22%3A%22International%20Business%20Machines%22%2C%22Reference%22%3A%22419044%22%7D%2C%7B%22Data%22%3A%22%22%2C%22Reference%22%3A%22277383%22%7D%5D&category=company&process=keysonly&table=jsonrefbatch&keysoutputall=true&target=json"

Need help with your query/API call? Contact us at support@interzoid.com