data engineering apis

How to Include Matching of a Dataset or Database Table in Workflow

In addition to being able to interactively identify matches of inconsistent, duplicate/redundant data of a text file or database table using this Cloud Connect application, you can now also automate the running of these matching jobs so they can be scheduled, added to any business processes or workflow, matched & merged with multiple datasets, or be part of a data pipeline in ETL/ELT processes. This is a powerful capability that can be delivered with a single command.

This is achieved via an HTTP request "query string", which can then be embedded directly into any process, batch file, scheduler, or series of commands.

For example, the following match process can be tested against our demo company name file (CSV source, no credits used). Just put the following URL in your browser address bar and hit enter. You will see a CSV file with a column of company names clustered and sorted by the algorithmically generated similarity keys:


    https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/companies.csv&table=CSV&column=1&process=matchreport&category=company&html=true
                

Running with 'Curl'

You can also run this command from a Linux, Windows, or Macintosh command line using "Curl" (must use double quotes within Curl on Windows). Curl (also known as cURL) is a command line HTTP client tool that is generally available by default on most computers:

Linux & Mac

    $ curl 'https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/companies.csv&table=CSV&column=1&process=matchreport&category=company'
                
Windows

    > curl "https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=CSV&connection=https://dl.interzoid.com/csv/companies.csv&table=CSV&column=1&process=matchreport&category=company"
                

Redirecting Output

Output from these curl commands can be redirected to output files for further processing using the greater-than symbol in both Linux & Windows.

Linux & Mac

    $ curl '[HTTP query string]' > output.csv
                
Windows

    > curl "[HTTP query string]" > output.csv
                

Connecting to Cloud SQL Data Tables

Here are some examples of using the same HTTP query string to match an entire database table of company names. See more about connection strings.


    (Snowflake example) https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=Snowflake&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company
    (Azure SQL example) https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=azure sql&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company
    (AWS RDS example) https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=aws rds postgres&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company
    (Google Cloud SQL example) https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=postgres&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company
    (Postgres example) https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=postgres&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company
    (MySQL example) https://connect.interzoid.com/run?function=match&apikey=use-your-own-api-key-here&source=mysql&connection=your-specific-connection-string&table=companies&column=company&process=matchreport&category=company
                


Supported Data Sources


    AWS RDS/Aurora
    Snowflake
    Azure SQL
    Google Cloud SQL
    Databricks
    PostgreSQL
    MySQL
    MariaDB
    Parquet
    CSV
    TSV
    Excel
                    


Data Matching Parameters


    Parameters specific to Data Matching to be set as part of the HTTP query string:

    function	    Required. Use 'match' for data matching.

    process	    Required. The process defines the report or action that will occur with the dataset. Process types
                    available are 'matchreport', 'keysonly', 'gensql', and 'createtable'. A 'match report' will generate
                    a report of all found clusters of similar data. The 'keysonly' value outputs a generated similarity key
                    for every record in the dataset, whereas 'gensql' is similar however it generates the SQL INSERT
                    statements to store the similarity keys in a database. The 'createtable' value will actually create
                    a new table in the source database with all of the similarity keys for each record in the source table
                    so they can be used for additional queries.

    category	    Required. This category type indicates which set of Machine Learning and matching algorithms
                    to make use of based on type of data content. Use 'company','individual',or 'address'.
                    


Additional Parameters


    Additional parameters that can set as part of the HTTP query string:

    apikey	    Required. Login to www.interzoid.com to obtain your API Key. It is how we track and manage usage.
                    If you do not yet have one, register at www.interzoid.com/register-api-account

    source	    Required. Source of data, such as 'CSV', 'Snowflake', 'Postgres', etc.
                    See source list on interactive page for entire list.

    connection	    Required. Connection string to access database, or in the case of a CSV or TSV file,
                    use the full URL of the location of the file.

    table	    Required. Table name to access the source data. Use "CSV" or "TSV" for delimited text files.

    column	    Required. Column name within the table to access the source data. This is a number for CSV or TSV files,
                    starting with number 1 from the left side of the file.

    reference	    An additional column from the source table to display in the output results, such as a primary key.

    newtable	    The name of the new table if the output results are written to a new table.

    json	    Set to true (&json=true) to display the output formatted as JSON.

    html	    Set to true (&html=true) to pad line breaks into the output results for better readability in
                    a browser when run from the address bar.
    

Also see our quick and easy Data Matching Tutorial.

Questions? Contact support@interzoid.com - we are happy to help.

Return to interactive page


All content (c) 2018-2023 Interzoid Incorporated. Questions? Contact support@interzoid.com

201 Spear Street, Suite 1100, San Francisco, CA 94105-6164

Interested in Data Cleansing Services?

Start Here
Terms of Service
Privacy Policy

Use the Interzoid Cloud Connect Data Platform and Start to Supercharge your Cloud Data Now: connect.interzoid.com
API Integration Code Examples and SDKs: github.com/interzoid
Documentation and Overview: Docs site
Interzoid Product and Technology Newsletter: Subscribe
Partnership Interest? Inquire