Introducing our Snowflake Data Cloud Native Application: AI-Augmented Data Quality built into SQL statements! Learn More

Quick and Easy Data Matching Tutorial

Here is a quick and easy tutorial to show you how to identify and match duplicate/redundant data in a dataset, all automated and directly from the browser. In this particular example, we will generate a "match report" with a sample CSV file, however these same steps can be followed for similar results with any supported SQL-based Cloud database.

Step 1: Register on the Interzoid site to get an API key. You will need to provide this key to run the match report within the tutorial. Registration and receiving an API key is free and no credit card is required.

Step 2: Go to the Interzoid Cloud Connect Website and click the "Company Name Matching" square. It is the first square in the upper left.

Company Name Matching

Step 3: You will now be at the Match Report Web application. Under "Report Setup", first provide the API key you now have after registering in Step 1. You can go to "Account" on the menu if you need to login and get your API key. Since the system is aware that this is a tutorial CSV data file we will be using, no credits will be subtracted from your Interzoid account.

API Key

Step 4: Select the matching algorithm type. Be sure to select "Company and Organization Names" for this demo, as different matching algorithms are applied depending on the type of data being analyzed and matched.

Data Match Report Type

Step 5: Select the "Database Connection Type". For this tutorial, select "CSV File" as this is what we will be using.

Database Connection Type for Matching

Step 6 For "Provide Database Connection String", since this is a CSV file, we will provide the URL Web address of the location of the file that we will analyze. We have a sample file stored on AWS S3 that we will use for this tutorial. Enter the the following in the edit box: "https://dl.interzoid.com/csv/companies.csv". You can download the file and open it up with any text or CSV viewer to see the contents.

Database Connection String

Step 7: For "Table Name", Enter "CSV" since this is a CSV file. If this were a SQL database, this is where the table name would be entered.

Data Match Table Name

Step 8: For "Column Name", we will enter the column number in the CSV file that we will be analyzing. The first column from the left in a CSV file starts with "1". This is what we will be using here, so enter "1" in the edit box.

Data Match Column Name

Step 9: The next two edit boxes we will leave for now. The last thing to do is to "Select Report Type". For this tutorial, ensure "Display Match Report of Duplicate Candidates" is selected.

Select the Match Report Type

Step 10: Click "Connect and Run Report". What happens here behind the scenes is that for each record in the CSV file, the match column will be sent to Interzoid's Company Name Matching API to generate a similarity key. These similarity keys are generated using various algorithms, heuristics, knowledge bases, and an ever-expanding base of Contextual Machine Learning capabilities. Then, after a similarity key is generated for each record, the records in the file are sorted by corresponding similarity key, so company/organization names that are similar line up next to each other in the match report that is created and displayed.

Connect and Run Match Report Against the Dataset

That's it! The same process can be used with any Cloud-accessible CSV file or Cloud SQL database. The Database Connection String is all that's required. You can also run an individual name matching report using "https://dl.interzoid.com/csv/peoplenames.csv" as part of the free demo tutorial as well and no credits will be used.

For instructions on how to include data matching of a dataset, including scheduled runs, ongoing processing, or as part of a data pipeline, see the data matching workflow guide.

Questions? Contact support@interzoid.com - we are happy to help.