Here is a quick and easy tutorial to show you how to identify and match duplicate/redundant data in a dataset, all automated and directly from the browser. In this particular example, we will generate a "match report" with a sample CSV file, however these same steps can be followed for similar results with any supported SQL-based Cloud database.
Step 1: Register on the Interzoid site to get an API key. You will need to provide this key to run the match report within the tutorial. Registration and receiving an API key is free and no credit card is required.
Step 2: Go to the Interzoid Cloud Connect Website and click the "Company Name Matching" square. It is the first square in the upper left.
Step 3: You will now be at the Match Report Web application. Under "Report Setup", first provide the API key you now have after registering in Step 1. You can go to "Account" on the menu if you need to login and get your API key. Since the system is aware that this is a tutorial CSV data file we will be using, no credits will be subtracted from your Interzoid account.
Step 4: Select the matching algorithm type. Be sure to select "Company and Organization Names" for this demo, as different matching algorithms are applied depending on the type of data being analyzed and matched.
Step 5: Select the "Database Connection Type". For this tutorial, select "CSV File" as this is what we will be using.
Step 6 For "Provide Database Connection String", since this is a CSV file, we will provide the URL Web address of the location of the file that we will analyze. We have a sample file stored on AWS S3 that we will use for this tutorial. Enter the the following in the edit box: "https://dl.interzoid.com/csv/companies.csv". You can download the file and open it up with any text or CSV viewer to see the contents.
Step 7: For "Table Name", Enter "CSV" since this is a CSV file. If this were a SQL database, this is where the table name would be entered.
Step 8: For "Column Name", we will enter the column number in the CSV file that we will be analyzing. The first column from the left in a CSV file starts with "1". This is what we will be using here, so enter "1" in the edit box.
Step 9: The next two edit boxes we will leave for now. The last thing to do is to "Select Report Type". For this tutorial, ensure "Display Match Report of Duplicate Candidates" is selected.
Step 10: Click "Connect and Run Report". What happens here behind the scenes is that for each record in the CSV file, the match column will be sent to Interzoid's Company Name Matching API to generate a similarity key. These similarity keys are generated using various algorithms, heuristics, knowledge bases, and an ever-expanding base of Contextual Machine Learning capabilities. Then, after a similarity key is generated for each record, the records in the file are sorted by corresponding similarity key, so company/organization names that are similar line up next to each other in the match report that is created and displayed.
That's it! The same process can be used with any Cloud-accessible CSV file or Cloud SQL database. The Database Connection String is all that's required. You can also run an individual name matching report using "https://dl.interzoid.com/csv/peoplenames.csv" as part of the free demo tutorial as well and no credits will be used.
For instructions on how to include data matching of a dataset, including scheduled runs, ongoing processing, or as part of a data pipeline, see the data matching workflow guide.
Questions? Contact support@interzoid.com - we are happy to help.