Start a project

We'd love to hear about your project. Please fill out this form to provide us with the necessary details.

Thank you! Your submission has been received! Our team will get back to you within 24 hours.
Oops! Something went wrong while submitting the form.
AI
Data

How To Clean Data and Enhance It With AI

Read in light mode
Read in dark mode
Have an unwieldy spreadsheet that needs to be cleaned? Here's our step-by-step guide for how to clean data and enhance it 100x faster than doing it manually.

Like many people these days, I’m a power user of ChatGPT and other LLMs to help me with things like drafting outlines for blog posts, generating cover letters, summarizing feedback, and much more. Generally, these tasks are long-form text-based, with the outputs being Google Docs.

Recently, I discovered an entirely new use case for LLMs I hadn’t even thought of: cleaning up messy spreadsheets. This involves data cleansing, where AI tools can be used to clean and prepare data for analysis. While I’ve always seen value in using AI for my long-form text tasks, it wasn’t until I used AI for my spreadsheet use case that I saw a 100x time savings with AI.

As with any AI/LLM task, the proper prompts and workflows are critical for getting the right output. Here’s a step-by-step guide for using ChatGPT to clean your data and enhance 100x faster.

Background

Recently, I found myself with a large data set that was messy, inaccurate, and incomplete. It’s a database of venture capitalists and angel investors that I’m putting together as a data product here at Peak Digital Studio. I’ve been using Phantombuster to scrape LinkedIn for VCs/angels and putting them into a spreadsheet. The problem with the data set is that details are missing, specifically around location, and some results aren’t investors. I wanted to clean the list up to ensure the only entries that remain are actually investors and that we have their location information.

I’ll use this specific example to show how I took a messy, inaccurate spreadsheet and ran it through ChatGPT to clean the data, and ensure the output was valuable for the end user.

Steps for an AI Data Cleaning Process

The following is a step-by-step guide for the data cleaning process using AI. Of course this specific use-case I am using as an example might not exactly fit what you’re doing, but generally the principles will apply. I used ChatGPT for this task as that’s the model I am most comfortable with, so results may vary with some of the other LLMs.

Step 1: Make sure you have the right ChatGPT plan

To upload files, you’ll need the ChatGPT plus plan.

Step 2: Prepare the spreadsheet

Before uploading anything to ChatGPT, make sure your file is below the 100mb upload limit. If you need to remove data or split the spreadsheet in half, you can always start with a smaller file to get the initial results, and upload more spreadsheets later. Be sure your spreadsheet is in .CSV format and that the data type for each column is consistent and correctly formatted.

I also remove any extraneous columns that won’t help ChatGPT do the work I need it to.

Step 3: Give ChatGPT a prompt

Now that your spreadsheet is ready and you have the right plan, you can give ChatGPT a prompt. I use ChatGPT 4o for this task, but have used 3.5 in the past.

To create the prompt, first think of all the issues that are making your data unclean, and work back from that.

In my case, I had lots of names with weird characters (I didn’t actually know the name of these characters so I used Perplexity to ask what they were called and what prompt I should use in ChatGPT to remove them).

I also wanted to determine, using all available information I scraped from their LinkedIn, if the person was an actual investor (for some reason lots of people say they work in VC, but don’t actually). Finally, I wanted the data to have coherent geographic data for each entry so it could be filtered by region, country, and continent.

Here’s the prompt I landed on:

You are a bot that takes .CSV files and cleans and enhances the data. I’m going to provide you with a list of profiles I scraped from LinkedIn, and I’ll give you a few instructions for how to clean:
1. Please clean up and correct tany mojibake or character encoding issues. Convert any special characters or symbols to their standard ASCII equivalents where possible.
2. Remove any middle initials and/or qualifications from people’s names. For example, Dr. Sean T. Smith, DDS please remove the “Dr.”, “T.” and “DDS”.
2. Using all available information about each entry, determine whether they are an actual startup investor - either a VC or an angel. Create a new field called Type, and label each as VC, Angel, or Not Investor
3. Use the “linkedinJobLocation” and “location” fields to first infer their location - for example knowing someone’s entry is Miami, this tells you they are in the city of Miami, state/region of Florida, country of United States of America, and continent of North America. Create new fields called “City”, “State/Region”, “Country”, and “Continent” and generate the data in these columns. Prioritize the field “location” if there’s any conflicts.
Some things to keep in mind:
1. Missing Values - If a field is absent or lacks a value in the text, make reasonable inferences whenever possible.
2. Inconsistencies - Ensure that all similar values are consistently formatted and spelled. For example, for the “state” field, “New Mexico”, “NM”, and “nm” should all be represented as “New Mexico”.
3. Final output - purely in .CSV format
My prompt for cleaning our database of VC and Angel investors

Step 4: Upload the spreadsheet

With the prompt inputted, you can now upload the actual spreadsheet. It takes a bit of time for ChatGPT to process, and it will give you some live updates on what it’s up to as it goes.

ChatGPT will let you know what it's up to while it starts it's data cleaning task

Step 5: Review for missing data and make any fixes

Check the output to see if there's any glaring errors

Be sure to review a few lines of the data, and then have ChatGPT make any fixes you might notice. For example, many of the entries for mine didn’t have proper location data, so I added another prompt after the original run to try and fix that:

Many of these entries have the city, but haven’t used that to infer the rest of the location data. Please fix

As you can see, I am very polite when I speak to my future robot overlord.

Optional Step 6: Keep Cleaning Data With Your New Output

Despite a few tries, I still had some location issues, where it was getting the country wrong. I've found sometimes it's best to start with a blank slate, and create a new chat with simpler instructions. I gave the following prompt in a new chat:

You are a bot that infers from a city’s name what state/region, country, and continent it’s on. I’ll provide a csv file that includes information including a field called “City”. Ignore all other columns and just use the “City” column to infer the rest of the geographic information based on existing data. Create new fields for “State/Region”, “Country” and “Continent” and put the data in those fields.

I removed all fields from the previous .CSV ChatGPT generated for me, and just left in the city. This solved the problem.

Don’t get frustrated - you might have to go back and forth a few times to get this fixed. Try to give clear and easy to understand prompts.

There you have it!

And that’s it! Now, you should have a clean and enhanced spreadsheet with high-quality data. Hopefully, this process will save you hours of busy work.

If you’d like to see the investor database I created while writing this article, check it out here. Or, if you need help setting up your own prompts or cleaning data, get in touch with me and we can help.

FAQ

What do we mean by “Clean Data” and Data Quality

“Clean data” refers to datasets free from errors, inconsistencies, and inaccuracies. It ensures that the data is accurate, complete, and formatted in a suitable way for data analysis and decision-making. Clean data is essential for producing reliable insights and making informed business decisions.

Key Characteristics of Clean Data:

  1. Accuracy: The data correctly represents the real-world values or conditions it is supposed to capture.
  2. Completeness: All required data fields are filled, and there are no missing values.
  3. Consistency: Data is uniform and follows a standardized format across different data sets and over time.
  4. Validity: The data conforms to the defined rules, constraints, and formats. For instance, dates follow a specific format (MM/DD/YYYY), and numerical values fall within expected ranges.
  5. Uniqueness: There are no duplicate records, ensuring that each entry is distinct.
  6. Timeliness: The data is up-to-date and relevant for the current analysis or decision-making process.
  7. Relevance: The data is pertinent to the specific context or analysis being performed.

Common Issues in Dirty Data:

  • Typos and Errors: Spelling mistakes, incorrect entries.
  • Missing Data: Empty fields that should contain data.
  • Duplicate Records: Multiple entries for the same item or entity.
  • Inconsistent Formats: Variations in date formats, units of measurement, or naming conventions.
  • Outliers: Data points that are significantly different from others and may indicate errors or unique conditions.
  • Invalid Data: Entries that do not conform to expected formats or fall outside acceptable ranges.

Importance of Clean Data:

  • Improved Accuracy: Ensures that analyses and insights are based on correct information.
  • Better Decision-Making: Reliable data supports more informed and effective decisions.
  • Efficiency: Reduces the time and resources spent on data scrubbing to correct errors and inconsistencies.
  • Compliance: Helps meet regulatory requirements and standards for data quality.
  • Enhanced Customer Trust: Ensures that customer data is accurate and up-to-date, improving trust and satisfaction.

Consolidating Data from Multiple Sources:Combining data from multiple data sources is crucial to avoid errors, duplication, and inconsistency in the final dataset used for analysis.

Formatting Numeric Data:Ensuring numeric data is correctly formatted is essential for accurate analysis, including converting categorical data into numeric data or filling missing numeric data with specific values.

Uniformity in Data:Data must be converted to the same unit of measure to ensure consistency and enable accurate analysis.

What is data analysis?

Data analysis is the process of examining, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves defining objectives, collecting data, cleaning and organizing data, using statistical tools to analyze the data, and interpreting the results to make informed decisions.

What makes manually cleaning data challenging?

Manually cleaning data is challenging because it is time-consuming, prone to human error, and requires a thorough understanding of the data to accurately identify and correct issues. Common methods include removing duplicates, handling missing values, correcting inaccuracies, and standardizing formats.

What is exploratory data analysis?

Exploratory Data Analysis (EDA) is an approach for summarizing and visualizing the important characteristics of a dataset, often using graphical techniques. It helps in understanding data patterns, detecting anomalies, and testing hypotheses.

What are data sets?

Data sets are collections of related data points typically organized in a table format, consisting of rows and columns, where each column represents a variable, and each row represents a record.

Matthew Johnson

Founder
Published on

July 19, 2024