Data

How To Clean Data and Enhance It With AI

Read in light mode

Read in dark mode

Have an unwieldy spreadsheet that needs to be cleaned? Here's our step-by-step guide for how to clean data and enhance it 100x faster than doing it manually.

Like many people these days, I’m a power user of ChatGPT and other LLMs to help me with things like drafting outlines for blog posts, generating cover letters, summarizing feedback, and much more. Generally, these tasks are long-form text-based, with the outputs being Google Docs.

Recently, I discovered an entirely new use case for LLMs I hadn’t even thought of: cleaning up messy spreadsheets. This involves data cleansing, where AI tools can be used to clean and prepare data for analysis. While I’ve always seen value in using AI for my long-form text tasks, it wasn’t until I used AI for my spreadsheet use case that I saw a 100x time savings with AI.

As with any AI/LLM task, the proper prompts and workflows are critical for getting the right output. Here’s a step-by-step guide for using ChatGPT to clean your data and enhance 100x faster.

Background

Recently, I found myself with a large data set that was messy, inaccurate, and incomplete. It’s a database of venture capitalists and angel investors that I’m putting together as a data product here at Peak Digital Studio. I’ve been using Phantombuster to scrape LinkedIn for VCs/angels and putting them into a spreadsheet. The problem with the data set is that details are missing, specifically around location, and some results aren’t investors. I wanted to clean the list up to ensure the only entries that remain are actually investors and that we have their location information.

I’ll use this specific example to show how I took a messy, inaccurate spreadsheet and ran it through ChatGPT to clean the data, and ensure the output was valuable for the end user.

Steps for an AI Data Cleaning Process

The following is a step-by-step guide for the data cleaning process using AI. Of course this specific use-case I am using as an example might not exactly fit what you’re doing, but generally the principles will apply. I used ChatGPT for this task as that’s the model I am most comfortable with, so results may vary with some of the other LLMs.

Step 1: Make sure you have the right ChatGPT plan

To upload files, you’ll need the ChatGPT plus plan.

Step 2: Prepare the spreadsheet

Before uploading anything to ChatGPT, make sure your file is below the 100mb upload limit. If you need to remove data or split the spreadsheet in half, you can always start with a smaller file to get the initial results, and upload more spreadsheets later. Be sure your spreadsheet is in .CSV format and that the data type for each column is consistent and correctly formatted.

I also remove any extraneous columns that won’t help ChatGPT do the work I need it to.

Step 3: Give ChatGPT a prompt

Now that your spreadsheet is ready and you have the right plan, you can give ChatGPT a prompt. I use ChatGPT 4o for this task, but have used 3.5 in the past.

To create the prompt, first think of all the issues that are making your data unclean, and work back from that.

In my case, I had lots of names with weird characters (I didn’t actually know the name of these characters so I used Perplexity to ask what they were called and what prompt I should use in ChatGPT to remove them).

I also wanted to determine, using all available information I scraped from their LinkedIn, if the person was an actual investor (for some reason lots of people say they work in VC, but don’t actually). Finally, I wanted the data to have coherent geographic data for each entry so it could be filtered by region, country, and continent.

Here’s the prompt I landed on:

You are a bot that takes .CSV files and cleans and enhances the data. I’m going to provide you with a list of profiles I scraped from LinkedIn, and I’ll give you a few instructions for how to clean:

1. Please clean up and correct tany mojibake or character encoding issues. Convert any special characters or symbols to their standard ASCII equivalents where possible.

2. Remove any middle initials and/or qualifications from people’s names. For example, Dr. Sean T. Smith, DDS please remove the “Dr.”, “T.” and “DDS”.

2. Using all available information about each entry, determine whether they are an actual startup investor - either a VC or an angel. Create a new field called Type, and label each as VC, Angel, or Not Investor

3. Use the “linkedinJobLocation” and “location” fields to first infer their location - for example knowing someone’s entry is Miami, this tells you they are in the city of Miami, state/region of Florida, country of United States of America, and continent of North America. Create new fields called “City”, “State/Region”, “Country”, and “Continent” and generate the data in these columns. Prioritize the field “location” if there’s any conflicts.

Some things to keep in mind:

1. Missing Values - If a field is absent or lacks a value in the text, make reasonable inferences whenever possible.

2. Inconsistencies - Ensure that all similar values are consistently formatted and spelled. For example, for the “state” field, “New Mexico”, “NM”, and “nm” should all be represented as “New Mexico”.

3. Final output - purely in .CSV format

My prompt for cleaning our database of VC and Angel investors

Step 4: Upload the spreadsheet

With the prompt inputted, you can now upload the actual spreadsheet. It takes a bit of time for ChatGPT to process, and it will give you some live updates on what it’s up to as it goes.

ChatGPT will let you know what it's up to while it starts it's data cleaning task

Step 5: Review for missing data and make any fixes

Check the output to see if there's any glaring errors

Be sure to review a few lines of the data, and then have ChatGPT make any fixes you might notice. For example, many of the entries for mine didn’t have proper location data, so I added another prompt after the original run to try and fix that:

Many of these entries have the city, but haven’t used that to infer the rest of the location data. Please fix

As you can see, I am very polite when I speak to my future robot overlord.

Optional Step 6: Keep Cleaning Data With Your New Output

Despite a few tries, I still had some location issues, where it was getting the country wrong. I've found sometimes it's best to start with a blank slate, and create a new chat with simpler instructions. I gave the following prompt in a new chat:

You are a bot that infers from a city’s name what state/region, country, and continent it’s on. I’ll provide a csv file that includes information including a field called “City”. Ignore all other columns and just use the “City” column to infer the rest of the geographic information based on existing data. Create new fields for “State/Region”, “Country” and “Continent” and put the data in those fields.

I removed all fields from the previous .CSV ChatGPT generated for me, and just left in the city. This solved the problem.

Don’t get frustrated - you might have to go back and forth a few times to get this fixed. Try to give clear and easy to understand prompts.

There you have it!

And that’s it! Now, you should have a clean and enhanced spreadsheet with high-quality data. Hopefully, this process will save you hours of busy work.

If you’d like to see the investor database I created while writing this article, check it out here. Or, if you need help setting up your own prompts or cleaning data, get in touch with me and we can help.

FAQ

What do we mean by “Clean Data” and Data Quality

“Clean data” refers to datasets free from errors, inconsistencies, and inaccuracies. It ensures that the data is accurate, complete, and formatted in a suitable way for data analysis and decision-making. Clean data is essential for producing reliable insights and making informed business decisions.

Key Characteristics of Clean Data:

Accuracy: The data correctly represents the real-world values or conditions it is supposed to capture.
Completeness: All required data fields are filled, and there are no missing values.
Consistency: Data is uniform and follows a standardized format across different data sets and over time.
Validity: The data conforms to the defined rules, constraints, and formats. For instance, dates follow a specific format (MM/DD/YYYY), and numerical values fall within expected ranges.
Uniqueness: There are no duplicate records, ensuring that each entry is distinct.
Timeliness: The data is up-to-date and relevant for the current analysis or decision-making process.
Relevance: The data is pertinent to the specific context or analysis being performed.

Common Issues in Dirty Data:

Typos and Errors: Spelling mistakes, incorrect entries.
Missing Data: Empty fields that should contain data.
Duplicate Records: Multiple entries for the same item or entity.
Inconsistent Formats: Variations in date formats, units of measurement, or naming conventions.
Outliers: Data points that are significantly different from others and may indicate errors or unique conditions.
Invalid Data: Entries that do not conform to expected formats or fall outside acceptable ranges.

Importance of Clean Data:

Improved Accuracy: Ensures that analyses and insights are based on correct information.
Better Decision-Making: Reliable data supports more informed and effective decisions.
Efficiency: Reduces the time and resources spent on data scrubbing to correct errors and inconsistencies.
Compliance: Helps meet regulatory requirements and standards for data quality.
Enhanced Customer Trust: Ensures that customer data is accurate and up-to-date, improving trust and satisfaction.

Consolidating Data from Multiple Sources:Combining data from multiple data sources is crucial to avoid errors, duplication, and inconsistency in the final dataset used for analysis.

Formatting Numeric Data:Ensuring numeric data is correctly formatted is essential for accurate analysis, including converting categorical data into numeric data or filling missing numeric data with specific values.

Uniformity in Data:Data must be converted to the same unit of measure to ensure consistency and enable accurate analysis.

What is data analysis?

Data analysis is the process of examining, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making. It involves defining objectives, collecting data, cleaning and organizing data, using statistical tools to analyze the data, and interpreting the results to make informed decisions.

What makes manually cleaning data challenging?

Manually cleaning data is challenging because it is time-consuming, prone to human error, and requires a thorough understanding of the data to accurately identify and correct issues. Common methods include removing duplicates, handling missing values, correcting inaccuracies, and standardizing formats.

What is exploratory data analysis?

Exploratory Data Analysis (EDA) is an approach for summarizing and visualizing the important characteristics of a dataset, often using graphical techniques. It helps in understanding data patterns, detecting anomalies, and testing hypotheses.

What are data sets?

Data sets are collections of related data points typically organized in a table format, consisting of rows and columns, where each column represents a variable, and each row represents a record.

Matthew Johnson

Founder

Marketing

How to Create a Visual Brand Identity

What is visual brand identity? This guide walks you through the process used by branding pros to get it right.

Matthew Johnson

Founder

Start a project

How To Clean Data and Enhance It With AI

Background

Steps for an AI Data Cleaning Process

Step 1: Make sure you have the right ChatGPT plan

Step 2: Prepare the spreadsheet

Step 3: Give ChatGPT a prompt

Step 4: Upload the spreadsheet

Step 5: Review for missing data and make any fixes

Optional Step 6: Keep Cleaning Data With Your New Output

There you have it!

FAQ

What do we mean by “Clean Data” and Data Quality

What is data analysis?

What makes manually cleaning data challenging?

What is exploratory data analysis?

What are data sets?

Matthew Johnson

How to Create a Visual Brand Identity

Websites Shouldn’t Be So Hard

How to Use ChatGPT As a Copywriting Tool to Reach the Right Audience

The Best Project Management Software for Small Businesses in 2025

8 Steps to Starting a Small Business: A Guide for First-Time Entrepreneurs

The Complete Guide to Sales Collateral: What It Is and Why You Need It

Small Business Marketing Items You Need to Know About

What Are the Best Networking Events for Business Owners?

20+ Ideas for Business Ventures

Tahoe Inc: A Resource for Entrepreneurs in the Tahoe Basin

Truckee/Tahoe businesses and non-profits: Need a website in April?

Marketing Materials Examples for Your Business

How to Choose an Affordable Web Design Package for Your Business

Why You Should Have a Newsletter

Free Product Hunt Launch Consultations

The Best SEO Tools for Small Businesses

How to Excel at Networking

Social Media Audit Services

Best Software for Small Businesses

50 Top Logo Examples to Inspire and Elevate Your Brand Identity

Understanding Website Package Pricing: What to Expect and How to Choose the Best Option

Guide to Website Management Services

Top Options for Affordable Small Business Website Design: Expert Guidance and Solutions

The Complete Guide to Understanding the Rounds of Venture Capital Funding (2025)

Discover the Best AI Tools for Business to Boost Productivity

Starting a Business for Dummies: Ultimate Guide

How to Find VCs: A Proven Guide for Startups

The Ultimate Guide to Branding Pricing: Costs & Tips for 2024

Finding Website Developers for Small Business Owners

8 Types of Marketing Explained

Small Business Web Maintenance: Website Services

2024 Startup Week Database

9 Tips for Starting a Business With No Money

PeoplePerHour And Upwork: Why They Are Bad Places to Find Designers

Why Subscription Design is the Way to Go for Your Business

Where to Look for Small Businesses for Sale

A Simple Podcast Name Generator Tool to Kickstart Your Podcasting Journey

Guide to getting an excellent podcast logo and design

Best Non-Profit Websites

How to Find Affordable Web Design

Technical SEO Services: The Ultimate Guide to Optimize Your Website

30+ Business Networking Events to Attend in 2024

Truckee Jobs Collective: Connecting Truckee-Tahoe Employers and Job Seekers

How to Launch on Product Hunt

YC Application Questions Template

Pre-Seed VC Database

Best domain registrar for small business

I want to start a business, but have no ideas

Key questions to ask when hiring Webflow agencies

Best Productivity System to Help You Get More Done

Looking for an Upwork Alternative?

About Us: Website Design Truckee

Spotlight: Kevin Ascher & Acoustic Selling

Why Are My Emails Going to Spam, and How Can I Fix It?

Average Cost of a Logo Design: A Comprehensive Guide

Spotlight: Startup Village & Lance Ding

Guide to Content Marketing Services

Checklist for Starting a Business

AI Powered Content Creation

Best Small Business Website Builder: Which One is Right for You?

Content Marketing For Startup: How to Generate Ideas

Subscribe to Our Newsletter