How to Import from a CSV Without Duplicating Existing Entries in Postgres?

Suppose you have a table website_visitors in which you regularly want to import visitors’ data from a CSV. Postgres has a built-in command that makes working with CSV files pretty convenient: \COPY. You can use it to import a downloaded visits.csv into website_visitors like this:

psql "$DB_STRING" -c "\COPY website_visitors(ip, visitor_id, user_agent, duration, created) \
  FROM './visits.csv' \
  DELIMITER ',' \
  CSV HEADER;
"

But there’s a problem here. If we import the table again, it will duplicate the existing visitors as well. We can add a UNIQUE constraint to visitor_id but Postgres will stop the copying operation at the first duplicate visitor_id it encounters, with a fail message like this:

ERROR:  duplicate key value violates unique constraint "website_visitors_pkey1"
DETAIL:  Key (visitor_id)=(152) already exists.
CONTEXT:  COPY website_visitors, line 2

Unfortunately, there’s no way to specify which rows should be skipped in \COPY. The way we can still skip duplicates is by using Temp Tables and views. Here’s how:

First, we will create a temporary table temp_website_visitors using the original table. The advantage of temporary tables is that they last only for the database session, which obviates the need to drop them. Another great feature is that they are only visible in the session they were created in.

To create a temporary table using the definition of website_visitors we will do:

CREATE TEMPORARY TABLE temp_website_visitors AS
  SELECT * FROM website_visitors
  WITH NO DATA;

By using WITH NO DATA we will only be copying the structure, not the contents of the original table.

Now, we can import the CSV data into the temporary table we created.

\COPY temp_website_visitors(ip, visitor_id, user_agent, duration, created) \
FROM './visits.csv' \
DELIMITER ',' \
CSV HEADER;

For the last step, we just need to merge both tables and skip visitors that are already there. There are multiple ways to do this. Here, we will be using subqueries.

INSERT INTO website_visitors 
  SELECT * FROM temp_website_visitors
  WHERE visitor_id NOT IN (
    SELECT visitor_id FROM website_visitors
  )

For our final command, we would need to pipe the above commands to psql. That’s because -c flag only supports running a single query.

echo "
  CREATE TEMPORARY TABLE temp_website_visitors AS
    SELECT * FROM website_visitors
    WITH NO DATA;
  
  \COPY temp_website_visitors(ip, visitor_id, user_agent, duration, created) \
    FROM './visits.csv' \
    DELIMITER ',' \
    CSV HEADER;
  
  INSERT INTO website_visitors 
    SELECT * FROM temp_website_visitors
    WHERE visitor_id NOT IN (
      SELECT visitor_id FROM website_visitors
    )
" | psql "$DB_STRING"

Output

CREATE TABLE AS
COPY 149
INSERT 0 0