How to Import from a CSV Without Duplicating Existing Entries in Postgres?
Suppose you have a table website_visitors
in which you regularly want to import visitors’ data from a CSV. Postgres has a built-in command that makes working with CSV files pretty convenient: \COPY. You can use it to import a downloaded visits.csv
into website_visitors
like this:
psql "$DB_STRING" -c "\COPY website_visitors(ip, visitor_id, user_agent, duration, created) \
FROM './visits.csv' \
DELIMITER ',' \
CSV HEADER;
"
But there’s a problem here. If we import the table again, it will duplicate the existing visitors as well. We can add a UNIQUE
constraint to visitor_id
but Postgres will stop the copying operation at the first duplicate visitor_id
it encounters, with a fail message like this:
ERROR: duplicate key value violates unique constraint "website_visitors_pkey1"
DETAIL: Key (visitor_id)=(152) already exists.
CONTEXT: COPY website_visitors, line 2
Unfortunately, there’s no way to specify which rows should be skipped in \COPY
. The way we can still skip duplicates is by using Temp Tables and views. Here’s how:
First, we will create a temporary table temp_website_visitors
using the original table. The advantage of temporary tables is that they last only for the database session, which obviates the need to drop them. Another great feature is that they are only visible in the session they were created in.
To create a temporary table using the definition of website_visitors
we will do:
CREATE TEMPORARY TABLE temp_website_visitors AS
SELECT * FROM website_visitors
WITH NO DATA;
By using WITH NO DATA
we will only be copying the structure, not the contents of the original table.
Now, we can import the CSV data into the temporary table we created.
\COPY temp_website_visitors(ip, visitor_id, user_agent, duration, created) \
FROM './visits.csv' \
DELIMITER ',' \
CSV HEADER;
For the last step, we just need to merge both tables and skip visitors that are already there. There are multiple ways to do this. Here, we will be using subqueries.
INSERT INTO website_visitors
SELECT * FROM temp_website_visitors
WHERE visitor_id NOT IN (
SELECT visitor_id FROM website_visitors
)
For our final command, we would need to pipe the above commands to psql
. That’s because -c
flag only supports running a single query.
echo "
CREATE TEMPORARY TABLE temp_website_visitors AS
SELECT * FROM website_visitors
WITH NO DATA;
\COPY temp_website_visitors(ip, visitor_id, user_agent, duration, created) \
FROM './visits.csv' \
DELIMITER ',' \
CSV HEADER;
INSERT INTO website_visitors
SELECT * FROM temp_website_visitors
WHERE visitor_id NOT IN (
SELECT visitor_id FROM website_visitors
)
" | psql "$DB_STRING"
Output
CREATE TABLE AS
COPY 149
INSERT 0 0