Duplicate Twitter Data (rows) in Same Load

dashDB / Insights for Twitter

I loaded Twitter data from dashDB, with a search keyword, and a date range to limit the number of tweets returned to around 35,000. I did not load the table twice, or append to the table. (I verified the problem I am describing by doing two additional loads.)

I ran the following query:

SELECT COUNT (MESSAGE_ID) row_count FROM MYSUBJECT_TWEETS;
SELECT COUNT (DISTINCT MESSAGE_ID) d_row_count FROM MYSUBJECT_TWEETS;
row_count is about 35,000. (Exactly the same number that IBM reported was originally loaded.)
d_row_count is about 27,000.

So I tried to identify what column(s) had multiple values, and was causing the duplicates.

The following query shows all of the duplicates of MESSAGE_ID, and it returns the maximum number of rows you can display (500). All the duplicates are pairs. There is not 3 of any MESSAGE_ID.

CREATE OR REPLACE VIEW MYSUBJECT
AS ( SELECT *
FROM MYSUBJECT_TWEETS);

SELECT *
FROM (SELECT *, COUNT(*) OVER (PARTITION BY MESSAGE_ID) AS num_id
FROM MYSUBJECT
)
WHERE num_id > 1;
I couldn’t see any difference between the duplicated rows. So I thought maybe it was invisible (non-printable) characters.The following query returns 0 results. (The only change is “SELECT DISTINCT * ” on the second line.)

CREATE OR REPLACE VIEW MYSUBJECT
AS ( SELECT DISTINCT *
FROM MYSUBJECT_TWEETS);

SELECT *
FROM (SELECT *, COUNT(*) OVER (PARTITION BY MESSAGE_ID) AS num_id
FROM MYSUBJECT
)
WHERE num_id > 1;
This means all of those duplicates are identical. (The duplicate pairs were merged with the DISTINCT statement.)

What is going on? Why are their thousands of identical rows in the Twitter data?

Related:

Leave a Reply