Don’t Use A Lookup Component To Find Duplicate Rows

19 October, 2006. 299 words.

Another in a series of random programming tips by Thomas Krehbiel, mostly to help me remember them in the future.

My SSIS project loads data from a source, transforms it, and stores it in a destination. To prevent storing data that has already been processed, I tried to use a Lookup component to find rows that already existed in the destination. This worked at first, but you’ll have problems if the number of lookup rows becomes too large to cache in memory (you’ll be able to tell this from the Windows dialogs informing you to increase the size of your virtual memory, followed by your machine slowing down to the point where it might as well have crashed, followed by a fifteen- or twenty-minute reboot process).

My first solution was to turn off the caching on the Lookup component. I figured it would slow things down, but it would still run. Wrong. With both a partial cache and no cache, the data flow stopped completely at the Lookup component. It sat there for one entire hour with only 6,000 input rows. There was probably a database-related reason for that, but I gave up and went a different route.

I ended up using a Merge Join component to merge the incoming data with the previously stored data using a left outer join. (This, of course, requires a unique identifier on the input and output.) If an existing row was found, the existing row ID will be non-null on the output of the join, and those rows can be split off and ignored with a Conditional Split component (or whatever). This solution worked well and saved a lot of time and memory in the package.

The moral of the story is: Don’t use a Lookup component to filter out duplicate rows.

Related