blogs.conchango.com

welcome to the conchango blogging site
Welcome to blogs.conchango.com Sign in | Join | Help
in Search

Anthony Steele's Blog

Duplicate Finder

A while ago (June last year) I wrote a utility to detect runs of duplicate lines in files, which is useful for looking for repetitive code that should be refactored. Then I stopped work on it, since it was done.  The original blog post is here and the project is up on CodePlex here.

This year I have revisited it with two new features which I think make it much more usable.

The first feature  is an MSBuild Task wrapper to compliment the command line interface. This means that instead of using a command line like: 

DuplicateFinder -r -t8 -eAssemblyInfo *.cs

You can now also use an equivalent MSBuild Script:

<Project DefaultTargets="RunTest"
    xmlns="http://schemas.microsoft.com/developer/msbuild/2003">
   
  <UsingTask TaskName="DuplicateFinder"
 AssemblyFile="$(MSBuildExtensionsPath)\DuplicateFinder.Tasks.dll" />
 
  <ItemGroup>
    <TestFiles
   Include="..\..\**\*.cs" 
   Exclude="..\..\**\AssemblyInfo.cs" 
 />
  </ItemGroup>
 
  <Target Name="RunTest">
    <DuplicateFinder Files="@(TestFiles)"
        DuplicateThreshold="8"
  />
  </Target>
</Project>

This may be more verbose, but it has the advantage that you can run it as part of your automated build process. In case you don't always look at the build output, you can use the CountThresholdForError and LengthThresholdForError options to the MSBuild task to fail the build if the duplicates in the source are too numerous or too long.

The second new feature is to cut down or eliminate the false positives. Whole files can already be excluded, but we also need to exclude duplicates where the first line starts with a particular prefix. I'll show you the reason for this.

If we run the duplicate finder on its own source code (excluding the generated AssemblyInfo.cs files, which we know will all look much the same), we get the following:

> DupFinder.exe -r -t7 -eAssemblyInfo.cs *.cs
 
Processing in C:\Temp\DuplicateFinder\DuplicateFinderLib
6 files read
Duplicate of length 7 at:
 Line 1-7 in C:\Temp\DuplicateFinder\DuplicateFinderLib\DuplicateEventArgs.cs
 Line 1-7 in C:\Temp\DuplicateFinder\DuplicateFinderLib\LineItem.cs
 Line 1-7 in C:\Temp\DuplicateFinder\DuplicateFinderLib\LineItemList.cs
1 duplicate found

The duplicate, line 1-7 of three different files, consists of these lines:

using System;
using System.Collections.Generic;
using System.Text;
 
namespace DuplicateFinderLib
{
    /// <summary>

While these lines may be the same, I don't regard them as "bad" or "cut and paste" code. So I exclude duplicates where the first line starts with "using". Like so:


DupFinder.exe -r -t7 -eAssemblyInfo.cs -xusing *.cs
Processing in C:\Temp\DuplicateFinder\DuplicateFinderLib
6 files read
0 duplicates found

You can also do this in the MSBuild file. More than one prefix can be excluded.

Published 23 February 2008 15:39 by Anthony.Steele

Comments

No Comments
Anonymous comments are disabled

About Anthony.Steele

Programmer in c# for Conchango
Powered by Community Server (Personal Edition), by Telligent Systems