Using Static Code Analysis to check for Duplicate Code

In general, I’m not a huge fan of static code analysis because I haven’t managed to find a continuous, valuable use case on an ongoing basis. One exception is DupFinder from Jetbrains which is a really cool tool that can be used to analyze your .Net code base and check for duplicated code segments.

I joined an existing project in November 2015 and I have been using the dupFinder since then on a sporadic basis just to keep track of some simple metrics, and if our refactoring is having any effect. If you are super keen you can also make it part of your build process, so that the report is generated on a more regular basis. One unfortunate thing about the tool is that it doesn’t produce pretty output by default, that requires a bit more effort.

Using this kind person’s gist, and some PowerShell skillz I have been able to automate the process and produce the output I need. I started with a batch file to call the dupFinder exe:

"c:\tools\dupfinder.exe" /output=dupReport.xml /e="**/*.Designer.cs;**/*.generated.cs;**/Reference.cs;**/DomainModel.cs;**/Metadata1.cs" /show-stats /debug /show-text S:\MySolution.sln

Since this project is database first, and we use an EDMX with some T4 templates to generate the data access and domain layers, we have a lot of auto-generated classes. So when running the dupFinder, I am excluding any designer related classes and any EF related models and metadata, because that is not code we directly touch. There are various other options you can use to tailor the output, depending on your needs.

Then after the exe runs, we need to transform the xml output into an HTML file, using the transform dlls provided by the .Net framework. We can use the following PowerShell script to perform the transformation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
function process-XSLT {
    param([string]$a)

    $xsl = join-path $pwd "dupFinderStylesheet.xslt"
    $inputstream = new-object System.IO.MemoryStream
    $xmlvar = new-object System.IO.StreamWriter($inputstream)
    $xmlvar.Write("$a")
    $xmlvar.Flush()
    $inputstream.position = 0
    $xml = new-object System.Xml.XmlTextReader($inputstream)
 
    $output = New-Object System.IO.MemoryStream
    $xslt = New-Object System.Xml.Xsl.XslCompiledTransform
    $arglist = new-object System.Xml.Xsl.XsltArgumentList
    $reader = new-object System.IO.StreamReader($output)
    $xslt.Load($xsl)
    $xslt.Transform($xml, $arglist, $output)
    $output.position = 0
 
    $transformed = [string]$reader.ReadToEnd()
    $reader.Close()
    return $transformed
}

$inputPath = join-path $pwd "dupReport.xml"
$outputPath = join-path $pwd "dupFinderReport.html"

$inputText = [System.IO.File]::ReadAllText($inputPath)

$text = process-XSLT $inputText

[System.IO.File]::WriteAllText($outputPath, $text);

Our batch file now looks like this:

"c:\tools\dupfinder.exe" /output=dupReport.xml /e="**/*.Designer.cs;**/*.generated.cs;**/Reference.cs;**/DomainModel.cs;**/Metadata1.cs" /show-stats /debug /show-text S:\MySolution.sln

PowerShell -NoProfile -ExecutionPolicy Bypass -Command "& 'C:\Source\RunDupFinderXslt.ps1'"

pause

The html output file now looks something like this:

DupFinder HTML output

In the below table, I am comparing the result set from the first time I ran the dupFinder, and the result set from today. Overall I was bit surprised at how much larger our code base is (a 700% increase in 18 months!), and I was impressed that in spite of the increase in number of lines of code (LOC) the percentage of duplicated code has decreased.

	Nov 2015	July 2017
Total codebase size (LOC):	206,968	1,578,689
Code to analyze (LOC):	15,014	82,144
Total size of duplicated fragments (LOC):	33713	206,782

Percentage to analyze (%):	7.25 %	5.20 %
Total Size of duplicate fragments (%):	16.29 %	13.10 %

Code Base Increase (%):	762.77 %
	1,371,721

With a little more thought and analysis, I realised that the large increase in auto-generated code was due to the tables we have added for new functionality. Each table means more domain entities and related services, and in 18 months we have done a lot of work on the product.

Given the insight writing this blog post has given me into our current code base, I think static code analysis deserves a re-look from me.

Neither here nor there

Using Static Code Analysis to check for Duplicate Code
You may have seen this code before...

Neither here nor there

Using Static Code Analysis to check for Duplicate Code You may have seen this code before...

Using Static Code Analysis to check for Duplicate Code
You may have seen this code before...