-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathmain.tex
1758 lines (1372 loc) · 122 KB
/
main.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\PassOptionsToPackage{table}{xcolor}
%\documentclass[10pt, compsoc,journal,twocolumn]{IEEEtran}
\documentclass[10pt,compsoc,twocolumn]{IEEEtran}
\usepackage{dblfloatfix}
\usepackage[utf8]{inputenc}
\usepackage{tcolorbox}
\usepackage{booktabs}
\usepackage{wrapfig}
\usepackage{multirow}
\usepackage{balance}
\usepackage{ragged2e}
\usepackage{tabularx}
\usepackage{cite}
\usepackage{colortbl}
\usepackage{multirow}
\usepackage{subcaption}
\usepackage[ruled,linesnumbered]{algorithm2e}
\usepackage{url}
\usepackage{amsmath}
\usepackage{amsthm}
\usepackage{amsfonts}
\usepackage[table]{xcolor}
\newtcolorbox{blockquote}{colback=blue!5!white,boxrule=0.4pt,colframe=gray!60!black,fonttitle=\bfseries}
\newcolumntype{L}{>{\raggedright\arraybackslash}X}
\newcommand{\bi}{\begin{itemize}}
\newcommand{\ei}{\end{itemize}}
%%%%%%%%%%%%%%
% For reviewer responses
%%%%%%%%%%%%%%
\newcommand{\bluecheck}{}%
\DeclareRobustCommand{\bluecheck}{%
\tikz\fill[scale=0.4, color=blue]
(0,.35) -- (.25,0) -- (1,.7) -- (.25,.15) -- cycle;%
}
\newcommand{\response}[1]{\textcolor{blue}{#1}}
\usepackage[tikz]{bclogo}
\newenvironment{RQ}[1]%
{\noindent\begin{minipage}[c]{\linewidth}%
\begin{bclogo}[couleur=gray!20,%
arrondi=0,logo=\none,%
ombre=true%
]{{\small ~#1}}}%
{\end{bclogo}\vspace{2mm}\end{minipage}}
%% reviewing
\newcommand{\blue}[1]{{\color{blue}{#1}}}
\newcommand{\reponse}[1]{\noindent{#1\\}}
\newcommand{\todo}[1]{\textbf{\color{red}{#1}}}
\newcommand{\subsect}[1]{\SS\ref{subsect:#1}}
%% Response text prefix
\newcommand{\respto}[1]{
\fcolorbox{black}{black!15}{%
\label{resp:#1}%
\bf\scriptsize R{#1}}}
%% Response text prefix
\newcommand{\bareresp}[1]{
\fcolorbox{black}{black!15}{%
\bf\scriptsize R{#1}}}
\newcommand{\BLUE}{\color{blue}}
\newcommand{\BLACK}{\color{black}}
%% Cite responses
\newcommand{\citeresp}[1]{%
{(see }\fcolorbox{black}{black!15}{%
\bf\scriptsize R{#1}}~{{on page \pageref{resp:#1})}}}%
\title{Improving Deep Learning for Defect Prediction (using the GHOST Hyperparameter Optimizer)}
\author{Rahul~Yedida and Tim~Menzies,~\IEEEmembership{IEEE Fellow}%
\IEEEcompsocitemizethanks{\IEEEcompsocthanksitem R. Yedida and T. Menzies are with the Department of Computer Science, North Carolina State University. Email: ryedida@ncsu.edu, timm@ieee.org}.%
\thanks{Manuscript submitted to TSE, August 9, 2020.}}
\begin{document}
\IEEEtitleabstractindextext{
\begin{abstract}
There has been much recent interest in the application
of deep learning neural
networks in software engineering.
Some researchers are worried that deep learning is being applied with insufficient critical assessment.
Hence, for one well-studied software analytics task (defect prediction), this paper
compares deep learning versus
prior-state-of-the-art results. We show how deep learning can outperform
those prior results after
adjusting its hyperparameters using GHOST (Goal-oriented Hyperparameter Optimization for Scalable Training).
For defect prediction, GHOST terminates in just a few minutes and scales
to larger data sets;
i.e. it is practical to tune deep learning tuning for defect prediction.
Hence this paper recommends deep learning for defect prediction, but only after
adjusting its goal predicates and tuning its hyperparameters (using some
hyperparameter optimization tool, like GHOST). %We also observe a generalization effect, where GHOST achieves state-of-the-art performance on cross-project defect prediction.
% These results suggest that, for future work, the following research agenda could be insightful:
% (a)~divided analytics into various domains;
% (b)~determine the prior non-deep learning state-of-the-art for each domain; (c)~compare deep learning with that state-of-the-art;
% (d)~for that particular domain, seek ways to better adapt
% deep learning to SE tasks.
\end{abstract}
\begin{IEEEkeywords}
defect prediction, hyperparameter optimization, neural networks
\end{IEEEkeywords}}
\maketitle
%\IEEEpeerreviewmaketitle
%\IEEEdisplaynontitleabstractindextext
\section{Introduction}
There has been much recent interest in the application
of deep learning (DL) neural
networks to (e.g.)
bug localization~\cite{huo2019deep},
sentiment analysis~\cite{lin2018sentiment,guo2017semantically},
API mining~\cite{chen2019mining,gu2016deep,nguyen2017exploring}, effort estimation for agile development~\cite{choetkiertikul2018deep}, code similarity detection~\cite{zhao2018deepsim},
code clone detection~\cite{white2016deep}, etc.
Some researchers are worried that DL is being applied with insufficient critical assessment.
For example, Gary Marcus~\cite{marcus2018deep}
warns than DL may be so over-hyped that it runs ``fresh risk for seriously dashed expectations'' that could blinker AI researchers from trying new ideas as well as
bringing another AI winter (the period in the mid to late 1980s when most AI companies went bankrupt due to poor generalization
of their methods).
% A major issue with deep learning is it high computationally expensive. For example, in the taking CMA-ES \cite{loshchilov2016cma} for example, they take 17 hours to run hyperparameter optimization on 30 GPUs (running each deep learner for 30 minutes). Such resources are not available to all research groups; we seek to improve this for software engineering.
So what is the benefit of DL ? Is the technology over-hyped? Should we stop using everything else and just move to DL?
Very few SE researchers are studying this question.
In their literature review, Li et al.~\cite{li2018deep} explored over 40 SE papers facilitated by DL models and found that 33 of them used DL without comparison to other (non-DL) techniques.
In our own literature review on DL in SE (reported below in \S\ref{sec:background}), we find experimental
comparisons of DL-vs-non-DL in less than 10\% of papers.
Therefore,
in order to better understand the benefits of DL,
% Deep learning is becoming increasingly popular for software defect prediction \cite{Fu2017,Li2017,Tufano2018,Wang2016,White2016}. Recent
% results show that such algorithms can
% (a)~dynamically extracting features from (e.g.) code using complex neural networks \cite{Wang2016,Dam2018};
% and that (b)~these features are insightful to many software
% analytics tasks~\cite{Dam2018,Hoang2019,Dam2019}.
% However, these methods suffer from slow training times which means in turn that it is difficult to explore the hyperparameter space of these systems. In effect, this means the trained models may not, in fact, be optimal. Also, in an industrial setting, such slow runtimes could render the trained models stale, especially as newer data is constantly available and model updating is desirable.
% At the care of this problem are several issues:
% \begin{itemize}
% \item
% Complex deep learners can have many parameters \cite{ba2014deep} and a large parameter space means optimization over the complex loss function landscape is a fundamentally difficult task.
% \item Such loss landscapes are typically ``non-convex''~\cite{li2018visualizing} which, as we shall see, is an important issue that significantly
% effects learner performance.
% \item
% Hyperparameter optimization for such deep neural networks is impractical because of the large training times.
% \end{itemize}
% As we shall show, these issues are not explored in a rigorous manner in the software analytics literature.
% In fact, our reading of the literature (presented below) is that deep learning is mostly reported without comparisons to prior (non-DL) approaches.
% On the whole, most research apply deep learners ``off the shelf'' without due consideration to how deep learners can be better adjusted to SE tasks.
this paper
compares deep learning with established results in software analytics;
specifically: {\em defect prediction from static code attributes}.
% \bi
% \item
% To effectively manage resources, it is common to match the quality assurance (QA) effort to the perceived criticality and bugginess of the code.
% \item
% It is impractical and inefficient to distribute equal qualitaitve assurance effort to every component in a software system~\cite{briand1993developing}.
% \item
% Learning defect predictors (using data miners) from static code attributes is one very cheap way to ``peek'' at the code to see where to spend more QA effort.
% \item
% Such defect predcitors are widely studied, with many recent publications~\cite{hoang2019deepjit, wang2016automatically, dam2019lessons}. Hence, we use it here to comparatively assess DL against prior work in SE.
% \ei
For that comparison, we use a prior state-of-the-art result: specifically, the
DODGE hyperparameter optimizer~\cite{agrawal2019dodge}.
DODGE is a useful comparison method since it is a recent state-of-the-art result (published in TSE'19), Also, internally,
DODGE explores billions of different learners and their configurations. Hence ``comparing against DODGE'' really means comparing against a very wide range of alternatives.
Using defect prediction as a case study and DODGE as a comparison tool, we ask the following questions.
\textbf{RQ1: Does standard deep learning work for defect prediction?}
\begin{blockquote}
\noindent
\BLUE
For defect prediction,
standard deep learners are outperformed by existing state-of-the-art methods in 30/40 experiments. \BLACK
\respto{2a1.1}
\end{blockquote}
The key phrase in the last paragraph is ``standard deep learners''. \BLUE A precise definition of this will be provided in \S \ref{sec:deeplearning}, but for continuity, we briefly mention here that a ``standard" deep learner would be accepted by a deep learning expert as a reasonably well-designed architecture based on advancements in deep learning theory. \BLACK
For {\bf RQ1}, we ran the deep learners using the standard configurations recommended in the literature. What we show in this paper is that standard deep learning can be greatly improved by first asking:
\textbf{RQ2: Why does standard deep learning fail for defect prediction?}
We find that issues with
how the decision boundary
is explored can explain the above.
Specifically:
\begin{blockquote}
\noindent
The lack-of-success of deep learning in defect prediction can be attributed to optimizing for the wrong performance measure.
\end{blockquote}
This result led to a new tool called GHOST (\textbf{G}oal-oriented \textbf{H}yperparameter \textbf{O}ptimization for \textbf{S}calable \textbf{T}raining).
This new tool (a)~uses the same \mbox{``$\epsilon$-domiantion''} heuristic as DODGE (explained in \S\ref{sec:dodge}) while, at the same time, (b)~extends DODGE to add a novel ``weighted loss function'' (explained in \S\ref{sec:loss}).
To assess GHOST, we ask:
\textbf{RQ3: How might we improve the results of deep learners on defect prediction?}
Using GHOST, we tune the DL loss function to find:
\begin{blockquote}
\noindent
For most evaluation goals,
our modified deep learner performs better than the prior state-of-the-art.
\end{blockquote}
%\textbf{RQ4: Can we always improve Deep Learning? }
%
% {\bf RQ3} reports that improving the loss function leads to useful improvements in deep learners.
\textbf{RQ4: Does deep learning work well in all cases?}
For this question,
we evaluate deep learners over multiple metrics
(area under the ROC curve (AUC), recall, false positive rate, and popt20). GHOST achieves superior results according to 3 of 4 criteria (AUC, recall, and popt20; see Table \ref{tab:ghost_dodge}), but
according to the fourth criteria (false positive rate), deep learning is
defeated by the prior-state-of-the-art. Hence we say:
\begin{blockquote}
\noindent
We recommend the use of deep learning in domains where false alarms are not particularly catastrophic.
\end{blockquote}
\noindent
Another way to state the conclusion of {\bf RQ4} is that
developers need to experiment before selecting the methods that work best in the local domain. In theory, it is hard to conduct these experiments
since many researchers report that deep learners have very long training times~\cite{lee2020biobert, brown2020language} (e.g. Lee et al. ~\cite{lee2020biobert} report that training took 23 days on 8 NVIDIA Tesla V100 GPUs).
Note that if it is so hard to experiment with deep learning,
then it would be hard to apply the lessons of {\bf RQ4}.
Accordingly, we must check:
\textbf{RQ5: How slow is tuning for
deep learning for defect prediction?}
We find that for the defect prediction data sets studied here, the median training times for deep learners with GHOST's hyperparameter optimization is 10 minutes (median).
Also,
we find that the tuning time increases at a slower rate than the number of rows in the datasets. e.g. with a 10x increase in the size of the data, the largest increase in the runtime was less than 4x.
Hence we say:
\begin{blockquote}
\noindent
For defect prediction, deep learners are both practical and tractable.
\end{blockquote}
\noindent
In summary, our key research contributions are:
\begin{itemize}
\item We show that for defect prediction, off-the-shelf deep learning is not recommended (see {\bf RQ1}).
\item We show that, contrary to prior pessimism, tuning deep learning algorithms is both useful (see {\bf RQ2,RQ3}), practical, and tractable (see {\bf RQ5}).
\item For many goals, deep learning can out-perform prior state-of-the-art results in defect prediction.
\item %But sometimes, deep learning is recommended in software analytics without sufficient evaluation. Specifically,
That's said,
deep learning is not recommended for defect prediction for certain goals (see {\bf RQ4}).
\end{itemize}
We also offer two systems-level contributions:
\begin{itemize}
\item
\textbf{GHOST}, a novel
tuning method;
\item A reproduction package containing all the code and data used in this study\footnote{See
\url{https://tiny.cc/ghost-dl}. }.
\end{itemize}
The rest of this paper is structured as follows. Section \ref{sec:background} offers some background notes. Section \ref{sec:method} discusses our approach in detail, along with our experimental design. Section \ref{sec:results} offers answers
to our research questions. In Section \ref{sec:threats}, we discuss some threats to validity. Our conclusion, in Section \ref{sec:conclusion}, will be
\begin{blockquote}
\noindent
For SE, do not use off-the-shelf deep learning.
Instead,
tune that algorithm to the needs of SE (using tools like, e.g. GHOST).
\end{blockquote}
\begin{table*}[t!]
\caption{Selected papers after applying filters (top SE venues, at least 10 cites per year).}\label{tab:papers}
{\rowcolors{2}{white}{blue!5}
\scriptsize
\begin{tabularx}{\textwidth}{llrlL}
\toprule
Year & Venue & Cites & Use Cases & Title \\
\midrule
2016 & ICSE & 262 & defect prediction & Automatically Learning Semantic Features for Defect Prediction \cite{wang2016automatically} \\
2016 & ASE & 224 & code clone detection & Deep learning code fragments for code clone detection \cite{white2016deep} \\
2015 & MSR & 183 & code representations & Toward Deep Learning Software Repositories \cite{deshmukh2017towards} \\
2017 & ICSE & 83 & trace links & Semantically Enhanced Software Traceability Using Deep Learning Technique \cite{guo2017semantically} \\
2017 & ICSME & 58 & code clone detection & CCLearner: A Deep Learning-Based Clone Detection Approach \cite{li2017cclearner} \\
2019 & TSE & 37 & story point estimation & A Deep Learning Model for Estimating Story Points \cite{choetkiertikul2018deep}\\
2018 & MSR & 34 & code clone detection & Deep Learning Similarities from Different Representations of Source Code \cite{marcus2018deep} \\
2017 & ICSME & 33 & vulnerability prediction & Learning to Predict Severity of Software Vulnerability Using Only Vulnerability Description \cite{han2017learning} \\
2018 & TSE & 24 & defect prediction & Deep Semantic Feature Learning for Software Defect Prediction \cite{wang2018deep} \\
2017 & ICSME & 23 & duplicate bug retrieval & Towards Accurate Duplicate Bug Retrieval Using Deep Learning Techniques \cite{deshmukh2017towards} \\
2019 & ICSE & 10 & bug localization & CRADLE: Cross-Backend Validation to Detect and Localize Bugs in Deep Learning Libraries \cite{pham2019cradle} \\
2019 & MSR & 8 & defect prediction & DeepJIT: An End-to-End Deep Learning Framework for Just-in-Time Defect Prediction \cite{hoang2019deepjit} \\
2019 & MSR & 5 & defect prediction & Lessons Learned from Using a Deep Tree-Based Model for Software Defect Prediction in Practice \cite{dam2019lessons} \\
2019 & ICSE-NIER & 4 & transfer learning & Leveraging Small Software Engineering Data Sets with Pre-Trained Neural Networks \cite{robbes2019leveraging} \\
2018 & TSE & 4 & defect prediction & How Well Do Change Sequences Predict Defects? Sequence Learning from Software Changes \cite{wen2018well} \\
2018 & IC SESS & 1 & language model & A Neural Language Model with a Modified Attention Mechanism for Software Code \cite{zhang2018neural} \\
\bottomrule
\end{tabularx}
}
\end{table*}
\section{Background}
\label{sec:background}
\subsection{Why study defect prediction?}
The case studies
of this paper relate to defect prediction.
This section
reviews why that is worthy of study.
% Defect prediction is a (very) widely explored area with many application areas.
% Specifically, in 2020, software defect prediction is now
% a ``sub-routine'' that enables much other research.
% As today's software grows rapidly both in size and number, software testing for capturing those defects plays an increasingly important role.
During software development, the testing process often has some resource limitations.
For example, the effort associated with coordinated human effort across large code bases can grow exponentially with the scale of the project ~\cite{fu2016tuning}.
Since every quality assurance decision is associated with a human and resource cost to the developer team, it is impractical and inefficient to distribute equal effort to every component in a software system~\cite{briand1993developing}.
Hence,
it is common to match the quality assurance (QA) effort to the perceived criticality and bugginess of the code for managing resources efficiently.
Creating defect prediction models is an efficient way to take a look at the incoming changes and focus on specific modules based on suggestions from a defect predictor.
Such predictors can save labor compared with traditional manual methods.
Misirli et al.~\cite{misirli2011ai} built a defect prediction model for a telecommunications company. Their models predicted 87\% of code defects and decreased inspection efforts by 72\% (and reduced post-release defects by 44\%).
Also,
Kim et al.~\cite{kim2015remi} applied the REMI
defect prediction model to API development process at Samsung.
Their models could
predict the bug-prone APIs with reasonable accuracy~(0.68 F1 scores) and reduce the resources required for executing test cases.
Not only that, but defect predictors are also competitive with certain automatic methods.
Rahman et al. ~\cite{rahman2014comparing} compared (a) static code analysis tools FindBugs, Jlint, and PMD with (b) defect predictors (which they called ``statistical defect prediction'') built using logistic regression.
No significant differences in cost-effectiveness were observed.
Consequently,
there is much interest in the industrial community about defect prediction.
In a survey of 395 practitioners from 33 countries and five continents, Wan et al.~\cite{wan2018perceptions} found that over 90\% of the respondents were willing to adopt defect prediction techniques.
\BLUE
\subsection{Deep learning}
\label{sec:deeplearning}
For the remainder of this paper, we define a ``standard deep learner" as one whose architecture would be reasonably justifiable to a deep learning expert.
To do so, we use the results of a recent paper published in a top venue (NeurIPS) by Montufar et al. \cite{montufar2014number}, who derive theoretical lower and upper bounds on the number of piecewise linear regions. In their paper, the authors provide theoretical proof that the decision boundary learned by a feedforward neural network \cite{lecun2015deep} using ReLU activation functions\footnote{The ReLU (rectified linear unit) activation function is $f(x) = \max(0, x)$.} in the hidden layers is comprised of piecewise linear regions. They then go on to derive lower and upper bounds for the number of linear regions that can comprise such a boundary, under reasonable circumstances (we ensured that our ``standard" learners satisfied all their constraints).
While the exact equations derived in their paper are not particularly important here, we do note an observation from their derived lower bound: by maintaining the number of nodes in the hidden layers of the network to be at least as many as the number of inputs (i.e., the dimensionality of the data), the lower bound for the number of piecewise linear regions in the decision boundary is nonzero. We interpret this as, by setting the number of nodes in each hidden layer \textit{equal} (for simplicity) to the dimensionality of the data, the network must make an effort to separate the two classes (i.e., using a nonzero number of piecewise linear regions). We stress here that this certainly does not guarantee an optimal boundary, but it does provide a guarantee for a nontrivial decision boundary.
\respto{1a2.1}
\BLACK
\subsection{ Deep learning for Software Engineering}
\label{sec:dlse}
The rest of this paper explores defect prediction
using DL and non-DL methods.
To understand the current state of deep learning in software engineering, in May 2020,
we explored the literature as follows.
Using Google Scholar, we search for research papers using the keyword ``deep learning AND software engineering". This returned over 900 papers with at least one citation.
To narrow down that list, we looked for papers published in the top venues (defined as per Google Scholar metrics "software systems"), with $N\ge 10$ cites per year (and for papers
more recent than 2017 we used $N \ge 1$). This led to the list of papers in Table \ref{tab:papers}.
From that survey, it was clear that
there is a growing research interest in the use of deep learning for software engineering. For example:
\bi
\item
Zhang et al.~\cite{zhang2018neural} model code using a language model after standard Natural Language Processing (NLP) preprocessing steps (tokenization, comment removal, rare word replacement, etc.).
\item
Wang et al.~\cite{wang2018deep,wang2016automatically} use a Deep Belief Network (DBN) to learn ``semantic" features and then use classical learners to perform defect prediction.
\item
Pham et al.~\cite{pham2019cradle} propose CRADLE, which uses pre-trained language models for bug localization.
\item
Wen et al.~\cite{wen2018well} extract semantic feature vectors, discretized, then use
them for defect prediction.
\item
Hoang et al. ~\cite{hoang2019deepjit} propose DeepJIT, a deep learning framework for just-in-time defect prediction.
\item
Li et al. ~\cite{li2017cclearner} obtain frequency count vectors from source code, learn pairwise similarity scores, and use a vanilla feedforward neural network to find code clones. \item
Deshmukh et al. ~\cite{deshmukh2017towards} use a Siamese LSTM \cite{koch2015siamese} and a convolutional neural network (CNN) to retrieve duplicate
bugs.
\item
Robbes et al.~\cite{robbes2019leveraging} show proof that deep learners can learn from small datasets by leveraging pre-trained models and ULMFit \cite{howard2018universal}.
\item
Dam et al.~\cite{dam2019lessons} develop a Tree-LSTM \cite{tai2015improved} to extract a feature vector from the Abstract Syntax Tree of
the code (AST) and use it for defect prediction. \item
White et al. ~\cite{white2016deep} use recurrent neural networks to extract embeddings for
code clone detection. White et al. ~\cite{white2015toward} learn code representations using deep learners.
\item
Han et al. ~\cite{han2017learning} feed GloVe \cite{pennington2014glove} embeddings to a CNN to extract features, and finally use an SVM classifier for vulnerability prediction.
\ei
\begin{figure}%{r}{2.1in}
\begin{center}
\includegraphics[width=.5\linewidth]{venn.png}
\end{center}
\caption{Summary of Table \ref{tab:papers}. The paper satisfying all three criteria is \cite{wang2016automatically}; the paper that compares with classical learners is \cite{dam2019lessons}}
\label{fig:venn}
\end{figure}
One feature we note from our sample of research papers
is that the code and data for most of the these papers is not always open source. Some are even protected by patent applications. This has implications for reproducibility. For example,
this paper
baselines
some of our
results against
Wang et al. \cite{wang2016automatically}.
Their methods
are protected by a patent, and therefore we could not replicate their results using their code.
That said, we were able to
find their exact training and test sets,
which we used
for comparison purposes.
Figure \ref{fig:venn} summarizes key features of that literature.
Note that few papers compare their results with non-deep learning methods \cite{wang2016automatically,dam2019lessons}. Also,
Figure~\ref{fig:venn} shows only one prior
result which included deep learning, hyperparameter optimization, and a comparison to non-DL methods~\cite{wang2016automatically}.
Further,
Several of the papers warn of the extreme cost of performing deep learning; specifically, very long training times
(e.g. when learning semantic features~\cite{wang2016automatically}).
Furthermore,
most of the papers do not perform a structured hyperparameter optimization (we only see this done by \cite{wang2016automatically}, using a grid search approach). Even then, very few hyperparameters are tested.
\begin{table*}[!t]
% \caption{Learners and preprocessors used by DODGE and GHOST}
% \label{tab:dodge}
%\captionsetup{font=footnotesize}
\scriptsize
\begin{tcolorbox}[colback=white]
\begin{flushleft}
\textbf{Learners used by DODGE:}
\noindent
\begin{itemize}
\item DecisionTreeClassifier(criterion=b, splitter=c, min\_samples\_split=a).
a, b, c= randuniform(0.0,1.0), randchoice([`gini',`entropy']), randchoice([`best',`random']).
\item RandomForestClassifier(n\_estimators=a,criterion=b, min\_samples\_split=c). a,b,c = randint(50, 150), randchoice(['gini', 'entropy']), randuniform(0.0, 1.0)
\item LogisticRegression(penalty=a, tol=b, C=float(c)).
a,b,c=randchoice([`l1',`l2']), randuniform(0.0,0.1), randint(1,500)
\item MultinomialNB(alpha=a) = randuniform(0.0,0.1)
\item KNeighborsClassifier(n\_neighbors=a, weights=b, p=d, metric=c). a, b,c = randint(2, 25), randchoice([`uniform', `distance']), randchoice([`minkowski',`chebyshev']).
If c=='minkowski': d= randint(1,15) else: d=2
\end{itemize}
{\bf Pre-processors used by DODGE (and the ones in italics were also used by GHOST):}
\bi
\item \textit{StandardScaler}
\item \textit{MinMaxScaler}
\item \textit{Normalizer(norm=a) = randchoice([`l1', `l2',`max'])}
\item MaxAbsScaler
\item RobustScaler(quantile\_range=(a, b)) = randint(0,50), randint(51,100)
\item KernelCenterer
\item QuantileTransformer(n\_quantiles=a, output\_distribution=c, subsample=b). a, b = randint(100, 1000), randint(1000, 1e5);
c=randchoice([`normal',`uniform']).
\item Binarizer(threshold=a) = randuniform(0,100)
\ei
\end{flushleft}
\end{tcolorbox}
\caption{Hyperparameter options explored by DODGE \cite{agrawal2019dodge}. GHOST optimizers uses some of these parameters (see the pre-processing options shown in {\em italics}) in the manner discussed in Table~\ref{tab:preprocessors2}.}\label{tab:preprocessors1}
\end{table*}
This lack of hyperparameter optimization in DL defect prediction papers~\cite{wang2016automatically,dam2019lessons, wang2018deep}
is of some concern.
Hyperparameter optimization is the tuning of so-called ``hyperparameters"--parameters that are not learned by the algorithm, but instead drive the working of the algorithm--to attain optimal performance.
Hyperparameter optimization is very useful when using complex learners.
This is especially true for deep learners, which have exponentially more parameters and complex loss surfaces to explore. Montufar et al. ~\cite{montufar2014number} note that the expressive power of deep learners comes from the architecture, which determines bounds on the number of piecewise linear regions that can be represented as a decision boundary.
\subsection{Hyperparameter Optimization}
% Our paper stands out from the above papers by addressing these claims. Specifically, we
% \begin{itemize}
% \item Compare against results that use hyperparameter optimization on classical machine learning models.
% \item Perform hyperparameter optimization on a larger set of hyper-parameters, and use baselines based on the recommendations of deep learning literature \cite{montufar2014number}.
% \item We show that, at least for these data sets, such tuning is not very computational expensive.
% For example, our approach takes less than 4 minutes to train a neural network.
% \item Make our code and data open source to promote the open science in software engineering.
% This is important since, in the above survey, we found that only three of the above seventeen papers \cite{hoang2019deepjit,Tufano2018,choetkiertikul2018deep} have
% made public the scripts and data used in their work.
% \end{itemize}
Hyperparameter optimization is
the automatic tuning
of many parameters that control the internal settings of the learner and data preprocessors.
The last 15 years have seen a dramatic
increase in the application of machine learning and hyperparameter optimization (HPO)
to software analytics~\cite{menzies2003data,last2003data,moeyersoms2015comprehensible,menzies2018software,kim2016emerging,menzies2006data,robillard2009recommendation,hassan2008road}.
% With modern hardware becoming increasingly powerful and computational power cost decreasing, the use of more computationally expensive models is becoming more feasible. Yet, many deep learners are still too expensive to run, and therefore there is a need for cheaper, better defect predictors.
% Deep learnering can be a very slow process. For example, the smallest ResNet \cite{he2016deep}
% (ResNet50) has 25 million parameters while other versions (e.g. DenseNets~\cite{huang2017densely}) have upwards of 8 million parameters (e.g. the recent GPT-3 model \cite{brown2020language} have over 175 billion parameters). However, the complex architectures used for image and language data are not required for software, where the data is intrinsically simple. Therefore, we train simple deep learning models, with few layers, and only densely connected layers, using recommendations from \cite{montufar2014number}.
% This is an important technique since
% numerous researchers reported that data making and data pre-processing techniques/algorithms/tools are often used in a ``black box'' manner without reflecting on
% the merits of choices associated with a particular tool~\cite{fu2016tuning,Binkley:2018}.
% Such black-box usage is risky since it
% means SE practitioners might be applying the wrong tool. Hence,
% An emerging question in the application of machine learning techniques to software engineering is the
% validity of the assumptions that underlie the creation of those tools and techniques.
% Numerous
% researchers now ask if
% superior results can be obtained by adapting
% machine learning tools to the particulars of software engineering problem~\cite{Binkley:2018,fu2016tuning,agrawal2018better,tantithamthavorn2016automated,Novielli:2018}.
A repeated results is that
research results from an ``off-the-shelf'' learner might be overturned by a second study that
tunes the hyperparameters of that learner~\cite{agrawal2019dodge}. For example, in 2008, Lessmann et al.~\cite{Lessmann08} argued that for defect prediction, the best and worst learning methods were random forests and CART, respectively.
In 2016, Fu et al.~\cite{fu2016tuning} showed that hyperparameter optimization effectively
reverses that conclusion since after tuning,
CART out-performed random forests.
While automatically tuning various hyperparameters is rare in the SE deep learning literature (see Figure \ref{fig:venn}), it has been explored in other domains.
For example, ``AutoML" methods seek the best combination of preprocessors and hyperparameters for a given dataset. These are typically based on Bayesian optimization, as in \cite{feurer2015efficient,thornton2013auto}. However, while Bayesian optimization in practice does find an optimal set of hyperparameters for deep learners, it takes a long time to run. For example, Feurer et al.~\cite{feurer2015efficient} report Auto-Sklearn results after 48 hours of running on a CPU farm.
Yet another branch, called Neural Architecture Search, also typically uses Bayesian optimization. These techniques aim to find the optimal learner architecture. These techniques have achieved a rather high level of sophistication: for example, Liu et al.~\cite{liu2019auto} describe a hierarchical approach for building neural network architectures. Elsken et al.~\cite{elsken2018neural} provide a review of neural architecture search methods. Some of these neural architecture search papers inevitably overlap with hyperparameter tuning \cite{bergstra2013making,domhan2015speeding,saxena2016convolutional,shahriari2015taking,stanley2002evolving}.
Both these approaches share the same concerns (a) they have long runtimes (b) some neural architecture search approaches tend to generate overly complex architectures for a problem, which may be overkill for software engineering.
Accordingly, the rest of this paper discusses experiments with DL hyperparameter optimization via our GHOST tool. We prefer GHOST over
other methods since it is simple to implement, it runs quickly, and it scales to larger data sets.
% Hence, using that tool we can show hyperparameter optimization for defect prediction is both effective and fast enough for use in practice.
\begin{table}[!b]
\centering
\scriptsize
\caption{Tuning parameters for DL (used by GHOST)}
\begin{tabularx}{\linewidth}{rp{2.5cm}L}
Preprocessor & Description & Options \\
\midrule
\texttt{StandardScaler} & Transforms the dataset to have a mean 0 and variance 1 & None \\
\rowcolor{blue!10} \texttt{MinMaxScaler} & Squeezes the data to the range $[0,1]$. & None \\
\texttt{Normalizer} & Normalizes samples to a unit norm. & Choice of norm: $l_1$, $l_2$, $l_\infty$ \\
\rowcolor{blue!10} \texttt{NumLayer}s & & 1 .. 5\\
\texttt{UnitsPerLayer} & & 2 .. 20\\
\rowcolor{blue!10} \texttt{Epochs} & number of times data reapplied to the algorithm & 10 .. 30\\
\texttt{Network} & topology & Feed forward\\
\rowcolor{blue!10} \texttt{Activation} & hidden units: & rectified linear units \\
\rowcolor{blue!10} & last layer & sigmoid\\
\end{tabularx}
\label{tab:preprocessors2}
\end{table}
Table~\ref{tab:preprocessors1} and Table~\ref{tab:preprocessors2} lists hyperparameters that can control the algorithms that generate
defect predictors. Table~\ref{tab:preprocessors1}, these options were collated from the hyperparameters
explored in recent SE papers~\cite{ghotra2015revisiting,fu2016tuning,agrawal2018better,agrawal2018wrong} and in the documentation of a widely-used data mining library
(Scikit-learn~\cite{scikit-learn}).
As to Table~\ref{tab:preprocessors2}, these options come from recent papers on deep learning.
After reading the literature looking for a ``standard DL architecture'', we
use standard feed-forward neural networks with ReLU activation functions (rectified linear units; a.k.a. ``hockey sticks'') in the hidden layers and sigmoid activation in the last layer.
As to other details of our deep learners,
Montufar et al. ~\cite{montufar2014number} discuss feed forward neural networks with ReLU activations.
The bounds for the number of piecewise linear regions constituting a decision boundary that can be represented by a deep learner, providing a proof based on topological folding. After, Montufar et al., we use no more than 3 layers with up to 20 units.
One options within deep learning is how many ``epochs'' to apply (and one epoch is one run over all the data to adjust entwork weights).
For our experiments, we used ten epochs since, as shown in Figure~\ref{fig:convergence}, we found very little improvement after after five epochs
(aside: and in experiments with up to 256 epochs, we found little improvements over the results shown in Figure~\ref{fig:convergence}).
% Based on the results of Santurkar et al. ~\cite{santurkar2018does},
% we do not use batch N=normalization \cite{ioffe2015batch}
% since Santurkar argues th. at this for complex problems, this flattening the loss surfaces (more precisely, it reduces the L-Lipschitzness and the $\beta$-smoothness of the loss surface), making optimization easier for gradient descent.
% In detail, when the number of units in any hidden layer is less than the number of inputs, the lower bound vanishes--this provided us with an upper bound for the number of units per layer (informally, while the network is still certainly capable of learning a complex decision boundary, it is no longer obliged to). To elaborate on this line of reasoning, given the results of their paper, it is trivial to show that an ideal neural network would have several layers, each with the same number of units (which in turn, is the same as the number of inputs)--this can be derived via a greedy algorithm that adds layers $l_i$ with the number of units as the number of inputs (denote this as $n_0$), until the remaining number of units (which can be trivially found from the results of the paper) is less than $2n_0$; at this point, we add one layer with the remaining number of units. While adding more units in the hidden layers certainly adds computational power, the formula presented in the paper suggests that deeper networks (in deep learning literature, ``deeper" denotes more layers) are preferable to ``wider" networks (which in deep learning literature refers to networks with more units in each layer).
\subsubsection{Optimization with DODGE}
\label{sec:dodge}
One key point to observe from those tables is the size of optimization space. If each numeric range is divided into five bins using {\em (max-min)/5}, then Table~\ref{tab:preprocessors1} holds nine binary choices and 18 options with five choices; i.e. \[5^{18}*2^9 \approx 2*10^{15} = 2,000,000,000,000,000\;\mathit{options}\]
This space is too large to be exhaustively explored. Hence
GHOST and
DODGE \cite{agrawal2019dodge} explore this space heuristically.
DODGE's exploration is defined around
a concept called $\epsilon$-domination. Proposed by Deb in 2005~\cite{deb2005evaluating}, $\epsilon$-domination states that there is little point in optimizing two options if their performance differs by less than some small value $\epsilon$.
\begin{figure}
\includegraphics[width=.49\textwidth]{convergence.png}
\caption{DL convergence on defect prediction data sets}
\label{fig:convergence}
\end{figure}
DODGE exploits $\epsilon$-domination as follows. $N$ times do: (a)~generate configurations at random, favoring those those with lowest weight (initially, all options have a weight $w=1$); (b)~if one option generates results within $\epsilon$ of some other, then increases these options' weights (thus decreasing the odds that they will be explored again).
This approach can be a very effective way to explore a large space
of options.
If we assess a defect predictor on $d=2$
dimensions (e.g. recall and false alarm), the output space of that performance divides into $(1/\epsilon)^d)$ regions.
To understand the practical implications of this,
we note that
defect predictors can exhibit a large variance
in their output, particularly for data sets like
those in Table \ref{tab:datasets} (where the size of the target class in the training and test
data is so variable).
Assuming $d=2$ and $\epsilon=0.2$ then
our defect predictors have
$(1/0.2)^2=25$ output regions.
Each time we apply
DODGE's tabu search, then two of these regions can be removed; i.e. in theory, DODGE could terminate very quickly.
Agrawal et al.~\cite{agrawal2019dodge} found that, for defect prediction,
the optimizations found above $N=30$ repeats of
the above loop performed no worse than those found after $N=1000$ loops. Better yet, the optimizations found that way
out-performed numerous prior results.
%The first two lines of Algorithm \ref{alg:ghost} are standard among all learning tasks.
%Lines 3 and 4 are taken from DODGE: we maintain a set of hyperparameter and preprocessor settings, and assign them an initial weight of 0. This weight is modified in lines 13 and 16 if they lead to similar performance (according to the goal criteria): if the difference in performance is less than $\epsilon$, we do not pursue them further and reduce their weight. The authors of DODGE claim that different values of $\epsilon$ lead to the same result; therefore, we use $\epsilon = 0.2$.
%In lines 5-11, we evaluate $N_1$ random samples among the candidate pool. As suggested by the authors of DODGE, we use the default value $N_1 = 12$. The evaluation consists of training a deep learner with the current hyperparameter choice (lines 6-9), and calling an \texttt{eval} function that runs forward propagation on the test set and evaluates the learner according to the metric of choice. Finally, for $N_2=30$ (default value) evaluations, we mutate the highest weighted hyperparameter set and evaluate it, keeping track of the best set of hyperparameters (which we call $\theta^*$).
% We briefly discuss notation used in Algorithm \ref{alg:ghost} to clarify any confusion: we use $f(\textbf{x}; \theta)$ to denote a neural network $f(\cdot)$, whose inputs are $\textbf{x}$, \textit{parameterized by} the hyperparameter (and preprocessor) set $\theta$. We use $\hat{\textbf{y}}$ to denote the predicted values and $\textbf{y}$ to denote the true values. Finally, we use $\hat{\mathcal{L}}$ to denote our modified loss function.
\section{Experimental Methods}
\label{sec:method}
The rest of this paper
discusses a comparative evaluation
of GHOST with DODGE, standard deep learning, and the results of Wang et al. \cite{wang2018deep}, who use a Deep Belief Network \cite{hinton2009deep}.
\subsection{Experimental Design}
Our experiments compared different defect predictors. For non-DL learners, we used the methods from DODGE \cite{agrawal2019dodge}.
For DL learning, we adopted a deep learner with 4 layers and 20 units per layer, trained for 10 epochs (from Theorem 5 in Montufar et al. \cite{montufar2014number}) We choose this based on the results of their paper, which was published in a top machine learning conference (NeurIPS 2014). While some deep learning researchers might be surprised to see that we only trained for 10 epochs, we found experimentally that this was sufficient for AUC and popt20; for recall and false alarm rate, we used 30 epochs.
For deep belief networks (DBN), the methods seen in a TSE'18 paper by
Wang et al. \cite{wang2018deep}.
In that approach,
for each file in the source code, they extract tokens, disregarding ones that do not affect the semantics of the code, such as variable names.
These tokens are vectorized, and given unique numbers, forming a vector of integers for each source file.
These vector-buggy pairs form the training set, which are input to a Deep Belief Network (DBN), a neural network architecture designed to extract features. The extracted features are used as input to a classical learner, such as Naive Bayes.
We use Wang et al. since (a)~that paper made a convincing case that this approach represented a state-of-the-art results for defect prediction for
features extracted from source code and (b) a deep learner is used to extract features, rather than being used as the learner, in contrast to our approach; this forms a valuable comparison.
We also explored two variants of defect prediction:
\bi
\item
{\em Within-project defect prediction (WPDP).} Here, models
learned from earlier releases of some project predict properties
of latter releases of that project.
\item
{\em Cross-project defect prediction (CPDP).} Here, models learned
from project1 where then applied to project2.
\ei
To compare our results
with other cross-project methods, we used TCA+~\cite{liu2019two} since
that paper reported that TCA+ performed as well, or better, than the prior work.
We compare these against the results of GHOST. On experimenting with weights $w_i\in\{1, 10, 100\}$ (from Equation~\ref{eq:w}), we noticed no improvement over $w_i=1$. Hence, we use that weight to optimize for each of the performance goals
listed in \S\ref{sec:perfs}.
%: AUC, recall, false alarm rate, and popt20.
When comparing against DODGE, in we split the data into train and test sets, as shown in Table \ref{tab:datasets}. GHOST used the training sets to find good DL settings, which were then applied to the test suite.
We ran our experiments on two machines:
(a) an Intel Broadwell CPU with 6 cores, 39GB RAM; and (b) a NVIDIA Tesla V100 GPU with 16 GB VRAM. The other had 15 GB RAM and a NVIDIA Tesla P100 GPU with 16 GB VRAM.
%\footnote{To justify that memory, we note that such high RAM is not necessarily required. We know this since early versions of our code had a memory leak issue, and on fixing it, we found that 10 GB was sufficient.}.
Because of the stochastic nature of deep learning (caused by random weight initialization), a statistically valid study of its merits should consider the distribution of its performance. For this reason, we run GHOST 20 times for each dataset, for each metric. All the results reported (in Table \ref{tab:results} and Table \ref{tab:tan}) are median values of these 20 runs.
\begin{table}
\centering
\caption{Static attributes in our datasets.}
\begin{tabularx}{\linewidth}{lL}
\toprule
Attribute & Description (with respect to a class) \\
\midrule
wmc & Weighted methods per class \cite{chidamber1994metrics} \\
\rowcolor{blue!10} dit & Depth of inheritance tree \cite{chidamber1994metrics} \\
noc & Number of children \cite{chidamber1994metrics} \\
\rowcolor{blue!10} cbo & Coupling between objects \cite{chidamber1994metrics} \\
rfc & Response for a class \cite{chidamber1994metrics} \\
\rowcolor{blue!10} lcom & Lack of cohesion in methods \cite{chidamber1994metrics} \\
lcom3 & Another lcom metric proposed by Henderson-Sellers \cite{henderson1996coupling} \\
\rowcolor{blue!10} npm & Number of public methods \cite{bansiya2002hierarchical} \\
loc & Number of lines of code \cite{bansiya2002hierarchical} \\
\rowcolor{blue!10} dam & Fraction of private or protected attributes \cite{bansiya2002hierarchical} \\
moa & Number of fields that are user-defined types \cite{bansiya2002hierarchical} \\
\rowcolor{blue!10} mfa & Fraction of accessible methods that are inherited \cite{bansiya2002hierarchical} \\
cam & Cohesion among methods of a class based on parameter list \cite{bansiya2002hierarchical} \\
\rowcolor{blue!10} ic & Inheritance coupling \cite{tang1999empirical} \\
cbm & Coupling between methods \cite{tang1999empirical} \\
\rowcolor{blue!10} amc & Average method complexity \cite{tang1999empirical} \\
ca & Number of classes depending on a class \cite{tang1999empirical} \\
\rowcolor{blue!10} ce & Number of classes a class depends on \cite{tang1999empirical} \\
max\_cc & Maximum McCabe's cyclomatic complexity score of methods \cite{mccabe1976complexity} \\
\rowcolor{blue!10} avg\_cc & Mean of McCabe's cyclomatic complexity score of methods \cite{mccabe1976complexity}
\end{tabularx}
\label{tab:attributes}
\end{table}
\begin{table}
\centering
\caption{Evaluated software projects for comparing with DODGE \cite{agrawal2019dodge}}
\begin{tabularx}{\linewidth}{lllLL}
\toprule
Project & Train versions & Test versions & Training Buggy \% & Test Buggy \% \\
\midrule
ivy & 1.1, 1.4 & 2.0 & 22 & 11 \\
\rowcolor{blue!10} lucene & 2.0, 2.2 & 2.4 & 53 & 60 \\
poi & 1.5, 2.0, 2.5 & 3.0 & 46 & 65 \\
\rowcolor{blue!10}synapse & 1.0, 1.1 & 1.2 & 20 & 34 \\
velocity & 1.4, 1.5 & 1.6 & 71 & 34 \\
\rowcolor{blue!10}camel & 1.0, 1.2, 1.4 & 1.6 & 21 & 19 \\
jEdit & 3.2, 4,0, 4.1, 4.2 & 4.3 & 23 & 2 \\
\rowcolor{blue!10}log4j & 1.0, 1.1 & 1.2 & 29 & 92 \\
xalan & 2.4, 2.5, 2.6 & 2.7 & 38 & 99 \\
\rowcolor{blue!10}xerces & 1.2, 1.3 & 1.4 & 16 & 74 \\
\bottomrule \\
\end{tabularx}
\label{tab:datasets}
\end{table}
\subsection{Data}
For this study, we use the same data used in DODGE's prior study~\cite{agrawal2019dodge} on defect prediction: see Table \ref{tab:datasets}. These are all open source Java projects from the PROMISE repository \cite{Sayyad-Shirabad+Menzies:2005}. Each dataset has 20 static attributes for each software project, described in Table \ref{tab:attributes}.
% In our subsequent discussion, the
% following feature of Table \ref{tab:datasets} will become very important. In this table, the class balance between defective and non-defective modules is highly variable between our train and test sets:
% \bi
% \item For example, in velocity and jedit, the percent of defect modules decreases more than half between train and test sets;
% \item The reverse pattern is seen in the train/test sets of other data sets.
% For example, the percentage of defect modules
% increases by (approx) 50\% (for poi and synapse) and by over 300\% (for log4j,xalan, xerces).
% \ei
% This feature will motivated the design of the GHOST method, described later in this paper.
\subsection{Performance Metrics}\label{sec:perfs}
For this study, we use performance metrics widely used used in
the defect predication literature.
Specifically, we evaluate the performance of our learners using three metrics: Area Under the ROC Curve (AUC) with false positive rate on the x-axis and true positive rate on the y-axis, recall, false alarm rate, and popt20. We use these measures to compare against the baseline of Agrawal et al.~\cite{agrawal2019dodge}.
\BLUE
A key benefit of using multiple metrics, aside from a more complete view of the model's performance, is that it allows us to more concretely answer RQ4, i.e., whether deep learning works for all situations. Since some metrics are more important for certain stakeholders (e.g., popt20 comments on the effort needed after the defect predictor is run; others may prefer a combination of high recalls and low false alarm rates, i.e., d2h), this exploration allows us to point out the cases where deep learning may not be the most suitable option.
\respto{3a2.3}
\BLACK
Recall measures the number of positive examples that were correctly classified by a learner:
\[
\text{recall} = \mathit{TP}\;/\;(\mathit{TP}+\mathit{FN})
\]
where TP is the number of true positives and FN is the number of false negatives.
The false alarm rate is the number of times a classifier incorrectly classifies an instance as a positive example, when it is in fact, not. Often, the false alarms observed in these experiments is very
small. Specifically, in our experiments, many of the false alarm results were 0. Hence, we will use {\em pf}, defined as follows:
\[
\text{pf} = \mathit{FP}\;/\;(\mathit{FP}+\mathit{TN}) %= %\frac{\text{TN}}{\text{FP}+\text{TN}}
\]
where FP is the number of false positives and TN is the number of true negatives.
The next metric, popt20 comments on the inspection effort required after a defect predictor is run. To calculate popt20, we list the defective modules before the non-defective modules, sorting both groups in ascending order of lines of code. Charting the percentage of defects that would be recalled if we traverse the code sorted in this manner in the y-axis, we report the value at the 20\% point.
Finally, we report the area under the receiver operating characteristics (ROC) curve (AUC), with false positive rate on the x-axis and true positive rate on the y-axis.
Note
that, with one exception, for most of these performance measures, {\em larger} values are {\em better} since:
\begin{itemize}
\item
As recall increases,
more defective modules are seen.
\item
As popt20 increases, the learner recommends that the developer inspect smaller parts of the code (to find bugs).
That is to say, the learner is decreasing the work load of the developer.
\item
As AUC increases, our learners are finding more true positives and avoiding more true negatives.
\end{itemize}
The one exception is pf where {\em smaller} values are {\em better} since,
as pf decreases,
there are fewer cases where we waste a developer's time showing them modules which are not really defective.
\subsection{Impact of Performance Metrics}
The original DODGE paper evaluated its models using metrics that were nuanced different to the above.
For example,
that study did not report recalls and false alarms.
When we checked DODGE's false alarms and recalls, we found that the results depended on the optimization goals; e.g. (a)~optimizing
for recall leads to higher recalls but nearly 100\% false alarms;
(b)~optimizing for false alarms leads to near zero false alarms, but also near zero recalls. Hence,
to ensure low false alarms and high recalls, our optimizers guide their search by trying to maximize the harmonic mean of high recalls $r$ and precision {\em p}; i.e.
\[ f_1 = 2rp\;/\;(r+p)\]The results of that search were then assess via the performance measures listed in the last section.
To simplify DODGE's search (and improve its performance scores), we applied a feature selection algorithm to the data sets prior to optimization. Feature selectors
prune superfluous attributes (i.e. those not associated to the number of defects).
Following the advice of Hall and Holmes~\cite{holmes03}, we used Hall's own CFS selector~\cite{hall00}.
We also improved on the nine data pre-processors used by DODGE. To address issues of class imbalance then SMOTE tool~\cite{Chawla02} down samples the majority class while generating artificial examples of the minority class.
In summary, when we say ``DODGE'' below, we mean the original DODGE, optimizing for a specific metric, or maximizing F1 (in the case of recall and false alarm rate), after (a)~feature selection with CFS and (b)~SMOTE.
\subsection{Statistics}
\label{sec:stats}
The statistical methods used in this paper were selected according to the particulars of our experiments.
For example, in the {\em within-company experiment}s, we are using learners that employ stochastic search. When testing such algorithms, it is standard~\cite{arcuri11} to repeat those runs 20 times with different random number
seeds. Such experiments generate a distribution of 20 results per learner per data set.
For those experiments, we use
the {\em distribution statistics} of \S\ref{sec:dist}
to compare the efficacy
of different learners.
For the cross-company experiments, we are comparing our results against the methods of Wang et al. \cite{wang2016automatically},
As mentioned in \S\ref{sec:dlse}, we do not have access to their implementations but we do have access to the train and test sets they use. For
these experiments, we must compare our results to the single performance points mentioned in the Wang et al. paper \cite{wang2016automatically}. Hence, here we run train our learners once on their train data, then test once on their test data.
Such experiments generated a single result per learner per data set.
For those experiments, we use the {\em point statistics}
of \S\ref{sec:point}
to compare the efficacy
of different learners.
\subsubsection{Distribution statistics}\label{sec:dist}
Distribution statistics \cite{arcuri13parameterto, ghotra2015revisiting} are used to distinguish two distributions of data. In our experimental setup, we run GHOST and DODGE 20 times, and therefore have a distribution of results for each dataset. This allows us to use distribution statistical methods to compare results.
Our comparison method of choice is the Scott-Knott test, which was endorsed at TSE'13 \cite{mittas2012ranking} and ICSE'15 \cite{ghotra2015revisiting}. The Scott-Knott test is a recursive bi-clustering algorithm that terminates when the difference between the two split groups is insignificant. Scott-Knott searches for split points that maximize the expected value of the difference between the means of the two resulting groups. Specifically, if a group $l$ is split into groups $m$ and $n$, Scott-Knott searches for the split point that maximizes
\[
\mathbb{E}[\Delta] = \frac{|m|}{|l|}\left( \mathbb{E}[m] - \mathbb{E}[l] \right)^2 + \frac{|n|}{|l|}\left( \mathbb{E}[n] - \mathbb{E}[l] \right)^2
\]
where $|m|$ represents the size of the group $m$.
The result of the Scott-Knott test is \textit{ranks} assigned to each result set; higher the rank, better the result. Scott-Knott ranks two results the same if the difference between the distributions is insignificant.
\subsubsection{Point statistics}\label{sec:point}
% ,
% Point statistics are used when point samples of two results are available. When comparing against the baselines of Wang et al. \cite{wang2016automatically}, because we do not have 20 runs of their algorithm (because of the proprietary nature of their code and extracted features--see Section \ref{sec:dlse}), we choose to use point statistics to check if one result is better than another.
% For all our results, we report the median over 20 runs. When comparing between different
For point statistics, we have access to various performance points (e.g. recall) across multiple data sets.
To determine if one point is better than another, we have to define a delta $\Delta$
below which we declare two points are the same.
To that end, we
use recommendations from
Rosenthal~\cite{rosenthal1994parametric}
(as of June 2020, this paper has 2713 citations in Google Scholar). Rosenthal comments that for point statistics,
parametric methods have more statistical power (to distinguish groups) that nonparametric methods. Further, within the parametric methods, there are two families of methods:
those that use the
``$r$'' Pearson product moment correlation;
and
those that use some ``$d$'' normalized difference between the mean.
After making all those theoretical points Rosenthal goes on to remark that neither method is intrinsically better than another.
Using Rosenthal's advice, we apply the most straightforward method endorsed by that research. Specifically. we compare treatment performances differences using Cohen's delta, which is computed as 30\% of the standard deviation of all the values.
\begin{equation}\label{se:cohen}
\Delta = 0.3*\sigma
\end{equation}
(Here, $\sigma$ is the standard deviation of all the, e.g., recall measures seen across all the data sets in the cross-company experiments.)
When two methods are different by less than $\Delta$, we say that they {\em perform equivalently}.
Otherwise, we say that one method {\em out-performs} the other if its performance is larger than
$\Delta$.
% S
% We run our hyperparameter optimization algorithm of choice, DODGE \cite{agrawal2019dodge}, for 40 iterations. In practice, when we ran for 100 iterations, we found no difference between the results after 40 and 100 iterations; therefore, all results reported are from DODGE. We use DODGE since this optimizer has been tested specifically in the software engineering domain.
% \begin{figure}[!t]
% \centering
% \includegraphics[width=\linewidth]{approach.png}
% \caption{An overview of our approach. The blue boxes are various stages in the model development lifecycle. The white boxes inside are classes of hyperparameters to be tuned.}
% \label{fig:approach}
% \end{figure}
% Figure \ref{fig:approach} offers an overview of our system. Raw data is preprocessed by a variety of methods, chosen by the hyperparameter optimizer (this is part of the search space). The result of the preprocessor(s), we call ``cooked" data. This is the data used by the deep learner, whose hyperparameters also we tune. Note that we use DODGE to tune both discrete and continuous hyperparameters. At the end is a trained model equipped with an optimized set of hyperparameters.
% We ran DODGE over three sets of hyperparameters: the preprocessors used (as discussed in Table \ref{tab:preprocessors}), the number of layers in the neural network, and the number of units per layer. The number of layers was constrained between $[1, 4)$, and the number of units per layer (each layer having the same number of units) was constrained to $[2, 20]$.
% Our choice for these bounds was not random: specifically, we used the results of Montufar et al. ~\cite{montufar2014number}; briefly, they discuss, for feedforward neural networks with ReLU activations (as we have used), bounds for the number of piecewise linear regions constituting a decision boundary that can be represented by a deep learner, providing a proof based on topological folding. In detail, when the number of units in any hidden layer is less than the number of inputs, the lower bound vanishes--this provided us with an upper bound for the number of units per layer (informally, while the network is still certainly capable of learning a complex decision boundary, it is no longer obliged to). To elaborate on this line of reasoning, given the results of their paper, it is trivial to show that an ideal neural network would have several layers, each with the same number of units (which in turn, is the same as the number of inputs)--this can be derived via a greedy algorithm that adds layers $l_i$ with the number of units as the number of inputs (denote this as $n_0$), until the remaining number of units (which can be trivially found from the results of the paper) is less than $2n_0$; at this point, we add one layer with the remaining number of units. While adding more units in the hidden layers certainly adds computational power, the formula presented in the paper suggests that deeper networks (in deep learning literature, ``deeper" denotes more layers) are preferable to ``wider" networks (which in deep learning literature refers to networks with more units in each layer). We made an assumption that no more than 3 layers with up to 20 units each would be required based on the formula presented in Montufar et al. ~\cite{montufar2014number}.
% \subsection{Neural network architectures}
% Beyond this, we do not use other types of layers, since our approach is based on the kind of networks discussed by Montufar et al. ~\cite{montufar2014number}. Additionally, we voluntarily do not use Batch Normalization \cite{ioffe2015batch} based on the results of Santurkar et al. ~\cite{santurkar2018does}, who comment that batch normalization helps by flattening the loss surfaces (more precisely, it reduces the L-Lipschitzness and the $\beta$-smoothness of the loss surface), making optimization easier for gradient descent. We hypothesize that with our simple neural networks where the number of parameters is orders of magnitude less than the ones examined by them, that help is not strictly required, and simply using better optimization algorithms should suffice. We use the Adam \cite{kingma2014adam} optimizer in our experiments.
% Experimentally, we found that no more than 80 epochs were required for convergence. We test convergence by training using Early Stopping, which stops training if the validation loss does not improve after 5 epochs. The results of this test are shown in Figure \ref{fig:convergence}. However, our experiments use only 10 epochs. While we could have certainly gotten better results by running our learners for 80 epochs (and this would not have taken significantly longer), we wanted to demonstrate clearly that deep learners can achieve competitive performance very quickly, when used right and with proper tuning and choice of architectures (based off proven reasoning).
\subsection{Optimization with GHOST \respto{3a2.1}}
The lesson of DODGE is that the particulars
of Table~\ref{tab:preprocessors1} can be
less important than how we explore combinations
of learning methods. So can we do more with
that insight? Is there some process more
direct that DODGE that explores, in a
more direct way, how to divide data
into regions that (e.g.) do or do not
predict for defects?
This section proposes GHOST as an
answer to those questions. To understand
GHOST, we first take a glance at our defect prediction data. Table~\ref{tab:attributes} shows
the static code attributes seen in our data
and Table~\ref{tab:datasets} shows what releases
we use for our training and test sets when comparing with DODGE\footnote{These releases were selected as follows: in the available data, use recent releases to predict for defects in the latest release.} (when comparing against our other baseline by Wang et al. \cite{wang2018deep}, we use the same train and test release versions as they do). A curious feature of Table~\ref{tab:datasets} is the wide variability in the percent of defects seen in the training and test sets. For example, the defect ratios in:
\bi
\item velocity's train:test ratios decrease from 71 to 34\%;
\item jEdit's train:test sets ratios decrease from 23 to 2\%;
\item xerces' train:test sets ratios increase from 16 to 74\%.
\item log4j's train:test sets ratios increase from 29 to 92\%;
\item xalan's train:test sets ratios increase from 38 to 99\%;
\ei
Such massive changes in the target class ratios means
that the geometry of the hyperspace boundary
between different classifications
can alter dramatically between train and test.
Therefore, the ``appropriate
learner'' for
this kind of data is one that
works well for a wide range
of ratios
of class distributions.
Such an appropriate learner
knows how to exploit
``hints'' from the hyperspace boundary in the train
set, then apply those hints to the test set.
GHOST was implemented as an experiment
to see if the {\em loss functions} of neural networks can learn such ``hints''.
What the rest of this paper shows is that
such loss function
(plus some
fuzzy oversampling, described below) is an effective
method for handling defect prediction.
\subsubsection{Loss Functions}
\label{sec:loss}
In neural networks, the loss function is the prediction error seen in the last epoch of the
neural net. The loss is used to calculate the gradients used to update the weights of the neural net for the next epoch of learning.
Loss can be calculated in various ways\footnote{
mean squared error,
binary crossentropy,
categorical crossentropy,
sparse categorical crossentropy, etc. } but the key point to note here is that, usually, the loss function is fixed prior to learning.