Security weaknesses in web applications deployed in cloud architectures can seriously affect its data confidentiality and integrity. The construction of the procedure utilized in the static analysis tools of source code security differs and therefore each tool finds a different number of each weakness type for which it is designed. To utilize the possible synergies different static analysis tools may process, this work uses a new method to combine several source codes aiming to investigate how to increase the performance of security weakness detection while reducing the number of false positives. Specifically, five static analysis tools will be combined with the designed method to study their behavior using an updated benchmark for OWASP Top Ten Security Weaknesses (OWASP TTSW). The method selects specific metrics to rank the tools for different criticality levels of web applications considering different weights in the ratios. The findings show that simply including more tools in a combination is not synonymous with better results; it depends on the specific tools included in the combination due to their different designs and techniques.
Nowadays organizations and companies use Web Applications (WA) in cloud infrastructures to manage their handle data from anywhere, Internet, Intranets et al. WA can be accessed from anywhere and can also be attacked, which means that it is vital to focus on the implementation of their security. A WA can include security weaknesses in source code that can affect not only the application but also the server that hosts it, the operating system and even the cloud infrastructure itself. Therefore, developing new methods to investigate how to preventively eliminate as many weaknesses as possible in WA source code from the beginning of the development of the WA is a priority.
OWASP Top Ten Project (OWASP TTP) [
Many studies indicate very good ratios of True Positives (TP) and False Positives (FP) may be determined by SASTT [
We list some concerns to analyze:
The n-SASTT effectiveness finding OWASP TTSW in combination. The n-SASTT combinations average effectiveness finding OWASP TTSW taken into account different metrics. The SASTT effectiveness finding OWASP TTSW for each n-tools combination and taken into account different metrics. The optimum way to study the security of WA at various degrees. OWASP Top Ten Benchmark (OWASP TTB) suitability for relating SASTT.
The main motivation for this work is the idea that the combination of different SASTTs can be very beneficial in improving the security of the source code of a web application. A proper combination of SASTTs can find more security vulnerabilities or TPs while yielding fewer false alarms or FPs. Therefore, it is necessary to reference comparative work on SASTT combinations so that auditors and analysts can select the best combinations. Besides, auditors and analysts need to choose the best SASTT combinations taking into account which combinations are more adequate having into account different criticality levels. To formalize the use of SASTT determining the maximum number of weaknesses and later being able to patch them efficiently, it is necessary to introduce a Software Development Life Cycle (SSDLC) given by Vicente et al. in [
Next, we present out innovations. The first objective is to find out the behaviour of the combination of commercial n-SASTT (Coverity, Klocwork, Fortify SCA and Xanitizer) and one open-source tool (FindSecurityBugs) using a new specific methodology. Combining several tools can improve the overall results but choosing the optimum tools to analyze the security of a WA is not a simple task. This study has investigated how to determine a repeatable method to combine several SASTT to achieve more optimal results in terms of true positive (TPR) and false positive (FPR) ratios. The way of combining tools proposed by the method is novel and uses a testbed application (with 669 test cases for TP and FP) specifically designed for the weakness classes of the OWASP TTP widely accepted by the community. The method examines the effectiveness in combination of 4 relevant commercial SASTT, plus an open source one. It also makes it possible to evaluate how they perform against a wide range of weaknesses in OWASP TTSW not covered by other works. We investigate specifically how is the performance choosing proper metrics.
The second aim in this work is to pick the more proper relations of tools taking into account various levels of security criticality. As mentioned above, a security analysis with a SASTT must necessarily include a criticising the results to eliminate possible FP, but sometimes auditors do not have enough time. In the most critical applications, it is necessary to find the highest number of TP and it does not matter the number of FP because there is time to eliminate the FP. In less critical applications there is not as much time to eliminate FP and tools are needed that yield fewer FP with good results in the TPR. The metrics used allow to distinguish which combinations of tools are more suitable for auditing the security of WA at various levels.
The tools are used against a new benchmark updated from its first use [
Hence, the findings of the work are:
A specific approach using a concrete benchmark relying on OWASP TTSW [ Categorizing results by SASTT in combination utilizing a method to categorize them based on various degree levels of WA significance and criticality. A study of leading commercial SASTT results in combination to permit researchers to pick the best tools to accomplish an audit of a WA.
The structure of this work is: Section 2 gives a background in web technologies security with emphasis in weaknesses, SASTT and related work. Section 3 provides the step of the assessment approach proposal designed with the steps followed to rank the SASTT in combination using the selected benchmark. Moreover, Section 4 collects the finding and Section 5 proposes future research.
A background is given on web technologies security, benchmarking initiatives, SASTT.
The advancement of WA for companies and business related
.NET framework with languages such as C# or iOS, PHP, Visual Basic, NodeJs and Java are the most chosen today. Java is the most used language, according to several studies [
One form of security prevention in relation to the development of WA source code is for practitioners to have knowledge in secure source code development [
The OWASP TTP puts together the most interesting security weakness classes and there are several studies that show that WA tested failed the OWASP TTP [
A sound technique to prevent security weaknesses in source code is avoidance [
SASTTs are a type of security tools that analyze the entire source code of a web application in several steps. First, they compile the source code and from the parse tree they transform it into a model that is checked against the specific rules or models of each security weakness. In summary, SASTT successively performs lexical, syntactic, semantic, intra-procedural, local analysis of each function and inter-procedural analysis between the different functions of the source code. SASTT provide a clear security analysis, and they analyze both source code and object code, as necessary. SASTT initiate with a problem because of the act of finding out, if a design attains its final state, or not [
A final audit of each weakness included in a SASTT report is needed to reduce the FP and locate the FN (more difficult). Security auditors are required to improve how to recognize all types of weaknesses in the source code for a particular programming language [
The works [
Nguyen et al. [
Muske et al. [
A comparison of 9 SASTT for C language is determined categorizing them by appropriate metrics using SAMATE tests suites for C language. This comparison includes several leaders´ commercial tools which it is important for the security analyst to select the best tool to audit source code security [
The work of Ye et al. [
Another study shows a conceptual, performance-based ranking framework that prioritizes the output of several SASTT, to enhance the tool effectiveness and usefulness [
Another paper [
Ferrara et al. [
The work of Flynn et al. [
The work of Vasallo et al. [
In the work [
In the work [
Nunes et al. [
The main related work conclusions are that the existing comparatives do not include an adequate number of leading commercial tools and the benchmarks used are not representatives with respect to OWASP TTP. It is very important that the benchmark includes the security weaknesses more frequent and dangerous in each weakness class, However, given the significant cost of commercial tools, it can be examined a study that deals with seven static tools (five commercial tools) by a new approach proposal with a new representative benchmark constructed for weakness classes included in the OWASP TTP [
It is developed a repeatable approach to relate and rank the SASTT.
Choose the OWASP TTB constructed. Choose the SASTT. We select five commercial and open-source SASTT according to the analysis of the corresponding works in Section 2.3 and official lists of SASTT and run the selected SASTT against the OWASP TTB designed in [ Choose appropriate metrics to scrutinize results based on three different levels of WA criticality. Metrics calculation. Discussion, analysis and ranking of the results.
A proper bench test must be portable, credible, representative, require minimum modifications, easy to implement and run and the tools execution must be under the same conditions [
Taking into account statistics of security weaknesses reported by several works [
Each case includes a function named bad () designed with a concrete weakness having an input source that is not justified or verified (badsource) and a code line not verified where the weakness appends (badsink). See
Weakness classes and types by category | TP test cases | FP test cases |
---|---|---|
Injection | 84 | 218 |
Broken_authentication_and_Sessions | 24 | 52 |
Sensitive_Data_Exposure | 12 | 26 |
Broken_Access_Control | 25 | 44 |
Security_Misconfiguration | 6 | 9 |
Cross_Site_Scripting | 32 | 62 |
Using_Components_with_Known_Weaknesess | 6 | 11 |
Cross_Site_Request_Forgery | 7 | 22 |
Redirects_not_validated | 11 | 18 |
N° Test cases | 207 | 462 |
Injection weakness class includes weakness types such as SQL Injection, LDAP Injection, Access Through SQL Primary, Command Injection, HTTP Response Splitting and Unsafe Treatment XPath Input. Broken Authentication and Sessions weakness class includes weakness types such as Hard Coded Passwords, Plaintext Storage in a Cookie, Using Referer Field for Authentication, Insufficient Session Expiration and other. Sensitive Data Exposure weakness class includes weakness types such as Information Leak error, Leftover Debug Code, Info Leak by Comment and other. Broken Access Control weakness class includes weakness types such as Relative Path Traversal, Absolute Path Traversal, Unsynchronized Shared Data TOCTOU and other. Security Misconfiguration weakness class includes weakness types such as Reversible One Way Hash, Insufficiently Random Values and Same Seed in PRNG. Using Components with Known Vulnerabilities includes weakness types such as Use Broken Crypto, Weak PRNG and Predictable Salt One Way Hash.
SASTT are chosen with respect to J2EE, one the most popular technology in web advancing, the programming language utilized by J2EE, Java, is one of the considered as more secure [ Fortify SCA (Commercial) includes 18 different languages, the most known OS platforms and provides SaaS (Software as a service) and it finds more than 479 weaknesses. Coverity (Commercial) provides numerous languages, as Javascript, HTML5, C/C++, Java, C#, Typescript and others. Xanitizer (Commercial) includes only Java language, but it provides to auditors of sanitizing the inputs variables in source code. FindSecurityBugs. (Open source). Plugins are provided for IntelliJ, Android Studio, SonarQube, Eclipse, and NetBeans. Command line integration is possible with Ant and Maven. Klocwork (Commercial) supports C, C++, C#, and Java languages. It has it has compliance with OWASP Top Ten project and others.
The metrics have been selected taking into account the related works of the state of the art investigated [ Precision (1). Proportion of the total TP findings penalized by number or FP:
True positive rate/Recall (2). Ratio of detected weaknesses to the number that really appears in the code:
False positive rate (3). Ratio of false alarms for weaknesses that not really appear in the code:
F-measure (4) is harmonic mean of precision and recall:
F
Once SASTT and metrics have been chosen, SASTT are executed against the OWASP TTB, we get the TP and FP results for each kind of weakness. Next, the metrics chosen in Section 3.3 are used to find adequate explanation of the results and to obtain the conclusions.
To determine all metrics a program in C language is developed to process the results of each tool. After executing each tool against OWASP TTB the findings are carefully analyzed and formatted in a file that the C program process to get the chosen metrics.
The strategy of 1-out-of-N (1ooN) is used for merging the findings of the SASTT. The technique suggested to obtain the merged results for two or more tools depends on several automated steps. 1ooN in SASTT combinations for TP detection: Any TP detection (alarm) from any of the n-SASTT in a bad function of a test case will lead to an alarm for a 1ooN system. 1ooN in SASTT combination for FP detection: Any TN (non-alarm) from any of the n-SASTT in a good function of a test case would go to a TN in a 1ooN system if the same tool detected a TP in the bad function of the same test case. If the same tool did not detect a TP means that it not properly detect this weakness, or it is not designed to detect it (see
Tool A | Tool N | N-SASTT | ||
---|---|---|---|---|
Positives cases (P) (Bad functions) | TP | or | TP | TP |
TP | or | FN | TP | |
FN | or | TP | TP | |
FN | and | FN | FN | |
Negative cases (N) (Good functions) | FP | or | FP | FP |
FP | or | TN | FP | |
TN | or | FP | FP | |
FP (TP |
or | TN | TN∗ | |
TN | or | FP (TP |
TN∗ | |
TN | and | TN | TN |
Note: ∗If any of the tools in a combination obtains a TN in a good function and it also obtain a TP in the associated bad function of the same test case.
The number of found weaknesses (TP) is accounted in
The test cases weaknesses vary in a weakness classification. To normalize findings in each class of weakness (
The execution of tools
Fortify | FsecBugs | Xanitizer | Klocwork | Coverity | ||||||
---|---|---|---|---|---|---|---|---|---|---|
TPR | FPR | TPR | FPR | TPR | FPR | TPR | FPR | TPR | FPR | |
Injection | 0.798 | 0.665 | 0.643 | 0.349 | 0.655 | 0.385 | 0.405 | 0.330 | 0.298 | 0.252 |
Broken auth | 0.292 | 0.269 | 0.208 | 0.038 | 0.250 | 0.288 | 0.083 | 0.058 | 0.083 | 0.154 |
Sensitive data | 0.500 | 0.231 | 0.000 | 0.000 | 0.000 | 0.077 | 0.167 | 0.077 | 0.000 | 0.038 |
Broken A.C | 0.800 | 0.614 | 0.720 | 0.250 | 0.320 | 0.159 | 0.800 | 0.659 | 0.640 | 0.250 |
Broken conf | 0.667 | 0.667 | 0.667 | 0.333 | 1.000 | 0.333 | 0.333 | 0.222 | 1.000 | 0.333 |
XSS | 0.500 | 0.258 | 0.906 | 0.484 | 0.938 | 0.500 | 0.625 | 0.355 | 0.531 | 0.258 |
Comp. Vuln. | 0.667 | 0.545 | 0.667 | 0.182 | 0.667 | 0.182 | 0.333 | 0.364 | 0.667 | 0.182 |
CSRF | 1.000 | 0.864 | 0.143 | 0.000 | 0.000 | 0.000 | 0.714 | 0.545 | 0.000 | 0.000 |
Open redirect | 0.545 | 0.389 | 0.909 | 0.333 | 0.818 | 0.333 | 0.818 | 0.667 | 0.818 | 0.222 |
TPR/FPR | 0.641 | 0.500 | 0.540 | 0.219 | 0.516 | 0.251 | 0.475 | 0.364 | 0.449 | 0.188 |
Precision | 0.562 | 0.712 | 0.673 | 0.566 | 0.705 | |||||
F-measure | 0.599 | 0.614 | 0.584 | 0.517 | 0.548 | |||||
F0,5-score | 0.691 | 0.803 | 0.761 | 0.655 | 0.759 | |||||
F1,5-score | 0.472 | 0.449 | 0.428 | 0.385 | 0.389 |
xnft | ftfb | coft | ftkw | xnkw | fbkw | cokw | xnfb | cofb | coxn | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Injection | TPR |
0.952 |
0.893 0.757 | 0.798 0.638 | 0.857 |
0.762 |
0.714 |
0.512 |
0.726 0.422 | 0.655 |
0.702 |
Broken auth | TPR |
0.333 0.442 | 0.417 0.308 | 0.292 0.346 | 0.292 |
0.250 |
0.292 |
0.167 |
0.333 0.250 | 0.250 |
0.250 |
Sensitive data | TPR |
0.500 0.231 | 0.500 0.231 | 0.500 0.231 | 0.500 |
0.167 |
0.167 |
0.167 |
0.000 0.077 | 0.000 |
0.000 |
Broken A.C | TPR |
0.800 0.545 | 0.800 0.614 | 0.840 0.477 | 0.920 |
0.880 |
0.840 |
0.840 |
0.800 0.318 | 0.760 |
0.720 |
Broken conf | TPR |
1.000 0.333 | 0.667 0.667 | 1.000 0.333 | 0.667 |
1.000 |
0.667 |
1.000 |
1.000 0.333 | 1.000 |
1.000 |
XSS | TPR |
0.938 0.500 | 0.906 0.484 | 0.656 0.355 | 0.656 |
0.938 |
0.906 |
0.750 |
0.938 0.500 | 0.906 |
0.938 |
Comp. vuln. | TPR |
0.667 0.182 | 0.667 0.545 | 0.667 0.182 | 0.667 |
0.667 |
0.667 |
0.667 |
0.667 0.182 | 0.667 |
0.667 |
CSRF | TPR |
1.000 0.864 | 1.000 0.864 | 1.000 0.864 | 1.000 |
0.714 |
0.714 |
0.714 |
0.143 0.000 | 0.143 |
0.000 |
Open redirect | TPR |
0.909 0.333 | 0.909 0.556 | 0.818 0.222 | 0.818 |
0.909 |
0.909 |
0.818 |
0.909 0.333 | 0.909 |
0.909 |
TPR |
0.789 |
0.751 0.558 | 0.730 0.405 | 0.709 |
0.698 |
0.653 |
0.626 |
0.613 0.268 | 0.613 |
0.576 |
|
Precision | 0.635 | 0.574 | 0.643 | 0.559 | 0.652 | 0.692 | 0.669 | 0.695 | 0.720 | 0.689 | |
F-measure | 0.704 | 0.650 | 0.684 | 0.625 | 0.674 | 0.672 | 0.647 | 0.652 | 0.647 | 0.627 | |
F0,5-score | 0.793 | 0.722 | 0.790 | 0.700 | 0.793 | 0.821 | 0.792 | 0.813 | 0.827 | 0.795 | |
F1,5-score | 0.565 | 0.527 | 0.539 | 0.503 | 0.526 | 0.511 | 0.491 | 0.489 | 0.479 | 0.467 |
xnftkw | xnftfb | coxnft | coftfb | ftfbkw | coftkw | xnfbkw | coxnkw | cofbkw | coxnfb | ||
---|---|---|---|---|---|---|---|---|---|---|---|
Injection | TPR |
0.976 |
0.952 |
0.952 |
0.893 |
0.929 |
0.857 |
0.786 |
0.786 |
0.726 |
0.726 |
Broken auth | TPR |
0.333 |
0.417 |
0.333 |
0.417 |
0.417 |
0.292 |
0.333 |
0.250 |
0.333 |
0.333 |
Sensitive data | TPR |
0.500 |
0.500 |
0.500 |
0.500 |
0.500 |
0.500 |
0.167 |
0.167 |
0.167 |
0.000 |
Broken A.C | TPR |
0.920 |
0.800 |
0.840 |
0.840 |
0.920 |
0.920 |
0.920 |
0.920 |
0.840 |
0.840 |
Broken conf | TPR |
1.000 |
1.000 |
1.000 |
1.000 |
0.667 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
XSS | TPR |
0.938 |
0.938 |
0.938 |
0.906 |
0.906 |
0.750 |
0.938 |
0.938 |
0.906 |
0.938 |
Comp. vuln. | TPR |
0.667 |
0.667 |
0.667 |
0.906 |
0.667 |
0.667 |
0.667 |
0.667 |
0.667 |
0.667 |
CSRF | TPR |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
0.714 |
0.714 |
0.714 |
0.143 |
Open redirect | TPR |
0.909 |
0.909 |
0.909 |
0.909 |
0.909 |
0.818 |
0.909 |
0.909 |
0.909 |
0.909 |
TPR |
0.805 |
0.798 |
0.793 |
0.792 |
0.768 |
0.756 |
0.715 |
0.706 |
0.696 |
0.617 |
|
Precision | 0.632 | 0.640 | 0.647 | 0.647 | 0.573 | 0.641 | 0.686 | 0.666 | 0.703 | 0.706 | |
F-measure | 0.708 | 0.710 | 0.713 | 0.712 | 0.656 | 0.694 | 0.700 | 0.685 | 0.700 | 0.658 | |
F0,5-score | 0.793 | 0.799 | 0.806 | 0.806 | 0.724 | 0.793 | 0.830 | 0.809 | 0.842 | 0.823 | |
F1,5-score | 0.571 | 0.570 | 0.570 | 0.570 | 0.535 | 0.551 | 0.543 | 0.533 | 0.537 | 0.494 |
We elaborate on the research concerns enumerated in Section 1 based on the proposed method with the relevant technique, giving the findings.
xnftfbkwco | xnftfbkw | coftfbkw | coxnftkw | coxnftfb | coxnfbkw | ||
---|---|---|---|---|---|---|---|
Injection | TPR |
0.976 |
0.976 |
0.929 |
0.976 |
0.952 |
0.786 |
Broken auth | TPR |
0.417 |
0.417 |
0.417 |
0.333 |
0.417 |
0.333 |
Sensitive data | TPR |
0.500 |
0.500 |
0.500 |
0.500 |
0.500 |
0.167 |
Broken A.C | TPR |
0.920 |
0.920 |
0.920 |
0.920 |
0.840 |
0.920 |
Broken conf | TPR |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
XSS | TPR |
0.938 |
0.938 |
0.906 |
0.938 |
0.938 |
0.938 |
Comp. vuln. | TPR |
0.667 |
0.667 |
0.667 |
0.667 |
0.667 |
0.667 |
CSRF | TPR |
1.000 |
1.000 |
1.000 |
1.000 |
1.000 |
0.714 |
Open redirect | TPR |
0.909 |
0.909 |
0.909 |
0.909 |
0.909 |
0.909 |
TPR |
0.814 |
0.814 |
0.805 |
0.805 |
0.802 |
0.715 |
|
Precision | 0.637 | 0.637 | 0.644 | 0.644 | 0.652 | 0.695 | |
F-measure | 0.715 | 0.715 | 0.716 | 0.715 | 0.719 | 0.705 | |
F0,5-score | 0.799 | 0.799 | 0.805 | 0.805 | 0.813 | 0.839 | |
F1,5-score | 0.577 | 0.577 | 0.575 | 0.575 | 0.576 | 0.545 |
In summary, the graphs in
It can be seen that each metric obtains a different ranking for each tool. Each of the metrics will be associated to a different security analysis objective that will be analyzed later in Section 3.5.4 according to different levels of criticality of the web applications.
In summary,
We develop three cases of the n-tool effectiveness related to F-measure, F0,5-score and Recall metrics and having into account all n-tools combinations. Each metric permits to classify SASTT combinations according to distinct levels of criticality. the
n-tools |
F0,5-score | n-tools |
F-measure | n-tools |
TPR (Recall) |
---|---|---|---|---|---|
cofbkw | 0.842 | coxnftfb | 0.719 | xnftfbkw | 0.814 |
coxnfbkw | 0.839 | coftfbkw | 0.716 | xnftfbkwco | 0.814 |
xnfbkw | 0.830 | xnftfbkwco | 0.715 | xnftkw | 0.805 |
cofb | 0.827 | xnftfbkw | 0.715 | coftfbkw | 0.805 |
fbkw | 0.821 | coxnftkw | 0.715 | coxnftkw | 0.805 |
coxnfb | 0.823 | coxnft | 0.713 | coxnftfb | 0.802 |
Xnfb | 0.813 | coftfb | 0.712 | xnftfb | 0.798 |
coxnftfb | 0.813 | xnftfb | 0.710 | coxnft | 0.793 |
coxnkw | 0.809 | xnftkw | 0.708 | ftfbkw | 0.792 |
coxnft | 0.806 | coxnfbkw | 0.705 | coftfb | 0.792 |
coftfb | 0.806 | xnft | 0.704 | xnft | 0.789 |
coxnftkw | 0.805 | xnfbkw | 0.7 | coftkw | 0.756 |
coftfbkw | 0.805 | cofbkw | 0.7 | ftfb | 0.751 |
fb | 0.803 | coftkw | 0.694 | coft | 0.730 |
xnftfbkw | 0.799 | coxnkw | 0.685 | xnfbkw | 0.715 |
xnftfbkwco | 0.799 | coft | 0.684 | coxnfbkw | 0.715 |
xnftfb | 0.799 | xnkw | 0.674 | ftkw | 0.709 |
coxn | 0.795 | fbkw | 0.672 | coxnkw | 0.706 |
xnftkw | 0.793 | Coxnfb | 0.658 | xnkw | 0.698 |
xnft | 0.793 | ftfbkw | 0.656 | cofbkw | 0.696 |
coftkw | 0.793 | xnfb | 0.652 | fbkw | 0.653 |
xnkw | 0.793 | fbft | 0.650 | ft | 0.641 |
cokw | 0.792 | cokw | 0.647 | cokw | 0.626 |
coft | 0.790 | cofb | 0.647 | coxnfb | 0.617 |
xn | 0.761 | coxn | 0.627 | xnfb | 0.613 |
co | 0.759 | ftkw | 0.625 | cofb | 0.588 |
ftfbkw | 0.724 | fb | 0.614 | coxn | 0.576 |
ftfb | 0.722 | ft | 0.599 | fb | 0.540 |
ftkw | 0.700 | xn | 0.584 | xn | 0.516 |
ft | 0.691 | co | 0.548 | kw | 0.475 |
kw | 0.655 | kw | 0.517 | co | 0.449 |
Recall (TPR) metric considers applications categorized as critical. since it shows the capability of a SASTT to find the largest number of weaknesses. Recall (TPR) is the most appropriate metric for crucial applications. since it permits to pick tools with the optimum TP ratio. As alternative to recall. F1,5-score metric can be used as it rewards the tools with better gain than precision metric. The combinations Xanitizer-Fortify-FindsecBugs-Klocwork. Xanitizer-Fortify-FindsecBugs-Klocwork-Coverity and Xanitizer-Fortify-Klocwork have the best results. Xanitizer-Fortify is the best 2-tool combination and Fortify the best tool in isolation.
F-measure metric (
This case provides for non-critical applications the building and evaluation of applications without relevant information and/or are not exhibited to threats. F0,5-score metric rewards on precision and is appropriate for non-critical applications where the time of development may be shorter. since it permits to favor the tools with better precision. That leads to tools with high precision to get better results. Precision metric can be used as alternative to F0,5-score. The combinations Coverity-FindsecBugs-Klocwork. Coverity-Xanitizer-FindSecBugs-Klocwork and Xanitizer-FindsecBugs-klocwork have the best results. FindSecBugs the best tool in isolation.
The results show that simply adding more tools in a possible combination is not synonymous with obtaining better results in any of the three classifications. There are combinations with fewer tools. even of a single tool that outperforms combinations with more tools.
This benchmark provides representativeness of security weaknesses for WA depending on the OWASP TTP and relies on realistic code including distinct source inputs and flow complexity. Besides. it permits to increase number of classes and types of weaknesses making them expandable. It has been updated from its first use in [
An analysis has been made of how SASTT behaves in combination to improve weakness detection efficiency results when using a single tool. To compare the distinct n-SASTT combinations, we have used an OWASP Top Ten weaknesses benchmarking technique recently constructed for the evaluation of the security performance of SASTT containing a complete of distinct weakness types of test cases in each weakness OWASP Top Ten class. The evaluation uses a new and repeatable approach for assessment and classifying the n-SASTT combinations.
In general, it is better to include more than one tool in combination to obtain better results with respect to the selected metrics. Although this is not always the case according to the results obtained for each metric and combination. Their different designs make it necessary to study how each combination behaves. TPR results of over 0.800 are achieved in combination. The FPR results of the combinations never exceed the worst result obtained by a tool included in the combination in an isolated way without combining. It is necessary an audit phase of the weakness findings performed by a trained user or team for the used WA languages and for a security weakness for each language.
The analysis of the results shows that simply adding more tools in a possible combination is not synonymous with obtaining better results in a classification for the selected metrics. There are combinations with fewer tools, even of a single tool, that outperforms combinations with more tools. The results depend on each concrete combination and the synergies between the SASTT included in a combination.
The evaluation process gives a strict classification of n-SASTT combinations taking into account appropriate and widely acceptable metrics applied to the findings of tools execution
In general, the weakness detections in the classes of weaknesses related to Disclosure of Information in WA source code and Broken Authentication and Sessions are improving for all tools. Changes in WA technologies make necessary alterations in the weakness classes over time. It needs a continuous study to update and adapt the tools to determine the most usual and interesting weakness classes. Hence, OWASP Top Ten must be updated frequently.
This work has evaluated four commercial and one open-source SASTT in combination. It is essential to analyze the behavior of commercial tools in combination with open-source tools to establish differences between then detecting weaknesses in combination and be able to reduce the economic costs when you can include a free tool like FindSecurityBugs which obtains excellent results.
It is important to build new benchmarks for all classes of weaknesses and for mores languages to perform new comparative that assist practitioners and companies choose the optimum SASTT.
We currently study ways to improve SASTT results with machine learning techniques to discover security weaknesses in source code of WA and reducing the false positive ratio. To reach this objective it is necessary to develop a labeled dataset with source code test cases to be trained with diverse machine learning algorithms.