Invented by Jeff Olson, II Matthew Kindy, Praetorian Inc
The Praetorian Inc invention works as follows
The method includes the following: (i) flattening an abstract syntax (AST) to a sequence structured tokens that includes both a syntactic and semantic structure, (ii), implementing a natural-language processing technique to map the sequence structured tokens to a number of integers (iii), pre-training the model with an unlabeled code as input to predict the next sub-token, and (iv), training the model on a code labeled to predict the presence orBackground for System and Method for Automatically Detecting a Security Vulnerability in a Source Code Using a Machine Learning Model
In computer security, vulnerability is a weakness that can be exploited to perform unauthorised actions on a computer system by a threat actor. An attacker must be able to connect a weakness in a system with at least one tool or technique. Vulnerability Management is the practice of identifying vulnerabilities, classifying them, remediating and mitigating them. This practice is generally used to describe software vulnerabilities in computer systems.
A software vulnerability discovered through automated code analysis has remained elusive. Rice’s Theorem states that all non-trivial semantic properties are indecidable. Software vulnerability detection using an automated procedure is inaccurate, as a computer may identify semantic property of another computer that is running on a computing system. A semantic property describes the behaviour of the program, such as whether it terminates for all inputs. A syntactic attribute, such as whether the program contains an if-then statement, is not a property. “A property is not trivial if neither it is true nor false for all computable functions.
Most of the static software analysis techniques that are currently available do not provide a complete and accurate picture. Software analysis that uses taint to identify sources and sanitizers is prone to false positives or false negatives because of the complexity in the syntactic and semantic structure. Other solutions (e.g. A static analysis has a high signal-to noise ratio in terms of the number of false positives they report (i.e. Reporting vulnerabilities that do not exist and/or reporting false negatives are both high. Unreported vulnerabilities which do exist.
U.S. Pat. No. 8.806,619 describes a method and system for determining if software contains malicious code. The method involves equipping a validation device with tools and monitors to capture software’s static and dynamic behavior. The validation machine executes the software under test, and tools and monitors log data that represents the behavior of the program to detect malicious or vulnerable code. To enhance software security, one or several operations are performed automatically on the software. “Activities that cannot be neutralized by an automatic process are flagged and sent to a human for inspection.
U.S. Pat. No. No. 8,499 353, discloses a platform for security assessment. The platform comprises a communications server that receives technical characteristics and context information about a software application, and testing engines to perform a plurality vulnerability tests on the software application. The platform can also include a module that defines an assurance level based upon the technical characteristics and the business context information. It then creates a plan of multiple vulnerability tests, according to the assurance level. Finally, it correlates the results from the vulnerability tests in order to identify faults within the application. “However, none of these prior art technologies effectively detects the vulnerability in source code with low signal-to noise ratios.
The authors conclude that “In light of the above discussion, it is necessary to overcome the drawbacks mentioned in existing approaches in order to detect the security vulnerability in source code automatically without false positives or false negatives, while maintaining the signal-to noise ratios at the lowest possible level.
The present disclosure seeks (to) provide a method of automatically detecting a vulnerability in source code by using a machine-learning model.
The present disclosure, in its first aspect, provides a method (of) automatically detecting security vulnerabilities in a code source using a machine-learning model.
The present disclosure has the advantage of a better signal-to noise ratio, and in particular the use of machine-learning (ML) models can help to improve the security vulnerability detection of the source code.
The method can also detect a second vulnerability in the source code before compiling it by performing static analysis of a vectorized callgraph.
Optionally the method includes detecting a a third vulnerability during compilation of the source by performing a library-analysis on the vectorized calling graph.
Optionally the method includes performing, using the Machine Learning model, a Post-analysis on the First Security Vulnerability, Second Security Vulnerability, and Third security vulnerability in order to predict a Final security vulnerability.
Optionally the method comprises creating a database with source code, its metadata and unlabeled and labeled code.
The method may also include parsing of the source into an abstract syntax structure (AST), where the abstract syntactic tree (AST) represents a tree-like representation of the abstract syntactic structures of source code written in any programming language.
The method can also include generating a graph of calls by integrating an abstract syntax tree with the control flow and dataflow in the source code. This graph will represent the calling relationships between the subroutines within a program.
Optionally the method comprises the implementation of an embedded technique to generate the vectorized graph call graph.
The method can also include displaying the final vulnerability of a system on an expert device to receive a first input by a security expert.
Optionally the method includes processing the first output on the final vulnerability, wherein said first input comprises feedback associated with the security vulnerability.
Optionally the method includes providing the first input about the final security vulnerabilities as training data for the machine learning model to improve the accuracy of a prediction of the presence of vulnerabilities in the source code.
The method can also include displaying the final vulnerability on the user’s device.
Optionally in the method, natural language processing includes a Byte Pair Encoding
Click here to view the patent on Google Patents.