Ehsan Aghaei, Ph.D.

AI Scientist and Researcher


@Cisco | @Carnegie Mellon University

Linkedin    Scholar Researchgate  Git      Medium


*NEWS 

DefenBot: A Hybrid Framework using LLM and Retrieval Augmented Generation for Vulnerability Mitigation

Large Language Models (LLMs) play a pivotal role in cybersecurity and cyber threat intelligence by significantly enhancing the capabilities of natural language processing and understanding. These models can analyze vast amounts of textual data, including threat reports, forum discussions, and news articles, to identify emerging cyber threats and vulnerabilities. LLMs excel in contextual comprehension, enabling them to recognize subtle nuances and evolving patterns within cyber threat landscapes. They facilitate the extraction of actionable insights from unstructured data, aiding in the early detection of potential security breaches, understanding threat actor tactics, and predicting attack vectors. By automating the analysis of diverse textual sources, LLMs empower cybersecurity professionals to stay ahead of the rapidly evolving threat landscape, streamline threat intelligence processes, and bolster overall resilience against cyber threats.

DefenBot proposes to develop a novel NLP-based system for automated vulnerability characterization and remediation. This system will apply large language models (LLMs) through Retrieval Augmented Generation (RAG) that takes a CVE and analyze threat action, identify corresponding weaknesses (CWEs) and MITRE ATT&CK techniques, and reason about the security measures to recommend actionable and practical cyber defense strategies in form of mitigation and critical security controls. 

A novel hybrid architecture will combine LLMs and retrieval models with structured knowledge representations to reduce false positives, improve the quality of domain-specific (cybersecurity domain) knowledge collection, increase interpretability, enhance factuality, and advance text generation.

We expect DefenBot to enhance the vulnerability characterization quality and improve the defense rate based on our related work such as SecureBERT, SecureGPT, and CyRet. 


Git Repo (TBA)


Enhanced SecureBERT

Last year, the buzz was all about BERT, GPT, and their variants. However, the current focus has shifted to terminology like ChatGPT, LLAMA, Gemini, etc. Nevertheless, in the realm of customized large language model (LLM) applications, the crucial role of fine-tuning persists in enhancing overall performance.


* Cross-encoders and Siamese Networks leverage BERT-based models for document embedding and similarity search tasks. These architectures utilize BERT's contextualized embeddings to enhance the encoding quality and improve the effectiveness of matching and retrieval processes for tasks such as vector database creation and similarity search.


* Critic models, validator modules, and guardrails are vastly leveraged to ensure the factuality and the correctness of the input and output of the LLM. BERT-based classifiers and/or embedders are the crucial components for validating results, detecting biases, improving the chain of thought processes, and prompt engineering.


For those who are interested in the application of AI, particularly NLP and LLM, in cybersecurity, I just released the ScureBERT+ and SecureDeBERTa which have been trained on +4B tokens with more focus on cyber threat intelligence (CTI).

While I assess these models across various tasks, particularly training the semantic similarity search model so it can distinguish between "sore throat" and "Denial of Services" when looking for documents related to "Virus"(!!), I highly value any technical feedback and welcome the opportunity to receive additional fine-tuned models built on top of them. Your insights and contributions will be greatly appreciated.


SecureBERT (2.1M downloads on HF)

SecureBERT+

SecureDeBERTa

CyRet: Domain-Specific similarity search model for cybersecurity corpus

In the realm of document encoding models, where the need for precise encoding and retrieval of similar documents in cybersecurity domains is paramount, state-of-the-art models like all-mpnet-base-v2 play a crucial role. This model is structured as a bi-encoder architecture, a design paradigm that distinguishes itself by having two independent encoder networks. In the context of natural language processing, these encoders operate concurrently, each handling one of the input sequences, typically sentences or documents. The key characteristic of bi-encoders is their ability to independently encode the input sequences into fixed-size embeddings. In the case of "all-mpnet-base-v2," this dual-encoder approach allows for efficient and parallelized processing of input data, enabling the model to capture contextual information from each sequence independently. This design is particularly advantageous for tasks involving similarity or matching, where the model needs to evaluate the relationship between pairs of input sequences. The use of bi-encoders contributes to the model's versatility and effectiveness in tasks such as document retrieval, sentence similarity, and other applications requiring a nuanced understanding of relationships within text.

The existing general-purpose models like all-mpnet-base-v2 often fall short in delivering the level of specificity required. Recognizing this limitation, we undertook the development and training of a domain-specific model tailored explicitly for the intricacies of the cybersecurity landscape. This collaborative effort resulted in the creation of CyRet, a novel model meticulously crafted to address the unique challenges inherent in cybersecurity document retrieval. Leveraging the collective expertise of our team, CyRet has undergone extensive fine-tuning processes, utilizing a comprehensive dataset comprising +400K cybersecurity-related samples, encompassing both positive and negative examples. This substantial dataset has played a pivotal role in shaping CyRet's superior performance compared to its general-purpose counterparts, showcasing its remarkable ability to precisely identify and retrieve documents relevant to cybersecurity scenarios. The heightened accuracy and domain-specific fine-tuning of CyRet distinguish it as an invaluable tool in bolstering information retrieval tasks within the cybersecurity domain, providing a tailored and effective solution where existing models may fall short. The development of CyRet marks a significant stride toward enhancing the precision and relevance of document encoding models within the specialized context of cybersecurity. Fig. below demonstrates the performance comparison between CyRet and the all-mpnet-base-v2 models in terms of accuracy and F1-Score in natural language inference task, evaluated based on a test dataset of 20K cybersecurity-related pairs.

The existing general-purpose models like all-mpnet-base-v2 often fall short in delivering the level of specificity required. Recognizing this limitation, we undertook the development and training of a domain-specific model tailored explicitly for the intricacies of the cybersecurity landscape. This collaborative effort resulted in the creation of CyRet, a novel model meticulously crafted to address the unique challenges inherent in cybersecurity document retrieval. Leveraging the collective expertise of our team,


SecureGPT: A Domain-Specific Text Generation Model for Cybersecurity

SecureGPT is an innovative AI tool trained on cybersecurity data. It aids in various cybersecurity tasks by generating, analyzing, and interpreting text. Its applications include analyzing threat intelligence, automating report writing, detecting phishing attacks, creating security awareness content, reviewing code for vulnerabilities, drafting policies, aiding research, and ensuring legal compliance. SecureGPT's versatile capabilities promise to enhance cybersecurity efforts across the board.

Git Repo (TBA)






CyRet has undergone extensive fine-tuning processes, utilizing a comprehensive dataset comprising +400K cybersecurity-related samples, encompassing both positive and negative examples. This substantial dataset has played a pivotal role in shaping CyRet's superior performance compared to its general-purpose counterparts, showcasing its remarkable ability to precisely identify and retrieve documents relevant to cybersecurity scenarios. The heightened accuracy and domain-specific fine-tuning of CyRet distinguish it as an invaluable tool in bolstering information retrieval tasks within the cybersecurity domain, providing a tailored and effective solution where existing models may fall short. The development of CyRet marks a significant stride toward enhancing the precision and relevance of document encoding models within the specialized context of cybersecurity. Fig. below demonstrates the performance comparison between CyRet and the all-mpnet-base-v2 models in terms of accuracy and F1-Score in natural language inference task, evaluated based on a test dataset of 20K cybersecurity-related pairs.

Git Repo (TBA)

CyberARM: Security Controls Grid for Optimal Cyber Defense Planning

An innovative model and optimization techniques for the selection of the required CSC to achieve optimal risk mitigation while considering factors such as acceptable residual risk, budget limitations, and resiliency requirements. 

Predicting Attack Actions from Vulnerabilities Using Cybersecurity-specific Contextual Language Model

We have developed a semi-supervised transfer learning framework on top of SecureBERT and using semantic role labeling to generate, collect, and annotated threat-related textual data and classify cybersecurity vulnerabilities to tactic, technique, and procedures (TTPs).

paper

SecureBERT

E. Aghaei, X. Niu, W.Shadid, B. Chu, E. Al-Shaer

We have released the first transformer-based  domain-specific language model for representing cybersecurity text which is trained and tested on large in-domain textual data.

> Github

> Huggingface

> Paper

> YouTube


CVE to CWE: Hierarchical Classification

I developed a hierarchical design on top of the SecureBERT to classify CVEs to CWEs in different levels of CWEs' tree-based structure.

paper

CVSS Base Metric Prediction

I developed a tool by combining the S.o.T.A SecureBERT language model and the classic TF-IDF approach to predict the value of CVSS base metrics for CVEs on a highly imbalanced dataset.

SecureBERT predicts course of defense actions

Research Interests


Machine Learning, Deep Learning, NLP, Language Modeling, LLMs, Text Mining, Information Retrieval, Cyber Threat Inteligence, Cyber Analytics, Adversarial Machine Learning.

Education

2022 - 2024

Postdoctoral Fellow | Carnegie Mellon University, Pittsburgh, PA

Software and Societal Systems (S3D) - School of Computer Science

-----------------------------------------------------------------------------------------------------------------------------------------------

2017 - 2022

Ph.D. | University of North Carolina - Charlotte, NC, USA 

Major: Computing and Information Systems 

Dissertation: ”Automated Classification and Mitigation of Cyber Vulnerabilities”

-----------------------------------------------------------------------------------------------------------------------------------------------

2014 - 2017

M.Sc. | The University of Toledo, OH, USA 

Major: Computer Science 

Thesis: ”Machine Learning for Host-based Misuse and Anomaly Detection in UNIX Environment”

-----------------------------------------------------------------------------------------------------------------------------------------------

2008 - 2013

B.Sc. | Shahid Beheshti University, Tehran, IRAN 

Major: Computer Engineering