Meet 'DarkBERT:' South Korea's Dark Web AI could combat cybercrime

A team of researchers from South Korea has developed a new LLM called "DarkBert," which has been trained exclusively on the "Dark Web."
Christopher McFadden
iStock-501549144 (1).jpg
The new AI was trained by scouring the "Dark Web."


A team of South Korean researchers has taken the unprecedented step of developing and training artificial intelligence (AI) on the so-called "Dark Web." The Dark Web trained AI, called DarkBERT, was unleashed to trawl and index what it could find to help shed light on ways to combat cybercrime.

The "Dark Web" is a section of the internet that remains hidden and cannot be accessed through standard web browsers. This part of the web is notorious for its anonymous websites and marketplaces that facilitate illegal activities, such as drug and weapon trading, stolen data sales, and a haven for cybercriminals.

The 'Dark Web' employs complex systems that mask the IP address of its users, making it difficult to trace the websites they have visited. Accessing this web section requires specialized software, the most popular of which is Tor (The Onion Router). Tor is used by approximately 2.5 million individuals every day.

With the rise of natural language processing programs like ChatGPT, such technology is increasingly used as a new kind of cybercrime. By developing an AI that can fight fire with fire, the researchers wanted to discover how large language models (LLM) could help.

To this end, the researchers have published a paper titled "DarkBERT: A Language Model for the Dark Side of the Internet" on their findings. They connected their model to the Tor network and collected raw data to create a database. However, the paper has yet to be peer-reviewed.

According to the team, their LLM was far better at making sense of the dark web than other models that were trained to complete similar tasks, including RoBERTa, which Facebook researchers designed back in 2019 to "predict intentionally hidden sections of text within otherwise unannotated language examples," according to an official description.

"Our evaluation results show that DarkBERT-based classification model outperforms that of known pre-trained language models," the researchers wrote in their paper.

According to the team, DarkBERT has the potential to be employed for diverse cybersecurity purposes, including identifying websites that vend ransomware or release confidential data. Additionally, it can scour through the numerous dark web forums updated daily and keep an eye on any illegal information exchange.

You can view the study for yourself at

Study abstract:

"Recent research has suggested that there are clear differences in the language used in the Dark Web compared to that of the Surface Web. As studies on the Dark Web commonly require textual analysis of the domain, language models specific to the Dark Web may provide valuable insights to researchers. In this work, we introduce DarkBERT, a language model trained on Dark Web data. We describe the steps taken to filter and compile the text data used to train DarkBERT to combat the extreme lexical and structural diversity of the dark Web that may be detrimental to building a proper representation of the domain. We evaluate DarkBERT and its vanilla counterpart along with other widely used language models to validate the benefits that a Dark Web domain-specific model offers in various use cases. Our evaluations show that DarkBERT outperforms current language models and may serve as a valuable resource for future research on the Dark Web."

Add Interesting Engineering to your Google News feed.
Add Interesting Engineering to your Google News feed.
message circleSHOW COMMENT (1)chevron
Job Board