S2W expands data analysis technology accumulated on the dark web to implement industrial AI

– Applying capabilities accumulated on the dark web to industry… Launching industrial AI solution ‘SAIP’ solution last year

– Implement domain-specific AI from security to manufacturing and finance through multi-domain cross-analysis

– Developing technology to support decision-making through agent AI beyond simple Q&A

S2W (hereinafter referred to as S2W), which started as a dark web specialist company, is expanding its business area by launching an industrial generative AI platform called 'SAIP (S2W AI Platform)' last year.

The dark web refers to an encrypted network that cannot be accessed with a general internet browser. It is a place where various cybercrimes such as drug trafficking, hacking, and ransomware occur frequently. Therefore, the data on the dark web is complex, unlike general web data. It must be accessed with a special browser. The network structure is also very unstable, and complex language is intentionally used to avoid tracking. The noise ratio of the data is also very high. It is difficult to discover hidden relationships between data, and information changes and disappears very quickly.

Due to these characteristics, S2W developed dark web specialized solutions called 'Jarvis' and 'Quasar' in the dark web area that most companies have difficulty accessing. S2W developed 'multi-domain cross-analysis technology' to collect the vast and complex unstructured data of the dark web, discover the meaning between the data, and track the relationship.

Multi-domain cross-analysis technology is a technology that integrates and analyzes data from different areas (domains), and can derive patterns or insights that are difficult to discover in a single domain. For example, in the field of cybersecurity, network traffic data, user behavior data, and system log data can be cross-analyzed to identify advanced threats that are difficult to detect with a single data source, and in the financial field, transaction data, customer behavior data, and external market data can be combined to enable more accurate risk assessment or fraud detection. This technology utilizes AI and big data analysis techniques to find correlations between domains, and provides much higher accuracy and comprehensive analysis results than individual domain analysis.

Applying the technical capabilities accumulated in the dark web to general industries, launching SAIP (S2W AI Platform)

The solution developed to allow general companies to use the technology accumulated on the dark web is 'SAIP (S2W AI Platform)'. SAIP is an industrial generative AI platform launched in February 2024 that allows all data within a company to be used in a conversational manner like ChatGPT, but is a system that greatly enhances security and accuracy.

The biggest feature of SAIP is the security system called 'Security Guardrail'. While general ChatGPT or other AI sometimes provide incorrect information or risk leaking sensitive information, SAIP safely protects corporate data while providing only accurate answers. In addition, information that can be accessed can be differentiated and set according to job title, so that only authorized people can view sensitive information such as personnel data. In addition, unlike other AI companies that provide general solutions, SAIP deeply understands each company's domain and provides customized solutions.

S2W’s journey from starting in the most difficult environment known as the dark web to now growing into an AI company that handles data analysis across all industries is a prime example of innovation where specialized technologies expand into general-purpose technologies.

We met CTO Park Geun-tae and AI Director Jeong Jin-woo at the S2W headquarters in Pangyo, Seongnam-si, Gyeonggi-do. CTO Park Geun-tae completed his doctoral studies at KAIST and worked on distributed system development at the Electronics and Telecommunications Research Institute (ETRI) and OS development at TmaxSoft. After that, he worked on big data and AI at SK Telecom for 12 years. He joined S2W in 2022 because he wanted to implement big data and AI research at a startup rather than a large corporation.

AI General Manager Jinwoo Jeong majored in Natural Language Processing (NLP) while completing his master’s and doctoral studies at KAIST, and studied mobile machine translation and information extraction from materials science papers at Samsung Advanced Institute of Technology. With a particular interest in data processing, Director Jinwoo Jeong joined in 2020 after being interested in processing information on the dark web through natural language processing.

We heard from CTO Park Geun-tae, who is in charge of technology at S2W, and AI Director Jeong Jin-woo about the three core technologies, including the technology to collect desired data from massive data, a customized language model for each domain, and a knowledge graph that connects relationships into a graph, and why the industrial generative AI platform 'SAIP (S2W AI Platform)' based on these technologies is attracting attention. (The content was organized in a Q&A format to accurately convey professional technologies.)

Q. S2W started with security. I think it has advantages and differences.

CTO Park Geun-tae: S2W can be defined as a 'security + data' company. In the AI era, data of a completely different nature than before is flowing into AI systems. In particular, as sensitive data closely related to personal information is used in large quantities for AI learning and service operation, the importance and security of data have become more important than ever.

S2W Park Geun-tae CTO

In order to successfully implement AI services, the role of security experts is essential, and therefore, a deep understanding of security is required. In particular, security becomes more important in the process of combining and utilizing external and internal data. In the financial sector, a large amount of sensitive data must be processed under a strict regulatory environment. Therefore, it is essential to accurately identify security requirements when developing AI services and reflect them from the design stage.

It can be said that S2W's core competitiveness is that it started from a security-based foundation.

Q. 'Multi-domain cross-analysis technology' is a technology that integrates and analyzes data from different areas, allowing for the derivation of patterns or insights that are difficult to discover in a single domain. S2W has applied this technology to 'JARVIS' and 'QUXAR', specialized dark web solutions, and has developed and applied it to SAIP based on the accumulated know-how. Please explain what multi-domain cross-analysis technology is.

CTO Park Geun-tae: There are fundamental problems in any company or organization. In order to grow the company or avoid risks, the best experts in the company gather together, collect all available information, and then each expert presents his or her own opinion and discusses the problem to solve it. The technology that enables AI to perform this process is the multi-domain cross-analysis technology.

Jeong Jin-woo, AI General Manager: Specifically, it is a combination of three technologies. The first is data collection technology, the second is domain-specific language model technology, and the third is ontology-based knowledge graph technology.

S2W Jinwoo Jeong, AI General Manager

Q. You said there are three technologies for multi-domain cross-analysis, but among them, data collection technology seems to be the most important. How do S2W solutions collect data?

CTO Park Geun-tae: S2W handles not only internal data but also all external data. The data types are also diverse. We collect all types of data, including Excel files, photos, web data, and government public agency legal data. S2W can reliably and effectively collect data in dark web or battlefield environments where network changes are severe or information appears and disappears temporarily.

Jung Jin-woo, AI General Manager: The most important thing is the 'needle in a haystack technology'. Data of interest on the web is like finding a needle in a haystack when considering all data on the Internet. It is extremely cost-inefficient to collect everything, so we utilize language models from the collection stage. For example, more than half of dark web data is pornography, and if we determine that it is pornography with a 99.9% probability, we discard it immediately. However, if it is with a 50% probability, we store it for the time being and reclassify it with a more sophisticated language model.

Currently, S2W identifies about 10 million web pages per month. Since we cannot store all of them on our servers, we apply language models step by step to select only the necessary data.

Q. You said that domain-specific language models are necessary to collect the necessary data. So how can you create and apply domain-specific language models so quickly?

Jeong Jin-woo, AI General Manager: If a language model is applied to a different field, its performance deteriorates. You can't use a pornography detection model in the financial field. That's why domain specialization is necessary, and it's especially important when the model size is small. Large language models like ChatGPT show excellent performance on their own, but when there's a lot of data, you can't use a large model. If you want to classify in real time, you need to use a small model, and for that, you need a model that is highly specialized for the domain.

CTO Park Geun-tae: When we receive customer data, we can very quickly figure out what data needs to be input into a language model of a certain scale to make it work. So we can build an optimal-scale language model faster than other competitors and carry out the project very quickly.

Q. I think you need to know the domain to create a domain-specific model. What do you understand about the domain?

Jung Jin-woo, AI General Manager: If you develop a language model for the baseball domain, you need to know baseball. That's why S2W conducts a lot of consulting with clients in the early stages. Many companies overlook this, and other companies usually say, "Our model is all you need." I think that's impossible. The reason S2W was able to achieve results is because we have a lot of experience in the dark web. We discuss with domain experts and extract categories and features very quickly and accurately.

CTO Park Geun-tae: We completed the very large-scale H company project in just 5 months. We spent about a month on consulting in the initial stages. At this time, the role of domain experts is very important.

Q. Lastly, please explain knowledge graph technology and how to utilize it.

Jung Jin-woo, AI General Manager: A knowledge graph is made up of nodes (points) and edges (lines). A knowledge graph is made up of multiple circles, with lines drawn between them to show relationships. The final stage of structuring is to express the collected data as a graph. In order to properly understand relationships, you have to create a graph.

S2W started using knowledge graph technology to track crimes on the dark web. This is because relationship information is the key in crime tracking. For example, let's say a hacker stole corporate information on the dark web and posted "OO data for sale for this price. Let's chat on Telegram." S2W extracts the Telegram ID as a key feature from the post, searches for cases where the same Telegram ID was used on other dark web sites or platforms, and connects the two. If the criminal also left a Bitcoin address, it sets it as another node and connects it to the Telegram ID, and also follows the transfer history from that Bitcoin address to another address. This is because Bitcoin transaction information is public due to the nature of blockchain, so it can be tracked. If you connect it in this way as a graph, you can connect the Telegram ID of the person who first posted the post to the Bitcoin address, and you can even figure out which exchange the criminal finally cashed out on. Knowledge graph technology systematically tracks the connections between various identifiers in the area of crime investigation where relationship information is important.

Q. We have heard about multi-domain cross-analysis technologies, namely data collection technologies, domain-specific language model technologies, and ontology-based knowledge graph technologies. So how were these technologies applied in SAIP?

Jung Jin-woo, AI General Manager: S2W has built SAIP solutions for companies H and L. It is a method of providing answers when users ask questions, but the two companies have different characteristics.

Company H integrated 130,000 internal documents that were scattered across multiple business portals and built a chatbot that quickly searches for relevant data and provides desired answers when users ask questions via voice. The core values of this project are data integration and improved accessibility. Previously, to find safety data, you had to access the safety portal, and to find facility data, you had to access the facility portal separately. Even for safety managers who were not familiar with the facility field, it was difficult to access the facility portal. However, with the new integrated system, if you ask, “Please tell me the safety guidelines related to this facility,” you can immediately receive relevant information without having to directly access the facility portal.

Jung Jin-woo, AI General Manager: L Company's trend analysis solution is a system that analyzes market changes using purchase data from L Company's affiliates. For example, when alcohol sales surge, we don't simply check the numbers, but detect related phenomena from external data such as news articles or SNS to create a comprehensive analysis report and provide it. Specifically, if the purchase volume of solo drinkers has increased sharply, we collect, refine, and analyze external data to derive the results of "The reasons for the increase in solo drinkers are as follows" and create a report. S2W has automated the complex trend analysis work that was previously performed manually by employees, greatly improving efficiency and accuracy.

Q. You started out on the dark web and are now expanding into industries. What industries do you plan to expand into in the future?

Jung Jin-woo, AI General Manager: Palantir also started out in the CIA and security and security sectors, but has now expanded to all corporate sectors. I think it’s easier to expand when the methodology is systematically established. We are currently reviewing the financial and defense sectors as our top priorities.

CTO Park Geun-tae: However, rather than expanding widely by utilizing the assets accumulated so far, we plan to prioritize building more deeply vertically. Customization is an essential element for each company, and since the steel industry and distribution industry have different characteristics, and even within the same steel company, each has its own unique characteristics, we believe that a customized approach is more effective. Therefore, we plan to conduct business by focusing on areas with proven performance.

Q. It seems like technology needs to continue to advance. What specific direction do you plan to develop it in?

CTO Park Geun-tae: We are trying to evolve from text-centric to multimodal (image, video, voice). The security field has a group of experts internally, so the level of AI application is higher than other fields. Since we have experts internally who can write advanced security reports, we have the advantage of having very high quality data that can be used for AI learning.

Jung Jin-woo, AI General Manager: Agents must be able to provide conclusions that are actually helpful in decision-making. The goal of the S2W agent is to implement AI that goes beyond simple question-answering to produce advanced conclusions at the level of expert-written reports.

Expanding to AI solutions that help every business make decisions

We were able to confirm S2W's unique approach. The key is the know-how to deeply understand the characteristics of each domain and quickly develop a small language model that fits it. And most importantly, it is characterized by the fact that it values the process of acquiring domain knowledge through thorough consulting with customers.

S2W, which started out as a dark web analysis company in 2018, has grown into an AI company that analyzes big data from various industries, thanks to this meticulous technical approach and continuous domain learning. The technical assets that S2W has accumulated since starting in the security field are actually serving as a greater competitive edge in the AI era. The true core hidden behind the somewhat complicated name of multi-domain cross-analysis is ultimately the persistent effort to implement 'AI that thinks like a human.'