/

/

Your employees are already using AI. Do you have control over it?

Your employees are already using AI. Do you have control over it?

Your employees are already using AI. Do you have control over it?

What happens to enterprise data in language models and how to keep it under control.

What happens to enterprise data in language models and how to keep it under control.

Andrzej Kossakowski

Andrzej Kossakowski

Andrzej Kossakowski

6 min

6 min

reading

Table of Contents

What really happens to information entered into LLMs

Artificial intelligence, and in particular language models referred to as LLMs (Large Language Models), have become part of many people’s day-to-day work. They are used, among other things, to create content and analyze information. In this article, we will not discuss futuristic AI development scenarios or evaluate specific tools. We will focus on one issue: what actually happens to the data a user enters into language models, especially when it involves sensitive information.

What a language model actually is

Language models are statistical systems trained on very large text datasets. Their task is to generate subsequent text fragments in a way that best matches the provided context. An LLM does not understand text in the human sense, but it can generate responses that are highly accurate in context.

A model response is created as the result of statistical computation - the system selects the most probable next word based on patterns learned during training. It has no awareness or intent. It does not "know" whether information is true or confidential - it operates solely on those patterns, without understanding their meaning.

Where models get their knowledge

The learning process for language models takes place at the training stage, before they are made available to users. During this phase, very large datasets are used, from which the model learns linguistic and contextual relationships. The model does not update its parameters in real time during a single conversation. This means that data entered by the user does not "teach" the model immediately and is not remembered as specific records.

At the same time, for some public services - especially free ones - user-entered data may be used by the provider to further improve models or train subsequent versions, in accordance with the service terms of use. This is usually done in an aggregated and processed manner, not by literally adding individual conversations to the model. This means that a single data point - for example, information about a specific company’s budget - should not affect the model’s future responses. However, there are no guarantees that entered information will not be disclosed or reproduced in some form in responses generated for other users. The user does not have full control over whether and how their data is used in provider-side development processes.

Research published in 2025 showed that some publicly available language models can reproduce entire books from their training data - almost word for word [1]. In the same year, Anthropic (creator of the Claude chatbot) reached a $1.5 billion settlement related to the use of pirated books to train its models [2]. If a model can encode the full content of a book in its parameters, then data entered by your company’s employees into chatbots - proposals, contracts, customer data, internal analyses - may be similarly retained in the model and potentially disclosed in responses generated for other users. Therefore, before deploying any AI tool in an organization, it is worth ensuring that the selected service variant guarantees that data is not used for further model training, and that the tool itself is properly configured for privacy.

How data is processed during a conversation

Data entered by the user is processed in order to generate a response. The query content becomes part of the current conversation context, and the model uses it to calculate the most probable response. This all takes place within the service provider’s infrastructure. Language models are not designed as a secure data repository. They are text-processing tools, not systems that guarantee controlled, long-term storage of sensitive information from the user’s perspective.

The user has limited visibility into how the provider stores and secures data on its side. In practice, this means that entered information leaves the organization’s environment and goes to external infrastructure.

Hallucinations - what they really are

In the context of LLMs, hallucinations are situations in which the model generates content that sounds coherent and credible, even though it does not have sufficient data or context for the response to be correct. Hallucinations are not a technical error or system failure and do not result from the model’s "intent." They are a natural effect of a mechanism that always attempts to generate a response.

In the context of data security, the issue is that the model may generate content that goes beyond the user’s original intent. In certain situations, this can increase the risk of information disclosure in a broader context than the person asking the question intended.

Can AI disclose user data

The model does not have access to a specific user’s external databases and does not "remember" their data in the traditional sense. However, two risk levels can be distinguished here: generative risk and infrastructure risk. Generative risk concerns situations in which the model generates content that does not match the user’s expectations. Infrastructure risk concerns the fact that data entered into the system is processed by the provider’s infrastructure.

Any IT infrastructure can become the subject of a security incident: misconfiguration, unauthorized access, technical vulnerability, or data leakage. Public language models are no exception. In practice, this means that sensitive data entered into a public LLM may be recorded in logs, stored according to the provider’s policy, and in extreme cases may become part of a security incident on the provider side. This risk is not specific to AI alone - it applies to every cloud service. In the case of language models, however, users often forget that they are using external infrastructure.

An additional issue is how AI tools enter organizations. This often happens without the IT department’s knowledge - employees independently create accounts in services such as ChatGPT or Gemini, often logging in through a corporate Microsoft 365 or Google Workspace account. In this way, an external tool becomes linked to corporate identity, without control over what data flows through it. The lack of two-factor authentication (MFA) on such an account further increases risk - if the account is compromised, an attacker gains access to the chatbot’s entire conversation history, including all data previously entered by the employee.

A practical example: when intuition fails, not competence

Industry media and official statements have described cases in which people in high-level positions in the United States public administration used publicly available language models to process content containing sensitive information. In 2025, the acting head of CISA (the U.S. cybersecurity agency) uploaded contract documents marked "for official use only" to the public version of ChatGPT, which triggered automatic security alerts on the federal network [3]. The reason was the mistaken assumption that the tool worked similarly to a local text editor. As a result, data was sent to AI-based systems that, under applicable security rules, should not have left the controlled environment. These situations led to internal reviews and the issuance of restrictions on using public language models in government institutions.

This example shows that the source of risk is a false sense of control created by an intuitive, conversational interface.

What this means in practice

Language models are useful tools, but they require informed use. In practice, this means that sensitive and confidential data should not be entered into public models. The lack of "model memory" does not mean data is not processed in provider infrastructure. AI is an external cloud service, not a local tool. AI in itself is not the threat. Risk appears when we expect from language models a level of data control that they technically do not provide.


----

Sources:

[1] Cooper, A. F. et al., Extracting memorized pieces of (copyrighted) books from open-weight language models, arXiv, 2025 - https://arxiv.org/abs/2505.12546

[2] ITHardware.pl, The largest settlement in AI history. 1.5 billion for illegal books, September 2025 - https://ithardware.pl/aktualnosci/miliardowa_ugoda_ai-44876.html

[3] Politico / CSO Online, CISA chief uploaded sensitive government files to public ChatGPT, January 2026 - https://www.csoonline.com/article/4124320/cisa-chief-uploaded-sensitive-government-files-to-public-chatgpt.html

Table of Contents

Request an IT support services quote

Briefly describe your situation - we will respond within 24 hours with a tailored proposal.

The personal data you provide will be processed for the purpose of preparing and sending an offer for your company. More information about your rights related to GDPR can be found in our Privacy Policy and Cookie Policy.

Thank you for submitting the form,

we will respond as soon as possible.

Working hours

Mon – Fri, 8:00 AM – 6:00 PM

Office address

Patriots Street 303, 04-767 Warsaw

We guarantee a quick response. We reply to every inquiry within 24 hours. In urgent matters - call.

Request an IT support services quote

Briefly describe your situation - we will respond within 24 hours with a tailored proposal.

The personal data you provide will be processed for the purpose of preparing and sending an offer for your company. More information about your rights related to GDPR can be found in our Privacy Policy and Cookie Policy.

Thank you for submitting the form,

we will respond as soon as possible.

Working hours

Mon – Fri, 8:00 AM – 6:00 PM

Office address

Patriots Street 303, 04-767 Warsaw

We guarantee a quick response. We reply to every inquiry within 24 hours. In urgent matters - call.

Request an IT support services quote

Briefly describe your situation - we will respond within 24 hours with a tailored proposal.

The personal data you provide will be processed for the purpose of preparing and sending an offer for your company. More information about your rights related to GDPR can be found in our Privacy Policy and Cookie Policy.

Thank you for submitting the form,

we will respond as soon as possible.

Working hours

Mon – Fri, 8:00 AM – 6:00 PM

Office address

Patriots Street 303, 04-767 Warsaw

We guarantee a quick response. We reply to every inquiry within 24 hours. In urgent matters - call.