The problems with Big Tech AI data collection: privacy concerns and how to protect your data

It’s been three years since OpenAI shook the digital world with the launch of ChatGPT, with now 200 million people logging on to ChatGPT weekly. And those numbers get a lot higher if you also include other generative AI solutions, such as Gemini or Claude: smart large language models from Big Tech companies that can generate text or images, translate, or code.

But while AI assistants are everywhere and bring these big, productive promises, it’s getting hard to determine what has been discussed more: The opportunities that AI will create, or the privacy issues that come with it. Think of data exfiltration, unchecked surveillance, and biased profiling.

Let’s look at why and how Big Tech is collecting your data and what best practices you can implement to protect your data when using AI.

Big Tech and AI: The data grab

Any large language model is only as good as the data it’s trained on. That’s why AI models of Big Tech companies, such as Google, Microsoft, and Amazon, are actively using your conversations with their chatbots to improve their models.

A recent Stanford study compared the privacy policies of six Big Tech AI platforms. Their research revealed that:

  • All of them use your chat input for their training purposes by default (except for Amazon’s Nova AI agent, for which the data isn’t clear).
  • In half of the cases, these conversations are saved on their servers indefinitely, so there is no limit on the amount of time that they keep your data.
  • The opt-out methods can be hard to find. In two out of six cases, the mechanisms to opt out of the chat training were even unclear or unspecified.
  • What’s more, not all companies clearly state that they de-identify your personal information before using it for their training purposes.
  • And some of the Big Tech platforms allow humans to review your chat transcripts for their model training goals.

These data-collecting mechanisms provide AI companies with a lot of control.

Lack of transparency due to cloud centralization makes building this control is easy for them. It’s expensive to run independent AI models. A lot of AI platforms are hosted in large, centralized cloud environments. The remote servers process the inputs and return an answer. While this is efficient, it also concentrates control: the data leaves the local environment, leading to a shift in governance from the user to the provider.

In other words, Big Tech is gaining a lot of power by collecting data while also keeping its data processing out of our sight.

Why « free » AI tools don’t exist

A lot of AI tools have seemingly free or cheap versions, while in reality, users pay by giving up their data, as data is the currency of the AI platform. In other words: If it’s free, you are the product.

 

The AI privacy concerns that come with Big Tech data collection

You might wonder, what are the issues with these AI data collection and processing mechanisms, as long as they provide you with a smart and efficient business tool?

First of all, there is the issue of unchecked surveillance and bias.

Data collection has been growing for years, and your AI model is only as good as the data that supports it. This can lead to unfair profiling and inequality, especially if the training data is biased.

A couple of real-life AI bias examples include:

  • An English tutoring company used AI recruiting software that automatically rejected female applicants over 55 and male applicants over 60, leading to a $356,000 settlement for age discrimination.
  • A Brookings Institution study showed AI credit scoring systems can reproduce racial disparities. That’s because historical credit data correlates strongly with race. When an AI model uses this data for its credit scoring, it can lead to unequal loan approvals and interest rates.
  • In 2016, Microsoft’s chatbot Tay began posting racist and antisemitic messages within 24 hours of launch after learning toxic behavior from user interactions on Twitter.

Then there is the issue of data theft and leakage. AI systems can become attractive targets for hackers because they process large amounts of sensitive information in one place.

They can also leak data by showing parts of previous conversations or sensitive data when they generate responses. Or the AI system conflates information and produces fake news.

That happened, for example, when a journalist was falsely described as « a 54-year-old child molester » by Microsoft’s AI tool Copilot. As he had been reporting on criminal court cases on child abuse, the AI bot wrongly made him into a criminal.

That’s just the tip of the iceberg of privacy concerns. Other issues surrounding AI include the collection of data without consent, training on children’s data, and the use of data without permission.

What about the EU’s AI Act to protect us from Big Tech’s abuses of power?

At this point, you might be wondering: Isn’t regulation supposed to address these issues?

The EU’s AI Act was designed exactly for that purpose: to set guardrails around how AI systems are developed and used. Once fully applied in August 2026, it would introduce rules aimed at improving transparency, accountability, and risk management for AI providers.

But the tech industry, especially the Big Tech players, didn’t just stand by to let this happen. They argued that parts of the AI Act were too complex and burdensome for innovation, and they have been actively pushing for change.

(Digital industry lobbying in the EU has grown significantly from about €113 million annual spending in 2023 to roughly €151 million today, an increase of more than 30% in just two years.)

And their efforts paid off, as the EU introduced a new proposal, « the Digital Omnibus« , which aims to simplify compliance and reduce regulatory overhead under the AI Act in certain areas.

The debate on AI now continues.

Supporters say the changes could make it easier for companies to adopt AI while maintaining safeguards. Critics, however, warn that this new proposal could weaken some of the protections originally intended by the AI Act and other digital regulations.

Best practices: Transparency, limited data collection, and control

For most organizations, the question is no longer whether AI is used, but under which conditions. In most cases today, using AI means sending data to external services.

And as this blog post shows, for organizations working with sensitive data, this can become a problem with legal, technical, and ethical consequences.

Does that mean you shouldn’t use AI at all?

The issue is that if you don’t provide AI you can trust, people will likely start using their favorite services like ChatGPT for work. And that means losing control of your sensitive data.

That’s why it might be better to look for responsible AI solutions that you can trust and that provide:

  • Transparency and auditability of training data
  • Minimal data collection, so your organization’s data stays safe
  • Full control over how your data is being processed
  • High performance and intuitive workflows, ensuring people use trusted internal tools instead of relying on external AI services, so employees don’t feel the need to use external AI tools for their daily work

An AI tool like the Nextcloud Assistant ticks all of these boxes, making sure you can safeguard your online data without having to give up on the efficiency and comfort of an AI solution.

How Nextcloud approaches AI: Ethical AI and other pointers

Nextcloud’s prime approach to AI is that it should never be fixed to any particular provider. In other words, the administrators can choose between different providers, including self-hosted options.

With our ethical AI rating, they can get specific guidance in this choice, based on a four-level, color-coded rating scale.

Apart from that foundation, we have specific tools for both admins and users to make the most of this privacy-first AI solution, ensuring governance, compliance, and data protection.

For admins

• We aim to ensure that for each AI function in Nextcloud, there exists at least one fully green-rated AI option, that is, an option that is fully open source in terms of model, training data & running code.

• We aim to give granular control over the various AI features so that different models/solutions can be employed for different functions.

• By default, all AI features are disabled. You can run the AI fully on-premises without sharing any data with providers, and our Ethical AI rating helps understand how data sharing works in provider-hosted models.

For users

• You’ll find clear indicators showing when AI is being used and how data is processed.

• Features are opt-in whenever possible, or easily disabled. AI is also only introduced where it adds real value, avoiding unnecessary automation.

• The AI functionality is either available from our the Nextclout Assistant’s dedicated interface, or through deep integrations in the various applications, such as AI-generated subtitles during a video call.


What’s new in Nextcloud Hub 26 Winter: the Nextcloud Assistant?

As AI in Nextcloud is designed to be sovereign, organizations decide where their models run, which models are used, and what happens to their data. By doing so, your organization can benefit from AI-supported collaboration without giving up responsibility for its data and stay in control.

Our latest release of Nextcloud Hub 26 Winter continues to build on this foundation.

AI is still optional and configurable, instead of a mandatory layer imposed on all users or workflows. This allows organizations to adopt AI at their own pace, align it with internal policies, and decide which use cases make sense in their environment.

When AI is enabled, it becomes part of the collaboration environment instead of an external dependency. It integrates into existing workflows without breaking governance or compliance frameworks.

What can the Nextcloud Assistant do for you in practice?

  • Improve and generate texts, media, and documents
  • Answer questions based on organizational data
  • Summarize meetings and conversations in Nextcloud Talk
  • Provide live transcription and translation for multilingual collaboration
  • Integrate AI capabilities directly into email, chat, meetings, and file workflows

Nextcloud Hub 26 Winter also makes compliance easier: You can generate images and documents in various apps and automatically label content with watermarks. This ensures your organization is in line with the latest regulations, such as the AI Act in the EU.

The Nextcloud Assistant adds a watermark to an AI-generated image.

In short: privacy-first AI solutions such as the Nextcloud Assistant give organizations the efficiency and convenience of AI, while keeping governance, compliance, and data ownership exactly where they belong: under their control.

Regain your digital autonomy with Nextcloud Hub 26 Winter

Our latest release of Nextcloud Hub 26 Winter is here! Discover the latest Nextcloud features.

Continue the discussion at the Nextcloud forums

Go to Forums