Web crawler downloads a lot of archives filling up space (#8210) · Issues · gpt / large_projects / gitlabhq1

Web crawler downloads a lot of archives filling up space

Created by: hildensia

We recently had the following problem on our GitLab installation (it's 6.5.0, but as far as I can see nothing changed regarding this issue):

We had a quite big project, which was "public". At some point the Google crawler found its way to this project and started indexing it. And of course it also started to index archives of everything. Thus gitlab generated hundreds and thousands of .zip, .tar.bz2 and .tar.gz files, filling up hundreds of GB space on the hard disc, eventually filling it completely, which in turn lead to gitlab not being available anymore.

One solution of course is to disable crawling completely. But it might be a good idea to disable crawling of archive generation as a standard. It isn't particularly interesting data for a crawler anyway and it hurts if it blows up everything.