Skip to content

Corpus pruning silently failing #4931

@jasocrow

Description

@jasocrow

Corpus pruning was silently failing in a private fork of clusterfuzz.

I was seeing log messages for Deleting URLs and Done deleting URLs, but no audit logs were present indicating URLs were actually being deleted successfully. Corpus sizes remained large, indicating URLs were not being deleted.

def delete_signed_urls(urls):
  if not urls:
    return
  logs.info('Deleting URLs.')
  with concurrency.make_pool(_POOL_SIZE) as pool:
    pool.map(_error_tolerant_delete_signed_url, urls)
  logs.info('Done deleting URLs.')

I updated the code in a private fork of clusterfuzz and now URLs are being deleted successfully:

def delete_signed_urls(urls):
  if not urls:
    return
  logs.info('Deleting URLs.')
  with concurrency.make_pool(_POOL_SIZE) as pool:
    list(pool.map(_error_tolerant_delete_signed_url, urls))
  logs.info('Done deleting URLs.')

However, log messages are never emitted when _error_tolerant_delete_signed_url is called. For example, I updated the code to log every time an attempt is made to delete an URL, but this was never present in Logs Explorer. This means that the Failed to delete warning will never be emitted:

def _error_tolerant_delete_signed_url(url: str):
  logs.info('Deleting URL %s' % url)
  try:
    delete_signed_url(url)
  except Exception:
    logs.warning(f'Failed to delete: {url}')

I'm not exactly sure why list(pool.map(_error_tolerant_delete_signed_url, urls)) fixes the issue. I've attempted to reproduce the issue in standalone Python programs with no success. For example, code like the following always prints 0 through 4 even though pool.map(bad_function, array) is called instead of list(pool.map(bad_function, array)).

from multiprocessing import Pool
import multiprocessing
import contextlib
from concurrent import futures

POOL_SIZE = multiprocessing.cpu_count()

@contextlib.contextmanager
def make_pool(pool_size=POOL_SIZE, max_pool_size=None):
  """Returns a pool that can (usually) execute tasks concurrently."""
  if max_pool_size is not None:
    pool_size = min(pool_size, max_pool_size)

  yield futures.ProcessPoolExecutor(pool_size)

def test_function(x):
  print(x)

if __name__ == '__main__':
    array = [x for x in range(0, 5)]
    with make_pool() as pool:
      pool.map(bad_function, array)

Consider the following changes for fixing this issue:

  • Update _error_tolerant_delete_signed_url to return errors when URLs are not deleted successfully
  • Update delete_signed_urls to iterate through the results and emit a warning for each URL not deleted successfully
  • Update delete_signed_urls to return a count of successfully deleted URLs
  • Update rsync_from_disk to use the return value of delete_signed_urls to log the number of successfully deleted URLs

I may submit a pull request myself with these suggested changes. However, this requires going through the open source approval process for my company and may take a while to get through.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions