Skip to content

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Mar 31, 2025

The @[ThreadLocal] annotation only works on some targets and doesn't allow to register a destructor callback that will be invoked when a thread shuts down.

We currently don't have threads shutting down, but with RFC 2 it will start happening (at least isolated contexts are expected to shutdown, others should eventually evolve to shutdown too), or use the complex Crystal::ThreadLocalValue to tie a value to a thread, which in turn requires finalize methods.

Extracted from #15395 and now uses Box since #15562 has been merged. We can't use Box, see 1012b54.

This PR is to be followed by:

@ysbaddaden ysbaddaden marked this pull request as ready for review April 18, 2025 12:23
@straight-shoota straight-shoota added this to the 1.17.0 milestone Apr 18, 2025
@BlobCodes
Copy link
Contributor

BlobCodes commented Apr 19, 2025

I think both the Hash-based Crystal::ThreadLocalValue and this solution (pthread specific with ::Box) aren't very elegant.

AFAIK the primary problem with the @[ThreadLocal] annotation is that it doesn't allow to GC its contents.

TLS not being supported isn't very common, so I wouldn't want to introduce unnecessary complexity and overhead just for them.

I've built a small POC for using a "Virtual TLS", which basically means allocating a big struct with all thread-local values for a thread.
Since this VTLS can be allocated with a GC, we don't need to do anything special to keep GC-ed Objects alive or finalize a TLS section.
It should also be faster than the two other variants since it only uses pointer arithmetic at runtime and can be used in signal handlers (malloc is forbidden in signal handlers).

On the few platforms not supporting TLS, we could still use pthread specifics for that one pointer.

The same technique could be used to easily create a "Fiber-Local-Storage" (#15088).

POC Code
# :nodoc:
module Crystal::ThreadLocalManager
  @[ThreadLocal]
  @@this_thread_data : Void* = Pointer(Void).new(0u64)
  
  @@containers : Crystal::PointerLinkedList(Container) = Crystal::PointerLinkedList(Container).new
  @@list_lock : Mutex = Mutex.new(:unchecked)
  
  struct Container
    include Crystal::PointerLinkedList::Node
  end
  
  # Registers a TLS section for this thread using the GC
  def self.register_self : Nil
    # normally I'd use GC.malloc_uncollectable but crystal doesnt expose that
    container = GC.malloc(sizeof(Container))
    
    # There's no pointer-based unsafe_construct for struct
    container.as(Container*).value = Container.new
    
    @@this_thread_data = container
  
    @@list_lock.synchronize do
      @@containers.push(container.as(Container*))
    end    
  end

  # Destroys the TLS section of this thread
  def self.unregister_self
    container = @@this_thread_data
  
    @@list_lock.synchronize do
      @@containers.delete(container.as(Container*))
    end
    
    @@this_thread_data = Pointer(Void).new(0u64)
    GC.free(container)
  end
  
  @[AlwaysInline]
  def self.thread_data : Void*
    @@this_thread_data
  end
end

macro thread_local(var)
  {% unless var.is_a?(TypeDeclaration) %}
    {% var.raise "thread_local requires a type declaration.\n\
                  Example: thread_local pcre_context : Void* = Pointer(Void).null" %}
  {% end %}
  {% unless var.value %}
    {% var.raise "thread_local requires a default value" %}
  {% end %}
  {% if @def %}
    {% var.raise "Cannot use thread_local inside a def" %}
  {% end %}
  
  struct ::Crystal::ThreadLocalManager::Container
    @%var : {{var.type}}{% if var.value %} = {{var.value}}{% end %}
  end
  
  def self.{{var.var.id}} : {{var.type}}
    ptr = ::Crystal::ThreadLocalManager.thread_data + offsetof(::Crystal::ThreadLocalManager::Container, @%var)
    ptr.as({{var.type}}*).value
  end
  
  def self.{{var.var.id}}=(value : {{var.type}})
    ptr = ::Crystal::ThreadLocalManager.thread_data + offsetof(::Crystal::ThreadLocalManager::Container, @%var)
    ptr.as({{var.type}}*).value = value
  end
end

module Test
  thread_local string : String = "test"
end
  
Crystal::ThreadLocalManager.register_self

Test.string = "thread 1!!"

Thread.new do
  Crystal::ThreadLocalManager.register_self

  puts Test.string # should print "test"
  
  Test.string = "thread 2?"
  
  puts Test.string # should print "thread 2?"
  
  Crystal::ThreadLocalManager.unregister_self
end
  
sleep 3.seconds

puts Test.string # should print "thread 1!!"

EDIT: Since this is a fixed-size data structure, we could even try to allocate it at the stack bottom to remove any GC involvement. Performance-wise, that should be very close to using plain TLS.

@BlobCodes
Copy link
Contributor

Referencing #15395 (comment):

Note 3: I'm wondering if this Thread::Local(T) type ain't deprecating the @[ThreadLocal] annotation: it works everywhere and the annotation doesn't even spare a function call —the codegen marks it as noinline to avoid cache issues.

AFAIU from the LLVM bug report (https://bugs.llvm.org/show_bug.cgi?id=19177), the noinline call is only relevant when Fibers change which Thread they run on (work stealing). Since crystal doesn't do that right now, this could probably be optimized. In the issue, a Dlang developer even stated that they just didn't implement work-stealing because it was a net performance loss.

Also, I think we should more cleanly separate the Crystal Compiler/Language from the Stdlib.
Even if this type was better than the annotation in every regard, one of these is a codegen feature (which could even be relevant for c bindings), and the other is a binding around pthread specifics.

@ysbaddaden
Copy link
Contributor Author

I think I must specify that Thread is a private type. Same for this Thread::Local(T), Thread::Mutex, and the other ones. They should all be moved to the internal Crystal::System namespace.

That probably puts some perspective?

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Apr 20, 2025

TLS not being supported isn't very common.

Indeed, it usually works, but we have targets where it doesn't: OpenBSD and MinGW64 for example, Android (Termux) also exhibited problems, and I'm pretty sure it can't work with Cosmopolitan (no yet supported).

the primary problem with the @[ThreadLocal] annotation is that it doesn't allow to GC its contents.

Using Box(T) is confusing: it won't allocate into the GC HEAP when T is a Reference or Pointer: it directly casts to/from Void*.

The GC visibility is what led to the Crystal::ThreadLocalValue(T) type, but neither the PCRE2 nor the Thread.current use cases need the GC to know about the value: PCRE2 are C malloc, the Thread objects are already in a linked list known to the GC. The PCRE2 case, however, needs an abstract destructor (so we don't link libpcre2 when we don't use it).

the noinline call is only relevant when Fibers change which Thread they run on (work stealing)

Which is what we introduced in Crystal 1.16: the Fiber::ExecutionContext::MultiThreaded scheduler does work stealing and fibers are now expected to be resumed by whatever thread. There's Fiber::ExecucutionContext::SingleThreaded if you prefer threads not sharing anything.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Apr 20, 2025

@BlobCodes I must say that your PoC is very interesting, and a public #thread_local macro could be nice.

We still can't rely on the @[ThreadLocal] annotation, so we still have to wrap libc TLS anyway as a fallback, for both Windows (MinGW64) and UNIX (OpenBSD, Android).

And @[ThreadLocal] must still be a call (because work stealing), so I'm not sure the annotation really brings a clear performance advantage over pthread specifics or win32 TLS.

@BlobCodes
Copy link
Contributor

the Fiber::ExecutionContext::MultiThreaded scheduler does work stealing and fibers are now expected to be resumed by whatever thread

I still think we could reduce the overhead of reading a TLS variable.
Maybe we could replace the noinline call with a volatile read?
That should still force LLVM to re-access the memory.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Apr 20, 2025

EDIT: Since this is a fixed-size data structure, we could even try to allocate it at the stack bottom to remove any GC involvement.

Oh my, indeed. It wouldn't even need the linked list + lock anymore, we just need Crystal.main and Thread#start to reserve the space + set the system TLS.

@BlobCodes
Copy link
Contributor

we just need Crystal.main and Thread#start to reserve the space + set the system TLS

Yes, the main pain point could be callbacks given to c libraries, being executed in another thread.
I don't know how unknown, external threads are handled in general right now though.

@BlobCodes
Copy link
Contributor

I still think we could reduce the overhead of reading a TLS variable.

It seems like the situation around thread-switching fibers hasn't gotten much better yet:
ziglang/zig#21739
llvm/llvm-project#98479

It is apparently possible to communicate to LLVM that the fiber may be re-scheduled to another thread at a given point using @llvm.coro.suspend (which looks like a lot of work).

I don't know if the MultiThreaded scheduler only re-schedules at fixed points (blocking operations?) or if it could theoretically switch at any time.

If the latter is the case, we probably can't optimize TLS accesses right now.

We can't use `Box(T)` because it will allocate in the GC HEAP and the
only live reference to that memory will be put in the system Thread
Local Storage (TLS) that the GC can't reach, and thus collect the
pointer 💥
@ysbaddaden
Copy link
Contributor Author

I fixed this PR (1012b54): we can't use Box(T) 🤦

@ysbaddaden
Copy link
Contributor Author

I thought lots about this, and here are my personal conclusions:

  • This internal Thread::Local(T) is still useful: it abstracts the system TLS and allows to optimize PCRE2 (thanks to the destructor). We also don't need the memory to be visible to the GC right now.

  • We may eventually optimize the @[ThreadLocal] annotation (great) but that doesn't invalidate the above system abstraction: there are targets where it doesn't work on both Windows (FLS/TLS) and UNIX (pthread specifics).

  • The same applies to the custom table.

  • The custom table is a very interesting evolution towards a public API; it can be used for both threads and fibers' local storage, the values are visible to the GC, and the overhead should be negligible 🎉

Unresolved question: with the custom table, we'll have to wrap PCRE2 matchdata and jitstack pointers in a class with a finalizer to free them (no issues), but we'd always have a live reference from the Container type to the classes. Won't that cause to always link libpcre2 because of the finalizer method? 🤔

@straight-shoota straight-shoota removed this from the 1.17.0 milestone Apr 22, 2025
@straight-shoota
Copy link
Member

Yeah, this is quite an intriguing concept 🤔
In particular, I like the idea to avoid potential name clashes with a fresh variable shared between different scopes.

However, that means it's based on global identifiers, thus it can really only work for class properties. It can't be used for instance and local variables.
Maybe that's acceptable if the typical use case is for class variables. The stdlib doesn't need more (after the refactor).

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jun 26, 2025

So, what do we do with this PR? I believe it still stands, at least for:

  1. An system abstraction when the target doesn't support @[ThreadLocal] (Android, Cosmopolitan, MinGW64, OpenBSD), be it for Thread.current as well as Feature: Fiber-Local-Storage #15889.

  2. The PCRE2 use case, where we can safely register a thread local, and need to register a destructor on thread termination to reclaim the memory, without creating a hard dependency on the external library (not used => not linked). The overhead cost of calling a TLS function is also negligible.

@straight-shoota
Copy link
Member

Yeah, sounds good. We should continue this in parallel to FLS 👍

@straight-shoota straight-shoota added this to the 1.17.0 milestone Jul 1, 2025
@straight-shoota straight-shoota merged commit 8fda88f into crystal-lang:master Jul 2, 2025
37 of 38 checks passed
@ysbaddaden ysbaddaden deleted the feature/add-thread-local-struct branch July 3, 2025 08:47
@ysbaddaden
Copy link
Contributor Author

@BlobCodes

I don't know if the MultiThreaded scheduler only re-schedules at fixed points (blocking operations?) or if it could theoretically switch at any time.

At fixed points. There is no external preemption of fibers; each fiber reschedules itself through Fiber.suspend or Fiber.yield for example (in practice it happens inside Crystal::Scheduler or Fiber::ExecutionContext).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants