Skip to content

theunknownmonk/mercor-challenge

Repository files navigation

Mercor Challenge Project

This project is a referral system application built with Java and Spring Boot.

Language & Setup

Requirements

  • Language: Java 17
  • Framework: Spring Boot 3.3.2
  • Build Tool: Gradle

Setup Commands

  1. Install dependencies and build the project:

    ./gradlew build

    This command will download all necessary dependencies and compile the source code.

  2. Run the application:

    ./gradlew bootRun

    The application will start on port 8090 by default.

Running Tests

To execute the entire test suite, run the following command from the root of the project:

./gradlew test

This will run all unit tests in the project and generate a report.

General Design Choices

This project follows a standard layered architecture pattern, separating concerns into distinct service, resource (controller), and data model layers.

  • In-Memory Storage: For simplicity and performance, the application uses in-memory data structures (ConcurrentHashMap) for all data storage. This means that all data is reset when the application restarts. This approach was chosen to focus on the core business logic without introducing the complexity of a database and to keep things simple.

  • UUID for Internal IDs: While users are identified by a client-provided userId, a unique, internally generated UUID is used for all internal operations. This decouples the internal data structure from external identifiers and prevents potential collisions. Also, UUIDs are often faster to index and join on in relational databases as compared to User Ids which might be of longer length.

  • Thread Safety: ConcurrentHashMap and strategic locking mechanisms (e.g., ReentrantReadWriteLock) are used in the storage services to ensure that the application is thread-safe and can handle concurrent requests without data corruption, especially in the referral graph operations.

  • Defensive Copying: The User model includes a copy constructor, and services return copies of objects rather than direct references. This prevents external modifications to the internal state of the services, enhancing encapsulation and stability.

  • Separation of Concerns: The ReferralStorageService is split into multiple, more focused services (UserStorageService, ReferralCountService, etc.) to adhere to the Single Responsibility Principle. This makes the code easier to maintain, test, and understand.

  • Network Simulation: The network growth simulation logic is encapsulated in its own service (NetworkGrowthSimulationService), completely separate from the core application logic. This ensures that the analytical modeling does not interfere with the main application's functionality.

  • Interface-Based Services: The core services, such as ReferralService, are defined as interfaces. This decouples the business logic from the specific implementation (e.g., ReferralServiceInMem). This design makes the system more flexible and extensible, allowing for different storage mechanisms (e.g., a database-backed service) to be added in the future without changing the controller layer. It also simplifies unit testing by allowing for easy mocking of service dependencies.

Implementation Details

Part 0: Observations and Assumptions

  • The referral graph is a DAG by design. We have also introduced the constraint that a user can be referred only once by another user. This means that the graph is a directed forest with a root/source user who start that referral tree.
  • While returning Top K referrers by reach, we only consider users with > 0 referrals.

Part 1: Referral Graph

Add Referral Relationships

The referral graph is implemented using a directed graph structure where each user can have multiple referrals. The ReferralService interface defines the operations for managing users and referrals, while the ReferralServiceInMem class provides an in-memory implementation. A ConcurrentHashMap<String, Set<String>> is used to store the referral relationships, where the key is the user ID and the value is a set of referred user IDs. (User Ids and UUIDs are used interchangeably in the documentation.) We also maintain a ConcurrentHashMap<String, User> to store user referrer details, allowing for quick access and updates. Time Complexity -> Adding referral relationships is O(1) but the bottleneck is checking whether the referral will create a cycle in the graph. The algorithm we use is going up the referral chain from the referred user to the root user and checking if the referrer is already in the chain. Let h denote the maximum referral chain height, the time complexity of this operation is O(h) in the worst case.

We also update the indirect referral count for the referrers up the referral chain, which is O(h*log(C)) as well. The log(C) factor comes from using a TreeMap to maintain the private final TreeMap<Long, Set<String>> referralCountToUsers = new TreeMap<>(); where C is the number of unique referral counts. This allows us to efficiently retrieve the top referrers by reach.

[New idea : As the graph that we have is a directed forest, for checking whether a referral will create a cycle, we can check if the candidate is one of the root nodes of the directed forest. ]

Get Direct Referrals

The getDirectReferrals method retrieves the list of direct referrals for a given user. It returns a list of user IDs that are directly referred by the specified user. The time complexity of this operation is O(number referrals) as it simply retrieves the set of referrals from the map.

Part 2: Full Network Reach

Total Referral Count

The getTotalReferralCount method calculates the total number of referrals for a given user, including both direct and indirect referrals. The ReferralCountService is used to maintain the count of direct and indirect referrals for each user. We use a concurrent hashmap The time complexity of this operation is O(1) as we precompute the indirect referral count when adding a referral.

Top Referrers by Reach

The getTopReferrersByReach method retrieves the top K referrers based on their indirect referral count. The time complexity of this operation is O(K), where K is the number of top referrers requested. We have done most of the heavy lifting in the addReferral method.

How to pick an appropriate value for K?

Part 3: Identify Influencers

Metric 1: Unique Reach Expansion

The getUniqueReachExpansion method calculates the unique reach expansion for a given user. This returns a set of top K users who cover the maximum number of unique candidates. As our referral graph is a directed forest, we can choose the root user/source user of every referral tree as the influencer. Now, we can use the ReferralCountService to get the top K users based on their indirect referral count which denotes the number of nodes in the source user's tree. Time complexity of this operation is O(T + TlogT) where T is the number of trees/connected component in the referral graph.

[New idea : We can use a max-heap to get the top K users in O(TlogK) time. This will be more efficient if T is large and K is small.]

Metric 2: Flow Centrality

The getTopFlowCentralUsers method calculates the flow centrality for a given user. This uses a Tree DP to compute the critical brokers in the referral graph. The time complexity is that of running a DFS on the referral graph, which is O(V + E) where V is the number of users and E is the number of referrals. Also, we need to get the top K users based on their flow centrality, which will involve sorting the users based on their flow centrality scores. So, total time complexity is O(V + E + VlogV).

Comparison between Total Network Reach, Unique Reach Expansion, and Flow Centrality as metrics for identifying influencers.

Part 4: Network Growth Simulation

The NetworkGrowthSimulationService simulates the growth of the referral network over a specified number of iterations.

Simulate Network Growth

The simulateCumulativeReferralCount method simulates the growth of the referral network by randomly adding referrals between existing users. Let M denote the maximum number of referrals a user can make and D denote the number of days. Time complexity of this operation is O(M * D).

Days to target

We use binary search to find the minimum number of days required to reach a target referral count. The findDaysToReachTarget method performs this search. We use the method from simulate to perform this. Let D denote the original answer order of magnitude of the number of days required to reach the target referral count. Then, the time complexity of this operation is O(logD * M * D) where M is the maximum number of referrals a user can make in a day.

API Related Design Choices

The API is designed following RESTful principles to ensure predictability and ease of use.

  • Resource-Oriented URLs: The API uses a resource-oriented URL structure. For example, /api/createUser and /api/addReferral are used to manage users and referrals.

  • Statelessness: The API is stateless. Every request from a client contains all the information needed by the server to process it, without relying on any stored server-side session state.

  • JSON for Data Interchange: The API uses JSON for all request and response bodies, which is the standard for modern web APIs due to its lightweight nature and ease of parsing.

  • Use of HTTP Methods: Standard HTTP methods are used to perform actions:

    • POST: To create new resources (e.g., creating a user or a referral).
    • GET: To retrieve resources (e.g., getting a user's direct referrals).
  • HTTP Status Codes: The API uses standard HTTP status codes to indicate the success or failure of a request:

    • 200 OK: For successful GET requests.
    • 201 Created: For successful POST requests that result in resource creation.
    • 400 Bad Request: For client-side errors, such as invalid input.
    • 404 Not Found: When a requested resource does not exist.
    • 500 Internal Server Error: For unexpected server-side errors.
  • Clear Error Messaging: In case of an error, the API returns a clear error message in the response body to help clients understand the cause of the issue. Internal exception details are not exposed to prevent information leakage.

Known Issues and Future Improvements

  • Data Persistence: The current implementation uses in-memory storage, meaning all data is lost on application restart. The most critical next step would be to replace the in-memory services with a persistent storage solution, such as a relational (PostgreSQL, MySQL) or NoSQL (MongoDB) database or Graph Database(Need to know more about this ) . The use of service interfaces will make this transition easier.

  • Authentication & Authorization: The API is currently unsecured, with all endpoints being publicly accessible. A future improvement would be to implement a security layer using a framework like Spring Security to protect the endpoints and ensure that only authorized users can perform certain actions.

  • Delete Functionality: The application currently lacks the functionality to delete users or referrals. A key improvement would be to add DELETE endpoints (e.g., /api/users/{userId}) to allow for the removal of data. This would require careful implementation to handle cascading deletes (e.g., how to manage the referrals of a deleted user) to maintain data integrity.

  • Memory Usage: The in-memory data structures (e.g., for storing the referral graph and user data) will grow indefinitely as more users and referrals are added. In a long-running system with a large dataset, this will eventually lead to an OutOfMemoryError, causing the service to crash. This limitation is a direct consequence of not using a persistent, disk-based storage solution.

  • Proper Locking Mechanism: Currently we are acquiring a write lock on the entire referral graph when adding a referral. This can lead to contention and performance issues in a high-concurrency environment. A more granular locking mechanism or an optimistic locking strategy could be implemented to improve performance. The idea is to perform a simple traversal on the nodes of the graph we need to update and acquire locks only on those nodes following a global order. This will reduce contention and improve performance in a high-concurrency environment.

Use of AI

  • Used AI tools to understand to create boilerplate code for the project, including the Gradle build file and initial project structure.
  • Used AI tools extensively to generate unit tests for the various services and methods in the project. This includes generating test cases for edge cases and ensuring that the business logic is thoroughly tested.
  • Used AI tools to understand more about the simulation methods and understand various ideas used in the project.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages