Skip to content

Anirecs: End-to-End Anime Recommendation System Discover personalized anime suggestions with Anirecs! This production-ready engine uses NLP (TF-IDF & cosine similarity) for content-based recs from 37k+ entries, sourced via scraping, Jikan API, and CSVs. Deployable via Pickle.

Notifications You must be signed in to change notification settings

Genious07/Recommendation-System

Repository files navigation

Anirecs: Anime Recommendation System

Welcome to Anirecs! This is a production-ready, end-to-end anime recommendation engine designed to help users discover their next favorite anime based on content similarity. Built with a focus on scalability, efficiency, and user-friendliness, Anirecs leverages natural language processing (NLP) techniques to analyze anime synopses, genres, and themes. Whether you're a casual viewer or a hardcore otaku, Anirecs makes personalized recommendations effortless.

This project demonstrates a complete machine learning pipeline—from data acquisition to model deployment—making.

Table of Contents

Project Overview

Anirecs is an intelligent recommendation system that suggests anime based on textual content analysis. It processes over 37,000 anime entries, combining synopses, genres, and themes into a unified "tags" feature. Using TF-IDF vectorization and cosine similarity, it computes content-based recommendations in real-time.

The system is built for production: the model is exportable via Pickle for easy integration into web apps (e.g., Flask/Django backend). In the accompanying Jupyter Notebook (Anirecs.ipynb), we walk through the entire process step-by-step, from raw data to a live recommender.

Why Anirecs?

  • Personalized & Accurate: Focuses on content similarity for meaningful suggestions.
  • Scalable: Handles large datasets efficiently without precomputing full similarity matrices.
  • End-to-End: Covers data ingestion, processing, modeling, and deployment—perfect for demonstrating full-stack ML skills.

Key Features

  • Content-Based Recommendations: Suggests anime similar to your favorites based on synopses, genres, and themes.
  • Real-Time Querying: Fast similarity computation on-the-fly.
  • Robust Data Handling: Manages missing values, duplicates, and inconsistencies gracefully.
  • Exportable Model: Pickle-serialized for seamless deployment.
  • Tested with Popular Anime: Includes examples like "Toradora!" and "Cowboy Bebop" for validation.

Tech Stack

  • Programming Language: Python 3.8+
  • Data Processing: Pandas, NumPy
  • NLP & ML: Scikit-learn (TF-IDF Vectorizer, Cosine Similarity)
  • Data Sources: Web scraping (e.g., MyAnimeList), Jikan API (MAL's unofficial API)
  • Visualization/Notebooks: Jupyter Notebook
  • Deployment: Pickle for model serialization; compatible with Flask, FastAPI, or Streamlit
  • Version Control: Git

Installation

To get started locally, follow these steps:

  1. Clone the Repository:

    git clone https://github.com/Genious07/Recommendation-System
    cd anirecs
    
  2. Set Up a Virtual Environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
  3. Install Dependencies:

    pip install -r requirements.txt
    

    (If requirements.txt isn't present, install manually: pip install pandas numpy scikit-learn)

  4. Download Data (if not included):

    • Run the data gathering scripts (see Data Gathering) or use pre-processed CSVs from the repo.

You're all set! 🎉 Run the Jupyter Notebook with jupyter notebook Anirecs.ipynb.

Usage

  1. Run the Notebook: Open Anirecs.ipynb and execute cells sequentially to build and test the model.

  2. Get Recommendations:

    recommendations = get_recommendations('Toradora!')
    print(recommendations)

    Output: A list of similar anime titles.

  3. Deploy Locally (e.g., via Flask):

    • Export the model: import pickle; pickle.dump((tfidf, vectors, df), open('anirecs_model.pkl', 'wb'))
    • Create a simple API endpoint (see Deployment for details).

For production, host on Heroku/AWS with a web framework.

Data Pipeline

The backbone of Anirecs is a robust data pipeline, ensuring high-quality input for the model. We handle everything from raw data collection to feature-ready datasets. This section mirrors the workflow in Anirecs.ipynb.

Data Gathering

Data is sourced from multiple places to create a comprehensive dataset:

  • Web Scraping: Used BeautifulSoup and Requests to scrape anime details from MyAnimeList (MAL). Focused on pages like top anime lists, extracting fields such as name, synopsis, genres, themes, score, members, favorites, and scored_by.

    • Example: Scraped ~15,000 entries from MAL's top anime dataset.
    • Ethical Note: Rate-limited requests to avoid server overload; complied with robots.txt.
  • Jikan API: MAL's unofficial REST API (via jikan.moe) for structured data retrieval.

    • Fetched anime by ID or search queries: GET /anime/{id} for details like episodes, studios, and demographics.
    • Integrated with scraping: Used API for missing fields (e.g., full synopses) in scraped data.
    • Handled pagination and rate limits (3 requests/second).
  • Additional CSVs: Merged with open datasets like anime-dataset-2023.csv (~25,000 rows) and myanilist.csv (~21,000 rows) for broader coverage.

    • Total Raw Data: ~108,000 rows across 6 files.

Code Snippet (from Notebook):

# Example: Loading and inspecting scraped/API data
def load_df(path):
    df = pd.read_csv(path)
    print(f"{path}: {df.shape[0]} rows, columns = {list(df.columns)}")
    return df

files = ['top_anime_dataset.csv', 'myanilist.csv', ...]
dfs = {f: load_df(f) for f in files}

Data Cleaning and Merging

Raw data is messy—duplicates, missing values, inconsistent formats. We standardized it:

  • Schema Harmonization: Defined target columns (e.g., anime_id, name, synopsis, genres).
  • Renaming & Selection: Used rename maps to align columns across datasets.
  • Merging: Concatenated harmonized DataFrames into a unified DF (~108,000 rows).
  • Deduplication: Dropped duplicates based on name (reduced to ~39,000 unique anime).
  • Handling Missing Values:
    • Filled NaNs in synopsis with empty strings.
    • Replaced "Unknown" in genres/themes with empties.
    • Median imputation for numerical fields like score, members.
  • Final Selection: Kept key columns: synopsis, genres, themes, name, score, etc. (Output: TheFinalData.csv ~37,000 rows).

Code Snippet:

# Deduplication and missing value handling
df.drop_duplicates(subset='name', keep='first', inplace=True)
df['synopsis'].fillna('', inplace=True)

Feature Engineering

Transformed raw text into a powerful "tags" feature for modeling:

  • Combined genres, themes, and synopsis into a single lowercase string.
  • Removed commas for clean tokenization.

Result: A concise, descriptive feature per anime (e.g., "action adventure fantasy during their decade-long quest...").

Code Snippet:

df['tags'] = (df['genres'].str.replace(',', ' ') + ' ' + 
              df['themes'].str.replace(',', ' ') + ' ' + 
              df['synopsis']).str.lower()

Model Building and Training

The core of Anirecs is a content-based recommender using NLP. No supervised training needed—it's unsupervised similarity matching.

Vectorization

  • Used TF-IDF to convert tags into numerical vectors.
  • Limited to top 10,000 features (words) to focus on signal over noise.
  • Removed English stop words for efficiency.

Code Snippet:

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=10000, stop_words='english')
vectors = tfidf.fit_transform(df['tags']).toarray()  # Shape: (37k, 10k)

Similarity Calculation

  • Computed cosine similarity on-the-fly (efficient for large datasets).
  • Avoided full matrix storage to save memory (~1.4 GB otherwise).

Recommendation Engine

  • Function: Takes anime name, returns top-N similar titles.
  • Logic: Fetch vector, compute similarities, sort, and exclude self.

Code Snippet:

from sklearn.metrics.pairwise import cosine_similarity

def get_recommendations(anime_name, num_recs=10):
    anime_index = df[df['name'] == anime_name].index[0]
    scores = cosine_similarity(vectors[anime_index].reshape(1, -1), vectors)[0]
    similar_indices = np.argsort(scores)[::-1][1:num_recs+1]
    return df['name'].iloc[similar_indices].tolist()

Deployment

  • Model Export: Serialize with Pickle for production.
    import pickle
    pickle.dump((tfidf, vectors, df), open('anirecs_model.pkl', 'wb'))
  • Backend Setup: Load Pickle in a Flask/FastAPI app.
    • Endpoint: /recommend?anime=Toradora!&num=10
    • Example Flask Code:
      from flask import Flask, request, jsonify
      app = Flask(__name__)
      tfidf, vectors, df = pickle.load(open('anirecs_model.pkl', 'rb'))
      
      @app.route('/recommend', methods=['GET'])
      def recommend():
          anime_name = request.args.get('anime')
          recs = get_recommendations(anime_name)
          return jsonify(recommendations=recs)
  • Hosting: Deploy on Heroku (free tier) or AWS EC2. Add frontend with React for a full app.
  • Scalability Tips: Use vector databases like FAISS for faster queries on larger datasets.

Testing and Evaluation

  • Unit Tests: Verified with popular anime (e.g., "Toradora!" suggests school-life rom-coms).
  • Metrics: Qualitative (relevance checks); quantitative (cosine scores >0.5 for top recs).
  • Edge Cases: Handled missing anime, empty tags.

From Notebook: Tested "Toradora!" and "Cowboy Bebop" with accurate results.

Contributing

We'd love your input! Fork the repo, create a branch, and submit a PR. Follow these guidelines:

  • Use descriptive commit messages.
  • Add tests for new features.
  • Update docs for changes.

Issues? Open one with details.

License

MIT License. Feel free to use, modify, and distribute. See LICENSE for details.

Contact

Acknowledgments

  • Inspired by MyAnimeList and anime communities.
  • Thanks to Jikan API creators and open datasets.

Last Updated: August 15, 2025
If this project impresses you, imagine what we could build together!

About

Anirecs: End-to-End Anime Recommendation System Discover personalized anime suggestions with Anirecs! This production-ready engine uses NLP (TF-IDF & cosine similarity) for content-based recs from 37k+ entries, sourced via scraping, Jikan API, and CSVs. Deployable via Pickle.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published