Welcome to Anirecs! This is a production-ready, end-to-end anime recommendation engine designed to help users discover their next favorite anime based on content similarity. Built with a focus on scalability, efficiency, and user-friendliness, Anirecs leverages natural language processing (NLP) techniques to analyze anime synopses, genres, and themes. Whether you're a casual viewer or a hardcore otaku, Anirecs makes personalized recommendations effortless.
This project demonstrates a complete machine learning pipeline—from data acquisition to model deployment—making.
- Project Overview
- Key Features
- Tech Stack
- Installation
- Usage
- Data Pipeline
- Model Building and Training
- Deployment
- Testing and Evaluation
- Contributing
- License
- Contact
- Acknowledgments
Anirecs is an intelligent recommendation system that suggests anime based on textual content analysis. It processes over 37,000 anime entries, combining synopses, genres, and themes into a unified "tags" feature. Using TF-IDF vectorization and cosine similarity, it computes content-based recommendations in real-time.
The system is built for production: the model is exportable via Pickle for easy integration into web apps (e.g., Flask/Django backend). In the accompanying Jupyter Notebook (Anirecs.ipynb
), we walk through the entire process step-by-step, from raw data to a live recommender.
Why Anirecs?
- Personalized & Accurate: Focuses on content similarity for meaningful suggestions.
- Scalable: Handles large datasets efficiently without precomputing full similarity matrices.
- End-to-End: Covers data ingestion, processing, modeling, and deployment—perfect for demonstrating full-stack ML skills.
- Content-Based Recommendations: Suggests anime similar to your favorites based on synopses, genres, and themes.
- Real-Time Querying: Fast similarity computation on-the-fly.
- Robust Data Handling: Manages missing values, duplicates, and inconsistencies gracefully.
- Exportable Model: Pickle-serialized for seamless deployment.
- Tested with Popular Anime: Includes examples like "Toradora!" and "Cowboy Bebop" for validation.
- Programming Language: Python 3.8+
- Data Processing: Pandas, NumPy
- NLP & ML: Scikit-learn (TF-IDF Vectorizer, Cosine Similarity)
- Data Sources: Web scraping (e.g., MyAnimeList), Jikan API (MAL's unofficial API)
- Visualization/Notebooks: Jupyter Notebook
- Deployment: Pickle for model serialization; compatible with Flask, FastAPI, or Streamlit
- Version Control: Git
To get started locally, follow these steps:
-
Clone the Repository:
git clone https://github.com/Genious07/Recommendation-System cd anirecs
-
Set Up a Virtual Environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install Dependencies:
pip install -r requirements.txt
(If
requirements.txt
isn't present, install manually:pip install pandas numpy scikit-learn
) -
Download Data (if not included):
- Run the data gathering scripts (see Data Gathering) or use pre-processed CSVs from the repo.
You're all set! 🎉 Run the Jupyter Notebook with jupyter notebook Anirecs.ipynb
.
-
Run the Notebook: Open
Anirecs.ipynb
and execute cells sequentially to build and test the model. -
Get Recommendations:
recommendations = get_recommendations('Toradora!') print(recommendations)
Output: A list of similar anime titles.
-
Deploy Locally (e.g., via Flask):
- Export the model:
import pickle; pickle.dump((tfidf, vectors, df), open('anirecs_model.pkl', 'wb'))
- Create a simple API endpoint (see Deployment for details).
- Export the model:
For production, host on Heroku/AWS with a web framework.
The backbone of Anirecs is a robust data pipeline, ensuring high-quality input for the model. We handle everything from raw data collection to feature-ready datasets. This section mirrors the workflow in Anirecs.ipynb
.
Data is sourced from multiple places to create a comprehensive dataset:
-
Web Scraping: Used BeautifulSoup and Requests to scrape anime details from MyAnimeList (MAL). Focused on pages like top anime lists, extracting fields such as
name
,synopsis
,genres
,themes
,score
,members
,favorites
, andscored_by
.- Example: Scraped ~15,000 entries from MAL's top anime dataset.
- Ethical Note: Rate-limited requests to avoid server overload; complied with robots.txt.
-
Jikan API: MAL's unofficial REST API (via
jikan.moe
) for structured data retrieval.- Fetched anime by ID or search queries:
GET /anime/{id}
for details like episodes, studios, and demographics. - Integrated with scraping: Used API for missing fields (e.g., full synopses) in scraped data.
- Handled pagination and rate limits (3 requests/second).
- Fetched anime by ID or search queries:
-
Additional CSVs: Merged with open datasets like
anime-dataset-2023.csv
(~25,000 rows) andmyanilist.csv
(~21,000 rows) for broader coverage.- Total Raw Data: ~108,000 rows across 6 files.
Code Snippet (from Notebook):
# Example: Loading and inspecting scraped/API data
def load_df(path):
df = pd.read_csv(path)
print(f"{path}: {df.shape[0]} rows, columns = {list(df.columns)}")
return df
files = ['top_anime_dataset.csv', 'myanilist.csv', ...]
dfs = {f: load_df(f) for f in files}
Raw data is messy—duplicates, missing values, inconsistent formats. We standardized it:
- Schema Harmonization: Defined target columns (e.g.,
anime_id
,name
,synopsis
,genres
). - Renaming & Selection: Used rename maps to align columns across datasets.
- Merging: Concatenated harmonized DataFrames into a unified DF (~108,000 rows).
- Deduplication: Dropped duplicates based on
name
(reduced to ~39,000 unique anime). - Handling Missing Values:
- Filled NaNs in
synopsis
with empty strings. - Replaced "Unknown" in
genres/themes
with empties. - Median imputation for numerical fields like
score
,members
.
- Filled NaNs in
- Final Selection: Kept key columns:
synopsis
,genres
,themes
,name
,score
, etc. (Output:TheFinalData.csv
~37,000 rows).
Code Snippet:
# Deduplication and missing value handling
df.drop_duplicates(subset='name', keep='first', inplace=True)
df['synopsis'].fillna('', inplace=True)
Transformed raw text into a powerful "tags" feature for modeling:
- Combined
genres
,themes
, andsynopsis
into a single lowercase string. - Removed commas for clean tokenization.
Result: A concise, descriptive feature per anime (e.g., "action adventure fantasy during their decade-long quest...").
Code Snippet:
df['tags'] = (df['genres'].str.replace(',', ' ') + ' ' +
df['themes'].str.replace(',', ' ') + ' ' +
df['synopsis']).str.lower()
The core of Anirecs is a content-based recommender using NLP. No supervised training needed—it's unsupervised similarity matching.
- Used TF-IDF to convert
tags
into numerical vectors. - Limited to top 10,000 features (words) to focus on signal over noise.
- Removed English stop words for efficiency.
Code Snippet:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=10000, stop_words='english')
vectors = tfidf.fit_transform(df['tags']).toarray() # Shape: (37k, 10k)
- Computed cosine similarity on-the-fly (efficient for large datasets).
- Avoided full matrix storage to save memory (~1.4 GB otherwise).
- Function: Takes anime name, returns top-N similar titles.
- Logic: Fetch vector, compute similarities, sort, and exclude self.
Code Snippet:
from sklearn.metrics.pairwise import cosine_similarity
def get_recommendations(anime_name, num_recs=10):
anime_index = df[df['name'] == anime_name].index[0]
scores = cosine_similarity(vectors[anime_index].reshape(1, -1), vectors)[0]
similar_indices = np.argsort(scores)[::-1][1:num_recs+1]
return df['name'].iloc[similar_indices].tolist()
- Model Export: Serialize with Pickle for production.
import pickle pickle.dump((tfidf, vectors, df), open('anirecs_model.pkl', 'wb'))
- Backend Setup: Load Pickle in a Flask/FastAPI app.
- Endpoint:
/recommend?anime=Toradora!&num=10
- Example Flask Code:
from flask import Flask, request, jsonify app = Flask(__name__) tfidf, vectors, df = pickle.load(open('anirecs_model.pkl', 'rb')) @app.route('/recommend', methods=['GET']) def recommend(): anime_name = request.args.get('anime') recs = get_recommendations(anime_name) return jsonify(recommendations=recs)
- Endpoint:
- Hosting: Deploy on Heroku (free tier) or AWS EC2. Add frontend with React for a full app.
- Scalability Tips: Use vector databases like FAISS for faster queries on larger datasets.
- Unit Tests: Verified with popular anime (e.g., "Toradora!" suggests school-life rom-coms).
- Metrics: Qualitative (relevance checks); quantitative (cosine scores >0.5 for top recs).
- Edge Cases: Handled missing anime, empty tags.
From Notebook: Tested "Toradora!" and "Cowboy Bebop" with accurate results.
We'd love your input! Fork the repo, create a branch, and submit a PR. Follow these guidelines:
- Use descriptive commit messages.
- Add tests for new features.
- Update docs for changes.
Issues? Open one with details.
MIT License. Feel free to use, modify, and distribute. See LICENSE for details.
- Developer: Satwik (e.g., GitHub | LinkedIn)
- Email: [email protected]
- Feedback: Star the repo or drop a message.
- Inspired by MyAnimeList and anime communities.
- Thanks to Jikan API creators and open datasets.
Last Updated: August 15, 2025
If this project impresses you, imagine what we could build together!