News Aggregator

News Aggregator is a web application that scrapes Nepali news articles from multiple news sources (currently OnlineKhabar and Annapurna Post) and employs topic modeling techniques to group similar articles by their headlines while providing meaningful topics for the clusters.

preview

View Live App


Features and Workflow

  1. Web Scraping
    • Built using Scrapy and deployed to Zyte Scrapy Cloud.
    • The scraper initially gathers a comprehensive dataset of news articles, including the headline, publication date, and link, which is stored in MongoDB.
    • It can also scrape new articles at regular intervals for real-time updates.
  2. Topic Modeling Pipeline
    The pipeline for clustering and topic extraction is powered by BERTopic.

    • Embeddings Generation
      • Initial trials with pre-trained sentence transformers produced suboptimal embeddings for the Nepali language.
      • Switching to the pre-trained Nepali model in FastText, a word embeddings library by Facebook AI, yielded better results for representing headlines.
    • Dimensionality Reduction
      • To address the high dimensionality of embeddings, UMAP (Uniform Manifold Approximation and Projection) is used, enabling more effective clustering.
    • Clustering
      • Clustering of reduced embeddings is achieved using HDBSCAN, a density-based clustering algorithm suitable for complex datasets like news headlines.
    • Topic Extraction
      • A bag of words is created by lemmatizing and removing stop words from each cluster of news headlines.
      • c-TF-IDF is applied to determine the significance of words in each cluster, considering inter-cluster differences. This process identified the top n keywords, which were then used to assign topics to clusters.
    • Cluster Prediction
      • For incoming news articles, cosine similarity is used to classify them into pre-existing clusters, enhancing the app’s ability to handle new data seamlessly.
  3. Frontend and Backend Integration
    • The frontend of the application is built using ReactJS.
    • The backend is primarily developed using Express.js, managing APIs, server-side operations, and database interactions.
    • FastAPI is used to build APIs for topic model predictions, integrating the BERTopic model for real-time cluster assignment of new articles.
    • MongoDB serves as the database, storing scraped news articles and clustering results efficiently.

Visualizations


Source Code and Components