How I scraped 2 lakhs+ of personal records from hamrobazaar.com

Disclaimer: This tutorial is for educational purposes, assuming responsible and ethical use of web scraping. It emphasizes the importance of respecting privacy and website policies.

Note: This article does not endorse illegal or unethical activities. Users are responsible for using the information responsibly and in compliance with applicable laws.

Introduction:
Hamrobazaar.com is a popular online marketplace where individuals can buy and sell a wide variety of products. If you're looking to extract contact information from the platform for a specific category, you can automate the process using Node.js and MongoDB. In this article, we'll walk through the steps to create a web scraper that fetches and stores contact details from Hamrobazaar.com

Prerequisites:
Before diving into the code, ensure you have the following tools and technologies installed on your system:

1. Node.js 

2. npm (Node Package Manager)

3. MongoDB

Setting up the Project:

1. Create a new Node.js project using npm init.

2. Install the necessary dependencies:

npm install axios mongoose

3. Set up a MongoDB database and configure the connection in your project.

Writing the Web Scraper:

My approach to web scraping a website begins with a thorough examination of whether it is client-side rendered or server-side rendered. If it is server-side rendered, we need to utilize different web scraping tools such as Puppeteer, Selenium, Cheerio, and others. There are distinct methodologies for this scenario. However, if a website is client-rendered, scraping is generally easier in most cases. While exploring hamrobazaar.com, I realized it is client-rendered, which means the server provides JSON data based on client requests initiated from the browser, and the JSON is rendered by the browser. It’s worth noting that there is an ongoing debate about the merits of each rendering method. We will delve into their differences in the coming days. As of now, I have confirmed that hamrobazaar.com is client-rendered.

After identifying the endpoints, to get the API endpoint, simply follow these steps:
  1. Right-click on the selected endpoint.
  2. Choose “Copy” from the menu.0
  3. Click on “Copy as Nodefetch (Node.js)”.

This action will copy the API endpoint, allowing you to use it in your Node.js code for fetching data.

The provided code is a Node.js script that utilizes the axios library to make HTTP requests and MongoDB to store the extracted contact information. The script is structured as follows:

The fetchContact function initiates the scraping process by sending a request to Hamrobazaar.com's GetAllCategory endpoint.

const axios = require("axios");
dotenv.config();
const Contact = require("../Models/Contact");
async function fetchContact() {
  try {
    const response = await axios.get(
      "https://api.hamrobazaar.com/api/AppData/GetAllCategory",
      {
        headers: {
          "accept": "application/json, text/plain, */*",
          "access-control-allow-origin": "*",
          "apikey": "You Will need to get your own API Key",
          "country_code": "null",
          "deviceid": "You Will need to get your own Device Id",
          "devicesource": "web",
          "sec-ch-ua": "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\"",
          "sec-ch-ua-mobile": "?0",
          "sec-ch-ua-platform": "\"Windows\"",
          "strict-transport-security": "max-age=2592000",
          "x-content-type-options": "nosniff",
          "x-frame-options": "SAMEORIGIN",
          "Referer": "https://hamrobazaar.com/",
          "Referrer-Policy": "strict-origin-when-cross-origin"
        },
      }
    );
    const { data } = response.data;
    for (const category of data) {
        console.log(`Currently Scraping ${category.id} Category`);
        await getAll(category.id);
    }
  } catch (error) {
    // Handle errors
    console.error("Error fetching data:", error.message);
  }
}

The getAll function iterates through pages for a specific category, extracts contact information, and inserts it into a MongoDB collection named "Contact."

async function getAll(CategoryId) {
  let nextPageNumber = 1;
while (true) {
    try {
      const response = await axios.get(
        `https://api.hamrobazaar.com/api/Product?PageSize=1000&CategoryId=${CategoryId}&IsHBSelect=false&PageNumber=${nextPageNumber}`,
        {
          headers: {
              "accept": "application/json, text/plain, */*",
              "access-control-allow-origin": "*",
              "apikey": "You Will need to get your own API Key",
              "country_code": "null",
              "deviceid": "You Will need to get your own Device Id",
              "devicesource": "web",
              "sec-ch-ua": "\"Chromium\";v=\"122\", \"Not(A:Brand\";v=\"24\", \"Google Chrome\";v=\"122\"",
              "sec-ch-ua-mobile": "?0",
              "sec-ch-ua-platform": "\"Windows\"",
              "strict-transport-security": "max-age=2592000",
              "x-content-type-options": "nosniff",
              "x-frame-options": "SAMEORIGIN",
              "Referer": "https://hamrobazaar.com/",
              "Referrer-Policy": "strict-origin-when-cross-origin"
          },
        }
      );
      const { data } = response.data;
      let nextPage = response.data.nextPageNumber;
      let totalRecords = response.data.totalRecords;
      if (nextPage == null) {
        break;
      }
      console.log(`Currently Scraping Page ${nextPage}`);
      if (nextPage === 2) {
        console.log(`Total Records: ${totalRecords}`);
      }
      const value = [];
      const uniqueSet = new Set();
      for (let i = 0; i < data.length; i++) {
        if (data[i].creatorInfo == null || data[i].creatorInfo == undefined) {
          continue;
        }
        if (
          data[i].creatorInfo.createdByUsername == null ||
          data[i].creatorInfo.createdByUsername == undefined
        ) {
          continue;
        }
      // Omitted phone numbers with obscured portions, such as "986***9399", to maintain data integrity.
        if (!data[i].creatorInfo.createdByUsername.includes("*")) {
          const contactObject = {
            name: data[i].creatorInfo.createdByName,
            contact: data[i].creatorInfo.createdByUsername,
          };
          const contactString = JSON.stringify(contactObject);
          if (!uniqueSet.has(contactString)) {
            value.push(contactObject);
            uniqueSet.add(contactString);
          }
        }
      }
      try {
        await Contact.insertMany(value);
        console.log(
          `Successfully inserted Contact Information for ${nextPage - 1}`
        );
        value.length = 0;
        uniqueSet.clear();
      } catch (error) {
        console.log("Something went wrong");
      }
      nextPageNumber = nextPage;
    } catch (error) {
      console.error("Error fetching data:", error.message);
      break;
    }
  }
}

The MongoDB schema defines the structure for the contact information, including the name and contact number.

const mongoose = require("mongoose");

const contactSchema = new mongoose.Schema({
  name: {
    type: String,
  },
  contact: {
    type: String,
  },
});

const Contact = mongoose.model("Contact", contactSchema);

module.exports = Contact;

Running the Scraper:
To run the web scraper, execute the following command in your terminal:

node your_scraper_script.js

Make sure to replace your_scraper_script.js with the name of your Node.js script.

Output

While extracting data from more than 200,000 records on Hamrobazaar.com, I successfully identified unique contact information for 18,132 individuals. Hamrobazaar.com users frequently create multiple posts, resulting in a varied collection of contacts. The code snippet provided ensures the extraction of distinct contact details. Specifically, duplicates are removed, and phone numbers with obscured portions (e.g., "986****399") are excluded during the data extraction process within the getAll method. Consequently, approximately 10% of the total scraped records were retained for further use.

try {
    // Use aggregate pipeline to group by name and contact
    const uniqueContacts = await Contact.aggregate([
      {
        $group: {
          _id: { name: "$name", contact: "$contact" },
        },
      },
      {
        $project: {
          _id: 0,
          name: "$_id.name",
          contact: "$_id.contact",
        },
      },
    ]);
   console.log(`Unique Contact Information: `${uniqueContacts.length});
   res.json(uniqueContacts);
  } catch (error) {
    console.error(error);
  }

Conclusion:
In this article, we explored how to build a web scraper using Node.js and MongoDB to extract contact information from Hamrobazaar.com. By automating the process, you can efficiently gather data for a specific category and store it in a structured format for further analysis. Keep in mind the ethical considerations and terms of service while scraping data from online platforms. Happy coding!