Back to COSC 50 – Software Design & Implementation - Indexed Web Crawler in C

Dartmouth College - COSC 50 – Software Design & Implementation

Indexed Web Crawler in C

Assignment Problem:

Build an indexed web crawler in C. The crawler should start from a seed URL, fetch web pages, parse HTML to extract further links, and crawl these links up to a specified depth (e.g., depth 2). As it crawls, it should build an inverted index mapping words found on the pages to a list of URLs (and optionally frequencies) where those words occur. The index should be serialized to disk.

Our Suggested Approach

1. Core Components Overview:

Crawler Engine: Manages the overall crawling process, including the queue of URLs to visit, depth control, and coordination of other components.

HTTP Fetcher: Responsible for retrieving the HTML content of web pages given a URL (typically using a library like libcurl in C).

HTML Parser: Extracts hyperlinks (URLs) and textual content from the fetched HTML. For C, this might involve simple string manipulation or a basic state machine for tags and stripping other tags, as full HTML parsing is complex.

Inverted Indexer: Builds and maintains the inverted index data structure. This maps words to the documents (URLs) they appear in.

Data Structures: Queues (for BFS crawling strategy), Hash Tables (for the inverted index and to keep track of visited URLs to avoid cycles and re-processing).

2. Crawler Engine Logic (Breadth-First Search - BFS):

Initialize a queue (FIFO) and add the seed URL along with its initial depth (e.g., depth 0).

Initialize a hash table (or set) to store visited_urls to prevent re-crawling pages and getting caught in loops.

While the queue is not empty AND the current_depth of the URL being processed is within the max_depth limit:

Dequeue a URL and its current_depth.

If this URL has already been visited, continue to the next URL in the queue.

Add the current URL to visited_urls.

Fetch Page Content: Use the HTTP Fetcher module to retrieve the HTML content for the current URL.

If fetching is successful:

Parse HTML for Links: Use the HTML Parser to find all valid, absolute hyperlinks on the page. Normalize these URLs (e.g., resolve relative paths against the current URL, handle http/https).

Enqueue New Links: For each extracted and normalized link, if it hasn't been visited and current_depth + 1 <= max_depth, add it to the queue with depth = current_depth + 1.

Parse HTML for Text & Index: Extract the plain textual content from the page (e.g., by stripping HTML tags, ignoring script/style content). Tokenize this text into words (convert to lowercase, remove punctuation, possibly ignore common stop words). For each word, update the Inverted Index data structure (see step 5).

3. HTTP Fetcher (Conceptual, often using libcurl in C):

Initialize libcurl (global setup).

Create a CURL easy handle.

Set essential options: CURLOPT_URL (the URL to fetch), CURLOPT_WRITEFUNCTION (a callback function to handle incoming data), CURLOPT_WRITEDATA (a pointer to a buffer or structure where your callback will store the data).

Optionally, set CURLOPT_FOLLOWLOCATION to handle redirects, and timeout options.

Perform the request: curl_easy_perform(). Check the return code for errors.

Clean up the CURL handle.

Link Extraction: A common simplified method is to scan the HTML for occurrences of strstr). Then, parse the string between the quotes to get the URL. This is not robust for all HTML but often sufficient for assignments. Resolve relative URLs using the base URL of the current page.

Text Extraction: A basic approach involves iterating through the HTML content. When a < is encountered, skip characters until a > is found (to strip tags). Collect characters that are not part of tags. Ignore content within