To extract and save all URLs from your domain into a sitemap.txt file using PHP, you can use the following script. This script uses cURL to fetch the content of your domain and DOMDocument to parse the HTML and extract the URLs.
Here’s the updated PHP script to find and save all URLs up to the third level:
<?php
// The domain to crawl
$domain = 'https://yourdomain.com';
// Ensure the domain ends with a slash
if (substr($domain, -1) !== '/') {
$domain .= '/';
}
// Function to get URLs from the page
function getUrls($url, $domain, $depth = 1, $maxDepth = 3) {
static $visited = [];
$urls = [];
// If the URL has been visited or we've reached max depth, return
if (isset($visited[$url]) || $depth > $maxDepth) {
return [];
}
// Mark the URL as visited
$visited[$url] = true;
// Initialize cURL
$ch = curl_init();
// Set cURL options
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
// Execute cURL request
$html = curl_exec($ch);
// Close cURL session
curl_close($ch);
// Check if the request was successful
if ($html === false) {
return $urls;
}
// Load HTML into DOMDocument
$dom = new DOMDocument();
// Suppress warnings due to malformed HTML
@$dom->loadHTML($html);
// Extract all anchor tags
$links = $dom->getElementsByTagName('a');
// Iterate through the links and extract URLs
foreach ($links as $link) {
$href = $link->getAttribute('href');
// Check if the URL is relative or belongs to the same domain
if (strpos($href, $domain) === 0) {
// Absolute URL within the same domain
$fullUrl = $href;
} elseif (strpos($href, '/') === 0) {
// Relative URL, prepend the domain
$fullUrl = rtrim($domain, '/') . $href;
} else {
// Skip external links or invalid URLs
continue;
}
// Add the full URL to the list
$urls[] = $fullUrl;
// Recursively get URLs from the linked page (increase depth)
$urls = array_merge($urls, getUrls($fullUrl, $domain, $depth + 1, $maxDepth));
}
return array_unique($urls); // Return unique URLs
}
// Function to save URLs to a file
function saveUrlsToFile($urls, $filename) {
// Open file for writing
$file = fopen($filename, 'w');
// Check if the file was successfully opened
if ($file) {
foreach ($urls as $url) {
fwrite($file, $url . PHP_EOL);
}
fclose($file); // Close the file
} else {
echo "Failed to open file for writing.\n";
}
}
// Get all URLs from the domain, up to the third level
$urls = getUrls($domain, $domain);
// Save URLs to sitemap.txt
saveUrlsToFile($urls, 'sitemap.txt');
echo "Sitemap saved to sitemap.txt\n";
?>
Explanation:
- getUrls($url): This function fetches the content of a given URL using cURL, parses it using DOMDocument, and extracts all anchor tags (<a>). It then filters the URLs to ensure they belong to the same domain and returns a unique list of URLs.
- saveUrlsToFile($urls, $filename): This function takes an array of URLs and saves them to a specified file, in this case, sitemap.txt.
- Execution: The script initializes by specifying the domain to crawl and then calls the functions to extract and save the URLs.
Important:
- Replace https://yourdomain.com with your actual domain.
- The script only extracts URLs from the initial page it fetches. To crawl multiple pages and extract URLs recursively, you would need to extend the script to follow links within the same domain.
Explanation of the Modifications 1:
1. Domain with Slash Handling:
- The script ensures that the domain ends with a slash (/) to prevent issues when appending relative URLs.
2. URL Handling in getUrls():
- The script now checks if a URL is absolute (starts with the domain) or relative (starts with /).
- For relative URLs, the domain is prepended to create a complete URL.
3. Saving Full URLs:
- The URLs saved in sitemap.txt will all be fully qualified, including the domain name, ensuring that each URL is complete.
This way, whether the URL is relative (e.g., /about-us) or absolute (e.g., https://yourdomain.com/about-us), it will be saved correctly in the sitemap.txt file with the full path.
Explanation of the Modifications 2:
1. Depth Handling:
- The getUrls() function now includes two additional parameters: $depth (current depth level) and $maxDepth (maximum depth to crawl).
- The function is called recursively, increasing the depth by 1 each time a new URL is followed.
2. Visited URLs:
- A static $visited array tracks URLs that have already been crawled to avoid infinite loops and redundant requests.
3. Recursive URL Crawling:
- After extracting URLs from the current page, the script recursively fetches URLs from the linked pages until the maximum depth is reached.
How It Works:
- Level 1: Fetches URLs from the home page.
- Level 2: Fetches URLs from the pages linked on the home page.
- Level 3: Fetches URLs from the pages linked on the Level 2 pages.
This will effectively generate a sitemap covering all URLs up to three levels deep within the specified domain, ensuring thorough coverage of the site structure.