Herramientas para capturar y convertir la web

How to download a website and all its content?

Página web

Situations exist when it is important to download an entire website, not just the finished result. But all the HTML of web pages, resources such as CSS, scripts and images.

Esto tal vez porque desea una copia de seguridad del código, pero por alguna razón ya no puede acceder a la fuente original. O quizás desee un registro detallado de cómo ha cambiado un sitio web con el tiempo.

Fortunately GrabzIt’s Web Scraper can achieve this by crawling over all the web pages on a website. To scrape all pages from a website. Then on each web page the scraper downloads the HTML along with any resources referenced on the page.

Crear un Scrape para descargar un sitio web completo

Para facilitar la descarga de su sitio web, GrabzIt proporciona una plantilla de raspado.

Para empezar cargar esta plantilla.

Escriba el URL de destino, the system automatically checks this URL for errors. Keep the Iniciar automáticamente el raspado casilla marcada, y su raspado se iniciará automáticamente.

Personalizando tu Scrape

To alter the template, uncheck the Iniciar automáticamente el raspado checkbox. One alteration would be to run the scrape on a regular schedule.

For instance, to create regular copies of a website. On the Programar raspado pestaña, simplemente haga clic en el Repetir raspar checkbox and then select how frequently you want the scrape to repeat. Then click the Actualizar button to start the scrape.

Usando su sitio web descargado

Once the web scraping has finished you can download it as a ZIP file. Unzip the file to find a folder named Files containing all downloaded web pages and website resources. There will also be a special HTML page called data.html in the root of the directory. Open this file in a web browser and you will find a HTML table with three columns:

This file helps you map the new filenames to their old locations. This is necessary because a URL cannot directly map to a file structure! As a URL can exceed the size limit for direct storage in the file path.

Also there can be many permutations of a URL. Especially when a web page can represent a lot of different content by changing various query string parámetros! Por lo tanto, almacenamos el sitio web en una estructura plana en la carpeta de archivos y le damos el archivo data.html para asignar estos archivos a la estructura original.

Of course because of this you can't open a downloaded HTML page and expect to see the web page you saw on the web. To do this you would need to rewrite the paths of the image, script and CSS resources etc found in the HTML file. So that the web browser can find them in your local file structure.You need to do this to view a website offline.

In the root of the ZIP file there is another file called Website.csv. This contains exactly the same information as the data.html file.

However, you can use this in case you want to read and process the website download programmatically. Perhaps using the mapping from the URL’s to the files to recreate the downloaded website.