{ "cells": [ { "cell_type": "markdown", "id": "67b4eba1", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Yesterday was intense\n", "\n", "- We webscraped the population of Italian MPs.\n", " - The rough-API method\n", "- Presented some summary statistics:\n", " - Age\n", " - Gender\n", " - Regional representation" ] }, { "cell_type": "markdown", "id": "bdde60f2", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "## Today let's be less intense\n", "\n", "- Finish summary statistics:\n", " - Education level\n", "- Look at other way to webscrape.\n", " - Full-API\n", " - Full-API with cookie stealing\n", " - Selenium" ] }, { "cell_type": "markdown", "id": "08d46b00", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Do you remember how to access your data?\n", "\n", "- Either you go to your folder, right click your data file and copy the path\n", "- Either you locate python on the folder, and import specifying only the name" ] }, { "cell_type": "code", "execution_count": 2, "id": "080ec73e", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | name | \n", "party | \n", "birthdate | \n", "birthplace | \n", "education | \n", "
---|---|---|---|---|---|
52 | \n", "BRAGA Chiara | \n", "PD-IDP | \n", "1979-09-02 | \n", "COMO | \n", "Laurea in pianificazione territoriale, urbanis... | \n", "
99 | \n", "COLOMBO Beatriz | \n", "FDI | \n", "1978-03-13 | \n", "RIMINI (FORLI') | \n", "Laurea in psicologia | \n", "
181 | \n", "GIGLIO VIGNA Alessandro | \n", "LEGA | \n", "1980-12-13 | \n", "IVREA (TORINO) | \n", "Diploma di istituto tecnico commerciale | \n", "
281 | \n", "PADOVANI Marco | \n", "FDI | \n", "1959-03-25 | \n", "VERONA | \n", "Diploma di maturità d'arte applicata | \n", "
71 | \n", "CARE' Nicola | \n", "PD-IDP | \n", "1960-07-31 | \n", "GUARDAVALLE (CATANZARO) | \n", "Imprenditore | \n", "
\n", " | id | \n", "codConcesion | \n", "fechaConcesion | \n", "aplicacionPresupuestaria | \n", "beneficiario | \n", "instrumento | \n", "importe | \n", "ayudaEquivalente | \n", "urlBR | \n", "tieneProyecto | \n", "numeroConvocatoria | \n", "idConvocatoria | \n", "convocatoria | \n", "descripcionCooficial | \n", "nivel1 | \n", "nivel2 | \n", "nivel3 | \n", "codigoInvente | \n", "idPersona | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "105680696 | \n", "105680696 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "****4261* HILDEGARD SCHUMANN MARLIES | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "37600.00 | \n", "37600.00 | \n", "https://www.caib.es/eboibfront/?lang=es | \n", "False | \n", "662529 | \n", "864089 | \n", "PROGRAMA3 Y 4 RD 853/2021 | \n", "PROGRAMES 3 I 4 RD 853&2021 | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DE ARQUITECTURA Y VIVIENDA | \n", "None | \n", "None | \n", "18646722 | \n", "
1 | \n", "105680692 | \n", "105680692 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "B57885014 BLANC DE GRIS CALA DOR SL | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "1237.50 | \n", "1237.50 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "7328022 | \n", "
2 | \n", "105680691 | \n", "105680691 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "B57483265 COMENSALS MENJADORS ESCOLARS SL | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "5292.24 | \n", "5292.24 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "11830624 | \n", "
3 | \n", "105680689 | \n", "105680689 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "B16571507 HOTEL BOUTIQUE BOSCH SL | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "4125.00 | \n", "4125.00 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "13682579 | \n", "
4 | \n", "105680686 | \n", "105680686 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "B07917495 INFO MIRBEN SL | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "1862.35 | \n", "1862.35 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "5587408 | \n", "
5 | \n", "105680685 | \n", "105680685 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "***4931** JAIME CARBONELL ROSSELLO | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "2062.50 | \n", "2062.50 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "11694934 | \n", "
6 | \n", "105680684 | \n", "105680684 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "***2287** CATALINA GALMES RANDALL | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "7795.76 | \n", "7795.76 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "12890170 | \n", "
7 | \n", "105680681 | \n", "105680681 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "B67899468 REYNES SHOP SL | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "4125.00 | \n", "4125.00 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "18646717 | \n", "
8 | \n", "105680680 | \n", "105680680 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "B07215692 GRAVILLERA SON CHIBETLI SL | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "20727.41 | \n", "20727.41 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "5585784 | \n", "
9 | \n", "105680679 | \n", "105680679 | \n", "2024-05-08T00:00:00+02:00 | \n", "None | \n", "B72835382 GUIXOS AULA ESTUDI SL | \n", "SUBVENCIÓN y ENTREGA DINERARIA SIN CONTRAPREST... | \n", "1237.50 | \n", "1237.50 | \n", "http://www.caib.es/eboibfront/es/2011/7643/sec... | \n", "False | \n", "745703 | \n", "947263 | \n", "Resolución del consejero de Economia, Hacienda... | \n", "Resolució del conseller d'Economia, Hisenda i ... | \n", "ILLES BALEARS | \n", "DIRECCIÓN GENERAL DEL TESORO, POLÍTICA FINANCI... | \n", "None | \n", "None | \n", "18646715 | \n", "
\\nGenerated by cloudfront (CloudFront)\\nRequest ID: iORg8_SQwr-ELvbWfzgU4G5PPu3VWfwvt7U552pV7kmdPy8kHa1xzw==\\n\\n\\n\\n'\n" ] } ], "source": [ "import requests\n", "# write the URL to the json database\n", "url = \"https://dait.interno.gov.it/documenti/trasparenza/POLITICHE_20220925/CAMERA_ITALIA_20220925/CAMERA_ITALIA_20220925_pluri.json?_=630966\"\n", "json_database = requests.get(url)\n", "print(json_database.content)" ] }, { "cell_type": "markdown", "id": "3b79101a", "metadata": {}, "source": [ "Response starting with a 4 means that there's an error. Let's have a look at it" ] }, { "cell_type": "code", "execution_count": 7, "id": "13a675e5", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "data": { "text/plain": [ "b'\\n\\n
\\nGenerated by cloudfront (CloudFront)\\nRequest ID: iORg8_SQwr-ELvbWfzgU4G5PPu3VWfwvt7U552pV7kmdPy8kHa1xzw==\\n\\n\\n\\n'" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "json_database.content" ] }, { "cell_type": "markdown", "id": "481352bb", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "In general, if you get that error more than once, it means that you do not have the authorization to enter the website.\n", "\n", "- In the instances where you can access it with your browser but not python, you can probably do something about it.\n", "\n", " - When you connect on a webpage, you make a `requests` to a website.\n", " - Whenever you `requests` data online, you have to introduce yourself: this is what `headers` and among `headers` in particular `cookies` are for.\n", " - The base cookies python use and your browser use are different\n", "\n", "So you juste have to pretend to be someone that you're not. How do we do that?" ] }, { "cell_type": "markdown", "id": "4fa07dfc", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "We get the cookies of the request from online. The way to get to the cookies is convoluted\n", "\n", "1. right-click on any webpage then select `inspect`\n", "2. On the menu that opened you'll find a category that's called `network`\n", "3. Click on `network` and refresh the page\n", "4. You'll see all the external `requests` \"made by the webpage\" to external sources. Locate the requests of a json type (in general)\n", "5. There you can go to `headers` and copy paste whatever is there" ] }, { "cell_type": "markdown", "id": "db2e04f2", "metadata": { "slideshow": { "slide_type": "subslide" } }, "source": [ "Organizing the headers is tedious, it must be organized in a dictionary." ] }, { "cell_type": "code", "execution_count": 8, "id": "a6aef3ac", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [], "source": [ "raw_headers = \"\"\"\n", "Host: dait.interno.gov.it\n", "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0\n", "Accept: application/json, text/javascript, */*; q=0.01\n", "Accept-Language: fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3\n", "Accept-Encoding: gzip, deflate, br\n", "Content-Type: application/json\n", "X-Requested-With: XMLHttpRequest\n", "Connection: keep-alive\n", "Referer: https://dait.interno.gov.it/elezioni/trasparenza/elezioni-politiche-2022\n", "Cookie: cookie-agreed-version=1.0.0; _ga_78EEYDJ064=GS1.1.1715148396.2.1.1715149155.0.0.0; _ga=GA1.3.1789629481.1715097354; _gid=GA1.3.362426513.1715097354; cookie-agreed=2; cookie-agreed-categories=%5B%22mandatory%22%2C%22misurazione_cookie_di_tracciamento_%22%5D; SSESS908d44425a37cd2efb28fa8e34cd7e07=FTdZLnFiR6otToy3fHrpCtOfXYVyXR2k1v35AS0vHjA; BIGipServerpool-dait=3793644810.48129.0000; _gat_gtag_UA_39126249_2=1\n", "Sec-Fetch-Dest: empty\n", "Sec-Fetch-Mode: cors\n", "Sec-Fetch-Site: same-origin\n", "TE: trailers\n", "\"\"\".strip()\n", "split_lines = [h.split(\": \", 1) for h in raw_headers.splitlines()]\n", "headers = dict(split_lines)" ] }, { "cell_type": "code", "execution_count": 9, "id": "817c9e68", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'Host': 'dait.interno.gov.it', 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) Gecko/20100101 Firefox/125.0', 'Accept': 'application/json, text/javascript, */*; q=0.01', 'Accept-Language': 'fr,fr-FR;q=0.8,en-US;q=0.5,en;q=0.3', 'Accept-Encoding': 'gzip, deflate, br', 'Content-Type': 'application/json', 'X-Requested-With': 'XMLHttpRequest', 'Connection': 'keep-alive', 'Referer': 'https://dait.interno.gov.it/elezioni/trasparenza/elezioni-politiche-2022', 'Cookie': 'cookie-agreed-version=1.0.0; _ga_78EEYDJ064=GS1.1.1715148396.2.1.1715149155.0.0.0; _ga=GA1.3.1789629481.1715097354; _gid=GA1.3.362426513.1715097354; cookie-agreed=2; cookie-agreed-categories=%5B%22mandatory%22%2C%22misurazione_cookie_di_tracciamento_%22%5D; SSESS908d44425a37cd2efb28fa8e34cd7e07=FTdZLnFiR6otToy3fHrpCtOfXYVyXR2k1v35AS0vHjA; BIGipServerpool-dait=3793644810.48129.0000; _gat_gtag_UA_39126249_2=1', 'Sec-Fetch-Dest': 'empty', 'Sec-Fetch-Mode': 'cors', 'Sec-Fetch-Site': 'same-origin', 'TE': 'trailers'}\n" ] } ], "source": [ "print(headers)" ] }, { "cell_type": "code", "execution_count": 10, "id": "55258cce", "metadata": { "slideshow": { "slide_type": "subslide" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "