Skip to content Skip to sidebar Skip to footer

How Do I Extract A Dataframe From A Website

The website (poder360.com.br/banco-de-dados) has a lot of filters that generate a dataframe, based on what you selected on those filters. I'm trying to extract this dataframe on Py

Solution 1:

I just tested with some random data and checked the network request in Firefox and when you click on Pesquisar this request is sent:

https://pesquisas.poder360.com.br/web/consulta/fetch?unidades_federativas_id=15&regioes_id=2&cargos_id=2&institutos_id=3&data_pesquisa_de=2021-09-22&data_pesquisa_ate=2021-09-23&turno=T&tipo_id=T&candidatos_id=1&order_column=ano&order_type=asc

Of course it didn't show any data and it's not easy task because you have to know all the ids from each param

Also you should contact them to see if they can provide some API to easy access

Solution 2:

What you can do is to call each of the URLs for the fields on the page, e.g. for Unidade Federativa, the URL to get the dataframe is https://pesquisas.poder360.com.br/web/regiao/enum?q=

This will return a JSON:

{"current_page":1,"data":[{"id":1,"descricao":"Regi\u00e3o Centro-Oeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Centro-Oeste"},{"id":2,"descricao":"Regi\u00e3o Nordeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Nordeste"},{"id":3,"descricao":"Regi\u00e3o Norte","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Norte"},{"id":4,"descricao":"Regi\u00e3o Sudeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Sudeste"},{"id":5,"descricao":"Regi\u00e3o Sul","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Sul"},{"id":6,"descricao":"Nacional","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Nacional"},{"id":7,"descricao":"Regional","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regional"},{"id":8,"descricao":"Regi\u00e3o Norte_Centro-Oeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Norte_Centro-Oeste"},{"id":9,"descricao":"Municipal","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Municipal"}],"first_page_url":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum?page=1","from":1,"last_page":1,"last_page_url":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum?page=1","next_page_url":"null","path":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum","per_page":15,"prev_page_url":"null","to":9,"total":9}

Note that I have replaced all the null with "null" so it will not trigger an error in Python.

You can then extract the data using the following code:

res = {"current_page":1,"data":[{"id":1,"descricao":"Regi\u00e3o Centro-Oeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Centro-Oeste"},{"id":2,"descricao":"Regi\u00e3o Nordeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Nordeste"},{"id":3,"descricao":"Regi\u00e3o Norte","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Norte"},{"id":4,"descricao":"Regi\u00e3o Sudeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Sudeste"},{"id":5,"descricao":"Regi\u00e3o Sul","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Sul"},{"id":6,"descricao":"Nacional","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Nacional"},{"id":7,"descricao":"Regional","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regional"},{"id":8,"descricao":"Regi\u00e3o Norte_Centro-Oeste","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Regi\u00e3o Norte_Centro-Oeste"},{"id":9,"descricao":"Municipal","created_at":"2018-06-12 01:54:30","updated_at":"2018-06-12 01:54:30","deleted_at":"null","text":"Municipal"}],"first_page_url":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum?page=1","from":1,"last_page":1,"last_page_url":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum?page=1","next_page_url":"null","path":"https:\/\/pesquisas.poder360.com.br\/web\/regiao\/enum","per_page":15,"prev_page_url":"null","to":9,"total":9}
for i inrange(0,len(res['data'])):
    print(res['data'][i]['descricao'])

This will return the output:

Região Centro-Oeste
Região Nordeste
Região Norte
Região Sudeste
Região Sul
Nacional
Regional
Região Norte_Centro-Oeste
Municipal

Now you just need to put all the relevant URLs in a list and run the same script over it.

Post a Comment for "How Do I Extract A Dataframe From A Website"