Following on from the article written by Jean -Marc Manach on finding confidentional documents on Google and the trick provided by Frederic Raynal from Quarkslab, we will show how we can go a little bit further with TaDaweb but more interestingly how we can automate the process.
In his article, Jean-Marc explains how you can find confidential documents by using the search query:
filetype:pdf inurl:gouv.fr “ne pas diffuser”
If you want to try this search query on Google.com you will get approximately 1000 results, with 10 results per page.
Now let’s see how we can automate this process. Firstly we launch the TaDaweb Creator and simply try to enter the search query in the Google plugin integrated in TaDaweb. We simply drag&drop the Google icon to the whiteboard:
Then we enter the search query, exactly the same as the one mentioned in Jean-Marc’s article:
We can choose several options such as the result type (if we choose PDF there is no need to add filetype:pdf in the query string). We can choose the country, which is similar to use google.fr, google.be, google.lu etc. One interesting option is to get only recent results by choosing MAX AGE one month and sort by RECENT. This is particularly useful when detecting document leak.
Here we will choose as options:
- 100 results
- PDF file
- Country: france
- Language: french
- max age: one month
- sort by recent
We will also change inurl:gouv.fr by site:gouv.fr which avoid potential results coming from gouv.fr.marcelpagnol.fr which would be a website talking about “marcelpagnol” and not from the government….
And we get…. zero result! This is good sign for the government: no document with the mention “ne pas diffuser” has been indexed by Google recently. This request is still useful to save as we can replay this request every day (hour/month/week etc) with TaDaweb and thus be alerted by email as soon as a new “ne pas diffuser” document is found by Google.
Let’s change our query by removing the “max age one month”:
This time we obtain 100 results as expected:
Now let’s try to do the same query with BING:
We now have 150 results: 100 from Google and 50 from BING. Now we are going to analyze them a little bit further.
We start by merging them:
We then want to separate the link and the description from the results. Also, we want to know what department of the government the document comes from.
As an example, if we have the URL “www.defense.gouv.fr/guide-medias-sociaux/telecharger.pdf” we want to extract “defense.gouv.fr”.
We connect a loop to the list of 150 results:
The loop function will allow us to iterate on our list to apply the same transformation.
What we want to do is to extract the title (Document provisoire, ne pas diffuser), the URL just under and finally the description.
We connect two Extract Links to the list element icon: one to extract the link title and one to extract the link URL:
With the first Extact Links we extract the title:
With the second Extract Links we have the URL behind the link:
Now we want to extract “sante-jeunesse-sports.gouv.fr“. For this we will use what we call a REGEX which allow advanced string transformation. The REGEX we will use is “^([^/]+)/.*$“. All we have to do is entering this REGEX in the transform dialog:
The transform tool enables us to perform many transformations on our data. In our example we also deleted “http://”, “https://” and “www” from our search string.
We now obtain exactly the string we wanted:
If we now connect the two Extract Links to the “Add” and execute the loop, we have a nice table with the title and the root URL:
This “TaDa”, which took 30 seconds to create, can now be saved:
We click SAVE, give it a name and a description):
The TaDa is now in the system:
- We can replay it automatically and on demand
- We can configure the TaDa to send an email alert when there is new content
- We can use it in a newsletter for distribution
- We can retrieve the result through an API in JSON format and save it an external database
All we have to do is go to tadaweb.com