Sign in

Working

Register

Working

Working

If you had to reduce the Open Data model down to a single word, you could do a lot worse than to pick ‘re-use’. Open Data is concerned with a lot of things – transparency, increased data dissemination – but one could argue that the most basic objective of the model is to bring about the greater use of existing and future datasets. And in many ways, all of the experiments carried out as part of the linkedarc.net project are illustrations of this principle in action. The visualisation of the Priniatikos Pyrgos stratigraphy using the Gephi tool, the creation of a web map to visualise the distribution of the loomweights found at Priniatikos Pyrgos and the analysis of the British Museum’s cuneiform collection are fundamentally exercises in data re-use. RDF and SPARQL can be difficult infrastructures to put in place but once established it is reasonably straightforward to exploit their data in any number of different ways.

The following blog post presents yet another example of how the linkedarc.net dataset can be accessed. It outlines the design of a Twitter bot that converts natural language queries into SPARQL queries and directs them at the linkedarc.net SPARQL interface. Essentially, this opens up the linkedarc.net dataset to both natural language querying and it also serves to connect the data to one of the world’s biggest social networks.

The basic principles of the project

Twitter bots are not unknown in the Twittersphere. In fact, once the network became popular, the advantages of employing algorithms to carry out tasks that would otherwise require human intervention became fairly obvious. For instance, why pay a human to promote your brand on Twitter by engaging with other Twitter users, if it was possible to make a small program that could do the same for less of a financial outlay?

Of course, the ideal of the Twitter bot and its reality do not always match up. It can be often fairly straightforward to spot a bot at work. Have you ever been followed by another Twitter user, who has a fondness (in fact, can only) talk in well-known sayings or in old adages? Well, if you have, then it is likely that you have been targeted by a bot. It’s a lot easier for a coder to build an app that takes its posts from a list of stock statements than it is to write code that would allow for a more natural-sounding Twitter conversation. The problem becomes even more complicated once you need your bot to maintain a two-way conversation with a real human account. To do this takes time, effort and only now are we beginning to see the emergence of interfaces of this sophistication. See Google’s current search interface, as an example. But even here, Google only has to respond to a single question. It does not have to maintain that conversation over a number of different dialogue phases.

While I knew that to achieve this level of machine nous would be difficult (and the results to date show that there is still a long way to go), I believe this is the general direction that human-machine interfacing is going. In the case of the linkedarc.net project, there are a number of ways of accessing the data. The user can use a web app facetted search interface. They can write SPARQL queries or they can even download a raw dump of the entire dataset and interrogate it in any way they wish. While we have become conditioned to using these types of method when we access online data resources, they all presuppose that the user will invest a certain amount of time and effort in learning the particular approach. If the need is great enough – for instance, if we need the data to complete the project that we are involved in – then we can accept this extra demand. But often the need is not deemed to be sufficient to warrant the investment or the method is too esoteric and ultimately the user is never fully capable of repaying that investment in tangible data outcomes.

In theory at least, the natural language query gets around both the problem of overcoming the initial learning curve and the potential for missed opportunities during the engagement. Once you are able to read and write, you are capable of interacting with another digital agent using natural language. Ideally, the Twitter agent that answers the linkedarc.net questions would be a human but for the reasons described above, this is not always possible. In any case, linkedarc.net is a test-bed for new research and this is the sort of experiment that is worth attempting. The goal is to see whether a Twitter bot can be built that is sufficiently adept at interpreting natural language queries and translating them into SPARQL queries. If that can be achieved, the SPARQL system is already in place to answer that query. Remember, this is about re-using existing services in new ways and providing a new interface layer such as this is a classic example of digital service re-use.

System design

Twitter Bot System Design

The diagram above describes the basic requirements of the Twitter bot question server system. A Twitter user first sends a tweet to the @linkedarc Twitter account. The tweet must include the #question hashtag. The Twitter bot is a very simple Python app. It simply polls the Twitter Web API using the Tweepy library, asking it if the @linkedarc user has had any new tweets directed at it since the bot was last run. If it finds a new tweet, it checks to see if the #question hashtag has been included. If that is found, then it checks to see if it can find a query match within its query database and it does this using regular expressions. Once matched, the bot takes the SPARQL query that corresponds to the natural language question and constructs a tweet response, which it sends to the questioner’s account. The questioner can then click on the embedded link and this will run the SPARQL query against the linkedarc.net dataset and the results will be displayed to the questioner. Otherwise, if the bot can’t recognise the question, it tells the questioner that it can’t help them this time. Here’s an example tweet:

@linkedarc How many loomweights are there at Priniatikos Pyrgos? #question

Here are a couple of example questions and the SPARQL queries that they generate:

List all the loomweights at Priniatikos Pyrgos

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX ecrm: <http://erlangen-crm.org/current/>
PREFIX crmeh: <http://purl.org/crmeh#>
PREFIX la_vocabs: <http://linkedarc.net/vocabs/>
PREFIX la: <http://linkedarc.net/ontology/>

SELECT DISTINCT ?URI ?label WHERE {
  ?URI la:project <http://linkedarc.net/data/la_pp>.
  ?URI ecrm:P2_has_type la_vocabs:findtype-loomweight.
  ?URI rdfs:label ?label
}

Who is Frank Lynam?

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX la: <http://linkedarc.net/ontology/>

SELECT DISTINCT ?personComment WHERE {
  ?person rdf:type foaf:Person.
  ?person rdfs:comment ?personComment.
  ?person rdfs:label ?personName.
  FILTER (REGEX(str(?personName), "Frank Lynam", 'i')).
}

Code structure

The Python project contains three files – a file with the main questioner server function, another containing the TwitterAPI class that wraps the Tweepy library and lastly a file for the SPARQLQueryMaker class, which matches the questions to their SPARQL equivalents. Most of the code in these should be fairly self-explanatory and they can all be downloaded from Bitbucket here. You will need to swap in your own Twitter API credentials, which you get once you register an app with Twitter (see here for more). I made another Python file, which I keep in a location that can’t be accessed by external users of the site – this bot runs on the linkedarc.net server. The bot itself can be run by simply entering the Python command line followed by the location of the twitterbot_questionserver.py file. If you want to run this bot every minute or so, you could use the Linux app crontab to set this up.

Other Twitter + SPARQL web bot ideas

Of course, there are lots of ways that an RDF/SPARQL-based system can be interfaced with from a Twitter bot. For instance, you could write a script that asks the SPARQL engine for a particular type of resource. I have done this for the archaeological data on linkedarc.net by writing a simple Python Twitter bot that posts a tweet with a link to any one of the images stored on the system. The SPARQL query that gets this information is shown below:

PREFIX ecrm: <http://erlangen-crm.org/current/>
PREFIX la_pp_ont: <http://linkedarc.net/ontology/la_pp/>
PREFIX foaf:  <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?imageSrc ?imageName
WHERE {
	?image a la_pp_ont:LA_E18_Image .
  	?image rdfs:name ?imageName .
	?image foaf:depiction ?imageSrc
} OFFSET 100 LIMIT 1

This asks linkedarc.net for its 100th image resource. It also gets the name and URL for the image in question. The script first asks the SPARQL server for the amount of images in the dataset. It then picks a random number between 1 and the image count and uses this as the value for the OFFSET parameter in the SPARQL call. This means that each time the script is run, it should return a different image. Again, I have set this up on the linkedarc.net server to run as a scheduled crontab task.