Surprising combinations of research contents and contexts are related to impact and emerge with scientific outsiders from distant disciplines

Combinatorial novelty, Science of science

Methods

A Hypergraph generative model for tag combinations is proposed to evaluate how surprising a paper is, given its tag combinations. Here each publication is considered as a hyperedge that connects multiple tags that belong to the publication.

Each node (tag) is represented as a latent vector , where each element is the probability that belongs to a latent dimension . These dimensions roughly correspond to the scientific fields.

Now imagine a combination that contains multiple nodes. The complementarity between these nodes is conceptualized as which is the sum of the probability that all the nodes belong to the same dimension over all possible . Let’s think about extreme cases. First, imagine that all the nodes are not complementary at all and belong to totally disjoint dimensions, where for all but one dimension and every node has non-zero element in different dimensions. In this case, all products will yield zero. The other extreme case is all nodes have a value of 1 for the same dimension. In this case, the formula produces 1.

Another concept that is modeled is the salience (or frequency, or popularity) of a node. If a concept is highly useful and used everywhere, then it is more available to more people, and more likely to be used. To model this, each node has a scalar that accounts for its popularity.

Now the propensity (or likelihood) of a combination can be written as: Given this propensity, the number of publications that contain can be modeled as a Poisson random variable: The probability of observing a hypergraph (created from every publication) is the product of all possible combinations: where and are the parameters. The time sequence of hypergraphs is modeled as the output of a Hidden Markov Process: where the transition density is a Gaussian density.

The estimation is done with Stochastic gradient descent and Negative sampling.

Content and context surprises are defined separately. Content surprise is based on the keywords (tags) and context surprise is based on the references (journals).

Results

  1. Citation is positively linked to the novelty.
  2. Context novelty is more strongly linked to the probability of hits, although content novelty seems like a better predictor of the prizes.
  3. The probability of his is the highest when both novelty is high.
  4. Expedition novelty is calculated by the average distance between team members’ publishing backgrounds and the publication venue of their focal paper. This novelty is a better predictor of the hits than career/team novelty.

Combinatorial novelty and computational creativity

The proposed model is a nice way to statistically operationalize the Combinatorial novelty.

Of course, this combination of tags is a shallow (crude) proxy of the actual innovation contained in the paper. It is easy to generate random combinations that are highly unlikely (thus surprising), but it is extremely challenging to realize a paper with such a random combination of tags. In other words, this model may capture something about the papers, given that the paper is already published. However, the model may not be as useful for deciding whether the paper should be accepted or not.

Can we prescribe the best teams or combinations that can become an impactful science? The issue here is that the space of all possibility is vast. It is easy to create highly surprising-looking combinations. But if it is not a real science, then it can be almost useless. And given the vastness of the possible combinations, it may not be super useful exercise. This is why it is still critical to have human expertise and ingenuity. That’s what connects the dots.

What can potentially be more interesting is thinking about just a bit more surprising than what’s not surprising at all. Based on the idea of Adjacent possible, just one step further away may yield plausible yet surprising combinations and ideas.

This may produce computational creativity tools. What could be possible combinations?

This is also related to the cognitive availability and who discovers what. See James Evans‘s other papers about cognitive availability.

See also

Xie2021distributed proposes a hypergraph model for the Coauthorship network.