TopK问题探索-最小堆JAVA实现

来源:转载


TopK问题即如何从大量数据中找出前K个数(数之间可比较,较大的排前面)。

注:这里使用数的概念,并不一定是数字,可以是任何对象,对象之间可以比较大小。


实际场景:比如搜索引擎找出得分最高的10篇文章,歌曲库中统计下载率最高的前10首歌等等。


下面探导有哪些实现方式:

方法一、将全部数据存放数组,然后对数组排序(大到小排),取出前K个数即可。

这种方式是最直接、最容易想到的方式。但由于是大量数据,存储和排序过程对内存、CPU资源消耗很大、效率低,不推荐使用。


方法二、从全部数据中取出K个数存入K大小的数组a中,对a按从小到大排序,则a[0]为最小值。然后依次取出其余数据,每取出一个数,都与a[0]比较,如果比a[0]小或相等,则取下一个数;反之,则丢弃a[0]的值,利用二分法找到其位置,然后该位置前的数组元素整体向前移动,如此反复读取,直到数据结尾。

这比方法一效率有很大提高,但如果K比较大时,整体移动也是比较耗时的

对于这种问题,效率比较高的解决方式是使用最小堆


最小堆(小根堆)是一种数据结构,它首先是一颗完全二叉树,并且,它所有父节点的值小于或等于两个子节点的值


最小堆的实际存储可以是数组,或者链表,用链表会更加灵活。


下面给出最小堆的一种JAVA实现方式(来自lucene源代码)


public abstract class PriorityQueue<T> { private int size; private final int maxSize; private final T[] heap; public PriorityQueue(int maxSize) { this(maxSize, true); } @SuppressWarnings("unchecked") public PriorityQueue(int maxSize, boolean prepopulate) { size = 0; int heapSize; if (0 == maxSize) { // We allocate 1 extra to avoid if statement in top() heapSize = 2; } else { if (maxSize > ArrayUtil.MAX_ARRAY_LENGTH) { throw new IllegalArgumentException("maxSize must be <= " + ArrayUtil.MAX_ARRAY_LENGTH + "; got: " + maxSize); } else { // NOTE: we add +1 because all access to heap is // 1-based not 0-based. heap[0] is unused. heapSize = maxSize + 1; } } heap = (T[]) new Object[heapSize]; // T is unbounded type, so this unchecked cast works always this.maxSize = maxSize; if (prepopulate) { // If sentinel objects are supported, populate the queue with them T sentinel = getSentinelObject(); if (sentinel != null) { heap[1] = sentinel; for (int i = 2; i < heap.length; i++) { heap[i] = getSentinelObject(); } size = maxSize; } } } /** Determines the ordering of objects in this priority queue. Subclasses * must define this one method. * @return <code>true</code> iff parameter <tt>a</tt> is less than parameter <tt>b</tt>. */ protected abstract boolean lessThan(T a, T b); /** * This method can be overridden by extending classes to return a sentinel * object which will be used by the {@link PriorityQueue#PriorityQueue(int,boolean)} * constructor to fill the queue, so that the code which uses that queue can always * assume it's full and only change the top without attempting to insert any new * object.<br> * * Those sentinel values should always compare worse than any non-sentinel * value (i.e., {@link #lessThan} should always favor the * non-sentinel values).<br> * * By default, this method returns false, which means the queue will not be * filled with sentinel values. Otherwise, the value returned will be used to * pre-populate the queue. Adds sentinel values to the queue.<br> * * If this method is extended to return a non-null value, then the following * usage pattern is recommended: * * <pre class="prettyprint"> * // extends getSentinelObject() to return a non-null value. * PriorityQueue<MyObject> pq = new MyQueue<MyObject>(numHits); * // save the 'top' element, which is guaranteed to not be null. * MyObject pqTop = pq.top(); * <...> * // now in order to add a new element, which is 'better' than top (after * // you've verified it is better), it is as simple as: * pqTop.change(). * pqTop = pq.updateTop(); * </pre> * * <b>NOTE:</b> if this method returns a non-null value, it will be called by * the {@link PriorityQueue#PriorityQueue(int,boolean)} constructor * {@link #size()} times, relying on a new object to be returned and will not * check if it's null again. Therefore you should ensure any call to this * method creates a new instance and behaves consistently, e.g., it cannot * return null if it previously returned non-null. * * @return the sentinel object to use to pre-populate the queue, or null if * sentinel objects are not supported. */ protected T getSentinelObject() { return null; } /** * Adds an Object to a PriorityQueue in log(size) time. If one tries to add * more objects than maxSize from initialize an * {@link ArrayIndexOutOfBoundsException} is thrown. * * @return the new 'top' element in the queue. */ public final T add(T element) { size++; heap[size] = element; upHeap(); return heap[1]; } /** * Adds an Object to a PriorityQueue in log(size) time. * It returns the object (if any) that was * dropped off the heap because it was full. This can be * the given parameter (in case it is smaller than the * full heap's minimum, and couldn't be added), or another * object that was previously the smallest value in the * heap and now has been replaced by a larger one, or null * if the queue wasn't yet full with maxSize elements. */ public T insertWithOverflow(T element) { if (size < maxSize) { add(element); return null; } else if (size > 0 && !lessThan(element, heap[1])) { T ret = heap[1]; heap[1] = element; updateTop(); return ret; } else { return element; } } /** Returns the least element of the PriorityQueue in constant time. */ public final T top() { // We don't need to check size here: if maxSize is 0, // then heap is length 2 array with both entries null. // If size is 0 then heap[1] is already null. return heap[1]; } /** Removes and returns the least element of the PriorityQueue in log(size) time. */ public final T pop() { if (size > 0) { T result = heap[1]; // save first value heap[1] = heap[size]; // move last to first heap[size] = null; // permit GC of objects size--; downHeap(); // adjust heap return result; } else { return null; } } /** * Should be called when the Object at top changes values. Still log(n) worst * case, but it's at least twice as fast to * * <pre class="prettyprint"> * pq.top().change(); * pq.updateTop(); * </pre> * * instead of * * <pre class="prettyprint"> * o = pq.pop(); * o.change(); * pq.push(o); * </pre> * * @return the new 'top' element. */ public final T updateTop() { downHeap(); return heap[1]; } /** Returns the number of elements currently stored in the PriorityQueue. */ public final int size() { return size; } /** Removes all entries from the PriorityQueue. */ public final void clear() { for (int i = 0; i <= size; i++) { heap[i] = null; } size = 0; } private final void upHeap() { int i = size; T node = heap[i]; // save bottom node int j = i >>> 1; while (j > 0 && lessThan(node, heap[j])) { heap[i] = heap[j]; // shift parents down i = j; j = j >>> 1; } heap[i] = node; // install saved node } private final void downHeap() { int i = 1; T node = heap[i]; // save top node int j = i << 1; // find smaller child int k = j + 1; if (k <= size && lessThan(heap[k], heap[j])) { j = k; } while (j <= size && lessThan(heap[j], node)) { heap[i] = heap[j]; // shift up child i = j; j = i << 1; k = j + 1; if (k <= size && lessThan(heap[k], heap[j])) { j = k; } } heap[i] = node; // install saved node } /** This method returns the internal heap array as Object[]. * @lucene.internal */ protected final Object[] getHeapArray() { return (Object[]) heap; }}

上面的抽象类封装了最小堆的一些基本操作,包括如何初始化最小堆、新增元素、弹出元素、调整根元素到适当位置等操作。在进行这些操作时,保证了最小堆的基本性质,即父结点的值小于或等于两个子结点值。

由于是抽象类,需要子类继续该类,并提供自己的lessThan方法的实现。

子类在使用时,需要先调用父类public PriorityQueue(int maxSize)或public PriorityQueue(int maxSize, boolean prepopulate)构造方法。


注:public PriorityQueue(int maxSize)实际是调用了后一个构造方法,参数prepopulate值为true


子类可重写protected T getSentinelObject()方法来决定是否要预填充堆,当该返回方法不为NULL时,且调用构造方法时参数prepopulate为true时,会预填充堆,即堆成员变量 T[] heap 数组的每个元素(除第一个元素外)赋上了非NULL值,且size赋值为maxSize的值,代表数组中已经有maxSize个元素;


注:size表示当前堆中元素实际个数,maxSize表示堆中可容纳的总元素,即容量。

另需要说明下,如果getSentinelObject()返回非NULL值,需要保证每次调用该方法,返回的都是new出来的新对象,而且该对象比其它任何对象的优先级都要低或相当,即lessThan(sentinelObject, otherObject)返回true


在初始化最小堆后,如果堆中未填充元素,可调用add方法新增元素到堆中;如果已经填充了元素,可调用top方法获取树最顶端元素,即根元素,改变根元素的一些值,当根元素改变时,需要子类调用public final T updateTop()方法来将根元素调整到适当位置。

注:由于最小堆的目的是存放前K个元素,在每次调用add方法前都要拿欲加入元素与根元素比较,如果小于或等于根元素,就不要执行add方法。同理,在欲改变根元素的一些值时,也是要进行比较的,只有当新的值比原来的值大时,才更改,并调用updateTop()方法


然后就涉及到如何取出堆中K个元素,这时就要循环调用pop()方法,每次调用都会弹出根元素,并存入数组或列表。由最小堆的性质可知,先弹出的元素肯定是要小于或等于后面的元素,这样就得到了排好序的前K个元素。


最小堆新增元素的时间复杂度为log(N)




分享给朋友:
您可能感兴趣的文章:
随机阅读: