I am a graduate student of physics and I am working to write several codes, which include several hundred gigabytes of data sorting Here's the trick for the data that I ask for, I know that there is no good method to sort out this type of data and search.
My data contains a large number of sets of numbers. These sets can be included anywhere from 1 to number (although 15 out of 99.9% set is less than 15) and these sets have about 1.5-2 billion sets (unfortunately this size prevents brute force detection is).
I should be able to specify a set with Kashmir elements and specify each set with k + 1 elements or more, which includes the specific subsets I have given.
Simple example:
Assume that I have the following set for my data:
(1,2,3)
(1,2,3,4,5)
(4,5,6,7)
(1,3,8,9)
(5,8,11)
If I had to give a request ( 1,3) I will be set: (1, 2,3), (1,2,3,4,5) and (1,3,8,9)
The request will return (11): (5,8,11). Requests (1,2,3) will return the set: (1,2,3) and (1,2,3,4,5)
Requests (50) No set will return:
So far the patterns should be clear. The main difference between this example and my data is that the set with my data is large, the number used for each element of the set runs from 0 to 16383 (14 bit), and there are several sets.
If it means that I am writing this program in C ++, though I know java, c, some assemblies, some foreign and some pearl too.
Does anyone know how to pull it off?
Edit:
To answer some questions and add some points:
1.) Data does not change. All this was taken to last a long time (each is divided into 2 gig files.)
2.) For storage space. Raw data takes approximately 250 gigabytes. My guess is that after deciding which several metadata processing and separation I can not knock anywhere from 36 to 48 gigabytes, depending on how much I decide to keep the metadata (without index). In addition, if I get enough set in my initial processing of data, then I can be able to enrich the data by adding counters for repeat events instead of repeating the events again and again.
3.) Within the processed set, there are at least two numbers for data in each number, including 14 bits (detected power) and 7 bits for metadata (detector number). So I'll need at least three bytes per number.
4) My "Though in 99.9% set, N is less than 15" comment was misleading. At an initial glance through some parts of the data I think that I have set which are included as 22 numbers, but the average per set is 5 numbers and the average per set is 6 numbers.
5) When I like the idea of creating a pointer of indicators in the file then I'm a bit flexible because I work semi-slow for requests of multiple numbers (at least Less I think it is slow) to set a general set of all indicators, i.e. finding the greatest common subset for a set set.
6.) In terms of the resources available to me, I can get a space of about 300 Gigas. I have raw data on the system (the remainder of my quota on that system). System 2 is a dual-processor server with quad-core AMD opsons and 16 gigabytes of RAM.
7.) Yes, maybe 0, this is a distortion of the data acquisition system, but it happens but it can happen.
Your problem is what was encountered by search engines. "I have a loose document. I need words that use this word." You are simply (very easily), integers rather than words, and there are small documents. One is by Manning et al. (On that link) is available free online, it is very readable, and a lot more details about how to do it Will go in
The disk space you have to pay, but it can be converted, and after the index is created, you need to be fast enough to meet your time requirements.
Comments
Post a Comment