Custom data import

The YetiForce system is equipped with built-in data import/export tools. In most cases these tools are sufficient, however, in case of importing large databases, it is recommended to prepare your own solution - one you can fully control. When we talk about large databases, we mean the ones that contain tens of thousands of records.

Basic tips

  1. Prepare a server environment
  2. Plan / Test / Verify
  3. Do not use the built-in import mechanism
  4. The advantage of a dedicated script over the built-in import mechanism
  5. Optimization
  6. The final result

1. Prepare a server environment

The majority of production servers, including hostings, is configured to ensure their stability, therefore their configuration is "safe" by default. If the import must be done quickly, you will have to move away from the safe parameters and configure your server to import the data continuously and without any interruptions, until the process is completed. If you import a large amount of data you will always encounter problems related to time, memory or performance. There is a variety of errors related to the time of execution, insufficient configuration, enabled mechanisms which should be disabled (for example recording server logs, data replication, hardware duplication,etc.), which can occur during the process. The server will often indicate the reasons why the script was stopped during the testing process. These pieces of information can be useful for users who do not know which parameters should be increased.

2. Plan / Test / Verify

Importing large databases causes a lot of issues, especially in the early stages. The main problem is the time and preparing the work environment. It is important to remember that sometimes the import of data has to be done multiple times [even dozens of times], so your working environment should be prepared in a way to allow you to easily delete the previous import and start the new one and, above all, verify its accuracy.

3. Do not use the built-in import mechanism

For many people who have little or no technical experience importing data, using the built-in mechanism is the only available option. Unfortunately, we do not recommend using this tool when the user deals with a large database. It is not caused by the application's imperfections, but rather by the fact that custom data import requires custom tools and a custom approach. The reasons for creating a dedicated solution for a specific import are described below. 

4. The advantage of a dedicated script over the built-in import mechanism

4.1 Speed

The built-in import mechanism runs a number of additional side tasks, such as workflows or handlers. These side tasks cause the import process to slow down significantly. Importing records using the built-in mechanism might be up to a hundred times slower than using a dedicated script. It is crucial to remember that the dedicated script should be complemented by mechanisms, which are applied by default in the workflow or handlers [for example, updating the names of labels used for searching is executed each time in a handler, and it is best to execute it once after the importing process is completed].

4.2 Full control

When you create your own script you have full control over the included libraries, queries, methods of import and other elements that significantly impact the speed. Full control will prove to be useful when: 
  • You look for “bottlenecks” - when creating the record takes too long you can measure the time between used methods in order to determine what takes the most time and to be able to optimize it. 
  • You want to correct a small amount of data – if the importing process is successful and in the meantime you find a small imperfection, in case you have your own well-written script, you can update small chunks of data. This is impossible in case of regular import. 
  • you need to define the conditions – when you import complex data [for example if a value in a field depends on some other elements] then you have to verify the elements and set the right value in real time. In case of the built-in mechanism the data is entered in a way specified before the process starts [e.g. there is an appropriate status set in a text file but it is not being checked dynamically]. 

4.3 Flexibility

The major benefit of having your own script is its flexibility. A good example of it can be downloading data in real time from a copy of the production system we use to import the data. When we use the built-in tool all the data must be prepared in advance and exported to a file. In case of your own script the system will automatically download the data and start importing it. 

5. Optimization

When you import large amounts of data, every small element is important. Several essential issues, which should be taken into consideration, are presented below: 
  1. The database
    • Whether the indexes for all combinations of conditions used for finding data are set properly.
    • Whether the database queries are formed correctly [when you optimize queries that take too much time you often find a better solution].
  2. The script
    • Whether the data imported is in the right order. 
    • Whether the script was run several times in order to utilize all the server's resources [the script must be constructed so as to prevent it from creating duplicates].
  3. The server
    • How to fix “bottlenecks”.
    • Whether the server was optimized [Apache, PHP, MySQL and drives/processor/memory].

6. The final result

A well prepared script [one that finds/downloads data from System A and finds/enters data to System B] is able to import up to 200 records per second, which means up to 720.000 records per hour.
  • czwartek, 10 sierpień 2017