processing large volumes #2027

paulobreim · 2023-12-17T16:31:57Z

paulobreim
Dec 17, 2023

A few weeks ago I posted a case here that has a lot of processing images.

After some problems that required restarting and continuing processing, I gradually considered different situations.
And now I decided to start everything over again and the screen below shows the situation.

We are talking about processing 43 computer images, which are not actually .dd images, but rather results exported by iped's own dd processing in previous versions, so they are not large volumes.
Added to this we have emails received from Apple and Google and finally 44 cell phone images in .ufdr format.
We haven't reached half of the processing yet, and what has a significant impact is precisely voice transcription.
We have 190 hours of processing so far, and there is still a long way to go.

I don't know how to assess whether it's too long or not, but I think it would be opportune to do an analysis.
If necessary, I can give you access (@lfcnassif @wladimirleite), via anydesk, for you to study.
I remember that in the first test where I did several restarts, one thing that took a long time was checking items that were possibly shared on WhatsApp. At the time it gave me the impression that it checked all images that had already been indexed, but I'm not sure about that.

tks
paulo

paulobreim · 2023-12-19T19:57:38Z

paulobreim
Dec 19, 2023
Author

Updating with this image below.
This file has been processing for more than 24 hours and is 137,839,034,246 bytes
Updating with this image below.
This file has been processing for more than 24 hours and is 137,839,034,246 bytes in size

0 replies

lfcnassif · 2023-12-19T20:35:05Z

lfcnassif
Dec 19, 2023
Maintainer

Hi @paulobreim. Audio transcription is a heavy task and this is expected. On #1909 @wladimirleite improved a bottleneck you had helped to find. If your CPU is 100%, probably there is nothing we can do. And currently I'm on vacation and I won't be able to help you analyze your case processing.

1 reply

paulobreim Dec 19, 2023
Author

I know you're on vacation, you really deserve it.
The CPU is very idle, see:

paulobreim · 2023-12-20T14:50:03Z

paulobreim
Dec 20, 2023
Author

Well, as this file has already been processed for more than 48 hours and CPU usage is very low, I decide to interrupt and restart the process with --continue.
Before interrupting, I generated the jstack of the 2 java.exe that were running.

In a while we will see if it recovers with high CPU.

jstack.zip

4 replies

wladimirleite Dec 20, 2023
Collaborator

Are you using Vosk?
It seems the same issue discussed in #1910 (comment).
The change I made postponed the issue a bit, but it will eventually happen in cases with a lot of audios.

paulobreim Dec 20, 2023
Author

Yes, Vosk.
it's the same case. I had started the process from the beginning with the snapshot version.
Now I stopped and restarted
It's a good set for testing.

paulobreim Dec 20, 2023
Author

After 1 hour he arrived at last file was processing and now the CPU is in 100%

paulobreim Dec 20, 2023
Author

Updating

CPU has dropped to 45% and we are processing the same file
Interestingly, the number of Threads with 100% usage is exactly half available. Maybe this is interesting information.

paulobreim · 2023-12-24T15:45:26Z

paulobreim
Dec 24, 2023
Author

Finally the last file was processed, although it did not finish normally.

There were 5 or 6 reprocessing, but one of them the log was zero because Windows simply booted the machine and as only iped was running, I think this reboot was caused by Java.
Below is the file with messages and logs that serves as a source for analyzing iped as a whole.
I'll reboot one more time to see if anything changes, assuming everything has been processed.
I also suspect that errors are related to issues of encrypted files and xmls with problems, within ufdr.
Anyway, with the current result I can now proceed with the work, so there is no need to rush to analyze these logs. I know they take time and may not be a priority compared to other IPED issues
I can save this job to test in future versions.

log.zip

Taking advantage of the end of the year, I would like to thank the entire community that helps maintain this project, especially @lfcnassif @wladimirleite . You are points outside the curve.

7 replies

lfcnassif Dec 30, 2023
Maintainer

I took a very quick look at the previous one and found you were processing several UFDR evidences at the same time, right? That increases memory usage. A few months ago I optimized UFDR processing to use less memory on #455. You can test the last snapshot here:
https://github.com/sepinf-inc/IPED/actions/runs/7197139259/artifacts/1112230710

paulobreim Jan 3, 2024
Author

Yes, that's a lot of ufdr images.

Great, I will test today.
tks

paulobreim Jan 4, 2024
Author

Here is the result
Console:
c:\iped-snapshot2\iped-4.2-snapshot>iped --portable --continue -o i:\talitatotal -d e:\j\2023.01.08776.23.007-88.ufdr -d e:\j\2023.01.08776.23.003-00.ufdr -d e:\j\2023.01.08776.23.008-50.ufdr -d e:\j\2023.01.08776.23.009-22.ufdr -d e:\j\2023.01.08776.23.010-22.ufdr -d e:\j\2023.01.08776.23.012-66.ufdr -d e:\j\2023.01.08776.23.013-38.ufdr -d e:\j\2023.01.08776.23.014-00.ufdr -d e:\j\2023.01.08776.23.015-72.ufdr -d e:\j\2023.01.08776.23.016-44.ufdr -d e:\j\2023.01.08776.23.017-16.ufdr -d e:\j\2023.01.08776.23.019-50.ufdr -d e:\j\2023.01.08776.23.023-66.ufdr -d e:\j\2023.01.08776.23.026-72.ufdr -d e:\j\2023.01.08776.23.031-50.ufdr -d e:\j\2023.01.08776.23.032-22.ufdr -d e:\j\2023.01.08776.23.034-66.ufdr -d e:\j\2023.01.08776.23.035-88.ufdr -d e:\j\2023.01.08776.23.039-16.ufdr -d e:\j\2023.01.08776.23.040-16.ufdr -d e:\j\2023.01.08776.23.041-88.ufdr -d e:\j\2023.01.08776.23.044-94.ufdr -d e:\j\2023.01.08776.23.045-66.ufdr -d e:\j\2023.01.08776.23.047-00.ufdr -d e:\j\2023.01.08776.23.050-44.ufdr -d e:\j\2023.01.08776.23.064-50.ufdr -d e:\j\2023.01.08776.23.065-22.ufdr -d e:\j\2023.01.08776.23.070-00.ufdr -d e:\j\2023.01.08776.23.074-88.ufdr -d e:\j\2023.01.08901.23.002-66.ufdr -d e:\j\2023.04.05290.23.005-72.ufdr -d e:\j\2023.04.05290.23.006-44.ufdr -d e:\j\2023.04.05290.23.007-16.ufdr -d e:\j\2023.04.05290.23.008-88.ufdr -d e:\j\2023.04.05290.23.009-50.ufdr -d e:\j\2023.04.05290.23.010-50.ufdr -d e:\j\2023.04.05290.23.011-22.ufdr -d e:\j\2023.04.05290.23.012-94.ufdr -d e:\j\2023.04.05290.23.028-44.ufdr -d e:\j\2023.04.05290.23.029-16.ufdr -d e:\j\2023.04.05290.23.030-16.ufdr -d e:\j\2023.04.05290.23.031-88.ufdr -d e:\j\2023.04.05290.23.032-50.ufdr -d e:\j\2023.04.05290.23.039-44.ufdr -d "e:\j\computadores" -d "e:\j\computadores2" -d "e:\j\computadores3" -d "e:\j\computadores4"
2024-01-03 11:10:49 [ERROR] [task.carver.CarverTask] addUnallocated is disabled, so carving will NOT be done in unallocated space!
java.lang.OutOfMemoryError: Java heap space
Dumping heap to java_pid5932.hprof ...
Heap dump file created [32276429511 bytes in 193.561 secs]
2024-01-03 12:53:17 [ERROR] [app.processing.Main] Processing Error:
java.lang.Exception: Error decoding datasource e:\j\2023.04.05290.23.032-50.ufdr
at iped.engine.datasource.ItemProducer.run(ItemProducer.java:154) ~[iped-engine-4.2-snapshot.jar:?]
Caused by: java.lang.OutOfMemoryError: Java heap space
at iped.engine.data.IPEDSource.invertIdToLuceneIdArray(IPEDSource.java:303) ~[iped-engine-4.2-snapshot.jar:?]
at iped.engine.data.IPEDSource.(IPEDSource.java:224) ~[iped-engine-4.2-snapshot.jar:?]
at iped.engine.data.IPEDSource.(IPEDSource.java:160) ~[iped-engine-4.2-snapshot.jar:?]
at iped.engine.datasource.IPEDReader.read(IPEDReader.java:156) ~[iped-engine-4.2-snapshot.jar:?]
at iped.engine.datasource.ItemProducer.run(ItemProducer.java:134) ~[iped-engine-4.2-snapshot.jar:?]
2024-01-03 12:53:17 [ERROR] [app.processing.Main] Processing aborted because of OutOfMemoryError. See the possible workarounds at https://github.com/sepinf-inc/IPED/wiki/Troubleshooting

ERROR!!!
Check the log at C:\iped-snapshot2\iped-4.2-snapshot\log\IPED-2024-01-03-11-06-35.log

Here is the log
IPED-2024-01-03-11-06-35.zip

Here is the pid
https://drive.google.com/file/d/1AHEgBUMII2pChAxWP0Fd8eG0fMpzCG4p/view?usp=sharing

tks

lfcnassif Jan 4, 2024
Maintainer

Thanks @paulobreim, I can try to take a look when possible. But as I said before, as more evidences are processed together,
more memory will be needed, we have to keep memory structures for each evidence being processed. Even if we further optimize the code, there will be always a limit, maybe 10, 50, 100 or more evidences, and it depends on the evidences being processed. I suggest you to break your processing in parts, processing groups of, for example, 5 evidences at the same time, and to use --append to add more evidences to the same case.

paulobreim Jan 4, 2024
Author

tks, I will break and inform.

paulobreim · 2024-01-12T17:18:35Z

paulobreim
Jan 12, 2024
Author

Following your advice, I restarted all processing in groups of 5 images. I put everything in a .bat that ran for a few days.

Looking at the last processing, it gave the famous OOME in one of the UFDR images, so I decided to investigate a little deeper, to generate subsidies that allow a better analysis and see how interesting it is:

1 - The command line that gave an error was:
iped --portable --append -o i:\talitafull -d e:\j\2023.04.05290.23.011-22.ufdr -d e:\j\2023.04.05290.23.012-94.ufdr -d e:\j\2023.04.05290.23.030-16.ufdr -d e:\j\2023.01.08776.23.007-88.ufdr
This process was canceled due to lack of memory with a message referring to the file 2023.04.05290.23.011-22.ufdr.
Looking through the search, 433,993 items were indexed

2 - So I did a new indexing just with this image with the following command line
iped --portable -o i:\talitafull2 -d e:\j\2023.04.05290.23.011-22.ufdr
In a few seconds it was already analyzing the files and finished normally in just 550 seconds.
The log shows that I index 466,591 items, but looking through iped-search there are 466,526 in the database. (I don't know if this small difference is a problem, I think it's normal)

3 - So I decided to reprocess this file based on the error, imagining that everything would go well and I used the following line
iped --portable --continue -o i:\talitafull -d e:\j\2023.04.05290.23.011-22.ufdr
It took about 4 minutes to start the analysis, probably because it had to analyze what had already been processed and the database contains many images that had already been processed.
The processing bar jumped to almost the end and the ParsingTask task, as shown in the image below

4 - After 1:20 the situation was the same. The same task takes 63 minutes of processing time, that is, practically only it was being executed, although from time to time, other tasks appeared and ended quickly (IndexTask), which I believe is as expected.

5 - After a few more minutes, the task manager shows that the task was no longer responding, although it indicates CPU usage. And in fact, when I clicked on iped it no longer responded.

6 - After a few more minutes, the CPU usage became zero and I noticed that there were messages on the console. I decided to stop processing because it was frozen.

7 - As it generated a dump, I uploaded it which can be obtained from this link
https://drive.google.com/file/d/14UxQ45-wSh4iTLtjxStvjxdbJJn5JK83/view?usp=sharing

8 - Finally, follow the logs of the processing that ended ok and the one that had an error. Note that they present different results.

https://drive.google.com/file/d/1ZA4fVVqwcg8NlkeLgTlg69GzVleP5nzH/view?usp=sharing

In theory, if the isolated processing of this image was very fast and without any errors, something must be causing the long processing time when the process uses --continue.

I'm available for further testing.

paulo

8 replies

lfcnassif Jan 16, 2024
Maintainer

Possibly the application is not filtering keywords, that is just a message of a possibly finished activity left behind. Please check the last message printed in the processing log. If it is related to P2PBookmarker class, which creates bookmarks for shared media files by p2p or chat apps, you already reported it before and I already opened an improvement request to improve it in the future, when time permits.

paulobreim Jan 16, 2024
Author

There really are multiple lines with P2PBookmark.
tks.

paulobreim Jan 21, 2024
Author

Finally, the indexing of all images was completed, processing the images one by one.
In only one of the images we had an interruption due to lack of memory, but this image processing alone in a new output does not give this error, so there must be some other problem when using --append.
I'll keep these images saved to test new versions.

tks

lfcnassif Jan 21, 2024
Maintainer

Is this the same evidence that resulted in the last OOM memory dump you shared?

paulobreim Jan 22, 2024
Author

Yes, the same file: 2023.04.05290.23.011-22.ufdr.

c:\iped-snapshot2\iped-4.2-snapshot>iped --portable --append -o i:\talita1a1 -d e:\j\2023.04.05290.23.011-22.ufdr [1961.675s][warning][gc,alloc] pool-22-thread-7: Retried waiting for GCLocker too often allocating 4194307 words java.lang.OutOfMemoryError: Java heap space Dumping heap to java_pid9028.hprof ... Heap dump file created [33414052800 bytes in 201.580 secs] [3986.852s][warning][gc,alloc] ParsingThread-87: Retried waiting for GCLocker too often allocating 1420147 words 2024-01-19 02:50:13 [ERROR] [engine.io.ParsingReader$BackgroundParsing] ParsingThread-87 OutOfMemory processing '2023.04.05290.23.011-22.ufdr/data/Root/data/com.whatsapp.w4b/databases/msgstore.db' (484323328 bytes ) java.lang.OutOfMemoryError: Java heap space 2024-01-19 03:07:07 [ERROR] [app.processing.Main] Processing Error: java.lang.OutOfMemoryError: Java heap space 2024-01-19 03:07:07 [ERROR] [app.processing.Main] Processing aborted because of OutOfMemoryError. See the possible workarounds at https://github.com/sepinf-inc/IPED/wiki/Troubleshooting

Running alone as different -o
c:\iped-snapshot2\iped-4.2-snapshot>iped --portable -o i:\talita1a1X -d e:\j\2023.04.05290.23.011-22.ufdr finishes in 550 seconds, without errors.

I can provide everything you need to analyze, including the .ufdr (6,674,647,909)
However, as running alone it doesn't give an error, I think it needs other files.

The output folder is 448 GB, and if necessary I can give access to the entire machine via anydesk. This computer is only used for IPED.

tks

ps1: None of this is affecting my work with these files, so there is no problem regarding time.
ps2: pid dump: https://drive.google.com/file/d/1Nut3AJuyjHgbhALZT8XIP_gIFc1dYxpA/view?usp=sharing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

processing large volumes #2027

{{title}}

Replies: 5 comments 20 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

processing large volumes #2027

paulobreim Dec 17, 2023

Replies: 5 comments · 20 replies

paulobreim Dec 19, 2023 Author

lfcnassif Dec 19, 2023 Maintainer

paulobreim Dec 19, 2023 Author

paulobreim Dec 20, 2023 Author

wladimirleite Dec 20, 2023 Collaborator

paulobreim Dec 20, 2023 Author

paulobreim Dec 20, 2023 Author

paulobreim Dec 20, 2023 Author

paulobreim Dec 24, 2023 Author

lfcnassif Dec 30, 2023 Maintainer

paulobreim Jan 3, 2024 Author

paulobreim Jan 4, 2024 Author

lfcnassif Jan 4, 2024 Maintainer

paulobreim Jan 4, 2024 Author

paulobreim Jan 12, 2024 Author

lfcnassif Jan 16, 2024 Maintainer

paulobreim Jan 16, 2024 Author

paulobreim Jan 21, 2024 Author

lfcnassif Jan 21, 2024 Maintainer

paulobreim Jan 22, 2024 Author

paulobreim
Dec 17, 2023

Replies: 5 comments 20 replies

paulobreim
Dec 19, 2023
Author

lfcnassif
Dec 19, 2023
Maintainer

paulobreim Dec 19, 2023
Author

paulobreim
Dec 20, 2023
Author

wladimirleite Dec 20, 2023
Collaborator

paulobreim Dec 20, 2023
Author

paulobreim Dec 20, 2023
Author

paulobreim Dec 20, 2023
Author

paulobreim
Dec 24, 2023
Author

lfcnassif Dec 30, 2023
Maintainer

paulobreim Jan 3, 2024
Author

paulobreim Jan 4, 2024
Author

lfcnassif Jan 4, 2024
Maintainer

paulobreim Jan 4, 2024
Author

paulobreim
Jan 12, 2024
Author

lfcnassif Jan 16, 2024
Maintainer

paulobreim Jan 16, 2024
Author

paulobreim Jan 21, 2024
Author

lfcnassif Jan 21, 2024
Maintainer

paulobreim Jan 22, 2024
Author